###### Course

Subscribe

Did you have a classmate in school that was head and shoulders above the other students? That kid was an outlier. Outliers fall outside of the pattern that we’re

Did you have a classmate in school that was head and shoulders above the other students? That kid was an outlier.

Outliers fall outside of the pattern that we’re used to seeing. And when it comes to data sets, they can throw a major kink into our analysis.

Let’s return to our example of professor evaluations and beauty scores. The relationship that we’ve observed between the two variables is that better looking professors receive higher evaluations.

But what if a good looking professor received a very low evaluation score? How would that affect our analysis?

In this video, Professor Stratmann will show you how much an outlier can affect a regression line – and you’ll have the chance to play with data on your own using DataSplash.

##
**Transcript**

What is an outlier? You probably already have a good intuition of it. It's that 6'10" kid in high school who towers over his classmates. It's that little guy outeating people twice his size. Outliers are those that don't fit the pattern we're used to seeing.

Let's go back to our data set. If you remember, we're examining the relationship between professors' beauty scores and their student evaluations. We generally see that better-looking teachers get better evaluations. Is there an outlier here? This point. It shows a professor with an extremely high beauty score but a very low evaluation score. So, I guess in this case, looks aren't everything.

A single outlier can have a major effect. We can see this in our data by removing or adding points and seeing the effect on the regression line. Let me show you. If you right click or control click, you can remove a point from our regression analysis. Let's remove this outlier, and see what happens. The dotted line is our old regression line, and the solid one is our new one. You can see it's gotten steeper. That makes sense, as we're removing an outlier that is pulling the line down. You can also see down below that our slope value has changed from 0.2 to 0.3. So removing a single outlier resulted in us finding a much stronger relationship between beauty and evaluation scores.

If you remember, in our last video we did an exercise where we predicted the expected change in evaluation score when going from a 2 to a 7 in beauty score. Our old slope was 0.2, which meant a 1-point predicted increase in evaluation. With this new line, we'd predict an increase of 1.5 points in evaluation. That's a 50% stronger effect, and that's the power of an outlier. That one really good-looking but really bad teacher is changing our model a lot. The point is, regressions can be sensitive to outliers. If you want to get your outlier back, you can just click the undo button.

Additionally, we can add data points to our regression by just clicking anywhere in the canvas. You can see, again, the line moves each time, showing my new regression line with a solid line and the previous line with a dotted line. Now it's your turn. I'd like you to try two things. First, remove a point so that the relationship weakens. The line should get flatter, so the slope should go down. And, second, add some data points, to make the relationship stronger. So pause the video now, and try it out.

Hopefully you've had some fun adding and removing data points. It's good to play with data to build your intuition.

Let's cover the first question—how to remove a point to make the relationship weaker. For example, remove this point. You would have seen a flatter line and our slope go down from 0.2 to 0.14.

So how about to make the relationship stronger? Let's go back to our original data set, with a slope of 0.2. If I add a point, way up in the upper-right, I can see my slope go up from 0.2. I can also add one to the lower-left, and see another increase. The more points I add to these two regions, the stronger the relationship gets.

So now that we've seen the power of outliers, your next question might be—can I just remove the outlier? It's important to remember we're typically using a regression to make a prediction about the real world. In the real world, we see guys who are 130 pounds eat 60 hot dogs. So if you want to predict hot dog eating based on weight, removing an outlier without thought can make your predictions worse. However, in certain cases, it might make sense to remove an outlier.

What if you are trying to predict the number of hot dogs an average person would eat at your upcoming barbeque? Then, including Kobayashi would make your prediction much worse. In this case, he's clearly an outlier, and should be omitted from the model. You might have also noticed something else changing when we removed the outlier—these funny numbers. What do those mean? That's what we turn to next.

An also, what's the best way to prepare a dataset based on which you create a chart? What kind of file should I use?

Regards,

George