WEBVTT 00:00:00.107 --> 00:00:03.926 ♪ [music] ♪ 00:00:21.040 --> 00:00:22.077 - [Thomas Stratmann] Hi! 00:00:22.077 --> 00:00:24.268 In the upcoming series of videos 00:00:24.268 --> 00:00:26.858 we're going to give you a shiny new tool 00:00:26.858 --> 00:00:30.414 to put into your Understanding Data toolbox: 00:00:30.414 --> 00:00:31.981 linear regression. 00:00:32.885 --> 00:00:34.668 Say you've got this theory. 00:00:34.668 --> 00:00:37.249 You've witnessed how good-looking people 00:00:37.249 --> 00:00:39.067 seem to get special perks. 00:00:39.642 --> 00:00:40.878 You're wondering, 00:00:40.878 --> 00:00:43.798 "Where else might we see this phenomenon?" 00:00:44.132 --> 00:00:45.637 What about for professors? 00:00:45.637 --> 00:00:48.259 Is it possible good-looking professors 00:00:48.259 --> 00:00:50.010 might get special perks too? 00:00:50.350 --> 00:00:53.899 Is it possible students treat them better 00:00:53.899 --> 00:00:57.209 by showering them with better student evaluations? 00:00:57.866 --> 00:01:00.467 If so, is the effect of looks 00:01:00.467 --> 00:01:03.573 on evaluation score big or [inaudible]? 00:01:04.349 --> 00:01:08.143 And say there is a new professor starting at a university. 00:01:08.619 --> 00:01:11.810 What can we predict about his evaluation 00:01:11.810 --> 00:01:13.371 simply by his looks? 00:01:13.940 --> 00:01:17.216 Given that these evaluations can determine pay raises, 00:01:17.671 --> 00:01:21.709 if this theory were true we might see professors resort 00:01:21.709 --> 00:01:24.980 to some surprising tactics to boost their scores. 00:01:25.471 --> 00:01:27.461 Suppose you wanted to find out 00:01:27.461 --> 00:01:30.801 if evaluations really improve with better looks. 00:01:31.441 --> 00:01:34.450 How would you go about testing this hypothesis? 00:01:34.956 --> 00:01:36.552 You could collect data. 00:01:36.761 --> 00:01:40.025 First you would have students rate on a scale from 1 to 10 00:01:40.025 --> 00:01:42.076 how good-looking a professor was, 00:01:42.076 --> 00:01:44.807 which gives you an average beauty score. 00:01:45.229 --> 00:01:48.552 Then you could retrieve the teacher's teaching evaluations 00:01:48.552 --> 00:01:50.421 from twenty-five students. 00:01:50.421 --> 00:01:53.273 Let's look at these two variables at the same time 00:01:53.273 --> 00:01:54.738 by using a scatterplot. 00:01:54.981 --> 00:01:57.419 We'll put beauty on the horizontal axis, 00:01:57.852 --> 00:02:00.589 and teacher evaluations on the vertical axis. 00:02:01.463 --> 00:02:05.514 For example, this dot represents Professor Peate, 00:02:06.173 --> 00:02:08.811 who received a beauty score of 3 00:02:08.811 --> 00:02:11.866 and an evaluation of 8.425. 00:02:12.084 --> 00:02:14.958 This one way out here is Professor Helmchen. 00:02:14.958 --> 00:02:16.797 - [Ben Stiller, "Zoolander"] Ridiculously good-looking! 00:02:16.797 --> 00:02:18.721 - [Thomas] Who got a very high beauty score, 00:02:18.721 --> 00:02:20.872 but not such a good evaluation. 00:02:21.101 --> 00:02:22.283 Can you see a trend? 00:02:22.283 --> 00:02:25.533 As we move from left to right on the horizontal axis, 00:02:25.533 --> 00:02:27.963 from the ugly to the gorgeous, 00:02:27.963 --> 00:02:31.186 we see a trend upwards in evaluation scores. 00:02:31.870 --> 00:02:35.174 By the way, the data we're exploring in this series 00:02:35.174 --> 00:02:38.923 is not made up -- it comes from a real study 00:02:38.923 --> 00:02:40.897 done at the University of Texas. 00:02:41.337 --> 00:02:46.023 If you're wondering, "pulchritude" is just the fancy academic way 00:02:46.023 --> 00:02:47.880 of saying beauty. 00:02:48.405 --> 00:02:51.474 With scatterplots it can sometimes be hard 00:02:51.474 --> 00:02:55.594 to make out the exact relationship between two variables -- 00:02:55.594 --> 00:02:59.104 especially when the values bounce around quite a bit 00:02:59.104 --> 00:03:01.318 as we go from left to right. 00:03:02.000 --> 00:03:04.908 One way to cut through this bounciness 00:03:04.908 --> 00:03:08.144 is to draw a straight line through the data cloud 00:03:08.144 --> 00:03:10.775 in such a way that this line summarizes the data 00:03:10.775 --> 00:03:12.613 as closely as possible. 00:03:13.295 --> 00:03:17.181 The technical term for this is "linear regression." 00:03:17.669 --> 00:03:20.888 Later on we'll talk about how this line is created, 00:03:20.888 --> 00:03:24.278 but for now we can assume that the line fits the data 00:03:24.278 --> 00:03:26.456 as closely as possible. 00:03:27.087 --> 00:03:29.536 So, what can this line tell us? 00:03:30.067 --> 00:03:32.596 First, we immediately see 00:03:32.596 --> 00:03:35.358 if the line is sloping upward or downward. 00:03:36.107 --> 00:03:39.827 In our data set we see the [fitted] line slopes upward. 00:03:40.794 --> 00:03:43.807 It thus confirms what we have conjectured earlier 00:03:43.807 --> 00:03:45.587 by just looking at the scatterplot. 00:03:46.070 --> 00:03:50.237 The upward slope means that there is a positive association 00:03:50.237 --> 00:03:53.026 between looks and evaluation scores. 00:03:53.544 --> 00:03:55.907 In other words, on average, 00:03:55.907 --> 00:03:59.469 better-looking professors are getting better evaluations. 00:03:59.768 --> 00:04:03.939 For other data sets we might see a stronger positive association. 00:04:04.377 --> 00:04:07.420 Or, you might see a negative association. 00:04:07.857 --> 00:04:10.764 Or perhaps no association at all. 00:04:11.158 --> 00:04:13.903 And our lines don't have to be straight. 00:04:14.389 --> 00:04:17.304 They can curve to fit the data when necessary. 00:04:17.770 --> 00:04:21.262 This line also gives us a way to predict outcomes. 00:04:21.579 --> 00:04:25.569 We can simply take a beauty score and read off the line 00:04:25.569 --> 00:04:28.429 what the predicted evaluation score would be. 00:04:28.609 --> 00:04:30.546 So, back to our new professor. 00:04:31.097 --> 00:04:34.109 We can precisely predict his evaluation score. 00:04:34.683 --> 00:04:36.749 "But wait! Wait!" you might say. 00:04:37.019 --> 00:04:38.749 "Can we trust this prediction?" 00:04:39.233 --> 00:04:41.665 How well does this one beauty variable 00:04:41.665 --> 00:04:43.515 really predict evaluations? 00:04:44.844 --> 00:04:47.890 Linear regression gives us some useful measures 00:04:47.890 --> 00:04:49.770 to answer those questions 00:04:49.770 --> 00:04:52.039 which we'll cover in a future video. 00:04:52.838 --> 00:04:55.439 We also have to be aware of other pitfalls 00:04:55.439 --> 00:04:58.340 before we draw any definite conclusions. 00:04:58.833 --> 00:05:00.430 You could imagine a scenario 00:05:00.430 --> 00:05:03.639 where what is driving the association we see 00:05:03.639 --> 00:05:06.900 is really a third variable that we have left out. 00:05:07.344 --> 00:05:09.965 For example, the difficulty of the course 00:05:09.965 --> 00:05:12.456 might be behind the positive association 00:05:12.456 --> 00:05:15.645 between beauty ratings and evaluation scores. 00:05:16.052 --> 00:05:18.956 Easy intro. courses get good evaluations. 00:05:19.228 --> 00:05:22.972 Harder, more advanced courses get bad evaluations. 00:05:23.660 --> 00:05:27.668 And younger professors might get assigned to intro. courses. 00:05:28.080 --> 00:05:32.095 Then, if students judge younger professors more attractive, 00:05:32.095 --> 00:05:34.335 you will find a positive association 00:05:34.335 --> 00:05:37.383 between beauty ratings and evaluation scores. 00:05:37.861 --> 00:05:40.388 But it's really the difficulty of the course, 00:05:40.388 --> 00:05:43.537 the variable that we've left out, not beauty, 00:05:43.537 --> 00:05:45.848 that is driving evaluation scores. 00:05:46.346 --> 00:05:49.807 In that case, all the primping would be for naught -- 00:05:50.289 --> 00:05:54.441 a case of mistaken correlation for causation, 00:05:54.900 --> 00:05:58.166 something we'll talk about further in a later video. 00:05:58.922 --> 00:06:02.069 And what if there were other important variables 00:06:02.069 --> 00:06:05.781 that affect both beauty ratings and evaluation scores? 00:06:06.626 --> 00:06:09.575 You might want to add considerations like skill, 00:06:09.846 --> 00:06:14.577 race, sex, and whether English is the teacher's native language 00:06:14.577 --> 00:06:18.994 to isolate more cleanly the effect of beauty on evaluations. 00:06:19.408 --> 00:06:21.758 When we get into multiple regression 00:06:21.758 --> 00:06:24.477 we will be able to measure the impact of beauty 00:06:24.477 --> 00:06:26.219 on teacher evaluations 00:06:26.219 --> 00:06:28.368 while accounting for other variables 00:06:28.368 --> 00:06:30.737 that might confound this association. 00:06:31.762 --> 00:06:35.509 Next up, we'll get our hands dirty by playing with this data 00:06:35.509 --> 00:06:39.070 to gain a better understanding of what this line can tell us. 00:06:41.169 --> 00:06:42.445 - [Narrator] Congratulations! 00:06:42.445 --> 00:06:45.247 You're one step closer to being a data ninja! 00:06:45.568 --> 00:06:47.139 However, to master this 00:06:47.139 --> 00:06:48.700 you'll need to strengthen your skills 00:06:48.700 --> 00:06:50.404 with some practice questions. 00:06:50.865 --> 00:06:53.976 Ready for your next mission? Click "Next Video." 00:06:54.313 --> 00:06:55.364 Still here? 00:06:55.598 --> 00:06:58.325 Move from understanding data to understanding your world 00:06:58.325 --> 00:07:01.642 by checking out MRU's other popular economics videos. 00:07:01.892 --> 00:07:04.406 ♪ [music] ♪