WEBVTT 00:00:00.660 --> 00:00:06.650 We will now begin our journey into the world of statistics, 00:00:06.650 --> 00:00:09.750 which is really a way to understand or get 00:00:09.750 --> 00:00:11.520 our head around data. 00:00:11.520 --> 00:00:14.670 So statistics is all about data. 00:00:14.670 --> 00:00:19.000 And as we begin our journey into the world of statistics, 00:00:19.000 --> 00:00:20.610 we will be doing a lot of what we 00:00:20.610 --> 00:00:23.210 can call descriptive statistics. 00:00:23.210 --> 00:00:25.470 So if we have a bunch of data, and if we 00:00:25.470 --> 00:00:27.990 want to tell something about all of that data 00:00:27.990 --> 00:00:29.890 without giving them all of the data, 00:00:29.890 --> 00:00:33.870 can we somehow describe it with a smaller set of numbers? 00:00:33.870 --> 00:00:35.720 So that's what we're going to focus on. 00:00:35.720 --> 00:00:37.360 And then once we build our toolkit 00:00:37.360 --> 00:00:39.260 on the descriptive statistics, then we 00:00:39.260 --> 00:00:41.710 can start to make inferences about that data, 00:00:41.710 --> 00:00:44.200 start to make conclusions, start to make judgments. 00:00:44.200 --> 00:00:49.430 And we'll start to do a lot of inferential statistics, 00:00:49.430 --> 00:00:51.160 make inferences. 00:00:51.160 --> 00:00:53.110 So with that out of the way, let's think 00:00:53.110 --> 00:00:56.390 about how we can describe data. 00:00:56.390 --> 00:01:00.710 So let's say we have a set of numbers. 00:01:00.710 --> 00:01:02.360 We can consider this to be data. 00:01:02.360 --> 00:01:04.580 Maybe we're measuring the heights of our plants 00:01:04.580 --> 00:01:05.740 in our garden. 00:01:05.740 --> 00:01:07.400 And let's say we have six plants. 00:01:07.400 --> 00:01:13.870 And the heights are 4 inches, 3 inches, 1 inch, 6 inches, 00:01:13.870 --> 00:01:17.990 and another one's 1 inch, and another one is 7 inches. 00:01:17.990 --> 00:01:20.934 And let's say someone just said-- in another room, not 00:01:20.934 --> 00:01:22.350 looking at your plants, just said, 00:01:22.350 --> 00:01:24.657 well, you know, how tall are your plants? 00:01:24.657 --> 00:01:26.240 And they only want to hear one number. 00:01:26.240 --> 00:01:30.560 They want to somehow have one number that 00:01:30.560 --> 00:01:33.410 represents all of these different heights of plants. 00:01:33.410 --> 00:01:36.580 How would you do that? 00:01:36.580 --> 00:01:38.810 Well, you'd say, well, how can I find something 00:01:38.810 --> 00:01:40.990 that-- maybe I want a typical number. 00:01:40.990 --> 00:01:44.060 Maybe I want some number that somehow represents the middle. 00:01:44.060 --> 00:01:46.250 Maybe I want the most frequent number. 00:01:46.250 --> 00:01:48.830 Maybe I want the number that somehow represents 00:01:48.830 --> 00:01:51.270 the center of all of these numbers. 00:01:51.270 --> 00:01:53.220 And if you said any of those things, 00:01:53.220 --> 00:01:55.189 you would actually have done the same things 00:01:55.189 --> 00:01:57.730 that the people who first came up with descriptive statistics 00:01:57.730 --> 00:01:58.230 said. 00:01:58.230 --> 00:02:00.150 They said, well, how can we do it? 00:02:00.150 --> 00:02:04.960 And we'll start by thinking of the idea of average. 00:02:04.960 --> 00:02:07.610 And in every day terminology, average 00:02:07.610 --> 00:02:09.720 has a very particular meaning, as we'll see. 00:02:09.720 --> 00:02:11.570 When many people talk about average, 00:02:11.570 --> 00:02:13.070 they're talking about the arithmetic 00:02:13.070 --> 00:02:14.960 mean, which we'll see shortly. 00:02:14.960 --> 00:02:18.100 But in statistics, average means something more general. 00:02:18.100 --> 00:02:22.980 It really means give me a typical, 00:02:22.980 --> 00:02:29.810 or give me a middle number, or-- and these are or's. 00:02:29.810 --> 00:02:31.930 And really it's an attempt to find 00:02:31.930 --> 00:02:33.490 a measure of central tendency. 00:02:38.550 --> 00:02:40.560 So once again, you have a bunch of numbers. 00:02:40.560 --> 00:02:42.970 You're somehow trying to represent these 00:02:42.970 --> 00:02:45.840 with one number we'll call the average, that's somehow 00:02:45.840 --> 00:02:49.130 typical, or middle, or the center somehow 00:02:49.130 --> 00:02:50.450 of these numbers. 00:02:50.450 --> 00:02:54.110 And as we'll see, there's many types of averages. 00:02:54.110 --> 00:02:56.690 The first is the one that you're probably most familiar with. 00:02:56.690 --> 00:02:58.398 It's the one-- and people talk about hey, 00:02:58.398 --> 00:03:00.840 the average on this exam or the average height. 00:03:00.840 --> 00:03:02.970 And that's the arithmetic mean. 00:03:02.970 --> 00:03:05.470 Just let me write it in. 00:03:05.470 --> 00:03:13.100 I'll write in yellow, arithmetic mean. 00:03:13.100 --> 00:03:16.010 When arithmetic is a noun, we call it arithmetic. 00:03:16.010 --> 00:03:19.960 When it's an adjective like this, we call it arithmetic, 00:03:19.960 --> 00:03:21.620 arithmetic mean. 00:03:21.620 --> 00:03:25.300 And this is really just the sum of all the numbers divided 00:03:25.300 --> 00:03:28.180 by-- this is a human-constructed definition that we've 00:03:28.180 --> 00:03:31.630 found useful-- the sum of all these numbers divided 00:03:31.630 --> 00:03:34.460 by the number of numbers we have. 00:03:34.460 --> 00:03:36.830 So given that, what is the arithmetic mean 00:03:36.830 --> 00:03:39.114 of this data set? 00:03:39.114 --> 00:03:40.280 Well, let's just compute it. 00:03:40.280 --> 00:03:46.160 It's going to be 4 plus 3 plus 1 plus 6 plus 1 00:03:46.160 --> 00:03:51.210 plus 7 over the number of data points we have. 00:03:51.210 --> 00:03:53.210 So we have six data points. 00:03:53.210 --> 00:03:54.860 So we're going to divide by 6. 00:03:54.860 --> 00:04:01.840 And we get 4 plus 3 is 7, plus 1 is 8, plus 6 is 14, 00:04:01.840 --> 00:04:04.934 plus 1 is 15, plus 7. 00:04:04.934 --> 00:04:07.927 15 plus 7 is 22. 00:04:07.927 --> 00:04:09.135 Let me do that one more time. 00:04:09.135 --> 00:04:15.180 You have 7, 8, 14, 15, 22, all of that over 6. 00:04:15.180 --> 00:04:17.070 And we could write this as a mixed number. 00:04:17.070 --> 00:04:21.120 6 goes into 22 three times with a remainder of 4. 00:04:21.120 --> 00:04:25.200 So it's 3 and 4/6, which is the same thing as 3 and 2/3. 00:04:25.200 --> 00:04:28.670 We could write this as a decimal with 3.6 repeating. 00:04:28.670 --> 00:04:32.360 So this is also 3.6 repeating. 00:04:32.360 --> 00:04:34.380 We could write it any one of those ways. 00:04:34.380 --> 00:04:36.700 But this is kind of a representative number. 00:04:36.700 --> 00:04:39.820 This is trying to get at a central tendency. 00:04:39.820 --> 00:04:41.620 Once again, these are human-constructed. 00:04:41.620 --> 00:04:43.590 No one ever-- it's not like someone just 00:04:43.590 --> 00:04:46.140 found some religious document that said, 00:04:46.140 --> 00:04:47.990 this is the way that the arithmetic mean 00:04:47.990 --> 00:04:49.180 must be defined. 00:04:49.180 --> 00:04:52.700 It's not as pure of a computation 00:04:52.700 --> 00:04:55.005 as, say, finding the circumference of the circle, 00:04:55.005 --> 00:04:56.880 which there really is-- that was kind of-- we 00:04:56.880 --> 00:04:57.840 studied the universe. 00:04:57.840 --> 00:05:00.600 And that just fell out of our study of the universe. 00:05:00.600 --> 00:05:02.250 It's a human-constructed definition 00:05:02.250 --> 00:05:04.110 that we found useful. 00:05:04.110 --> 00:05:07.260 Now there are other ways to measure the average 00:05:07.260 --> 00:05:10.130 or find a typical or middle value. 00:05:10.130 --> 00:05:14.470 The other very typical way is the median. 00:05:14.470 --> 00:05:15.667 And I will write median. 00:05:15.667 --> 00:05:16.750 I'm running out of colors. 00:05:16.750 --> 00:05:18.660 I will write median in pink. 00:05:18.660 --> 00:05:21.280 So there is the median. 00:05:21.280 --> 00:05:25.160 And the median is literally looking for the middle number. 00:05:25.160 --> 00:05:27.350 So if you were to order all the numbers in your set 00:05:27.350 --> 00:05:31.460 and find the middle one, then that is your median. 00:05:31.460 --> 00:05:34.050 So given that, what's the median of this set of numbers 00:05:34.050 --> 00:05:35.806 going to be? 00:05:35.806 --> 00:05:36.930 Let's try to figure it out. 00:05:36.930 --> 00:05:38.170 Let's try to order it. 00:05:38.170 --> 00:05:39.810 So we have 1. 00:05:39.810 --> 00:05:41.010 Then we have another 1. 00:05:41.010 --> 00:05:42.860 Then we have a 3. 00:05:42.860 --> 00:05:46.630 Then we have a 4, a 6, and a 7. 00:05:46.630 --> 00:05:48.700 So all I did is I reordered this. 00:05:48.700 --> 00:05:50.890 And so what's the middle number? 00:05:50.890 --> 00:05:52.320 Well, you look here. 00:05:52.320 --> 00:05:54.960 Since we have an even number of numbers, we have six numbers, 00:05:54.960 --> 00:05:57.260 there's not one middle number. 00:05:57.260 --> 00:05:59.650 You actually have two middle numbers here. 00:05:59.650 --> 00:06:02.050 You have two middle numbers right over here. 00:06:02.050 --> 00:06:03.160 You have the 3 and the 4. 00:06:03.160 --> 00:06:05.940 And in this case, when you have two middle numbers, 00:06:05.940 --> 00:06:09.640 you actually go halfway between these two numbers. 00:06:09.640 --> 00:06:12.080 You're essentially taking the arithmetic mean of these two 00:06:12.080 --> 00:06:14.272 numbers to find the median. 00:06:14.272 --> 00:06:16.230 So the median is going to be halfway in-between 00:06:16.230 --> 00:06:19.190 3 and 4, which is going to be 3.5. 00:06:19.190 --> 00:06:24.424 So the median in this case is 3.5. 00:06:24.424 --> 00:06:26.590 So if you have an even number of numbers, the median 00:06:26.590 --> 00:06:28.714 or the middle two, the-- essentially the arithmetic 00:06:28.714 --> 00:06:31.329 mean of the middle two, or halfway between the middle two. 00:06:31.329 --> 00:06:32.870 If you have an odd number of numbers, 00:06:32.870 --> 00:06:34.270 it's a little bit easier to compute. 00:06:34.270 --> 00:06:35.644 And just so that we see that, let 00:06:35.644 --> 00:06:36.920 me give you another data set. 00:06:36.920 --> 00:06:39.030 Let's say our data set-- and I'll 00:06:39.030 --> 00:06:41.740 order it for us-- let's say our data set 00:06:41.740 --> 00:06:55.689 was 0, 7, 50, I don't know, 10,000, and 1 million. 00:06:55.689 --> 00:06:56.980 Let's say that is our data set. 00:06:56.980 --> 00:06:58.450 Kind of a crazy data set. 00:06:58.450 --> 00:07:02.400 But in this situation, what is our median? 00:07:02.400 --> 00:07:04.045 Well, here we have five numbers. 00:07:04.045 --> 00:07:05.420 We have an odd number of numbers. 00:07:05.420 --> 00:07:07.200 So it's easier to pick out a middle. 00:07:07.200 --> 00:07:12.040 The middle is the number that is greater than two of the numbers 00:07:12.040 --> 00:07:13.540 and is less than two of the numbers. 00:07:13.540 --> 00:07:14.760 It's exactly in the middle. 00:07:14.760 --> 00:07:18.840 So in this case, our median is 50. 00:07:18.840 --> 00:07:20.742 Now, the third measure of central tendency, 00:07:20.742 --> 00:07:22.200 and this is the one that's probably 00:07:22.200 --> 00:07:26.426 used least often in life, is the mode. 00:07:26.426 --> 00:07:27.800 And people often forget about it. 00:07:27.800 --> 00:07:29.852 It sounds like something very complex. 00:07:29.852 --> 00:07:31.310 But what we'll see is it's actually 00:07:31.310 --> 00:07:33.080 a very straightforward idea. 00:07:33.080 --> 00:07:36.180 And in some ways, it is the most basic idea. 00:07:36.180 --> 00:07:40.510 So the mode is actually the most common number in a data set, 00:07:40.510 --> 00:07:41.885 if there is a most common number. 00:07:41.885 --> 00:07:43.801 If all of the numbers are represented equally, 00:07:43.801 --> 00:07:45.760 if there's no one single most common number, 00:07:45.760 --> 00:07:47.320 then you have no mode. 00:07:47.320 --> 00:07:50.240 But given that definition of the mode, 00:07:50.240 --> 00:07:54.190 what is the single most common number in our original data 00:07:54.190 --> 00:07:58.300 set, in this data set right over here? 00:07:58.300 --> 00:08:00.100 Well, we only have one 4. 00:08:00.100 --> 00:08:01.490 We only have one 3. 00:08:01.490 --> 00:08:03.370 But we have two 1's. 00:08:03.370 --> 00:08:04.880 We have one 6 and one 7. 00:08:04.880 --> 00:08:08.730 So the number that shows up the most number of times here 00:08:08.730 --> 00:08:11.060 is our 1. 00:08:11.060 --> 00:08:14.070 So the mode, the most typical number, the most common number 00:08:14.070 --> 00:08:17.610 here is a 1. 00:08:17.610 --> 00:08:19.590 So, you see, these are all different ways 00:08:19.590 --> 00:08:23.320 of trying to get at a typical, or middle, or central tendency. 00:08:23.320 --> 00:08:25.600 But they do it in very, very different ways. 00:08:25.600 --> 00:08:27.350 And as we study more and more statistics, 00:08:27.350 --> 00:08:29.760 we'll see that they're good for different things. 00:08:29.760 --> 00:08:31.730 This is used very frequently. 00:08:31.730 --> 00:08:34.574 The median is really good if you have some kind of crazy number 00:08:34.574 --> 00:08:35.990 out here that could have otherwise 00:08:35.990 --> 00:08:38.100 skewed the arithmetic mean. 00:08:38.100 --> 00:08:41.449 The mode could also be useful in situations like that, 00:08:41.449 --> 00:08:43.240 especially if you do have one number that's 00:08:43.240 --> 00:08:45.960 showing up a lot more frequently. 00:08:45.960 --> 00:08:47.570 Anyway, I'll leave you there. 00:08:47.570 --> 00:08:51.710 And we'll-- the next few videos, we will explore statistics even 00:08:51.710 --> 00:08:53.260 deeper.