1
00:00:01,304 --> 00:00:03,065
... especially with a large class.

2
00:00:03,065 --> 00:00:04,249
Alright!

3
00:00:04,249 --> 00:00:08,455
We haven't seen you since last Thursday. I don't see you quite as often as my Monday, Wednesday, Friday class.

4
00:00:08,455 --> 00:00:13,657
If you're still having questions about switching a lab because you had something else change in your schedule,

5
00:00:13,657 --> 00:00:21,555
or still waiting to get into the class, Angie at that statsstudents services email is the best place to contact someone

6
00:00:21,555 --> 00:00:28,428
or you go right over there. She's in West Hall in the Stat department, and she can actually sometimes help you

7
00:00:28,428 --> 00:00:34,307
out much better if you go over to see her personally. So [that] is where you can find Angie,

8
00:00:34,307 --> 00:00:38,717
who is manning my wait-list every morning, issuing permissions for the drops that went through at night,

9
00:00:38,717 --> 00:00:42,274
so if you still have a question regarding that see Angie.

10
00:00:42,274 --> 00:00:49,849
The prelabs are working, it wasn't just our site, it wasn't all the 1570 students bringing site-maker down,

11
00:00:49,849 --> 00:00:55,282
it was some other issue in site-maker certainly, but the prelabs are now available and working fine

12
00:00:55,282 --> 00:00:58,919
as did their ... fix yesterday and re-load,

13
00:00:58,919 --> 00:01:04,524
so I hope you can also see them adequately and ... go back to them as needed

14
00:01:04,524 --> 00:01:08,579
when you start to do your first real homework set and have to make that histogram or boxplot.

15
00:01:08,579 --> 00:01:15,713
Watch the 2 minute video on how to do that rather than have to go through a whole bunch of other stuff to remember how that works.

16
00:01:15,713 --> 00:01:20,872
So the videos are there, we still will allow you to finish up your prelab, you know, by tomorrow afternoon, by tomorrow evening,

17
00:01:20,872 --> 00:01:25,963
the last lab ends at 7, so if you didn't get it done for your lab yesterday because of the difficulties with site-maker

18
00:01:25,963 --> 00:01:31,954
we will extend that, just email your GSI, work out getting that prelab 1, we just want you to do it, so that

19
00:01:31,954 --> 00:01:36,067
you are ready to do some of the other things we do in labs and homework, so you get your point if you do it,

20
00:01:36,067 --> 00:01:41,056
you get no points if you don't ...so complete that please.

21
00:01:41,056 --> 00:01:48,144
Alright, urh, the other item that's coming up on Thursday is when you go to your lecture book homework tool

22
00:01:48,144 --> 00:01:52,211
you will start to see a homework there for you on Thursday morning.

23
00:01:52,211 --> 00:01:54,629
This is my testing student homework side,

24
00:01:54,629 --> 00:02:01,165
but I have the ability to see anything that's hidden and I have created that first homework, and so, once you click on

25
00:02:01,165 --> 00:02:06,250
the little pencil thing that say "I'm gonna start writing that homeowork" there's 3 questions ... and you'll be able to click

26
00:02:06,250 --> 00:02:10,499
through those 3 questions and put in your answers, some of them are multiple choice, some of them are typing up

27
00:02:10,499 --> 00:02:15,592
a little explanation, and one of them is an upload of a graph to get that practice too.

28
00:02:15,592 --> 00:02:21,010
The practice homework going up Thursday is practice. I'm putting it under the "required hand in" just so that

29
00:02:21,010 --> 00:02:27,945
there are points on it, and then it will be submitted by the Wednesday after at 11pm automatically for you.

30
00:02:27,945 --> 00:02:35,350
The GSIs will get it and they will grade it to go through that process, because we have a few new GSIs this term who haven't done that yet.

31
00:02:35,350 --> 00:02:39,582
And then you will see what the graded homework looks like back to you, maybe with some feedback too if

32
00:02:39,582 --> 00:02:45,892
you didn't get something right. It is again just practice so if you're not in the class and haven't subscribed yet that's fine,

33
00:02:45,892 --> 00:02:50,143
if you don't do it that's fine, it's not gonna affect your grade, except that it does give you a full practice,

34
00:02:50,143 --> 00:02:55,096
so that when that 1st homework really does come up and you're working on it maybe just the little issues or questions

35
00:02:55,096 --> 00:02:57,577
you have will be gone ... by then.

36
00:02:57,577 --> 00:03:03,132
So that will be available, I'll probably give you another highlight on that again on Thursday.

37
00:03:03,132 --> 00:03:07,636
So, office hours have also started but they're pretty quiet right now, but if you do have an issue,

38
00:03:07,636 --> 00:03:12,249
if you do want to see how virtual sites works with someone that could help you a little more on that

39
00:03:12,249 --> 00:03:18,515
come on in, 274 West Hall, Monday through Wednesday, we're there from 9 to 9.

40
00:03:18,515 --> 00:03:20,917
Thursday 9 o'clock in the morning 'til 5, and Friday just in the morning time, 9 to 12.

41
00:03:20,933 --> 00:03:26,674
You're welcome to go to any of the office hours even though they may not be your GSI's office hours,

42
00:03:26,674 --> 00:03:32,404
and instructor office hours are also posted on c-tools, you can go to myself or Dr. ?.

43
00:03:32,404 --> 00:03:38,441
The first real homework will not be open until January 20th, but the practice one will start this Thursday

44
00:03:38,441 --> 00:03:41,306
and go from Thursday morning and you got a week basically to do it.

45
00:03:41,306 --> 00:03:46,740
The following Wednesday night at 11pm, it will be automatically submitted for you.

46
00:03:46,817 --> 00:03:52,753
We're working through chapter 2. We actually should finish much of that up today, maybe a couple more slides on Thursday.

47
00:03:52,753 --> 00:03:55,610
For Thursday, I am asking you to read the chapters 3 and 4.

48
00:03:55,763 --> 00:03:58,552
They're short chapters, they're reading chapters, (um)

49
00:03:58,552 --> 00:04:02,522
there are not really any formulas in those chapters, maybe just 1 small one.

50
00:04:02,522 --> 00:04:07,237
I've posted partially complete notes for those 3 and 4.

51
00:04:07,252 --> 00:04:11,419
You've got the notes in your binder and you can look at a few of the answers.

52
00:04:11,434 --> 00:04:15,053
I am going to go through the ones that are missing as a recap of those chapters at the beginning on Thursday,

53
00:04:15,053 --> 00:04:20,858
and then I'll put up a clicker practice quiz, 5 points, I'll actually keep those scores just to show you the histogram

54
00:04:20,858 --> 00:04:26,964
of them later; but it'll be as if it were a little intermediate practice for you to see if you got those couple ideas

55
00:04:26,964 --> 00:04:35,167
from those 2 reading chapters, more concepts on different types of bias, an experiment versus an observational study and things like that.

56
00:04:35,259 --> 00:04:41,532
So try reading through the chapters or at least just look at the notes that are online, they're on c-tools under lecture info,

57
00:04:41,548 --> 00:04:47,437
just read through them before Thursday and then come ready to recap ideas and try out a little practice quiz.

58
00:04:47,561 --> 00:04:52,933
So those are my announcements, questions or comments on anything before we recap our histograms that we ended up with,

59
00:04:53,056 --> 00:04:57,894
question or comment on anything?

60
00:05:02,802 --> 00:05:04,992
Alright,

61
00:05:04,992 --> 00:05:10,213
then we're going to recall the 2 types of variables. We're in chapter 2, the 2 types of variables were what?

62
00:05:10,213 --> 00:05:13,139
categorical or ... quantitative.

63
00:05:13,139 --> 00:05:20,082
And what was the very first graph we looked at that described a quantitative variable's distribution,

64
00:05:20,159 --> 00:05:25,176
that gave us the values or categories that your variable can take on and how often they occurred? and how many?

65
00:05:27,077 --> 00:05:31,144
that first graph we looked at had bars and it was called a

66
00:05:31,144 --> 00:05:33,131
bar chart or bar graph

67
00:05:33,131 --> 00:05:36,013
and then we saw the pie graph along with that,

68
00:05:36,151 --> 00:05:41,599
and the basic thing for a numerical summary is just summarizing the counts or percentages that fall into

69
00:05:41,599 --> 00:05:44,859
each category, we did that for the sleep deprived status variable,

70
00:05:44,951 --> 00:05:51,765
and then we went on to our quantitative variable where we had students say how much hours of sleep they typically get per night

71
00:05:51,827 --> 00:05:57,216
and the quantitative variable to show its distribution, and by distribution we just mean,

72
00:05:57,216 --> 00:06:00,300
give me an idea of are that that variable can take on

73
00:06:00,300 --> 00:06:02,603
and how often they tend to occur,

74
00:06:02,603 --> 00:06:07,210
and the nice picture for the distribution of a quantitative variable was also one with bars but its wasn't

75
00:06:07,210 --> 00:06:11,076
a bar chart, it was called a ... histogram.

76
00:06:11,215 --> 00:06:15,017
And the histogram is what we looked at, we gonna recap that on page 8 of your notes, and then

77
00:06:15,017 --> 00:06:19,548
move on to start summarizing the data that's quantitative with numbers.

78
00:06:19,548 --> 00:06:23,976
Numbers such as the mean and the median, and we'll get through at least to standard deviation by the end of today.

79
00:06:24,115 --> 00:06:30,745
So, page 8, we had our histogram of the Amount of Sleep for College Students

80
00:06:30,745 --> 00:06:37,138
we did a clicker question at the very end where we talked about the shape of that distribution of amount of sleep

81
00:06:37,138 --> 00:06:43,593
we talked about it being approximately symmetric and another word that would work would be unimodal

82
00:06:43,640 --> 00:06:49,290
one main mode or peak, and then some of you did want to select the skewness aspect

83
00:06:49,290 --> 00:06:53,830
it was, it's not perfectly symmetric, a slight skewness to the left, but it's quite slight,

84
00:06:53,830 --> 00:06:57,725
I would leave that as a secondary and not the primary thing I would be looking for.

85
00:06:57,725 --> 00:07:01,395
The summary here approximately symmetric, unimodal, centered around 7 hours,

86
00:07:01,395 --> 00:07:07,169
most of the values between 4 and 10, and I can kind of picture that histogram back with that description.

87
00:07:07,261 --> 00:07:14,073
So this type of description is what you're asked to write out, even by hand, with your histogram that you make for your prelab.

88
00:07:14,227 --> 00:07:19,409
Alright ... so lets move to looking at nummerical summaries that would be appropriate.

89
00:07:19,645 --> 00:07:22,832
Oh, we have one more histogram here for "What if".

90
00:07:23,001 --> 00:07:28,624
So, you have a response, you measured for a study, it's quantitative, so you make a histogram.

91
00:07:28,624 --> 00:07:32,927
You look at it graphically first, and this is the picture that you get.

92
00:07:33,051 --> 00:07:37,974
So this is not unimodal, it's called what? [softly: bimodal]

93
00:07:38,051 --> 00:07:44,822
Bimodal, more than one mode or peak. A bimodal distribution, what would it tell you?

94
00:07:44,899 --> 00:07:48,655
I certainly see 2 groups, kinna 2 subgroups.

95
00:07:48,655 --> 00:07:56,363
I see a groups of observations, people that had low scores or low values on the response, and another group that had high scores here.

96
00:07:56,363 --> 00:07:58,223
So I see a bimodal distribution.

97
00:07:58,223 --> 00:08:02,294
I see that there seems to be 2 subgroups in my data set.

98
00:08:02,294 --> 00:08:06,736
And that's one thing I would comment on and want to find, try to figure out why that occurred.

99
00:08:06,828 --> 00:08:14,982
So there appears to be ... 2 subgroups.

100
00:08:16,706 --> 00:08:21,008
I would not just note that and then go on and start calculating means and standard deviations because

101
00:08:21,008 --> 00:08:24,867
I want to first figure out why? what made these 2 groups?

102
00:08:24,943 --> 00:08:28,611
Investigate why.

103
00:08:28,919 --> 00:08:35,280
Maybe it turns out ... that the lower observations were for the males in your data set,

104
00:08:35,280 --> 00:08:37,022
the higher observations for females.

105
00:08:37,068 --> 00:08:42,854
Maybe it's the old versus the young, or some other aspect that you measured.

106
00:08:42,869 --> 00:08:48,103
You want to try to investigate why. It might be a gender issue, it might be an age issue ... it could be ... a region issue,

107
00:08:48,178 --> 00:08:54,279
or something else that affects that response and gives you these 2 subgroups, which is why in a data set

108
00:08:54,279 --> 00:08:58,035
we don't just record only the outcome of interest, we record a lot of other variables too,

109
00:08:58,090 --> 00:09:04,494
they may not be the main response variable but they might help to explain features that we see in our response,

110
00:09:04,494 --> 00:09:11,723
we might have to control for or account for age or gender in our analyses if we see such a picture like this.

111
00:09:11,784 --> 00:09:16,333
So I would want to try to figure out why, so we look at our data set with other variables that have been measured,

112
00:09:16,334 --> 00:09:21,049
may be more demographic but they might help us to understand what's going on and

113
00:09:21,049 --> 00:09:24,560
then of course, I wouldn't just leave the data set group together after this 'cause if you calculate a mean

114
00:09:24,576 --> 00:09:29,246
the mean's going to be sort of a balancing point and that's not really reflecting of the group very well.

115
00:09:29,246 --> 00:09:31,353
wouldn't make sense to summarize this data together.

116
00:09:31,353 --> 00:09:35,275
I'd rather probably lead toward analyzing the data separately by my subgroups from there on out.

117
00:09:35,275 --> 00:09:44,639
So we might end up analyzing data separately by my subgroups,

118
00:09:47,101 --> 00:09:55,882
from there on out. Or at least include that variable that just gives us that distinction as a variable to help control for that factor.

119
00:09:56,082 --> 00:09:58,538
It's important to look at your data first.

120
00:09:58,661 --> 00:10:02,368
If your histogram turned out to be like this and I asked you "Should you calculate a mean?"

121
00:10:02,368 --> 00:10:08,156
Well you can, you can always have the calculator do that for you. SPSS gives you the mean by default with the histogram

122
00:10:08,156 --> 00:10:12,907
so it doesn't even, you know, make you (you know) check first, but it would not make sense to report it.

123
00:10:12,907 --> 00:10:15,679
It would not be meaningful in this case.

124
00:10:15,756 --> 00:10:20,030
Alright, couple of histogram comments are laid out there for you.

125
00:10:20,169 --> 00:10:25,985
The bar chart, we usually have a gap between the bars, 'cause it separates the different categories, whereas

126
00:10:26,154 --> 00:10:29,890
histograms there's often one right next to the other, unless of course there's a gap and there's no observations there

127
00:10:30,136 --> 00:10:33,962
and the you have an outlier maybe sticking out at the end.

128
00:10:34,119 --> 00:10:41,349
How many classes depends ... on a little bit of judgement but the computers and calculators will do a default kinna algorithm to work out

129
00:10:41,349 --> 00:10:45,620
a reasonable number of categories. You can always go in and change that slightly,

130
00:10:45,620 --> 00:10:49,867
you don't want too many, because if you had too many categories then you're going to have just a few values

131
00:10:49,867 --> 00:10:53,707
in each, it's going to be kind of a "flat pancake" kind of histogram.

132
00:10:53,707 --> 00:10:57,205
If you have too few, then you have everybody in just a couple of classes, and that's not

133
00:10:57,205 --> 00:11:01,529
looking like a very good picture of a distribution either.

134
00:11:01,529 --> 00:11:10,632
We like to put "relative frequency" just another name for "proportion" or "percent" on the Y axis whenever you're comparing observations.

135
00:11:10,632 --> 00:11:14,794
If I wanted to prepare the histogram for male versus female college students, in terms of Amount of Sleep,

136
00:11:14,794 --> 00:11:18,676
I could do that as long as my axes were matching up.

137
00:11:18,676 --> 00:11:24,306
Lots of defaults, lots of options, and you get to see that in SPSS a little bit too.

138
00:11:24,413 --> 00:11:31,772
Alright, urh, last comment here, is just that one of the sections talks about "dot plots" and "stem and leaves plots" in your textbook

139
00:11:31,772 --> 00:11:36,537
and we're not going to have you do those particular types of graphs but some of the examples in there do comment

140
00:11:36,537 --> 00:11:40,934
on the shape of the resulting distribution. So it's still good to just kinna glance through,

141
00:11:40,934 --> 00:11:43,738
but you won't be asked to make those particular types of graphs.

142
00:11:43,738 --> 00:11:47,809
The histogram's a good choice overall.

143
00:11:47,948 --> 00:11:56,794
The example that you have on page 10, you can try out on your own ... that was a, one I think on an exam in a spring term.

144
00:11:56,794 --> 00:12:01,397
It is another histogram of a quantitative variable. See if you can go through and answer those couple questions.

145
00:12:01,397 --> 00:12:09,309
The solutions are already posted on c-tools. I'll be posting lecture notes filled in up until the end of drop/add

146
00:12:09,309 --> 00:12:13,845
and so, anything I missed though or also skip in class I will post no matter what.

147
00:12:13,845 --> 00:12:20,508
So under lecture noted you can find part 1 of chapter 2 notes already filled in.

148
00:12:20,662 --> 00:12:24,697
Let's turn to the numerical summaries for the rest of today.

149
00:12:24,697 --> 00:12:30,081
Numerical summaries only appropriate for quantitative data, even if you had categorical data that you coded

150
00:12:30,081 --> 00:12:33,954
it may not make sense to do any kind of averaging or finding a median there.

151
00:12:33,954 --> 00:12:41,694
But it certainly makes sense for a quantitative variable where we saw a reasonably unimodal, homogenous set of observations.

152
00:12:41,694 --> 00:12:45,549
So you've all calculated a mean at some point, and a median perhaps.

153
00:12:45,640 --> 00:12:47,887
Measures of center.

154
00:12:47,887 --> 00:12:52,672
We're gonna have a couple formulas along the way, they're going to be based on data looking like those Xs

155
00:12:52,672 --> 00:12:57,816
X1 is the first observation in your data set, X2 is the second one. That is representing your set of data.

156
00:12:57,816 --> 00:13:03,165
So n represents the number of items in your data set, its sometimes called the sample size.

157
00:13:03,165 --> 00:13:06,152
And how do you calculate a mean of a set of data?

158
00:13:06,228 --> 00:13:10,282
Add them up, divide by the total number, right?

159
00:13:10,590 --> 00:13:12,803
The formula for that then would be what?

160
00:13:13,018 --> 00:13:18,309
X1 add X2 add all those observations up,

161
00:13:18,879 --> 00:13:25,192
up to the last observation and divide by the total number that's represented here by ... n.

162
00:13:25,285 --> 00:13:32,778
The symbol for that type of mean, when it's the mean of your sample, which is usually what kind of number or data you have.

163
00:13:32,808 --> 00:13:35,673
Usually you don't have the entire population of values but rather just a sample.

164
00:13:35,703 --> 00:13:40,645
It's called the sample mean, it's represented by X - bar.

165
00:13:40,799 --> 00:13:45,644
It's probably a function on one of your calculators where you can put some data in and press that to

166
00:13:45,644 --> 00:13:48,078
get the mean, and it's the mean of your sample.

167
00:13:48,078 --> 00:13:55,288
We have a different notation if it is the mean of the population. We've a greek letter "mu" µ that represents a population mean

168
00:13:55,288 --> 00:14:00,625
but usually what we're looking at when we're summarizing data it's from a sample, part of the larger population,

169
00:14:00,625 --> 00:14:01,938
not the entire population.

170
00:14:01,938 --> 00:14:07,966
Anything calculated on a sample, again, is called a statistic ... so that sample mean is easy to find,

171
00:14:07,966 --> 00:14:14,081
um, a shorthand notation would be to use that summation notation, the summation there, that big "sigma" Σ

172
00:14:14,081 --> 00:14:20,877
just means add 'em up, add up whatever's after it, and so that's adding up the Xs and dividing by the total.

173
00:14:21,047 --> 00:14:23,734
The median, how do we find that?

174
00:14:24,058 --> 00:14:29,579
The median is the bullet value that's in the middle, but you have to first order your data from smallest to largest.

175
00:14:29,732 --> 00:14:34,158
Let's say you had an odd number of observations, 5 of them,

176
00:14:34,281 --> 00:14:39,096
then there is, if you order them, a middle value. And that would be your median.

177
00:14:39,096 --> 00:14:46,139
What if you have only an even number of observations, say 4 ... so the median's gonna be right here.

178
00:14:46,139 --> 00:14:53,011
Any value that's between those 2 numbers could be the median, it would divide your data set into 50% above and 50% below,

179
00:14:53,011 --> 00:14:59,033
but in this case we define it to be the average of those 2 middle numbers so we all get the same answer.

180
00:14:59,033 --> 00:15:03,986
So the median ... when n is an odd number of observations,

181
00:15:03,986 --> 00:15:11,204
the median or 'm' is going to be the middle observation, 'cause there is a unique one.

182
00:15:11,357 --> 00:15:16,802
Maybe the same values, some on the either side, but it will be THE middle value,

183
00:15:16,802 --> 00:15:28,574
whereas if your number of observations is even ... then you're going to define the median to be the average of the 2 middle observations.

184
00:15:33,157 --> 00:15:34,584
Alright, let's try it out.

185
00:15:34,584 --> 00:15:38,500
We've got our small set of data from that study where students were looking at

186
00:15:38,500 --> 00:15:44,535
whether you're getting the same amount of french fries in your small orders, we have the data provided for you there.

187
00:15:44,535 --> 00:15:48,347
Again the note right here is whenever you have quantitative data, the first step should be,

188
00:15:48,347 --> 00:15:52,895
not calculate the mean, some standard deviation or something like that but to graph it.

189
00:15:52,895 --> 00:16:00,634
Proper graph or histogram to show the distribution, to show the values that are possible is a (uh) histogram for quantitative variable.

190
00:16:00,634 --> 00:16:04,687
I say we have sort of a unimodal, bell-shaped picture again.

191
00:16:04,687 --> 00:16:12,544
Roughly symmetric and a range from the 60s up to the lower 80s, and it's centered around the mid to lower 70s.

192
00:16:12,544 --> 00:16:19,191
So there's our histograph. It makes sense to calculate a mean to summarize this data, we have some variability there but it is somewhat homogenous.

193
00:16:19,237 --> 00:16:24,112
So let's calculate that mean first. How'd we do it? and what would it be represented by?

194
00:16:24,127 --> 00:16:26,630
The symbol would be

195
00:16:27,137 --> 00:16:27,978
X - bar.

196
00:16:28,071 --> 00:16:37,955
We'd have to plug into our calculator or ... computer, this data set, add up all 12 observations, and divide by 12.

197
00:16:38,663 --> 00:16:44,885
Now if you have a calculator. I would, I rarely ask you to calculate a mean on an exam, I know you can do it.

198
00:16:44,885 --> 00:16:48,502
You can do it with a computer or calculator. I usually give you some basic summary measures.

199
00:16:48,671 --> 00:16:55,775
If we were to calculate this, would a value like 82 even make sense?

200
00:16:56,129 --> 00:16:59,110
Right? ... 82 would be way over here.

201
00:16:59,110 --> 00:17:00,179
Is that the balancing point of this?

202
00:17:00,179 --> 00:17:01,976
Does that look like the mean?

203
00:17:02,176 --> 00:17:08,391
So you just want to make sure that any calculation you do do can take it back to the picture you got that it makes somewhat sense with that picture.

204
00:17:08,391 --> 00:17:17,308
82 would not make sense; 69 may not start to make sense either. But how about more in the middle? A 73.6,

205
00:17:17,308 --> 00:17:21,998
and the units here are always good to include when you're reporting some numerical summaries and

206
00:17:21,998 --> 00:17:25,839
the units here are always good to include when you're reporting some numerical summaries,

207
00:17:25,839 --> 00:17:29,973
and the weight in grams. 73.6 is visually about that balancing point, it makes sense with my histogram.

208
00:17:30,065 --> 00:17:39,066
We have had pictures of a histogram, where we ask you to pick what you think might be the mean, and maybe which one might be the median

209
00:17:39,066 --> 00:17:41,153
and you should be able to kind of visualize that but not have to calculate, just from the histogram, and know how would it relates back.

210
00:17:41,153 --> 00:17:44,439
Alright, how 'bout the median?

211
00:17:44,547 --> 00:17:49,359
Median you need the data in order, and I provided that for you. We have 12 observations,

212
00:17:49,652 --> 00:17:59,079
so the median's going to fall in the middle between ... that 72 and that 74. So the median here will be what?

213
00:17:59,294 --> 00:18:07,401
73, it would be the average of the 2 middle observations... 73 grams.

214
00:18:08,124 --> 00:18:16,566
Now the mean was 73.6 and the median was 73. The fact that those are kind of similar, does that make sense here too?

215
00:18:16,566 --> 00:18:24,171
If you had a roughly symmetric distribution, then the mean and the median would be very close to one other, and that supports it also.

216
00:18:24,171 --> 00:18:27,607
Alright, well we have a couple of "what ifs" there on the bottom.

217
00:18:27,607 --> 00:18:34,652
What if we had put the data in our calculator quickly and instead of putting the 63 in we actually just put in a 3.

218
00:18:34,652 --> 00:18:40,278
So our smallest observation was a 63, what if we had accidentally just typed in the 3 only.

219
00:18:40,278 --> 00:18:45,319
How would that affect my measures of center here that we just calculated? How about the median?

220
00:18:47,134 --> 00:18:49,637
Would the median change at all? No.

221
00:18:49,637 --> 00:18:56,913
The median doesn't use all the values, it uses them in terms of its place, in terms of there size or order, but it doesn't use every value in a data set.

222
00:18:56,913 --> 00:19:02,541
So the median's not going to be affected at all. The median would stay the same.

223
00:19:04,787 --> 00:19:07,005
How about the mean?

224
00:19:08,944 --> 00:19:16,121
It will change; the mean uses every value that 3 would enter in instead of a 63 is part of the total on the top.

225
00:19:16,230 --> 00:19:20,709
The mean is going to do what do you think? It's going to go up? go down? It's gonna go down.

226
00:19:20,709 --> 00:19:25,413
One smaller value, being smaller yet, is going to drag that mean down.

227
00:19:25,413 --> 00:19:29,361
That the mean would decrease

228
00:19:30,393 --> 00:19:33,330
The mean would decrease

229
00:19:34,392 --> 00:19:35,638
to 68.6

230
00:19:39,946 --> 00:19:44,501
There would only be a couple observations below the mean, 10 of them above it.

231
00:19:44,625 --> 00:19:48,977
Much different that what we had before. So the mean would be smaller.

232
00:19:49,116 --> 00:19:54,579
So we talk about the mean IS affected by extreme observations, the median is not so.

233
00:19:54,579 --> 00:19:57,211
It is more of a resistant measure.

234
00:19:57,211 --> 00:20:01,540
Top of the next page, let's fill that idea in.

235
00:20:02,664 --> 00:20:05,064
The mean IS sensitive

236
00:20:06,141 --> 00:20:16,444
to extreme observations whereas the median is our more resistant measure of center.

237
00:20:17,060 --> 00:20:23,603
So which measure of center might be the better one to report if in looking at your histogram it was strongly skewed

238
00:20:23,603 --> 00:20:27,133
maybe with a couple of outliers, either on the high end or low end?

239
00:20:27,334 --> 00:20:28,726
The median.

240
00:20:28,726 --> 00:20:33,692
The median would paint a picture of what really is more of the middle observation, where more of the typical values

241
00:20:33,877 --> 00:20:39,346
tend to be, rather than the mean. The mean can be affected by those extreme observations.

242
00:20:39,346 --> 00:20:40,522
Very good!

243
00:20:40,522 --> 00:20:45,244
Alright, couple pictures to show the relationship between the mean and the median.

244
00:20:45,291 --> 00:20:47,797
What is the descriptor of this first one again?

245
00:20:47,797 --> 00:20:54,498
We call this ... bell-shaped, symmetric, unimodal, all of those words would work.

246
00:20:58,144 --> 00:21:06,283
And, the symmetry is the main idea here. How would the mean and median compare? and where will they be?

247
00:21:06,283 --> 00:21:15,374
Both right in the middle ... and approximately equal to each other.

248
00:21:16,251 --> 00:21:20,338
The smoothed out histogram being shown on the right,

249
00:21:20,338 --> 00:21:22,573
upper right, that is also

250
00:21:22,573 --> 00:21:26,405
symmetric. What's the other word to describe this one?

251
00:21:26,620 --> 00:21:32,894
UNIFORM, not unimodal but uniform. It actually doesn't have really any mode, or they're all modes if you will,

252
00:21:32,894 --> 00:21:36,600
'cause they're all equally likely. Symmetric, uniform, it's still symmetric.

253
00:21:36,600 --> 00:21:45,684
So it would be quite easy here to also find ... the mean and the median, you just need to find the midpoint.

254
00:21:46,284 --> 00:21:50,097
So if I gave you the end points, you would be able to find the mean and the median for that one even without

255
00:21:50,097 --> 00:21:54,441
having more detail of how many observations and all the individual values.

256
00:21:54,441 --> 00:21:59,034
Alright, you got two skewed distributions, This first one here is again what? skewed to the?

257
00:21:59,249 --> 00:22:06,283
Skewed to the right, look at where the tail ends up being pulled out. Skewed to the right.

258
00:22:06,283 --> 00:22:13,096
That is more for income data, sales of home. That type. And skewed to the right means we've got a lot of values that are small

259
00:22:13,096 --> 00:22:17,777
and then we're throwing in a few really large values into that data set, and averaging.

260
00:22:17,777 --> 00:22:20,431
So the mean and median are not going to be the same.

261
00:22:20,431 --> 00:22:23,710
Which one will tend to be higher?

262
00:22:24,033 --> 00:22:29,943
Got a bunch of ones, twos and threes, and you throw in a couple of really large numbers, it's going to pull the mean

263
00:22:29,943 --> 00:22:31,857
towards those large values.

264
00:22:31,857 --> 00:22:39,846
Alright...Skewed to the right, relative to each other, the median would be the smaller one and be less than the mean.

265
00:22:40,000 --> 00:22:43,491
Exactly how much smaller, and exactly where they're placed.

266
00:22:43,491 --> 00:22:51,407
Median's going to be the value on the axis so that if you looked at the area to right and left it should be both about 50%

267
00:22:51,407 --> 00:22:55,849
'cause the median divides things in half that way. The mean should be more visually the balancing point

268
00:22:55,849 --> 00:23:01,858
of that histogram or that smooth curve. But relative to each other that's how they should compare.

269
00:23:01,858 --> 00:23:04,365
The mean gets pulled in direction of the tail.

270
00:23:04,365 --> 00:23:07,352
Our other distribution here is skewed to the left.

271
00:23:09,106 --> 00:23:13,281
Very typical of your exam scores in this class.

272
00:23:13,281 --> 00:23:19,752
The mean and median again will not be equal to each other but it would be the mean that gets pulled down

273
00:23:20,089 --> 00:23:25,340
compared to the median ...in this distribution.

274
00:23:26,140 --> 00:23:31,295
Which is why I often report the 5 number summary or the median ... as far as my measure of center

275
00:23:31,310 --> 00:23:35,218
when I report scores on your exams, rather than the mean, the mean can get pulled down by

276
00:23:35,218 --> 00:23:39,983
that one or two or few low scores that end up happening.

277
00:23:39,983 --> 00:23:42,771
And what's the descriptor for our last graph there?

278
00:23:45,171 --> 00:23:46,539
Bimodal

279
00:23:48,031 --> 00:23:50,611
Two main modes or peak, it's still a roughly symmetric

280
00:23:50,611 --> 00:23:55,076
and what would the mean and median be there?

281
00:23:55,276 --> 00:23:58,644
Kind of in the middle again.

282
00:24:00,122 --> 00:24:07,073
Does either one of those measures of center really represent what we think of as the center as being that sort of typical value?

283
00:24:07,073 --> 00:24:19,418
Not very well! ... So neither does a very good job as a summary measure here.

284
00:24:23,172 --> 00:24:28,344
One of the old exams had a picture that ended up being somewhat bnimodal, showing distinct clusters

285
00:24:28,344 --> 00:24:33,920
and the question was "should you report the mean here for this quantitative variable? is that an appropriate measure?"

286
00:24:33,920 --> 00:24:39,335
If yes go ahead and report it, and it was right in the output summary, if no, explain why not.

287
00:24:39,428 --> 00:24:41,876
And for that type of picture, I would say no.

288
00:24:41,876 --> 00:24:46,967
And the reason is that you seem to have two subgroups and you shouldn't be aggregating them, combining them together.

289
00:24:48,120 --> 00:24:53,654
Alright, there's measures of center. Mean and median are the typical ones, we know when it's appropriate to use either one

290
00:24:53,654 --> 00:24:57,086
and when it's appropriate to perhaps use one over the other.

291
00:24:57,101 --> 00:25:01,474
Median preferred when its skewed, or strong outliers in your data set.

292
00:25:01,721 --> 00:25:05,963
Measures of center are done. Measures of spread come next.

293
00:25:06,210 --> 00:25:10,776
So what if you had the distribution of scores on an exams that were somewhat

294
00:25:10,945 --> 00:25:18,846
unimodal, bell-shaped, symmetric, and the mean was reported to be 76

295
00:25:19,631 --> 00:25:21,738
and you scored an 88, how do you feel?

296
00:25:24,092 --> 00:25:24,914
Good?

297
00:25:25,329 --> 00:25:30,812
You're above the mean at least, right? um?. How good should you feel?

298
00:25:31,273 --> 00:25:38,983
Let's suppose 76 is our mean there, and 88 is right about here ... so there's one model.

299
00:25:39,106 --> 00:25:46,253
Would you feel better with the distribution of scores looking like that number one, or number two?

300
00:25:46,976 --> 00:25:51,953
Number two, right? Number two relative to the peers and the scores, you're looking much better

301
00:25:52,144 --> 00:25:58,208
compared to the first distribution. They both have the same shape, they both have the same center, but

302
00:25:58,208 --> 00:26:03,233
the differ in their spread, or variation. And it's important to know that aspect of your model too.

303
00:26:03,233 --> 00:26:12,650
And a score from one distribution being 88 from 76 may look better relative to that idea of how spread out the values are.

304
00:26:12,650 --> 00:26:15,792
So a couple measures of spread to go along with measures of center.

305
00:26:15,792 --> 00:26:19,554
The two easiest ones of course, or the easiest one that is, is the range.

306
00:26:19,554 --> 00:26:25,793
Just look at the overall range of your data set. The range is defined to be the max. minus the min.

307
00:26:25,793 --> 00:26:33,682
So if I have you calculate the range I want you to take the largest value and subtract the smallest one off and get that spread of that 100% of your data.

308
00:26:33,682 --> 00:26:39,962
If I ask you to just comment on spread, one comment you can make is "it goes from 10 to 22"

309
00:26:39,962 --> 00:26:43,600
and that's kind of giving me the range, but it's not computing it.

310
00:26:43,954 --> 00:26:48,194
Another way of giving you some idea of the breakdown is to report some percentiles.

311
00:26:48,194 --> 00:26:53,147
In fact alot of your standardized tests report your score, and what percentile you got.

312
00:26:53,147 --> 00:26:59,414
In the two pictures we just looked at you would have a higher percentile when there is less spread in you model.

313
00:26:59,414 --> 00:27:06,012
So percentiles tell you a value so that you know what percent are below it and what percent therefore are above it.

314
00:27:06,489 --> 00:27:09,298
Common percentiles, we've already done one of 'em.

315
00:27:09,298 --> 00:27:16,583
The median is actually a percentile of your distribution. It is the 50th percentile 'cause it cuts you data set in half, half below, half above.

316
00:27:16,583 --> 00:27:24,412
If you were to take the lower half of your data set, below the median, and pretend that's your data set and find the median again

317
00:27:24,412 --> 00:27:34,525
that would be what is called the first quartile or the 25th percentile, and the first quartile is denoted typically with a Q1.

318
00:27:35,001 --> 00:27:41,063
If were to take your data set, find the median and take all the values above the median, find the median of that set again

319
00:27:41,108 --> 00:27:50,423
you'd have the upper quartile or Q3 or third quartile ... also known as the 75th percentile

320
00:27:50,423 --> 00:27:53,623
So it's just dividing each half of the data set in half again,

321
00:27:53,623 --> 00:27:58,373
and those two quartiles then give you another little bit of positional idea in your data set.

322
00:27:58,373 --> 00:28:04,288
If you take those numbers, the median, the quartiles, along with the min. and the max. and put them in a table,

323
00:28:04,288 --> 00:28:09,180
this is the way your textbook lays out this summary, it's called the 5 Number summary.

324
00:28:09,180 --> 00:28:14,205
It gives you all in one kind of picture here. A couple measures of spread.

325
00:28:14,205 --> 00:28:22,320
One measure of spread would be the max. minus the min. so you get the range by looking at that distance.

326
00:28:22,320 --> 00:28:28,475
You get the median as your measure of center, and then another measure of spread that sometimes used

327
00:28:29,121 --> 00:28:34,365
instead of the overall range is called the interquartile range or I Q R.

328
00:28:37,102 --> 00:28:39,824
Interquartile range.

329
00:28:40,517 --> 00:28:46,839
It's the measure of the spread for the middle 50% of your data, 'cause the range depends on your two most extreme values.

330
00:28:46,839 --> 00:28:51,378
What if one of those values is an outlier, so your range looks distorted to be quite large

331
00:28:51,378 --> 00:28:54,930
when most of your data is really in this range instead, a smaller range.

332
00:28:54,930 --> 00:28:57,530
IQR, interquartile range.

333
00:28:57,530 --> 00:29:02,790
So nice set of summaries for when you have skewed data or outliers would be a 5 number summary.

334
00:29:02,790 --> 00:29:06,421
It gives your center and spread in a couple of ways.

335
00:29:06,591 --> 00:29:10,463
Let's try it out. Our French Fries data set again

336
00:29:12,109 --> 00:29:18,825
Why don't you go ahead and try to work out the 5 number summary, we found the median already

337
00:29:19,164 --> 00:29:22,262
Min. and max. are easy to put in.

338
00:29:22,632 --> 00:29:25,699
And then let's find those quartiles.

339
00:29:26,099 --> 00:29:28,497
So 12 observations,

340
00:29:30,897 --> 00:29:32,220
middle,

341
00:29:32,392 --> 00:30:04,005
(silence)

342
00:30:04,128 --> 00:30:13,053
So I'm finding Q1 and Q3. Not too bad here right? You just take your data that's below and above that median position

343
00:30:13,514 --> 00:30:19,074
What is going to be Q3? ... Any of these lower observations, there's 6 of them down there,

344
00:30:19,367 --> 00:30:23,423
so the middle falls right here now, between the 69 and the 70.

345
00:30:24,254 --> 00:30:35,662
So Q1, the lower quartile would be what? 69.5 and on the upper side there's another 6 observations in that upper half and

346
00:30:35,662 --> 00:30:42,111
finding the middle there would be Q3 78.5.

347
00:30:42,572 --> 00:30:49,619
Alright, so the range is easily found, the actual computation of the range would be the maximum value minus the min.

348
00:30:49,619 --> 00:30:54,573
The overall range of the data covers ... 20 grams.

349
00:30:54,573 --> 00:30:57,175
For a spread.

350
00:30:57,299 --> 00:31:02,723
The interquartile range defined again to be Q3 minus Q1

351
00:31:02,939 --> 00:31:14,375
Looking at the spread of the middle 50% of your data would be that 78.5 minus the 69 ... 69.5

352
00:31:14,390 --> 00:31:18,280
A difference there of only 9 grams instead

353
00:31:20,004 --> 00:31:26,823
So the middle 50% cover a range of 9 grams. You can compare from one data set to another with common measures such as IQR

354
00:31:26,823 --> 00:31:32,733
to see which one looks like it's more spread out in terms of that middle 50%. ...Now quick "What if" here

355
00:31:32,733 --> 00:31:39,096
What if that 83 were not there? So instead of having 12 observations, I have 11,

356
00:31:40,341 --> 00:31:44,282
so now what will be my median of this data set?

357
00:31:45,066 --> 00:31:46,671
The 72

358
00:31:48,732 --> 00:31:52,468
and looking at finding Q1 and Q3, what would I do?

359
00:31:52,468 --> 00:32:00,691
So now, I'm looking at just the 5 observations below the 72, I'm not going to include the 72 in that lower half.

360
00:32:00,691 --> 00:32:07,300
So the definition of Q1 when you have a median being one of the values in your data set is everything strictly below and strictly above

361
00:32:07,300 --> 00:32:10,263
that works for both an odd and an even.

362
00:32:10,263 --> 00:32:20,039
So them my Q1 would have been a 69, and looking at only the 5 observations above, my Q3 would have been the 78

363
00:32:20,039 --> 00:32:24,384
just so you know how the calculations the computers are doing, that's typically the method of using it

364
00:32:24,384 --> 00:32:32,512
sometimes they specifically take .25 times the sample size to find the position of that 25th observation

365
00:32:32,512 --> 00:32:38,705
or 25th percentile, and even interpolate, but this is method that most calculators and computers use.

366
00:32:38,705 --> 00:32:44,111
Alright, the test score example is another one that you can try out on your own,

367
00:32:44,111 --> 00:32:50,055
it's really not too difficult to work through, just report values from that 5 number summary.

368
00:32:50,055 --> 00:32:53,663
It's very similar to one of the examples in chapter one, which I asked you to read initially,

369
00:32:53,663 --> 00:32:57,548
and it went through a 5 number summary there and pulling out a few values.

370
00:32:57,841 --> 00:33:01,683
So try that one out, that would be posted up on c-tools ... very soon.

371
00:33:02,267 --> 00:33:07,810
We're gonna look at another graph today ... called boxplots.

372
00:33:07,810 --> 00:33:13,831
You probably saw that in your prelab if you looked at your prelab video or maybe even ... saw one in lab.

373
00:33:14,107 --> 00:33:20,656
It is a picture of your 5 number summary ... in graphical form.

374
00:33:20,656 --> 00:33:26,200
So you take your 5 number summary ... you take your quartiles Q1 and Q3

375
00:33:26,200 --> 00:33:33,337
and you use those to form your box. The length of the box is your IQR, visually,

376
00:33:33,337 --> 00:33:43,124
so Q1 and Q3, so this length here is really your interquartile range, it's visually showing you that spread of the middle 50% of your data.

377
00:33:43,124 --> 00:33:48,351
Inside the box, wherever it occurs you put your median with a line drawn in the middle of the box.

378
00:33:48,351 --> 00:33:56,786
Now here it's shown more in the middle, it could be that your Q1 and your median are the same, if you had a lot of repeats in your data set it's possible

379
00:33:56,786 --> 00:33:59,880
so there might not even be a line in the middle of the box.

380
00:33:59,880 --> 00:34:04,252
And then you do a little check. You check to see if there are any outliers, 'cause if there's outliers,

381
00:34:04,252 --> 00:34:08,089
I want to still show them visually in the graph that we're making here.

382
00:34:08,089 --> 00:34:13,770
There's a rule for checking for outliers, it's called the 1.5 times IQR rule

383
00:34:13,770 --> 00:34:24,030
'cause you calculate 1.5 times IQR, and that quantity's your step, your amount that you go out from the quartiles

384
00:34:24,030 --> 00:34:28,484
you take Q1 and you go down one step,

385
00:34:28,484 --> 00:34:33,028
you take Q3 and you go up ... one step

386
00:34:33,028 --> 00:34:36,883
and those values are your fences or your boundaries.

387
00:34:36,883 --> 00:34:44,281
Turns out with a little bit of theory you can show that any values that are OUTSIDE those fences are unusual,

388
00:34:44,281 --> 00:34:47,491
they would be deemed an outlier using this rule.

389
00:34:47,491 --> 00:34:54,225
So you take 1.5 times the IQR, you go out a step from you quartiles, put like a little boundaries there, fences

390
00:34:54,225 --> 00:34:57,935
and say do I have any observations that are outside those fences?

391
00:34:57,935 --> 00:35:02,245
If I do, I want to plot them separately and draw attention to them 'cause they're sticking out from the rest,

392
00:35:02,245 --> 00:35:10,777
if I don't ...then just extend the lines out to the actual min. and max. 'cause there were no outliers in that data set

393
00:35:10,777 --> 00:35:16,119
so you plot them individually if you have any, if not just extend them out to the smallest and largest observation

394
00:35:16,119 --> 00:35:22,207
if there are outliers you say apart from those outliers what are the min. and max. and draw your boxplot accordingly.

395
00:35:22,207 --> 00:35:25,744
So this boxplot right here is for your 12 orders of french fries, the weights.

396
00:35:25,928 --> 00:35:32,010
I do not see any points plotted outside the length of those boxes, those whiskers going out individually,

397
00:35:32,010 --> 00:35:37,062
I didn't have any outliers, in this example. If you remember the histogram that we looked at a little bit ago

398
00:35:37,062 --> 00:35:40,886
it didn't have any outliers either. With this set of 12 observations.

399
00:35:40,886 --> 00:35:47,734
So I see no outliers on either end here, they would be plotted separately,

400
00:35:47,734 --> 00:35:51,191
I don't see any outliers, so what we're going to do is confirm that there aren't any outliers

401
00:35:51,191 --> 00:35:59,044
trying out this rule once ... and then pretend that we do have one larger value and see how that affects the boxplot overall.

402
00:35:59,044 --> 00:36:03,959
Boxplots are gonna be made for you typically, and I just want you to know how they're constructed

403
00:36:03,959 --> 00:36:07,595
so let's do one little check on how that rule works.

404
00:36:07,595 --> 00:36:11,495
What we're doing basically is verifying that there are no outliers in our data set.

405
00:36:11,495 --> 00:36:17,146
We didn't see it visually in the histogram, the boxplot we just saw there didn't show it , so let's see

406
00:36:17,146 --> 00:36:20,568
what that rule is that shows us there aren't any outliers.

407
00:36:20,737 --> 00:36:29,484
The interquartile range is the first thing to compute, and that was a diference of 9 grams, we just did that together ... a minute ago.

408
00:36:29,484 --> 00:36:32,006
Calculate that thing called a step

409
00:36:32,344 --> 00:36:34,870
1.5 times the IQR.

410
00:36:35,055 --> 00:36:39,588
So this is my step amount ...13.5.

411
00:36:39,988 --> 00:36:46,077
13.5 is not put on my actual boxplot anywhere, it's just the amount I'm going to go out from the quartiles

412
00:36:46,077 --> 00:36:49,921
to figure out where these fences can be drawn or thought through.

413
00:36:49,921 --> 00:36:57,874
So the lower boundary or fence is taking Q1 and subtracting one step. So go from 69.5 which is Q1

414
00:36:57,874 --> 00:37:06,896
subtract off one step amount and come up with your lower boundary which is a 56.

415
00:37:07,266 --> 00:37:10,619
Now that's not necessarily gonna be a value in your data set, but you're going to ask yourself,

416
00:37:10,619 --> 00:37:15,848
"in my data set did I have any values that were even smaller than that lower boundary?"

417
00:37:15,848 --> 00:37:21,683
Any observation that fell below ... 56. What was our lowest value?

418
00:37:22,207 --> 00:37:26,322
63. So we have no outliers on the low side.

419
00:37:26,322 --> 00:37:33,346
Any observation that fall below this lower boundary? no, so there are no low outliers.

420
00:37:35,161 --> 00:37:41,468
If my lowest value was a 55 or a 54 that would start to be deemed an outlier, and I would be plotting that separately

421
00:37:41,468 --> 00:37:47,626
but there is no outlier on the low end. The 56 didn't appear on my boxplot that you had at the top of that page

422
00:37:47,626 --> 00:37:52,197
but that is the boundary that I'm using to determine if there would be a point I would plot separately.

423
00:37:52,197 --> 00:38:00,902
What about on the upper end? The upper boundary or fence is 78.5 going up one step

424
00:38:01,533 --> 00:38:05,281
This is a boundary of 92.

425
00:38:06,451 --> 00:38:09,482
Again, you may or may not even have that value in your data set, but you're gonna be asking

426
00:38:09,482 --> 00:38:16,294
do I have any values that go beyond that number? Any values on the high end, above 92.

427
00:38:16,294 --> 00:38:19,967
Our largest value was? ... 83

428
00:38:20,136 --> 00:38:27,954
So the answer here is also no. So there are no ... high outliers or large outliers.

429
00:38:28,600 --> 00:38:33,917
So we've just confirmed the graph that we just made was the graph using that rule, just that it didn't have any outliers

430
00:38:33,917 --> 00:38:37,007
so we just drew the lines out to the min. and max.

431
00:38:37,007 --> 00:38:41,377
What would happen if there were an outlier? How does the boxplot change?

432
00:38:41,377 --> 00:38:49,160
So we're going to pretend on the next page that that largest value of 83 is really a 93 so we will have an outlier

433
00:38:49,160 --> 00:38:51,578
and see how the graph changes.

434
00:38:53,208 --> 00:39:01,342
So if we change from 83 to 93, just makes the one large value larger a bit larger, does the median change?

435
00:39:01,342 --> 00:39:05,731
is the median affected by those extreme kind of values? no, still 73.

436
00:39:05,731 --> 00:39:10,631
In fact the quartiles are going to be the same too. We're just shifting that one large value out there further

437
00:39:10,631 --> 00:39:16,829
so it's still going to be the same Q1 and Q3. Everything's the same except for the largest values which is a 93 now.

438
00:39:17,829 --> 00:39:26,731
The boundaries that we computed on the previous slide. That 56 lower boundary and that 92 still stand as the boundaries to determine if we have outliers

439
00:39:26,731 --> 00:39:31,637
but now on the high side, 92's out there and I have a 93.

440
00:39:31,867 --> 00:39:37,177
It goes beyond ... so now I do have one high outlier, the value of my 93.

441
00:39:37,315 --> 00:39:43,639
That's the point that's plotted separately here.

442
00:39:45,686 --> 00:39:52,941
A different symbol's used depending on the package, it might be dots or asterisks, but any point that's plotted separately is an outlier by this rule.

443
00:39:52,941 --> 00:40:00,511
There actually could have been, without knowing the data set, at least one outlier here, 'cause there could be two values at that same high value

444
00:40:00,511 --> 00:40:06,637
But we plot that separately and then ... the lower box whisker goes out to still the 63 'cause we didn't have any low outliers

445
00:40:06,652 --> 00:40:17,278
That upper whisker or line goes up to an 80. Notice that this doesn't go up to the 92, which was our boundary, 'cause we didn't have any values at 92,

446
00:40:17,278 --> 00:40:21,442
that was just used to determine if we have an outlier or not.

447
00:40:21,442 --> 00:40:28,765
The upper end here should go up to the largest value in our data set that's NOT an outlier which we would have plotted separately

448
00:40:28,765 --> 00:40:32,945
and our largest value that is not an outlier is?

449
00:40:32,945 --> 00:40:33,660
80

450
00:40:33,660 --> 00:40:35,496
Why does the line extend out to 80?

451
00:40:36,886 --> 00:40:38,277
It's the largest value

452
00:40:42,017 --> 00:40:43,679
that is not an outlier.

453
00:40:52,182 --> 00:40:57,854
The 50, the lower boundary of 56 and the upper boundary of 92 don't appear on the graph.

454
00:40:57,854 --> 00:41:03,269
They are sort of your invisible fences, one kind of fence right there, and one way down over here just to

455
00:41:03,269 --> 00:41:07,810
determine which values, if there are any, would be plotted separately.

456
00:41:07,810 --> 00:41:13,423
So now I can see truly where most of the values are. I see the outlier plotted separately, I see the gap

457
00:41:13,423 --> 00:41:15,353
which I would have seen in the histogram too.

458
00:41:15,353 --> 00:41:19,880
You have the histogram and just that one larger outcome was moved way out larger you would have

459
00:41:19,880 --> 00:41:25,744
had a gap of values and seen that outlier through the histogram, but also drawing to your attention the boxplot.

460
00:41:26,098 --> 00:41:31,510
Alright ... so the boxplots are going to be made for you, I don't like to have you work our this IQR rule by hand

461
00:41:31,510 --> 00:41:37,872
and try to draw them out, but know what these values are that apply separately and how they came to be plotted that way.

462
00:41:38,056 --> 00:41:40,408
Couple notes then to fill in.

463
00:41:40,732 --> 00:41:43,995
Boxplots are very helpful when you want to make comparisons.

464
00:41:44,149 --> 00:41:51,217
You can make side by side boxplots, one boxplot for males, one for females, to compare the two distributions quite easily.

465
00:41:51,217 --> 00:41:59,879
So side by side boxplots are good for ... comparing ... two or more ... sets of observations.

466
00:42:06,125 --> 00:42:14,140
It automatically puts them on the same axes, the same scaling so that you don't have to go in and click the axes and make the two histograms have the same

467
00:42:14,140 --> 00:42:18,050
X and Y axis scaling, so you can compare them, puts 'em right next to each other.

468
00:42:18,050 --> 00:42:23,923
We do that quite often for a visual check when we look at multiple comparisons for different groups,

469
00:42:23,923 --> 00:42:26,990
watching out for those points that are plotted separately.

470
00:42:26,990 --> 00:42:29,790
They're called what again? ... They're outliers.

471
00:42:29,790 --> 00:42:35,009
And my point here is that they are still part of the data set.

472
00:42:40,731 --> 00:42:47,022
I don't want you to ignore them, in fact they were plotted there separately so you would be drawn to see them.

473
00:42:48,268 --> 00:42:52,451
Maybe that outlier is the most interesting observation in your data set.

474
00:42:52,543 --> 00:43:00,163
Here's all the data, that observation's really good. How did you get to be THAT good for that combination of factors that I put together?

475
00:43:00,163 --> 00:43:02,716
I might wanna focus on that one rather than the rest.

476
00:43:03,008 --> 00:43:07,937
Alright, so they're still part of the data set, don't ignore them, they're there to show you that they did stand out

477
00:43:07,937 --> 00:43:10,519
from the general rest of the data set.

478
00:43:10,519 --> 00:43:17,073
One thing that boxplots do not do very well, that we can't see from a boxplot alone very well is shape.

479
00:43:17,288 --> 00:43:21,957
You can't confirm shape from a boxplot only.

480
00:43:27,602 --> 00:43:37,551
What graph does a much better job of showing us the shape of the distribution of our quantitative variable? ... A histogram.

481
00:43:44,857 --> 00:43:52,488
Your boxplot can look ... beautifully symmetric but your data set underlying it may not be.

482
00:43:52,488 --> 00:44:00,026
Just because your median is right in the middle to Q1 and Q3, you don't know what happens between Q1 and the median and Q3.

483
00:44:00,026 --> 00:44:05,008
How are the data distributed that way?! They could be all clumped to one end, and then clumped to the middle on the other side,

484
00:44:05,008 --> 00:44:09,798
so it may not ... have that same pattern that you're thinking that the boxplot tends to show.

485
00:44:09,798 --> 00:44:15,197
So it doesn't confirm shape, you can start to see skewness. A boxplot showing some skewness would be

486
00:44:15,197 --> 00:44:19,801
a long tail going out and a bunch of outliers on one end, so you can start to see or visualize that,

487
00:44:19,801 --> 00:44:23,265
BUT it still doesn't confirm the shape, we would wanna use a histogram instead.

488
00:44:23,265 --> 00:44:27,800
And then one of the graph you're gonna look at in labs #2

489
00:44:27,800 --> 00:44:33,220
um, no labs next week, remember?! cause there's no Monday classes, so we don't have labs at all for next week.

490
00:44:33,220 --> 00:44:39,503
But the 2nd lab we're gonna look at Q-Q plots... That'll show you whether you have a bell-shaped model in a better way.

491
00:44:39,503 --> 00:44:45,053
We're gonna look at time plots 'cause a lot of data we gather over time. So a couple more graphs that we'll look at.

492
00:44:45,053 --> 00:44:49,811
And then my last comment really helps you on an exam ... or a quiz if we were to give one

493
00:44:49,811 --> 00:44:53,619
and that is that "Show me what you're doing when you're reading values off the graph!"

494
00:44:53,619 --> 00:44:57,924
If you're looking all over and reading that value that was the outlier a 92 instead of 93

495
00:44:57,924 --> 00:45:05,291
well, I might give you credit for something that's reasonable, based on where the axes are and how fine they are for you to read values from.

496
00:45:05,291 --> 00:45:10,754
Show me what you're doing so I can see your approach ... and still give you credit for the right process.

497
00:45:10,754 --> 00:45:14,733
Alright, so on the next ... Oh! I got my comic of the day

498
00:45:14,733 --> 00:45:19,857
Recycling, go green (sings softly: aluminum cans, bottles and) (soft laughter) Recycle those boxploh ... uh yeah, Ok!

499
00:45:23,857 --> 00:45:30,846
What I have next for your is to try out a question that was on an exam, in the past. We have here scores

500
00:45:30,862 --> 00:45:40,233
on a standardized test, for children, and some of those children ate breakfast, some of them did not.

501
00:45:40,233 --> 00:45:45,066
SO we have their scores compared side by side. Side by side boxplots. So I'm gonna ask you to review

502
00:45:45,066 --> 00:45:49,136
the notes on that page 16 that we just went through, or keep them in mind anyway, work with a neighbor.

503
00:45:49,136 --> 00:45:55,979
Try out these 3 questions, and then we'll click them in. I'm gonna give you some choices then for the actual answers

504
00:45:55,979 --> 00:45:58,295
but try them out , and we'll see how you do.

505
00:45:58,295 --> 00:46:03,503
This is a chance for you to talk with your neighbor and work out a problem.

506
00:46:03,503 --> 00:49:31,880
(silence)

507
00:49:31,880 --> 00:49:35,815
(inaudible talking in background)

508
00:49:35,815 --> 00:49:41,903
So this question that you have, was on an exam before but not with multiple choice on the exam.

509
00:49:41,903 --> 00:49:50,793
If your answer's not exactly there ... it's probably close to one of those answers perhaps ... pick the closest one.

510
00:49:51,516 --> 00:49:57,314
Do you notice in the questions there are sometimes words that are bold? ... or italicized?

511
00:49:57,314 --> 00:50:00,604
That usually means they're important ...to help you guide to the right boxplot.

512
00:50:08,771 --> 00:50:18,959
We're asking here for the approximate lowest score or grade... for a child who... "does" have breakfast.

513
00:50:20,697 --> 00:50:24,442
So I want "Do you have breakfast? YES!"

514
00:50:25,426 --> 00:50:27,275
That's the boxplot I wanna work with.

515
00:50:27,552 --> 00:50:35,291
Looking pretty good! Here's your distribution, most of you indeed are picking the right answer. Which is ... B) ... 4.5

516
00:50:35,460 --> 00:50:40,141
Plus 5 points minus 10 over here

517
00:50:41,110 --> 00:50:42,557
Alright!

518
00:50:46,021 --> 00:50:49,217
The... lowest value 4.5 is this value right here. What is that called? That's an ... outlier. (Outlier)

519
00:50:52,139 --> 00:50:55,615
But is it still part of the data set? (Yes) (It is)

520
00:50:55,615 --> 00:51:00,615
It's the lowest value. There could've even been two children that had 4.5. But it is the minimum.

521
00:51:00,954 --> 00:51:06,171
If you ignored the outliers, yes, then the next smallest observation is 6.

522
00:51:06,171 --> 00:51:10,117
But 6 is not the minimum value here, for that data set.

523
00:51:10,117 --> 00:51:16,249
And of course the other, um, answer of 4 was for the other group in case you went to the wrong boxplot.

524
00:51:16,249 --> 00:51:17,798
Highlighting that.

525
00:51:17,798 --> 00:51:22,796
Now if someone wrote, and I saw someone in my other classes wrote 4.6, I would have given credit for that

526
00:51:22,796 --> 00:51:24,425
'cause that's close to what I have there.

527
00:51:24,425 --> 00:51:29,687
You know it's pretty much right in the middle, but if you read it off as 4.4 or 4.6 you'd still credit,

528
00:51:29,687 --> 00:51:35,446
especially if you were to circle it or bring it over and kinda tell me what you think that number is, on the axis.

529
00:51:35,446 --> 00:51:42,664
Alright, a little more curious about this ... next question ...and go ahead!

530
00:51:46,326 --> 00:51:55,519
Among children who did NOT eat breakfast ... 25% has a score of at least how many points?

531
00:51:59,284 --> 00:52:14,459
(inaudible taking in background)

532
00:52:14,983 --> 00:52:16,981
We're a little more split on this one. (Uh oh)

533
00:52:18,304 --> 00:52:20,816
For our question 2.

534
00:52:21,554 --> 00:52:25,900
Alright, let's take a look.

535
00:52:26,776 --> 00:52:30,028
Wanna change your answer? Um?

536
00:52:31,073 --> 00:52:40,782
Question 2, I see a 25%, so I'm sort of thinking a quartile rather than the median. So the choice over these three.

537
00:52:40,814 --> 00:52:47,817
Let's take a look. I gonna tell you right up front that the right answer is ... A)

538
00:52:48,524 --> 00:52:50,608
Let's see WHY?

539
00:52:51,915 --> 00:52:58,162
Alright we're looking at which boxplot? The one with the children that did ... not eat breakfast. So the "NO" group.

540
00:52:58,162 --> 00:53:04,444
I do see what? what is the median for this one? About 6...and a half.

541
00:53:04,444 --> 00:53:12,163
That would be my median... and my quartiles are about this 5.5 ... and then about 7.5

542
00:53:13,470 --> 00:53:18,497
And I see a 25%, but 25% does not necessarily mean it's gonna be a Q1.

543
00:53:18,821 --> 00:53:26,032
When you take Q1, you got 25% percent below and what percent above Q1? ... 75

544
00:53:26,032 --> 00:53:34,260
When you take Q3, you got another 75, 25 split. There's 25% above it and 75% below it.

545
00:53:34,260 --> 00:53:40,489
So depending on which quartile you use depends on which direction you were looking to go for that 25%.

546
00:53:40,618 --> 00:53:47,070
So often in these kind of questions it's almost easier to just say "I know it's a Q1 or Q3", 'cause it could be kinda tricky that way.

547
00:53:47,070 --> 00:53:51,103
Put each one in and see if it makes the sentence correct.

548
00:53:51,103 --> 00:53:57,054
So let's try the one that was the most common answer. Most of you took 25% and said "oh it's Q1, 5.5"

549
00:53:57,054 --> 00:54:08,526
So what if we put 5.5 in our sentence here? It says ... you're looking at the percent of students that had a score of ...at least ... 5.5 points.

550
00:54:08,526 --> 00:54:14,548
So at least 5.5. points means 5.5 points or ... more.

551
00:54:14,918 --> 00:54:19,900
And what percent of the students had a score of 5.5 points or more? or at least 5 and a half points?

552
00:54:19,900 --> 00:54:23,541
75% ... NOT 25%.

553
00:54:23,541 --> 00:54:30,617
There are 75% ... that had a score of at least 5 and a half points.

554
00:54:30,617 --> 00:54:40,385
Let's put in the other Q3 of 7.5. We put in 7.5 and say what percent of the students had a score of at least 7 and a half points.

555
00:54:40,385 --> 00:54:48,383
At least means that many points or? ... more. And what percent are there?... 25%

556
00:54:48,383 --> 00:54:51,887
The correct answer A).

557
00:54:51,887 --> 00:54:59,740
If it were "at most" instead of "at least" ... if I had at most there then you would be calculating that many points or?... less,

558
00:54:59,740 --> 00:55:02,862
and then you do want 25%, so that's Q1.

559
00:55:02,862 --> 00:55:10,554
So is it "at least" or "at most"? is it more than or less than? and you want the 25% in that spot , or the 75% tells you which one to use.

560
00:55:10,554 --> 00:55:14,749
You just have to think it through a little bit, or kinda try out both numbers and see which one ends up

561
00:55:14,749 --> 00:55:22,211
making it correct... noting the direction you're going and what percent you want there... in that direction.

562
00:55:23,042 --> 00:55:30,767
Alright, so that one I put there specifically 'cause that was one on a quiz, I think for spring term and I did have a number get it wrong.

563
00:55:31,151 --> 00:55:34,627
I'm doing it now so you won't get it wrong if it were asked somewhere on your exam.

564
00:55:34,627 --> 00:55:39,102
Our last one is a true/false. Let's see what you choose there.

565
00:55:39,594 --> 00:55:46,275
We have a very nice symmetric boxplot for our "not eating breakfast" group.

566
00:55:46,413 --> 00:55:54,441
The statement is that this implies the distribution for scores, is set nice bell-shaped symmetric distribution.

567
00:55:55,025 --> 00:55:57,291
No outliers in that one.

568
00:55:58,460 --> 00:56:03,557
And ... most of you paid attention to the notes on the previous page.

569
00:56:04,511 --> 00:56:08,124
Notes on the previous page would lead you to say "This is false".

570
00:56:09,186 --> 00:56:15,632
It is false. ... It does not imply the distribution is bell-shaped, just because the distribution is symmetric.

571
00:56:18,924 --> 00:56:25,820
A symmetric boxplot ... does not necessarily guarantee that your data when you plot it would be symmetric.

572
00:56:26,082 --> 00:56:30,674
It tells you that you median and your Qs, your Q1 and Q3, are nice and symmetric from each other,

573
00:56:30,674 --> 00:56:33,598
and even if you went up to a max. and min. it would be the same in distance.

574
00:56:33,598 --> 00:56:38,612
But what happens in between, to that 25% chunks you've got?

575
00:56:38,612 --> 00:56:41,902
They could be distributed in different ways that are not mirrored or symmetric.

576
00:56:42,010 --> 00:56:46,978
And even if it were symmetric, does it mean it's gonna be this type of symmetry? bell-shaped?

577
00:56:47,362 --> 00:56:58,456
Isn't there other symmetry models? Bimodal will give you a symmetric boxplot. Just the same as a unimodal bell-shaped curve will give you a symmetric boxplot.

578
00:56:58,456 --> 00:57:00,873
So you can't see the clusters.

579
00:57:00,873 --> 00:57:04,856
You can't see bimodality from your boxplot. It hides that aspect of shape.

580
00:57:04,856 --> 00:57:13,186
Boxplots hide clusters and gaps, you don't see them as readily. So shape is not best to imply from any boxplot.

581
00:57:13,186 --> 00:57:17,258
Look at the histogram along with it. So false!.

582
00:57:17,489 --> 00:57:24,626
Alright ... now we did have one outlier, at least one outlier in our data set, there could actually be two children or more there, at that low value.

583
00:57:24,980 --> 00:57:29,104
What do you do with outliers?

584
00:57:29,458 --> 00:57:33,202
A small section in your text. Section 2.6 talks about how to handle them.

585
00:57:33,264 --> 00:57:37,213
Gives you a couple good examples for you to look at.

586
00:57:37,213 --> 00:57:43,545
The primary idea is that you just can't throw them out, and say "oh they don't match, let's just not use them".

587
00:57:43,545 --> 00:57:50,483
You have to take a look at your data, it might be a legitimate value, it might represent something that's been gathered under different conditions.

588
00:57:50,698 --> 00:57:59,647
Joel did the measuring of these parts, knew how to operate this machine. Jim came in while he went out, to the bathroom, and did a measurement and it was way wrong,

589
00:57:59,647 --> 00:58:03,501
and that would indicate that you shouldn't use it 'cause it wasn't calibrated correctly or whatever.

590
00:58:03,501 --> 00:58:08,129
It could just be natural variability, and you occasionally get a value that stands out,

591
00:58:08,236 --> 00:58:13,782
but more than likely it could be that someone else made the measurement. The measurement was entered incorrectly,

592
00:58:13,782 --> 00:58:17,324
you can go back and check your records, just a switching around of the values or something,

593
00:58:17,324 --> 00:58:23,438
but you can't throw it out unless there's a legitimate reason to say it doesn't fit with the rest of the data for this situation,

594
00:58:23,438 --> 00:58:27,712
it's measured by different person or on a different machine ... or under a different condition.

595
00:58:27,881 --> 00:58:34,195
It could be that it's an interesting value to look at, and that might be your focus.

596
00:58:34,195 --> 00:58:37,946
It could be that that particular person or item belongs to a different group.

597
00:58:37,946 --> 00:58:42,201
That was the only male in your data set and everybody else that you asked the question of were all females.

598
00:58:42,201 --> 00:58:45,910
So that could give you a different rating or response.

599
00:58:45,910 --> 00:58:51,758
Alright, so, some good examples there. We have techniques that are more resistant to outliers than others,

600
00:58:51,758 --> 00:58:53,843
we have a median that's better to report than a mean ...if there were an outlier.

601
00:58:54,335 --> 00:58:59,671
Question in the back? ... or just stretching?

602
00:59:00,179 --> 00:59:07,744
Alright, one last measure of spread ....Standard Deviation... and we end with our pictures of the day.

603
00:59:09,359 --> 00:59:14,668
So how many of you have heard of standard deviation too as another nummerical summary?

604
00:59:14,668 --> 00:59:18,485
How it's calculated, how you interpret it, would be our last focus.

605
00:59:18,869 --> 00:59:28,472
So if you have the median as your measure of center, then you're most likely gonna report the range or the IQR, interquartile range, to go along with it.

606
00:59:28,472 --> 00:59:36,102
If you had the mean, as a reasonable measure of center, then often you see next to that is the standard deviation as a measure of spread.

607
00:59:36,102 --> 00:59:43,101
'Cause what does the standard deviation do? It measures the spread of your observations from, the mean.

608
00:59:43,101 --> 00:59:51,134
The standard deviations takes every value of a new data set, and looks at how far away every observation is from the mean.

609
00:59:51,334 --> 00:59:56,050
We're going to interpret the standard deviation in the following way.

610
00:59:56,281 --> 01:00:03,983
We want to interpret it as being roughly the average distance, that your values are ... from the mean.

611
01:00:03,983 --> 01:00:08,543
It's that typical distance, that average distance, of the values from the mean.

612
01:00:08,712 --> 01:00:09,292
Now it's not exactly the average.

613
01:00:09,615 --> 01:00:13,146
If you were going to calculate the average, you'd have a little different formula than what we're gonna write up here.

614
01:00:13,146 --> 01:00:19,175
But it does take distances from the mean, and we want to interpret it or view it as being roughly that average distance,

615
01:00:19,175 --> 01:00:23,486
so it can be kind of a yardstick for us ... to see how far away we are from the mean.

616
01:00:23,486 --> 01:00:25,605
So here's how it's computed. Let's work that out.

617
01:00:25,636 --> 01:00:29,734
We're gonna take every observation and look at how far away it is from the mean.

618
01:00:29,734 --> 01:00:32,379
So the first observation would be X1

619
01:00:32,379 --> 01:00:41,044
and what was the mean of a sample again? We represent that by X - bar, that represents the sample mean. The mean of your set of data.

620
01:00:41,044 --> 01:00:46,825
So there's the first distance, and I'm gonna calculate the next distance, my next observation from the mean.

621
01:00:47,147 --> 01:00:50,333
And I'm gonna do that for all of the observations.

622
01:00:51,732 --> 01:00:57,941
Every single value's used in calculating the standard deviation. Just as it is for calculating the mean.

623
01:00:58,110 --> 01:01:03,471
And if I were to average these, I would be starting to add them up then, right?

624
01:01:03,471 --> 01:01:10,967
Average these distances. And if you did that calculation right there, that would always come out to be zero.

625
01:01:10,967 --> 01:01:16,108
'Cause some of those distances are positive, some values were above the mean and some were below the mean.

626
01:01:16,108 --> 01:01:22,878
And so every negative and positive ends up, when you work them all out as a total, to add up to zero. That's a property of the mean.

627
01:01:22,878 --> 01:01:25,840
The total of the distances of every value from the mean is zero.

628
01:01:25,840 --> 01:01:28,593
So you can't really just average that 'cause you always get zero.

629
01:01:28,593 --> 01:01:35,214
So what's one way of getting rid of the negatives versus positives is to take ... absolute value, and that would work

630
01:01:35,214 --> 01:01:41,061
that would be called "the mean absolute deviation", which is a measure of spread,

631
01:01:41,061 --> 01:01:47,366
which really is more of that average distance idea, but it doesn't work very well mathematically.

632
01:01:47,366 --> 01:01:53,593
Trying to integrate or show properties of this kind of measurement here, this statistic when there's absolute values to work with ... not so fun.

633
01:01:53,593 --> 01:01:57,985
So another way to get rid of it is to ... square everything.

634
01:01:57,985 --> 01:02:06,104
Then a value that's 2 below the mean contributes 4, just as a value that sits 2 above the mean contributes the same distance of 4.

635
01:02:06,134 --> 01:02:08,553
But it's in squared units now.

636
01:02:08,553 --> 01:02:13,903
So if I were to use squares, I would average those squares by doing what?

637
01:02:13,903 --> 01:02:22,391
Add all those squared distances up and divide by ... 'n', we're gonna divide by n minus 1.

638
01:02:24,021 --> 01:02:29,124
n minus one, sometimes gonna be called, later for us, the degrees of freedom.

639
01:02:29,862 --> 01:02:34,049
Those quantities that were in the top numerator there,

640
01:02:34,065 --> 01:02:38,787
and before we put the squares there, we said they added up to zero.

641
01:02:38,787 --> 01:02:46,076
So if you has 10 numbers and you know that the distances add up to zero, 9 of those those numbers could be anything

642
01:02:46,352 --> 01:02:54,305
giving any of the distance at all, that last one though, has to be fixed, because it has to make that total add up to zero.

643
01:02:54,305 --> 01:02:58,412
So there are 10 observations but only 9 that are free to vary, and be whatever they want to be.

644
01:02:58,412 --> 01:03:03,383
That last one's fixed by this constraint, because we're looking at that distance from the mean.

645
01:03:03,383 --> 01:03:10,348
It also turns out, if you do a "n-1" on the bottom ... you're gonna get a better estimate out from your data

646
01:03:10,348 --> 01:03:18,479
for what the true standard deviation would be for the population, than if you didn't correct for that, losing of one degree of freedom.

647
01:03:18,479 --> 01:03:20,672
So later it's gonna be called

648
01:03:20,672 --> 01:03:23,789
degrees of freedom

649
01:03:24,650 --> 01:03:30,119
when we use some inference using this as our measure of spread.

650
01:03:30,119 --> 01:03:36,588
Right now, we have, looking at roughly the average of the squared deviation of the values from the mean,

651
01:03:36,588 --> 01:03:41,973
and I don't want to have a measurement in squared units when it comes out, so we take the square root of that whole thing.

652
01:03:43,557 --> 01:03:51,587
And you got the standard deviation ... represented by that little letter 's' for our sample standard deviation.

653
01:03:51,587 --> 01:03:56,047
's' goes along with X - bar... those are the symbols for a sample standard deviation.

654
01:03:56,047 --> 01:04:01,416
The shorthand notation that's on your yellow card, that has your second formula there, I think the mean is listed there too.

655
01:04:01,416 --> 01:04:12,423
would be that you take the sum of the X minus the X - bars, and square them, dividing that total or sum by that 'n-1'.

656
01:04:12,423 --> 01:04:14,788
So that's just that more compact notation.

657
01:04:14,788 --> 01:04:17,688
You've already likely have on your calculator an 's' button too.

658
01:04:17,688 --> 01:04:23,384
Or when you have it do some basic summaries it will give you that standard deviation, which is nice.

659
01:04:23,384 --> 01:04:29,471
What we calculated before we took a square root ... is called s² (s squared), that's your variance.

660
01:04:29,471 --> 01:04:35,939
Variance is not in original units. Variance is in squared units so we wanna bring it back to the original units to be my yardstick.

661
01:04:35,939 --> 01:04:40,259
Here's the average and my give or take, my standard deviation each way.

662
01:04:40,259 --> 01:04:43,090
So that's how it's computed. Let's try it out once.

663
01:04:43,090 --> 01:04:46,233
There's an example in the bottom that you can look at, on your own.

664
01:04:46,233 --> 01:04:53,879
We're gonna calculate it once but mostly focus on the interpretations that we have in the middle of that next page. Page 19.

665
01:04:58,370 --> 01:05:02,504
This is probably the only time you're gonna calculate the standard deviation by hand, working it out.

666
01:05:02,504 --> 01:05:06,536
The rest of time I want you to use the calculator or I provide it on the exam for you to interpret.

667
01:05:07,336 --> 01:05:13,787
These are our observations, the average was 73.6, that was our X - bar.

668
01:05:13,787 --> 01:05:18,069
The standard deviation or 's' for our sample.

669
01:05:18,423 --> 01:05:21,870
We're gonna need a very large square root.

670
01:05:21,870 --> 01:05:33,509
We're gonna have to start calculating distances of each of the values from their mean, so the first one's a 77 ... minus the 73.6 ... but I do have to square that.

671
01:05:34,924 --> 01:05:42,760
We're kind of cheating a little bit here, and putting in the first and the last observation...

672
01:05:42,944 --> 01:05:46,190
to show would it would be that you would have to work out and do.

673
01:05:46,712 --> 01:05:53,818
With only a very basic calculator this would take a bit of time, especially with a mean of 73.6, in decimal form like that,

674
01:05:53,818 --> 01:05:56,759
but I would do all of those squared terms and add them up.

675
01:05:56,759 --> 01:05:58,976
So, an observation that's further away from the mean contributes more.

676
01:05:58,976 --> 01:06:05,912
Leading to a higher spread or more variation than if you had a lot of observations that are close to the mean.

677
01:06:05,912 --> 01:06:12,440
What will we divide by here? ...12 is our observation number, so 12 minus ...1

678
01:06:12,440 --> 01:06:18,357
Under the square root you would get a thirty...five point four

679
01:06:20,495 --> 01:06:23,594
35.4 has what name again?

680
01:06:24,701 --> 01:06:28,796
It's called the variance. It's call the variance.

681
01:06:28,796 --> 01:06:30,880
So sometimes you'll see reports that show the variance and the standard deviation.

682
01:06:30,880 --> 01:06:37,989
Standard deviation's preferred because we bring it back to original units, and that would be about 5.9

683
01:06:37,989 --> 01:06:44,438
and so we can say the standard deviation's 5.9 grams versus a variance of 35.4 grams squared.

684
01:06:44,469 --> 01:06:47,099
Hard to think of squared units.

685
01:06:47,099 --> 01:06:53,168
So there's out standard deviation. If you looked at the histogram, the values did vary from the mean

686
01:06:53,168 --> 01:07:00,119
some of them are more than 5.9 away, some of them are closer ... than 5.9 away.

687
01:07:00,119 --> 01:07:04,406
But the average distance can be thought of as being ABOUT 5.9 grams.

688
01:07:04,406 --> 01:07:10,225
Probably the more important thing is to know what this number tells you in terms of an interpretation or viewing it,

689
01:07:10,225 --> 01:07:18,090
and then also when you have certain shapes of distributions it's very useful for a yardstick as we will see ...not today, on Thursday.

690
01:07:18,090 --> 01:07:22,873
So interpretation's coming first. This is gonna be an important page.

691
01:07:22,873 --> 01:07:28,555
We're definitely gonna have on real homework number 1, somewhere interpreting a standard deviation.

692
01:07:28,555 --> 01:07:32,998
I think in one of your modules, it might even be in module one, there's some good examples to look at.

693
01:07:32,998 --> 01:07:38,044
If you didn't get to them in lab that were wrong interpretations and right ones, in picking out the right ones.

694
01:07:38,044 --> 01:07:45,121
The weights of our small orders of french fries. They weren't all the same ... they did vary.

695
01:07:45,121 --> 01:07:51,201
They vary, they are about how far way from the mean? Roughly how far on average?

696
01:07:51,201 --> 01:08:01,608
So these weights of small order of french fries are roughly about ... 5.9 grams away from their mean, which happened to be this 73.6.

697
01:08:02,162 --> 01:08:08,960
And then we've got that important clarifier there ... "on average".

698
01:08:08,960 --> 01:08:13,137
I'm not saying that every order of french fries have the weight that is exactly 5.9 grams away from the mean.

699
01:08:13,137 --> 01:08:20,506
There were some that were further away, and some that were closer; and if you look at the average of those distances, it would be ABOUT 5.9.

700
01:08:20,506 --> 01:08:24,963
So come, some keys parts here is that we are clarifying it 'cause we didn't calculate the absolute values,

701
01:08:24,963 --> 01:08:28,306
we did square things and then took the square root.

702
01:08:28,306 --> 01:08:33,616
We are talking about an average distance... and I give you the frame of reference from which you're looking at

703
01:08:33,616 --> 01:08:36,760
the values from what? from their mean.

704
01:08:36,760 --> 01:08:40,134
Another way to write it correctly is to start out with the "On average" so you ...

705
01:08:40,134 --> 01:08:46,542
"On average, these weights did vary ... by ABOUT how much... from their mean?"

706
01:08:46,542 --> 01:08:55,165
by ABOUT 5.9 grams .. from their mean, which happens to be known here as 73.6.

707
01:08:56,134 --> 01:09:02,070
So there's a couple examples of where you have all the right parts ... for an interpretation.

708
01:09:02,070 --> 01:09:08,662
A standard deviation's thought of as being roughly an average distance. The average distance that the values vary from ...

709
01:09:08,662 --> 01:09:11,449
what frame of reference from? The mean.

710
01:09:12,265 --> 01:09:18,693
A data set that had a standard deviation of 5.9 compared to a data set that had a standard deviation of 59.

711
01:09:18,693 --> 01:09:23,123
Much more spread from the mean on average, compared to the other one.

712
01:09:23,123 --> 01:09:26,265
Alright, standard deviation. Couple of notes

713
01:09:26,265 --> 01:09:29,655
Well, how would we get a standard deviation of 0?

714
01:09:29,655 --> 01:09:32,147
What would would it mean?

715
01:09:32,147 --> 01:09:39,090
There's no variation, no spread. Has to occur when all the values are ... the same.

716
01:09:39,090 --> 01:09:46,032
So 's' could be zero, that's the smallest it could be. It represents that there is no spread at all, no variation,

717
01:09:46,032 --> 01:09:51,108
and it would occur of course when all the observations ...are the same.

718
01:09:53,001 --> 01:10:00,223
Otherwise, what kind of values do you get for an 's', a standard deviation? It'd have to be positive.

719
01:10:00,223 --> 01:10:03,478
You can't get an 's' of -2.8. It would be marked really wrong.

720
01:10:03,894 --> 01:10:12,550
The larger the value, the more spread. The closer to zero, the more consistent the values are, and close to the mean on average.

721
01:10:12,550 --> 01:10:20,973
Now, the mean was our measure of spread that we, or measure of center that we talked about was somewhat sensitive to those extreme observation,

722
01:10:20,973 --> 01:10:27,930
an extreme observation would pull the mean quite down or up; and the standard deviation looks at every value from that mean

723
01:10:27,930 --> 01:10:34,164
so the standard deviation is also ... SENSITIVE to extreme observations.

724
01:10:39,471 --> 01:10:48,464
So when do we prefer to use the mean and the standard deviation, to go along with a graph, to show or summarize a quantitative variable?

725
01:10:48,464 --> 01:10:54,562
We do that when our distribution looks to be reasonably symmetric, and bell-shaped.

726
01:10:58,545 --> 01:11:06,125
If our distribution overall looks reasonably symmetric, unimodal, bell-shaped ... it's even more ideal.

727
01:11:06,125 --> 01:11:10,648
Then the standard deviation and a mean work very well, in fact we're gonna see in the beginning of next class

728
01:11:10,648 --> 01:11:15,629
this empirical rules that allows us to really see how it works with a bell-shaped model.

729
01:11:15,629 --> 01:11:20,662
5 number summary ... uses a median, uses the interquartile range to measure spread,

730
01:11:20,662 --> 01:11:25,571
which are more resistant and better to use when you have what kind of distributions?

731
01:11:26,094 --> 01:11:32,662
Strongly skewed distributions, not just slightly skewed, slightly is okay, to use a mean and standard deviation, but

732
01:11:32,662 --> 01:11:36,589
strongly skewed distributions, OR if you have outliers.

733
01:11:41,711 --> 01:11:47,922
Strongly skewed distributions or if outliers, 'cause they will affect the mean and the standard deviation.

734
01:11:47,922 --> 01:11:51,435
The last bullet ... and then a picture of the day.

735
01:11:51,435 --> 01:11:56,218
The last bullet is: your calculator will often have lots of these numbers that can be summarized for you,

736
01:11:56,218 --> 01:11:59,407
and calculated for you, but sometimes it even gives you more.

737
01:12:00,637 --> 01:12:04,098
Do you remember the first sample mean?

738
01:12:04,098 --> 01:12:14,470
X - bar, that's the mean of a sample. That's called a statistic. And 'mu' ... is the notation for a population mean.

739
01:12:14,977 --> 01:12:23,820
For standard deviation we have a parallel, of sigma versus 's'. Most of the time, I know on TI calculators

740
01:12:23,820 --> 01:12:27,997
when you get your summary measures, sometimes on regular calculators, the little buttons,

741
01:12:27,997 --> 01:12:35,605
they have a sigma button or an 's' button. The sigma is if you had a population of values and you had all those numbers

742
01:12:35,605 --> 01:12:40,754
put into your calculator, then you want the population standard deviation, which is computed a little differently

743
01:12:40,754 --> 01:12:48,779
than the sample one ... and you need to use ... sigma. But most of the time the data that we have is from a sample.

744
01:12:48,779 --> 01:12:53,806
Most of the time, we're computing sample statistics, and NOT true population values,

745
01:12:53,806 --> 01:12:59,987
so we're gonna often pull off the 's' and not the sigma when that comes out from your calculator, so you know that difference.

746
01:13:00,202 --> 01:13:03,593
And the you get to see pictures of the day before you leave.

747
01:13:03,593 --> 01:13:08,602
I have two dogs! First Lily ... and there's Molly.

748
01:13:08,602 --> 01:13:13,749
Lily is a 16 year old Beagle mix that can't hear, can't see but she's still around.

749
01:13:13,749 --> 01:13:17,566
(inaudible talking in background) Molly is a lot of fun! She's kind of a cat-dog. (laugh in background)

750
01:13:17,566 --> 01:13:21,566
She keeps us (inaudible) busy, with walks... I hope you have a good day. We'll see you on Thursday!