-
Okay, let's go back to basics.
-
Let's see sampling distributions.
-
We had something
called "population."
-
Like, let's take some
huge population,
-
say, all high school students
in the United States.
-
That's a huge population.
-
I don't know
how many there are,
-
and we want to study
something like weight,
-
obesity of students, and
we need the weight of them.
-
That's huge data.
-
What we usually do
when we have a huge data--
-
or it might be the case that we
do not have access to data--
-
[unclear] for example, you
want to study something about...
-
...some rabbits, for example,
in state of Minnesota.
-
Well, you cannot just gather all the rabbits
and maybe weigh them, for example.
-
So, what you do is you sample.
This one makes more sense
-
to just sample some rabbits at random from maybe
different places in the state and weigh them.
-
So, that's when you really don't
have access to the whole population.
-
The other case you might have
access to the whole population
-
of all high school students in the
country, but it's not feasible.
-
So, in that case, too, we choose
some samples from every state--
-
there are different
ways to sample them--
-
you can say based on population of each
state we sample according to that population.
-
So, a more populous state like
California you sample more students
-
from California than, say,
I don't know...Minnesota.
-
So, we have a population and we're
studying something about this population.
-
Usually we are talking about
the normal distribution,
-
although this applies
to any distribution.
-
So, this population
has some distribution.
-
It has some parameters.
-
Now, parameters for the
distribution of the population are--
-
usually what we study: mu is mean,
and sigma squared which is variance,
-
and sigma which is
standard deviation.
-
Well, there might be other things that
we study, depending on the situation,
-
if you're working for an insurance
company you might even consider other--
-
these are some kind of "moments,"
we call them "moments"--
-
so you can choose other moments.
You might need more than just these two.
-
So...
-
For the moment we
just need these two,
-
in fact this one [sigma/standard deviation]
and this one [mu/mean].
-
Now, that's for the population.
-
But, I said it's not feasible or it's not
possible to look at the whole population.
-
So, what we do is we sample.
-
And that's where statistics really
comes into the picture.
-
Statistics basically starts here.
-
Otherwise, if we know the population,
we know everybody and the weight
-
of every student in this country,
then that's it, we have data. So what?
-
Statistics comes in when it is not possible (for
whatever reason) to study the whole population.
-
In fact, when we do data science,
that's another story.
-
In data science, say, we
usually have the population.
-
Well, sometimes we would sample,
but we usually have the population.
-
And most often what
we want to do is predict.
-
So, we have to--
-
you want to predict something about
the population in the future.
-
But, anyways, this is not data science,
this is statistics, although they are related.
-
We have a sample here and this sample also has
some kind of mean and some kind of variance.
-
This mean and variance we denote
them differently. Remember? Sample--
-
This is population, and for sample
we have x̅ [x-bar] for mean,
-
and remember we have s
for standard deviation.
-
Okay, s for standard deviation
and this is for the sample.
-
Oh, it's getting so dark,
let me turn a light on.
-
Now...
-
First of all, you might say--
you might intuitively [unclear]--
-
You might feel like having more
sample, so you might say,
-
"A larger sample size will give me
a better idea of the population."
-
Where does that feeling
or intuition come from?
-
In fact, it's true. It's true that a larger
sample will give a much better idea.
-
But, there's something here, this
mean and this standard deviation,
-
so the larger sample will give me a mean and
a standard deviation. But, what do we want?
-
We want this [sample] mean and
this [sample] standard deviation
-
be close to this [population] mean and
this [population] standard deviation.
-
In other words, you want these two [from the
sample] to estimate [the population] for me.
-
Right? So, you want these
two [from the sample]
-
to estimate this [population] mean
and this standard deviation.
-
And that's where the
word "estimators" comes in.
-
So, this x-bar, which is
the mean of the sample,
-
and s, which is the standard
deviation of the sample,
-
these two are estimators for
mu and sigma. In fact, in--
-
Well, the best-- I mean, x-bar is
an unbiased estimator for mu,
-
and s squared is an unbiased
estimator for sigma squared.
-
So, this estimates this,
and this estimates this.
-
Not sigma, not sigma and s,
that's why we always write variance.
-
This guy x̅ estimates this μ, and
this guy s^2 estimates this σ^2.
-
And they're unbiased. Now, what "unbiased" is,
I'm not going to discuss in more detail, but it's...
-
Well...at some point I will
touch that, but anyways.
-
But, the word "unbiased" itself, should
give you some idea of what they really are.
-
So...
-
Okay, now...
-
This...
-
So, really there's something
that's coming into the picture.
-
We have the sample, and this
sample has some distribution.
-
So, the sample has
some distribution.
-
Now, what do we assume
for that distribution?
-
Well, most often what we do
is we graph the histogram
-
and by looking at the histogram
we guess, and we say,
-
"Okay, the histogram is telling me that
this distribution looks like normal."
-
Or, "This distribution looks like a
exponential distribution," so on and so forth.
-
The first thing we usually do is we graph
the histogram and look at the picture
-
and try to just guess
what the distribution is.
-
Well, anyway, the sample
comes with some numbers,
-
so if it is weight of students or age of students,
something [like that], you have some numbers.
-
Those numbers give you x-bar and s,
so you have at least 2 numbers here.
-
And these two numbers
definitely help you to--
-
not the distribution, but most often
you guess the distribution, but anyway--
-
to write down the distribution more clearly
by including these two in the distribution.
-
Okay, so...
-
First thing we do is histogram.
-
So, we have a sample
and we draw a histogram.
-
So, given a sample
-
graph the histogram first.
-
So, the histogram should give you an idea
of what the distribution should look like.
-
When you graph the histogram this means that
you are using some computer program,
-
whatever program you
are using, you'll find:
-
x-bar and s.
What is x-bar and what is s?
-
X-bar is [uppercase] sigma
[Σ meaning sum] of all x's
-
divided by n. And s squared
is [uppercase] sigma Σ,
-
x minus x-bar squared
over n minus 1.
-
That is s squared.
Now, what is n?
-
N is the sample size. Let me
write here: "n = sample size."
-
And these two formulas, in fact--
n is sample size-- these two formulas--
-
First of all, it means that your
sample must be more than 1.
-
[He points to denominator where 1 minus 1
would equal 0 and be undefined.]
-
Definitely, can you say anything just
by having 1 [item in a] sample?
-
So, this one and this one give me two numbers,
and these two numbers theoretically--
-
by theory we know-- that these are unbiased
estimators for mu and [lowercase] sigma squared.
-
Now, as I said, and I'll say it again, this chapter,
in fact, is about normal distribution.
-
So, what we usually do is we just say, "Okay, well
let's assume that the population is normal."
-
Although there are some theorems which have
nothing to do with normal distributions,
-
but they work for every distribution.
But, in order to work with a sample
-
and make some predictions
and other things,
-
we usually assume that the distribution of
the population is normal, and also the sample.
-
So, although there are some problems
with this assumption, but...
-
Well, we have to assume sometimes.
-
Unless we have a large sample,
if we have a large enough sample,
-
then by looking at the histogram we might say,
"I think this is not a normal distribution,
-
this is like an exponential distribution,
or like 'blah-blah' distribution."
-
That idea needs more experience and
knowledge of statistics and probability.
-
So, these two guys
are good estimators.
-
Now...
-
Let me erase...
what should I erase?
-
[unclear]
-
So, it's saying something like this.
When we study population--
-
So, this is the population...
-
You have to listen carefully
to what I'm saying.
-
So, we want to
study a population.
-
Say I'm a stats professor and I choose
some students from my class,
-
n students, say 10 students,
-
and send them to different places in the country
to measure weight of some students.
-
So, to weigh some students,
high school students.
-
Today I sent them and they go around
the country in the morning and weigh,
-
and bring me 10 numbers.
So, today, day 1...
-
they bring me ten numbers.
[counting]
-
So, they give me ten numbers.
-
These ten numbers, you
can find using that formula,
-
you can find for these ten numbers
both x-bar and s squared for day 1.
-
So for day 1 we have
x-bar and s squared.
-
And, again, day 2,
-
I send students again to give me
another x-bar and another s squared.
-
Day 3, another one.
-
Day-- well, how many days do
you think? Let's say 100 days.
-
Day 100, so I have x-bar [subscript] 100,
and s squared [subscript] 100.
-
Say I send them 100 days, and I have
no money to do that. [chuckles]
-
That's why I'm saying
"if it's feasible."
-
Anyway, so I have some x-bars
here and some s squares.
-
And samples themselves are kind
of random, the whole set of ten--
-
How many students did I send? Ten.
-
So, this sample has ten students from
high schools around the country.
-
This sample [day 2] has ten students,
but they do not have to be the same,
-
they can be at random, so it's unlikely
to have two samples exactly the same.
-
Right? So, just imagine. You choose some
students from the high schools in the country.
-
The next day you choose other students,
but since you are doing it at random,
-
some of these students might be the same.
But, just imagine out of-- I don't know,
-
like out of a hundred million students--
I don't know, let's say ten million--
-
out of ten million students you are choosing
ten in 2 days and they are the same?
-
It's really unlikely to have the same.
-
In fact, we can find the probability of being
the same, but that's not the point here.
-
And then day 3, another ten students,
up to day 100, ten students.
-
The point is, this sample
itself is like random.
-
Although each one of them is random,
the sample in whole is also like random.
-
Which makes these x-bars
and s squared's random.
-
Right? This makes them random.
-
So, being random, it means that we can
talk about some kind of random variable.
-
X-bar, for example, or s squared.
-
And this x-bar has values: x-bar [subscript] 1,
x-bar [subscript] 2, up to x-bar [subscript] 100.
-
This one [s squared] takes values from
-
s^2 [subscript] 1, s^2 [subscript] 2,
up to s^2 [subscript] 100.
-
Since they are random variables,
-
with these as values taken by these two
random variables, they have a distribution.
-
So, the distribution of these two, that's what
this thing is about: sampling distribution.
-
And that's what we usually graph.
That's what we graph. And...
-
So, we have this x-bar
and s squared for ten--
-
Well, ten's not enough, but say
I send 200 students, okay?
-
And then we graph it-- a histogram for
x-bar, and a histogram for s squared--
-
those two histograms will give us
an idea of what the distributions are.
-
Now, next time I will discuss
the distributions of these...
-
...random variables and also some amazing
theorems about those distributions.
-
So, see you next time.