
Title:
NonNormal Data  Intro to Data Science

Description:

When performing a ttest, we assume that our data

is normal. In the wild, you'll often encounter probability distributions.

They're distinctly not normal. They might look like this, or

like this, or completely different. As you'd imagine, there are

still statistical tests that we can utilize when our data

is not normal. Why don't we briefly discuss what you

might do in situations like this. First off, we should

have some machinery in place for determining whether or not

our data is Gaussian in the first place. A crude, inaccurate

way of determining whether or not our data is normal is

simply to plot a histogram of our data and ask, does

this look like a bell curve? In both of these cases, the

answer would definitely be no. But, we can do a little

bit better than that. There are some statistical tests that we

can use to measure the likelihood that a sample is drawn

from a normally distributed population. One such test is the shapirowilk test.

I don't want to go into great depth with

regards to the theory behind this test, but I do

want to let you know that it's implemented in scipy.

You can call it really easily like this. W and

P are going to be equal to scipy.stats.shapiro data, where

our data here is just an array, or list containing

all of our data points. This function's going to return these

two values. The first, W is the ShapiroWilk Test statistic.

The second value in this twopole is going

to be our P value, which should be interpreted

in the same way that we would interpret

the pvalue for our ttest. That is, given the

null hypothesis that this data is drawn from

a normal distribution, what is the likelihood that we

would observe a value of W that was at least as extreme as the one that we see?