
Title:
Box Plots  Data Analysis with R

Description:

Let's use another type of visualization that's helpful for seeing

the distribution of a variable called a box plot. Now if

you're unfamiliar with a box plot, you can find resources in

the instructor notes, and there's also a link to [UNKNOWN] statistic

class so you can test your own knowledge. You may recall

earlier that we split friend count by gender in a pair

of histograms using facet wrap. The code looked like this. Instead

of using these histograms we're going to generate box plots of friend

count by gender, so we can quickly see the differences between

the distributions. And in particular we're going to see the difference between

the median of the two groups. And remember again the the

q plot function automatically generates histograms (/g) when we pass it a

single variable. So we need to add a parameter to tell

q plot that we want a different type of plot. To

do that, we're going to use the gym called box plot. Now,

I'm going to use the same data set as before. So I'm going to

keep this and q plot. Now, what's different about box

plots is that the y axis is going to be

our friend count. The x axis, on the other hand,

is going to be our categorical variables for male and female, or

gender. Notice that we use the continuous variables. Friend count

as y. And the grouping, or the categorical variable as x.

This will always be true for your box plots. I

forgot a parenthesis here and then let me just reformat my

code so it looks a little bit cleaner. There we go.

Running this code, we can see that we get our two box

plots. Let's zoom in to get a closer look. The boxes here

and here cover the middle 50% of values, or what's called the

inner quartile range. And I know these boxes are hard to see,

since we have so many outliers on this plot. Each of these

tiny little dots is an outlier in our data. We can also

see that the y axis is capturing all the friend counts from

zero all the way up to 5,000. So we're not

omitting any user data in this plot. And finally, this horizontal

line, which you may have noticed at first, is the

median for the two box plots. And you might be wondering

what makes an outlier an actual outlier. And well, we

usually consider outliers to be just outside of, one and a

half times the IQR from the media. Since there's so

many outliers in these plots, let's adjust our code to focus

on just these two boxes. We'll have you do this in the next

programming exercise. See, if you can altar our code to make that adjustment.