English subtitles

← Box Plots - Data Analysis with R

Get Embed Code
6 Languages

Showing Revision 3 created 05/24/2016 by Udacity Robot.

  1. Let's use another type of visualization that's helpful for seeing
  2. the distribution of a variable called a box plot. Now if
  3. you're unfamiliar with a box plot, you can find resources in
  4. the instructor notes, and there's also a link to [UNKNOWN] statistic
  5. class so you can test your own knowledge. You may recall
  6. earlier that we split friend count by gender in a pair
  7. of histograms using facet wrap. The code looked like this. Instead
  8. of using these histograms we're going to generate box plots of friend
  9. count by gender, so we can quickly see the differences between
  10. the distributions. And in particular we're going to see the difference between
  11. the median of the two groups. And remember again the the
  12. q plot function automatically generates histograms (/g) when we pass it a
  13. single variable. So we need to add a parameter to tell
  14. q plot that we want a different type of plot. To
  15. do that, we're going to use the gym called box plot. Now,
  16. I'm going to use the same data set as before. So I'm going to
  17. keep this and q plot. Now, what's different about box
  18. plots is that the y axis is going to be
  19. our friend count. The x axis, on the other hand,
  20. is going to be our categorical variables for male and female, or
  21. gender. Notice that we use the continuous variables. Friend count
  22. as y. And the grouping, or the categorical variable as x.
  23. This will always be true for your box plots. I
  24. forgot a parenthesis here and then let me just reformat my
  25. code so it looks a little bit cleaner. There we go.
  26. Running this code, we can see that we get our two box
  27. plots. Let's zoom in to get a closer look. The boxes here
  28. and here cover the middle 50% of values, or what's called the
  29. inner quartile range. And I know these boxes are hard to see,
  30. since we have so many outliers on this plot. Each of these
  31. tiny little dots is an outlier in our data. We can also
  32. see that the y axis is capturing all the friend counts from
  33. zero all the way up to 5,000. So we're not
  34. omitting any user data in this plot. And finally, this horizontal
  35. line, which you may have noticed at first, is the
  36. median for the two box plots. And you might be wondering
  37. what makes an outlier an actual outlier. And well, we
  38. usually consider outliers to be just outside of, one and a
  39. half times the IQR from the media. Since there's so
  40. many outliers in these plots, let's adjust our code to focus
  41. on just these two boxes. We'll have you do this in the next
  42. programming exercise. See, if you can altar our code to make that adjustment.