• The Amara On Demand team is looking for native speakers of German, Japanese, Korean, Italian, Hindi and Dutch to help with special paid projects
Apply here Hide

## ← Box Plots - Data Analysis with R

• 3 Followers
• 42 Lines

### Get Embed Code x Embed video Use the following code to embed this video. See our usage guide for more details on embedding. Paste this in your document somewhere (closest to the closing body tag is preferable): ```<script type="text/javascript" src='https://amara.org/embedder-iframe'></script> ``` Paste this inside your HTML body, where you want to include the widget: ```<div class="amara-embed" data-url="http://www.youtube.com/watch?v=c-v-Xa_R4SA" data-team="udacity"></div> ``` 6 Languages

• English [en] original
• Arabic [ar]
• English (United States) [en-us]
• Japanese [ja]
• Portuguese, Brazilian [pt-br]
• Chinese, Simplified [zh-cn]

Showing Revision 3 created 05/24/2016 by Udacity Robot.

1. Let's use another type of visualization that's helpful for seeing
2. the distribution of a variable called a box plot. Now if
3. you're unfamiliar with a box plot, you can find resources in
4. the instructor notes, and there's also a link to [UNKNOWN] statistic
5. class so you can test your own knowledge. You may recall
6. earlier that we split friend count by gender in a pair
7. of histograms using facet wrap. The code looked like this. Instead
8. of using these histograms we're going to generate box plots of friend
9. count by gender, so we can quickly see the differences between
10. the distributions. And in particular we're going to see the difference between
11. the median of the two groups. And remember again the the
12. q plot function automatically generates histograms (/g) when we pass it a
13. single variable. So we need to add a parameter to tell
14. q plot that we want a different type of plot. To
15. do that, we're going to use the gym called box plot. Now,
16. I'm going to use the same data set as before. So I'm going to
17. keep this and q plot. Now, what's different about box
18. plots is that the y axis is going to be
19. our friend count. The x axis, on the other hand,
20. is going to be our categorical variables for male and female, or
21. gender. Notice that we use the continuous variables. Friend count
22. as y. And the grouping, or the categorical variable as x.
23. This will always be true for your box plots. I
24. forgot a parenthesis here and then let me just reformat my
25. code so it looks a little bit cleaner. There we go.
26. Running this code, we can see that we get our two box
27. plots. Let's zoom in to get a closer look. The boxes here
28. and here cover the middle 50% of values, or what's called the
29. inner quartile range. And I know these boxes are hard to see,
30. since we have so many outliers on this plot. Each of these
31. tiny little dots is an outlier in our data. We can also
32. see that the y axis is capturing all the friend counts from
33. zero all the way up to 5,000. So we're not
34. omitting any user data in this plot. And finally, this horizontal
35. line, which you may have noticed at first, is the
36. median for the two box plots. And you might be wondering
37. what makes an outlier an actual outlier. And well, we
38. usually consider outliers to be just outside of, one and a
39. half times the IQR from the media. Since there's so
40. many outliers in these plots, let's adjust our code to focus
41. on just these two boxes. We'll have you do this in the next
42. programming exercise. See, if you can altar our code to make that adjustment.