YouTube

Got a YouTube account?

New: enable viewer-created translations and captions on your YouTube channel!

English subtitles

← Multiple Stages Using a Given Operator - Data Wranging with MongoDB

Get Embed Code
4 Languages

Showing Revision 3 created 05/24/2016 by Udacity Robot.

  1. So I hope it's clear that the aggregation framework is

  2. designed to allow you to create a data processing pipeline, as
  3. we've seen several times. Now, you can include as many
  4. stages as you need in order to achieve a goal. For
  5. each stage, you just need to consider what input that
  6. stage needs to receive and what output it needs to produce.
  7. Now, what I want to talk about here, is the fact that
  8. many tasks require us to use more than one stage with
  9. the same operator. For example, we frequently need multiple
  10. group stages, in order to achieve our goal. So, let's
  11. look at another example. This time I really want you
  12. to consider each stage individually and just think about the
  13. inputs and outputs. We're going to do a tweak
  14. on our original question of user mentions. And this time
  15. we're going to look at who has mentioned the most
  16. unique users. So, in the same way that we looked
  17. at unique hashtags just a minute ago, now
  18. we're going to, look at modifying this pipeline
  19. so that we're only counting mentions of users
  20. not already accounted for in our grouping. So
  21. here's the code for this. We have our same unwind stage as before, but the group
  22. stage is different, and you'll notice that there's
  23. another unwind stage here and another group stage here.
  24. In the first group stage, what we're doing is,
  25. aggregating still on the user's screen name but, rather
  26. than simply summing up all of the documents that
  27. this group stage received in order to calculate all the
  28. user mentions, because of course unwind is going to generate
  29. one document for every user mention. Instead what we're
  30. going to do here is use the addToSet operator
  31. that we looked at just a couple minutes ago. Now,
  32. what we're paying attention to here is the
  33. screen name for the user that was actually
  34. mentioned in the tweet. So we're accumulating an
  35. array here in this group stage of a unique
  36. set of users mentioned in tweets produced by
  37. this user, or by each user. Okay, but that
  38. doesn't get us what we want, because again
  39. remember our question is, who has mentioned the most
  40. unique users? Okay, so to this point, all we
  41. have is a list of unique users. We haven't counted
  42. them yet. Now in order to do that, what we're
  43. going to need to do is think about what's the
  44. output from this group stage? Well, the output is
  45. going to be exactly what we defined here as the
  46. structure of documents coming out of this. All documents will
  47. have an underscore id field. Which will be the username
  48. that forms the basis for the grouping that
  49. that document represents. And there will be this
  50. mset field, our mentions set. Okay? So this
  51. stage here, will receive documents with an underscore
  52. id field and an mset field. So, what that means is that we can use unwind
  53. again here, and produce one document for every
  54. item in this array. And again remember this is
  55. going to be an array of unique elements because we used the addToSet operator to
  56. produce it. So unwind then will generate one document for every item found, in
  57. the mset field where each document in its input. So this stage then,
  58. will unwind this array of unique user mentions and pass
  59. those along to this second group stage. And it's this group stage where
  60. we end up calculating the count that we want.
  61. So instead of just counting all user mentions as we
  62. did before, here, instead what we're going to do,
  63. is now count those unique mentions that get passed along
  64. to us here after unwinding the mset array. Okay?
  65. So the documents then to be passed along from this
  66. second group stage to sort are going to contain an
  67. id. This is just a, copying over essentially of the
  68. id that we got as input to this stage,
  69. which, is going to be the id that is produced
  70. here at this stage. And then we're simply counting
  71. how many documents did I receive that have that id?
  72. Or how many unique user mentions did I receive
  73. for this specific user? Okay? Then we're simply going to
  74. sort based on the count field, in descending order as
  75. we've been doing through most of our examples. And, finally,
  76. here I'm going to limit it to ten so we can actually see what the
  77. counts are for unique user mentions for the
  78. top ten tweeters. Okay. So, let's run this.
  79. Okay. So, we can see the names of the users,
  80. and their respective number of unique user mentions throughout this collection.