English subtitles

← Unwind Operator - Data Wranging with MongoDB

Get Embed Code
4 Languages

Showing Revision 3 created 05/24/2016 by Udacity Robot.

  1. Alright, let's take some time now and talk about the
  2. unwind aggregation operator. In a lot of situations we're going
  3. to want to count or do some other sort of
  4. operation based on the values in an array field. We
  5. need to use array field values in some way. So,
  6. with this data We might want to answer a question like
  7. the following: In our collection, who included the most user
  8. mentions in their tweets? Now the reason why this is
  9. relevant to the unwind operator is because user
  10. mentions are included in our tweets inside an array
  11. field. Now in this data, if you remember,
  12. user mentions are found within the entity's sub-document, in
  13. particular, in the user mentions field. User mentions
  14. is an array that contains documents that represent each
  15. individual user mention. So, what I'm going to do
  16. here is pull up some examples using this query.
  17. Now, here I'm using an operator that we haven't seen
  18. before. All this says is give me back documents where
  19. the user mentions field of the entities sub-document are of
  20. length three. So, I'll pretty print this. Then, if we scroll
  21. up, we can see that this example here does in
  22. fact have three user mentions in it. And, just so
  23. you have the full picture, entities, is a top level
  24. field here. It has a sub document as a value and
  25. user mentions is one of the fields of that
  26. entities sub document. User mentions is an array value
  27. field. And, we can see that it holds documents
  28. that are shaped liked this Now, what we're interested in
  29. are these screen names here, because these are the
  30. names of users that are mentioned in this particular tweet.
  31. And for any tweet that mentions a user, you're
  32. going to have an array with documents like this inside
  33. of it, naming the users mentioned. Now, what we want
  34. to find out is a count of all the user mentions
  35. made by an individual Twitter user. So what we're going to
  36. have to do is look through all of the tweets. There'll
  37. be some grouping involved, of course, because we want to
  38. group together tweets by the same user. But we also want
  39. to count the number of user mentions. Unwind is a convenient
  40. tool for doing that to answer this particular question. Let's take
  41. a look. Okay, so here's our aggregation pipeline.
  42. Our first stage uses the unwind operator and it's
  43. being run against that user mentions field. Now
  44. remember that what unwind does, is creates a copy
  45. of the containing document for any array field.
  46. It duplicates all fields except for the items in
  47. the array. And it will create one copy for
  48. each element in the array. And the only difference
  49. between all of the copies will be that this field
  50. Will take on each of the different values in the array,
  51. in the documents that are produced. So let me, let
  52. me make this a little bit more concrete. If we take
  53. a look at this again, for this particular tweet, the
  54. unwind stage will produce three documents as output for this one
  55. tweet document here. All of the other fields that we
  56. see here, all of these, And everything else here in this
  57. tweet document will be exactly the same. The one difference will be
  58. that the user mentions field will have a single document as its
  59. value in each of those three copies of this tweet. In the
  60. first copy, it will have this as its value In the second copy,
  61. it will have this and finally the third copy, it
  62. will have this. So, in the documents that get passed along
  63. to the next stage, in this case a group stage,
  64. the documents will have a different value for the user mentions
  65. field. Now, it turns out that in our case. What
  66. we really care about is this splitting effect. Not so much
  67. with the value of user mentions is each time through.
  68. Because, what we're interested in doing in the next stage is
  69. essentially counting all of the documents that
  70. pass through to this group stage with the
  71. same screen name for the user who created
  72. the tweet. Because, again, remember. The question we're
  73. after here is, who included the most user mentions in their tweets? So by the
  74. time we get to this stage unwind will
  75. have produced an individual document for every user
  76. mention in the collection. And group then will
  77. aggregate them together based on the screen name
  78. of the user who created the tweet, will
  79. then simple produce a count field here as
  80. part of our group operation. And again remember
  81. that sum imply increments this counter each time
  82. you see the document that's aggregated together with
  83. the screen name or a document that has
  84. the same screen name. Then we do our sort
  85. and limit states. So one question I'll put to you
  86. here is does this count the number of unique
  87. user mentions? That is to say if a twitter user
  88. mentions the same user more than once does this
  89. count each one of those mentions or does it count
  90. all mentions of the same user as one mentions? If
  91. its not unique mentions that are not being counted here
  92. question I'll leave you with is what type of
  93. aggregation pipeline would we need to put together in order
  94. to count unique mentions. Of users. Okay, so let's
  95. run this. And because we limited this to one, we
  96. get one document in our result array with a
  97. count of 21 for user mentions for this user. Now
  98. in case it's not clear to you by this
  99. point, the advantage of the aggregation framework in MongoDB is
  100. that all of this work is being performed
  101. server side. That means that for this particular query
  102. all that comes across the network to our
  103. client is just that one result we just looked
  104. at. The aggregation framework is powerful, not just
  105. because of the functionality it provides, but because of
  106. the speed with which it can execute these queries
  107. because this functionality is fundamental to the server itself.