YouTube

Got a YouTube account?

New: enable viewer-created translations and captions on your YouTube channel!

English subtitles

← Counting Words Serially - Intro to Data Science

Get Embed Code
4 Languages

Showing Revision 5 created 05/25/2016 by Udacity Robot.

  1. Here is one way to explain the Mapreduce programming
  2. model. Say that I wanted to count the number of
  3. occurrences of each word that appears at least once in
  4. a document. Let's use the text of Alice in Wonderland.
  5. Here's a bit of text that says Alice was
  6. begining to get very tired of sitting by her sister
  7. on the bank And of having nothing to do. If
  8. I wanted to solve this problem without Mapreduce, I might
  9. create a Python dictionary consisting of all the words
  10. and their counts. I could go through the document
  11. and say, for each word in the document, if
  12. there is a key for that word, add one.
  13. Otherwise, set the initial for that key equal to
  14. one. And instead of applying it to this short
  15. sentence fragment from the book, we'd apply it to
  16. the entire book. Before we solve this problem with Mapreduce,
  17. why don't you try to write a Python script
  18. along the lines of what we just discussed, that will
  19. get the job done. Given many lines of a text,
  20. create a dictionary with a key for each word, and
  21. a value corresponding to the count of the word in
  22. that text. Note that we want the words to be
  23. stripped of any capitalization and punctuation. We just want the
  24. basic words. Here's some code to get you started. First,
  25. we import system string. And then we
  26. initialize an empty dictionary, which will hold our
  27. words and values. We cycle through the lines of the input, and for each line we
  28. create an array, data. Which is essentially all
  29. of the words in that line, split by
  30. white space. So if we started with this
  31. line. Hello, how are you? It would become,
  32. hello, how, are, and you, in an array of length four. Your code should go here.
  33. After we split the line by white space, and before we print out the dictionary.