-
Title:
Counting Words Serially - Intro to Data Science
-
Description:
-
Here is one way to explain the Mapreduce programming
-
model. Say that I wanted to count the number of
-
occurrences of each word that appears at least once in
-
a document. Let's use the text of Alice in Wonderland.
-
Here's a bit of text that says Alice was
-
begining to get very tired of sitting by her sister
-
on the bank And of having nothing to do. If
-
I wanted to solve this problem without Mapreduce, I might
-
create a Python dictionary consisting of all the words
-
and their counts. I could go through the document
-
and say, for each word in the document, if
-
there is a key for that word, add one.
-
Otherwise, set the initial for that key equal to
-
one. And instead of applying it to this short
-
sentence fragment from the book, we'd apply it to
-
the entire book. Before we solve this problem with Mapreduce,
-
why don't you try to write a Python script
-
along the lines of what we just discussed, that will
-
get the job done. Given many lines of a text,
-
create a dictionary with a key for each word, and
-
a value corresponding to the count of the word in
-
that text. Note that we want the words to be
-
stripped of any capitalization and punctuation. We just want the
-
basic words. Here's some code to get you started. First,
-
we import system string. And then we
-
initialize an empty dictionary, which will hold our
-
words and values. We cycle through the lines of the input, and for each line we
-
create an array, data. Which is essentially all
-
of the words in that line, split by
-
white space. So if we started with this
-
line. Hello, how are you? It would become,
-
hello, how, are, and you, in an array of length four. Your code should go here.
-
After we split the line by white space, and before we print out the dictionary.