Return to Video

Combiners - Intro to Hadoop and MapReduce

  • 0:00 - 0:02
    Now we're going to talk about a pretty cool new
  • 0:02 - 0:05
    tool. Something that goes between the map and the reduce
  • 0:05 - 0:09
    phase, called the combiner. Before we do that, let's review
  • 0:09 - 0:11
    how you may have calculated the mean in the last
  • 0:11 - 0:14
    problem. And I don't know exactly what you did,
  • 0:14 - 0:16
    but one thing you may have done is, something like
  • 0:16 - 0:21
    this. First, you probably had your mapper go through the
  • 0:21 - 0:24
    data. I should say your mappers, go through the data,
  • 0:25 - 0:30
    and output, key value pairs that looked
  • 0:30 - 0:34
    something like, day of week, amount spent. Then,
  • 0:34 - 0:39
    for each day of the week, your reducer probably kept a total of the sum and
  • 0:39 - 0:42
    a count. So actually maybe your value here,
  • 0:42 - 0:45
    was the amount of money spent. And then
  • 0:45 - 0:48
    also a one, to help increment the count.
  • 0:48 - 0:51
    And then you probably would divide the sum
  • 0:51 - 0:54
    by the count, and it looks pretty good. But
  • 0:54 - 0:57
    remember, net produce is happening on this whole network
  • 0:57 - 1:00
    of data nodes. Each of these little boxes, we'll
  • 1:00 - 1:03
    pretend is a data node. And your data's spread
  • 1:03 - 1:07
    out across all of these different machines. And now
  • 1:07 - 1:09
    let's think about where each of these three steps
  • 1:09 - 1:12
    took place. So, we're not going to focus on what
  • 1:12 - 1:16
    happened but where it happened. Well, for step one,
  • 1:16 - 1:18
    the mappers were all over the place. There were
  • 1:18 - 1:22
    mappers, anywhere there is relevant data. But, for any
  • 1:22 - 1:25
    given day, all of the reduction had to happen
  • 1:25 - 1:28
    on a single machine, where that reducer was living.
  • 1:28 - 1:31
    And actually, that's enough, we've already exposed a problem.
  • 1:31 - 1:34
    Because if there's a lot of data, even if
  • 1:34 - 1:37
    each individual output was simply a day and a
  • 1:37 - 1:41
    number, depending on the number of records we have,
  • 1:41 - 1:44
    transferring all of this, all this data to
  • 1:44 - 1:47
    this machine, with reducer lift, that, that's a lot
  • 1:47 - 1:49
    of bandwidth. That's going to be a lot of traffic
  • 1:49 - 1:52
    on our network that we don't necessarily want, and
  • 1:52 - 1:54
    it doesn't actually seem like we need to have.
  • 1:55 - 1:58
    So what if we could do some of this
  • 1:58 - 2:03
    reduction locally? What if on each machine, before we
  • 2:03 - 2:06
    send our key-value pairs off, off to be reduced,
  • 2:06 - 2:09
    what if we could do some pre-reduction? And it
  • 2:09 - 2:12
    turns out we can. And that pre-reduction happens with these
  • 2:12 - 2:16
    things called combiners. And before you practice using combiners
  • 2:16 - 2:18
    let me try to sell you a bit more on
  • 2:18 - 2:21
    their value. So as you know, when you use
  • 2:21 - 2:24
    a map reduce job you get some output like this.
  • 2:24 - 2:27
    This is where it's displaying how much mapping has happened
  • 2:27 - 2:30
    and how much reducing. But you get this tracking URL.
  • 2:31 - 2:36
    And if you open that URL in a browser you get a job page, and that job page
  • 2:36 - 2:38
    looks something like this. And actually if you scroll
  • 2:38 - 2:41
    down below here you'll see some more interesting statistics.
  • 2:41 - 2:45
    And in fact we ran two jobs. They were
  • 2:45 - 2:49
    identical jobs, except one implemented combiners, so they were
  • 2:49 - 2:53
    reducing locally as much as possible, before shuttling that
  • 2:53 - 2:56
    data off to that machine that held the end reducer.
  • 2:56 - 2:59
    Let's compare the statistics from those two jobs. The
  • 2:59 - 3:02
    one with combiners, the one without. So let's compare what
  • 3:02 - 3:05
    happened on these job pages. The one without the combiner
  • 3:05 - 3:08
    versus the one with combiner. So, a lot of stuff
  • 3:08 - 3:11
    is the same. The mappers to the same amount,
  • 3:11 - 3:16
    the input matches the output for without and with. That's
  • 3:16 - 3:18
    as expected. In fact, a lot of things look pretty
  • 3:18 - 3:21
    similar. But where things get really interesting is down here
  • 3:21 - 3:27
    at reduce input records. So, the reducers, had to handle the whole
  • 3:27 - 3:33
    4 million records that the macro had put when there is no combiner. But when
  • 3:33 - 3:36
    there was a combiner, these 4 million
  • 3:36 - 3:39
    mapped output records are combined into 412
  • 3:39 - 3:44
    reduced input records, or reducer input records.
  • 3:44 - 3:47
    That's pretty amazing. That's a huge change.
  • 3:47 - 3:51
    In fact, the reducers had to shuffle far fewer bytes. And for time
  • 3:51 - 3:57
    spent, with the combiner this job took 31 seconds to run versus without where
  • 3:57 - 4:02
    it took 43. That's a significant difference. So you can really use combiners to
  • 4:02 - 4:05
    improve the efficiency of your map reduced
  • 4:05 - 4:07
    jobs. And we're going to try that next.
タイトル:
Combiners - Intro to Hadoop and MapReduce
概説:

more » « less
Video Language:
English
Team:
Udacity
プロジェクト:
ud617 - Intro to Hadoop and Mapreduce
Duration:
04:08

English subtitles

改訂 Compare revisions