< Return to Video

Lecture 25 | Programming Abstractions (Stanford)

  • 0:00 - 0:12
  • 0:12 - 0:16
  • 0:16 - 0:25
    This presentation is delivered by the Stanford Center for Professional
  • 0:25 - 0:28
    Development.
  • 0:28 - 0:32
    The piece I need is not here, so I'm going to actually just talk to you about what
  • 0:32 - 0:34
    it says on my slides and then when Jason arrives we'll plug it back in and see what's going on.
  • 0:34 - 0:36
    Oh, there's Jason. There we go.
  • 0:36 - 0:40
    So what I'm going to do today is actually something kinda fun. And so I'm
  • 0:40 - 0:41
    gonna take a case study of a particular data structure
  • 0:41 - 0:45
    design,
  • 0:45 - 0:49
    and the one I'm looking at there is the lexicon.
  • 0:49 - 0:53
    So how many of you actually ever decided to just open up the lexicon file, just to see
  • 0:53 - 0:55
    what was in there at some point during the quarter. We used it a couple of times.
  • 0:55 - 1:00
    Anybody want to tell us what they learned by opening up that file?
  • 1:00 - 1:04
    Anything interesting?
  • 1:04 - 1:05
    It's a data structure called a DAWG. You're like, well that's a pretty funny name for something.
  • 1:05 - 1:10
    And
  • 1:10 - 1:14
    was the code particularly well-[inaudible], easy to read, easy to understand?
  • 1:14 - 1:18
    No. Who wrote that code? She should totally be fired.
  • 1:18 - 1:22
    Yeah, so this what I wrote kinda a long time ago, actually, to solve this particular problem. And actually I would like to key through
  • 1:22 - 1:28
    kinda how it is I came to discover the DAWG and why it is the DAWG was the
  • 1:28 - 1:29
    choice that made sense for what the need pattern we had for the lexicon was.
  • 1:29 - 1:30
    So let me just refresh what
  • 1:30 - 1:34
    the lexicon does.
  • 1:34 - 1:36
    It is a special case of kinda set or map. It is a container where you're going stick things
  • 1:36 - 1:38
    in, they're going to be unique, right? There's no
  • 1:38 - 1:41
    need for any sort of duplicate.
  • 1:41 - 1:43
    The keys are always strings. And in fact, they're English words, so that's going to be an important
  • 1:43 - 1:47
    thing to remember, that
  • 1:47 - 1:48
    they're not just arbitrary sequences of letters, right? They have certain patterns to
  • 1:48 - 1:51
    them
  • 1:51 - 1:54
    and certain sizes that might actually be something we could capitalize on. When we know
  • 1:54 - 1:57
    something about the domain, we're able to kinda special case or tune or
  • 1:57 - 1:58
    data structure to solve that problem
  • 1:58 - 1:59
    very well,
  • 1:59 - 2:03
    in particular.
  • 2:03 - 2:07
    It has no associated value, so it doesn't have dictionary definitions or synonyms or anything
  • 2:07 - 2:08
    like that, right? So it really just is exists or doesn't exist
  • 2:08 - 2:12
    that we were interested in supporting.
  • 2:12 - 2:15
    And it also has, in addition to the kind of being able to add words and check if a
  • 2:15 - 2:16
    word is contained, it's also this prefixing operation that we had talked a
  • 2:16 - 2:19
    little about
  • 2:19 - 2:23
    in Boggle, and that came back on the midterm. Like, why was prefixing important, right?
  • 2:23 - 2:25
    It was critical for pruning these dead-end paths, these searches on the
  • 2:25 - 2:26
    Boggle board that otherwise
  • 2:26 - 2:32
    would really have
  • 2:32 - 2:32
    taken a lot of time and kinda led to nothing of any value. So what I'm gonna to talk you
  • 2:32 - 2:36
    through
  • 2:36 - 2:40
    is some of the obvious choices based on data structures we already know about,
  • 2:40 - 2:43
    and see where they would go. So this is a thought experiment. It's not that I wrote all
  • 2:43 - 2:45
    these to find out. Actually, I mostly did a lot of this on paper, trying to figure
  • 2:45 - 2:48
    out what options we had
  • 2:48 - 2:49
    and how they would play out and what kind of trade-offs I'd have to make to make them
  • 2:49 - 2:52
    work.
  • 2:52 - 2:55
    So the simplest case of any kind of collection is to think about
  • 2:55 - 2:58
    whether vector or array can do the job for you.
  • 2:58 - 3:00
    In this case, having a vector that's kept in sorted - alphabetically sorted - order seems
  • 3:00 - 3:03
    like a pretty good,
  • 3:03 - 3:05
    easy way to get this set up and running and how would it behave?
  • 3:05 - 3:06
    So in order to do a contains
  • 3:06 - 3:07
    (word)
  • 3:07 - 3:15
    on this data structure,
  • 3:15 - 3:17
    what algorithm are you going to use to do your searching? Binary search,
  • 3:17 - 3:21
    right? It's in sorted order, right? We've got to take advantage of that, right? If we do
  • 3:21 - 3:25
    a linear search I'm looking for the word ''mediocre,'' I'm going to be trucking through,
  • 3:25 - 3:28
    you know, thousands of words before I find it. So definitely want to be using binary search.
  • 3:28 - 3:29
    Equals-equals, comparing my strings, less than, greater than, dividing in half. And so
  • 3:29 - 3:32
    it should run a logarithm time.
  • 3:32 - 3:35
    Logarithm time, one of those greater performers that you
  • 3:35 - 3:38
    learned in the PQ assignment, hopefully. It's just basically not measurable on
  • 3:38 - 3:41
    today's computer. So even for entries of
  • 3:41 - 3:42
    100,000, 500,000, basically
  • 3:42 - 3:46
    for free.
  • 3:46 - 3:50
    How do we do contains (prefix)? Can we also do a fast prefix search given this
  • 3:50 - 3:51
    data structure?
  • 3:51 - 3:54
    I hope so right? Thinking about
  • 3:54 - 3:57
    the same kind of binary search. In this case, though, looking at substring, right? So
  • 3:57 - 3:59
    comparing I'm looking for a three-character substring, only looking at the
  • 3:59 - 4:02
    first three characters, decide which way to go.
  • 4:02 - 4:05
    Once I find something that matches in the first three characters, I'm done.
  • 4:05 - 4:07
    So again, still using the sorted property to quickly narrow in on where
  • 4:07 - 4:10
    it could be.
  • 4:10 - 4:14
    When it's time to add a new word to this,
  • 4:14 - 4:17
    if someone asked you to put in the word ''abalone,'' I gotta
  • 4:17 - 4:21
    use that binary search to find out where to go. So in logarithmic time I can
  • 4:21 - 4:23
    tell you where the new word goes, but then to put it into position,
  • 4:23 - 4:26
    there's gonna be some shuffling.
  • 4:26 - 4:28
    And that's where this operation
  • 4:28 - 4:32
    starts to bog down,
  • 4:32 - 4:33
    given the idea that you might be moving half or more of your items,
  • 4:33 - 4:36
    on average,
  • 4:36 - 4:40
    to make that space for that one to open up.
  • 4:40 - 4:42
    It is going to require a linear time operation to get new words into the data
  • 4:42 - 4:43
    structure. Okay,
  • 4:43 - 4:48
    probably still, though,
  • 4:48 - 4:51
    a reasonable approach, right? These are the operations that get done
  • 4:51 - 4:54
    immense numbers of times in Boggle, right? You're doing a ton of contains (word)'s and contains (prefix)'s
  • 4:54 - 4:58
    as you're exhaustively searching that big board.
  • 4:58 - 5:01
    But the add's are done kinda once at the beginning. You build that big dictionary and
  • 5:01 - 5:02
    then you subsequently use it. And so if it took a little bit more time to built
  • 5:02 - 5:06
    that thing,
  • 5:06 - 5:08
    sorta an n-squared operation to kinda load it up, well,
  • 5:08 - 5:11
    maybe that would be okay.
  • 5:11 - 5:14
    The other we want to look at is space usage, and that's - and some of this is a little
  • 5:14 - 5:18
    bit of an interesting artifact of the time I was working on this. It turns
  • 5:18 - 5:21
    out space was actually a limiting factor in what we had available
  • 5:21 - 5:22
    to us, in terms of core memories of our machines and
  • 5:22 - 5:25
    what we could take up for our dictionary.
  • 5:25 - 5:28
    In this case, it has per entry the size of the string,
  • 5:28 - 5:29
    which we don't exactly know the string it represented. It's an abstraction
  • 5:29 - 5:32
    to us, but
  • 5:32 - 5:35
    we can make the assumption of kinda a simplification, that
  • 5:35 - 5:37
    it's probably about the length of the
  • 5:37 - 5:39
    string times of the size of the character. And there might be a little bit more
  • 5:39 - 5:40
    housekeeping associated with it, but
  • 5:40 - 5:42
  • 5:42 - 5:44
    that's a pretty good estimate of what it's taking. So in fact, it's the string data
  • 5:44 - 5:46
    itself
  • 5:46 - 5:50
    that's being represented in this vector
  • 5:50 - 5:51
    without much else overhead. So if we say words are about average of about eight characters,
  • 5:51 - 5:54
    about eight bytes per word.
  • 5:54 - 5:57
    So if we have 100,000 we expect to have 800,000 bytes
  • 5:57 - 6:00
    invested in this sort of vector arrangement.
  • 6:00 - 6:02
    Excess capacity, a couple other things that might tweak that a little bit, but that
  • 6:02 - 6:04
    gives us at least a good starting point to
  • 6:04 - 6:07
    think about how this
  • 6:07 - 6:09
    compares to alternatives. Okay,
  • 6:09 - 6:12
    so then if we said
  • 6:12 - 6:16
    we know that sortedness - that binary search is the
  • 6:16 - 6:17
    real advantage of keeping it in sorted order, but the contiguous nature of the array
  • 6:17 - 6:19
    kinda
  • 6:19 - 6:23
    bites us in terms of insert.
  • 6:23 - 6:26
    If we reorganize into that binary search tree to give ourselves the dynamic flexibility of
  • 6:26 - 6:27
    wiring in the left and right halves through pointers, as opposed to
  • 6:27 - 6:29
    contiguous memory,
  • 6:29 - 6:31
    then we know we can also get add to run
  • 6:31 - 6:35
    in logarithmic time. So
  • 6:35 - 6:36
    looking at these operations, contains (word) is a tree search that's using equals-equals,
  • 6:36 - 6:40
    left-right, deciding which way to go,
  • 6:40 - 6:43
    and also performs logarithmically if the tree is balanced.
  • 6:43 - 6:45
    Contains (prefix), same deal. Keep
  • 6:45 - 6:46
    looking at the [inaudible],
  • 6:46 - 6:48
    working our way down, knowing which way to go.
  • 6:48 - 6:51
    And then
  • 6:51 - 6:54
    add can also perform in logarithmic time
  • 6:54 - 6:57
    because it does the same search, falling off the bottom and
  • 6:57 - 6:58
    then inserting the new word there, or finding it along the way and not having any
  • 6:58 - 7:01
    work to do.
  • 7:01 - 7:03
    So we can get all our operations done in logarithmic time, which we know to be kinda
  • 7:03 - 7:05
    fast enough to actually not be worth
  • 7:05 - 7:08
    fighting any harder for.
  • 7:08 - 7:11
    Where did we pay for this?
  • 7:11 - 7:13
    Nothing really comes for free, right?
  • 7:13 - 7:15
    Overhead, right? And overhead, in this case, memory is certainly one of the places
  • 7:15 - 7:20
    where there's going to be a
  • 7:20 - 7:23
    pretty big cost that we now have the string data for each entry, plus a
  • 7:23 - 7:26
    left pointer and a right pointer. And if we're doing the work to keep it
  • 7:26 - 7:30
    balanced, which we probably ought to do if we want to avoid the degenerate cases
  • 7:30 - 7:31
    of it kinda getting totally out of whack and degenerating into linear time,
  • 7:31 - 7:34
    we're going to have a couple more
  • 7:34 - 7:37
    bits, you know, bytes thrown into that factor, too.
  • 7:37 - 7:39
    And so if we're assuming that there's
  • 7:39 - 7:42
    ten bytes of overhead,
  • 7:42 - 7:46
    four bytes in each of these plus another two for the balance factor, then we've added
  • 7:46 - 7:48
    ten bytes of overhead on top of the eight bytes for the word.
  • 7:48 - 7:48
    Getting a little
  • 7:48 - 7:52
    bit hefty. Something to worry about.
  • 7:52 - 7:56
    Also in terms of code complexity, which I didn't put a bullet up for, but it's
  • 7:56 - 7:59
    also worth kinda keeping - weighing it in your mind also which is
  • 7:59 - 8:01
    that building the fancier, more complicated data structure probably meant
  • 8:01 - 8:02
    you spent more time debugging it,
  • 8:02 - 8:06
    more time developing it,
  • 8:06 - 8:08
    right? You will have - it will be harder for someone later to maintain and deal
  • 8:08 - 8:10
    with because it's actually just more complicated code that somebody looking at a
  • 8:10 - 8:12
    sorted vector
  • 8:12 - 8:13
    isn't likely to actually - like, they want to add a new operation,
  • 8:13 - 8:16
    let's say they add remove.
  • 8:16 - 8:19
    You're able to take a word out of the lexicon. The lexicon doesn't currently support
  • 8:19 - 8:23
    that. Somebody adding that on the sorted vector implementation probably won't find themselves
  • 8:23 - 8:25
    too tripped up. Someone who has to do a remove out of a binary search tree, removing it, reattaching
  • 8:25 - 8:27
    the subtrees
  • 8:27 - 8:30
    and not destroying the balancing
  • 8:30 - 8:32
    is a pretty complicated operation. So we've made - we've invested in this data
  • 8:32 - 8:34
    structure that actually will keep on
  • 8:34 - 8:36
    creating
  • 8:36 - 8:38
    further development hassles for any modifications that are down the road for
  • 8:38 - 8:40
    this data structure. So
  • 8:40 - 8:45
    it is
  • 8:45 - 8:47
    a big investment in your brainpower to keep it working. So
  • 8:47 - 8:51
    let's think about what a hash table can do for us. A
  • 8:51 - 8:56
    hash table is the last thing that we looked at on Friday,
  • 8:56 - 8:58
    kinda just moving away from this whole idea of sortedness and kinda moving it out of bound off to this idea of a
  • 8:58 - 9:02
    hashing function.
  • 9:02 - 9:05
    So if I have a hash table that has a certain number of buckets
  • 9:05 - 9:06
    that uses a link list for chaining to
  • 9:06 - 9:10
    handle the collision,
  • 9:10 - 9:11
    then what I would have is my words just scattered around my table,
  • 9:11 - 9:14
  • 9:14 - 9:16
    allowing me to have quick access by hashing to the right bucket and then
  • 9:16 - 9:19
    working my way down the bucket to C.
  • 9:19 - 9:21
    So the contains (word) would do a standard
  • 9:21 - 9:24
    hash table lookup.
  • 9:24 - 9:27
    Take the word, you know, ''mediate,'' run it through the hash function. It says oh, it's bucket
  • 9:27 - 9:30
    three in this case. I go to bucket three. I walk down the link list. I find it or
  • 9:30 - 9:31
    not. I can tell you whether the word ''mediate,'' if it's in the table, had to be in bucket
  • 9:31 - 9:33
    three.
  • 9:33 - 9:35
    No questions asked.
  • 9:35 - 9:38
    In that case, the
  • 9:38 - 9:39
    expected time for that is going to be n over b, where n is the number of entries, b is the number of
  • 9:39 - 9:41
    buckets.
  • 9:41 - 9:42
    If we have buckets
  • 9:42 - 9:47
    set to be in the
  • 9:47 - 9:52
    rough order of the number of entries, then we're going to get constant access there.
  • 9:52 - 9:54
    Now we want to do contains (prefix). I want
  • 9:54 - 10:00
    to
  • 10:00 - 10:05
    know if there's any words in this table that begin with ''med.''
  • 10:05 - 10:09
    Where do I look?
  • 10:09 - 10:11
    If I run ''med'' through the hash function,
  • 10:11 - 10:15
    do I expect that I would get the same number
  • 10:15 - 10:18
    as other words that begin with ''med,'' like ''mediate'' and ''median'' and
  • 10:18 - 10:19
    ''mediator?'' Yes.
  • 10:19 - 10:22
    You say yes. All
  • 10:22 - 10:26
    right, let's take that to an extreme.
  • 10:26 - 10:29
    If I expect that if a prefix in any longer words all have the same thing, then should it
  • 10:29 - 10:33
    be the case, for example, I think about the simplest case of a prefix, ''m.''
  • 10:33 - 10:35
    Should all the words that begin with ''m'' hash to the same place?
  • 10:35 - 10:38
    If they did, we're in trouble, right?
  • 10:38 - 10:42
    Now, if all prefixes did hash to the same place, then what I'm going to end up with is a
  • 10:42 - 10:45
    really heavily clustered table where ''a,'' ''b,'' ''c,'' you know, ''z'' -
  • 10:45 - 10:46
    there's 26 buckets that have all the ''z'' words, all the
  • 10:46 - 10:48
    ''y'' words, all the what not,
  • 10:48 - 10:52
    and none of the other buckets would be used. All
  • 10:52 - 10:55
    right, so in fact, actually, as part of the good behavior of a hash function, we're
  • 10:55 - 10:58
    hopping that, actually, it does jumble things up. And then a word that looks like one
  • 10:58 - 11:02
    but has another letter added on the end, we're hoping it goes somewhere totally different,
  • 11:02 - 11:04
    that ''median'' and ''medians'' hopefully go to two totally disparate places.
  • 11:04 - 11:07
    That if there were patterns where
  • 11:07 - 11:10
    words that were substrings of each other all got hashed to the same
  • 11:10 - 11:12
    location, we would end up with heavy clustering. Because there are a lot of words that
  • 11:12 - 11:15
    repeat parts of the other words in the dictionary.
  • 11:15 - 11:19
    So in fact, if I want to know if there's any words that begin with ''med,''
  • 11:19 - 11:21
    I have no a priori information about where to look for
  • 11:21 - 11:25
    them. They could be anywhere.
  • 11:25 - 11:26
    I could hash ''med'' and see if ''med'' itself was a word, right?
  • 11:26 - 11:30
    But let's say it was something that
  • 11:30 - 11:33
    itself was just the beginning of something. ''St,'' ''str'' -
  • 11:33 - 11:36
    there's a lot of words that begin with that, but ''str'' itself is not a word. So
  • 11:36 - 11:39
    hashing to that bucket and looking for it would only tell me if there's exactly ''str,'' and then
  • 11:39 - 11:41
    occasionally there might be some ''str'' words also there.
  • 11:41 - 11:44
    But other than that, we just gotta look.
  • 11:44 - 11:45
    And where do we look? Everywhere.
  • 11:45 - 11:50
    Everywhere, everywhere, everywhere.
  • 11:50 - 11:53
    That in order to conclude there is no word that begins with the prefix we have,
  • 11:53 - 11:55
    the only way to be confident of that is to either find something, or to have searched
  • 11:55 - 11:59
    the entire table and not found it.
  • 11:59 - 12:02
    And so it does require this completely linear, exhaustive
  • 12:02 - 12:04
    look through the entire contents of the hash table to know
  • 12:04 - 12:07
    something, for example, is not a prefix.
  • 12:07 - 12:09
    You might find one quickly. So in the best case, you might be in the first couple of
  • 12:09 - 12:10
    buckets, happen upon one.
  • 12:10 - 12:14
    But no guarantees,
  • 12:14 - 12:16
    overall. And certainly when it's not a prefix, it's going to require a complete look-see of everything.
  • 12:16 - 12:18
    That's kinda a bummer. Adding
  • 12:18 - 12:19
    a
  • 12:19 - 12:22
    word,
  • 12:22 - 12:24
    back to the fast behavior again. Hashing in the bucket,
  • 12:24 - 12:26
    checking to see if it's already there, adding if needed
  • 12:26 - 12:28
    [inaudible].
  • 12:28 - 12:31
    So it gets good times on these two,
  • 12:31 - 12:34
    but then kinda really bottoms out when it comes to supporting
  • 12:34 - 12:35
    that prefix search.
  • 12:35 - 12:38
    Its space usage
  • 12:38 - 12:42
    is kinda roughly comparable to that of the binary search tree.
  • 12:42 - 12:42
    We've got the string data itself, and we've got the four-byte pointer on the link list.
  • 12:42 - 12:44
  • 12:44 - 12:48
    There's also the four-byte bucket pointer,
  • 12:48 - 12:51
    which is over in this other structure here, this array full of buckets.
  • 12:51 - 12:54
    If that number, though, is roughly equal to the number of cells, then you can kinda
  • 12:54 - 12:56
    just count it as each entry
  • 12:56 - 12:59
    had this
  • 12:59 - 13:02
    12-byte cell, plus a four-byte bucket pointer somewhere
  • 13:02 - 13:04
    that you can kinda think of as associated on a per entry basis to tell you it's about
  • 13:04 - 13:06
  • 13:06 - 13:10
    eight bytes of overhead on top of the string.
  • 13:10 - 13:11
    So 16 bytes for each entry in the table.
  • 13:11 - 13:16
    Okay, that's
  • 13:16 - 13:18
    what we have so far.
  • 13:18 - 13:21
    We've got
  • 13:21 - 13:24
    pretty good performance on contains (word) no matter what we
  • 13:24 - 13:27
    choose. We can get either logarithmic or constant time with either of these three
  • 13:27 - 13:31
    data structure. So perfectly acceptable performance there.
  • 13:31 - 13:33
    We cannot get contains (prefix) to run well on a hash table.
  • 13:33 - 13:34
    Just doesn't happen the way hashing works.
  • 13:34 - 13:38
    And then add
  • 13:38 - 13:41
    can be fast on the two complicated pointer-based data structures to the right,
  • 13:41 - 13:42
    but can't be fast on the sorted vector.
  • 13:42 - 13:45
    And then
  • 13:45 - 13:47
    we see kinda some trade-offs here in terms of space versus time, that there's
  • 13:47 - 13:51
    very little space overhead
  • 13:51 - 13:55
    in the sorted vector case, but really blossoms into
  • 13:55 - 13:58
    two times, almost three times, the storage needed. In the VST and hash table
  • 13:58 - 14:00
    case, the code also got a lot more complicated when we moved up to those
  • 14:00 - 14:04
    data structures. So
  • 14:04 - 14:06
    then I'll tell you, actually, that it turns out that the one type of performance that was interesting to us
  • 14:06 - 14:09
    when I was
  • 14:09 - 14:14
    exploring this, but in fact, more important at the time, was the memory usage.
  • 14:14 - 14:17
    That the memory usage was really the big obstacle that we had.
  • 14:17 - 14:19
    The Official Scrabble Players Dictionary II is actually the source for the word list that
  • 14:19 - 14:22
    we're using.
  • 14:22 - 14:23
    It has about 128,000 words, I think, is exactly the
  • 14:23 - 14:26
    number it has.
  • 14:26 - 14:29
    And the average length of those is eight characters. And the file itself is
  • 14:29 - 14:32
    over a megabyte. So on disk, if you were just to look at the listing of all the words,
  • 14:32 - 14:34
    it's a megabyte - a little over a megabyte there.
  • 14:34 - 14:37
    If we were loading it into the sorted vector,
  • 14:37 - 14:40
    we'd get something that was just roughly equal to that. It would take about a megabyte of space,
  • 14:40 - 14:45
    plus a little bit of noise.
  • 14:45 - 14:48
    The ones that double up that take it up to two, two and a half megabytes total.
  • 14:48 - 14:50
    At the time we were working, and I know this is going to seem completely archaic to you, but we had
  • 14:50 - 14:51
    these stone tablets that we worked on.
  • 14:51 - 14:55
  • 14:55 - 15:00
    And they typically had main memories of about a megabyte, maybe, like in
  • 15:00 - 15:02
    an exciting case, four megabytes; four whole megabytes of RAM.
  • 15:02 - 15:05
    So it was not possible
  • 15:05 - 15:09
    that we could dedicate one or two megabytes of your main memory
  • 15:09 - 15:12
    to just your lexicon. It just wasn't - there just wasn't enough space to go around.
  • 15:12 - 15:15
    So we're working in this very tight environment in terms of that. So a little bit more
  • 15:15 - 15:16
    like, for example, today's world of embedded devices. If you're adding something for an
  • 15:16 - 15:18
    iPod or a
  • 15:18 - 15:21
  • 15:21 - 15:24
    cell phone you have a lot less
  • 15:24 - 15:27
    gargantuan memory. So your average desktop machine, having a
  • 15:27 - 15:28
    gigabyte of RAM, now you're like, whatever, a megabyte here, ten megabytes there, free, all the
  • 15:28 - 15:31
    memory you want. And
  • 15:31 - 15:34
    so this seems kinda like a silly thing to focus on, but this was 1995,
  • 15:34 - 15:35
    and it turns out, yeah, there was certainly more interest in
  • 15:35 - 15:37
    keeping it
  • 15:37 - 15:40
    tight to the memory.
  • 15:40 - 15:44
    The Boggle lexicon that we have actually takes under a third of a megabyte.
  • 15:44 - 15:45
    The memory that's used while it's working and searching and checking for words and
  • 15:45 - 15:47
    prefixes
  • 15:47 - 15:50
    is actually smaller
  • 15:50 - 15:53
    than the on-disk form of the data we started with.
  • 15:53 - 15:57
    So it actually does a kind of a bit of a compression
  • 15:57 - 16:00
    as part of its strategy for storage, which is pretty interesting to think about.
  • 16:00 - 16:02
    So I'll tell you, actually, the truth about how we first did Boggle, because it's kinda a little
  • 16:02 - 16:03
    interesting
  • 16:03 - 16:07
    historical perspective on it.
  • 16:07 - 16:09
    But Boggle was given by a colleague of mine for the first time, Todd
  • 16:09 - 16:10
    Feldman. He was the guy who came up with the idea, which is a great idea. And he said, ''Oh, this is a great
  • 16:10 - 16:13
    assignment.''
  • 16:13 - 16:16
    And when he wrote it, he said, ''Oh, I need a lexicon.'' So as part of kinda putting the assignment together, he was like, ''Oh, we
  • 16:16 - 16:18
    have to get some lexicons.'' So he looked around and he found a word list that had about
  • 16:18 - 16:20
    20,000 words.
  • 16:20 - 16:23
  • 16:23 - 16:27
    And I think we first found this word list and we said, ''Oh, it's just impossible.'' It's going to take a megabyte
  • 16:27 - 16:30
    or more and we just don't have that kind of space. So we need much smaller data files. So we had a data file that had
  • 16:30 - 16:33
    about 20,000 words, which then takes about
  • 16:33 - 16:34
    200k. And he actually wrote it using a hash table.
  • 16:34 - 16:37
    So given he wrote it as a hash table,
  • 16:37 - 16:38
    he just didn't support contains (prefix). He
  • 16:38 - 16:39
    just didn't put it in,
  • 16:39 - 16:42
  • 16:42 - 16:45
    and - because you can't do it efficiently. So it actually took a smaller amount
  • 16:45 - 16:49
    of space and had a smaller word list and it wouldn't do contains (prefix).
  • 16:49 - 16:51
    So then people wrote Boggle. And as you learned from that midterm question, you can write Boggle without
  • 16:51 - 16:55
    contains (prefix).
  • 16:55 - 16:58
    It spends a lot more time looking down these data ends, but it eventually bottoms out
  • 16:58 - 17:01
    when it runs off the board and uses all the
  • 17:01 - 17:04
    cubes. And
  • 17:04 - 17:05
    it actually, in some ways, almost made the program a little more satisfying to
  • 17:05 - 17:08
    write in one odd way,
  • 17:08 - 17:09
    which is, when you're - so it's the computer's turn, right, that uses the
  • 17:09 - 17:12
    prefix pruning.
  • 17:12 - 17:14
    So you would type in your words and it would be perfectly fine finding it, but it
  • 17:14 - 17:16
    doesn't use the prefix part when it's doing the human's turn. You're typing in words; it's finding them. You're typing
  • 17:16 - 17:19
    in the words; it's find them.
  • 17:19 - 17:21
    Then you would go to do the computer's turn and you'd say, ''Okay, go.''
  • 17:21 - 17:25
    And now it would take a really long time.
  • 17:25 - 17:27
    It'd be hunting around the board. It would be working so hard your fan would be coming on.
  • 17:27 - 17:30
    You'd feel like oh, my computer's doing something.
  • 17:30 - 17:31
    And then it would show up with twice as many words as you found or ten times as
  • 17:31 - 17:33
    many words as you found, whatever.
  • 17:33 - 17:36
    And in fact, then you would say,
  • 17:36 - 17:40
    ''Well, at least it had to work hard to beat me.''
  • 17:40 - 17:41
    And then you felt like I wrote this program that causes my processor to get busy, right? I feel good. And
  • 17:41 - 17:44
  • 17:44 - 17:47
    in fact, actually, the word list was pretty small. So in fact, it didn't - it wasn't even actually as likely to
  • 17:47 - 17:49
    beat you because it didn't know as many words. 125,000 words is
  • 17:49 - 17:52
    an enormous number of words.
  • 17:52 - 17:55
    The average - I've heard different factors on this, but the
  • 17:55 - 17:59
    average person's vocabulary, English vocabulary, is about 20,000 to
  • 17:59 - 18:00
    30,000 words. It tends to be about 10,000 higher if you've gone to college, so
  • 18:00 - 18:03
    maybe 30,000 to 40,000.
  • 18:03 - 18:06
    The fact that it knows 80,000 more words than the average college graduate
  • 18:06 - 18:09
    means it knows a lot of obscure words, too. So in fact, not only does it
  • 18:09 - 18:12
    very quickly find all those words and then produce these ridiculous words that you've
  • 18:12 - 18:14
    never heard of, it just - it's almost like it's taunting you by doing it
  • 18:14 - 18:17
    in a millisecond.
  • 18:17 - 18:21
    And back then it took a while, so it was actually like
  • 18:21 - 18:23
    okay, it beat me, but it had to work hard.
  • 18:23 - 18:27
    But then - so then I took over Boggle.
  • 18:27 - 18:31
    I taught x maybe a quarter or two later and I said, ''I'm going to write a
  • 18:31 - 18:35
    lexicon. I'm going to go up and make a little project for myself, because I don't have enough work to do.''
  • 18:35 - 18:38
    And my goal was to make one that would do prefix, but that also would be
  • 18:38 - 18:40
    tight enough on memory that we could actually load a much bigger dictionary file, because I thought I'd
  • 18:40 - 18:42
    actually make the game even more
  • 18:42 - 18:48
    fun to play.
  • 18:48 - 18:52
    All right, so here's where I started.
  • 18:52 - 18:55
    This is a little excerpt from the middle of the
  • 18:55 - 18:58
    dictionary. ''Stra,'' yeah, that's totally the thing you notice, right, when you look at this thing. You see
  • 18:58 - 19:02
    ''straddle,'' ''straddler,'' ''straddlers,'' ''straddles,'' ''straddling,'' and then these words that you can't
  • 19:02 - 19:04
    believe you're allowed to do this, but apparently you can put ''er'' and ''ier'' and
  • 19:04 - 19:05
    ''ies' and ''ing'' on
  • 19:05 - 19:07
    almost anything out there and it
  • 19:07 - 19:08
    makes a valid word.
  • 19:08 - 19:11
    So
  • 19:11 - 19:14
    although you never really thought about the fact that
  • 19:14 - 19:16
    ''straightforward,'' ''straightforwardly,'' ''straightforwardness,'' are all there, there is a huge amount
  • 19:16 - 19:18
    of repetition. And this is
  • 19:18 - 19:20
    where I said we're going to use some information about the domain.
  • 19:20 - 19:23
    These aren't just arbitrary strings.
  • 19:23 - 19:24
    They're not just random sequences of letters that are
  • 19:24 - 19:28
    picked out of the air,
  • 19:28 - 19:30
    that they develop over time. They're word roots. They're
  • 19:30 - 19:34
    suffixes. There's these prefixes that mean things
  • 19:34 - 19:35
    that actually show you that looking at this little excerpt out of the middle,
  • 19:35 - 19:38
    there's an awful lot of words
  • 19:38 - 19:42
    that repeat portions of other words.
  • 19:42 - 19:44
    And that may be part of what we can capitalize on. That
  • 19:44 - 19:47
    instead of really recording ''straightaway''
  • 19:47 - 19:51
    and ''straightaways,'' repeating those first 11 characters and adding an ''s,''
  • 19:51 - 19:55
    there's some way that I can unify the data that I have here. The same way when we try to
  • 19:55 - 19:56
    find codes, you have the same passage of code repeated two or three times.
  • 19:56 - 19:59
    We're always saying
  • 19:59 - 20:00
    unify it, move it into a helper and call it in three places. It's like can we do the same thing
  • 20:00 - 20:02
    with our data?
  • 20:02 - 20:07
    Can we share the parts of the words that are in common
  • 20:07 - 20:09
    and only distinguish where they are different?
  • 20:09 - 20:11
    Take a look at this.
  • 20:11 - 20:14
    This is a little portion
  • 20:14 - 20:18
    of something called a trie.
  • 20:18 - 20:20
    A trie is spelled t-r-i-e and it's actually from ''retrieval,'' but
  • 20:20 - 20:23
    sadly, it's still pronounced tri, just to confuse you.
  • 20:23 - 20:28
    And it is a variant of a tree.
  • 20:28 - 20:30
    In this case it's a tree that has 26 children coming out of any particular node,
  • 20:30 - 20:32
    one for each letter of the alphabet.
  • 20:32 - 20:35
    And so starting from the root node at the top,
  • 20:35 - 20:39
    all the words that begin with ''a'' are actually
  • 20:39 - 20:42
    down the A branch, and then there's a B branch. There'd be a C branch and a D
  • 20:42 - 20:45
    branch. And so the idea is that all the words, like if you think of the levels of the
  • 20:45 - 20:46
    tree are actually like - these are all the zero if characters of the word.
  • 20:46 - 20:49
    And then that second level is
  • 20:49 - 20:52
    all the characters that follow that zero if character. So here's the ''ab''
  • 20:52 - 20:56
    words and the ''ac'' words and the ''ax'' words.
  • 20:56 - 20:57
    Over here is the ''st'' words, but there's not a ''sx,'' so there's some places that actually aren't
  • 20:57 - 21:00
    filled out in this tree.
  • 21:00 - 21:02
    And as you keep going down further, so that the depth
  • 21:02 - 21:04
    you would trace to a tree
  • 21:04 - 21:08
    would actually be
  • 21:08 - 21:10
    tracing through the characters in sequence that make words.
  • 21:10 - 21:11
    And so in this form, tracing out a
  • 21:11 - 21:12
    path through this tree
  • 21:12 - 21:14
    from the root
  • 21:14 - 21:17
    down to a particular node
  • 21:17 - 21:19
    tells you about prefixes. So there are words that begin with ''a'' and words that begin with ''ac''
  • 21:19 - 21:23
    and ''ace,''
  • 21:23 - 21:23
    and then there's actually a little notation there that's done by a
  • 21:23 - 21:24
    visual
  • 21:24 - 21:26
    of the
  • 21:26 - 21:29
    bold and circle around it
  • 21:29 - 21:30
    that says, ''Oh, and the path that led here from the root is a word.'' So
  • 21:30 - 21:33
    ''a'' is a word
  • 21:33 - 21:35
    and ''ace'' is a word and ''aces'' is a word.
  • 21:35 - 21:39
    ''Act'' is a word and so is ''acted'' and ''actor.''
  • 21:39 - 21:44
    And in this case, the ''act'' part is totally shared,
  • 21:44 - 21:46
    so everything that comes off of ''act,'' ''acting,'' ''acted,'' ''actor,'' ''actors,''
  • 21:46 - 21:50
    can all share that initial part.
  • 21:50 - 21:52
    And for a word like ''act,'' it doesn't seem that profound. But you start thinking about words like
  • 21:52 - 21:55
    ''strawberry'' or ''straightforward'' or ''stratosphere''
  • 21:55 - 21:59
    and realize that ''stratospheric'' and ''strawberries'' and ''straightforwardly''
  • 21:59 - 22:01
    can all leverage the other ten or 12 characters that were already
  • 22:01 - 22:04
    invested in the root word,
  • 22:04 - 22:06
    and that ''straightjacket'' and ''straightforward'' can even share ''straight.''
  • 22:06 - 22:10
    But there's actually a lot of
  • 22:10 - 22:15
    prefixes that can be unified across the space of the dictionary.
  • 22:15 - 22:17
    So knowing that words have these patterns helps us to do this.
  • 22:17 - 22:19
    So let's build that.
  • 22:19 - 22:22
    So the first form of this -
  • 22:22 - 22:25
    and this, again, is a thought experiment. I did not actually write it this way because
  • 22:25 - 22:27
    it would just be absolutely astronomically expensive. But it was a good way to kinda think
  • 22:27 - 22:30
    about where I needed to go.
  • 22:30 - 22:34
    I designed a new kinda trie node
  • 22:34 - 22:35
    that had the letter and a little Boolean flag of is this the word, so whether it had
  • 22:35 - 22:39
    the dark circle around it.
  • 22:39 - 22:41
    And then it had a 26-member children array, so pointers to other nodes like
  • 22:41 - 22:44
    this one. So totally recursive,
  • 22:44 - 22:47
    staring from that root and then the
  • 22:47 - 22:50
    idea is to use those 26 children as ''a'' being in the first slot and ''z'' in the last
  • 22:50 - 22:52
    slot to make it really easy to kinda just trace letter by letter
  • 22:52 - 22:55
    down to the levels of the tree.
  • 22:55 - 22:59
    So I'd have this one root node that pointed to the initial one, and then all the words
  • 22:59 - 23:03
    that begin with ''a'' are off that first lot; all from ''z'' off the last slot.
  • 23:03 - 23:08
    So when I want to know if a word is in the dictionary,
  • 23:08 - 23:12
    what do I have to do in this data structure? How do I find it? I
  • 23:12 - 23:18
    want to find the word ''far.'' Where do I look?
  • 23:18 - 23:27
    Left? Right? Down? Which way? It would be great
  • 23:27 - 23:31
    if you would tell me. Then I would understand what my data structure is. [Inaudible]
  • 23:31 - 23:35
    Exactly. So at the very top I say okay, ''f'' is my first letter. Match my ''f,'' get me to the
  • 23:35 - 23:35
    F node, and now it says recursively find ''ar'' from here. So it's very
  • 23:35 - 23:37
    recursive. It'll be like
  • 23:37 - 23:39
    find the first letter
  • 23:39 - 23:43
    and then recursively find the substring from here, working down. So find the ''a'' out of the next node, find
  • 23:43 - 23:46
    an ''r'' out of the next node, and then when you get to that node,
  • 23:46 - 23:47
    you'll check this little ''is word Boolean,''
  • 23:47 - 23:49
    and say okay,
  • 23:49 - 23:53
    was that path I traced
  • 23:53 - 23:58
    marked in the dictionary as leading to something that's known to be a word?
  • 23:58 - 24:01
    The prefix looks exactly the same, except for that last check for the as word. So
  • 24:01 - 24:05
    given that nodes exist in this, because further down there there's some words, I just
  • 24:05 - 24:08
    trace out a sequence of letters ''str,'' and as long as there's something there,
  • 24:08 - 24:11
    then I know that away from there are subsequent children beneath it
  • 24:11 - 24:15
    that lead to words.
  • 24:15 - 24:17
    And adding a word does the same operation.
  • 24:17 - 24:19
    If I need to add the word ''strawberry,''
  • 24:19 - 24:20
    then I start looking for ''s,'' ''t,'' ''r,''
  • 24:20 - 24:24
    and at some point,
  • 24:24 - 24:26
    I find some places where the nodes don't exist. Then I start creating them from
  • 24:26 - 24:29
    there on down.
  • 24:29 - 24:33
    So it is tracing what exists and then adding new nodes off the bottom,
  • 24:33 - 24:38
    and then marking the final one as yes, this is a word.
  • 24:38 - 24:43
    What's interesting about this is that all three of these operations
  • 24:43 - 24:47
    don't actually make any reference to how many words are in the dictionary.
  • 24:47 - 24:49
    They are all dependent on the length of the word you're searching for or adding into the dictionary.
  • 24:49 - 24:51
    If there's ten words in there, if there's 1,000 words in there, if
  • 24:51 - 24:53
    there's a million words in there.
  • 24:53 - 24:56
    But in no way does the big O depend on
  • 24:56 - 24:59
    how big or how loaded the data structure is.
  • 24:59 - 25:00
    That's kinda wacky.
  • 25:00 - 25:06
    The longest word,
  • 25:06 - 25:09
    according to the Oxford English Dictionary?
  • 25:09 - 25:11
    There you go. You're not going to make that on the Boggle board anytime soon because it's only got 16 cubes. Even Big Boggle
  • 25:11 - 25:12
    can't really make that puppy. But
  • 25:12 - 25:14
    just so you know,
  • 25:14 - 25:17
    45 is not so many, right?
  • 25:17 - 25:21
    And,
  • 25:21 - 25:24
    in fact, actually, here's a little thought for you. Not only is it not dependent on num words, it actually has a
  • 25:24 - 25:26
    little bit of an inverse relationship with num words,
  • 25:26 - 25:29
    especially the add operation.
  • 25:29 - 25:31
    That as the data structure becomes more loaded,
  • 25:31 - 25:34
    there's typically less work to do in add.
  • 25:34 - 25:37
    So if ''strawberry'' is in there, and you go to add ''strawberries,''
  • 25:37 - 25:38
    then actually you already have ''strawberries'' to depend on. You just need to build
  • 25:38 - 25:41
    off the thing. So the fact that
  • 25:41 - 25:45
    the more words that are in there, the more likely that some of the nodes you already need
  • 25:45 - 25:46
    are already in place, and then you can just kinda blow past them and just add your new
  • 25:46 - 25:49
    nodes.
  • 25:49 - 25:53
    Now, that's odd. You think of most data structures as really as
  • 25:53 - 26:00
    clearly as they get heavier, it takes more work to install new
  • 26:00 - 26:03
    things in it rather than less. [Inaudible].
  • 26:03 - 26:04
    Well, the thing is, they're organized by letter, and that 26-number array. So in fact,
  • 26:04 - 26:08
    I don't even have to look.
  • 26:08 - 26:11
    So I just go straight to the slot. If I'm looking for strawberry, I'll look for ''s'' and I go
  • 26:11 - 26:14
    straight to the ''t'' slot and then the ''r'' slot. So in fact, if the mode was already in place, I'm
  • 26:14 - 26:16
    not searching for it. It really just is exactly in the ''r'' slot. That's why
  • 26:16 - 26:21
    there's 26 of them. In
  • 26:21 - 26:24
    each array? Each array is 26 and they're A through Z and if there's an empty slot we're using a null.
  • 26:24 - 26:27
    So in this case, it's optimized to make it super fast. I don't even have to look
  • 26:27 - 26:29
    through the letters to find the one that matches. There's no matching. It just goes straight
  • 26:29 - 26:32
    to the ''r'' slot, ''z'' slot, the ''s'' slot.
  • 26:32 - 26:35
    We're going to see this is going to be a consequence we can't
  • 26:35 - 26:37
    tolerate, but it is, at least, the starting point for thinking about how to
  • 26:37 - 26:38
    build a data structure.
  • 26:38 - 26:40
    So the space usage
  • 26:40 - 26:44
    is where this thing
  • 26:44 - 26:47
    really, really - so this is an extreme example of a space-time trade-off, right?
  • 26:47 - 26:50
    We've got this thing that optimizes beautifully for OAV length of
  • 26:50 - 26:53
    the word. So that means typically eight operations on average are going to be
  • 26:53 - 26:56
    needed to add or to search for something.
  • 26:56 - 26:57
    The space we're using here is 106 bytes per nodes, so that's this
  • 26:57 - 26:59
    26-member,
  • 26:59 - 27:02
    four-byte array,
  • 27:02 - 27:04
    plus two bytes for the character and the Boolean that are up there. One
  • 27:04 - 27:05
    hundred and six bytes
  • 27:05 - 27:09
    is a lot of memory,
  • 27:09 - 27:13
    and the tree has about a quarter of a million nodes.
  • 27:13 - 27:17
    So given the 125,000 words that I said are in the input,
  • 27:17 - 27:19
    if you build it into a tree, they take about 250,000 nodes. That's an interesting sort of
  • 27:19 - 27:22
    number to think about, though. You have 125,000
  • 27:22 - 27:25
    words. They take 250,000 nodes. That tells you, actually, that
  • 27:25 - 27:27
    kinda on average, each
  • 27:27 - 27:29
    word has about two unique nodes,
  • 27:29 - 27:33
    that even words like ''straightforward'' and ''straightforwardly,''
  • 27:33 - 27:34
    on average, are sharing enough of their prefix - common prefix
  • 27:34 - 27:38
    parts - that
  • 27:38 - 27:39
    there's very little unique data added by any particular word in there that averages
  • 27:39 - 27:40
    across the whole thing. That's
  • 27:40 - 27:42
    pretty neat
  • 27:42 - 27:45
    to know. However, this is just not going to fly.
  • 27:45 - 27:50
    So here I was telling you that my main memory had
  • 27:50 - 27:53
    four megabytes, maybe, on a good day, and now I've actually
  • 27:53 - 27:55
    built a data structure that's going to require six times that
  • 27:55 - 27:58
    in terms of storage.
  • 27:58 - 28:00
    Okay, we gotta squeeze that down.
  • 28:00 - 28:04
    So the first thing we can do
  • 28:04 - 28:08
    is to realize that those 26 - allocated a full 26-member array, of
  • 28:08 - 28:11
    which I only plan on using some of the slots, is pretty wasteful.
  • 28:11 - 28:13
    So instead of saying I'm committing to a 26 per thing, I'll say well, how about if
  • 28:13 - 28:16
    I have an array that actually
  • 28:16 - 28:20
    is tight to the number of children coming out of this node?
  • 28:20 - 28:20
    So the very first node will have 26. There are words that start with every letter in the
  • 28:20 - 28:22
    alphabet.
  • 28:22 - 28:25
    But from there, it very quickly winnows down.
  • 28:25 - 28:29
    You have the letter ''x.'' You do not need the full 26 children coming
  • 28:29 - 28:33
    off of ''x.'' There's only a few letters that follow ''x'' in the English language that are going
  • 28:33 - 28:37
    to need to be fleshed out. You need the ''xa,'' you need an ''xe,''
  • 28:37 - 28:40
    but you don't need ''xj,'' you don't need ''xk,'' a whole bunch of things like that.
  • 28:40 - 28:44
    So if I change the structure here to instead of being
  • 28:44 - 28:47
    a 26-member array of node pointers, I make it to be a pointer to a
  • 28:47 - 28:50
    set of nodes pointers, so this is a dynamic array
  • 28:50 - 28:54
    that I'm keeping track of the number of children in a second field here
  • 28:54 - 28:56
    that the first one will be allocated to 26, the second will be allocated to eight.
  • 28:56 - 28:59
    This is going to change a little bit of our running times, because now we won't
  • 28:59 - 29:01
    be able to go exactly to the ''s'' slot. We'll have to go look for it.
  • 29:01 - 29:03
    So on each step down the tree,
  • 29:03 - 29:07
    it'll be
  • 29:07 - 29:12
    looking through that sized array at this slot to find the right match. But it adds a
  • 29:12 - 29:14
    constant factor, basically, on top of OAV length of the word.
  • 29:14 - 29:15
    So the space issues of this
  • 29:15 - 29:19
  • 29:19 - 29:23
    is - we're just not storing all those nulls. And there were a lot of nulls. That
  • 29:23 - 29:25
    the node itself is not just ten bytes, and then there's some space in this
  • 29:25 - 29:27
    dynamic array out there, which is the number of children
  • 29:27 - 29:31
    times the four-byte pointer,
  • 29:31 - 29:33
    that on average, a node has six children. Across the
  • 29:33 - 29:36
    250,000 nodes in the
  • 29:36 - 29:39
    data structure that's being billed from this, so really 26 was way
  • 29:39 - 29:40
    overkill. About 20 of them had nulls
  • 29:40 - 29:42
    for most of the nodes.
  • 29:42 - 29:46
    And many of them, for example, at the very bottom of the tree,
  • 29:46 - 29:49
    have completely none. They're all nulls. So you get to a word like
  • 29:49 - 29:52
    ''straightforwardly.'' There's nothing after that ''y,''
  • 29:52 - 29:53
    so it has zero children coming out the back of that one.
  • 29:53 - 29:57
    And so certainly
  • 29:57 - 30:00
    having 26 null children pointing away from that was a
  • 30:00 - 30:01
    waste that we can tighten up.
  • 30:01 - 30:03
    So when we get it down to this we'll have
  • 30:03 - 30:04
  • 30:04 - 30:05
    44 bytes
  • 30:05 - 30:06
  • 30:06 - 30:08
    -
  • 30:08 - 30:11
    or 34 bytes for the per node.
  • 30:11 - 30:15
    Still a quarter million nodes. Still
  • 30:15 - 30:18
    eight and half megabytes. So we're not winning any prizes just yet.
  • 30:18 - 30:22
    But we are making good progress. That got us down a factor of
  • 30:22 - 30:25
    three or four right there. So I still have the same
  • 30:25 - 30:28
    number of nodes. All I changed was the pointers to the
  • 30:28 - 30:32
    subsequent nodes further down the tree. So
  • 30:32 - 30:35
    I - the kind of letters and how the letters connect to each other - the connectivity is
  • 30:35 - 30:39
    still pretty much the same. It's just that in any particular node, how many children does it have?
  • 30:39 - 30:40
    I was storing a whole bunch of nulls and now I'm not storing those nulls.
  • 30:40 - 30:43
    So the number of nodes is still
  • 30:43 - 30:47
    exactly as it was before. I'm going
  • 30:47 - 30:49
    to do one other little squishing thing on this data structure
  • 30:49 - 30:51
    and then I'm going
  • 30:51 - 30:53
    to flip gears for a second.
  • 30:53 - 30:56
    So still using kinda my
  • 30:56 - 30:59
    CS side of things, how can I squash things down?
  • 30:59 - 31:03
    I'm going to use a technique that we used in the heap.
  • 31:03 - 31:04
    So in the case of the binary heap that we implemented into the
  • 31:04 - 31:07
    PQ,
  • 31:07 - 31:11
    you'll remember that we drew pictures in the handout that showed a heap with oh, I
  • 31:11 - 31:11
    have a parent with two children, which has four grandchildren and
  • 31:11 - 31:15
    so on.
  • 31:15 - 31:16
    But that we actually compressed that down into an array,
  • 31:16 - 31:20
    and in the array
  • 31:20 - 31:24
    there was this pattern of
  • 31:24 - 31:27
    here is the root node.
  • 31:27 - 31:30
    So we'll call this level zero.
  • 31:30 - 31:33
    And then the next two nodes
  • 31:33 - 31:36
    are level one.
  • 31:36 - 31:38
    And the next four nodes
  • 31:38 - 31:42
    are level two and so on. So
  • 31:42 - 31:45
    there's one here, two here, there's four here, there's eight here. And we kinda flattened it down
  • 31:45 - 31:50
    by generation, so looking at what was at the top and what was underneath it.
  • 31:50 - 31:52
    When we did that, we removed all the space for pointers.
  • 31:52 - 31:54
    Although we drew those pictures in the heap as though there were pointers
  • 31:54 - 31:57
    there, in fact, there were not.
  • 31:57 - 32:02
    There were no pointers being stored in the heap in this version of the PQ.
  • 32:02 - 32:02
    And so that gets us back to kind of array like storage requirements,
  • 32:02 - 32:06
    but
  • 32:06 - 32:09
    by flattening it down into this array, and still kinda using the tree
  • 32:09 - 32:10
    conceptually to work our way through it, we're getting the advantages of
  • 32:10 - 32:12
    that
  • 32:12 - 32:14
  • 32:14 - 32:15
    traversal that's based on height of tree,
  • 32:15 - 32:20
    not length of vector.
  • 32:20 - 32:22
    So if we do the same thing with the trie, it's a little bit more complicated
  • 32:22 - 32:25
    because there were a couple things about heap that made this a little
  • 32:25 - 32:27
    easier. Heap always had two children, so there were these rules about
  • 32:27 - 32:30
    how the tree was filled in that meant
  • 32:30 - 32:33
    you could always compute parent and child index using this kinda divide by two,
  • 32:33 - 32:36
    multiple by two exchange operation.
  • 32:36 - 32:39
    In this case, what we're going to have to do is just lay it down by generation,
  • 32:39 - 32:41
    but we don't know how big a generation's gonna be, so we're actually gonna have to keep
  • 32:41 - 32:43
    track of it manually.
  • 32:43 - 32:46
    So the root node will have
  • 32:46 - 32:49
    all A through Z. So this will be
  • 32:49 - 32:53
    level one, which is A through Z here,
  • 32:53 - 32:57
    and then level two is - there'll be a bunch of slots here, which is the
  • 32:57 - 33:00
    A followed by some letter, and then there'd be the B's followed by some letter, and
  • 33:00 - 33:03
    then there'd be the C's followed by some letter, and so on.
  • 33:03 - 33:06
    And then further down is level three, which are the three-character words. So these are all
  • 33:06 - 33:09
    the single-character paths,
  • 33:09 - 33:12
    two-character paths, three-character paths, kinda flattened down.
  • 33:12 - 33:15
    And so if I look at what I changed here, my first node
  • 33:15 - 33:18
    says it has no letter that's not the root node, it's not a word,
  • 33:18 - 33:20
    and it says it's children begin at index one. So
  • 33:20 - 33:23
    the next location. So it's not a
  • 33:23 - 33:26
    multiply by two and add one or something like that. It's okay, here's
  • 33:26 - 33:30
    where my children begin. They're at slot one in the array.
  • 33:30 - 33:31
    And then there are 26 of them. So that means that A, B, C, D are kinda in order there.
  • 33:31 - 33:33
    If I look at A,
  • 33:33 - 33:36
    ''a'' is a word,
  • 33:36 - 33:39
    and it says its children begin at index 27
  • 33:39 - 33:44
    and there are 12 of them, so there are 12 characters out of the 26 that
  • 33:44 - 33:47
    can follow an ''a'' in the sequence of a valid word. And
  • 33:47 - 33:48
    then B's children begin at index 39, which is 27 plus 12. So that's
  • 33:48 - 33:50
    where
  • 33:50 - 33:54
    A's children are sitting here at index
  • 33:54 - 33:56
    27 to 38 and then 39 to - I think B has eight or so children -
  • 33:56 - 33:59
    and C and so on.
  • 33:59 - 34:00
    So we flattened it all down by generation
  • 34:00 - 34:01
    into
  • 34:01 - 34:04
    an array.
  • 34:04 - 34:07
    That means all the space for pointers
  • 34:07 - 34:10
    has now been removed.
  • 34:10 - 34:12
    One serious consequence of this, though,
  • 34:12 - 34:14
    is that the pointers got us flexibility, that
  • 34:14 - 34:15
    that insert and remove
  • 34:15 - 34:19
  • 34:19 - 34:22
    was dependent on the idea that pointers really being there, pointers buying us
  • 34:22 - 34:25
    the ability to insert and delete without doing a shuffle.
  • 34:25 - 34:27
    That now, once I have flattened it, I have suddenly bought back the same
  • 34:27 - 34:29
    problems that arrays have, which is
  • 34:29 - 34:33
    if I have the words here that have
  • 34:33 - 34:36
    ''ab'' for abalone, and ''ac'' for act and I don't happen to have any words with ''ad.'' And
  • 34:36 - 34:40
    I try to add the word ''add,''
  • 34:40 - 34:43
    that in order to put ''ad'' in there, we're going to have to move everything over and then
  • 34:43 - 34:46
    there's going to be a lot of shuffling and what not.
  • 34:46 - 34:48
    So I'm starting to build a data structure that's losing some flexibility but
  • 34:48 - 34:49
    saving space.
  • 34:49 - 34:53
    So
  • 34:53 - 34:54
    I'm, in some sense, focusing my attention on how can I build a data structure that's really fast
  • 34:54 - 34:56
    to search,
  • 34:56 - 35:00
    even if it was a little harder to build?
  • 35:00 - 35:02
    Because it turns out the thing I most care about is making sure it can
  • 35:02 - 35:07
    actually very, very efficiently look up words.
  • 35:07 - 35:11
    Maybe building it once, being a little slow, isn't so painful.
  • 35:11 - 35:14
    So the space issues in this gets us down to just ten bytes per
  • 35:14 - 35:16
    node. All the pointers are gone. So there was 24 bytes of pointers that we've
  • 35:16 - 35:19
    just eliminated.
  • 35:19 - 35:21
    And that means we've got two and a half megabyte
  • 35:21 - 35:28
    memory footprint right now.
  • 35:28 - 35:29
    So we're kinda getting close to hitting the range for binary tree and hash tables.
  • 35:29 - 35:33
    Now I'm
  • 35:33 - 35:37
    going to go back to the domain again. So I worked on it kinda from the CS side, thinking
  • 35:37 - 35:39
    about things I know about data structures and
  • 35:39 - 35:41
    reorganizing arrays and heaps and pointers. I'm going
  • 35:41 - 35:43
    to go back and look at that domain again
  • 35:43 - 35:46
    and see if there's something else I
  • 35:46 - 35:48
    can kinda take advantage of.
  • 35:48 - 35:50
    So this is a different
  • 35:50 - 35:52
    cross-section of the dictionary
  • 35:52 - 35:57
    showing the words ''adapted,''
  • 35:57 - 36:02
    ''adaptor,'' ''adaptors,'' ''adapting,'' ''adaption,'' adaptions,'' ''adapts.''
  • 36:02 - 36:04
    So eight words that are all built off the ''adapt'' root with a bunch of suffixes. I look
  • 36:04 - 36:05
    sorta over a little bit further. I've got
  • 36:05 - 36:10
    ''desert,''
  • 36:10 - 36:15
    ''deserted,'' ''deserter,'' ''deserting,'' ''desertings,'' ''desertion,'' ''desertions,'' ''deserts.'' Okay,
  • 36:15 - 36:20
    and ''detect'' and ''insert'' and ''perfect'' and ''invent'' and ''interrupt''
  • 36:20 - 36:23
    that all are verbs for which you can apply the past tense in the
  • 36:23 - 36:26
    gerund form and the ''ion's'' that
  • 36:26 - 36:29
    tack onto that root word to give you these suffixes.
  • 36:29 - 36:32
    And each of these words that has shown up here has exactly the same set of suffixes
  • 36:32 - 36:33
    that can be applied to it.
  • 36:33 - 36:35
    So I
  • 36:35 - 36:37
    did this thing about prefixes. I
  • 36:37 - 36:42
    went out of my way to find that -
  • 36:42 - 36:45
    so all these words could share that same prefix, ''assert,'' ''asserter,'' ''assertings,'' ''asserting.''
  • 36:45 - 36:47
    But is there some way that I could take this idea of ''corrupt''
  • 36:47 - 36:50
    and ''assert'' and ''insert,''
  • 36:50 - 36:54
    and then once I get down to that ''t,'' where they end, is there a way where they can also
  • 36:54 - 36:57
    share their backend parts? That those ''ing's'' and ''ion's,'' can they be
  • 36:57 - 36:58
    unified, as well? Well,
  • 36:58 - 37:00
    why not?
  • 37:00 - 37:03
    Why not? So here is
  • 37:03 - 37:06
    the first version of what's going to be the DAWG,
  • 37:06 - 37:07
    the Directed Acyclic Word
  • 37:07 - 37:10
    Graph.
  • 37:10 - 37:12
    It takes the idea of unifying the prefixes and now applies it to the
  • 37:12 - 37:14
    suffixes.
  • 37:14 - 37:17
    That there are a lot of
  • 37:17 - 37:20
    words that end in the same letters, as well as start in the same letters. Why not just apply
  • 37:20 - 37:22
    the same kind of unification and sharing on the back end?
  • 37:22 - 37:24
    So here comes in
  • 37:24 - 37:26
    ''adapt'' and ''adopt''
  • 37:26 - 37:30
    and ''based'' and ''port''
  • 37:30 - 37:33
    and out the back end is ''ports'' and ''portion'' and ''porting'' and ''porter'' and ''adopter''
  • 37:33 - 37:37
    and ''adopted'' and ''basting'' and ''bastion,''
  • 37:37 - 37:40
    all coming off that sharing all the endings, because they have the same
  • 37:40 - 37:41
    words that are viable and the same connections that are
  • 37:41 - 37:46
    between them. So
  • 37:46 - 37:49
    the idea that all sharing the ''er'' and the ''er'' leading in to the ''s.'' And that same ''s,'' for
  • 37:49 - 37:50
    example, being shared from ''portion'' and ''baste'' and
  • 37:50 - 37:51
    ''adopt'' and
  • 37:51 - 37:54
    ''adopter,''
  • 37:54 - 37:57
    all coming into the same ending of oh, these things that end in ''s'' are words. Here's an ''s'' that is a
  • 37:57 - 37:59
    word for them to share.
  • 37:59 - 38:01
    So it is directed. So
  • 38:01 - 38:04
    it is a graph.
  • 38:04 - 38:05
    So once we remove this - the idea
  • 38:05 - 38:09
    that there's a single path
  • 38:09 - 38:12
    from the root to any node, we're no longer a tree structure. So we
  • 38:12 - 38:16
    can get to ''t'' by going through ''adapt'' or ''adopt'' or ''port,''
  • 38:16 - 38:19
    and so that no longer fits the definition of what is a tree. It's
  • 38:19 - 38:23
    become a graph. We can get to these nodes from different places.
  • 38:23 - 38:24
    The arcs go one way in this case. We can't go back up the tree
  • 38:24 - 38:27
    to find the words in reverse.
  • 38:27 - 38:31
    There are no cycles. So acyclic, right, you can't just
  • 38:31 - 38:33
    wank around some portion and get this many bananas as you want.
  • 38:33 - 38:34
    And
  • 38:34 - 38:37
    they have
  • 38:37 - 38:44
    been with certain nodes that are marked, in this case, the same way they were in the
  • 38:44 - 38:45
    trie. This path from the root node to here does represent a word in this
  • 38:45 - 38:48
    situation.
  • 38:48 - 38:53
    So building this is a little bit complicated. I'll give you a little bit
  • 38:53 - 38:56
    of the idea of the intuition for it, and I will not get too deep into it.
  • 38:56 - 39:00
    But largely the way you do it is you build the trie,
  • 39:00 - 39:01
    so that it doesn't have the unification of the suffixes; it only unifies the
  • 39:01 - 39:03
    prefixes. And
  • 39:03 - 39:07
    then you work backwards
  • 39:07 - 39:09
    on the suffixes. So in some sense you look at all the words that end, for example, in
  • 39:09 - 39:12
    ''s'' in the dictionary.
  • 39:12 - 39:15
    And you find all the ones that just end in one ''s'' that goes nowhere, and you say
  • 39:15 - 39:17
    look, this ''s'' looks like every other ''s'' in this thing. All these words
  • 39:17 - 39:20
    that are just plurals or
  • 39:20 - 39:23
    verbs that can have an ''s'' tacked on. Let's share that ''s.''
  • 39:23 - 39:25
    And then so you take however many s's there were, 1,000 of them, and you
  • 39:25 - 39:29
    say these are all the same ''s.''
  • 39:29 - 39:31
    And then what you do is you look at all the nodes that lead into that ''s.'' So you're kinda
  • 39:31 - 39:33
    traversing the graph in reverse, and you say well look,
  • 39:33 - 39:36
    here's a bunch of people who lead in on an ''e.''
  • 39:36 - 39:37
    And if that ''e'' is a word or not, there might be some group that has ''e''
  • 39:37 - 39:38
    that's also a word,
  • 39:38 - 39:40
    so like
  • 39:40 - 39:42
    the word ''apple'' and ''apples.''
  • 39:42 - 39:46
    But sometimes the ''e'' is not,
  • 39:46 - 39:49
    where that wasn't a stopping place, let's say, for ''parties'' it wasn't.
  • 39:49 - 39:53
    So all the ones that are a word you can unify on the ''e'' that was a word, and all the ones that aren't
  • 39:53 - 39:54
    are that way can unify on that. So you just work your way backwards from there. So you say
  • 39:54 - 39:55
    what letters led into that ''e?''
  • 39:55 - 39:58
    And so kinda from
  • 39:58 - 40:02
    the end of the words backwards, you're looking for all the people who can end in an
  • 40:02 - 40:05
    ''s,'' and end in an ''e,'' and end in an ''ies,'' and unify them up,
  • 40:05 - 40:07
    making sure that they have all the same outgoing connections. So for
  • 40:07 - 40:09
    example, ''running'' has
  • 40:09 - 40:13
    an ''ing'' at the end. So does ''swimming,''
  • 40:13 - 40:15
    but there's ''swimmingly'' and there's not ''runningly.'' So you have to be careful about
  • 40:15 - 40:18
    unifying things that
  • 40:18 - 40:20
    they would kinda have to have all identical properties from this point
  • 40:20 - 40:22
    down to the bottom of the tree
  • 40:22 - 40:25
    to be unifiable.
  • 40:25 - 40:28
    But it's a pretty neat process. It's actually what's called
  • 40:28 - 40:29
    DFT subset minimization, so if you want to learn a little bit about it, that's the
  • 40:29 - 40:31
    word to look up.
  • 40:31 - 40:35
    And so if I do that,
  • 40:35 - 40:36
    and I build this as a DAWG, I'm still using the same structure definition here,
  • 40:36 - 40:40
  • 40:40 - 40:44
    but what I am doing is unifying suffixes as well as prefixes
  • 40:44 - 40:46
    and that, actually, is gonna change the number of nodes that I need drastically.
  • 40:46 - 40:49
    It goes down to 80,000
  • 40:49 - 40:52
    from my initial 250,000.
  • 40:52 - 40:55
    And if you think about the number of words again,
  • 40:55 - 40:57
    there were 125,000 words in the dictionary,
  • 40:57 - 41:01
    the DAWG has just 80,000 words. So it means, actually, once you start
  • 41:01 - 41:04
    unifying the suffixes, it's like most words don't even have -
  • 41:04 - 41:07
    or many of the words don't even have nodes of their own that are unique to
  • 41:07 - 41:11
    them whatsoever. They just exist kinda as portions of other words
  • 41:11 - 41:12
    that have all been joined together and
  • 41:12 - 41:15
    compressed into one package.
  • 41:15 - 41:18
    So 80,000 nodes total, about
  • 41:18 - 41:21
    two-thirds of the number of words total.
  • 41:21 - 41:23
    So we have ten nodes - ten-byte nodes
  • 41:23 - 41:27
    and 80,000 of them.
  • 41:27 - 41:31
    We are down to under a megabyte. So we've crossed that little interesting barrier of
  • 41:31 - 41:34
    the data structure on disk. The data that was fed into it, those words, was over a
  • 41:34 - 41:36
    megabyte and now, actually, we have created a representation that actually is
  • 41:36 - 41:40
    compressed, that uses less space
  • 41:40 - 41:42
    than the text file did in the first place. And that's really because it is
  • 41:42 - 41:44
    finding all these ways in which words are like each other
  • 41:44 - 41:46
    and sharing,
  • 41:46 - 41:48
    making use of the existing
  • 41:48 - 41:50
    structure in the
  • 41:50 - 41:51
    DAWG to add new words
  • 41:51 - 41:53
    without
  • 41:53 - 41:56
    adding bulk.
  • 41:56 - 41:57
    So once I've done this I have to say that I've made the data structure
  • 41:57 - 42:01
    even less flexible.
  • 42:01 - 42:03
    So to be clear, each of these steps has a cost somewhere else that made the code more
  • 42:03 - 42:07
    complicated. It also made it so that
  • 42:07 - 42:10
    now that I have all this sharing back here, that when I'm adding a new word, so I go
  • 42:10 - 42:14
    back to look at this picture here, that
  • 42:14 - 42:15
    I go in to add a word and I say oh, actually, the word
  • 42:15 - 42:17
    ''portioning'' is a word.
  • 42:17 - 42:19
    Well, ''adoptioning'' is not.
  • 42:19 - 42:22
    So
  • 42:22 - 42:26
    that will necessitate if I go in to add the word ''portioning,'' that I actually have
  • 42:26 - 42:30
    to split it off of its existing unified branch, because if I just tacked on the
  • 42:30 - 42:33
    ''ing'' on the end of ''portioning'' there, that it turns out that ''adaptioning''
  • 42:33 - 42:37
    and ''adoptioning'' would suddenly become words, too. So actually I have to
  • 42:37 - 42:38
    be extra careful once I've got it unified this way that when I add new words that I don't
  • 42:38 - 42:39
    actually
  • 42:39 - 42:42
    accidentally
  • 42:42 - 42:45
    add some additional words that I didn't plan on because these paths have
  • 42:45 - 42:49
    been shared. So it requires a little bit more careful management. Again,
  • 42:49 - 42:50
    the goal, though, here, in this case, was to try to build a data structure that was kinda freeze-dried.
  • 42:50 - 42:53
    So I
  • 42:53 - 42:55
    build it off the data structure - the input data file I had,
  • 42:55 - 42:57
    that then would operate very, very efficiently,
  • 42:57 - 43:00
    and I wasn't planning on making a lot
  • 43:00 - 43:01
    of runtime changes to it.
  • 43:01 - 43:04
    I get this down to this
  • 43:04 - 43:08
    and then the last thing I'm going to do to it
  • 43:08 - 43:11
    is go back to kinda the CS side of it and see if I can squeeze it down kinda
  • 43:11 - 43:15
    in terms of just data structure, what's being stored per node.
  • 43:15 - 43:18
    So I have 80,000 of these. So
  • 43:18 - 43:22
    typically, when you look at a structure,
  • 43:22 - 43:24
    it's very rare that the idea if you change something from an int to a
  • 43:24 - 43:28
    char, so it was a two-byte thing into a one-byte thing,
  • 43:28 - 43:32
    that you're going to have any profound impact on the total size needed.
  • 43:32 - 43:33
    But if you happen to know you're going to make a lot of these structures, tens of thousands,
  • 43:33 - 43:35
    almost 100,000 of them,
  • 43:35 - 43:38
    then it turns out every byte counts.
  • 43:38 - 43:40
    If I can get this thing down from, in this case,
  • 43:40 - 43:42
    ten bytes to eight bytes, or to
  • 43:42 - 43:46
    six bytes or four bytes,
  • 43:46 - 43:49
    then it will cut my memory by quite a factor.
  • 43:49 - 43:52
    So what I did in the final version of it
  • 43:52 - 43:58
    is I used a little technique called bit packing,
  • 43:58 - 44:00
    which is to actually stash stuff in the absolute minimum possible space
  • 44:00 - 44:04
    that I can fit it into,
  • 44:04 - 44:04
    taking advantage of things like, for example, if there are only 26 letters in the
  • 44:04 - 44:08
  • 44:08 - 44:10
    alphabet, a character is actually a full eight-bit value, which can hold
  • 44:10 - 44:13
    numbers from zero to 255.
  • 44:13 - 44:15
    And I just don't need that much space. I'm not using the yen sign, I'm not using the parentheses,
  • 44:15 - 44:19
    I'm not using the digits.
  • 44:19 - 44:22
    So having, in some sense, the capacity to store all those distinct characters is actually
  • 44:22 - 44:26
    a cost I'm paying on every character that I don't need to.
  • 44:26 - 44:29
    So in fact, I squeezed it down by dividing up into the sub-bit level. This
  • 44:29 - 44:30
    something that's talked about a little bit in Chapter 15 if you're at all curious. It's
  • 44:30 - 44:33
    not what I
  • 44:33 - 44:34
    consider core material for 106B, but I'm just sorta mentioning it to kinda
  • 44:34 - 44:36
    complete the picture.
  • 44:36 - 44:38
    Where five bits,
  • 44:38 - 44:41
    given that a bit is either zero or one,
  • 44:41 - 44:45
    that I have five of them connected together, then I have 25
  • 44:45 - 44:45
    different patterns that I can represent in that
  • 44:45 - 44:48
  • 44:48 - 44:50
    amount of capacity there, and that's 32 different patterns, which is enough for
  • 44:50 - 44:54
    my characters, plus a little slack.
  • 44:54 - 44:56
    I use exactly one bit for is word, rather than a full Boolean. All I need
  • 44:56 - 44:59
    to know is yes or no.
  • 44:59 - 45:02
    I also changed the way I did
  • 45:02 - 45:04
    knowing that you were the last child of a generation, rather than having
  • 45:04 - 45:06
    A say, ''there are 12 children here.''
  • 45:06 - 45:10
    I just said, go to index 27
  • 45:10 - 45:12
    and start reading, and one of them will tell you it's the last. So each of them says, ''No,
  • 45:12 - 45:16
    I'm not the last,'' ''No, I'm not the last.'' And then when you get to the last one, it
  • 45:16 - 45:18
    will be marked as ''yes,'' and that's how you'll know - that's where A ends and the
  • 45:18 - 45:20
    next one beings.
  • 45:20 - 45:21
    And so if I can get it down in this way
  • 45:21 - 45:24
  • 45:24 - 45:28
    I've taken
  • 45:28 - 45:31
    what was a ten-byte structure down to four. And it turns out there's often
  • 45:31 - 45:35
    reasons why trying to get it down to a multiple of two is actually
  • 45:35 - 45:37
    a critical barrier. That if it were five bytes, it's actually likely to be eight,
  • 45:37 - 45:39
    just because of what's called alignment and padding restrictions.
  • 45:39 - 45:42
    So if were at
  • 45:42 - 45:45
    seven and I squeeze down to six, it might not make any change. And if I squeeze down to five, it
  • 45:45 - 45:47
    still might not make any change. It might be, actually, artificially decided that in
  • 45:47 - 45:50
    each of the structures is going to be eight behind the scenes.
  • 45:50 - 45:54
    But getting it down to four actually tends to be another one of those critical barriers where
  • 45:54 - 45:56
    okay, four actually will be smaller than something that was ten. And
  • 45:56 - 45:58
    so I take my 80,000 nodes,
  • 45:58 - 46:03
    I get them to be four bytes each,
  • 46:03 - 46:06
    and then I now have a data structure that's about a third of a megabyte in memory,
  • 46:06 - 46:09
    as opposed to the over a megabyte on disk
  • 46:09 - 46:11
    in the text form.
  • 46:11 - 46:12
    And so that's how it does what it does. So if you're going to look at the code,
  • 46:12 - 46:16
    you'll see that
  • 46:16 - 46:17
    there is this complicated bit pack structure that
  • 46:17 - 46:20
    it does
  • 46:20 - 46:24
    look for things char byte using a recursive character-by-character
  • 46:24 - 46:24
    strategy of start at the top, find the matching character, then go to the children index,
  • 46:24 - 46:28
    and then
  • 46:28 - 46:33
    work your way down the children to find the next matching character, and so on
  • 46:33 - 46:37
    to find its way through the data structure.
  • 46:37 - 46:41
    So [inaudible] is where the child index, or for example, the idea that
  • 46:41 - 46:42
    A says my children start at 27 or B starts at 39,
  • 46:42 - 46:45
    and so I just need
  • 46:45 - 46:51
    a number there, but I'm using
  • 46:51 - 46:55
    a slightly smaller than an integer, just to make them all fit very tightly. So I'll tell
  • 46:55 - 46:57
    you a couple of interesting things about it, actually, before I close this off, which is the
  • 46:57 - 47:02
    lexicon.dat file
  • 47:02 - 47:03
    actually just is an in-memory representation of what the DAWG data
  • 47:03 - 47:07
    structure looks like.
  • 47:07 - 47:11
    And so in fact, it's very easy to take the whole data structure. It doesn't have any pointers in it. Pointers, actually, tend to be a
  • 47:11 - 47:15
    little bit - make data structures harder to rebuild and
  • 47:15 - 47:17
    to dehydrate to disk because they have to kinda be wired up in memory. You don't know what
  • 47:17 - 47:20
    memory you're going to get from new and what the addresses are.
  • 47:20 - 47:22
    This one, actually, doesn't have any pointers. Everything is actually kinda self-contained. It's
  • 47:22 - 47:26
    just one big array of these blocks.
  • 47:26 - 47:27
    And all the information is relative in there. It says go to the index 27, index
  • 47:27 - 47:30
    45.
  • 47:30 - 47:34
    So you can take that whole block and just write it to disk, 300 bytes worth,
  • 47:34 - 47:36
    read it back in 300 bytes worth and it kinda is rehydrated
  • 47:36 - 47:39
    without any extra effort. So
  • 47:39 - 47:42
    that's actually what the lexicon.dat file is, it's just a dehydrated form of
  • 47:42 - 47:44
    the DAWG having been
  • 47:44 - 47:48
    build in memory. That makes it very, very fast to load.
  • 47:48 - 47:49
    At that point, though, it's super inflexible. If you want to add a word to
  • 47:49 - 47:52
    it,
  • 47:52 - 47:56
    the way the structure is build, you'd have to kinda add nodes and make sure it wasn't
  • 47:56 - 47:58
    glomming on to some existing words that were nearby, and so it would be a lot of work to
  • 47:58 - 48:01
    actually edit the DAWG in place.
  • 48:01 - 48:04
    In fact, when you ask it to add a word,
  • 48:04 - 48:05
    it actually doesn't edit that one at all.
  • 48:05 - 48:09
    You say I'd like to put a new word in.
  • 48:09 - 48:11
    It says I'm just - this one's so perfect the way it is. I've built this one, I love this
  • 48:11 - 48:14
    one, I'm very happy with it.
  • 48:14 - 48:16
    If you want to put some other words into it, it just keeps a second data structure
  • 48:16 - 48:21
    on the side, an
  • 48:21 - 48:21
    auxiliary one over here, where it sticks the additional words you ask it to put in. I
  • 48:21 - 48:24
    have
  • 48:24 - 48:28
    to say that seems like a very obvious thing to do, but in fact, I didn't even think of
  • 48:28 - 48:31
    that for years. We had the lexicon. I said you can't use the lexicon to make new
  • 48:31 - 48:32
    word lists or edit them. You can only load this one, beautiful, perfect one. You can't do
  • 48:32 - 48:35
    anything else. And
  • 48:35 - 48:38
    at some point people kept saying, ''I wish I could put some words in the lexicon. I wish I could use it
  • 48:38 - 48:41
    for the word list to keep track of the human and computer players.'' I'm like, ''Well, you just
  • 48:41 - 48:41
    can't do it. You don't understand. It's very complicated.
  • 48:41 - 48:45
    You
  • 48:45 - 48:50
    just don't understand my pain.'' Whatever. Whatever. I had excuses. And then, really,
  • 48:50 - 48:51
    after some number of years of just being a baby about it, I'm like oh, of course, why don't I
  • 48:51 - 48:54
    just
  • 48:54 - 48:57
    keep another structure off to the side? It's an abstraction. It's a word list. You don't need
  • 48:57 - 49:00
    to know what I do. The fact that I keep a DAWG for some of the words
  • 49:00 - 49:04
    and just an ordinary set for some of the other words
  • 49:04 - 49:06
    is really completely irrelevant. And as long as when you ask for words I
  • 49:06 - 49:08
    look in both and I tell you, ''Yes, it's there,'' why do you need to know
  • 49:08 - 49:11
    where I keep it? Okay, so it took
  • 49:11 - 49:13
    me a while to buy abstraction, even though I teach it all the time. It just
  • 49:13 - 49:17
    wasn't totally making sense.
  • 49:17 - 49:19
    So a neat little thing, just kinda keeping both behind. And so it turns out that I
  • 49:19 - 49:21
    learned about this, actually, from a paper
  • 49:21 - 49:22
    that I read at the - in the
  • 49:22 - 49:23
  • 49:23 - 49:27
    early ?90s about somebody
  • 49:27 - 49:30
    building a Scrabble player, and they wanted a dictionary that
  • 49:30 - 49:33
    supported kinda finding plays on the board and being able to extend words. And so
  • 49:33 - 49:34
    knowing things about root words and how they extend and prefixes and suffixes was kinda the
  • 49:34 - 49:38
    original motivation.
  • 49:38 - 49:39
    And this kind of data structure is actually really great for any kind of word playing.
  • 49:39 - 49:42
    So you have
  • 49:42 - 49:43
    anagrams that you want to rearrange or words that you want to extend past suffix and
  • 49:43 - 49:46
    prefix that
  • 49:46 - 49:50
    this kind of data structure is really optimized for
  • 49:50 - 49:52
    doing that kind of traversal and kinda leveraging the existing patterns
  • 49:52 - 49:56
    in words to save space while doing so.
  • 49:56 - 50:00
    So it was a really fun little project to do. It was kinda neat to think about. Even though today,
  • 50:00 - 50:02
    if you needed a lexicon, you could just use a set. It's fast enough. It takes
  • 50:02 - 50:05
    a ton of memory, but you know what, memory's cheap.
  • 50:05 - 50:06
    So this is not the kind of thing you would need to do today except when you were in some kinda
  • 50:06 - 50:07
    embedded,
  • 50:07 - 50:11
    low memory,
  • 50:11 - 50:14
    very, very tight processor environment, but it still makes for kinda interesting sort of artifact
  • 50:14 - 50:18
    to think about well, how you get there and what kind of things you can use.
  • 50:18 - 50:19
    So Wednesday we will talk about some big picture design, some kinda wrap stuff, and I will
  • 50:19 - 50:22
    see
  • 50:22 - 50:23
    you
Title:
Lecture 25 | Programming Abstractions (Stanford)
Description:

Lecture 25 by Julie Zelenski for the Programming Abstractions Course (CS106B) in the Stanford Computer Science Department.

Julie examines a case study and opening up the lexicon file, which is complicated; she walks the students through the code and explains why she wrote it as she did as opposed to a sorted vector or binary search tree. She then introduces the DAWG data structure.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=FE6E58F856038C69

CS 106B Course Website:
http://cs106b.stanford.edu

Stanford Center for Professional Development:
http://scpd.stanford.edu/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanforduniversity/

more » « less
Video Language:
English
Duration:
50:36
N. Ueda edited English subtitles for Lecture 25 | Programming Abstractions (Stanford)
Eunjeong_Kim added a translation

English subtitles

Revisions