Lecture 25 | Programming Abstractions (Stanford)

Edit subtitles

0:00 - 0:12
0:12 - 0:16
0:16 - 0:25

This presentation is delivered by the Stanford Center for Professional
0:25 - 0:28

Development.
0:28 - 0:32

The piece I need is not here, so I'm going to actually just talk to you about what
0:32 - 0:34

it says on my slides and then when Jason arrives we'll plug it back in and see what's going on.
0:34 - 0:36

Oh, there's Jason. There we go.
0:36 - 0:40

So what I'm going to do today is actually something kinda fun. And so I'm
0:40 - 0:41

gonna take a case study of a particular data structure
0:41 - 0:45

design,
0:45 - 0:49

and the one I'm looking at there is the lexicon.
0:49 - 0:53

So how many of you actually ever decided to just open up the lexicon file, just to see
0:53 - 0:55

what was in there at some point during the quarter. We used it a couple of times.
0:55 - 1:00

Anybody want to tell us what they learned by opening up that file?
1:00 - 1:04

Anything interesting?
1:04 - 1:05

It's a data structure called a DAWG. You're like, well that's a pretty funny name for something.
1:05 - 1:10

And
1:10 - 1:14

was the code particularly well-[inaudible], easy to read, easy to understand?
1:14 - 1:18

No. Who wrote that code? She should totally be fired.
1:18 - 1:22

Yeah, so this what I wrote kinda a long time ago, actually, to solve this particular problem. And actually I would like to key through
1:22 - 1:28

kinda how it is I came to discover the DAWG and why it is the DAWG was the
1:28 - 1:29

choice that made sense for what the need pattern we had for the lexicon was.
1:29 - 1:30

So let me just refresh what
1:30 - 1:34

the lexicon does.
1:34 - 1:36

It is a special case of kinda set or map. It is a container where you're going stick things
1:36 - 1:38

in, they're going to be unique, right? There's no
1:38 - 1:41

need for any sort of duplicate.
1:41 - 1:43

The keys are always strings. And in fact, they're English words, so that's going to be an important
1:43 - 1:47

thing to remember, that
1:47 - 1:48

they're not just arbitrary sequences of letters, right? They have certain patterns to
1:48 - 1:51

them
1:51 - 1:54

and certain sizes that might actually be something we could capitalize on. When we know
1:54 - 1:57

something about the domain, we're able to kinda special case or tune or
1:57 - 1:58

data structure to solve that problem
1:58 - 1:59

very well,
1:59 - 2:03

in particular.
2:03 - 2:07

It has no associated value, so it doesn't have dictionary definitions or synonyms or anything
2:07 - 2:08

like that, right? So it really just is exists or doesn't exist
2:08 - 2:12

that we were interested in supporting.
2:12 - 2:15

And it also has, in addition to the kind of being able to add words and check if a
2:15 - 2:16

word is contained, it's also this prefixing operation that we had talked a
2:16 - 2:19

little about
2:19 - 2:23

in Boggle, and that came back on the midterm. Like, why was prefixing important, right?
2:23 - 2:25

It was critical for pruning these dead-end paths, these searches on the
2:25 - 2:26

Boggle board that otherwise
2:26 - 2:32

would really have
2:32 - 2:32

taken a lot of time and kinda led to nothing of any value. So what I'm gonna to talk you
2:32 - 2:36

through
2:36 - 2:40

is some of the obvious choices based on data structures we already know about,
2:40 - 2:43

and see where they would go. So this is a thought experiment. It's not that I wrote all
2:43 - 2:45

these to find out. Actually, I mostly did a lot of this on paper, trying to figure
2:45 - 2:48

out what options we had
2:48 - 2:49

and how they would play out and what kind of trade-offs I'd have to make to make them
2:49 - 2:52

work.
2:52 - 2:55

So the simplest case of any kind of collection is to think about
2:55 - 2:58

whether vector or array can do the job for you.
2:58 - 3:00

In this case, having a vector that's kept in sorted - alphabetically sorted - order seems
3:00 - 3:03

like a pretty good,
3:03 - 3:05

easy way to get this set up and running and how would it behave?
3:05 - 3:06

So in order to do a contains
3:06 - 3:07

(word)
3:07 - 3:15

on this data structure,
3:15 - 3:17

what algorithm are you going to use to do your searching? Binary search,
3:17 - 3:21

right? It's in sorted order, right? We've got to take advantage of that, right? If we do
3:21 - 3:25

a linear search I'm looking for the word ''mediocre,'' I'm going to be trucking through,
3:25 - 3:28

you know, thousands of words before I find it. So definitely want to be using binary search.
3:28 - 3:29

Equals-equals, comparing my strings, less than, greater than, dividing in half. And so
3:29 - 3:32

it should run a logarithm time.
3:32 - 3:35

Logarithm time, one of those greater performers that you
3:35 - 3:38

learned in the PQ assignment, hopefully. It's just basically not measurable on
3:38 - 3:41

today's computer. So even for entries of
3:41 - 3:42

100,000, 500,000, basically
3:42 - 3:46

for free.
3:46 - 3:50

How do we do contains (prefix)? Can we also do a fast prefix search given this
3:50 - 3:51

data structure?
3:51 - 3:54

I hope so right? Thinking about
3:54 - 3:57

the same kind of binary search. In this case, though, looking at substring, right? So
3:57 - 3:59

comparing I'm looking for a three-character substring, only looking at the
3:59 - 4:02

first three characters, decide which way to go.
4:02 - 4:05

Once I find something that matches in the first three characters, I'm done.
4:05 - 4:07

So again, still using the sorted property to quickly narrow in on where
4:07 - 4:10

it could be.
4:10 - 4:14

When it's time to add a new word to this,
4:14 - 4:17

if someone asked you to put in the word ''abalone,'' I gotta
4:17 - 4:21

use that binary search to find out where to go. So in logarithmic time I can
4:21 - 4:23

tell you where the new word goes, but then to put it into position,
4:23 - 4:26

there's gonna be some shuffling.
4:26 - 4:28

And that's where this operation
4:28 - 4:32

starts to bog down,
4:32 - 4:33

given the idea that you might be moving half or more of your items,
4:33 - 4:36

on average,
4:36 - 4:40

to make that space for that one to open up.
4:40 - 4:42

It is going to require a linear time operation to get new words into the data
4:42 - 4:43

structure. Okay,
4:43 - 4:48

probably still, though,
4:48 - 4:51

a reasonable approach, right? These are the operations that get done
4:51 - 4:54

immense numbers of times in Boggle, right? You're doing a ton of contains (word)'s and contains (prefix)'s
4:54 - 4:58

as you're exhaustively searching that big board.
4:58 - 5:01

But the add's are done kinda once at the beginning. You build that big dictionary and
5:01 - 5:02

then you subsequently use it. And so if it took a little bit more time to built
5:02 - 5:06

that thing,
5:06 - 5:08

sorta an n-squared operation to kinda load it up, well,
5:08 - 5:11

maybe that would be okay.
5:11 - 5:14

The other we want to look at is space usage, and that's - and some of this is a little
5:14 - 5:18

bit of an interesting artifact of the time I was working on this. It turns
5:18 - 5:21

out space was actually a limiting factor in what we had available
5:21 - 5:22

to us, in terms of core memories of our machines and
5:22 - 5:25

what we could take up for our dictionary.
5:25 - 5:28

In this case, it has per entry the size of the string,
5:28 - 5:29

which we don't exactly know the string it represented. It's an abstraction
5:29 - 5:32

to us, but
5:32 - 5:35

we can make the assumption of kinda a simplification, that
5:35 - 5:37

it's probably about the length of the
5:37 - 5:39

string times of the size of the character. And there might be a little bit more
5:39 - 5:40

housekeeping associated with it, but
5:40 - 5:42
5:42 - 5:44

that's a pretty good estimate of what it's taking. So in fact, it's the string data
5:44 - 5:46

itself
5:46 - 5:50

that's being represented in this vector
5:50 - 5:51

without much else overhead. So if we say words are about average of about eight characters,
5:51 - 5:54

about eight bytes per word.
5:54 - 5:57

So if we have 100,000 we expect to have 800,000 bytes
5:57 - 6:00

invested in this sort of vector arrangement.
6:00 - 6:02

Excess capacity, a couple other things that might tweak that a little bit, but that
6:02 - 6:04

gives us at least a good starting point to
6:04 - 6:07

think about how this
6:07 - 6:09

compares to alternatives. Okay,
6:09 - 6:12

so then if we said
6:12 - 6:16

we know that sortedness - that binary search is the
6:16 - 6:17

real advantage of keeping it in sorted order, but the contiguous nature of the array
6:17 - 6:19

kinda
6:19 - 6:23

bites us in terms of insert.
6:23 - 6:26

If we reorganize into that binary search tree to give ourselves the dynamic flexibility of
6:26 - 6:27

wiring in the left and right halves through pointers, as opposed to
6:27 - 6:29

contiguous memory,
6:29 - 6:31

then we know we can also get add to run
6:31 - 6:35

in logarithmic time. So
6:35 - 6:36

looking at these operations, contains (word) is a tree search that's using equals-equals,
6:36 - 6:40

left-right, deciding which way to go,
6:40 - 6:43

and also performs logarithmically if the tree is balanced.
6:43 - 6:45

Contains (prefix), same deal. Keep
6:45 - 6:46

looking at the [inaudible],
6:46 - 6:48

working our way down, knowing which way to go.
6:48 - 6:51

And then
6:51 - 6:54

add can also perform in logarithmic time
6:54 - 6:57

because it does the same search, falling off the bottom and
6:57 - 6:58

then inserting the new word there, or finding it along the way and not having any
6:58 - 7:01

work to do.
7:01 - 7:03

So we can get all our operations done in logarithmic time, which we know to be kinda
7:03 - 7:05

fast enough to actually not be worth
7:05 - 7:08

fighting any harder for.
7:08 - 7:11

Where did we pay for this?
7:11 - 7:13

Nothing really comes for free, right?
7:13 - 7:15

Overhead, right? And overhead, in this case, memory is certainly one of the places
7:15 - 7:20

where there's going to be a
7:20 - 7:23

pretty big cost that we now have the string data for each entry, plus a
7:23 - 7:26

left pointer and a right pointer. And if we're doing the work to keep it
7:26 - 7:30

balanced, which we probably ought to do if we want to avoid the degenerate cases
7:30 - 7:31

of it kinda getting totally out of whack and degenerating into linear time,
7:31 - 7:34

we're going to have a couple more
7:34 - 7:37

bits, you know, bytes thrown into that factor, too.
7:37 - 7:39

And so if we're assuming that there's
7:39 - 7:42

ten bytes of overhead,
7:42 - 7:46

four bytes in each of these plus another two for the balance factor, then we've added
7:46 - 7:48

ten bytes of overhead on top of the eight bytes for the word.
7:48 - 7:48

Getting a little
7:48 - 7:52

bit hefty. Something to worry about.
7:52 - 7:56

Also in terms of code complexity, which I didn't put a bullet up for, but it's
7:56 - 7:59

also worth kinda keeping - weighing it in your mind also which is
7:59 - 8:01

that building the fancier, more complicated data structure probably meant
8:01 - 8:02

you spent more time debugging it,
8:02 - 8:06

more time developing it,
8:06 - 8:08

right? You will have - it will be harder for someone later to maintain and deal
8:08 - 8:10

with because it's actually just more complicated code that somebody looking at a
8:10 - 8:12

sorted vector
8:12 - 8:13

isn't likely to actually - like, they want to add a new operation,
8:13 - 8:16

let's say they add remove.
8:16 - 8:19

You're able to take a word out of the lexicon. The lexicon doesn't currently support
8:19 - 8:23

that. Somebody adding that on the sorted vector implementation probably won't find themselves
8:23 - 8:25

too tripped up. Someone who has to do a remove out of a binary search tree, removing it, reattaching
8:25 - 8:27

the subtrees
8:27 - 8:30

and not destroying the balancing
8:30 - 8:32

is a pretty complicated operation. So we've made - we've invested in this data
8:32 - 8:34

structure that actually will keep on
8:34 - 8:36

creating
8:36 - 8:38

further development hassles for any modifications that are down the road for
8:38 - 8:40

this data structure. So
8:40 - 8:45

it is
8:45 - 8:47

a big investment in your brainpower to keep it working. So
8:47 - 8:51

let's think about what a hash table can do for us. A
8:51 - 8:56

hash table is the last thing that we looked at on Friday,
8:56 - 8:58

kinda just moving away from this whole idea of sortedness and kinda moving it out of bound off to this idea of a
8:58 - 9:02

hashing function.
9:02 - 9:05

So if I have a hash table that has a certain number of buckets
9:05 - 9:06

that uses a link list for chaining to
9:06 - 9:10

handle the collision,
9:10 - 9:11

then what I would have is my words just scattered around my table,
9:11 - 9:14
9:14 - 9:16

allowing me to have quick access by hashing to the right bucket and then
9:16 - 9:19

working my way down the bucket to C.
9:19 - 9:21

So the contains (word) would do a standard
9:21 - 9:24

hash table lookup.
9:24 - 9:27

Take the word, you know, ''mediate,'' run it through the hash function. It says oh, it's bucket
9:27 - 9:30

three in this case. I go to bucket three. I walk down the link list. I find it or
9:30 - 9:31

not. I can tell you whether the word ''mediate,'' if it's in the table, had to be in bucket
9:31 - 9:33

three.
9:33 - 9:35

No questions asked.
9:35 - 9:38

In that case, the
9:38 - 9:39

expected time for that is going to be n over b, where n is the number of entries, b is the number of
9:39 - 9:41

buckets.
9:41 - 9:42

If we have buckets
9:42 - 9:47

set to be in the
9:47 - 9:52

rough order of the number of entries, then we're going to get constant access there.
9:52 - 9:54

Now we want to do contains (prefix). I want
9:54 - 10:00

to
10:00 - 10:05

know if there's any words in this table that begin with ''med.''
10:05 - 10:09

Where do I look?
10:09 - 10:11

If I run ''med'' through the hash function,
10:11 - 10:15

do I expect that I would get the same number
10:15 - 10:18

as other words that begin with ''med,'' like ''mediate'' and ''median'' and
10:18 - 10:19

''mediator?'' Yes.
10:19 - 10:22

You say yes. All
10:22 - 10:26

right, let's take that to an extreme.
10:26 - 10:29

If I expect that if a prefix in any longer words all have the same thing, then should it
10:29 - 10:33

be the case, for example, I think about the simplest case of a prefix, ''m.''
10:33 - 10:35

Should all the words that begin with ''m'' hash to the same place?
10:35 - 10:38

If they did, we're in trouble, right?
10:38 - 10:42

Now, if all prefixes did hash to the same place, then what I'm going to end up with is a
10:42 - 10:45

really heavily clustered table where ''a,'' ''b,'' ''c,'' you know, ''z'' -
10:45 - 10:46

there's 26 buckets that have all the ''z'' words, all the
10:46 - 10:48

''y'' words, all the what not,
10:48 - 10:52

and none of the other buckets would be used. All
10:52 - 10:55

right, so in fact, actually, as part of the good behavior of a hash function, we're
10:55 - 10:58

hopping that, actually, it does jumble things up. And then a word that looks like one
10:58 - 11:02

but has another letter added on the end, we're hoping it goes somewhere totally different,
11:02 - 11:04

that ''median'' and ''medians'' hopefully go to two totally disparate places.
11:04 - 11:07

That if there were patterns where
11:07 - 11:10

words that were substrings of each other all got hashed to the same
11:10 - 11:12

location, we would end up with heavy clustering. Because there are a lot of words that
11:12 - 11:15

repeat parts of the other words in the dictionary.
11:15 - 11:19

So in fact, if I want to know if there's any words that begin with ''med,''
11:19 - 11:21

I have no a priori information about where to look for
11:21 - 11:25

them. They could be anywhere.
11:25 - 11:26

I could hash ''med'' and see if ''med'' itself was a word, right?
11:26 - 11:30

But let's say it was something that
11:30 - 11:33

itself was just the beginning of something. ''St,'' ''str'' -
11:33 - 11:36

there's a lot of words that begin with that, but ''str'' itself is not a word. So
11:36 - 11:39

hashing to that bucket and looking for it would only tell me if there's exactly ''str,'' and then
11:39 - 11:41

occasionally there might be some ''str'' words also there.
11:41 - 11:44

But other than that, we just gotta look.
11:44 - 11:45

And where do we look? Everywhere.
11:45 - 11:50

Everywhere, everywhere, everywhere.
11:50 - 11:53

That in order to conclude there is no word that begins with the prefix we have,
11:53 - 11:55

the only way to be confident of that is to either find something, or to have searched
11:55 - 11:59

the entire table and not found it.
11:59 - 12:02

And so it does require this completely linear, exhaustive
12:02 - 12:04

look through the entire contents of the hash table to know
12:04 - 12:07

something, for example, is not a prefix.
12:07 - 12:09

You might find one quickly. So in the best case, you might be in the first couple of
12:09 - 12:10

buckets, happen upon one.
12:10 - 12:14

But no guarantees,
12:14 - 12:16

overall. And certainly when it's not a prefix, it's going to require a complete look-see of everything.
12:16 - 12:18

That's kinda a bummer. Adding
12:18 - 12:19

a
12:19 - 12:22

word,
12:22 - 12:24

back to the fast behavior again. Hashing in the bucket,
12:24 - 12:26

checking to see if it's already there, adding if needed
12:26 - 12:28

[inaudible].
12:28 - 12:31

So it gets good times on these two,
12:31 - 12:34

but then kinda really bottoms out when it comes to supporting
12:34 - 12:35

that prefix search.
12:35 - 12:38

Its space usage
12:38 - 12:42

is kinda roughly comparable to that of the binary search tree.
12:42 - 12:42

We've got the string data itself, and we've got the four-byte pointer on the link list.
12:42 - 12:44
12:44 - 12:48

There's also the four-byte bucket pointer,
12:48 - 12:51

which is over in this other structure here, this array full of buckets.
12:51 - 12:54

If that number, though, is roughly equal to the number of cells, then you can kinda
12:54 - 12:56

just count it as each entry
12:56 - 12:59

had this
12:59 - 13:02

12-byte cell, plus a four-byte bucket pointer somewhere
13:02 - 13:04

that you can kinda think of as associated on a per entry basis to tell you it's about
13:04 - 13:06
13:06 - 13:10

eight bytes of overhead on top of the string.
13:10 - 13:11

So 16 bytes for each entry in the table.
13:11 - 13:16

Okay, that's
13:16 - 13:18

what we have so far.
13:18 - 13:21

We've got
13:21 - 13:24

pretty good performance on contains (word) no matter what we
13:24 - 13:27

choose. We can get either logarithmic or constant time with either of these three
13:27 - 13:31

data structure. So perfectly acceptable performance there.
13:31 - 13:33

We cannot get contains (prefix) to run well on a hash table.
13:33 - 13:34

Just doesn't happen the way hashing works.
13:34 - 13:38

And then add
13:38 - 13:41

can be fast on the two complicated pointer-based data structures to the right,
13:41 - 13:42

but can't be fast on the sorted vector.
13:42 - 13:45

And then
13:45 - 13:47

we see kinda some trade-offs here in terms of space versus time, that there's
13:47 - 13:51

very little space overhead
13:51 - 13:55

in the sorted vector case, but really blossoms into
13:55 - 13:58

two times, almost three times, the storage needed. In the VST and hash table
13:58 - 14:00

case, the code also got a lot more complicated when we moved up to those
14:00 - 14:04

data structures. So
14:04 - 14:06

then I'll tell you, actually, that it turns out that the one type of performance that was interesting to us
14:06 - 14:09

when I was
14:09 - 14:14

exploring this, but in fact, more important at the time, was the memory usage.
14:14 - 14:17

That the memory usage was really the big obstacle that we had.
14:17 - 14:19

The Official Scrabble Players Dictionary II is actually the source for the word list that
14:19 - 14:22

we're using.
14:22 - 14:23

It has about 128,000 words, I think, is exactly the
14:23 - 14:26

number it has.
14:26 - 14:29

And the average length of those is eight characters. And the file itself is
14:29 - 14:32

over a megabyte. So on disk, if you were just to look at the listing of all the words,
14:32 - 14:34

it's a megabyte - a little over a megabyte there.
14:34 - 14:37

If we were loading it into the sorted vector,
14:37 - 14:40

we'd get something that was just roughly equal to that. It would take about a megabyte of space,
14:40 - 14:45

plus a little bit of noise.
14:45 - 14:48

The ones that double up that take it up to two, two and a half megabytes total.
14:48 - 14:50

At the time we were working, and I know this is going to seem completely archaic to you, but we had
14:50 - 14:51

these stone tablets that we worked on.
14:51 - 14:55
14:55 - 15:00

And they typically had main memories of about a megabyte, maybe, like in
15:00 - 15:02

an exciting case, four megabytes; four whole megabytes of RAM.
15:02 - 15:05

So it was not possible
15:05 - 15:09

that we could dedicate one or two megabytes of your main memory
15:09 - 15:12

to just your lexicon. It just wasn't - there just wasn't enough space to go around.
15:12 - 15:15

So we're working in this very tight environment in terms of that. So a little bit more
15:15 - 15:16

like, for example, today's world of embedded devices. If you're adding something for an
15:16 - 15:18

iPod or a
15:18 - 15:21
15:21 - 15:24

cell phone you have a lot less
15:24 - 15:27

gargantuan memory. So your average desktop machine, having a
15:27 - 15:28

gigabyte of RAM, now you're like, whatever, a megabyte here, ten megabytes there, free, all the
15:28 - 15:31

memory you want. And
15:31 - 15:34

so this seems kinda like a silly thing to focus on, but this was 1995,
15:34 - 15:35

and it turns out, yeah, there was certainly more interest in
15:35 - 15:37

keeping it
15:37 - 15:40

tight to the memory.
15:40 - 15:44

The Boggle lexicon that we have actually takes under a third of a megabyte.
15:44 - 15:45

The memory that's used while it's working and searching and checking for words and
15:45 - 15:47

prefixes
15:47 - 15:50

is actually smaller
15:50 - 15:53

than the on-disk form of the data we started with.
15:53 - 15:57

So it actually does a kind of a bit of a compression
15:57 - 16:00

as part of its strategy for storage, which is pretty interesting to think about.
16:00 - 16:02

So I'll tell you, actually, the truth about how we first did Boggle, because it's kinda a little
16:02 - 16:03

interesting
16:03 - 16:07

historical perspective on it.
16:07 - 16:09

But Boggle was given by a colleague of mine for the first time, Todd
16:09 - 16:10

Feldman. He was the guy who came up with the idea, which is a great idea. And he said, ''Oh, this is a great
16:10 - 16:13

assignment.''
16:13 - 16:16

And when he wrote it, he said, ''Oh, I need a lexicon.'' So as part of kinda putting the assignment together, he was like, ''Oh, we
16:16 - 16:18

have to get some lexicons.'' So he looked around and he found a word list that had about
16:18 - 16:20

20,000 words.
16:20 - 16:23
16:23 - 16:27

And I think we first found this word list and we said, ''Oh, it's just impossible.'' It's going to take a megabyte
16:27 - 16:30

or more and we just don't have that kind of space. So we need much smaller data files. So we had a data file that had
16:30 - 16:33

about 20,000 words, which then takes about
16:33 - 16:34

200k. And he actually wrote it using a hash table.
16:34 - 16:37

So given he wrote it as a hash table,
16:37 - 16:38

he just didn't support contains (prefix). He
16:38 - 16:39

just didn't put it in,
16:39 - 16:42
16:42 - 16:45

and - because you can't do it efficiently. So it actually took a smaller amount
16:45 - 16:49

of space and had a smaller word list and it wouldn't do contains (prefix).
16:49 - 16:51

So then people wrote Boggle. And as you learned from that midterm question, you can write Boggle without
16:51 - 16:55

contains (prefix).
16:55 - 16:58

It spends a lot more time looking down these data ends, but it eventually bottoms out
16:58 - 17:01

when it runs off the board and uses all the
17:01 - 17:04

cubes. And
17:04 - 17:05

it actually, in some ways, almost made the program a little more satisfying to
17:05 - 17:08

write in one odd way,
17:08 - 17:09

which is, when you're - so it's the computer's turn, right, that uses the
17:09 - 17:12

prefix pruning.
17:12 - 17:14

So you would type in your words and it would be perfectly fine finding it, but it
17:14 - 17:16

doesn't use the prefix part when it's doing the human's turn. You're typing in words; it's finding them. You're typing
17:16 - 17:19

in the words; it's find them.
17:19 - 17:21

Then you would go to do the computer's turn and you'd say, ''Okay, go.''
17:21 - 17:25

And now it would take a really long time.
17:25 - 17:27

It'd be hunting around the board. It would be working so hard your fan would be coming on.
17:27 - 17:30

You'd feel like oh, my computer's doing something.
17:30 - 17:31

And then it would show up with twice as many words as you found or ten times as
17:31 - 17:33

many words as you found, whatever.
17:33 - 17:36

And in fact, then you would say,
17:36 - 17:40

''Well, at least it had to work hard to beat me.''
17:40 - 17:41

And then you felt like I wrote this program that causes my processor to get busy, right? I feel good. And
17:41 - 17:44
17:44 - 17:47

in fact, actually, the word list was pretty small. So in fact, it didn't - it wasn't even actually as likely to
17:47 - 17:49

beat you because it didn't know as many words. 125,000 words is
17:49 - 17:52

an enormous number of words.
17:52 - 17:55

The average - I've heard different factors on this, but the
17:55 - 17:59

average person's vocabulary, English vocabulary, is about 20,000 to
17:59 - 18:00

30,000 words. It tends to be about 10,000 higher if you've gone to college, so
18:00 - 18:03

maybe 30,000 to 40,000.
18:03 - 18:06

The fact that it knows 80,000 more words than the average college graduate
18:06 - 18:09

means it knows a lot of obscure words, too. So in fact, not only does it
18:09 - 18:12

very quickly find all those words and then produce these ridiculous words that you've
18:12 - 18:14

never heard of, it just - it's almost like it's taunting you by doing it
18:14 - 18:17

in a millisecond.
18:17 - 18:21

And back then it took a while, so it was actually like
18:21 - 18:23

okay, it beat me, but it had to work hard.
18:23 - 18:27

But then - so then I took over Boggle.
18:27 - 18:31

I taught x maybe a quarter or two later and I said, ''I'm going to write a
18:31 - 18:35

lexicon. I'm going to go up and make a little project for myself, because I don't have enough work to do.''
18:35 - 18:38

And my goal was to make one that would do prefix, but that also would be
18:38 - 18:40

tight enough on memory that we could actually load a much bigger dictionary file, because I thought I'd
18:40 - 18:42

actually make the game even more
18:42 - 18:48

fun to play.
18:48 - 18:52

All right, so here's where I started.
18:52 - 18:55

This is a little excerpt from the middle of the
18:55 - 18:58

dictionary. ''Stra,'' yeah, that's totally the thing you notice, right, when you look at this thing. You see
18:58 - 19:02

''straddle,'' ''straddler,'' ''straddlers,'' ''straddles,'' ''straddling,'' and then these words that you can't
19:02 - 19:04

believe you're allowed to do this, but apparently you can put ''er'' and ''ier'' and
19:04 - 19:05

''ies' and ''ing'' on
19:05 - 19:07

almost anything out there and it
19:07 - 19:08

makes a valid word.
19:08 - 19:11

So
19:11 - 19:14

although you never really thought about the fact that
19:14 - 19:16

''straightforward,'' ''straightforwardly,'' ''straightforwardness,'' are all there, there is a huge amount
19:16 - 19:18

of repetition. And this is
19:18 - 19:20

where I said we're going to use some information about the domain.
19:20 - 19:23

These aren't just arbitrary strings.
19:23 - 19:24

They're not just random sequences of letters that are
19:24 - 19:28

picked out of the air,
19:28 - 19:30

that they develop over time. They're word roots. They're
19:30 - 19:34

suffixes. There's these prefixes that mean things
19:34 - 19:35

that actually show you that looking at this little excerpt out of the middle,
19:35 - 19:38

there's an awful lot of words
19:38 - 19:42

that repeat portions of other words.
19:42 - 19:44

And that may be part of what we can capitalize on. That
19:44 - 19:47

instead of really recording ''straightaway''
19:47 - 19:51

and ''straightaways,'' repeating those first 11 characters and adding an ''s,''
19:51 - 19:55

there's some way that I can unify the data that I have here. The same way when we try to
19:55 - 19:56

find codes, you have the same passage of code repeated two or three times.
19:56 - 19:59

We're always saying
19:59 - 20:00

unify it, move it into a helper and call it in three places. It's like can we do the same thing
20:00 - 20:02

with our data?
20:02 - 20:07

Can we share the parts of the words that are in common
20:07 - 20:09

and only distinguish where they are different?
20:09 - 20:11

Take a look at this.
20:11 - 20:14

This is a little portion
20:14 - 20:18

of something called a trie.
20:18 - 20:20

A trie is spelled t-r-i-e and it's actually from ''retrieval,'' but
20:20 - 20:23

sadly, it's still pronounced tri, just to confuse you.
20:23 - 20:28

And it is a variant of a tree.
20:28 - 20:30

In this case it's a tree that has 26 children coming out of any particular node,
20:30 - 20:32

one for each letter of the alphabet.
20:32 - 20:35

And so starting from the root node at the top,
20:35 - 20:39

all the words that begin with ''a'' are actually
20:39 - 20:42

down the A branch, and then there's a B branch. There'd be a C branch and a D
20:42 - 20:45

branch. And so the idea is that all the words, like if you think of the levels of the
20:45 - 20:46

tree are actually like - these are all the zero if characters of the word.
20:46 - 20:49

And then that second level is
20:49 - 20:52

all the characters that follow that zero if character. So here's the ''ab''
20:52 - 20:56

words and the ''ac'' words and the ''ax'' words.
20:56 - 20:57

Over here is the ''st'' words, but there's not a ''sx,'' so there's some places that actually aren't
20:57 - 21:00

filled out in this tree.
21:00 - 21:02

And as you keep going down further, so that the depth
21:02 - 21:04

you would trace to a tree
21:04 - 21:08

would actually be
21:08 - 21:10

tracing through the characters in sequence that make words.
21:10 - 21:11

And so in this form, tracing out a
21:11 - 21:12

path through this tree
21:12 - 21:14

from the root
21:14 - 21:17

down to a particular node
21:17 - 21:19

tells you about prefixes. So there are words that begin with ''a'' and words that begin with ''ac''
21:19 - 21:23

and ''ace,''
21:23 - 21:23

and then there's actually a little notation there that's done by a
21:23 - 21:24

visual
21:24 - 21:26

of the
21:26 - 21:29

bold and circle around it
21:29 - 21:30

that says, ''Oh, and the path that led here from the root is a word.'' So
21:30 - 21:33

''a'' is a word
21:33 - 21:35

and ''ace'' is a word and ''aces'' is a word.
21:35 - 21:39

''Act'' is a word and so is ''acted'' and ''actor.''
21:39 - 21:44

And in this case, the ''act'' part is totally shared,
21:44 - 21:46

so everything that comes off of ''act,'' ''acting,'' ''acted,'' ''actor,'' ''actors,''
21:46 - 21:50

can all share that initial part.
21:50 - 21:52

And for a word like ''act,'' it doesn't seem that profound. But you start thinking about words like
21:52 - 21:55

''strawberry'' or ''straightforward'' or ''stratosphere''
21:55 - 21:59

and realize that ''stratospheric'' and ''strawberries'' and ''straightforwardly''
21:59 - 22:01

can all leverage the other ten or 12 characters that were already
22:01 - 22:04

invested in the root word,
22:04 - 22:06

and that ''straightjacket'' and ''straightforward'' can even share ''straight.''
22:06 - 22:10

But there's actually a lot of
22:10 - 22:15

prefixes that can be unified across the space of the dictionary.
22:15 - 22:17

So knowing that words have these patterns helps us to do this.
22:17 - 22:19

So let's build that.
22:19 - 22:22

So the first form of this -
22:22 - 22:25

and this, again, is a thought experiment. I did not actually write it this way because
22:25 - 22:27

it would just be absolutely astronomically expensive. But it was a good way to kinda think
22:27 - 22:30

about where I needed to go.
22:30 - 22:34

I designed a new kinda trie node
22:34 - 22:35

that had the letter and a little Boolean flag of is this the word, so whether it had
22:35 - 22:39

the dark circle around it.
22:39 - 22:41

And then it had a 26-member children array, so pointers to other nodes like
22:41 - 22:44

this one. So totally recursive,
22:44 - 22:47

staring from that root and then the
22:47 - 22:50

idea is to use those 26 children as ''a'' being in the first slot and ''z'' in the last
22:50 - 22:52

slot to make it really easy to kinda just trace letter by letter
22:52 - 22:55

down to the levels of the tree.
22:55 - 22:59

So I'd have this one root node that pointed to the initial one, and then all the words
22:59 - 23:03

that begin with ''a'' are off that first lot; all from ''z'' off the last slot.
23:03 - 23:08

So when I want to know if a word is in the dictionary,
23:08 - 23:12

what do I have to do in this data structure? How do I find it? I
23:12 - 23:18

want to find the word ''far.'' Where do I look?
23:18 - 23:27

Left? Right? Down? Which way? It would be great
23:27 - 23:31

if you would tell me. Then I would understand what my data structure is. [Inaudible]
23:31 - 23:35

Exactly. So at the very top I say okay, ''f'' is my first letter. Match my ''f,'' get me to the
23:35 - 23:35

F node, and now it says recursively find ''ar'' from here. So it's very
23:35 - 23:37

recursive. It'll be like
23:37 - 23:39

find the first letter
23:39 - 23:43

and then recursively find the substring from here, working down. So find the ''a'' out of the next node, find
23:43 - 23:46

an ''r'' out of the next node, and then when you get to that node,
23:46 - 23:47

you'll check this little ''is word Boolean,''
23:47 - 23:49

and say okay,
23:49 - 23:53

was that path I traced
23:53 - 23:58

marked in the dictionary as leading to something that's known to be a word?
23:58 - 24:01

The prefix looks exactly the same, except for that last check for the as word. So
24:01 - 24:05

given that nodes exist in this, because further down there there's some words, I just
24:05 - 24:08

trace out a sequence of letters ''str,'' and as long as there's something there,
24:08 - 24:11

then I know that away from there are subsequent children beneath it
24:11 - 24:15

that lead to words.
24:15 - 24:17

And adding a word does the same operation.
24:17 - 24:19

If I need to add the word ''strawberry,''
24:19 - 24:20

then I start looking for ''s,'' ''t,'' ''r,''
24:20 - 24:24

and at some point,
24:24 - 24:26

I find some places where the nodes don't exist. Then I start creating them from
24:26 - 24:29

there on down.
24:29 - 24:33

So it is tracing what exists and then adding new nodes off the bottom,
24:33 - 24:38

and then marking the final one as yes, this is a word.
24:38 - 24:43

What's interesting about this is that all three of these operations
24:43 - 24:47

don't actually make any reference to how many words are in the dictionary.
24:47 - 24:49

They are all dependent on the length of the word you're searching for or adding into the dictionary.
24:49 - 24:51

If there's ten words in there, if there's 1,000 words in there, if
24:51 - 24:53

there's a million words in there.
24:53 - 24:56

But in no way does the big O depend on
24:56 - 24:59

how big or how loaded the data structure is.
24:59 - 25:00

That's kinda wacky.
25:00 - 25:06

The longest word,
25:06 - 25:09

according to the Oxford English Dictionary?
25:09 - 25:11

There you go. You're not going to make that on the Boggle board anytime soon because it's only got 16 cubes. Even Big Boggle
25:11 - 25:12

can't really make that puppy. But
25:12 - 25:14

just so you know,
25:14 - 25:17

45 is not so many, right?
25:17 - 25:21

And,
25:21 - 25:24

in fact, actually, here's a little thought for you. Not only is it not dependent on num words, it actually has a
25:24 - 25:26

little bit of an inverse relationship with num words,
25:26 - 25:29

especially the add operation.
25:29 - 25:31

That as the data structure becomes more loaded,
25:31 - 25:34

there's typically less work to do in add.
25:34 - 25:37

So if ''strawberry'' is in there, and you go to add ''strawberries,''
25:37 - 25:38

then actually you already have ''strawberries'' to depend on. You just need to build
25:38 - 25:41

off the thing. So the fact that
25:41 - 25:45

the more words that are in there, the more likely that some of the nodes you already need
25:45 - 25:46

are already in place, and then you can just kinda blow past them and just add your new
25:46 - 25:49

nodes.
25:49 - 25:53

Now, that's odd. You think of most data structures as really as
25:53 - 26:00

clearly as they get heavier, it takes more work to install new
26:00 - 26:03

things in it rather than less. [Inaudible].
26:03 - 26:04

Well, the thing is, they're organized by letter, and that 26-number array. So in fact,
26:04 - 26:08

I don't even have to look.
26:08 - 26:11

So I just go straight to the slot. If I'm looking for strawberry, I'll look for ''s'' and I go
26:11 - 26:14

straight to the ''t'' slot and then the ''r'' slot. So in fact, if the mode was already in place, I'm
26:14 - 26:16

not searching for it. It really just is exactly in the ''r'' slot. That's why
26:16 - 26:21

there's 26 of them. In
26:21 - 26:24

each array? Each array is 26 and they're A through Z and if there's an empty slot we're using a null.
26:24 - 26:27

So in this case, it's optimized to make it super fast. I don't even have to look
26:27 - 26:29

through the letters to find the one that matches. There's no matching. It just goes straight
26:29 - 26:32

to the ''r'' slot, ''z'' slot, the ''s'' slot.
26:32 - 26:35

We're going to see this is going to be a consequence we can't
26:35 - 26:37

tolerate, but it is, at least, the starting point for thinking about how to
26:37 - 26:38

build a data structure.
26:38 - 26:40

So the space usage
26:40 - 26:44

is where this thing
26:44 - 26:47

really, really - so this is an extreme example of a space-time trade-off, right?
26:47 - 26:50

We've got this thing that optimizes beautifully for OAV length of
26:50 - 26:53

the word. So that means typically eight operations on average are going to be
26:53 - 26:56

needed to add or to search for something.
26:56 - 26:57

The space we're using here is 106 bytes per nodes, so that's this
26:57 - 26:59

26-member,
26:59 - 27:02

four-byte array,
27:02 - 27:04

plus two bytes for the character and the Boolean that are up there. One
27:04 - 27:05

hundred and six bytes
27:05 - 27:09

is a lot of memory,
27:09 - 27:13

and the tree has about a quarter of a million nodes.
27:13 - 27:17

So given the 125,000 words that I said are in the input,
27:17 - 27:19

if you build it into a tree, they take about 250,000 nodes. That's an interesting sort of
27:19 - 27:22

number to think about, though. You have 125,000
27:22 - 27:25

words. They take 250,000 nodes. That tells you, actually, that
27:25 - 27:27

kinda on average, each
27:27 - 27:29

word has about two unique nodes,
27:29 - 27:33

that even words like ''straightforward'' and ''straightforwardly,''
27:33 - 27:34

on average, are sharing enough of their prefix - common prefix
27:34 - 27:38

parts - that
27:38 - 27:39

there's very little unique data added by any particular word in there that averages
27:39 - 27:40

across the whole thing. That's
27:40 - 27:42

pretty neat
27:42 - 27:45

to know. However, this is just not going to fly.
27:45 - 27:50

So here I was telling you that my main memory had
27:50 - 27:53

four megabytes, maybe, on a good day, and now I've actually
27:53 - 27:55

built a data structure that's going to require six times that
27:55 - 27:58

in terms of storage.
27:58 - 28:00

Okay, we gotta squeeze that down.
28:00 - 28:04

So the first thing we can do
28:04 - 28:08

is to realize that those 26 - allocated a full 26-member array, of
28:08 - 28:11

which I only plan on using some of the slots, is pretty wasteful.
28:11 - 28:13

So instead of saying I'm committing to a 26 per thing, I'll say well, how about if
28:13 - 28:16

I have an array that actually
28:16 - 28:20

is tight to the number of children coming out of this node?
28:20 - 28:20

So the very first node will have 26. There are words that start with every letter in the
28:20 - 28:22

alphabet.
28:22 - 28:25

But from there, it very quickly winnows down.
28:25 - 28:29

You have the letter ''x.'' You do not need the full 26 children coming
28:29 - 28:33

off of ''x.'' There's only a few letters that follow ''x'' in the English language that are going
28:33 - 28:37

to need to be fleshed out. You need the ''xa,'' you need an ''xe,''
28:37 - 28:40

but you don't need ''xj,'' you don't need ''xk,'' a whole bunch of things like that.
28:40 - 28:44

So if I change the structure here to instead of being
28:44 - 28:47

a 26-member array of node pointers, I make it to be a pointer to a
28:47 - 28:50

set of nodes pointers, so this is a dynamic array
28:50 - 28:54

that I'm keeping track of the number of children in a second field here
28:54 - 28:56

that the first one will be allocated to 26, the second will be allocated to eight.
28:56 - 28:59

This is going to change a little bit of our running times, because now we won't
28:59 - 29:01

be able to go exactly to the ''s'' slot. We'll have to go look for it.
29:01 - 29:03

So on each step down the tree,
29:03 - 29:07

it'll be
29:07 - 29:12

looking through that sized array at this slot to find the right match. But it adds a
29:12 - 29:14

constant factor, basically, on top of OAV length of the word.
29:14 - 29:15

So the space issues of this
29:15 - 29:19
29:19 - 29:23

is - we're just not storing all those nulls. And there were a lot of nulls. That
29:23 - 29:25

the node itself is not just ten bytes, and then there's some space in this
29:25 - 29:27

dynamic array out there, which is the number of children
29:27 - 29:31

times the four-byte pointer,
29:31 - 29:33

that on average, a node has six children. Across the
29:33 - 29:36

250,000 nodes in the
29:36 - 29:39

data structure that's being billed from this, so really 26 was way
29:39 - 29:40

overkill. About 20 of them had nulls
29:40 - 29:42

for most of the nodes.
29:42 - 29:46

And many of them, for example, at the very bottom of the tree,
29:46 - 29:49

have completely none. They're all nulls. So you get to a word like
29:49 - 29:52

''straightforwardly.'' There's nothing after that ''y,''
29:52 - 29:53

so it has zero children coming out the back of that one.
29:53 - 29:57

And so certainly
29:57 - 30:00

having 26 null children pointing away from that was a
30:00 - 30:01

waste that we can tighten up.
30:01 - 30:03

So when we get it down to this we'll have
30:03 - 30:04
30:04 - 30:05

44 bytes
30:05 - 30:06
30:06 - 30:08

-
30:08 - 30:11

or 34 bytes for the per node.
30:11 - 30:15

Still a quarter million nodes. Still
30:15 - 30:18

eight and half megabytes. So we're not winning any prizes just yet.
30:18 - 30:22

But we are making good progress. That got us down a factor of
30:22 - 30:25

three or four right there. So I still have the same
30:25 - 30:28

number of nodes. All I changed was the pointers to the
30:28 - 30:32

subsequent nodes further down the tree. So
30:32 - 30:35

I - the kind of letters and how the letters connect to each other - the connectivity is
30:35 - 30:39

still pretty much the same. It's just that in any particular node, how many children does it have?
30:39 - 30:40

I was storing a whole bunch of nulls and now I'm not storing those nulls.
30:40 - 30:43

So the number of nodes is still
30:43 - 30:47

exactly as it was before. I'm going
30:47 - 30:49

to do one other little squishing thing on this data structure
30:49 - 30:51

and then I'm going
30:51 - 30:53

to flip gears for a second.
30:53 - 30:56

So still using kinda my
30:56 - 30:59

CS side of things, how can I squash things down?
30:59 - 31:03

I'm going to use a technique that we used in the heap.
31:03 - 31:04

So in the case of the binary heap that we implemented into the
31:04 - 31:07

PQ,
31:07 - 31:11

you'll remember that we drew pictures in the handout that showed a heap with oh, I
31:11 - 31:11

have a parent with two children, which has four grandchildren and
31:11 - 31:15

so on.
31:15 - 31:16

But that we actually compressed that down into an array,
31:16 - 31:20

and in the array
31:20 - 31:24

there was this pattern of
31:24 - 31:27

here is the root node.
31:27 - 31:30

So we'll call this level zero.
31:30 - 31:33

And then the next two nodes
31:33 - 31:36

are level one.
31:36 - 31:38

And the next four nodes
31:38 - 31:42

are level two and so on. So
31:42 - 31:45

there's one here, two here, there's four here, there's eight here. And we kinda flattened it down
31:45 - 31:50

by generation, so looking at what was at the top and what was underneath it.
31:50 - 31:52

When we did that, we removed all the space for pointers.
31:52 - 31:54

Although we drew those pictures in the heap as though there were pointers
31:54 - 31:57

there, in fact, there were not.
31:57 - 32:02

There were no pointers being stored in the heap in this version of the PQ.
32:02 - 32:02

And so that gets us back to kind of array like storage requirements,
32:02 - 32:06

but
32:06 - 32:09

by flattening it down into this array, and still kinda using the tree
32:09 - 32:10

conceptually to work our way through it, we're getting the advantages of
32:10 - 32:12

that
32:12 - 32:14
32:14 - 32:15

traversal that's based on height of tree,
32:15 - 32:20

not length of vector.
32:20 - 32:22

So if we do the same thing with the trie, it's a little bit more complicated
32:22 - 32:25

because there were a couple things about heap that made this a little
32:25 - 32:27

easier. Heap always had two children, so there were these rules about
32:27 - 32:30

how the tree was filled in that meant
32:30 - 32:33

you could always compute parent and child index using this kinda divide by two,
32:33 - 32:36

multiple by two exchange operation.
32:36 - 32:39

In this case, what we're going to have to do is just lay it down by generation,
32:39 - 32:41

but we don't know how big a generation's gonna be, so we're actually gonna have to keep
32:41 - 32:43

track of it manually.
32:43 - 32:46

So the root node will have
32:46 - 32:49

all A through Z. So this will be
32:49 - 32:53

level one, which is A through Z here,
32:53 - 32:57

and then level two is - there'll be a bunch of slots here, which is the
32:57 - 33:00

A followed by some letter, and then there'd be the B's followed by some letter, and
33:00 - 33:03

then there'd be the C's followed by some letter, and so on.
33:03 - 33:06

And then further down is level three, which are the three-character words. So these are all
33:06 - 33:09

the single-character paths,
33:09 - 33:12

two-character paths, three-character paths, kinda flattened down.
33:12 - 33:15

And so if I look at what I changed here, my first node
33:15 - 33:18

says it has no letter that's not the root node, it's not a word,
33:18 - 33:20

and it says it's children begin at index one. So
33:20 - 33:23

the next location. So it's not a
33:23 - 33:26

multiply by two and add one or something like that. It's okay, here's
33:26 - 33:30

where my children begin. They're at slot one in the array.
33:30 - 33:31

And then there are 26 of them. So that means that A, B, C, D are kinda in order there.
33:31 - 33:33

If I look at A,
33:33 - 33:36

''a'' is a word,
33:36 - 33:39

and it says its children begin at index 27
33:39 - 33:44

and there are 12 of them, so there are 12 characters out of the 26 that
33:44 - 33:47

can follow an ''a'' in the sequence of a valid word. And
33:47 - 33:48

then B's children begin at index 39, which is 27 plus 12. So that's
33:48 - 33:50

where
33:50 - 33:54

A's children are sitting here at index
33:54 - 33:56

27 to 38 and then 39 to - I think B has eight or so children -
33:56 - 33:59

and C and so on.
33:59 - 34:00

So we flattened it all down by generation
34:00 - 34:01

into
34:01 - 34:04

an array.
34:04 - 34:07

That means all the space for pointers
34:07 - 34:10

has now been removed.
34:10 - 34:12

One serious consequence of this, though,
34:12 - 34:14

is that the pointers got us flexibility, that
34:14 - 34:15

that insert and remove
34:15 - 34:19
34:19 - 34:22

was dependent on the idea that pointers really being there, pointers buying us
34:22 - 34:25

the ability to insert and delete without doing a shuffle.
34:25 - 34:27

That now, once I have flattened it, I have suddenly bought back the same
34:27 - 34:29

problems that arrays have, which is
34:29 - 34:33

if I have the words here that have
34:33 - 34:36

''ab'' for abalone, and ''ac'' for act and I don't happen to have any words with ''ad.'' And
34:36 - 34:40

I try to add the word ''add,''
34:40 - 34:43

that in order to put ''ad'' in there, we're going to have to move everything over and then
34:43 - 34:46

there's going to be a lot of shuffling and what not.
34:46 - 34:48

So I'm starting to build a data structure that's losing some flexibility but
34:48 - 34:49

saving space.
34:49 - 34:53

So
34:53 - 34:54

I'm, in some sense, focusing my attention on how can I build a data structure that's really fast
34:54 - 34:56

to search,
34:56 - 35:00

even if it was a little harder to build?
35:00 - 35:02

Because it turns out the thing I most care about is making sure it can
35:02 - 35:07

actually very, very efficiently look up words.
35:07 - 35:11

Maybe building it once, being a little slow, isn't so painful.
35:11 - 35:14

So the space issues in this gets us down to just ten bytes per
35:14 - 35:16

node. All the pointers are gone. So there was 24 bytes of pointers that we've
35:16 - 35:19

just eliminated.
35:19 - 35:21

And that means we've got two and a half megabyte
35:21 - 35:28

memory footprint right now.
35:28 - 35:29

So we're kinda getting close to hitting the range for binary tree and hash tables.
35:29 - 35:33

Now I'm
35:33 - 35:37

going to go back to the domain again. So I worked on it kinda from the CS side, thinking
35:37 - 35:39

about things I know about data structures and
35:39 - 35:41

reorganizing arrays and heaps and pointers. I'm going
35:41 - 35:43

to go back and look at that domain again
35:43 - 35:46

and see if there's something else I
35:46 - 35:48

can kinda take advantage of.
35:48 - 35:50

So this is a different
35:50 - 35:52

cross-section of the dictionary
35:52 - 35:57

showing the words ''adapted,''
35:57 - 36:02

''adaptor,'' ''adaptors,'' ''adapting,'' ''adaption,'' adaptions,'' ''adapts.''
36:02 - 36:04

So eight words that are all built off the ''adapt'' root with a bunch of suffixes. I look
36:04 - 36:05

sorta over a little bit further. I've got
36:05 - 36:10

''desert,''
36:10 - 36:15

''deserted,'' ''deserter,'' ''deserting,'' ''desertings,'' ''desertion,'' ''desertions,'' ''deserts.'' Okay,
36:15 - 36:20

and ''detect'' and ''insert'' and ''perfect'' and ''invent'' and ''interrupt''
36:20 - 36:23

that all are verbs for which you can apply the past tense in the
36:23 - 36:26

gerund form and the ''ion's'' that
36:26 - 36:29

tack onto that root word to give you these suffixes.
36:29 - 36:32

And each of these words that has shown up here has exactly the same set of suffixes
36:32 - 36:33

that can be applied to it.
36:33 - 36:35

So I
36:35 - 36:37

did this thing about prefixes. I
36:37 - 36:42

went out of my way to find that -
36:42 - 36:45

so all these words could share that same prefix, ''assert,'' ''asserter,'' ''assertings,'' ''asserting.''
36:45 - 36:47

But is there some way that I could take this idea of ''corrupt''
36:47 - 36:50

and ''assert'' and ''insert,''
36:50 - 36:54

and then once I get down to that ''t,'' where they end, is there a way where they can also
36:54 - 36:57

share their backend parts? That those ''ing's'' and ''ion's,'' can they be
36:57 - 36:58

unified, as well? Well,
36:58 - 37:00

why not?
37:00 - 37:03

Why not? So here is
37:03 - 37:06

the first version of what's going to be the DAWG,
37:06 - 37:07

the Directed Acyclic Word
37:07 - 37:10

Graph.
37:10 - 37:12

It takes the idea of unifying the prefixes and now applies it to the
37:12 - 37:14

suffixes.
37:14 - 37:17

That there are a lot of
37:17 - 37:20

words that end in the same letters, as well as start in the same letters. Why not just apply
37:20 - 37:22

the same kind of unification and sharing on the back end?
37:22 - 37:24

So here comes in
37:24 - 37:26

''adapt'' and ''adopt''
37:26 - 37:30

and ''based'' and ''port''
37:30 - 37:33

and out the back end is ''ports'' and ''portion'' and ''porting'' and ''porter'' and ''adopter''
37:33 - 37:37

and ''adopted'' and ''basting'' and ''bastion,''
37:37 - 37:40

all coming off that sharing all the endings, because they have the same
37:40 - 37:41

words that are viable and the same connections that are
37:41 - 37:46

between them. So
37:46 - 37:49

the idea that all sharing the ''er'' and the ''er'' leading in to the ''s.'' And that same ''s,'' for
37:49 - 37:50

example, being shared from ''portion'' and ''baste'' and
37:50 - 37:51

''adopt'' and
37:51 - 37:54

''adopter,''
37:54 - 37:57

all coming into the same ending of oh, these things that end in ''s'' are words. Here's an ''s'' that is a
37:57 - 37:59

word for them to share.
37:59 - 38:01

So it is directed. So
38:01 - 38:04

it is a graph.
38:04 - 38:05

So once we remove this - the idea
38:05 - 38:09

that there's a single path
38:09 - 38:12

from the root to any node, we're no longer a tree structure. So we
38:12 - 38:16

can get to ''t'' by going through ''adapt'' or ''adopt'' or ''port,''
38:16 - 38:19

and so that no longer fits the definition of what is a tree. It's
38:19 - 38:23

become a graph. We can get to these nodes from different places.
38:23 - 38:24

The arcs go one way in this case. We can't go back up the tree
38:24 - 38:27

to find the words in reverse.
38:27 - 38:31

There are no cycles. So acyclic, right, you can't just
38:31 - 38:33

wank around some portion and get this many bananas as you want.
38:33 - 38:34

And
38:34 - 38:37

they have
38:37 - 38:44

been with certain nodes that are marked, in this case, the same way they were in the
38:44 - 38:45

trie. This path from the root node to here does represent a word in this
38:45 - 38:48

situation.
38:48 - 38:53

So building this is a little bit complicated. I'll give you a little bit
38:53 - 38:56

of the idea of the intuition for it, and I will not get too deep into it.
38:56 - 39:00

But largely the way you do it is you build the trie,
39:00 - 39:01

so that it doesn't have the unification of the suffixes; it only unifies the
39:01 - 39:03

prefixes. And
39:03 - 39:07

then you work backwards
39:07 - 39:09

on the suffixes. So in some sense you look at all the words that end, for example, in
39:09 - 39:12

''s'' in the dictionary.
39:12 - 39:15

And you find all the ones that just end in one ''s'' that goes nowhere, and you say
39:15 - 39:17

look, this ''s'' looks like every other ''s'' in this thing. All these words
39:17 - 39:20

that are just plurals or
39:20 - 39:23

verbs that can have an ''s'' tacked on. Let's share that ''s.''
39:23 - 39:25

And then so you take however many s's there were, 1,000 of them, and you
39:25 - 39:29

say these are all the same ''s.''
39:29 - 39:31

And then what you do is you look at all the nodes that lead into that ''s.'' So you're kinda
39:31 - 39:33

traversing the graph in reverse, and you say well look,
39:33 - 39:36

here's a bunch of people who lead in on an ''e.''
39:36 - 39:37

And if that ''e'' is a word or not, there might be some group that has ''e''
39:37 - 39:38

that's also a word,
39:38 - 39:40

so like
39:40 - 39:42

the word ''apple'' and ''apples.''
39:42 - 39:46

But sometimes the ''e'' is not,
39:46 - 39:49

where that wasn't a stopping place, let's say, for ''parties'' it wasn't.
39:49 - 39:53

So all the ones that are a word you can unify on the ''e'' that was a word, and all the ones that aren't
39:53 - 39:54

are that way can unify on that. So you just work your way backwards from there. So you say
39:54 - 39:55

what letters led into that ''e?''
39:55 - 39:58

And so kinda from
39:58 - 40:02

the end of the words backwards, you're looking for all the people who can end in an
40:02 - 40:05

''s,'' and end in an ''e,'' and end in an ''ies,'' and unify them up,
40:05 - 40:07

making sure that they have all the same outgoing connections. So for
40:07 - 40:09

example, ''running'' has
40:09 - 40:13

an ''ing'' at the end. So does ''swimming,''
40:13 - 40:15

but there's ''swimmingly'' and there's not ''runningly.'' So you have to be careful about
40:15 - 40:18

unifying things that
40:18 - 40:20

they would kinda have to have all identical properties from this point
40:20 - 40:22

down to the bottom of the tree
40:22 - 40:25

to be unifiable.
40:25 - 40:28

But it's a pretty neat process. It's actually what's called
40:28 - 40:29

DFT subset minimization, so if you want to learn a little bit about it, that's the
40:29 - 40:31

word to look up.
40:31 - 40:35

And so if I do that,
40:35 - 40:36

and I build this as a DAWG, I'm still using the same structure definition here,
40:36 - 40:40
40:40 - 40:44

but what I am doing is unifying suffixes as well as prefixes
40:44 - 40:46

and that, actually, is gonna change the number of nodes that I need drastically.
40:46 - 40:49

It goes down to 80,000
40:49 - 40:52

from my initial 250,000.
40:52 - 40:55

And if you think about the number of words again,
40:55 - 40:57

there were 125,000 words in the dictionary,
40:57 - 41:01

the DAWG has just 80,000 words. So it means, actually, once you start
41:01 - 41:04

unifying the suffixes, it's like most words don't even have -
41:04 - 41:07

or many of the words don't even have nodes of their own that are unique to
41:07 - 41:11

them whatsoever. They just exist kinda as portions of other words
41:11 - 41:12

that have all been joined together and
41:12 - 41:15

compressed into one package.
41:15 - 41:18

So 80,000 nodes total, about
41:18 - 41:21

two-thirds of the number of words total.
41:21 - 41:23

So we have ten nodes - ten-byte nodes
41:23 - 41:27

and 80,000 of them.
41:27 - 41:31

We are down to under a megabyte. So we've crossed that little interesting barrier of
41:31 - 41:34

the data structure on disk. The data that was fed into it, those words, was over a
41:34 - 41:36

megabyte and now, actually, we have created a representation that actually is
41:36 - 41:40

compressed, that uses less space
41:40 - 41:42

than the text file did in the first place. And that's really because it is
41:42 - 41:44

finding all these ways in which words are like each other
41:44 - 41:46

and sharing,
41:46 - 41:48

making use of the existing
41:48 - 41:50

structure in the
41:50 - 41:51

DAWG to add new words
41:51 - 41:53

without
41:53 - 41:56

adding bulk.
41:56 - 41:57

So once I've done this I have to say that I've made the data structure
41:57 - 42:01

even less flexible.
42:01 - 42:03

So to be clear, each of these steps has a cost somewhere else that made the code more
42:03 - 42:07

complicated. It also made it so that
42:07 - 42:10

now that I have all this sharing back here, that when I'm adding a new word, so I go
42:10 - 42:14

back to look at this picture here, that
42:14 - 42:15

I go in to add a word and I say oh, actually, the word
42:15 - 42:17

''portioning'' is a word.
42:17 - 42:19

Well, ''adoptioning'' is not.
42:19 - 42:22

So
42:22 - 42:26

that will necessitate if I go in to add the word ''portioning,'' that I actually have
42:26 - 42:30

to split it off of its existing unified branch, because if I just tacked on the
42:30 - 42:33

''ing'' on the end of ''portioning'' there, that it turns out that ''adaptioning''
42:33 - 42:37

and ''adoptioning'' would suddenly become words, too. So actually I have to
42:37 - 42:38

be extra careful once I've got it unified this way that when I add new words that I don't
42:38 - 42:39

actually
42:39 - 42:42

accidentally
42:42 - 42:45

add some additional words that I didn't plan on because these paths have
42:45 - 42:49

been shared. So it requires a little bit more careful management. Again,
42:49 - 42:50

the goal, though, here, in this case, was to try to build a data structure that was kinda freeze-dried.
42:50 - 42:53

So I
42:53 - 42:55

build it off the data structure - the input data file I had,
42:55 - 42:57

that then would operate very, very efficiently,
42:57 - 43:00

and I wasn't planning on making a lot
43:00 - 43:01

of runtime changes to it.
43:01 - 43:04

I get this down to this
43:04 - 43:08

and then the last thing I'm going to do to it
43:08 - 43:11

is go back to kinda the CS side of it and see if I can squeeze it down kinda
43:11 - 43:15

in terms of just data structure, what's being stored per node.
43:15 - 43:18

So I have 80,000 of these. So
43:18 - 43:22

typically, when you look at a structure,
43:22 - 43:24

it's very rare that the idea if you change something from an int to a
43:24 - 43:28

char, so it was a two-byte thing into a one-byte thing,
43:28 - 43:32

that you're going to have any profound impact on the total size needed.
43:32 - 43:33

But if you happen to know you're going to make a lot of these structures, tens of thousands,
43:33 - 43:35

almost 100,000 of them,
43:35 - 43:38

then it turns out every byte counts.
43:38 - 43:40

If I can get this thing down from, in this case,
43:40 - 43:42

ten bytes to eight bytes, or to
43:42 - 43:46

six bytes or four bytes,
43:46 - 43:49

then it will cut my memory by quite a factor.
43:49 - 43:52

So what I did in the final version of it
43:52 - 43:58

is I used a little technique called bit packing,
43:58 - 44:00

which is to actually stash stuff in the absolute minimum possible space
44:00 - 44:04

that I can fit it into,
44:04 - 44:04

taking advantage of things like, for example, if there are only 26 letters in the
44:04 - 44:08
44:08 - 44:10

alphabet, a character is actually a full eight-bit value, which can hold
44:10 - 44:13

numbers from zero to 255.
44:13 - 44:15

And I just don't need that much space. I'm not using the yen sign, I'm not using the parentheses,
44:15 - 44:19

I'm not using the digits.
44:19 - 44:22

So having, in some sense, the capacity to store all those distinct characters is actually
44:22 - 44:26

a cost I'm paying on every character that I don't need to.
44:26 - 44:29

So in fact, I squeezed it down by dividing up into the sub-bit level. This
44:29 - 44:30

something that's talked about a little bit in Chapter 15 if you're at all curious. It's
44:30 - 44:33

not what I
44:33 - 44:34

consider core material for 106B, but I'm just sorta mentioning it to kinda
44:34 - 44:36

complete the picture.
44:36 - 44:38

Where five bits,
44:38 - 44:41

given that a bit is either zero or one,
44:41 - 44:45

that I have five of them connected together, then I have 25
44:45 - 44:45

different patterns that I can represent in that
44:45 - 44:48
44:48 - 44:50

amount of capacity there, and that's 32 different patterns, which is enough for
44:50 - 44:54

my characters, plus a little slack.
44:54 - 44:56

I use exactly one bit for is word, rather than a full Boolean. All I need
44:56 - 44:59

to know is yes or no.
44:59 - 45:02

I also changed the way I did
45:02 - 45:04

knowing that you were the last child of a generation, rather than having
45:04 - 45:06

A say, ''there are 12 children here.''
45:06 - 45:10

I just said, go to index 27
45:10 - 45:12

and start reading, and one of them will tell you it's the last. So each of them says, ''No,
45:12 - 45:16

I'm not the last,'' ''No, I'm not the last.'' And then when you get to the last one, it
45:16 - 45:18

will be marked as ''yes,'' and that's how you'll know - that's where A ends and the
45:18 - 45:20

next one beings.
45:20 - 45:21

And so if I can get it down in this way
45:21 - 45:24
45:24 - 45:28

I've taken
45:28 - 45:31

what was a ten-byte structure down to four. And it turns out there's often
45:31 - 45:35

reasons why trying to get it down to a multiple of two is actually
45:35 - 45:37

a critical barrier. That if it were five bytes, it's actually likely to be eight,
45:37 - 45:39

just because of what's called alignment and padding restrictions.
45:39 - 45:42

So if were at
45:42 - 45:45

seven and I squeeze down to six, it might not make any change. And if I squeeze down to five, it
45:45 - 45:47

still might not make any change. It might be, actually, artificially decided that in
45:47 - 45:50

each of the structures is going to be eight behind the scenes.
45:50 - 45:54

But getting it down to four actually tends to be another one of those critical barriers where
45:54 - 45:56

okay, four actually will be smaller than something that was ten. And
45:56 - 45:58

so I take my 80,000 nodes,
45:58 - 46:03

I get them to be four bytes each,
46:03 - 46:06

and then I now have a data structure that's about a third of a megabyte in memory,
46:06 - 46:09

as opposed to the over a megabyte on disk
46:09 - 46:11

in the text form.
46:11 - 46:12

And so that's how it does what it does. So if you're going to look at the code,
46:12 - 46:16

you'll see that
46:16 - 46:17

there is this complicated bit pack structure that
46:17 - 46:20

it does
46:20 - 46:24

look for things char byte using a recursive character-by-character
46:24 - 46:24

strategy of start at the top, find the matching character, then go to the children index,
46:24 - 46:28

and then
46:28 - 46:33

work your way down the children to find the next matching character, and so on
46:33 - 46:37

to find its way through the data structure.
46:37 - 46:41

So [inaudible] is where the child index, or for example, the idea that
46:41 - 46:42

A says my children start at 27 or B starts at 39,
46:42 - 46:45

and so I just need
46:45 - 46:51

a number there, but I'm using
46:51 - 46:55

a slightly smaller than an integer, just to make them all fit very tightly. So I'll tell
46:55 - 46:57

you a couple of interesting things about it, actually, before I close this off, which is the
46:57 - 47:02

lexicon.dat file
47:02 - 47:03

actually just is an in-memory representation of what the DAWG data
47:03 - 47:07

structure looks like.
47:07 - 47:11

And so in fact, it's very easy to take the whole data structure. It doesn't have any pointers in it. Pointers, actually, tend to be a
47:11 - 47:15

little bit - make data structures harder to rebuild and
47:15 - 47:17

to dehydrate to disk because they have to kinda be wired up in memory. You don't know what
47:17 - 47:20

memory you're going to get from new and what the addresses are.
47:20 - 47:22

This one, actually, doesn't have any pointers. Everything is actually kinda self-contained. It's
47:22 - 47:26

just one big array of these blocks.
47:26 - 47:27

And all the information is relative in there. It says go to the index 27, index
47:27 - 47:30

45.
47:30 - 47:34

So you can take that whole block and just write it to disk, 300 bytes worth,
47:34 - 47:36

read it back in 300 bytes worth and it kinda is rehydrated
47:36 - 47:39

without any extra effort. So
47:39 - 47:42

that's actually what the lexicon.dat file is, it's just a dehydrated form of
47:42 - 47:44

the DAWG having been
47:44 - 47:48

build in memory. That makes it very, very fast to load.
47:48 - 47:49

At that point, though, it's super inflexible. If you want to add a word to
47:49 - 47:52

it,
47:52 - 47:56

the way the structure is build, you'd have to kinda add nodes and make sure it wasn't
47:56 - 47:58

glomming on to some existing words that were nearby, and so it would be a lot of work to
47:58 - 48:01

actually edit the DAWG in place.
48:01 - 48:04

In fact, when you ask it to add a word,
48:04 - 48:05

it actually doesn't edit that one at all.
48:05 - 48:09

You say I'd like to put a new word in.
48:09 - 48:11

It says I'm just - this one's so perfect the way it is. I've built this one, I love this
48:11 - 48:14

one, I'm very happy with it.
48:14 - 48:16

If you want to put some other words into it, it just keeps a second data structure
48:16 - 48:21

on the side, an
48:21 - 48:21

auxiliary one over here, where it sticks the additional words you ask it to put in. I
48:21 - 48:24

have
48:24 - 48:28

to say that seems like a very obvious thing to do, but in fact, I didn't even think of
48:28 - 48:31

that for years. We had the lexicon. I said you can't use the lexicon to make new
48:31 - 48:32

word lists or edit them. You can only load this one, beautiful, perfect one. You can't do
48:32 - 48:35

anything else. And
48:35 - 48:38

at some point people kept saying, ''I wish I could put some words in the lexicon. I wish I could use it
48:38 - 48:41

for the word list to keep track of the human and computer players.'' I'm like, ''Well, you just
48:41 - 48:41

can't do it. You don't understand. It's very complicated.
48:41 - 48:45

You
48:45 - 48:50

just don't understand my pain.'' Whatever. Whatever. I had excuses. And then, really,
48:50 - 48:51

after some number of years of just being a baby about it, I'm like oh, of course, why don't I
48:51 - 48:54

just
48:54 - 48:57

keep another structure off to the side? It's an abstraction. It's a word list. You don't need
48:57 - 49:00

to know what I do. The fact that I keep a DAWG for some of the words
49:00 - 49:04

and just an ordinary set for some of the other words
49:04 - 49:06

is really completely irrelevant. And as long as when you ask for words I
49:06 - 49:08

look in both and I tell you, ''Yes, it's there,'' why do you need to know
49:08 - 49:11

where I keep it? Okay, so it took
49:11 - 49:13

me a while to buy abstraction, even though I teach it all the time. It just
49:13 - 49:17

wasn't totally making sense.
49:17 - 49:19

So a neat little thing, just kinda keeping both behind. And so it turns out that I
49:19 - 49:21

learned about this, actually, from a paper
49:21 - 49:22

that I read at the - in the
49:22 - 49:23
49:23 - 49:27

early ?90s about somebody
49:27 - 49:30

building a Scrabble player, and they wanted a dictionary that
49:30 - 49:33

supported kinda finding plays on the board and being able to extend words. And so
49:33 - 49:34

knowing things about root words and how they extend and prefixes and suffixes was kinda the
49:34 - 49:38

original motivation.
49:38 - 49:39

And this kind of data structure is actually really great for any kind of word playing.
49:39 - 49:42

So you have
49:42 - 49:43

anagrams that you want to rearrange or words that you want to extend past suffix and
49:43 - 49:46

prefix that
49:46 - 49:50

this kind of data structure is really optimized for
49:50 - 49:52

doing that kind of traversal and kinda leveraging the existing patterns
49:52 - 49:56

in words to save space while doing so.
49:56 - 50:00

So it was a really fun little project to do. It was kinda neat to think about. Even though today,
50:00 - 50:02

if you needed a lexicon, you could just use a set. It's fast enough. It takes
50:02 - 50:05

a ton of memory, but you know what, memory's cheap.
50:05 - 50:06

So this is not the kind of thing you would need to do today except when you were in some kinda
50:06 - 50:07

embedded,
50:07 - 50:11

low memory,
50:11 - 50:14

very, very tight processor environment, but it still makes for kinda interesting sort of artifact
50:14 - 50:18

to think about well, how you get there and what kind of things you can use.
50:18 - 50:19

So Wednesday we will talk about some big picture design, some kinda wrap stuff, and I will
50:19 - 50:22

see
50:22 - 50:23

you

Title:: Lecture 25 | Programming Abstractions (Stanford)
Description:: Lecture 25 by Julie Zelenski for the Programming Abstractions Course (CS106B) in the Stanford Computer Science Department.

Julie examines a case study and opening up the lexicon file, which is complicated; she walks the students through the code and explains why she wrote it as she did as opposed to a sorted vector or binary search tree. She then introduces the DAWG data structure.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=FE6E58F856038C69

CS 106B Course Website:
http://cs106b.stanford.edu

Stanford Center for Professional Development:
http://scpd.stanford.edu/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanforduniversity/

more » « less
Video Language:: English
Duration:: 50:36

	N. Ueda edited English subtitles for Lecture 25 \| Programming Abstractions (Stanford)
	Eunjeong_Kim added a translation

English subtitles

Revisions

Revision 2

N. Ueda

Lecture 25 | Programming Abstractions (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)