WEBVTT

00:00:00.000 --> 00:00:12.000


00:00:12.000 --> 00:00:16.000


00:00:16.000 --> 00:00:25.000
This presentation is delivered by the Stanford Center for Professional

00:00:25.000 --> 00:00:28.000
Development.

00:00:28.000 --> 00:00:32.000
The piece I need is not here, so I'm going to actually just talk to you about what

00:00:32.000 --> 00:00:34.000
it says on my slides and then when Jason arrives we'll plug it back in and see what's going on.

00:00:34.000 --> 00:00:36.000
Oh, there's Jason. There we go.

00:00:36.000 --> 00:00:40.000
So what I'm going to do today is actually something kinda fun. And so I'm

00:00:40.000 --> 00:00:41.000
gonna take a case study of a particular data structure

00:00:41.000 --> 00:00:45.000
design,

00:00:45.000 --> 00:00:49.000
and the one I'm looking at there is the lexicon.

00:00:49.000 --> 00:00:53.000
So how many of you actually ever decided to just open up the lexicon file, just to see

00:00:53.000 --> 00:00:55.000
what was in there at some point during the quarter. We used it a couple of times.

00:00:55.000 --> 00:01:00.000
Anybody want to tell us what they learned by opening up that file?

00:01:00.000 --> 00:01:04.000
Anything interesting?

00:01:04.000 --> 00:01:05.000
It's a data structure called a DAWG. You're like, well that's a pretty funny name for something.

00:01:05.000 --> 00:01:10.000
And

00:01:10.000 --> 00:01:14.000
was the code particularly well-[inaudible], easy to read, easy to understand?

00:01:14.000 --> 00:01:18.000
No. Who wrote that code? She should totally be fired.

00:01:18.000 --> 00:01:22.000
Yeah, so this what I wrote kinda a long time ago, actually, to solve this particular problem. And actually I would like to key through

00:01:22.000 --> 00:01:28.000
kinda how it is I came to discover the DAWG and why it is the DAWG was the

00:01:28.000 --> 00:01:29.000
choice that made sense for what the need pattern we had for the lexicon was.

00:01:29.000 --> 00:01:30.000
So let me just refresh what

00:01:30.000 --> 00:01:34.000
the lexicon does.

00:01:34.000 --> 00:01:36.000
It is a special case of kinda set or map. It is a container where you're going stick things

00:01:36.000 --> 00:01:38.000
in, they're going to be unique, right? There's no

00:01:38.000 --> 00:01:41.000
need for any sort of duplicate.

00:01:41.000 --> 00:01:43.000
The keys are always strings. And in fact, they're English words, so that's going to be an important

00:01:43.000 --> 00:01:47.000
thing to remember, that

00:01:47.000 --> 00:01:48.000
they're not just arbitrary sequences of letters, right? They have certain patterns to

00:01:48.000 --> 00:01:51.000
them

00:01:51.000 --> 00:01:54.000
and certain sizes that might actually be something we could capitalize on. When we know

00:01:54.000 --> 00:01:57.000
something about the domain, we're able to kinda special case or tune or

00:01:57.000 --> 00:01:58.000
data structure to solve that problem

00:01:58.000 --> 00:01:59.000
very well,

00:01:59.000 --> 00:02:03.000
in particular.

00:02:03.000 --> 00:02:07.000
It has no associated value, so it doesn't have dictionary definitions or synonyms or anything

00:02:07.000 --> 00:02:08.000
like that, right? So it really just is exists or doesn't exist

00:02:08.000 --> 00:02:12.000
that we were interested in supporting.

00:02:12.000 --> 00:02:15.000
And it also has, in addition to the kind of being able to add words and check if a

00:02:15.000 --> 00:02:16.000
word is contained, it's also this prefixing operation that we had talked a

00:02:16.000 --> 00:02:19.000
little about

00:02:19.000 --> 00:02:23.000
in Boggle, and that came back on the midterm. Like, why was prefixing important, right?

00:02:23.000 --> 00:02:25.000
It was critical for pruning these dead-end paths, these searches on the

00:02:25.000 --> 00:02:26.000
Boggle board that otherwise

00:02:26.000 --> 00:02:32.000
would really have

00:02:32.000 --> 00:02:32.000
taken a lot of time and kinda led to nothing of any value. So what I'm gonna to talk you

00:02:32.000 --> 00:02:36.000
through

00:02:36.000 --> 00:02:40.000
is some of the obvious choices based on data structures we already know about,

00:02:40.000 --> 00:02:43.000
and see where they would go. So this is a thought experiment. It's not that I wrote all

00:02:43.000 --> 00:02:45.000
these to find out. Actually, I mostly did a lot of this on paper, trying to figure

00:02:45.000 --> 00:02:48.000
out what options we had

00:02:48.000 --> 00:02:49.000
and how they would play out and what kind of trade-offs I'd have to make to make them

00:02:49.000 --> 00:02:52.000
work.

00:02:52.000 --> 00:02:55.000
So the simplest case of any kind of collection is to think about

00:02:55.000 --> 00:02:58.000
whether vector or array can do the job for you.

00:02:58.000 --> 00:03:00.000
In this case, having a vector that's kept in sorted - alphabetically sorted - order seems

00:03:00.000 --> 00:03:03.000
like a pretty good,

00:03:03.000 --> 00:03:05.000
easy way to get this set up and running and how would it behave?

00:03:05.000 --> 00:03:06.000
So in order to do a contains

00:03:06.000 --> 00:03:07.000
(word)

00:03:07.000 --> 00:03:15.000
on this data structure,

00:03:15.000 --> 00:03:17.000
what algorithm are you going to use to do your searching? Binary search,

00:03:17.000 --> 00:03:21.000
right? It's in sorted order, right? We've got to take advantage of that, right? If we do

00:03:21.000 --> 00:03:25.000
a linear search I'm looking for the word ''mediocre,'' I'm going to be trucking through,

00:03:25.000 --> 00:03:28.000
you know, thousands of words before I find it. So definitely want to be using binary search.

00:03:28.000 --> 00:03:29.000
Equals-equals, comparing my strings, less than, greater than, dividing in half. And so

00:03:29.000 --> 00:03:32.000
it should run a logarithm time.

00:03:32.000 --> 00:03:35.000
Logarithm time, one of those greater performers that you

00:03:35.000 --> 00:03:38.000
learned in the PQ assignment, hopefully. It's just basically not measurable on

00:03:38.000 --> 00:03:41.000
today's computer. So even for entries of

00:03:41.000 --> 00:03:42.000
100,000, 500,000, basically

00:03:42.000 --> 00:03:46.000
for free.

00:03:46.000 --> 00:03:50.000
How do we do contains (prefix)? Can we also do a fast prefix search given this

00:03:50.000 --> 00:03:51.000
data structure?

00:03:51.000 --> 00:03:54.000
I hope so right? Thinking about

00:03:54.000 --> 00:03:57.000
the same kind of binary search. In this case, though, looking at substring, right? So

00:03:57.000 --> 00:03:59.000
comparing I'm looking for a three-character substring, only looking at the

00:03:59.000 --> 00:04:02.000
first three characters, decide which way to go.

00:04:02.000 --> 00:04:05.000
Once I find something that matches in the first three characters, I'm done.

00:04:05.000 --> 00:04:07.000
So again, still using the sorted property to quickly narrow in on where

00:04:07.000 --> 00:04:10.000
it could be.

00:04:10.000 --> 00:04:14.000
When it's time to add a new word to this,

00:04:14.000 --> 00:04:17.000
if someone asked you to put in the word ''abalone,'' I gotta

00:04:17.000 --> 00:04:21.000
use that binary search to find out where to go. So in logarithmic time I can

00:04:21.000 --> 00:04:23.000
tell you where the new word goes, but then to put it into position,

00:04:23.000 --> 00:04:26.000
there's gonna be some shuffling.

00:04:26.000 --> 00:04:28.000
And that's where this operation

00:04:28.000 --> 00:04:32.000
starts to bog down,

00:04:32.000 --> 00:04:33.000
given the idea that you might be moving half or more of your items,

00:04:33.000 --> 00:04:36.000
on average,

00:04:36.000 --> 00:04:40.000
to make that space for that one to open up.

00:04:40.000 --> 00:04:42.000
It is going to require a linear time operation to get new words into the data

00:04:42.000 --> 00:04:43.000
structure. Okay,

00:04:43.000 --> 00:04:48.000
probably still, though,

00:04:48.000 --> 00:04:51.000
a reasonable approach, right? These are the operations that get done

00:04:51.000 --> 00:04:54.000
immense numbers of times in Boggle, right? You're doing a ton of contains (word)'s and contains (prefix)'s

00:04:54.000 --> 00:04:58.000
as you're exhaustively searching that big board.

00:04:58.000 --> 00:05:01.000
But the add's are done kinda once at the beginning. You build that big dictionary and

00:05:01.000 --> 00:05:02.000
then you subsequently use it. And so if it took a little bit more time to built

00:05:02.000 --> 00:05:06.000
that thing,

00:05:06.000 --> 00:05:08.000
sorta an n-squared operation to kinda load it up, well,

00:05:08.000 --> 00:05:11.000
maybe that would be okay.

00:05:11.000 --> 00:05:14.000
The other we want to look at is space usage, and that's - and some of this is a little

00:05:14.000 --> 00:05:18.000
bit of an interesting artifact of the time I was working on this. It turns

00:05:18.000 --> 00:05:21.000
out space was actually a limiting factor in what we had available

00:05:21.000 --> 00:05:22.000
to us, in terms of core memories of our machines and

00:05:22.000 --> 00:05:25.000
what we could take up for our dictionary.

00:05:25.000 --> 00:05:28.000
In this case, it has per entry the size of the string,

00:05:28.000 --> 00:05:29.000
which we don't exactly know the string it represented. It's an abstraction

00:05:29.000 --> 00:05:32.000
to us, but

00:05:32.000 --> 00:05:35.000
we can make the assumption of kinda a simplification, that

00:05:35.000 --> 00:05:37.000
it's probably about the length of the

00:05:37.000 --> 00:05:39.000
string times of the size of the character. And there might be a little bit more

00:05:39.000 --> 00:05:40.000
housekeeping associated with it, but

00:05:40.000 --> 00:05:42.000


00:05:42.000 --> 00:05:44.000
that's a pretty good estimate of what it's taking. So in fact, it's the string data

00:05:44.000 --> 00:05:46.000
itself

00:05:46.000 --> 00:05:50.000
that's being represented in this vector

00:05:50.000 --> 00:05:51.000
without much else overhead. So if we say words are about average of about eight characters,

00:05:51.000 --> 00:05:54.000
about eight bytes per word.

00:05:54.000 --> 00:05:57.000
So if we have 100,000 we expect to have 800,000 bytes

00:05:57.000 --> 00:06:00.000
invested in this sort of vector arrangement.

00:06:00.000 --> 00:06:02.000
Excess capacity, a couple other things that might tweak that a little bit, but that

00:06:02.000 --> 00:06:04.000
gives us at least a good starting point to

00:06:04.000 --> 00:06:07.000
think about how this

00:06:07.000 --> 00:06:09.000
compares to alternatives. Okay,

00:06:09.000 --> 00:06:12.000
so then if we said

00:06:12.000 --> 00:06:16.000
we know that sortedness - that binary search is the

00:06:16.000 --> 00:06:17.000
real advantage of keeping it in sorted order, but the contiguous nature of the array

00:06:17.000 --> 00:06:19.000
kinda

00:06:19.000 --> 00:06:23.000
bites us in terms of insert.

00:06:23.000 --> 00:06:26.000
If we reorganize into that binary search tree to give ourselves the dynamic flexibility of

00:06:26.000 --> 00:06:27.000
wiring in the left and right halves through pointers, as opposed to

00:06:27.000 --> 00:06:29.000
contiguous memory,

00:06:29.000 --> 00:06:31.000
then we know we can also get add to run

00:06:31.000 --> 00:06:35.000
in logarithmic time. So

00:06:35.000 --> 00:06:36.000
looking at these operations, contains (word) is a tree search that's using equals-equals,

00:06:36.000 --> 00:06:40.000
left-right, deciding which way to go,

00:06:40.000 --> 00:06:43.000
and also performs logarithmically if the tree is balanced.

00:06:43.000 --> 00:06:45.000
Contains (prefix), same deal. Keep

00:06:45.000 --> 00:06:46.000
looking at the [inaudible],

00:06:46.000 --> 00:06:48.000
working our way down, knowing which way to go.

00:06:48.000 --> 00:06:51.000
And then

00:06:51.000 --> 00:06:54.000
add can also perform in logarithmic time

00:06:54.000 --> 00:06:57.000
because it does the same search, falling off the bottom and

00:06:57.000 --> 00:06:58.000
then inserting the new word there, or finding it along the way and not having any

00:06:58.000 --> 00:07:01.000
work to do.

00:07:01.000 --> 00:07:03.000
So we can get all our operations done in logarithmic time, which we know to be kinda

00:07:03.000 --> 00:07:05.000
fast enough to actually not be worth

00:07:05.000 --> 00:07:08.000
fighting any harder for.

00:07:08.000 --> 00:07:11.000
Where did we pay for this?

00:07:11.000 --> 00:07:13.000
Nothing really comes for free, right?

00:07:13.000 --> 00:07:15.000
Overhead, right? And overhead, in this case, memory is certainly one of the places

00:07:15.000 --> 00:07:20.000
where there's going to be a

00:07:20.000 --> 00:07:23.000
pretty big cost that we now have the string data for each entry, plus a

00:07:23.000 --> 00:07:26.000
left pointer and a right pointer. And if we're doing the work to keep it

00:07:26.000 --> 00:07:30.000
balanced, which we probably ought to do if we want to avoid the degenerate cases

00:07:30.000 --> 00:07:31.000
of it kinda getting totally out of whack and degenerating into linear time,

00:07:31.000 --> 00:07:34.000
we're going to have a couple more

00:07:34.000 --> 00:07:37.000
bits, you know, bytes thrown into that factor, too.

00:07:37.000 --> 00:07:39.000
And so if we're assuming that there's

00:07:39.000 --> 00:07:42.000
ten bytes of overhead,

00:07:42.000 --> 00:07:46.000
four bytes in each of these plus another two for the balance factor, then we've added

00:07:46.000 --> 00:07:48.000
ten bytes of overhead on top of the eight bytes for the word.

00:07:48.000 --> 00:07:48.000
Getting a little

00:07:48.000 --> 00:07:52.000
bit hefty. Something to worry about.

00:07:52.000 --> 00:07:56.000
Also in terms of code complexity, which I didn't put a bullet up for, but it's

00:07:56.000 --> 00:07:59.000
also worth kinda keeping - weighing it in your mind also which is

00:07:59.000 --> 00:08:01.000
that building the fancier, more complicated data structure probably meant

00:08:01.000 --> 00:08:02.000
you spent more time debugging it,

00:08:02.000 --> 00:08:06.000
more time developing it,

00:08:06.000 --> 00:08:08.000
right? You will have - it will be harder for someone later to maintain and deal

00:08:08.000 --> 00:08:10.000
with because it's actually just more complicated code that somebody looking at a

00:08:10.000 --> 00:08:12.000
sorted vector

00:08:12.000 --> 00:08:13.000
isn't likely to actually - like, they want to add a new operation,

00:08:13.000 --> 00:08:16.000
let's say they add remove.

00:08:16.000 --> 00:08:19.000
You're able to take a word out of the lexicon. The lexicon doesn't currently support

00:08:19.000 --> 00:08:23.000
that. Somebody adding that on the sorted vector implementation probably won't find themselves

00:08:23.000 --> 00:08:25.000
too tripped up. Someone who has to do a remove out of a binary search tree, removing it, reattaching

00:08:25.000 --> 00:08:27.000
the subtrees

00:08:27.000 --> 00:08:30.000
and not destroying the balancing

00:08:30.000 --> 00:08:32.000
is a pretty complicated operation. So we've made - we've invested in this data

00:08:32.000 --> 00:08:34.000
structure that actually will keep on

00:08:34.000 --> 00:08:36.000
creating

00:08:36.000 --> 00:08:38.000
further development hassles for any modifications that are down the road for

00:08:38.000 --> 00:08:40.000
this data structure. So

00:08:40.000 --> 00:08:45.000
it is

00:08:45.000 --> 00:08:47.000
a big investment in your brainpower to keep it working. So

00:08:47.000 --> 00:08:51.000
let's think about what a hash table can do for us. A

00:08:51.000 --> 00:08:56.000
hash table is the last thing that we looked at on Friday,

00:08:56.000 --> 00:08:58.000
kinda just moving away from this whole idea of sortedness and kinda moving it out of bound off to this idea of a

00:08:58.000 --> 00:09:02.000
hashing function.

00:09:02.000 --> 00:09:05.000
So if I have a hash table that has a certain number of buckets

00:09:05.000 --> 00:09:06.000
that uses a link list for chaining to

00:09:06.000 --> 00:09:10.000
handle the collision,

00:09:10.000 --> 00:09:11.000
then what I would have is my words just scattered around my table,

00:09:11.000 --> 00:09:14.000


00:09:14.000 --> 00:09:16.000
allowing me to have quick access by hashing to the right bucket and then

00:09:16.000 --> 00:09:19.000
working my way down the bucket to C.

00:09:19.000 --> 00:09:21.000
So the contains (word) would do a standard

00:09:21.000 --> 00:09:24.000
hash table lookup.

00:09:24.000 --> 00:09:27.000
Take the word, you know, ''mediate,'' run it through the hash function. It says oh, it's bucket

00:09:27.000 --> 00:09:30.000
three in this case. I go to bucket three. I walk down the link list. I find it or

00:09:30.000 --> 00:09:31.000
not. I can tell you whether the word ''mediate,'' if it's in the table, had to be in bucket

00:09:31.000 --> 00:09:33.000
three.

00:09:33.000 --> 00:09:35.000
No questions asked.

00:09:35.000 --> 00:09:38.000
In that case, the

00:09:38.000 --> 00:09:39.000
expected time for that is going to be n over b, where n is the number of entries, b is the number of

00:09:39.000 --> 00:09:41.000
buckets.

00:09:41.000 --> 00:09:42.000
If we have buckets

00:09:42.000 --> 00:09:47.000
set to be in the

00:09:47.000 --> 00:09:52.000
rough order of the number of entries, then we're going to get constant access there.

00:09:52.000 --> 00:09:54.000
Now we want to do contains (prefix). I want

00:09:54.000 --> 00:10:00.000
to

00:10:00.000 --> 00:10:05.000
know if there's any words in this table that begin with ''med.''

00:10:05.000 --> 00:10:09.000
Where do I look?

00:10:09.000 --> 00:10:11.000
If I run ''med'' through the hash function,

00:10:11.000 --> 00:10:15.000
do I expect that I would get the same number

00:10:15.000 --> 00:10:18.000
as other words that begin with ''med,'' like ''mediate'' and ''median'' and

00:10:18.000 --> 00:10:19.000
''mediator?'' Yes.

00:10:19.000 --> 00:10:22.000
You say yes. All

00:10:22.000 --> 00:10:26.000
right, let's take that to an extreme.

00:10:26.000 --> 00:10:29.000
If I expect that if a prefix in any longer words all have the same thing, then should it

00:10:29.000 --> 00:10:33.000
be the case, for example, I think about the simplest case of a prefix, ''m.''

00:10:33.000 --> 00:10:35.000
Should all the words that begin with ''m'' hash to the same place?

00:10:35.000 --> 00:10:38.000
If they did, we're in trouble, right?

00:10:38.000 --> 00:10:42.000
Now, if all prefixes did hash to the same place, then what I'm going to end up with is a

00:10:42.000 --> 00:10:45.000
really heavily clustered table where ''a,'' ''b,'' ''c,'' you know, ''z'' -

00:10:45.000 --> 00:10:46.000
there's 26 buckets that have all the ''z'' words, all the

00:10:46.000 --> 00:10:48.000
''y'' words, all the what not,

00:10:48.000 --> 00:10:52.000
and none of the other buckets would be used. All

00:10:52.000 --> 00:10:55.000
right, so in fact, actually, as part of the good behavior of a hash function, we're

00:10:55.000 --> 00:10:58.000
hopping that, actually, it does jumble things up. And then a word that looks like one

00:10:58.000 --> 00:11:02.000
but has another letter added on the end, we're hoping it goes somewhere totally different,

00:11:02.000 --> 00:11:04.000
that ''median'' and ''medians'' hopefully go to two totally disparate places.

00:11:04.000 --> 00:11:07.000
That if there were patterns where

00:11:07.000 --> 00:11:10.000
words that were substrings of each other all got hashed to the same

00:11:10.000 --> 00:11:12.000
location, we would end up with heavy clustering. Because there are a lot of words that

00:11:12.000 --> 00:11:15.000
repeat parts of the other words in the dictionary.

00:11:15.000 --> 00:11:19.000
So in fact, if I want to know if there's any words that begin with ''med,''

00:11:19.000 --> 00:11:21.000
I have no a priori information about where to look for

00:11:21.000 --> 00:11:25.000
them. They could be anywhere.

00:11:25.000 --> 00:11:26.000
I could hash ''med'' and see if ''med'' itself was a word, right?

00:11:26.000 --> 00:11:30.000
But let's say it was something that

00:11:30.000 --> 00:11:33.000
itself was just the beginning of something. ''St,'' ''str'' -

00:11:33.000 --> 00:11:36.000
there's a lot of words that begin with that, but ''str'' itself is not a word. So

00:11:36.000 --> 00:11:39.000
hashing to that bucket and looking for it would only tell me if there's exactly ''str,'' and then

00:11:39.000 --> 00:11:41.000
occasionally there might be some ''str'' words also there.

00:11:41.000 --> 00:11:44.000
But other than that, we just gotta look.

00:11:44.000 --> 00:11:45.000
And where do we look? Everywhere.

00:11:45.000 --> 00:11:50.000
Everywhere, everywhere, everywhere.

00:11:50.000 --> 00:11:53.000
That in order to conclude there is no word that begins with the prefix we have,

00:11:53.000 --> 00:11:55.000
the only way to be confident of that is to either find something, or to have searched

00:11:55.000 --> 00:11:59.000
the entire table and not found it.

00:11:59.000 --> 00:12:02.000
And so it does require this completely linear, exhaustive

00:12:02.000 --> 00:12:04.000
look through the entire contents of the hash table to know

00:12:04.000 --> 00:12:07.000
something, for example, is not a prefix.

00:12:07.000 --> 00:12:09.000
You might find one quickly. So in the best case, you might be in the first couple of

00:12:09.000 --> 00:12:10.000
buckets, happen upon one.

00:12:10.000 --> 00:12:14.000
But no guarantees,

00:12:14.000 --> 00:12:16.000
overall. And certainly when it's not a prefix, it's going to require a complete look-see of everything.

00:12:16.000 --> 00:12:18.000
That's kinda a bummer. Adding

00:12:18.000 --> 00:12:19.000
a

00:12:19.000 --> 00:12:22.000
word,

00:12:22.000 --> 00:12:24.000
back to the fast behavior again. Hashing in the bucket,

00:12:24.000 --> 00:12:26.000
checking to see if it's already there, adding if needed

00:12:26.000 --> 00:12:28.000
[inaudible].

00:12:28.000 --> 00:12:31.000
So it gets good times on these two,

00:12:31.000 --> 00:12:34.000
but then kinda really bottoms out when it comes to supporting

00:12:34.000 --> 00:12:35.000
that prefix search.

00:12:35.000 --> 00:12:38.000
Its space usage

00:12:38.000 --> 00:12:42.000
is kinda roughly comparable to that of the binary search tree.

00:12:42.000 --> 00:12:42.000
We've got the string data itself, and we've got the four-byte pointer on the link list.

00:12:42.000 --> 00:12:44.000


00:12:44.000 --> 00:12:48.000
There's also the four-byte bucket pointer,

00:12:48.000 --> 00:12:51.000
which is over in this other structure here, this array full of buckets.

00:12:51.000 --> 00:12:54.000
If that number, though, is roughly equal to the number of cells, then you can kinda

00:12:54.000 --> 00:12:56.000
just count it as each entry

00:12:56.000 --> 00:12:59.000
had this

00:12:59.000 --> 00:13:02.000
12-byte cell, plus a four-byte bucket pointer somewhere

00:13:02.000 --> 00:13:04.000
that you can kinda think of as associated on a per entry basis to tell you it's about

00:13:04.000 --> 00:13:06.000


00:13:06.000 --> 00:13:10.000
eight bytes of overhead on top of the string.

00:13:10.000 --> 00:13:11.000
So 16 bytes for each entry in the table.

00:13:11.000 --> 00:13:16.000
Okay, that's

00:13:16.000 --> 00:13:18.000
what we have so far.

00:13:18.000 --> 00:13:21.000
We've got

00:13:21.000 --> 00:13:24.000
pretty good performance on contains (word) no matter what we

00:13:24.000 --> 00:13:27.000
choose. We can get either logarithmic or constant time with either of these three

00:13:27.000 --> 00:13:31.000
data structure. So perfectly acceptable performance there.

00:13:31.000 --> 00:13:33.000
We cannot get contains (prefix) to run well on a hash table.

00:13:33.000 --> 00:13:34.000
Just doesn't happen the way hashing works.

00:13:34.000 --> 00:13:38.000
And then add

00:13:38.000 --> 00:13:41.000
can be fast on the two complicated pointer-based data structures to the right,

00:13:41.000 --> 00:13:42.000
but can't be fast on the sorted vector.

00:13:42.000 --> 00:13:45.000
And then

00:13:45.000 --> 00:13:47.000
we see kinda some trade-offs here in terms of space versus time, that there's

00:13:47.000 --> 00:13:51.000
very little space overhead

00:13:51.000 --> 00:13:55.000
in the sorted vector case, but really blossoms into

00:13:55.000 --> 00:13:58.000
two times, almost three times, the storage needed. In the VST and hash table

00:13:58.000 --> 00:14:00.000
case, the code also got a lot more complicated when we moved up to those

00:14:00.000 --> 00:14:04.000
data structures. So

00:14:04.000 --> 00:14:06.000
then I'll tell you, actually, that it turns out that the one type of performance that was interesting to us

00:14:06.000 --> 00:14:09.000
when I was

00:14:09.000 --> 00:14:14.000
exploring this, but in fact, more important at the time, was the memory usage.

00:14:14.000 --> 00:14:17.000
That the memory usage was really the big obstacle that we had.

00:14:17.000 --> 00:14:19.000
The Official Scrabble Players Dictionary II is actually the source for the word list that

00:14:19.000 --> 00:14:22.000
we're using.

00:14:22.000 --> 00:14:23.000
It has about 128,000 words, I think, is exactly the

00:14:23.000 --> 00:14:26.000
number it has.

00:14:26.000 --> 00:14:29.000
And the average length of those is eight characters. And the file itself is

00:14:29.000 --> 00:14:32.000
over a megabyte. So on disk, if you were just to look at the listing of all the words,

00:14:32.000 --> 00:14:34.000
it's a megabyte - a little over a megabyte there.

00:14:34.000 --> 00:14:37.000
If we were loading it into the sorted vector,

00:14:37.000 --> 00:14:40.000
we'd get something that was just roughly equal to that. It would take about a megabyte of space,

00:14:40.000 --> 00:14:45.000
plus a little bit of noise.

00:14:45.000 --> 00:14:48.000
The ones that double up that take it up to two, two and a half megabytes total.

00:14:48.000 --> 00:14:50.000
At the time we were working, and I know this is going to seem completely archaic to you, but we had

00:14:50.000 --> 00:14:51.000
these stone tablets that we worked on.

00:14:51.000 --> 00:14:55.000


00:14:55.000 --> 00:15:00.000
And they typically had main memories of about a megabyte, maybe, like in

00:15:00.000 --> 00:15:02.000
an exciting case, four megabytes; four whole megabytes of RAM.

00:15:02.000 --> 00:15:05.000
So it was not possible

00:15:05.000 --> 00:15:09.000
that we could dedicate one or two megabytes of your main memory

00:15:09.000 --> 00:15:12.000
to just your lexicon. It just wasn't - there just wasn't enough space to go around.

00:15:12.000 --> 00:15:15.000
So we're working in this very tight environment in terms of that. So a little bit more

00:15:15.000 --> 00:15:16.000
like, for example, today's world of embedded devices. If you're adding something for an

00:15:16.000 --> 00:15:18.000
iPod or a

00:15:18.000 --> 00:15:21.000


00:15:21.000 --> 00:15:24.000
cell phone you have a lot less

00:15:24.000 --> 00:15:27.000
gargantuan memory. So your average desktop machine, having a

00:15:27.000 --> 00:15:28.000
gigabyte of RAM, now you're like, whatever, a megabyte here, ten megabytes there, free, all the

00:15:28.000 --> 00:15:31.000
memory you want. And

00:15:31.000 --> 00:15:34.000
so this seems kinda like a silly thing to focus on, but this was 1995,

00:15:34.000 --> 00:15:35.000
and it turns out, yeah, there was certainly more interest in

00:15:35.000 --> 00:15:37.000
keeping it

00:15:37.000 --> 00:15:40.000
tight to the memory.

00:15:40.000 --> 00:15:44.000
The Boggle lexicon that we have actually takes under a third of a megabyte.

00:15:44.000 --> 00:15:45.000
The memory that's used while it's working and searching and checking for words and

00:15:45.000 --> 00:15:47.000
prefixes

00:15:47.000 --> 00:15:50.000
is actually smaller

00:15:50.000 --> 00:15:53.000
than the on-disk form of the data we started with.

00:15:53.000 --> 00:15:57.000
So it actually does a kind of a bit of a compression

00:15:57.000 --> 00:16:00.000
as part of its strategy for storage, which is pretty interesting to think about.

00:16:00.000 --> 00:16:02.000
So I'll tell you, actually, the truth about how we first did Boggle, because it's kinda a little

00:16:02.000 --> 00:16:03.000
interesting

00:16:03.000 --> 00:16:07.000
historical perspective on it.

00:16:07.000 --> 00:16:09.000
But Boggle was given by a colleague of mine for the first time, Todd

00:16:09.000 --> 00:16:10.000
Feldman. He was the guy who came up with the idea, which is a great idea. And he said, ''Oh, this is a great

00:16:10.000 --> 00:16:13.000
assignment.''

00:16:13.000 --> 00:16:16.000
And when he wrote it, he said, ''Oh, I need a lexicon.'' So as part of kinda putting the assignment together, he was like, ''Oh, we

00:16:16.000 --> 00:16:18.000
have to get some lexicons.'' So he looked around and he found a word list that had about

00:16:18.000 --> 00:16:20.000
20,000 words.

00:16:20.000 --> 00:16:23.000


00:16:23.000 --> 00:16:27.000
And I think we first found this word list and we said, ''Oh, it's just impossible.'' It's going to take a megabyte

00:16:27.000 --> 00:16:30.000
or more and we just don't have that kind of space. So we need much smaller data files. So we had a data file that had

00:16:30.000 --> 00:16:33.000
about 20,000 words, which then takes about

00:16:33.000 --> 00:16:34.000
200k. And he actually wrote it using a hash table.

00:16:34.000 --> 00:16:37.000
So given he wrote it as a hash table,

00:16:37.000 --> 00:16:38.000
he just didn't support contains (prefix). He

00:16:38.000 --> 00:16:39.000
just didn't put it in,

00:16:39.000 --> 00:16:42.000


00:16:42.000 --> 00:16:45.000
and - because you can't do it efficiently. So it actually took a smaller amount

00:16:45.000 --> 00:16:49.000
of space and had a smaller word list and it wouldn't do contains (prefix).

00:16:49.000 --> 00:16:51.000
So then people wrote Boggle. And as you learned from that midterm question, you can write Boggle without

00:16:51.000 --> 00:16:55.000
contains (prefix).

00:16:55.000 --> 00:16:58.000
It spends a lot more time looking down these data ends, but it eventually bottoms out

00:16:58.000 --> 00:17:01.000
when it runs off the board and uses all the

00:17:01.000 --> 00:17:04.000
cubes. And

00:17:04.000 --> 00:17:05.000
it actually, in some ways, almost made the program a little more satisfying to

00:17:05.000 --> 00:17:08.000
write in one odd way,

00:17:08.000 --> 00:17:09.000
which is, when you're - so it's the computer's turn, right, that uses the

00:17:09.000 --> 00:17:12.000
prefix pruning.

00:17:12.000 --> 00:17:14.000
So you would type in your words and it would be perfectly fine finding it, but it

00:17:14.000 --> 00:17:16.000
doesn't use the prefix part when it's doing the human's turn. You're typing in words; it's finding them. You're typing

00:17:16.000 --> 00:17:19.000
in the words; it's find them.

00:17:19.000 --> 00:17:21.000
Then you would go to do the computer's turn and you'd say, ''Okay, go.''

00:17:21.000 --> 00:17:25.000
And now it would take a really long time.

00:17:25.000 --> 00:17:27.000
It'd be hunting around the board. It would be working so hard your fan would be coming on.

00:17:27.000 --> 00:17:30.000
You'd feel like oh, my computer's doing something.

00:17:30.000 --> 00:17:31.000
And then it would show up with twice as many words as you found or ten times as

00:17:31.000 --> 00:17:33.000
many words as you found, whatever.

00:17:33.000 --> 00:17:36.000
And in fact, then you would say,

00:17:36.000 --> 00:17:40.000
''Well, at least it had to work hard to beat me.''

00:17:40.000 --> 00:17:41.000
And then you felt like I wrote this program that causes my processor to get busy, right? I feel good. And

00:17:41.000 --> 00:17:44.000


00:17:44.000 --> 00:17:47.000
in fact, actually, the word list was pretty small. So in fact, it didn't - it wasn't even actually as likely to

00:17:47.000 --> 00:17:49.000
beat you because it didn't know as many words. 125,000 words is

00:17:49.000 --> 00:17:52.000
an enormous number of words.

00:17:52.000 --> 00:17:55.000
The average - I've heard different factors on this, but the

00:17:55.000 --> 00:17:59.000
average person's vocabulary, English vocabulary, is about 20,000 to

00:17:59.000 --> 00:18:00.000
30,000 words. It tends to be about 10,000 higher if you've gone to college, so

00:18:00.000 --> 00:18:03.000
maybe 30,000 to 40,000.

00:18:03.000 --> 00:18:06.000
The fact that it knows 80,000 more words than the average college graduate

00:18:06.000 --> 00:18:09.000
means it knows a lot of obscure words, too. So in fact, not only does it

00:18:09.000 --> 00:18:12.000
very quickly find all those words and then produce these ridiculous words that you've

00:18:12.000 --> 00:18:14.000
never heard of, it just - it's almost like it's taunting you by doing it

00:18:14.000 --> 00:18:17.000
in a millisecond.

00:18:17.000 --> 00:18:21.000
And back then it took a while, so it was actually like

00:18:21.000 --> 00:18:23.000
okay, it beat me, but it had to work hard.

00:18:23.000 --> 00:18:27.000
But then - so then I took over Boggle.

00:18:27.000 --> 00:18:31.000
I taught x maybe a quarter or two later and I said, ''I'm going to write a

00:18:31.000 --> 00:18:35.000
lexicon. I'm going to go up and make a little project for myself, because I don't have enough work to do.''

00:18:35.000 --> 00:18:38.000
And my goal was to make one that would do prefix, but that also would be

00:18:38.000 --> 00:18:40.000
tight enough on memory that we could actually load a much bigger dictionary file, because I thought I'd

00:18:40.000 --> 00:18:42.000
actually make the game even more

00:18:42.000 --> 00:18:48.000
fun to play.

00:18:48.000 --> 00:18:52.000
All right, so here's where I started.

00:18:52.000 --> 00:18:55.000
This is a little excerpt from the middle of the

00:18:55.000 --> 00:18:58.000
dictionary. ''Stra,'' yeah, that's totally the thing you notice, right, when you look at this thing. You see

00:18:58.000 --> 00:19:02.000
''straddle,'' ''straddler,'' ''straddlers,'' ''straddles,'' ''straddling,'' and then these words that you can't

00:19:02.000 --> 00:19:04.000
believe you're allowed to do this, but apparently you can put ''er'' and ''ier'' and

00:19:04.000 --> 00:19:05.000
''ies' and ''ing'' on

00:19:05.000 --> 00:19:07.000
almost anything out there and it

00:19:07.000 --> 00:19:08.000
makes a valid word.

00:19:08.000 --> 00:19:11.000
So

00:19:11.000 --> 00:19:14.000
although you never really thought about the fact that

00:19:14.000 --> 00:19:16.000
''straightforward,'' ''straightforwardly,'' ''straightforwardness,'' are all there, there is a huge amount

00:19:16.000 --> 00:19:18.000
of repetition. And this is

00:19:18.000 --> 00:19:20.000
where I said we're going to use some information about the domain.

00:19:20.000 --> 00:19:23.000
These aren't just arbitrary strings.

00:19:23.000 --> 00:19:24.000
They're not just random sequences of letters that are

00:19:24.000 --> 00:19:28.000
picked out of the air,

00:19:28.000 --> 00:19:30.000
that they develop over time. They're word roots. They're

00:19:30.000 --> 00:19:34.000
suffixes. There's these prefixes that mean things

00:19:34.000 --> 00:19:35.000
that actually show you that looking at this little excerpt out of the middle,

00:19:35.000 --> 00:19:38.000
there's an awful lot of words

00:19:38.000 --> 00:19:42.000
that repeat portions of other words.

00:19:42.000 --> 00:19:44.000
And that may be part of what we can capitalize on. That

00:19:44.000 --> 00:19:47.000
instead of really recording ''straightaway''

00:19:47.000 --> 00:19:51.000
and ''straightaways,'' repeating those first 11 characters and adding an ''s,''

00:19:51.000 --> 00:19:55.000
there's some way that I can unify the data that I have here. The same way when we try to

00:19:55.000 --> 00:19:56.000
find codes, you have the same passage of code repeated two or three times.

00:19:56.000 --> 00:19:59.000
We're always saying

00:19:59.000 --> 00:20:00.000
unify it, move it into a helper and call it in three places. It's like can we do the same thing

00:20:00.000 --> 00:20:02.000
with our data?

00:20:02.000 --> 00:20:07.000
Can we share the parts of the words that are in common

00:20:07.000 --> 00:20:09.000
and only distinguish where they are different?

00:20:09.000 --> 00:20:11.000
Take a look at this.

00:20:11.000 --> 00:20:14.000
This is a little portion

00:20:14.000 --> 00:20:18.000
of something called a trie.

00:20:18.000 --> 00:20:20.000
A trie is spelled t-r-i-e and it's actually from ''retrieval,'' but

00:20:20.000 --> 00:20:23.000
sadly, it's still pronounced tri, just to confuse you.

00:20:23.000 --> 00:20:28.000
And it is a variant of a tree.

00:20:28.000 --> 00:20:30.000
In this case it's a tree that has 26 children coming out of any particular node,

00:20:30.000 --> 00:20:32.000
one for each letter of the alphabet.

00:20:32.000 --> 00:20:35.000
And so starting from the root node at the top,

00:20:35.000 --> 00:20:39.000
all the words that begin with ''a'' are actually

00:20:39.000 --> 00:20:42.000
down the A branch, and then there's a B branch. There'd be a C branch and a D

00:20:42.000 --> 00:20:45.000
branch. And so the idea is that all the words, like if you think of the levels of the

00:20:45.000 --> 00:20:46.000
tree are actually like - these are all the zero if characters of the word.

00:20:46.000 --> 00:20:49.000
And then that second level is

00:20:49.000 --> 00:20:52.000
all the characters that follow that zero if character. So here's the ''ab''

00:20:52.000 --> 00:20:56.000
words and the ''ac'' words and the ''ax'' words.

00:20:56.000 --> 00:20:57.000
Over here is the ''st'' words, but there's not a ''sx,'' so there's some places that actually aren't

00:20:57.000 --> 00:21:00.000
filled out in this tree.

00:21:00.000 --> 00:21:02.000
And as you keep going down further, so that the depth

00:21:02.000 --> 00:21:04.000
you would trace to a tree

00:21:04.000 --> 00:21:08.000
would actually be

00:21:08.000 --> 00:21:10.000
tracing through the characters in sequence that make words.

00:21:10.000 --> 00:21:11.000
And so in this form, tracing out a

00:21:11.000 --> 00:21:12.000
path through this tree

00:21:12.000 --> 00:21:14.000
from the root

00:21:14.000 --> 00:21:17.000
down to a particular node

00:21:17.000 --> 00:21:19.000
tells you about prefixes. So there are words that begin with ''a'' and words that begin with ''ac''

00:21:19.000 --> 00:21:23.000
and ''ace,''

00:21:23.000 --> 00:21:23.000
and then there's actually a little notation there that's done by a

00:21:23.000 --> 00:21:24.000
visual

00:21:24.000 --> 00:21:26.000
of the

00:21:26.000 --> 00:21:29.000
bold and circle around it

00:21:29.000 --> 00:21:30.000
that says, ''Oh, and the path that led here from the root is a word.'' So

00:21:30.000 --> 00:21:33.000
''a'' is a word

00:21:33.000 --> 00:21:35.000
and ''ace'' is a word and ''aces'' is a word.

00:21:35.000 --> 00:21:39.000
''Act'' is a word and so is ''acted'' and ''actor.''

00:21:39.000 --> 00:21:44.000
And in this case, the ''act'' part is totally shared,

00:21:44.000 --> 00:21:46.000
so everything that comes off of ''act,'' ''acting,'' ''acted,'' ''actor,'' ''actors,''

00:21:46.000 --> 00:21:50.000
can all share that initial part.

00:21:50.000 --> 00:21:52.000
And for a word like ''act,'' it doesn't seem that profound. But you start thinking about words like

00:21:52.000 --> 00:21:55.000
''strawberry'' or ''straightforward'' or ''stratosphere''

00:21:55.000 --> 00:21:59.000
and realize that ''stratospheric'' and ''strawberries'' and ''straightforwardly''

00:21:59.000 --> 00:22:01.000
can all leverage the other ten or 12 characters that were already

00:22:01.000 --> 00:22:04.000
invested in the root word,

00:22:04.000 --> 00:22:06.000
and that ''straightjacket'' and ''straightforward'' can even share ''straight.''

00:22:06.000 --> 00:22:10.000
But there's actually a lot of

00:22:10.000 --> 00:22:15.000
prefixes that can be unified across the space of the dictionary.

00:22:15.000 --> 00:22:17.000
So knowing that words have these patterns helps us to do this.

00:22:17.000 --> 00:22:19.000
So let's build that.

00:22:19.000 --> 00:22:22.000
So the first form of this -

00:22:22.000 --> 00:22:25.000
and this, again, is a thought experiment. I did not actually write it this way because

00:22:25.000 --> 00:22:27.000
it would just be absolutely astronomically expensive. But it was a good way to kinda think

00:22:27.000 --> 00:22:30.000
about where I needed to go.

00:22:30.000 --> 00:22:34.000
I designed a new kinda trie node

00:22:34.000 --> 00:22:35.000
that had the letter and a little Boolean flag of is this the word, so whether it had

00:22:35.000 --> 00:22:39.000
the dark circle around it.

00:22:39.000 --> 00:22:41.000
And then it had a 26-member children array, so pointers to other nodes like

00:22:41.000 --> 00:22:44.000
this one. So totally recursive,

00:22:44.000 --> 00:22:47.000
staring from that root and then the

00:22:47.000 --> 00:22:50.000
idea is to use those 26 children as ''a'' being in the first slot and ''z'' in the last

00:22:50.000 --> 00:22:52.000
slot to make it really easy to kinda just trace letter by letter

00:22:52.000 --> 00:22:55.000
down to the levels of the tree.

00:22:55.000 --> 00:22:59.000
So I'd have this one root node that pointed to the initial one, and then all the words

00:22:59.000 --> 00:23:03.000
that begin with ''a'' are off that first lot; all from ''z'' off the last slot.

00:23:03.000 --> 00:23:08.000
So when I want to know if a word is in the dictionary,

00:23:08.000 --> 00:23:12.000
what do I have to do in this data structure? How do I find it? I

00:23:12.000 --> 00:23:18.000
want to find the word ''far.'' Where do I look?

00:23:18.000 --> 00:23:27.000
Left? Right? Down? Which way? It would be great

00:23:27.000 --> 00:23:31.000
if you would tell me. Then I would understand what my data structure is. [Inaudible]

00:23:31.000 --> 00:23:35.000
Exactly. So at the very top I say okay, ''f'' is my first letter. Match my ''f,'' get me to the

00:23:35.000 --> 00:23:35.000
F node, and now it says recursively find ''ar'' from here. So it's very

00:23:35.000 --> 00:23:37.000
recursive. It'll be like

00:23:37.000 --> 00:23:39.000
find the first letter

00:23:39.000 --> 00:23:43.000
and then recursively find the substring from here, working down. So find the ''a'' out of the next node, find

00:23:43.000 --> 00:23:46.000
an ''r'' out of the next node, and then when you get to that node,

00:23:46.000 --> 00:23:47.000
you'll check this little ''is word Boolean,''

00:23:47.000 --> 00:23:49.000
and say okay,

00:23:49.000 --> 00:23:53.000
was that path I traced

00:23:53.000 --> 00:23:58.000
marked in the dictionary as leading to something that's known to be a word?

00:23:58.000 --> 00:24:01.000
The prefix looks exactly the same, except for that last check for the as word. So

00:24:01.000 --> 00:24:05.000
given that nodes exist in this, because further down there there's some words, I just

00:24:05.000 --> 00:24:08.000
trace out a sequence of letters ''str,'' and as long as there's something there,

00:24:08.000 --> 00:24:11.000
then I know that away from there are subsequent children beneath it

00:24:11.000 --> 00:24:15.000
that lead to words.

00:24:15.000 --> 00:24:17.000
And adding a word does the same operation.

00:24:17.000 --> 00:24:19.000
If I need to add the word ''strawberry,''

00:24:19.000 --> 00:24:20.000
then I start looking for ''s,'' ''t,'' ''r,''

00:24:20.000 --> 00:24:24.000
and at some point,

00:24:24.000 --> 00:24:26.000
I find some places where the nodes don't exist. Then I start creating them from

00:24:26.000 --> 00:24:29.000
there on down.

00:24:29.000 --> 00:24:33.000
So it is tracing what exists and then adding new nodes off the bottom,

00:24:33.000 --> 00:24:38.000
and then marking the final one as yes, this is a word.

00:24:38.000 --> 00:24:43.000
What's interesting about this is that all three of these operations

00:24:43.000 --> 00:24:47.000
don't actually make any reference to how many words are in the dictionary.

00:24:47.000 --> 00:24:49.000
They are all dependent on the length of the word you're searching for or adding into the dictionary.

00:24:49.000 --> 00:24:51.000
If there's ten words in there, if there's 1,000 words in there, if

00:24:51.000 --> 00:24:53.000
there's a million words in there.

00:24:53.000 --> 00:24:56.000
But in no way does the big O depend on

00:24:56.000 --> 00:24:59.000
how big or how loaded the data structure is.

00:24:59.000 --> 00:25:00.000
That's kinda wacky.

00:25:00.000 --> 00:25:06.000
The longest word,

00:25:06.000 --> 00:25:09.000
according to the Oxford English Dictionary?

00:25:09.000 --> 00:25:11.000
There you go. You're not going to make that on the Boggle board anytime soon because it's only got 16 cubes. Even Big Boggle

00:25:11.000 --> 00:25:12.000
can't really make that puppy. But

00:25:12.000 --> 00:25:14.000
just so you know,

00:25:14.000 --> 00:25:17.000
45 is not so many, right?

00:25:17.000 --> 00:25:21.000
And,

00:25:21.000 --> 00:25:24.000
in fact, actually, here's a little thought for you. Not only is it not dependent on num words, it actually has a

00:25:24.000 --> 00:25:26.000
little bit of an inverse relationship with num words,

00:25:26.000 --> 00:25:29.000
especially the add operation.

00:25:29.000 --> 00:25:31.000
That as the data structure becomes more loaded,

00:25:31.000 --> 00:25:34.000
there's typically less work to do in add.

00:25:34.000 --> 00:25:37.000
So if ''strawberry'' is in there, and you go to add ''strawberries,''

00:25:37.000 --> 00:25:38.000
then actually you already have ''strawberries'' to depend on. You just need to build

00:25:38.000 --> 00:25:41.000
off the thing. So the fact that

00:25:41.000 --> 00:25:45.000
the more words that are in there, the more likely that some of the nodes you already need

00:25:45.000 --> 00:25:46.000
are already in place, and then you can just kinda blow past them and just add your new

00:25:46.000 --> 00:25:49.000
nodes.

00:25:49.000 --> 00:25:53.000
Now, that's odd. You think of most data structures as really as

00:25:53.000 --> 00:26:00.000
clearly as they get heavier, it takes more work to install new

00:26:00.000 --> 00:26:03.000
things in it rather than less. [Inaudible].

00:26:03.000 --> 00:26:04.000
Well, the thing is, they're organized by letter, and that 26-number array. So in fact,

00:26:04.000 --> 00:26:08.000
I don't even have to look.

00:26:08.000 --> 00:26:11.000
So I just go straight to the slot. If I'm looking for strawberry, I'll look for ''s'' and I go

00:26:11.000 --> 00:26:14.000
straight to the ''t'' slot and then the ''r'' slot. So in fact, if the mode was already in place, I'm

00:26:14.000 --> 00:26:16.000
not searching for it. It really just is exactly in the ''r'' slot. That's why

00:26:16.000 --> 00:26:21.000
there's 26 of them. In

00:26:21.000 --> 00:26:24.000
each array? Each array is 26 and they're A through Z and if there's an empty slot we're using a null.

00:26:24.000 --> 00:26:27.000
So in this case, it's optimized to make it super fast. I don't even have to look

00:26:27.000 --> 00:26:29.000
through the letters to find the one that matches. There's no matching. It just goes straight

00:26:29.000 --> 00:26:32.000
to the ''r'' slot, ''z'' slot, the ''s'' slot.

00:26:32.000 --> 00:26:35.000
We're going to see this is going to be a consequence we can't

00:26:35.000 --> 00:26:37.000
tolerate, but it is, at least, the starting point for thinking about how to

00:26:37.000 --> 00:26:38.000
build a data structure.

00:26:38.000 --> 00:26:40.000
So the space usage

00:26:40.000 --> 00:26:44.000
is where this thing

00:26:44.000 --> 00:26:47.000
really, really - so this is an extreme example of a space-time trade-off, right?

00:26:47.000 --> 00:26:50.000
We've got this thing that optimizes beautifully for OAV length of

00:26:50.000 --> 00:26:53.000
the word. So that means typically eight operations on average are going to be

00:26:53.000 --> 00:26:56.000
needed to add or to search for something.

00:26:56.000 --> 00:26:57.000
The space we're using here is 106 bytes per nodes, so that's this

00:26:57.000 --> 00:26:59.000
26-member,

00:26:59.000 --> 00:27:02.000
four-byte array,

00:27:02.000 --> 00:27:04.000
plus two bytes for the character and the Boolean that are up there. One

00:27:04.000 --> 00:27:05.000
hundred and six bytes

00:27:05.000 --> 00:27:09.000
is a lot of memory,

00:27:09.000 --> 00:27:13.000
and the tree has about a quarter of a million nodes.

00:27:13.000 --> 00:27:17.000
So given the 125,000 words that I said are in the input,

00:27:17.000 --> 00:27:19.000
if you build it into a tree, they take about 250,000 nodes. That's an interesting sort of

00:27:19.000 --> 00:27:22.000
number to think about, though. You have 125,000

00:27:22.000 --> 00:27:25.000
words. They take 250,000 nodes. That tells you, actually, that

00:27:25.000 --> 00:27:27.000
kinda on average, each

00:27:27.000 --> 00:27:29.000
word has about two unique nodes,

00:27:29.000 --> 00:27:33.000
that even words like ''straightforward'' and ''straightforwardly,''

00:27:33.000 --> 00:27:34.000
on average, are sharing enough of their prefix - common prefix

00:27:34.000 --> 00:27:38.000
parts - that

00:27:38.000 --> 00:27:39.000
there's very little unique data added by any particular word in there that averages

00:27:39.000 --> 00:27:40.000
across the whole thing. That's

00:27:40.000 --> 00:27:42.000
pretty neat

00:27:42.000 --> 00:27:45.000
to know. However, this is just not going to fly.

00:27:45.000 --> 00:27:50.000
So here I was telling you that my main memory had

00:27:50.000 --> 00:27:53.000
four megabytes, maybe, on a good day, and now I've actually

00:27:53.000 --> 00:27:55.000
built a data structure that's going to require six times that

00:27:55.000 --> 00:27:58.000
in terms of storage.

00:27:58.000 --> 00:28:00.000
Okay, we gotta squeeze that down.

00:28:00.000 --> 00:28:04.000
So the first thing we can do

00:28:04.000 --> 00:28:08.000
is to realize that those 26 - allocated a full 26-member array, of

00:28:08.000 --> 00:28:11.000
which I only plan on using some of the slots, is pretty wasteful.

00:28:11.000 --> 00:28:13.000
So instead of saying I'm committing to a 26 per thing, I'll say well, how about if

00:28:13.000 --> 00:28:16.000
I have an array that actually

00:28:16.000 --> 00:28:20.000
is tight to the number of children coming out of this node?

00:28:20.000 --> 00:28:20.000
So the very first node will have 26. There are words that start with every letter in the

00:28:20.000 --> 00:28:22.000
alphabet.

00:28:22.000 --> 00:28:25.000
But from there, it very quickly winnows down.

00:28:25.000 --> 00:28:29.000
You have the letter ''x.'' You do not need the full 26 children coming

00:28:29.000 --> 00:28:33.000
off of ''x.'' There's only a few letters that follow ''x'' in the English language that are going

00:28:33.000 --> 00:28:37.000
to need to be fleshed out. You need the ''xa,'' you need an ''xe,''

00:28:37.000 --> 00:28:40.000
but you don't need ''xj,'' you don't need ''xk,'' a whole bunch of things like that.

00:28:40.000 --> 00:28:44.000
So if I change the structure here to instead of being

00:28:44.000 --> 00:28:47.000
a 26-member array of node pointers, I make it to be a pointer to a

00:28:47.000 --> 00:28:50.000
set of nodes pointers, so this is a dynamic array

00:28:50.000 --> 00:28:54.000
that I'm keeping track of the number of children in a second field here

00:28:54.000 --> 00:28:56.000
that the first one will be allocated to 26, the second will be allocated to eight.

00:28:56.000 --> 00:28:59.000
This is going to change a little bit of our running times, because now we won't

00:28:59.000 --> 00:29:01.000
be able to go exactly to the ''s'' slot. We'll have to go look for it.

00:29:01.000 --> 00:29:03.000
So on each step down the tree,

00:29:03.000 --> 00:29:07.000
it'll be

00:29:07.000 --> 00:29:12.000
looking through that sized array at this slot to find the right match. But it adds a

00:29:12.000 --> 00:29:14.000
constant factor, basically, on top of OAV length of the word.

00:29:14.000 --> 00:29:15.000
So the space issues of this

00:29:15.000 --> 00:29:19.000


00:29:19.000 --> 00:29:23.000
is - we're just not storing all those nulls. And there were a lot of nulls. That

00:29:23.000 --> 00:29:25.000
the node itself is not just ten bytes, and then there's some space in this

00:29:25.000 --> 00:29:27.000
dynamic array out there, which is the number of children

00:29:27.000 --> 00:29:31.000
times the four-byte pointer,

00:29:31.000 --> 00:29:33.000
that on average, a node has six children. Across the

00:29:33.000 --> 00:29:36.000
250,000 nodes in the

00:29:36.000 --> 00:29:39.000
data structure that's being billed from this, so really 26 was way

00:29:39.000 --> 00:29:40.000
overkill. About 20 of them had nulls

00:29:40.000 --> 00:29:42.000
for most of the nodes.

00:29:42.000 --> 00:29:46.000
And many of them, for example, at the very bottom of the tree,

00:29:46.000 --> 00:29:49.000
have completely none. They're all nulls. So you get to a word like

00:29:49.000 --> 00:29:52.000
''straightforwardly.'' There's nothing after that ''y,''

00:29:52.000 --> 00:29:53.000
so it has zero children coming out the back of that one.

00:29:53.000 --> 00:29:57.000
And so certainly

00:29:57.000 --> 00:30:00.000
having 26 null children pointing away from that was a

00:30:00.000 --> 00:30:01.000
waste that we can tighten up.

00:30:01.000 --> 00:30:03.000
So when we get it down to this we'll have

00:30:03.000 --> 00:30:04.000


00:30:04.000 --> 00:30:05.000
44 bytes

00:30:05.000 --> 00:30:06.000


00:30:06.000 --> 00:30:08.000
-

00:30:08.000 --> 00:30:11.000
or 34 bytes for the per node.

00:30:11.000 --> 00:30:15.000
Still a quarter million nodes. Still

00:30:15.000 --> 00:30:18.000
eight and half megabytes. So we're not winning any prizes just yet.

00:30:18.000 --> 00:30:22.000
But we are making good progress. That got us down a factor of

00:30:22.000 --> 00:30:25.000
three or four right there. So I still have the same

00:30:25.000 --> 00:30:28.000
number of nodes. All I changed was the pointers to the

00:30:28.000 --> 00:30:32.000
subsequent nodes further down the tree. So

00:30:32.000 --> 00:30:35.000
I - the kind of letters and how the letters connect to each other - the connectivity is

00:30:35.000 --> 00:30:39.000
still pretty much the same. It's just that in any particular node, how many children does it have?

00:30:39.000 --> 00:30:40.000
I was storing a whole bunch of nulls and now I'm not storing those nulls.

00:30:40.000 --> 00:30:43.000
So the number of nodes is still

00:30:43.000 --> 00:30:47.000
exactly as it was before. I'm going

00:30:47.000 --> 00:30:49.000
to do one other little squishing thing on this data structure

00:30:49.000 --> 00:30:51.000
and then I'm going

00:30:51.000 --> 00:30:53.000
to flip gears for a second.

00:30:53.000 --> 00:30:56.000
So still using kinda my

00:30:56.000 --> 00:30:59.000
CS side of things, how can I squash things down?

00:30:59.000 --> 00:31:03.000
I'm going to use a technique that we used in the heap.

00:31:03.000 --> 00:31:04.000
So in the case of the binary heap that we implemented into the

00:31:04.000 --> 00:31:07.000
PQ,

00:31:07.000 --> 00:31:11.000
you'll remember that we drew pictures in the handout that showed a heap with oh, I

00:31:11.000 --> 00:31:11.000
have a parent with two children, which has four grandchildren and

00:31:11.000 --> 00:31:15.000
so on.

00:31:15.000 --> 00:31:16.000
But that we actually compressed that down into an array,

00:31:16.000 --> 00:31:20.000
and in the array

00:31:20.000 --> 00:31:24.000
there was this pattern of

00:31:24.000 --> 00:31:27.000
here is the root node.

00:31:27.000 --> 00:31:30.000
So we'll call this level zero.

00:31:30.000 --> 00:31:33.000
And then the next two nodes

00:31:33.000 --> 00:31:36.000
are level one.

00:31:36.000 --> 00:31:38.000
And the next four nodes

00:31:38.000 --> 00:31:42.000
are level two and so on. So

00:31:42.000 --> 00:31:45.000
there's one here, two here, there's four here, there's eight here. And we kinda flattened it down

00:31:45.000 --> 00:31:50.000
by generation, so looking at what was at the top and what was underneath it.

00:31:50.000 --> 00:31:52.000
When we did that, we removed all the space for pointers.

00:31:52.000 --> 00:31:54.000
Although we drew those pictures in the heap as though there were pointers

00:31:54.000 --> 00:31:57.000
there, in fact, there were not.

00:31:57.000 --> 00:32:02.000
There were no pointers being stored in the heap in this version of the PQ.

00:32:02.000 --> 00:32:02.000
And so that gets us back to kind of array like storage requirements,

00:32:02.000 --> 00:32:06.000
but

00:32:06.000 --> 00:32:09.000
by flattening it down into this array, and still kinda using the tree

00:32:09.000 --> 00:32:10.000
conceptually to work our way through it, we're getting the advantages of

00:32:10.000 --> 00:32:12.000
that

00:32:12.000 --> 00:32:14.000


00:32:14.000 --> 00:32:15.000
traversal that's based on height of tree,

00:32:15.000 --> 00:32:20.000
not length of vector.

00:32:20.000 --> 00:32:22.000
So if we do the same thing with the trie, it's a little bit more complicated

00:32:22.000 --> 00:32:25.000
because there were a couple things about heap that made this a little

00:32:25.000 --> 00:32:27.000
easier. Heap always had two children, so there were these rules about

00:32:27.000 --> 00:32:30.000
how the tree was filled in that meant

00:32:30.000 --> 00:32:33.000
you could always compute parent and child index using this kinda divide by two,

00:32:33.000 --> 00:32:36.000
multiple by two exchange operation.

00:32:36.000 --> 00:32:39.000
In this case, what we're going to have to do is just lay it down by generation,

00:32:39.000 --> 00:32:41.000
but we don't know how big a generation's gonna be, so we're actually gonna have to keep

00:32:41.000 --> 00:32:43.000
track of it manually.

00:32:43.000 --> 00:32:46.000
So the root node will have

00:32:46.000 --> 00:32:49.000
all A through Z. So this will be

00:32:49.000 --> 00:32:53.000
level one, which is A through Z here,

00:32:53.000 --> 00:32:57.000
and then level two is - there'll be a bunch of slots here, which is the

00:32:57.000 --> 00:33:00.000
A followed by some letter, and then there'd be the B's followed by some letter, and

00:33:00.000 --> 00:33:03.000
then there'd be the C's followed by some letter, and so on.

00:33:03.000 --> 00:33:06.000
And then further down is level three, which are the three-character words. So these are all

00:33:06.000 --> 00:33:09.000
the single-character paths,

00:33:09.000 --> 00:33:12.000
two-character paths, three-character paths, kinda flattened down.

00:33:12.000 --> 00:33:15.000
And so if I look at what I changed here, my first node

00:33:15.000 --> 00:33:18.000
says it has no letter that's not the root node, it's not a word,

00:33:18.000 --> 00:33:20.000
and it says it's children begin at index one. So

00:33:20.000 --> 00:33:23.000
the next location. So it's not a

00:33:23.000 --> 00:33:26.000
multiply by two and add one or something like that. It's okay, here's

00:33:26.000 --> 00:33:30.000
where my children begin. They're at slot one in the array.

00:33:30.000 --> 00:33:31.000
And then there are 26 of them. So that means that A, B, C, D are kinda in order there.

00:33:31.000 --> 00:33:33.000
If I look at A,

00:33:33.000 --> 00:33:36.000
''a'' is a word,

00:33:36.000 --> 00:33:39.000
and it says its children begin at index 27

00:33:39.000 --> 00:33:44.000
and there are 12 of them, so there are 12 characters out of the 26 that

00:33:44.000 --> 00:33:47.000
can follow an ''a'' in the sequence of a valid word. And

00:33:47.000 --> 00:33:48.000
then B's children begin at index 39, which is 27 plus 12. So that's

00:33:48.000 --> 00:33:50.000
where

00:33:50.000 --> 00:33:54.000
A's children are sitting here at index

00:33:54.000 --> 00:33:56.000
27 to 38 and then 39 to - I think B has eight or so children -

00:33:56.000 --> 00:33:59.000
and C and so on.

00:33:59.000 --> 00:34:00.000
So we flattened it all down by generation

00:34:00.000 --> 00:34:01.000
into

00:34:01.000 --> 00:34:04.000
an array.

00:34:04.000 --> 00:34:07.000
That means all the space for pointers

00:34:07.000 --> 00:34:10.000
has now been removed.

00:34:10.000 --> 00:34:12.000
One serious consequence of this, though,

00:34:12.000 --> 00:34:14.000
is that the pointers got us flexibility, that

00:34:14.000 --> 00:34:15.000
that insert and remove

00:34:15.000 --> 00:34:19.000


00:34:19.000 --> 00:34:22.000
was dependent on the idea that pointers really being there, pointers buying us

00:34:22.000 --> 00:34:25.000
the ability to insert and delete without doing a shuffle.

00:34:25.000 --> 00:34:27.000
That now, once I have flattened it, I have suddenly bought back the same

00:34:27.000 --> 00:34:29.000
problems that arrays have, which is

00:34:29.000 --> 00:34:33.000
if I have the words here that have

00:34:33.000 --> 00:34:36.000
''ab'' for abalone, and ''ac'' for act and I don't happen to have any words with ''ad.'' And

00:34:36.000 --> 00:34:40.000
I try to add the word ''add,''

00:34:40.000 --> 00:34:43.000
that in order to put ''ad'' in there, we're going to have to move everything over and then

00:34:43.000 --> 00:34:46.000
there's going to be a lot of shuffling and what not.

00:34:46.000 --> 00:34:48.000
So I'm starting to build a data structure that's losing some flexibility but

00:34:48.000 --> 00:34:49.000
saving space.

00:34:49.000 --> 00:34:53.000
So

00:34:53.000 --> 00:34:54.000
I'm, in some sense, focusing my attention on how can I build a data structure that's really fast

00:34:54.000 --> 00:34:56.000
to search,

00:34:56.000 --> 00:35:00.000
even if it was a little harder to build?

00:35:00.000 --> 00:35:02.000
Because it turns out the thing I most care about is making sure it can

00:35:02.000 --> 00:35:07.000
actually very, very efficiently look up words.

00:35:07.000 --> 00:35:11.000
Maybe building it once, being a little slow, isn't so painful.

00:35:11.000 --> 00:35:14.000
So the space issues in this gets us down to just ten bytes per

00:35:14.000 --> 00:35:16.000
node. All the pointers are gone. So there was 24 bytes of pointers that we've

00:35:16.000 --> 00:35:19.000
just eliminated.

00:35:19.000 --> 00:35:21.000
And that means we've got two and a half megabyte

00:35:21.000 --> 00:35:28.000
memory footprint right now.

00:35:28.000 --> 00:35:29.000
So we're kinda getting close to hitting the range for binary tree and hash tables.

00:35:29.000 --> 00:35:33.000
Now I'm

00:35:33.000 --> 00:35:37.000
going to go back to the domain again. So I worked on it kinda from the CS side, thinking

00:35:37.000 --> 00:35:39.000
about things I know about data structures and

00:35:39.000 --> 00:35:41.000
reorganizing arrays and heaps and pointers. I'm going

00:35:41.000 --> 00:35:43.000
to go back and look at that domain again

00:35:43.000 --> 00:35:46.000
and see if there's something else I

00:35:46.000 --> 00:35:48.000
can kinda take advantage of.

00:35:48.000 --> 00:35:50.000
So this is a different

00:35:50.000 --> 00:35:52.000
cross-section of the dictionary

00:35:52.000 --> 00:35:57.000
showing the words ''adapted,''

00:35:57.000 --> 00:36:02.000
''adaptor,'' ''adaptors,'' ''adapting,'' ''adaption,'' adaptions,'' ''adapts.''

00:36:02.000 --> 00:36:04.000
So eight words that are all built off the ''adapt'' root with a bunch of suffixes. I look

00:36:04.000 --> 00:36:05.000
sorta over a little bit further. I've got

00:36:05.000 --> 00:36:10.000
''desert,''

00:36:10.000 --> 00:36:15.000
''deserted,'' ''deserter,'' ''deserting,'' ''desertings,'' ''desertion,'' ''desertions,'' ''deserts.'' Okay,

00:36:15.000 --> 00:36:20.000
and ''detect'' and ''insert'' and ''perfect'' and ''invent'' and ''interrupt''

00:36:20.000 --> 00:36:23.000
that all are verbs for which you can apply the past tense in the

00:36:23.000 --> 00:36:26.000
gerund form and the ''ion's'' that

00:36:26.000 --> 00:36:29.000
tack onto that root word to give you these suffixes.

00:36:29.000 --> 00:36:32.000
And each of these words that has shown up here has exactly the same set of suffixes

00:36:32.000 --> 00:36:33.000
that can be applied to it.

00:36:33.000 --> 00:36:35.000
So I

00:36:35.000 --> 00:36:37.000
did this thing about prefixes. I

00:36:37.000 --> 00:36:42.000
went out of my way to find that -

00:36:42.000 --> 00:36:45.000
so all these words could share that same prefix, ''assert,'' ''asserter,'' ''assertings,'' ''asserting.''

00:36:45.000 --> 00:36:47.000
But is there some way that I could take this idea of ''corrupt''

00:36:47.000 --> 00:36:50.000
and ''assert'' and ''insert,''

00:36:50.000 --> 00:36:54.000
and then once I get down to that ''t,'' where they end, is there a way where they can also

00:36:54.000 --> 00:36:57.000
share their backend parts? That those ''ing's'' and ''ion's,'' can they be

00:36:57.000 --> 00:36:58.000
unified, as well? Well,

00:36:58.000 --> 00:37:00.000
why not?

00:37:00.000 --> 00:37:03.000
Why not? So here is

00:37:03.000 --> 00:37:06.000
the first version of what's going to be the DAWG,

00:37:06.000 --> 00:37:07.000
the Directed Acyclic Word

00:37:07.000 --> 00:37:10.000
Graph.

00:37:10.000 --> 00:37:12.000
It takes the idea of unifying the prefixes and now applies it to the

00:37:12.000 --> 00:37:14.000
suffixes.

00:37:14.000 --> 00:37:17.000
That there are a lot of

00:37:17.000 --> 00:37:20.000
words that end in the same letters, as well as start in the same letters. Why not just apply

00:37:20.000 --> 00:37:22.000
the same kind of unification and sharing on the back end?

00:37:22.000 --> 00:37:24.000
So here comes in

00:37:24.000 --> 00:37:26.000
''adapt'' and ''adopt''

00:37:26.000 --> 00:37:30.000
and ''based'' and ''port''

00:37:30.000 --> 00:37:33.000
and out the back end is ''ports'' and ''portion'' and ''porting'' and ''porter'' and ''adopter''

00:37:33.000 --> 00:37:37.000
and ''adopted'' and ''basting'' and ''bastion,''

00:37:37.000 --> 00:37:40.000
all coming off that sharing all the endings, because they have the same

00:37:40.000 --> 00:37:41.000
words that are viable and the same connections that are

00:37:41.000 --> 00:37:46.000
between them. So

00:37:46.000 --> 00:37:49.000
the idea that all sharing the ''er'' and the ''er'' leading in to the ''s.'' And that same ''s,'' for

00:37:49.000 --> 00:37:50.000
example, being shared from ''portion'' and ''baste'' and

00:37:50.000 --> 00:37:51.000
''adopt'' and

00:37:51.000 --> 00:37:54.000
''adopter,''

00:37:54.000 --> 00:37:57.000
all coming into the same ending of oh, these things that end in ''s'' are words. Here's an ''s'' that is a

00:37:57.000 --> 00:37:59.000
word for them to share.

00:37:59.000 --> 00:38:01.000
So it is directed. So

00:38:01.000 --> 00:38:04.000
it is a graph.

00:38:04.000 --> 00:38:05.000
So once we remove this - the idea

00:38:05.000 --> 00:38:09.000
that there's a single path

00:38:09.000 --> 00:38:12.000
from the root to any node, we're no longer a tree structure. So we

00:38:12.000 --> 00:38:16.000
can get to ''t'' by going through ''adapt'' or ''adopt'' or ''port,''

00:38:16.000 --> 00:38:19.000
and so that no longer fits the definition of what is a tree. It's

00:38:19.000 --> 00:38:23.000
become a graph. We can get to these nodes from different places.

00:38:23.000 --> 00:38:24.000
The arcs go one way in this case. We can't go back up the tree

00:38:24.000 --> 00:38:27.000
to find the words in reverse.

00:38:27.000 --> 00:38:31.000
There are no cycles. So acyclic, right, you can't just

00:38:31.000 --> 00:38:33.000
wank around some portion and get this many bananas as you want.

00:38:33.000 --> 00:38:34.000
And

00:38:34.000 --> 00:38:37.000
they have

00:38:37.000 --> 00:38:44.000
been with certain nodes that are marked, in this case, the same way they were in the

00:38:44.000 --> 00:38:45.000
trie. This path from the root node to here does represent a word in this

00:38:45.000 --> 00:38:48.000
situation.

00:38:48.000 --> 00:38:53.000
So building this is a little bit complicated. I'll give you a little bit

00:38:53.000 --> 00:38:56.000
of the idea of the intuition for it, and I will not get too deep into it.

00:38:56.000 --> 00:39:00.000
But largely the way you do it is you build the trie,

00:39:00.000 --> 00:39:01.000
so that it doesn't have the unification of the suffixes; it only unifies the

00:39:01.000 --> 00:39:03.000
prefixes. And

00:39:03.000 --> 00:39:07.000
then you work backwards

00:39:07.000 --> 00:39:09.000
on the suffixes. So in some sense you look at all the words that end, for example, in

00:39:09.000 --> 00:39:12.000
''s'' in the dictionary.

00:39:12.000 --> 00:39:15.000
And you find all the ones that just end in one ''s'' that goes nowhere, and you say

00:39:15.000 --> 00:39:17.000
look, this ''s'' looks like every other ''s'' in this thing. All these words

00:39:17.000 --> 00:39:20.000
that are just plurals or

00:39:20.000 --> 00:39:23.000
verbs that can have an ''s'' tacked on. Let's share that ''s.''

00:39:23.000 --> 00:39:25.000
And then so you take however many s's there were, 1,000 of them, and you

00:39:25.000 --> 00:39:29.000
say these are all the same ''s.''

00:39:29.000 --> 00:39:31.000
And then what you do is you look at all the nodes that lead into that ''s.'' So you're kinda

00:39:31.000 --> 00:39:33.000
traversing the graph in reverse, and you say well look,

00:39:33.000 --> 00:39:36.000
here's a bunch of people who lead in on an ''e.''

00:39:36.000 --> 00:39:37.000
And if that ''e'' is a word or not, there might be some group that has ''e''

00:39:37.000 --> 00:39:38.000
that's also a word,

00:39:38.000 --> 00:39:40.000
so like

00:39:40.000 --> 00:39:42.000
the word ''apple'' and ''apples.''

00:39:42.000 --> 00:39:46.000
But sometimes the ''e'' is not,

00:39:46.000 --> 00:39:49.000
where that wasn't a stopping place, let's say, for ''parties'' it wasn't.

00:39:49.000 --> 00:39:53.000
So all the ones that are a word you can unify on the ''e'' that was a word, and all the ones that aren't

00:39:53.000 --> 00:39:54.000
are that way can unify on that. So you just work your way backwards from there. So you say

00:39:54.000 --> 00:39:55.000
what letters led into that ''e?''

00:39:55.000 --> 00:39:58.000
And so kinda from

00:39:58.000 --> 00:40:02.000
the end of the words backwards, you're looking for all the people who can end in an

00:40:02.000 --> 00:40:05.000
''s,'' and end in an ''e,'' and end in an ''ies,'' and unify them up,

00:40:05.000 --> 00:40:07.000
making sure that they have all the same outgoing connections. So for

00:40:07.000 --> 00:40:09.000
example, ''running'' has

00:40:09.000 --> 00:40:13.000
an ''ing'' at the end. So does ''swimming,''

00:40:13.000 --> 00:40:15.000
but there's ''swimmingly'' and there's not ''runningly.'' So you have to be careful about

00:40:15.000 --> 00:40:18.000
unifying things that

00:40:18.000 --> 00:40:20.000
they would kinda have to have all identical properties from this point

00:40:20.000 --> 00:40:22.000
down to the bottom of the tree

00:40:22.000 --> 00:40:25.000
to be unifiable.

00:40:25.000 --> 00:40:28.000
But it's a pretty neat process. It's actually what's called

00:40:28.000 --> 00:40:29.000
DFT subset minimization, so if you want to learn a little bit about it, that's the

00:40:29.000 --> 00:40:31.000
word to look up.

00:40:31.000 --> 00:40:35.000
And so if I do that,

00:40:35.000 --> 00:40:36.000
and I build this as a DAWG, I'm still using the same structure definition here,

00:40:36.000 --> 00:40:40.000


00:40:40.000 --> 00:40:44.000
but what I am doing is unifying suffixes as well as prefixes

00:40:44.000 --> 00:40:46.000
and that, actually, is gonna change the number of nodes that I need drastically.

00:40:46.000 --> 00:40:49.000
It goes down to 80,000

00:40:49.000 --> 00:40:52.000
from my initial 250,000.

00:40:52.000 --> 00:40:55.000
And if you think about the number of words again,

00:40:55.000 --> 00:40:57.000
there were 125,000 words in the dictionary,

00:40:57.000 --> 00:41:01.000
the DAWG has just 80,000 words. So it means, actually, once you start

00:41:01.000 --> 00:41:04.000
unifying the suffixes, it's like most words don't even have -

00:41:04.000 --> 00:41:07.000
or many of the words don't even have nodes of their own that are unique to

00:41:07.000 --> 00:41:11.000
them whatsoever. They just exist kinda as portions of other words

00:41:11.000 --> 00:41:12.000
that have all been joined together and

00:41:12.000 --> 00:41:15.000
compressed into one package.

00:41:15.000 --> 00:41:18.000
So 80,000 nodes total, about

00:41:18.000 --> 00:41:21.000
two-thirds of the number of words total.

00:41:21.000 --> 00:41:23.000
So we have ten nodes - ten-byte nodes

00:41:23.000 --> 00:41:27.000
and 80,000 of them.

00:41:27.000 --> 00:41:31.000
We are down to under a megabyte. So we've crossed that little interesting barrier of

00:41:31.000 --> 00:41:34.000
the data structure on disk. The data that was fed into it, those words, was over a

00:41:34.000 --> 00:41:36.000
megabyte and now, actually, we have created a representation that actually is

00:41:36.000 --> 00:41:40.000
compressed, that uses less space

00:41:40.000 --> 00:41:42.000
than the text file did in the first place. And that's really because it is

00:41:42.000 --> 00:41:44.000
finding all these ways in which words are like each other

00:41:44.000 --> 00:41:46.000
and sharing,

00:41:46.000 --> 00:41:48.000
making use of the existing

00:41:48.000 --> 00:41:50.000
structure in the

00:41:50.000 --> 00:41:51.000
DAWG to add new words

00:41:51.000 --> 00:41:53.000
without

00:41:53.000 --> 00:41:56.000
adding bulk.

00:41:56.000 --> 00:41:57.000
So once I've done this I have to say that I've made the data structure

00:41:57.000 --> 00:42:01.000
even less flexible.

00:42:01.000 --> 00:42:03.000
So to be clear, each of these steps has a cost somewhere else that made the code more

00:42:03.000 --> 00:42:07.000
complicated. It also made it so that

00:42:07.000 --> 00:42:10.000
now that I have all this sharing back here, that when I'm adding a new word, so I go

00:42:10.000 --> 00:42:14.000
back to look at this picture here, that

00:42:14.000 --> 00:42:15.000
I go in to add a word and I say oh, actually, the word

00:42:15.000 --> 00:42:17.000
''portioning'' is a word.

00:42:17.000 --> 00:42:19.000
Well, ''adoptioning'' is not.

00:42:19.000 --> 00:42:22.000
So

00:42:22.000 --> 00:42:26.000
that will necessitate if I go in to add the word ''portioning,'' that I actually have

00:42:26.000 --> 00:42:30.000
to split it off of its existing unified branch, because if I just tacked on the

00:42:30.000 --> 00:42:33.000
''ing'' on the end of ''portioning'' there, that it turns out that ''adaptioning''

00:42:33.000 --> 00:42:37.000
and ''adoptioning'' would suddenly become words, too. So actually I have to

00:42:37.000 --> 00:42:38.000
be extra careful once I've got it unified this way that when I add new words that I don't

00:42:38.000 --> 00:42:39.000
actually

00:42:39.000 --> 00:42:42.000
accidentally

00:42:42.000 --> 00:42:45.000
add some additional words that I didn't plan on because these paths have

00:42:45.000 --> 00:42:49.000
been shared. So it requires a little bit more careful management. Again,

00:42:49.000 --> 00:42:50.000
the goal, though, here, in this case, was to try to build a data structure that was kinda freeze-dried.

00:42:50.000 --> 00:42:53.000
So I

00:42:53.000 --> 00:42:55.000
build it off the data structure - the input data file I had,

00:42:55.000 --> 00:42:57.000
that then would operate very, very efficiently,

00:42:57.000 --> 00:43:00.000
and I wasn't planning on making a lot

00:43:00.000 --> 00:43:01.000
of runtime changes to it.

00:43:01.000 --> 00:43:04.000
I get this down to this

00:43:04.000 --> 00:43:08.000
and then the last thing I'm going to do to it

00:43:08.000 --> 00:43:11.000
is go back to kinda the CS side of it and see if I can squeeze it down kinda

00:43:11.000 --> 00:43:15.000
in terms of just data structure, what's being stored per node.

00:43:15.000 --> 00:43:18.000
So I have 80,000 of these. So

00:43:18.000 --> 00:43:22.000
typically, when you look at a structure,

00:43:22.000 --> 00:43:24.000
it's very rare that the idea if you change something from an int to a

00:43:24.000 --> 00:43:28.000
char, so it was a two-byte thing into a one-byte thing,

00:43:28.000 --> 00:43:32.000
that you're going to have any profound impact on the total size needed.

00:43:32.000 --> 00:43:33.000
But if you happen to know you're going to make a lot of these structures, tens of thousands,

00:43:33.000 --> 00:43:35.000
almost 100,000 of them,

00:43:35.000 --> 00:43:38.000
then it turns out every byte counts.

00:43:38.000 --> 00:43:40.000
If I can get this thing down from, in this case,

00:43:40.000 --> 00:43:42.000
ten bytes to eight bytes, or to

00:43:42.000 --> 00:43:46.000
six bytes or four bytes,

00:43:46.000 --> 00:43:49.000
then it will cut my memory by quite a factor.

00:43:49.000 --> 00:43:52.000
So what I did in the final version of it

00:43:52.000 --> 00:43:58.000
is I used a little technique called bit packing,

00:43:58.000 --> 00:44:00.000
which is to actually stash stuff in the absolute minimum possible space

00:44:00.000 --> 00:44:04.000
that I can fit it into,

00:44:04.000 --> 00:44:04.000
taking advantage of things like, for example, if there are only 26 letters in the

00:44:04.000 --> 00:44:08.000


00:44:08.000 --> 00:44:10.000
alphabet, a character is actually a full eight-bit value, which can hold

00:44:10.000 --> 00:44:13.000
numbers from zero to 255.

00:44:13.000 --> 00:44:15.000
And I just don't need that much space. I'm not using the yen sign, I'm not using the parentheses,

00:44:15.000 --> 00:44:19.000
I'm not using the digits.

00:44:19.000 --> 00:44:22.000
So having, in some sense, the capacity to store all those distinct characters is actually

00:44:22.000 --> 00:44:26.000
a cost I'm paying on every character that I don't need to.

00:44:26.000 --> 00:44:29.000
So in fact, I squeezed it down by dividing up into the sub-bit level. This

00:44:29.000 --> 00:44:30.000
something that's talked about a little bit in Chapter 15 if you're at all curious. It's

00:44:30.000 --> 00:44:33.000
not what I

00:44:33.000 --> 00:44:34.000
consider core material for 106B, but I'm just sorta mentioning it to kinda

00:44:34.000 --> 00:44:36.000
complete the picture.

00:44:36.000 --> 00:44:38.000
Where five bits,

00:44:38.000 --> 00:44:41.000
given that a bit is either zero or one,

00:44:41.000 --> 00:44:45.000
that I have five of them connected together, then I have 25

00:44:45.000 --> 00:44:45.000
different patterns that I can represent in that

00:44:45.000 --> 00:44:48.000


00:44:48.000 --> 00:44:50.000
amount of capacity there, and that's 32 different patterns, which is enough for

00:44:50.000 --> 00:44:54.000
my characters, plus a little slack.

00:44:54.000 --> 00:44:56.000
I use exactly one bit for is word, rather than a full Boolean. All I need

00:44:56.000 --> 00:44:59.000
to know is yes or no.

00:44:59.000 --> 00:45:02.000
I also changed the way I did

00:45:02.000 --> 00:45:04.000
knowing that you were the last child of a generation, rather than having

00:45:04.000 --> 00:45:06.000
A say, ''there are 12 children here.''

00:45:06.000 --> 00:45:10.000
I just said, go to index 27

00:45:10.000 --> 00:45:12.000
and start reading, and one of them will tell you it's the last. So each of them says, ''No,

00:45:12.000 --> 00:45:16.000
I'm not the last,'' ''No, I'm not the last.'' And then when you get to the last one, it

00:45:16.000 --> 00:45:18.000
will be marked as ''yes,'' and that's how you'll know - that's where A ends and the

00:45:18.000 --> 00:45:20.000
next one beings.

00:45:20.000 --> 00:45:21.000
And so if I can get it down in this way

00:45:21.000 --> 00:45:24.000


00:45:24.000 --> 00:45:28.000
I've taken

00:45:28.000 --> 00:45:31.000
what was a ten-byte structure down to four. And it turns out there's often

00:45:31.000 --> 00:45:35.000
reasons why trying to get it down to a multiple of two is actually

00:45:35.000 --> 00:45:37.000
a critical barrier. That if it were five bytes, it's actually likely to be eight,

00:45:37.000 --> 00:45:39.000
just because of what's called alignment and padding restrictions.

00:45:39.000 --> 00:45:42.000
So if were at

00:45:42.000 --> 00:45:45.000
seven and I squeeze down to six, it might not make any change. And if I squeeze down to five, it

00:45:45.000 --> 00:45:47.000
still might not make any change. It might be, actually, artificially decided that in

00:45:47.000 --> 00:45:50.000
each of the structures is going to be eight behind the scenes.

00:45:50.000 --> 00:45:54.000
But getting it down to four actually tends to be another one of those critical barriers where

00:45:54.000 --> 00:45:56.000
okay, four actually will be smaller than something that was ten. And

00:45:56.000 --> 00:45:58.000
so I take my 80,000 nodes,

00:45:58.000 --> 00:46:03.000
I get them to be four bytes each,

00:46:03.000 --> 00:46:06.000
and then I now have a data structure that's about a third of a megabyte in memory,

00:46:06.000 --> 00:46:09.000
as opposed to the over a megabyte on disk

00:46:09.000 --> 00:46:11.000
in the text form.

00:46:11.000 --> 00:46:12.000
And so that's how it does what it does. So if you're going to look at the code,

00:46:12.000 --> 00:46:16.000
you'll see that

00:46:16.000 --> 00:46:17.000
there is this complicated bit pack structure that

00:46:17.000 --> 00:46:20.000
it does

00:46:20.000 --> 00:46:24.000
look for things char byte using a recursive character-by-character

00:46:24.000 --> 00:46:24.000
strategy of start at the top, find the matching character, then go to the children index,

00:46:24.000 --> 00:46:28.000
and then

00:46:28.000 --> 00:46:33.000
work your way down the children to find the next matching character, and so on

00:46:33.000 --> 00:46:37.000
to find its way through the data structure.

00:46:37.000 --> 00:46:41.000
So [inaudible] is where the child index, or for example, the idea that

00:46:41.000 --> 00:46:42.000
A says my children start at 27 or B starts at 39,

00:46:42.000 --> 00:46:45.000
and so I just need

00:46:45.000 --> 00:46:51.000
a number there, but I'm using

00:46:51.000 --> 00:46:55.000
a slightly smaller than an integer, just to make them all fit very tightly. So I'll tell

00:46:55.000 --> 00:46:57.000
you a couple of interesting things about it, actually, before I close this off, which is the

00:46:57.000 --> 00:47:02.000
lexicon.dat file

00:47:02.000 --> 00:47:03.000
actually just is an in-memory representation of what the DAWG data

00:47:03.000 --> 00:47:07.000
structure looks like.

00:47:07.000 --> 00:47:11.000
And so in fact, it's very easy to take the whole data structure. It doesn't have any pointers in it. Pointers, actually, tend to be a

00:47:11.000 --> 00:47:15.000
little bit - make data structures harder to rebuild and

00:47:15.000 --> 00:47:17.000
to dehydrate to disk because they have to kinda be wired up in memory. You don't know what

00:47:17.000 --> 00:47:20.000
memory you're going to get from new and what the addresses are.

00:47:20.000 --> 00:47:22.000
This one, actually, doesn't have any pointers. Everything is actually kinda self-contained. It's

00:47:22.000 --> 00:47:26.000
just one big array of these blocks.

00:47:26.000 --> 00:47:27.000
And all the information is relative in there. It says go to the index 27, index

00:47:27.000 --> 00:47:30.000
45.

00:47:30.000 --> 00:47:34.000
So you can take that whole block and just write it to disk, 300 bytes worth,

00:47:34.000 --> 00:47:36.000
read it back in 300 bytes worth and it kinda is rehydrated

00:47:36.000 --> 00:47:39.000
without any extra effort. So

00:47:39.000 --> 00:47:42.000
that's actually what the lexicon.dat file is, it's just a dehydrated form of

00:47:42.000 --> 00:47:44.000
the DAWG having been

00:47:44.000 --> 00:47:48.000
build in memory. That makes it very, very fast to load.

00:47:48.000 --> 00:47:49.000
At that point, though, it's super inflexible. If you want to add a word to

00:47:49.000 --> 00:47:52.000
it,

00:47:52.000 --> 00:47:56.000
the way the structure is build, you'd have to kinda add nodes and make sure it wasn't

00:47:56.000 --> 00:47:58.000
glomming on to some existing words that were nearby, and so it would be a lot of work to

00:47:58.000 --> 00:48:01.000
actually edit the DAWG in place.

00:48:01.000 --> 00:48:04.000
In fact, when you ask it to add a word,

00:48:04.000 --> 00:48:05.000
it actually doesn't edit that one at all.

00:48:05.000 --> 00:48:09.000
You say I'd like to put a new word in.

00:48:09.000 --> 00:48:11.000
It says I'm just - this one's so perfect the way it is. I've built this one, I love this

00:48:11.000 --> 00:48:14.000
one, I'm very happy with it.

00:48:14.000 --> 00:48:16.000
If you want to put some other words into it, it just keeps a second data structure

00:48:16.000 --> 00:48:21.000
on the side, an

00:48:21.000 --> 00:48:21.000
auxiliary one over here, where it sticks the additional words you ask it to put in. I

00:48:21.000 --> 00:48:24.000
have

00:48:24.000 --> 00:48:28.000
to say that seems like a very obvious thing to do, but in fact, I didn't even think of

00:48:28.000 --> 00:48:31.000
that for years. We had the lexicon. I said you can't use the lexicon to make new

00:48:31.000 --> 00:48:32.000
word lists or edit them. You can only load this one, beautiful, perfect one. You can't do

00:48:32.000 --> 00:48:35.000
anything else. And

00:48:35.000 --> 00:48:38.000
at some point people kept saying, ''I wish I could put some words in the lexicon. I wish I could use it

00:48:38.000 --> 00:48:41.000
for the word list to keep track of the human and computer players.'' I'm like, ''Well, you just

00:48:41.000 --> 00:48:41.000
can't do it. You don't understand. It's very complicated.

00:48:41.000 --> 00:48:45.000
You

00:48:45.000 --> 00:48:50.000
just don't understand my pain.'' Whatever. Whatever. I had excuses. And then, really,

00:48:50.000 --> 00:48:51.000
after some number of years of just being a baby about it, I'm like oh, of course, why don't I

00:48:51.000 --> 00:48:54.000
just

00:48:54.000 --> 00:48:57.000
keep another structure off to the side? It's an abstraction. It's a word list. You don't need

00:48:57.000 --> 00:49:00.000
to know what I do. The fact that I keep a DAWG for some of the words

00:49:00.000 --> 00:49:04.000
and just an ordinary set for some of the other words

00:49:04.000 --> 00:49:06.000
is really completely irrelevant. And as long as when you ask for words I

00:49:06.000 --> 00:49:08.000
look in both and I tell you, ''Yes, it's there,'' why do you need to know

00:49:08.000 --> 00:49:11.000
where I keep it? Okay, so it took

00:49:11.000 --> 00:49:13.000
me a while to buy abstraction, even though I teach it all the time. It just

00:49:13.000 --> 00:49:17.000
wasn't totally making sense.

00:49:17.000 --> 00:49:19.000
So a neat little thing, just kinda keeping both behind. And so it turns out that I

00:49:19.000 --> 00:49:21.000
learned about this, actually, from a paper

00:49:21.000 --> 00:49:22.000
that I read at the - in the

00:49:22.000 --> 00:49:23.000


00:49:23.000 --> 00:49:27.000
early ?90s about somebody

00:49:27.000 --> 00:49:30.000
building a Scrabble player, and they wanted a dictionary that

00:49:30.000 --> 00:49:33.000
supported kinda finding plays on the board and being able to extend words. And so

00:49:33.000 --> 00:49:34.000
knowing things about root words and how they extend and prefixes and suffixes was kinda the

00:49:34.000 --> 00:49:38.000
original motivation.

00:49:38.000 --> 00:49:39.000
And this kind of data structure is actually really great for any kind of word playing.

00:49:39.000 --> 00:49:42.000
So you have

00:49:42.000 --> 00:49:43.000
anagrams that you want to rearrange or words that you want to extend past suffix and

00:49:43.000 --> 00:49:46.000
prefix that

00:49:46.000 --> 00:49:50.000
this kind of data structure is really optimized for

00:49:50.000 --> 00:49:52.000
doing that kind of traversal and kinda leveraging the existing patterns

00:49:52.000 --> 00:49:56.000
in words to save space while doing so.

00:49:56.000 --> 00:50:00.000
So it was a really fun little project to do. It was kinda neat to think about. Even though today,

00:50:00.000 --> 00:50:02.000
if you needed a lexicon, you could just use a set. It's fast enough. It takes

00:50:02.000 --> 00:50:05.000
a ton of memory, but you know what, memory's cheap.

00:50:05.000 --> 00:50:06.000
So this is not the kind of thing you would need to do today except when you were in some kinda

00:50:06.000 --> 00:50:07.000
embedded,

00:50:07.000 --> 00:50:11.000
low memory,

00:50:11.000 --> 00:50:14.000
very, very tight processor environment, but it still makes for kinda interesting sort of artifact

00:50:14.000 --> 00:50:18.000
to think about well, how you get there and what kind of things you can use.

00:50:18.000 --> 00:50:19.000
So Wednesday we will talk about some big picture design, some kinda wrap stuff, and I will

00:50:19.000 --> 00:50:22.000
see

00:50:22.000 --> 00:50:23.000
you