WEBVTT 00:00:00.000 --> 00:00:12.000 00:00:12.000 --> 00:00:16.000 00:00:16.000 --> 00:00:25.000 This presentation is delivered by the Stanford Center for Professional 00:00:25.000 --> 00:00:28.000 Development. 00:00:28.000 --> 00:00:32.000 The piece I need is not here, so I'm going to actually just talk to you about what 00:00:32.000 --> 00:00:34.000 it says on my slides and then when Jason arrives we'll plug it back in and see what's going on. 00:00:34.000 --> 00:00:36.000 Oh, there's Jason. There we go. 00:00:36.000 --> 00:00:40.000 So what I'm going to do today is actually something kinda fun. And so I'm 00:00:40.000 --> 00:00:41.000 gonna take a case study of a particular data structure 00:00:41.000 --> 00:00:45.000 design, 00:00:45.000 --> 00:00:49.000 and the one I'm looking at there is the lexicon. 00:00:49.000 --> 00:00:53.000 So how many of you actually ever decided to just open up the lexicon file, just to see 00:00:53.000 --> 00:00:55.000 what was in there at some point during the quarter. We used it a couple of times. 00:00:55.000 --> 00:01:00.000 Anybody want to tell us what they learned by opening up that file? 00:01:00.000 --> 00:01:04.000 Anything interesting? 00:01:04.000 --> 00:01:05.000 It's a data structure called a DAWG. You're like, well that's a pretty funny name for something. 00:01:05.000 --> 00:01:10.000 And 00:01:10.000 --> 00:01:14.000 was the code particularly well-[inaudible], easy to read, easy to understand? 00:01:14.000 --> 00:01:18.000 No. Who wrote that code? She should totally be fired. 00:01:18.000 --> 00:01:22.000 Yeah, so this what I wrote kinda a long time ago, actually, to solve this particular problem. And actually I would like to key through 00:01:22.000 --> 00:01:28.000 kinda how it is I came to discover the DAWG and why it is the DAWG was the 00:01:28.000 --> 00:01:29.000 choice that made sense for what the need pattern we had for the lexicon was. 00:01:29.000 --> 00:01:30.000 So let me just refresh what 00:01:30.000 --> 00:01:34.000 the lexicon does. 00:01:34.000 --> 00:01:36.000 It is a special case of kinda set or map. It is a container where you're going stick things 00:01:36.000 --> 00:01:38.000 in, they're going to be unique, right? There's no 00:01:38.000 --> 00:01:41.000 need for any sort of duplicate. 00:01:41.000 --> 00:01:43.000 The keys are always strings. And in fact, they're English words, so that's going to be an important 00:01:43.000 --> 00:01:47.000 thing to remember, that 00:01:47.000 --> 00:01:48.000 they're not just arbitrary sequences of letters, right? They have certain patterns to 00:01:48.000 --> 00:01:51.000 them 00:01:51.000 --> 00:01:54.000 and certain sizes that might actually be something we could capitalize on. When we know 00:01:54.000 --> 00:01:57.000 something about the domain, we're able to kinda special case or tune or 00:01:57.000 --> 00:01:58.000 data structure to solve that problem 00:01:58.000 --> 00:01:59.000 very well, 00:01:59.000 --> 00:02:03.000 in particular. 00:02:03.000 --> 00:02:07.000 It has no associated value, so it doesn't have dictionary definitions or synonyms or anything 00:02:07.000 --> 00:02:08.000 like that, right? So it really just is exists or doesn't exist 00:02:08.000 --> 00:02:12.000 that we were interested in supporting. 00:02:12.000 --> 00:02:15.000 And it also has, in addition to the kind of being able to add words and check if a 00:02:15.000 --> 00:02:16.000 word is contained, it's also this prefixing operation that we had talked a 00:02:16.000 --> 00:02:19.000 little about 00:02:19.000 --> 00:02:23.000 in Boggle, and that came back on the midterm. Like, why was prefixing important, right? 00:02:23.000 --> 00:02:25.000 It was critical for pruning these dead-end paths, these searches on the 00:02:25.000 --> 00:02:26.000 Boggle board that otherwise 00:02:26.000 --> 00:02:32.000 would really have 00:02:32.000 --> 00:02:32.000 taken a lot of time and kinda led to nothing of any value. So what I'm gonna to talk you 00:02:32.000 --> 00:02:36.000 through 00:02:36.000 --> 00:02:40.000 is some of the obvious choices based on data structures we already know about, 00:02:40.000 --> 00:02:43.000 and see where they would go. So this is a thought experiment. It's not that I wrote all 00:02:43.000 --> 00:02:45.000 these to find out. Actually, I mostly did a lot of this on paper, trying to figure 00:02:45.000 --> 00:02:48.000 out what options we had 00:02:48.000 --> 00:02:49.000 and how they would play out and what kind of trade-offs I'd have to make to make them 00:02:49.000 --> 00:02:52.000 work. 00:02:52.000 --> 00:02:55.000 So the simplest case of any kind of collection is to think about 00:02:55.000 --> 00:02:58.000 whether vector or array can do the job for you. 00:02:58.000 --> 00:03:00.000 In this case, having a vector that's kept in sorted - alphabetically sorted - order seems 00:03:00.000 --> 00:03:03.000 like a pretty good, 00:03:03.000 --> 00:03:05.000 easy way to get this set up and running and how would it behave? 00:03:05.000 --> 00:03:06.000 So in order to do a contains 00:03:06.000 --> 00:03:07.000 (word) 00:03:07.000 --> 00:03:15.000 on this data structure, 00:03:15.000 --> 00:03:17.000 what algorithm are you going to use to do your searching? Binary search, 00:03:17.000 --> 00:03:21.000 right? It's in sorted order, right? We've got to take advantage of that, right? If we do 00:03:21.000 --> 00:03:25.000 a linear search I'm looking for the word ''mediocre,'' I'm going to be trucking through, 00:03:25.000 --> 00:03:28.000 you know, thousands of words before I find it. So definitely want to be using binary search. 00:03:28.000 --> 00:03:29.000 Equals-equals, comparing my strings, less than, greater than, dividing in half. And so 00:03:29.000 --> 00:03:32.000 it should run a logarithm time. 00:03:32.000 --> 00:03:35.000 Logarithm time, one of those greater performers that you 00:03:35.000 --> 00:03:38.000 learned in the PQ assignment, hopefully. It's just basically not measurable on 00:03:38.000 --> 00:03:41.000 today's computer. So even for entries of 00:03:41.000 --> 00:03:42.000 100,000, 500,000, basically 00:03:42.000 --> 00:03:46.000 for free. 00:03:46.000 --> 00:03:50.000 How do we do contains (prefix)? Can we also do a fast prefix search given this 00:03:50.000 --> 00:03:51.000 data structure? 00:03:51.000 --> 00:03:54.000 I hope so right? Thinking about 00:03:54.000 --> 00:03:57.000 the same kind of binary search. In this case, though, looking at substring, right? So 00:03:57.000 --> 00:03:59.000 comparing I'm looking for a three-character substring, only looking at the 00:03:59.000 --> 00:04:02.000 first three characters, decide which way to go. 00:04:02.000 --> 00:04:05.000 Once I find something that matches in the first three characters, I'm done. 00:04:05.000 --> 00:04:07.000 So again, still using the sorted property to quickly narrow in on where 00:04:07.000 --> 00:04:10.000 it could be. 00:04:10.000 --> 00:04:14.000 When it's time to add a new word to this, 00:04:14.000 --> 00:04:17.000 if someone asked you to put in the word ''abalone,'' I gotta 00:04:17.000 --> 00:04:21.000 use that binary search to find out where to go. So in logarithmic time I can 00:04:21.000 --> 00:04:23.000 tell you where the new word goes, but then to put it into position, 00:04:23.000 --> 00:04:26.000 there's gonna be some shuffling. 00:04:26.000 --> 00:04:28.000 And that's where this operation 00:04:28.000 --> 00:04:32.000 starts to bog down, 00:04:32.000 --> 00:04:33.000 given the idea that you might be moving half or more of your items, 00:04:33.000 --> 00:04:36.000 on average, 00:04:36.000 --> 00:04:40.000 to make that space for that one to open up. 00:04:40.000 --> 00:04:42.000 It is going to require a linear time operation to get new words into the data 00:04:42.000 --> 00:04:43.000 structure. Okay, 00:04:43.000 --> 00:04:48.000 probably still, though, 00:04:48.000 --> 00:04:51.000 a reasonable approach, right? These are the operations that get done 00:04:51.000 --> 00:04:54.000 immense numbers of times in Boggle, right? You're doing a ton of contains (word)'s and contains (prefix)'s 00:04:54.000 --> 00:04:58.000 as you're exhaustively searching that big board. 00:04:58.000 --> 00:05:01.000 But the add's are done kinda once at the beginning. You build that big dictionary and 00:05:01.000 --> 00:05:02.000 then you subsequently use it. And so if it took a little bit more time to built 00:05:02.000 --> 00:05:06.000 that thing, 00:05:06.000 --> 00:05:08.000 sorta an n-squared operation to kinda load it up, well, 00:05:08.000 --> 00:05:11.000 maybe that would be okay. 00:05:11.000 --> 00:05:14.000 The other we want to look at is space usage, and that's - and some of this is a little 00:05:14.000 --> 00:05:18.000 bit of an interesting artifact of the time I was working on this. It turns 00:05:18.000 --> 00:05:21.000 out space was actually a limiting factor in what we had available 00:05:21.000 --> 00:05:22.000 to us, in terms of core memories of our machines and 00:05:22.000 --> 00:05:25.000 what we could take up for our dictionary. 00:05:25.000 --> 00:05:28.000 In this case, it has per entry the size of the string, 00:05:28.000 --> 00:05:29.000 which we don't exactly know the string it represented. It's an abstraction 00:05:29.000 --> 00:05:32.000 to us, but 00:05:32.000 --> 00:05:35.000 we can make the assumption of kinda a simplification, that 00:05:35.000 --> 00:05:37.000 it's probably about the length of the 00:05:37.000 --> 00:05:39.000 string times of the size of the character. And there might be a little bit more 00:05:39.000 --> 00:05:40.000 housekeeping associated with it, but 00:05:40.000 --> 00:05:42.000 00:05:42.000 --> 00:05:44.000 that's a pretty good estimate of what it's taking. So in fact, it's the string data 00:05:44.000 --> 00:05:46.000 itself 00:05:46.000 --> 00:05:50.000 that's being represented in this vector 00:05:50.000 --> 00:05:51.000 without much else overhead. So if we say words are about average of about eight characters, 00:05:51.000 --> 00:05:54.000 about eight bytes per word. 00:05:54.000 --> 00:05:57.000 So if we have 100,000 we expect to have 800,000 bytes 00:05:57.000 --> 00:06:00.000 invested in this sort of vector arrangement. 00:06:00.000 --> 00:06:02.000 Excess capacity, a couple other things that might tweak that a little bit, but that 00:06:02.000 --> 00:06:04.000 gives us at least a good starting point to 00:06:04.000 --> 00:06:07.000 think about how this 00:06:07.000 --> 00:06:09.000 compares to alternatives. Okay, 00:06:09.000 --> 00:06:12.000 so then if we said 00:06:12.000 --> 00:06:16.000 we know that sortedness - that binary search is the 00:06:16.000 --> 00:06:17.000 real advantage of keeping it in sorted order, but the contiguous nature of the array 00:06:17.000 --> 00:06:19.000 kinda 00:06:19.000 --> 00:06:23.000 bites us in terms of insert. 00:06:23.000 --> 00:06:26.000 If we reorganize into that binary search tree to give ourselves the dynamic flexibility of 00:06:26.000 --> 00:06:27.000 wiring in the left and right halves through pointers, as opposed to 00:06:27.000 --> 00:06:29.000 contiguous memory, 00:06:29.000 --> 00:06:31.000 then we know we can also get add to run 00:06:31.000 --> 00:06:35.000 in logarithmic time. So 00:06:35.000 --> 00:06:36.000 looking at these operations, contains (word) is a tree search that's using equals-equals, 00:06:36.000 --> 00:06:40.000 left-right, deciding which way to go, 00:06:40.000 --> 00:06:43.000 and also performs logarithmically if the tree is balanced. 00:06:43.000 --> 00:06:45.000 Contains (prefix), same deal. Keep 00:06:45.000 --> 00:06:46.000 looking at the [inaudible], 00:06:46.000 --> 00:06:48.000 working our way down, knowing which way to go. 00:06:48.000 --> 00:06:51.000 And then 00:06:51.000 --> 00:06:54.000 add can also perform in logarithmic time 00:06:54.000 --> 00:06:57.000 because it does the same search, falling off the bottom and 00:06:57.000 --> 00:06:58.000 then inserting the new word there, or finding it along the way and not having any 00:06:58.000 --> 00:07:01.000 work to do. 00:07:01.000 --> 00:07:03.000 So we can get all our operations done in logarithmic time, which we know to be kinda 00:07:03.000 --> 00:07:05.000 fast enough to actually not be worth 00:07:05.000 --> 00:07:08.000 fighting any harder for. 00:07:08.000 --> 00:07:11.000 Where did we pay for this? 00:07:11.000 --> 00:07:13.000 Nothing really comes for free, right? 00:07:13.000 --> 00:07:15.000 Overhead, right? And overhead, in this case, memory is certainly one of the places 00:07:15.000 --> 00:07:20.000 where there's going to be a 00:07:20.000 --> 00:07:23.000 pretty big cost that we now have the string data for each entry, plus a 00:07:23.000 --> 00:07:26.000 left pointer and a right pointer. And if we're doing the work to keep it 00:07:26.000 --> 00:07:30.000 balanced, which we probably ought to do if we want to avoid the degenerate cases 00:07:30.000 --> 00:07:31.000 of it kinda getting totally out of whack and degenerating into linear time, 00:07:31.000 --> 00:07:34.000 we're going to have a couple more 00:07:34.000 --> 00:07:37.000 bits, you know, bytes thrown into that factor, too. 00:07:37.000 --> 00:07:39.000 And so if we're assuming that there's 00:07:39.000 --> 00:07:42.000 ten bytes of overhead, 00:07:42.000 --> 00:07:46.000 four bytes in each of these plus another two for the balance factor, then we've added 00:07:46.000 --> 00:07:48.000 ten bytes of overhead on top of the eight bytes for the word. 00:07:48.000 --> 00:07:48.000 Getting a little 00:07:48.000 --> 00:07:52.000 bit hefty. Something to worry about. 00:07:52.000 --> 00:07:56.000 Also in terms of code complexity, which I didn't put a bullet up for, but it's 00:07:56.000 --> 00:07:59.000 also worth kinda keeping - weighing it in your mind also which is 00:07:59.000 --> 00:08:01.000 that building the fancier, more complicated data structure probably meant 00:08:01.000 --> 00:08:02.000 you spent more time debugging it, 00:08:02.000 --> 00:08:06.000 more time developing it, 00:08:06.000 --> 00:08:08.000 right? You will have - it will be harder for someone later to maintain and deal 00:08:08.000 --> 00:08:10.000 with because it's actually just more complicated code that somebody looking at a 00:08:10.000 --> 00:08:12.000 sorted vector 00:08:12.000 --> 00:08:13.000 isn't likely to actually - like, they want to add a new operation, 00:08:13.000 --> 00:08:16.000 let's say they add remove. 00:08:16.000 --> 00:08:19.000 You're able to take a word out of the lexicon. The lexicon doesn't currently support 00:08:19.000 --> 00:08:23.000 that. Somebody adding that on the sorted vector implementation probably won't find themselves 00:08:23.000 --> 00:08:25.000 too tripped up. Someone who has to do a remove out of a binary search tree, removing it, reattaching 00:08:25.000 --> 00:08:27.000 the subtrees 00:08:27.000 --> 00:08:30.000 and not destroying the balancing 00:08:30.000 --> 00:08:32.000 is a pretty complicated operation. So we've made - we've invested in this data 00:08:32.000 --> 00:08:34.000 structure that actually will keep on 00:08:34.000 --> 00:08:36.000 creating 00:08:36.000 --> 00:08:38.000 further development hassles for any modifications that are down the road for 00:08:38.000 --> 00:08:40.000 this data structure. So 00:08:40.000 --> 00:08:45.000 it is 00:08:45.000 --> 00:08:47.000 a big investment in your brainpower to keep it working. So 00:08:47.000 --> 00:08:51.000 let's think about what a hash table can do for us. A 00:08:51.000 --> 00:08:56.000 hash table is the last thing that we looked at on Friday, 00:08:56.000 --> 00:08:58.000 kinda just moving away from this whole idea of sortedness and kinda moving it out of bound off to this idea of a 00:08:58.000 --> 00:09:02.000 hashing function. 00:09:02.000 --> 00:09:05.000 So if I have a hash table that has a certain number of buckets 00:09:05.000 --> 00:09:06.000 that uses a link list for chaining to 00:09:06.000 --> 00:09:10.000 handle the collision, 00:09:10.000 --> 00:09:11.000 then what I would have is my words just scattered around my table, 00:09:11.000 --> 00:09:14.000 00:09:14.000 --> 00:09:16.000 allowing me to have quick access by hashing to the right bucket and then 00:09:16.000 --> 00:09:19.000 working my way down the bucket to C. 00:09:19.000 --> 00:09:21.000 So the contains (word) would do a standard 00:09:21.000 --> 00:09:24.000 hash table lookup. 00:09:24.000 --> 00:09:27.000 Take the word, you know, ''mediate,'' run it through the hash function. It says oh, it's bucket 00:09:27.000 --> 00:09:30.000 three in this case. I go to bucket three. I walk down the link list. I find it or 00:09:30.000 --> 00:09:31.000 not. I can tell you whether the word ''mediate,'' if it's in the table, had to be in bucket 00:09:31.000 --> 00:09:33.000 three. 00:09:33.000 --> 00:09:35.000 No questions asked. 00:09:35.000 --> 00:09:38.000 In that case, the 00:09:38.000 --> 00:09:39.000 expected time for that is going to be n over b, where n is the number of entries, b is the number of 00:09:39.000 --> 00:09:41.000 buckets. 00:09:41.000 --> 00:09:42.000 If we have buckets 00:09:42.000 --> 00:09:47.000 set to be in the 00:09:47.000 --> 00:09:52.000 rough order of the number of entries, then we're going to get constant access there. 00:09:52.000 --> 00:09:54.000 Now we want to do contains (prefix). I want 00:09:54.000 --> 00:10:00.000 to 00:10:00.000 --> 00:10:05.000 know if there's any words in this table that begin with ''med.'' 00:10:05.000 --> 00:10:09.000 Where do I look? 00:10:09.000 --> 00:10:11.000 If I run ''med'' through the hash function, 00:10:11.000 --> 00:10:15.000 do I expect that I would get the same number 00:10:15.000 --> 00:10:18.000 as other words that begin with ''med,'' like ''mediate'' and ''median'' and 00:10:18.000 --> 00:10:19.000 ''mediator?'' Yes. 00:10:19.000 --> 00:10:22.000 You say yes. All 00:10:22.000 --> 00:10:26.000 right, let's take that to an extreme. 00:10:26.000 --> 00:10:29.000 If I expect that if a prefix in any longer words all have the same thing, then should it 00:10:29.000 --> 00:10:33.000 be the case, for example, I think about the simplest case of a prefix, ''m.'' 00:10:33.000 --> 00:10:35.000 Should all the words that begin with ''m'' hash to the same place? 00:10:35.000 --> 00:10:38.000 If they did, we're in trouble, right? 00:10:38.000 --> 00:10:42.000 Now, if all prefixes did hash to the same place, then what I'm going to end up with is a 00:10:42.000 --> 00:10:45.000 really heavily clustered table where ''a,'' ''b,'' ''c,'' you know, ''z'' - 00:10:45.000 --> 00:10:46.000 there's 26 buckets that have all the ''z'' words, all the 00:10:46.000 --> 00:10:48.000 ''y'' words, all the what not, 00:10:48.000 --> 00:10:52.000 and none of the other buckets would be used. All 00:10:52.000 --> 00:10:55.000 right, so in fact, actually, as part of the good behavior of a hash function, we're 00:10:55.000 --> 00:10:58.000 hopping that, actually, it does jumble things up. And then a word that looks like one 00:10:58.000 --> 00:11:02.000 but has another letter added on the end, we're hoping it goes somewhere totally different, 00:11:02.000 --> 00:11:04.000 that ''median'' and ''medians'' hopefully go to two totally disparate places. 00:11:04.000 --> 00:11:07.000 That if there were patterns where 00:11:07.000 --> 00:11:10.000 words that were substrings of each other all got hashed to the same 00:11:10.000 --> 00:11:12.000 location, we would end up with heavy clustering. Because there are a lot of words that 00:11:12.000 --> 00:11:15.000 repeat parts of the other words in the dictionary. 00:11:15.000 --> 00:11:19.000 So in fact, if I want to know if there's any words that begin with ''med,'' 00:11:19.000 --> 00:11:21.000 I have no a priori information about where to look for 00:11:21.000 --> 00:11:25.000 them. They could be anywhere. 00:11:25.000 --> 00:11:26.000 I could hash ''med'' and see if ''med'' itself was a word, right? 00:11:26.000 --> 00:11:30.000 But let's say it was something that 00:11:30.000 --> 00:11:33.000 itself was just the beginning of something. ''St,'' ''str'' - 00:11:33.000 --> 00:11:36.000 there's a lot of words that begin with that, but ''str'' itself is not a word. So 00:11:36.000 --> 00:11:39.000 hashing to that bucket and looking for it would only tell me if there's exactly ''str,'' and then 00:11:39.000 --> 00:11:41.000 occasionally there might be some ''str'' words also there. 00:11:41.000 --> 00:11:44.000 But other than that, we just gotta look. 00:11:44.000 --> 00:11:45.000 And where do we look? Everywhere. 00:11:45.000 --> 00:11:50.000 Everywhere, everywhere, everywhere. 00:11:50.000 --> 00:11:53.000 That in order to conclude there is no word that begins with the prefix we have, 00:11:53.000 --> 00:11:55.000 the only way to be confident of that is to either find something, or to have searched 00:11:55.000 --> 00:11:59.000 the entire table and not found it. 00:11:59.000 --> 00:12:02.000 And so it does require this completely linear, exhaustive 00:12:02.000 --> 00:12:04.000 look through the entire contents of the hash table to know 00:12:04.000 --> 00:12:07.000 something, for example, is not a prefix. 00:12:07.000 --> 00:12:09.000 You might find one quickly. So in the best case, you might be in the first couple of 00:12:09.000 --> 00:12:10.000 buckets, happen upon one. 00:12:10.000 --> 00:12:14.000 But no guarantees, 00:12:14.000 --> 00:12:16.000 overall. And certainly when it's not a prefix, it's going to require a complete look-see of everything. 00:12:16.000 --> 00:12:18.000 That's kinda a bummer. Adding 00:12:18.000 --> 00:12:19.000 a 00:12:19.000 --> 00:12:22.000 word, 00:12:22.000 --> 00:12:24.000 back to the fast behavior again. Hashing in the bucket, 00:12:24.000 --> 00:12:26.000 checking to see if it's already there, adding if needed 00:12:26.000 --> 00:12:28.000 [inaudible]. 00:12:28.000 --> 00:12:31.000 So it gets good times on these two, 00:12:31.000 --> 00:12:34.000 but then kinda really bottoms out when it comes to supporting 00:12:34.000 --> 00:12:35.000 that prefix search. 00:12:35.000 --> 00:12:38.000 Its space usage 00:12:38.000 --> 00:12:42.000 is kinda roughly comparable to that of the binary search tree. 00:12:42.000 --> 00:12:42.000 We've got the string data itself, and we've got the four-byte pointer on the link list. 00:12:42.000 --> 00:12:44.000 00:12:44.000 --> 00:12:48.000 There's also the four-byte bucket pointer, 00:12:48.000 --> 00:12:51.000 which is over in this other structure here, this array full of buckets. 00:12:51.000 --> 00:12:54.000 If that number, though, is roughly equal to the number of cells, then you can kinda 00:12:54.000 --> 00:12:56.000 just count it as each entry 00:12:56.000 --> 00:12:59.000 had this 00:12:59.000 --> 00:13:02.000 12-byte cell, plus a four-byte bucket pointer somewhere 00:13:02.000 --> 00:13:04.000 that you can kinda think of as associated on a per entry basis to tell you it's about 00:13:04.000 --> 00:13:06.000 00:13:06.000 --> 00:13:10.000 eight bytes of overhead on top of the string. 00:13:10.000 --> 00:13:11.000 So 16 bytes for each entry in the table. 00:13:11.000 --> 00:13:16.000 Okay, that's 00:13:16.000 --> 00:13:18.000 what we have so far. 00:13:18.000 --> 00:13:21.000 We've got 00:13:21.000 --> 00:13:24.000 pretty good performance on contains (word) no matter what we 00:13:24.000 --> 00:13:27.000 choose. We can get either logarithmic or constant time with either of these three 00:13:27.000 --> 00:13:31.000 data structure. So perfectly acceptable performance there. 00:13:31.000 --> 00:13:33.000 We cannot get contains (prefix) to run well on a hash table. 00:13:33.000 --> 00:13:34.000 Just doesn't happen the way hashing works. 00:13:34.000 --> 00:13:38.000 And then add 00:13:38.000 --> 00:13:41.000 can be fast on the two complicated pointer-based data structures to the right, 00:13:41.000 --> 00:13:42.000 but can't be fast on the sorted vector. 00:13:42.000 --> 00:13:45.000 And then 00:13:45.000 --> 00:13:47.000 we see kinda some trade-offs here in terms of space versus time, that there's 00:13:47.000 --> 00:13:51.000 very little space overhead 00:13:51.000 --> 00:13:55.000 in the sorted vector case, but really blossoms into 00:13:55.000 --> 00:13:58.000 two times, almost three times, the storage needed. In the VST and hash table 00:13:58.000 --> 00:14:00.000 case, the code also got a lot more complicated when we moved up to those 00:14:00.000 --> 00:14:04.000 data structures. So 00:14:04.000 --> 00:14:06.000 then I'll tell you, actually, that it turns out that the one type of performance that was interesting to us 00:14:06.000 --> 00:14:09.000 when I was 00:14:09.000 --> 00:14:14.000 exploring this, but in fact, more important at the time, was the memory usage. 00:14:14.000 --> 00:14:17.000 That the memory usage was really the big obstacle that we had. 00:14:17.000 --> 00:14:19.000 The Official Scrabble Players Dictionary II is actually the source for the word list that 00:14:19.000 --> 00:14:22.000 we're using. 00:14:22.000 --> 00:14:23.000 It has about 128,000 words, I think, is exactly the 00:14:23.000 --> 00:14:26.000 number it has. 00:14:26.000 --> 00:14:29.000 And the average length of those is eight characters. And the file itself is 00:14:29.000 --> 00:14:32.000 over a megabyte. So on disk, if you were just to look at the listing of all the words, 00:14:32.000 --> 00:14:34.000 it's a megabyte - a little over a megabyte there. 00:14:34.000 --> 00:14:37.000 If we were loading it into the sorted vector, 00:14:37.000 --> 00:14:40.000 we'd get something that was just roughly equal to that. It would take about a megabyte of space, 00:14:40.000 --> 00:14:45.000 plus a little bit of noise. 00:14:45.000 --> 00:14:48.000 The ones that double up that take it up to two, two and a half megabytes total. 00:14:48.000 --> 00:14:50.000 At the time we were working, and I know this is going to seem completely archaic to you, but we had 00:14:50.000 --> 00:14:51.000 these stone tablets that we worked on. 00:14:51.000 --> 00:14:55.000 00:14:55.000 --> 00:15:00.000 And they typically had main memories of about a megabyte, maybe, like in 00:15:00.000 --> 00:15:02.000 an exciting case, four megabytes; four whole megabytes of RAM. 00:15:02.000 --> 00:15:05.000 So it was not possible 00:15:05.000 --> 00:15:09.000 that we could dedicate one or two megabytes of your main memory 00:15:09.000 --> 00:15:12.000 to just your lexicon. It just wasn't - there just wasn't enough space to go around. 00:15:12.000 --> 00:15:15.000 So we're working in this very tight environment in terms of that. So a little bit more 00:15:15.000 --> 00:15:16.000 like, for example, today's world of embedded devices. If you're adding something for an 00:15:16.000 --> 00:15:18.000 iPod or a 00:15:18.000 --> 00:15:21.000 00:15:21.000 --> 00:15:24.000 cell phone you have a lot less 00:15:24.000 --> 00:15:27.000 gargantuan memory. So your average desktop machine, having a 00:15:27.000 --> 00:15:28.000 gigabyte of RAM, now you're like, whatever, a megabyte here, ten megabytes there, free, all the 00:15:28.000 --> 00:15:31.000 memory you want. And 00:15:31.000 --> 00:15:34.000 so this seems kinda like a silly thing to focus on, but this was 1995, 00:15:34.000 --> 00:15:35.000 and it turns out, yeah, there was certainly more interest in 00:15:35.000 --> 00:15:37.000 keeping it 00:15:37.000 --> 00:15:40.000 tight to the memory. 00:15:40.000 --> 00:15:44.000 The Boggle lexicon that we have actually takes under a third of a megabyte. 00:15:44.000 --> 00:15:45.000 The memory that's used while it's working and searching and checking for words and 00:15:45.000 --> 00:15:47.000 prefixes 00:15:47.000 --> 00:15:50.000 is actually smaller 00:15:50.000 --> 00:15:53.000 than the on-disk form of the data we started with. 00:15:53.000 --> 00:15:57.000 So it actually does a kind of a bit of a compression 00:15:57.000 --> 00:16:00.000 as part of its strategy for storage, which is pretty interesting to think about. 00:16:00.000 --> 00:16:02.000 So I'll tell you, actually, the truth about how we first did Boggle, because it's kinda a little 00:16:02.000 --> 00:16:03.000 interesting 00:16:03.000 --> 00:16:07.000 historical perspective on it. 00:16:07.000 --> 00:16:09.000 But Boggle was given by a colleague of mine for the first time, Todd 00:16:09.000 --> 00:16:10.000 Feldman. He was the guy who came up with the idea, which is a great idea. And he said, ''Oh, this is a great 00:16:10.000 --> 00:16:13.000 assignment.'' 00:16:13.000 --> 00:16:16.000 And when he wrote it, he said, ''Oh, I need a lexicon.'' So as part of kinda putting the assignment together, he was like, ''Oh, we 00:16:16.000 --> 00:16:18.000 have to get some lexicons.'' So he looked around and he found a word list that had about 00:16:18.000 --> 00:16:20.000 20,000 words. 00:16:20.000 --> 00:16:23.000 00:16:23.000 --> 00:16:27.000 And I think we first found this word list and we said, ''Oh, it's just impossible.'' It's going to take a megabyte 00:16:27.000 --> 00:16:30.000 or more and we just don't have that kind of space. So we need much smaller data files. So we had a data file that had 00:16:30.000 --> 00:16:33.000 about 20,000 words, which then takes about 00:16:33.000 --> 00:16:34.000 200k. And he actually wrote it using a hash table. 00:16:34.000 --> 00:16:37.000 So given he wrote it as a hash table, 00:16:37.000 --> 00:16:38.000 he just didn't support contains (prefix). He 00:16:38.000 --> 00:16:39.000 just didn't put it in, 00:16:39.000 --> 00:16:42.000 00:16:42.000 --> 00:16:45.000 and - because you can't do it efficiently. So it actually took a smaller amount 00:16:45.000 --> 00:16:49.000 of space and had a smaller word list and it wouldn't do contains (prefix). 00:16:49.000 --> 00:16:51.000 So then people wrote Boggle. And as you learned from that midterm question, you can write Boggle without 00:16:51.000 --> 00:16:55.000 contains (prefix). 00:16:55.000 --> 00:16:58.000 It spends a lot more time looking down these data ends, but it eventually bottoms out 00:16:58.000 --> 00:17:01.000 when it runs off the board and uses all the 00:17:01.000 --> 00:17:04.000 cubes. And 00:17:04.000 --> 00:17:05.000 it actually, in some ways, almost made the program a little more satisfying to 00:17:05.000 --> 00:17:08.000 write in one odd way, 00:17:08.000 --> 00:17:09.000 which is, when you're - so it's the computer's turn, right, that uses the 00:17:09.000 --> 00:17:12.000 prefix pruning. 00:17:12.000 --> 00:17:14.000 So you would type in your words and it would be perfectly fine finding it, but it 00:17:14.000 --> 00:17:16.000 doesn't use the prefix part when it's doing the human's turn. You're typing in words; it's finding them. You're typing 00:17:16.000 --> 00:17:19.000 in the words; it's find them. 00:17:19.000 --> 00:17:21.000 Then you would go to do the computer's turn and you'd say, ''Okay, go.'' 00:17:21.000 --> 00:17:25.000 And now it would take a really long time. 00:17:25.000 --> 00:17:27.000 It'd be hunting around the board. It would be working so hard your fan would be coming on. 00:17:27.000 --> 00:17:30.000 You'd feel like oh, my computer's doing something. 00:17:30.000 --> 00:17:31.000 And then it would show up with twice as many words as you found or ten times as 00:17:31.000 --> 00:17:33.000 many words as you found, whatever. 00:17:33.000 --> 00:17:36.000 And in fact, then you would say, 00:17:36.000 --> 00:17:40.000 ''Well, at least it had to work hard to beat me.'' 00:17:40.000 --> 00:17:41.000 And then you felt like I wrote this program that causes my processor to get busy, right? I feel good. And 00:17:41.000 --> 00:17:44.000 00:17:44.000 --> 00:17:47.000 in fact, actually, the word list was pretty small. So in fact, it didn't - it wasn't even actually as likely to 00:17:47.000 --> 00:17:49.000 beat you because it didn't know as many words. 125,000 words is 00:17:49.000 --> 00:17:52.000 an enormous number of words. 00:17:52.000 --> 00:17:55.000 The average - I've heard different factors on this, but the 00:17:55.000 --> 00:17:59.000 average person's vocabulary, English vocabulary, is about 20,000 to 00:17:59.000 --> 00:18:00.000 30,000 words. It tends to be about 10,000 higher if you've gone to college, so 00:18:00.000 --> 00:18:03.000 maybe 30,000 to 40,000. 00:18:03.000 --> 00:18:06.000 The fact that it knows 80,000 more words than the average college graduate 00:18:06.000 --> 00:18:09.000 means it knows a lot of obscure words, too. So in fact, not only does it 00:18:09.000 --> 00:18:12.000 very quickly find all those words and then produce these ridiculous words that you've 00:18:12.000 --> 00:18:14.000 never heard of, it just - it's almost like it's taunting you by doing it 00:18:14.000 --> 00:18:17.000 in a millisecond. 00:18:17.000 --> 00:18:21.000 And back then it took a while, so it was actually like 00:18:21.000 --> 00:18:23.000 okay, it beat me, but it had to work hard. 00:18:23.000 --> 00:18:27.000 But then - so then I took over Boggle. 00:18:27.000 --> 00:18:31.000 I taught x maybe a quarter or two later and I said, ''I'm going to write a 00:18:31.000 --> 00:18:35.000 lexicon. I'm going to go up and make a little project for myself, because I don't have enough work to do.'' 00:18:35.000 --> 00:18:38.000 And my goal was to make one that would do prefix, but that also would be 00:18:38.000 --> 00:18:40.000 tight enough on memory that we could actually load a much bigger dictionary file, because I thought I'd 00:18:40.000 --> 00:18:42.000 actually make the game even more 00:18:42.000 --> 00:18:48.000 fun to play. 00:18:48.000 --> 00:18:52.000 All right, so here's where I started. 00:18:52.000 --> 00:18:55.000 This is a little excerpt from the middle of the 00:18:55.000 --> 00:18:58.000 dictionary. ''Stra,'' yeah, that's totally the thing you notice, right, when you look at this thing. You see 00:18:58.000 --> 00:19:02.000 ''straddle,'' ''straddler,'' ''straddlers,'' ''straddles,'' ''straddling,'' and then these words that you can't 00:19:02.000 --> 00:19:04.000 believe you're allowed to do this, but apparently you can put ''er'' and ''ier'' and 00:19:04.000 --> 00:19:05.000 ''ies' and ''ing'' on 00:19:05.000 --> 00:19:07.000 almost anything out there and it 00:19:07.000 --> 00:19:08.000 makes a valid word. 00:19:08.000 --> 00:19:11.000 So 00:19:11.000 --> 00:19:14.000 although you never really thought about the fact that 00:19:14.000 --> 00:19:16.000 ''straightforward,'' ''straightforwardly,'' ''straightforwardness,'' are all there, there is a huge amount 00:19:16.000 --> 00:19:18.000 of repetition. And this is 00:19:18.000 --> 00:19:20.000 where I said we're going to use some information about the domain. 00:19:20.000 --> 00:19:23.000 These aren't just arbitrary strings. 00:19:23.000 --> 00:19:24.000 They're not just random sequences of letters that are 00:19:24.000 --> 00:19:28.000 picked out of the air, 00:19:28.000 --> 00:19:30.000 that they develop over time. They're word roots. They're 00:19:30.000 --> 00:19:34.000 suffixes. There's these prefixes that mean things 00:19:34.000 --> 00:19:35.000 that actually show you that looking at this little excerpt out of the middle, 00:19:35.000 --> 00:19:38.000 there's an awful lot of words 00:19:38.000 --> 00:19:42.000 that repeat portions of other words. 00:19:42.000 --> 00:19:44.000 And that may be part of what we can capitalize on. That 00:19:44.000 --> 00:19:47.000 instead of really recording ''straightaway'' 00:19:47.000 --> 00:19:51.000 and ''straightaways,'' repeating those first 11 characters and adding an ''s,'' 00:19:51.000 --> 00:19:55.000 there's some way that I can unify the data that I have here. The same way when we try to 00:19:55.000 --> 00:19:56.000 find codes, you have the same passage of code repeated two or three times. 00:19:56.000 --> 00:19:59.000 We're always saying 00:19:59.000 --> 00:20:00.000 unify it, move it into a helper and call it in three places. It's like can we do the same thing 00:20:00.000 --> 00:20:02.000 with our data? 00:20:02.000 --> 00:20:07.000 Can we share the parts of the words that are in common 00:20:07.000 --> 00:20:09.000 and only distinguish where they are different? 00:20:09.000 --> 00:20:11.000 Take a look at this. 00:20:11.000 --> 00:20:14.000 This is a little portion 00:20:14.000 --> 00:20:18.000 of something called a trie. 00:20:18.000 --> 00:20:20.000 A trie is spelled t-r-i-e and it's actually from ''retrieval,'' but 00:20:20.000 --> 00:20:23.000 sadly, it's still pronounced tri, just to confuse you. 00:20:23.000 --> 00:20:28.000 And it is a variant of a tree. 00:20:28.000 --> 00:20:30.000 In this case it's a tree that has 26 children coming out of any particular node, 00:20:30.000 --> 00:20:32.000 one for each letter of the alphabet. 00:20:32.000 --> 00:20:35.000 And so starting from the root node at the top, 00:20:35.000 --> 00:20:39.000 all the words that begin with ''a'' are actually 00:20:39.000 --> 00:20:42.000 down the A branch, and then there's a B branch. There'd be a C branch and a D 00:20:42.000 --> 00:20:45.000 branch. And so the idea is that all the words, like if you think of the levels of the 00:20:45.000 --> 00:20:46.000 tree are actually like - these are all the zero if characters of the word. 00:20:46.000 --> 00:20:49.000 And then that second level is 00:20:49.000 --> 00:20:52.000 all the characters that follow that zero if character. So here's the ''ab'' 00:20:52.000 --> 00:20:56.000 words and the ''ac'' words and the ''ax'' words. 00:20:56.000 --> 00:20:57.000 Over here is the ''st'' words, but there's not a ''sx,'' so there's some places that actually aren't 00:20:57.000 --> 00:21:00.000 filled out in this tree. 00:21:00.000 --> 00:21:02.000 And as you keep going down further, so that the depth 00:21:02.000 --> 00:21:04.000 you would trace to a tree 00:21:04.000 --> 00:21:08.000 would actually be 00:21:08.000 --> 00:21:10.000 tracing through the characters in sequence that make words. 00:21:10.000 --> 00:21:11.000 And so in this form, tracing out a 00:21:11.000 --> 00:21:12.000 path through this tree 00:21:12.000 --> 00:21:14.000 from the root 00:21:14.000 --> 00:21:17.000 down to a particular node 00:21:17.000 --> 00:21:19.000 tells you about prefixes. So there are words that begin with ''a'' and words that begin with ''ac'' 00:21:19.000 --> 00:21:23.000 and ''ace,'' 00:21:23.000 --> 00:21:23.000 and then there's actually a little notation there that's done by a 00:21:23.000 --> 00:21:24.000 visual 00:21:24.000 --> 00:21:26.000 of the 00:21:26.000 --> 00:21:29.000 bold and circle around it 00:21:29.000 --> 00:21:30.000 that says, ''Oh, and the path that led here from the root is a word.'' So 00:21:30.000 --> 00:21:33.000 ''a'' is a word 00:21:33.000 --> 00:21:35.000 and ''ace'' is a word and ''aces'' is a word. 00:21:35.000 --> 00:21:39.000 ''Act'' is a word and so is ''acted'' and ''actor.'' 00:21:39.000 --> 00:21:44.000 And in this case, the ''act'' part is totally shared, 00:21:44.000 --> 00:21:46.000 so everything that comes off of ''act,'' ''acting,'' ''acted,'' ''actor,'' ''actors,'' 00:21:46.000 --> 00:21:50.000 can all share that initial part. 00:21:50.000 --> 00:21:52.000 And for a word like ''act,'' it doesn't seem that profound. But you start thinking about words like 00:21:52.000 --> 00:21:55.000 ''strawberry'' or ''straightforward'' or ''stratosphere'' 00:21:55.000 --> 00:21:59.000 and realize that ''stratospheric'' and ''strawberries'' and ''straightforwardly'' 00:21:59.000 --> 00:22:01.000 can all leverage the other ten or 12 characters that were already 00:22:01.000 --> 00:22:04.000 invested in the root word, 00:22:04.000 --> 00:22:06.000 and that ''straightjacket'' and ''straightforward'' can even share ''straight.'' 00:22:06.000 --> 00:22:10.000 But there's actually a lot of 00:22:10.000 --> 00:22:15.000 prefixes that can be unified across the space of the dictionary. 00:22:15.000 --> 00:22:17.000 So knowing that words have these patterns helps us to do this. 00:22:17.000 --> 00:22:19.000 So let's build that. 00:22:19.000 --> 00:22:22.000 So the first form of this - 00:22:22.000 --> 00:22:25.000 and this, again, is a thought experiment. I did not actually write it this way because 00:22:25.000 --> 00:22:27.000 it would just be absolutely astronomically expensive. But it was a good way to kinda think 00:22:27.000 --> 00:22:30.000 about where I needed to go. 00:22:30.000 --> 00:22:34.000 I designed a new kinda trie node 00:22:34.000 --> 00:22:35.000 that had the letter and a little Boolean flag of is this the word, so whether it had 00:22:35.000 --> 00:22:39.000 the dark circle around it. 00:22:39.000 --> 00:22:41.000 And then it had a 26-member children array, so pointers to other nodes like 00:22:41.000 --> 00:22:44.000 this one. So totally recursive, 00:22:44.000 --> 00:22:47.000 staring from that root and then the 00:22:47.000 --> 00:22:50.000 idea is to use those 26 children as ''a'' being in the first slot and ''z'' in the last 00:22:50.000 --> 00:22:52.000 slot to make it really easy to kinda just trace letter by letter 00:22:52.000 --> 00:22:55.000 down to the levels of the tree. 00:22:55.000 --> 00:22:59.000 So I'd have this one root node that pointed to the initial one, and then all the words 00:22:59.000 --> 00:23:03.000 that begin with ''a'' are off that first lot; all from ''z'' off the last slot. 00:23:03.000 --> 00:23:08.000 So when I want to know if a word is in the dictionary, 00:23:08.000 --> 00:23:12.000 what do I have to do in this data structure? How do I find it? I 00:23:12.000 --> 00:23:18.000 want to find the word ''far.'' Where do I look? 00:23:18.000 --> 00:23:27.000 Left? Right? Down? Which way? It would be great 00:23:27.000 --> 00:23:31.000 if you would tell me. Then I would understand what my data structure is. [Inaudible] 00:23:31.000 --> 00:23:35.000 Exactly. So at the very top I say okay, ''f'' is my first letter. Match my ''f,'' get me to the 00:23:35.000 --> 00:23:35.000 F node, and now it says recursively find ''ar'' from here. So it's very 00:23:35.000 --> 00:23:37.000 recursive. It'll be like 00:23:37.000 --> 00:23:39.000 find the first letter 00:23:39.000 --> 00:23:43.000 and then recursively find the substring from here, working down. So find the ''a'' out of the next node, find 00:23:43.000 --> 00:23:46.000 an ''r'' out of the next node, and then when you get to that node, 00:23:46.000 --> 00:23:47.000 you'll check this little ''is word Boolean,'' 00:23:47.000 --> 00:23:49.000 and say okay, 00:23:49.000 --> 00:23:53.000 was that path I traced 00:23:53.000 --> 00:23:58.000 marked in the dictionary as leading to something that's known to be a word? 00:23:58.000 --> 00:24:01.000 The prefix looks exactly the same, except for that last check for the as word. So 00:24:01.000 --> 00:24:05.000 given that nodes exist in this, because further down there there's some words, I just 00:24:05.000 --> 00:24:08.000 trace out a sequence of letters ''str,'' and as long as there's something there, 00:24:08.000 --> 00:24:11.000 then I know that away from there are subsequent children beneath it 00:24:11.000 --> 00:24:15.000 that lead to words. 00:24:15.000 --> 00:24:17.000 And adding a word does the same operation. 00:24:17.000 --> 00:24:19.000 If I need to add the word ''strawberry,'' 00:24:19.000 --> 00:24:20.000 then I start looking for ''s,'' ''t,'' ''r,'' 00:24:20.000 --> 00:24:24.000 and at some point, 00:24:24.000 --> 00:24:26.000 I find some places where the nodes don't exist. Then I start creating them from 00:24:26.000 --> 00:24:29.000 there on down. 00:24:29.000 --> 00:24:33.000 So it is tracing what exists and then adding new nodes off the bottom, 00:24:33.000 --> 00:24:38.000 and then marking the final one as yes, this is a word. 00:24:38.000 --> 00:24:43.000 What's interesting about this is that all three of these operations 00:24:43.000 --> 00:24:47.000 don't actually make any reference to how many words are in the dictionary. 00:24:47.000 --> 00:24:49.000 They are all dependent on the length of the word you're searching for or adding into the dictionary. 00:24:49.000 --> 00:24:51.000 If there's ten words in there, if there's 1,000 words in there, if 00:24:51.000 --> 00:24:53.000 there's a million words in there. 00:24:53.000 --> 00:24:56.000 But in no way does the big O depend on 00:24:56.000 --> 00:24:59.000 how big or how loaded the data structure is. 00:24:59.000 --> 00:25:00.000 That's kinda wacky. 00:25:00.000 --> 00:25:06.000 The longest word, 00:25:06.000 --> 00:25:09.000 according to the Oxford English Dictionary? 00:25:09.000 --> 00:25:11.000 There you go. You're not going to make that on the Boggle board anytime soon because it's only got 16 cubes. Even Big Boggle 00:25:11.000 --> 00:25:12.000 can't really make that puppy. But 00:25:12.000 --> 00:25:14.000 just so you know, 00:25:14.000 --> 00:25:17.000 45 is not so many, right? 00:25:17.000 --> 00:25:21.000 And, 00:25:21.000 --> 00:25:24.000 in fact, actually, here's a little thought for you. Not only is it not dependent on num words, it actually has a 00:25:24.000 --> 00:25:26.000 little bit of an inverse relationship with num words, 00:25:26.000 --> 00:25:29.000 especially the add operation. 00:25:29.000 --> 00:25:31.000 That as the data structure becomes more loaded, 00:25:31.000 --> 00:25:34.000 there's typically less work to do in add. 00:25:34.000 --> 00:25:37.000 So if ''strawberry'' is in there, and you go to add ''strawberries,'' 00:25:37.000 --> 00:25:38.000 then actually you already have ''strawberries'' to depend on. You just need to build 00:25:38.000 --> 00:25:41.000 off the thing. So the fact that 00:25:41.000 --> 00:25:45.000 the more words that are in there, the more likely that some of the nodes you already need 00:25:45.000 --> 00:25:46.000 are already in place, and then you can just kinda blow past them and just add your new 00:25:46.000 --> 00:25:49.000 nodes. 00:25:49.000 --> 00:25:53.000 Now, that's odd. You think of most data structures as really as 00:25:53.000 --> 00:26:00.000 clearly as they get heavier, it takes more work to install new 00:26:00.000 --> 00:26:03.000 things in it rather than less. [Inaudible]. 00:26:03.000 --> 00:26:04.000 Well, the thing is, they're organized by letter, and that 26-number array. So in fact, 00:26:04.000 --> 00:26:08.000 I don't even have to look. 00:26:08.000 --> 00:26:11.000 So I just go straight to the slot. If I'm looking for strawberry, I'll look for ''s'' and I go 00:26:11.000 --> 00:26:14.000 straight to the ''t'' slot and then the ''r'' slot. So in fact, if the mode was already in place, I'm 00:26:14.000 --> 00:26:16.000 not searching for it. It really just is exactly in the ''r'' slot. That's why 00:26:16.000 --> 00:26:21.000 there's 26 of them. In 00:26:21.000 --> 00:26:24.000 each array? Each array is 26 and they're A through Z and if there's an empty slot we're using a null. 00:26:24.000 --> 00:26:27.000 So in this case, it's optimized to make it super fast. I don't even have to look 00:26:27.000 --> 00:26:29.000 through the letters to find the one that matches. There's no matching. It just goes straight 00:26:29.000 --> 00:26:32.000 to the ''r'' slot, ''z'' slot, the ''s'' slot. 00:26:32.000 --> 00:26:35.000 We're going to see this is going to be a consequence we can't 00:26:35.000 --> 00:26:37.000 tolerate, but it is, at least, the starting point for thinking about how to 00:26:37.000 --> 00:26:38.000 build a data structure. 00:26:38.000 --> 00:26:40.000 So the space usage 00:26:40.000 --> 00:26:44.000 is where this thing 00:26:44.000 --> 00:26:47.000 really, really - so this is an extreme example of a space-time trade-off, right? 00:26:47.000 --> 00:26:50.000 We've got this thing that optimizes beautifully for OAV length of 00:26:50.000 --> 00:26:53.000 the word. So that means typically eight operations on average are going to be 00:26:53.000 --> 00:26:56.000 needed to add or to search for something. 00:26:56.000 --> 00:26:57.000 The space we're using here is 106 bytes per nodes, so that's this 00:26:57.000 --> 00:26:59.000 26-member, 00:26:59.000 --> 00:27:02.000 four-byte array, 00:27:02.000 --> 00:27:04.000 plus two bytes for the character and the Boolean that are up there. One 00:27:04.000 --> 00:27:05.000 hundred and six bytes 00:27:05.000 --> 00:27:09.000 is a lot of memory, 00:27:09.000 --> 00:27:13.000 and the tree has about a quarter of a million nodes. 00:27:13.000 --> 00:27:17.000 So given the 125,000 words that I said are in the input, 00:27:17.000 --> 00:27:19.000 if you build it into a tree, they take about 250,000 nodes. That's an interesting sort of 00:27:19.000 --> 00:27:22.000 number to think about, though. You have 125,000 00:27:22.000 --> 00:27:25.000 words. They take 250,000 nodes. That tells you, actually, that 00:27:25.000 --> 00:27:27.000 kinda on average, each 00:27:27.000 --> 00:27:29.000 word has about two unique nodes, 00:27:29.000 --> 00:27:33.000 that even words like ''straightforward'' and ''straightforwardly,'' 00:27:33.000 --> 00:27:34.000 on average, are sharing enough of their prefix - common prefix 00:27:34.000 --> 00:27:38.000 parts - that 00:27:38.000 --> 00:27:39.000 there's very little unique data added by any particular word in there that averages 00:27:39.000 --> 00:27:40.000 across the whole thing. That's 00:27:40.000 --> 00:27:42.000 pretty neat 00:27:42.000 --> 00:27:45.000 to know. However, this is just not going to fly. 00:27:45.000 --> 00:27:50.000 So here I was telling you that my main memory had 00:27:50.000 --> 00:27:53.000 four megabytes, maybe, on a good day, and now I've actually 00:27:53.000 --> 00:27:55.000 built a data structure that's going to require six times that 00:27:55.000 --> 00:27:58.000 in terms of storage. 00:27:58.000 --> 00:28:00.000 Okay, we gotta squeeze that down. 00:28:00.000 --> 00:28:04.000 So the first thing we can do 00:28:04.000 --> 00:28:08.000 is to realize that those 26 - allocated a full 26-member array, of 00:28:08.000 --> 00:28:11.000 which I only plan on using some of the slots, is pretty wasteful. 00:28:11.000 --> 00:28:13.000 So instead of saying I'm committing to a 26 per thing, I'll say well, how about if 00:28:13.000 --> 00:28:16.000 I have an array that actually 00:28:16.000 --> 00:28:20.000 is tight to the number of children coming out of this node? 00:28:20.000 --> 00:28:20.000 So the very first node will have 26. There are words that start with every letter in the 00:28:20.000 --> 00:28:22.000 alphabet. 00:28:22.000 --> 00:28:25.000 But from there, it very quickly winnows down. 00:28:25.000 --> 00:28:29.000 You have the letter ''x.'' You do not need the full 26 children coming 00:28:29.000 --> 00:28:33.000 off of ''x.'' There's only a few letters that follow ''x'' in the English language that are going 00:28:33.000 --> 00:28:37.000 to need to be fleshed out. You need the ''xa,'' you need an ''xe,'' 00:28:37.000 --> 00:28:40.000 but you don't need ''xj,'' you don't need ''xk,'' a whole bunch of things like that. 00:28:40.000 --> 00:28:44.000 So if I change the structure here to instead of being 00:28:44.000 --> 00:28:47.000 a 26-member array of node pointers, I make it to be a pointer to a 00:28:47.000 --> 00:28:50.000 set of nodes pointers, so this is a dynamic array 00:28:50.000 --> 00:28:54.000 that I'm keeping track of the number of children in a second field here 00:28:54.000 --> 00:28:56.000 that the first one will be allocated to 26, the second will be allocated to eight. 00:28:56.000 --> 00:28:59.000 This is going to change a little bit of our running times, because now we won't 00:28:59.000 --> 00:29:01.000 be able to go exactly to the ''s'' slot. We'll have to go look for it. 00:29:01.000 --> 00:29:03.000 So on each step down the tree, 00:29:03.000 --> 00:29:07.000 it'll be 00:29:07.000 --> 00:29:12.000 looking through that sized array at this slot to find the right match. But it adds a 00:29:12.000 --> 00:29:14.000 constant factor, basically, on top of OAV length of the word. 00:29:14.000 --> 00:29:15.000 So the space issues of this 00:29:15.000 --> 00:29:19.000 00:29:19.000 --> 00:29:23.000 is - we're just not storing all those nulls. And there were a lot of nulls. That 00:29:23.000 --> 00:29:25.000 the node itself is not just ten bytes, and then there's some space in this 00:29:25.000 --> 00:29:27.000 dynamic array out there, which is the number of children 00:29:27.000 --> 00:29:31.000 times the four-byte pointer, 00:29:31.000 --> 00:29:33.000 that on average, a node has six children. Across the 00:29:33.000 --> 00:29:36.000 250,000 nodes in the 00:29:36.000 --> 00:29:39.000 data structure that's being billed from this, so really 26 was way 00:29:39.000 --> 00:29:40.000 overkill. About 20 of them had nulls 00:29:40.000 --> 00:29:42.000 for most of the nodes. 00:29:42.000 --> 00:29:46.000 And many of them, for example, at the very bottom of the tree, 00:29:46.000 --> 00:29:49.000 have completely none. They're all nulls. So you get to a word like 00:29:49.000 --> 00:29:52.000 ''straightforwardly.'' There's nothing after that ''y,'' 00:29:52.000 --> 00:29:53.000 so it has zero children coming out the back of that one. 00:29:53.000 --> 00:29:57.000 And so certainly 00:29:57.000 --> 00:30:00.000 having 26 null children pointing away from that was a 00:30:00.000 --> 00:30:01.000 waste that we can tighten up. 00:30:01.000 --> 00:30:03.000 So when we get it down to this we'll have 00:30:03.000 --> 00:30:04.000 00:30:04.000 --> 00:30:05.000 44 bytes 00:30:05.000 --> 00:30:06.000 00:30:06.000 --> 00:30:08.000 - 00:30:08.000 --> 00:30:11.000 or 34 bytes for the per node. 00:30:11.000 --> 00:30:15.000 Still a quarter million nodes. Still 00:30:15.000 --> 00:30:18.000 eight and half megabytes. So we're not winning any prizes just yet. 00:30:18.000 --> 00:30:22.000 But we are making good progress. That got us down a factor of 00:30:22.000 --> 00:30:25.000 three or four right there. So I still have the same 00:30:25.000 --> 00:30:28.000 number of nodes. All I changed was the pointers to the 00:30:28.000 --> 00:30:32.000 subsequent nodes further down the tree. So 00:30:32.000 --> 00:30:35.000 I - the kind of letters and how the letters connect to each other - the connectivity is 00:30:35.000 --> 00:30:39.000 still pretty much the same. It's just that in any particular node, how many children does it have? 00:30:39.000 --> 00:30:40.000 I was storing a whole bunch of nulls and now I'm not storing those nulls. 00:30:40.000 --> 00:30:43.000 So the number of nodes is still 00:30:43.000 --> 00:30:47.000 exactly as it was before. I'm going 00:30:47.000 --> 00:30:49.000 to do one other little squishing thing on this data structure 00:30:49.000 --> 00:30:51.000 and then I'm going 00:30:51.000 --> 00:30:53.000 to flip gears for a second. 00:30:53.000 --> 00:30:56.000 So still using kinda my 00:30:56.000 --> 00:30:59.000 CS side of things, how can I squash things down? 00:30:59.000 --> 00:31:03.000 I'm going to use a technique that we used in the heap. 00:31:03.000 --> 00:31:04.000 So in the case of the binary heap that we implemented into the 00:31:04.000 --> 00:31:07.000 PQ, 00:31:07.000 --> 00:31:11.000 you'll remember that we drew pictures in the handout that showed a heap with oh, I 00:31:11.000 --> 00:31:11.000 have a parent with two children, which has four grandchildren and 00:31:11.000 --> 00:31:15.000 so on. 00:31:15.000 --> 00:31:16.000 But that we actually compressed that down into an array, 00:31:16.000 --> 00:31:20.000 and in the array 00:31:20.000 --> 00:31:24.000 there was this pattern of 00:31:24.000 --> 00:31:27.000 here is the root node. 00:31:27.000 --> 00:31:30.000 So we'll call this level zero. 00:31:30.000 --> 00:31:33.000 And then the next two nodes 00:31:33.000 --> 00:31:36.000 are level one. 00:31:36.000 --> 00:31:38.000 And the next four nodes 00:31:38.000 --> 00:31:42.000 are level two and so on. So 00:31:42.000 --> 00:31:45.000 there's one here, two here, there's four here, there's eight here. And we kinda flattened it down 00:31:45.000 --> 00:31:50.000 by generation, so looking at what was at the top and what was underneath it. 00:31:50.000 --> 00:31:52.000 When we did that, we removed all the space for pointers. 00:31:52.000 --> 00:31:54.000 Although we drew those pictures in the heap as though there were pointers 00:31:54.000 --> 00:31:57.000 there, in fact, there were not. 00:31:57.000 --> 00:32:02.000 There were no pointers being stored in the heap in this version of the PQ. 00:32:02.000 --> 00:32:02.000 And so that gets us back to kind of array like storage requirements, 00:32:02.000 --> 00:32:06.000 but 00:32:06.000 --> 00:32:09.000 by flattening it down into this array, and still kinda using the tree 00:32:09.000 --> 00:32:10.000 conceptually to work our way through it, we're getting the advantages of 00:32:10.000 --> 00:32:12.000 that 00:32:12.000 --> 00:32:14.000 00:32:14.000 --> 00:32:15.000 traversal that's based on height of tree, 00:32:15.000 --> 00:32:20.000 not length of vector. 00:32:20.000 --> 00:32:22.000 So if we do the same thing with the trie, it's a little bit more complicated 00:32:22.000 --> 00:32:25.000 because there were a couple things about heap that made this a little 00:32:25.000 --> 00:32:27.000 easier. Heap always had two children, so there were these rules about 00:32:27.000 --> 00:32:30.000 how the tree was filled in that meant 00:32:30.000 --> 00:32:33.000 you could always compute parent and child index using this kinda divide by two, 00:32:33.000 --> 00:32:36.000 multiple by two exchange operation. 00:32:36.000 --> 00:32:39.000 In this case, what we're going to have to do is just lay it down by generation, 00:32:39.000 --> 00:32:41.000 but we don't know how big a generation's gonna be, so we're actually gonna have to keep 00:32:41.000 --> 00:32:43.000 track of it manually. 00:32:43.000 --> 00:32:46.000 So the root node will have 00:32:46.000 --> 00:32:49.000 all A through Z. So this will be 00:32:49.000 --> 00:32:53.000 level one, which is A through Z here, 00:32:53.000 --> 00:32:57.000 and then level two is - there'll be a bunch of slots here, which is the 00:32:57.000 --> 00:33:00.000 A followed by some letter, and then there'd be the B's followed by some letter, and 00:33:00.000 --> 00:33:03.000 then there'd be the C's followed by some letter, and so on. 00:33:03.000 --> 00:33:06.000 And then further down is level three, which are the three-character words. So these are all 00:33:06.000 --> 00:33:09.000 the single-character paths, 00:33:09.000 --> 00:33:12.000 two-character paths, three-character paths, kinda flattened down. 00:33:12.000 --> 00:33:15.000 And so if I look at what I changed here, my first node 00:33:15.000 --> 00:33:18.000 says it has no letter that's not the root node, it's not a word, 00:33:18.000 --> 00:33:20.000 and it says it's children begin at index one. So 00:33:20.000 --> 00:33:23.000 the next location. So it's not a 00:33:23.000 --> 00:33:26.000 multiply by two and add one or something like that. It's okay, here's 00:33:26.000 --> 00:33:30.000 where my children begin. They're at slot one in the array. 00:33:30.000 --> 00:33:31.000 And then there are 26 of them. So that means that A, B, C, D are kinda in order there. 00:33:31.000 --> 00:33:33.000 If I look at A, 00:33:33.000 --> 00:33:36.000 ''a'' is a word, 00:33:36.000 --> 00:33:39.000 and it says its children begin at index 27 00:33:39.000 --> 00:33:44.000 and there are 12 of them, so there are 12 characters out of the 26 that 00:33:44.000 --> 00:33:47.000 can follow an ''a'' in the sequence of a valid word. And 00:33:47.000 --> 00:33:48.000 then B's children begin at index 39, which is 27 plus 12. So that's 00:33:48.000 --> 00:33:50.000 where 00:33:50.000 --> 00:33:54.000 A's children are sitting here at index 00:33:54.000 --> 00:33:56.000 27 to 38 and then 39 to - I think B has eight or so children - 00:33:56.000 --> 00:33:59.000 and C and so on. 00:33:59.000 --> 00:34:00.000 So we flattened it all down by generation 00:34:00.000 --> 00:34:01.000 into 00:34:01.000 --> 00:34:04.000 an array. 00:34:04.000 --> 00:34:07.000 That means all the space for pointers 00:34:07.000 --> 00:34:10.000 has now been removed. 00:34:10.000 --> 00:34:12.000 One serious consequence of this, though, 00:34:12.000 --> 00:34:14.000 is that the pointers got us flexibility, that 00:34:14.000 --> 00:34:15.000 that insert and remove 00:34:15.000 --> 00:34:19.000 00:34:19.000 --> 00:34:22.000 was dependent on the idea that pointers really being there, pointers buying us 00:34:22.000 --> 00:34:25.000 the ability to insert and delete without doing a shuffle. 00:34:25.000 --> 00:34:27.000 That now, once I have flattened it, I have suddenly bought back the same 00:34:27.000 --> 00:34:29.000 problems that arrays have, which is 00:34:29.000 --> 00:34:33.000 if I have the words here that have 00:34:33.000 --> 00:34:36.000 ''ab'' for abalone, and ''ac'' for act and I don't happen to have any words with ''ad.'' And 00:34:36.000 --> 00:34:40.000 I try to add the word ''add,'' 00:34:40.000 --> 00:34:43.000 that in order to put ''ad'' in there, we're going to have to move everything over and then 00:34:43.000 --> 00:34:46.000 there's going to be a lot of shuffling and what not. 00:34:46.000 --> 00:34:48.000 So I'm starting to build a data structure that's losing some flexibility but 00:34:48.000 --> 00:34:49.000 saving space. 00:34:49.000 --> 00:34:53.000 So 00:34:53.000 --> 00:34:54.000 I'm, in some sense, focusing my attention on how can I build a data structure that's really fast 00:34:54.000 --> 00:34:56.000 to search, 00:34:56.000 --> 00:35:00.000 even if it was a little harder to build? 00:35:00.000 --> 00:35:02.000 Because it turns out the thing I most care about is making sure it can 00:35:02.000 --> 00:35:07.000 actually very, very efficiently look up words. 00:35:07.000 --> 00:35:11.000 Maybe building it once, being a little slow, isn't so painful. 00:35:11.000 --> 00:35:14.000 So the space issues in this gets us down to just ten bytes per 00:35:14.000 --> 00:35:16.000 node. All the pointers are gone. So there was 24 bytes of pointers that we've 00:35:16.000 --> 00:35:19.000 just eliminated. 00:35:19.000 --> 00:35:21.000 And that means we've got two and a half megabyte 00:35:21.000 --> 00:35:28.000 memory footprint right now. 00:35:28.000 --> 00:35:29.000 So we're kinda getting close to hitting the range for binary tree and hash tables. 00:35:29.000 --> 00:35:33.000 Now I'm 00:35:33.000 --> 00:35:37.000 going to go back to the domain again. So I worked on it kinda from the CS side, thinking 00:35:37.000 --> 00:35:39.000 about things I know about data structures and 00:35:39.000 --> 00:35:41.000 reorganizing arrays and heaps and pointers. I'm going 00:35:41.000 --> 00:35:43.000 to go back and look at that domain again 00:35:43.000 --> 00:35:46.000 and see if there's something else I 00:35:46.000 --> 00:35:48.000 can kinda take advantage of. 00:35:48.000 --> 00:35:50.000 So this is a different 00:35:50.000 --> 00:35:52.000 cross-section of the dictionary 00:35:52.000 --> 00:35:57.000 showing the words ''adapted,'' 00:35:57.000 --> 00:36:02.000 ''adaptor,'' ''adaptors,'' ''adapting,'' ''adaption,'' adaptions,'' ''adapts.'' 00:36:02.000 --> 00:36:04.000 So eight words that are all built off the ''adapt'' root with a bunch of suffixes. I look 00:36:04.000 --> 00:36:05.000 sorta over a little bit further. I've got 00:36:05.000 --> 00:36:10.000 ''desert,'' 00:36:10.000 --> 00:36:15.000 ''deserted,'' ''deserter,'' ''deserting,'' ''desertings,'' ''desertion,'' ''desertions,'' ''deserts.'' Okay, 00:36:15.000 --> 00:36:20.000 and ''detect'' and ''insert'' and ''perfect'' and ''invent'' and ''interrupt'' 00:36:20.000 --> 00:36:23.000 that all are verbs for which you can apply the past tense in the 00:36:23.000 --> 00:36:26.000 gerund form and the ''ion's'' that 00:36:26.000 --> 00:36:29.000 tack onto that root word to give you these suffixes. 00:36:29.000 --> 00:36:32.000 And each of these words that has shown up here has exactly the same set of suffixes 00:36:32.000 --> 00:36:33.000 that can be applied to it. 00:36:33.000 --> 00:36:35.000 So I 00:36:35.000 --> 00:36:37.000 did this thing about prefixes. I 00:36:37.000 --> 00:36:42.000 went out of my way to find that - 00:36:42.000 --> 00:36:45.000 so all these words could share that same prefix, ''assert,'' ''asserter,'' ''assertings,'' ''asserting.'' 00:36:45.000 --> 00:36:47.000 But is there some way that I could take this idea of ''corrupt'' 00:36:47.000 --> 00:36:50.000 and ''assert'' and ''insert,'' 00:36:50.000 --> 00:36:54.000 and then once I get down to that ''t,'' where they end, is there a way where they can also 00:36:54.000 --> 00:36:57.000 share their backend parts? That those ''ing's'' and ''ion's,'' can they be 00:36:57.000 --> 00:36:58.000 unified, as well? Well, 00:36:58.000 --> 00:37:00.000 why not? 00:37:00.000 --> 00:37:03.000 Why not? So here is 00:37:03.000 --> 00:37:06.000 the first version of what's going to be the DAWG, 00:37:06.000 --> 00:37:07.000 the Directed Acyclic Word 00:37:07.000 --> 00:37:10.000 Graph. 00:37:10.000 --> 00:37:12.000 It takes the idea of unifying the prefixes and now applies it to the 00:37:12.000 --> 00:37:14.000 suffixes. 00:37:14.000 --> 00:37:17.000 That there are a lot of 00:37:17.000 --> 00:37:20.000 words that end in the same letters, as well as start in the same letters. Why not just apply 00:37:20.000 --> 00:37:22.000 the same kind of unification and sharing on the back end? 00:37:22.000 --> 00:37:24.000 So here comes in 00:37:24.000 --> 00:37:26.000 ''adapt'' and ''adopt'' 00:37:26.000 --> 00:37:30.000 and ''based'' and ''port'' 00:37:30.000 --> 00:37:33.000 and out the back end is ''ports'' and ''portion'' and ''porting'' and ''porter'' and ''adopter'' 00:37:33.000 --> 00:37:37.000 and ''adopted'' and ''basting'' and ''bastion,'' 00:37:37.000 --> 00:37:40.000 all coming off that sharing all the endings, because they have the same 00:37:40.000 --> 00:37:41.000 words that are viable and the same connections that are 00:37:41.000 --> 00:37:46.000 between them. So 00:37:46.000 --> 00:37:49.000 the idea that all sharing the ''er'' and the ''er'' leading in to the ''s.'' And that same ''s,'' for 00:37:49.000 --> 00:37:50.000 example, being shared from ''portion'' and ''baste'' and 00:37:50.000 --> 00:37:51.000 ''adopt'' and 00:37:51.000 --> 00:37:54.000 ''adopter,'' 00:37:54.000 --> 00:37:57.000 all coming into the same ending of oh, these things that end in ''s'' are words. Here's an ''s'' that is a 00:37:57.000 --> 00:37:59.000 word for them to share. 00:37:59.000 --> 00:38:01.000 So it is directed. So 00:38:01.000 --> 00:38:04.000 it is a graph. 00:38:04.000 --> 00:38:05.000 So once we remove this - the idea 00:38:05.000 --> 00:38:09.000 that there's a single path 00:38:09.000 --> 00:38:12.000 from the root to any node, we're no longer a tree structure. So we 00:38:12.000 --> 00:38:16.000 can get to ''t'' by going through ''adapt'' or ''adopt'' or ''port,'' 00:38:16.000 --> 00:38:19.000 and so that no longer fits the definition of what is a tree. It's 00:38:19.000 --> 00:38:23.000 become a graph. We can get to these nodes from different places. 00:38:23.000 --> 00:38:24.000 The arcs go one way in this case. We can't go back up the tree 00:38:24.000 --> 00:38:27.000 to find the words in reverse. 00:38:27.000 --> 00:38:31.000 There are no cycles. So acyclic, right, you can't just 00:38:31.000 --> 00:38:33.000 wank around some portion and get this many bananas as you want. 00:38:33.000 --> 00:38:34.000 And 00:38:34.000 --> 00:38:37.000 they have 00:38:37.000 --> 00:38:44.000 been with certain nodes that are marked, in this case, the same way they were in the 00:38:44.000 --> 00:38:45.000 trie. This path from the root node to here does represent a word in this 00:38:45.000 --> 00:38:48.000 situation. 00:38:48.000 --> 00:38:53.000 So building this is a little bit complicated. I'll give you a little bit 00:38:53.000 --> 00:38:56.000 of the idea of the intuition for it, and I will not get too deep into it. 00:38:56.000 --> 00:39:00.000 But largely the way you do it is you build the trie, 00:39:00.000 --> 00:39:01.000 so that it doesn't have the unification of the suffixes; it only unifies the 00:39:01.000 --> 00:39:03.000 prefixes. And 00:39:03.000 --> 00:39:07.000 then you work backwards 00:39:07.000 --> 00:39:09.000 on the suffixes. So in some sense you look at all the words that end, for example, in 00:39:09.000 --> 00:39:12.000 ''s'' in the dictionary. 00:39:12.000 --> 00:39:15.000 And you find all the ones that just end in one ''s'' that goes nowhere, and you say 00:39:15.000 --> 00:39:17.000 look, this ''s'' looks like every other ''s'' in this thing. All these words 00:39:17.000 --> 00:39:20.000 that are just plurals or 00:39:20.000 --> 00:39:23.000 verbs that can have an ''s'' tacked on. Let's share that ''s.'' 00:39:23.000 --> 00:39:25.000 And then so you take however many s's there were, 1,000 of them, and you 00:39:25.000 --> 00:39:29.000 say these are all the same ''s.'' 00:39:29.000 --> 00:39:31.000 And then what you do is you look at all the nodes that lead into that ''s.'' So you're kinda 00:39:31.000 --> 00:39:33.000 traversing the graph in reverse, and you say well look, 00:39:33.000 --> 00:39:36.000 here's a bunch of people who lead in on an ''e.'' 00:39:36.000 --> 00:39:37.000 And if that ''e'' is a word or not, there might be some group that has ''e'' 00:39:37.000 --> 00:39:38.000 that's also a word, 00:39:38.000 --> 00:39:40.000 so like 00:39:40.000 --> 00:39:42.000 the word ''apple'' and ''apples.'' 00:39:42.000 --> 00:39:46.000 But sometimes the ''e'' is not, 00:39:46.000 --> 00:39:49.000 where that wasn't a stopping place, let's say, for ''parties'' it wasn't. 00:39:49.000 --> 00:39:53.000 So all the ones that are a word you can unify on the ''e'' that was a word, and all the ones that aren't 00:39:53.000 --> 00:39:54.000 are that way can unify on that. So you just work your way backwards from there. So you say 00:39:54.000 --> 00:39:55.000 what letters led into that ''e?'' 00:39:55.000 --> 00:39:58.000 And so kinda from 00:39:58.000 --> 00:40:02.000 the end of the words backwards, you're looking for all the people who can end in an 00:40:02.000 --> 00:40:05.000 ''s,'' and end in an ''e,'' and end in an ''ies,'' and unify them up, 00:40:05.000 --> 00:40:07.000 making sure that they have all the same outgoing connections. So for 00:40:07.000 --> 00:40:09.000 example, ''running'' has 00:40:09.000 --> 00:40:13.000 an ''ing'' at the end. So does ''swimming,'' 00:40:13.000 --> 00:40:15.000 but there's ''swimmingly'' and there's not ''runningly.'' So you have to be careful about 00:40:15.000 --> 00:40:18.000 unifying things that 00:40:18.000 --> 00:40:20.000 they would kinda have to have all identical properties from this point 00:40:20.000 --> 00:40:22.000 down to the bottom of the tree 00:40:22.000 --> 00:40:25.000 to be unifiable. 00:40:25.000 --> 00:40:28.000 But it's a pretty neat process. It's actually what's called 00:40:28.000 --> 00:40:29.000 DFT subset minimization, so if you want to learn a little bit about it, that's the 00:40:29.000 --> 00:40:31.000 word to look up. 00:40:31.000 --> 00:40:35.000 And so if I do that, 00:40:35.000 --> 00:40:36.000 and I build this as a DAWG, I'm still using the same structure definition here, 00:40:36.000 --> 00:40:40.000 00:40:40.000 --> 00:40:44.000 but what I am doing is unifying suffixes as well as prefixes 00:40:44.000 --> 00:40:46.000 and that, actually, is gonna change the number of nodes that I need drastically. 00:40:46.000 --> 00:40:49.000 It goes down to 80,000 00:40:49.000 --> 00:40:52.000 from my initial 250,000. 00:40:52.000 --> 00:40:55.000 And if you think about the number of words again, 00:40:55.000 --> 00:40:57.000 there were 125,000 words in the dictionary, 00:40:57.000 --> 00:41:01.000 the DAWG has just 80,000 words. So it means, actually, once you start 00:41:01.000 --> 00:41:04.000 unifying the suffixes, it's like most words don't even have - 00:41:04.000 --> 00:41:07.000 or many of the words don't even have nodes of their own that are unique to 00:41:07.000 --> 00:41:11.000 them whatsoever. They just exist kinda as portions of other words 00:41:11.000 --> 00:41:12.000 that have all been joined together and 00:41:12.000 --> 00:41:15.000 compressed into one package. 00:41:15.000 --> 00:41:18.000 So 80,000 nodes total, about 00:41:18.000 --> 00:41:21.000 two-thirds of the number of words total. 00:41:21.000 --> 00:41:23.000 So we have ten nodes - ten-byte nodes 00:41:23.000 --> 00:41:27.000 and 80,000 of them. 00:41:27.000 --> 00:41:31.000 We are down to under a megabyte. So we've crossed that little interesting barrier of 00:41:31.000 --> 00:41:34.000 the data structure on disk. The data that was fed into it, those words, was over a 00:41:34.000 --> 00:41:36.000 megabyte and now, actually, we have created a representation that actually is 00:41:36.000 --> 00:41:40.000 compressed, that uses less space 00:41:40.000 --> 00:41:42.000 than the text file did in the first place. And that's really because it is 00:41:42.000 --> 00:41:44.000 finding all these ways in which words are like each other 00:41:44.000 --> 00:41:46.000 and sharing, 00:41:46.000 --> 00:41:48.000 making use of the existing 00:41:48.000 --> 00:41:50.000 structure in the 00:41:50.000 --> 00:41:51.000 DAWG to add new words 00:41:51.000 --> 00:41:53.000 without 00:41:53.000 --> 00:41:56.000 adding bulk. 00:41:56.000 --> 00:41:57.000 So once I've done this I have to say that I've made the data structure 00:41:57.000 --> 00:42:01.000 even less flexible. 00:42:01.000 --> 00:42:03.000 So to be clear, each of these steps has a cost somewhere else that made the code more 00:42:03.000 --> 00:42:07.000 complicated. It also made it so that 00:42:07.000 --> 00:42:10.000 now that I have all this sharing back here, that when I'm adding a new word, so I go 00:42:10.000 --> 00:42:14.000 back to look at this picture here, that 00:42:14.000 --> 00:42:15.000 I go in to add a word and I say oh, actually, the word 00:42:15.000 --> 00:42:17.000 ''portioning'' is a word. 00:42:17.000 --> 00:42:19.000 Well, ''adoptioning'' is not. 00:42:19.000 --> 00:42:22.000 So 00:42:22.000 --> 00:42:26.000 that will necessitate if I go in to add the word ''portioning,'' that I actually have 00:42:26.000 --> 00:42:30.000 to split it off of its existing unified branch, because if I just tacked on the 00:42:30.000 --> 00:42:33.000 ''ing'' on the end of ''portioning'' there, that it turns out that ''adaptioning'' 00:42:33.000 --> 00:42:37.000 and ''adoptioning'' would suddenly become words, too. So actually I have to 00:42:37.000 --> 00:42:38.000 be extra careful once I've got it unified this way that when I add new words that I don't 00:42:38.000 --> 00:42:39.000 actually 00:42:39.000 --> 00:42:42.000 accidentally 00:42:42.000 --> 00:42:45.000 add some additional words that I didn't plan on because these paths have 00:42:45.000 --> 00:42:49.000 been shared. So it requires a little bit more careful management. Again, 00:42:49.000 --> 00:42:50.000 the goal, though, here, in this case, was to try to build a data structure that was kinda freeze-dried. 00:42:50.000 --> 00:42:53.000 So I 00:42:53.000 --> 00:42:55.000 build it off the data structure - the input data file I had, 00:42:55.000 --> 00:42:57.000 that then would operate very, very efficiently, 00:42:57.000 --> 00:43:00.000 and I wasn't planning on making a lot 00:43:00.000 --> 00:43:01.000 of runtime changes to it. 00:43:01.000 --> 00:43:04.000 I get this down to this 00:43:04.000 --> 00:43:08.000 and then the last thing I'm going to do to it 00:43:08.000 --> 00:43:11.000 is go back to kinda the CS side of it and see if I can squeeze it down kinda 00:43:11.000 --> 00:43:15.000 in terms of just data structure, what's being stored per node. 00:43:15.000 --> 00:43:18.000 So I have 80,000 of these. So 00:43:18.000 --> 00:43:22.000 typically, when you look at a structure, 00:43:22.000 --> 00:43:24.000 it's very rare that the idea if you change something from an int to a 00:43:24.000 --> 00:43:28.000 char, so it was a two-byte thing into a one-byte thing, 00:43:28.000 --> 00:43:32.000 that you're going to have any profound impact on the total size needed. 00:43:32.000 --> 00:43:33.000 But if you happen to know you're going to make a lot of these structures, tens of thousands, 00:43:33.000 --> 00:43:35.000 almost 100,000 of them, 00:43:35.000 --> 00:43:38.000 then it turns out every byte counts. 00:43:38.000 --> 00:43:40.000 If I can get this thing down from, in this case, 00:43:40.000 --> 00:43:42.000 ten bytes to eight bytes, or to 00:43:42.000 --> 00:43:46.000 six bytes or four bytes, 00:43:46.000 --> 00:43:49.000 then it will cut my memory by quite a factor. 00:43:49.000 --> 00:43:52.000 So what I did in the final version of it 00:43:52.000 --> 00:43:58.000 is I used a little technique called bit packing, 00:43:58.000 --> 00:44:00.000 which is to actually stash stuff in the absolute minimum possible space 00:44:00.000 --> 00:44:04.000 that I can fit it into, 00:44:04.000 --> 00:44:04.000 taking advantage of things like, for example, if there are only 26 letters in the 00:44:04.000 --> 00:44:08.000 00:44:08.000 --> 00:44:10.000 alphabet, a character is actually a full eight-bit value, which can hold 00:44:10.000 --> 00:44:13.000 numbers from zero to 255. 00:44:13.000 --> 00:44:15.000 And I just don't need that much space. I'm not using the yen sign, I'm not using the parentheses, 00:44:15.000 --> 00:44:19.000 I'm not using the digits. 00:44:19.000 --> 00:44:22.000 So having, in some sense, the capacity to store all those distinct characters is actually 00:44:22.000 --> 00:44:26.000 a cost I'm paying on every character that I don't need to. 00:44:26.000 --> 00:44:29.000 So in fact, I squeezed it down by dividing up into the sub-bit level. This 00:44:29.000 --> 00:44:30.000 something that's talked about a little bit in Chapter 15 if you're at all curious. It's 00:44:30.000 --> 00:44:33.000 not what I 00:44:33.000 --> 00:44:34.000 consider core material for 106B, but I'm just sorta mentioning it to kinda 00:44:34.000 --> 00:44:36.000 complete the picture. 00:44:36.000 --> 00:44:38.000 Where five bits, 00:44:38.000 --> 00:44:41.000 given that a bit is either zero or one, 00:44:41.000 --> 00:44:45.000 that I have five of them connected together, then I have 25 00:44:45.000 --> 00:44:45.000 different patterns that I can represent in that 00:44:45.000 --> 00:44:48.000 00:44:48.000 --> 00:44:50.000 amount of capacity there, and that's 32 different patterns, which is enough for 00:44:50.000 --> 00:44:54.000 my characters, plus a little slack. 00:44:54.000 --> 00:44:56.000 I use exactly one bit for is word, rather than a full Boolean. All I need 00:44:56.000 --> 00:44:59.000 to know is yes or no. 00:44:59.000 --> 00:45:02.000 I also changed the way I did 00:45:02.000 --> 00:45:04.000 knowing that you were the last child of a generation, rather than having 00:45:04.000 --> 00:45:06.000 A say, ''there are 12 children here.'' 00:45:06.000 --> 00:45:10.000 I just said, go to index 27 00:45:10.000 --> 00:45:12.000 and start reading, and one of them will tell you it's the last. So each of them says, ''No, 00:45:12.000 --> 00:45:16.000 I'm not the last,'' ''No, I'm not the last.'' And then when you get to the last one, it 00:45:16.000 --> 00:45:18.000 will be marked as ''yes,'' and that's how you'll know - that's where A ends and the 00:45:18.000 --> 00:45:20.000 next one beings. 00:45:20.000 --> 00:45:21.000 And so if I can get it down in this way 00:45:21.000 --> 00:45:24.000 00:45:24.000 --> 00:45:28.000 I've taken 00:45:28.000 --> 00:45:31.000 what was a ten-byte structure down to four. And it turns out there's often 00:45:31.000 --> 00:45:35.000 reasons why trying to get it down to a multiple of two is actually 00:45:35.000 --> 00:45:37.000 a critical barrier. That if it were five bytes, it's actually likely to be eight, 00:45:37.000 --> 00:45:39.000 just because of what's called alignment and padding restrictions. 00:45:39.000 --> 00:45:42.000 So if were at 00:45:42.000 --> 00:45:45.000 seven and I squeeze down to six, it might not make any change. And if I squeeze down to five, it 00:45:45.000 --> 00:45:47.000 still might not make any change. It might be, actually, artificially decided that in 00:45:47.000 --> 00:45:50.000 each of the structures is going to be eight behind the scenes. 00:45:50.000 --> 00:45:54.000 But getting it down to four actually tends to be another one of those critical barriers where 00:45:54.000 --> 00:45:56.000 okay, four actually will be smaller than something that was ten. And 00:45:56.000 --> 00:45:58.000 so I take my 80,000 nodes, 00:45:58.000 --> 00:46:03.000 I get them to be four bytes each, 00:46:03.000 --> 00:46:06.000 and then I now have a data structure that's about a third of a megabyte in memory, 00:46:06.000 --> 00:46:09.000 as opposed to the over a megabyte on disk 00:46:09.000 --> 00:46:11.000 in the text form. 00:46:11.000 --> 00:46:12.000 And so that's how it does what it does. So if you're going to look at the code, 00:46:12.000 --> 00:46:16.000 you'll see that 00:46:16.000 --> 00:46:17.000 there is this complicated bit pack structure that 00:46:17.000 --> 00:46:20.000 it does 00:46:20.000 --> 00:46:24.000 look for things char byte using a recursive character-by-character 00:46:24.000 --> 00:46:24.000 strategy of start at the top, find the matching character, then go to the children index, 00:46:24.000 --> 00:46:28.000 and then 00:46:28.000 --> 00:46:33.000 work your way down the children to find the next matching character, and so on 00:46:33.000 --> 00:46:37.000 to find its way through the data structure. 00:46:37.000 --> 00:46:41.000 So [inaudible] is where the child index, or for example, the idea that 00:46:41.000 --> 00:46:42.000 A says my children start at 27 or B starts at 39, 00:46:42.000 --> 00:46:45.000 and so I just need 00:46:45.000 --> 00:46:51.000 a number there, but I'm using 00:46:51.000 --> 00:46:55.000 a slightly smaller than an integer, just to make them all fit very tightly. So I'll tell 00:46:55.000 --> 00:46:57.000 you a couple of interesting things about it, actually, before I close this off, which is the 00:46:57.000 --> 00:47:02.000 lexicon.dat file 00:47:02.000 --> 00:47:03.000 actually just is an in-memory representation of what the DAWG data 00:47:03.000 --> 00:47:07.000 structure looks like. 00:47:07.000 --> 00:47:11.000 And so in fact, it's very easy to take the whole data structure. It doesn't have any pointers in it. Pointers, actually, tend to be a 00:47:11.000 --> 00:47:15.000 little bit - make data structures harder to rebuild and 00:47:15.000 --> 00:47:17.000 to dehydrate to disk because they have to kinda be wired up in memory. You don't know what 00:47:17.000 --> 00:47:20.000 memory you're going to get from new and what the addresses are. 00:47:20.000 --> 00:47:22.000 This one, actually, doesn't have any pointers. Everything is actually kinda self-contained. It's 00:47:22.000 --> 00:47:26.000 just one big array of these blocks. 00:47:26.000 --> 00:47:27.000 And all the information is relative in there. It says go to the index 27, index 00:47:27.000 --> 00:47:30.000 45. 00:47:30.000 --> 00:47:34.000 So you can take that whole block and just write it to disk, 300 bytes worth, 00:47:34.000 --> 00:47:36.000 read it back in 300 bytes worth and it kinda is rehydrated 00:47:36.000 --> 00:47:39.000 without any extra effort. So 00:47:39.000 --> 00:47:42.000 that's actually what the lexicon.dat file is, it's just a dehydrated form of 00:47:42.000 --> 00:47:44.000 the DAWG having been 00:47:44.000 --> 00:47:48.000 build in memory. That makes it very, very fast to load. 00:47:48.000 --> 00:47:49.000 At that point, though, it's super inflexible. If you want to add a word to 00:47:49.000 --> 00:47:52.000 it, 00:47:52.000 --> 00:47:56.000 the way the structure is build, you'd have to kinda add nodes and make sure it wasn't 00:47:56.000 --> 00:47:58.000 glomming on to some existing words that were nearby, and so it would be a lot of work to 00:47:58.000 --> 00:48:01.000 actually edit the DAWG in place. 00:48:01.000 --> 00:48:04.000 In fact, when you ask it to add a word, 00:48:04.000 --> 00:48:05.000 it actually doesn't edit that one at all. 00:48:05.000 --> 00:48:09.000 You say I'd like to put a new word in. 00:48:09.000 --> 00:48:11.000 It says I'm just - this one's so perfect the way it is. I've built this one, I love this 00:48:11.000 --> 00:48:14.000 one, I'm very happy with it. 00:48:14.000 --> 00:48:16.000 If you want to put some other words into it, it just keeps a second data structure 00:48:16.000 --> 00:48:21.000 on the side, an 00:48:21.000 --> 00:48:21.000 auxiliary one over here, where it sticks the additional words you ask it to put in. I 00:48:21.000 --> 00:48:24.000 have 00:48:24.000 --> 00:48:28.000 to say that seems like a very obvious thing to do, but in fact, I didn't even think of 00:48:28.000 --> 00:48:31.000 that for years. We had the lexicon. I said you can't use the lexicon to make new 00:48:31.000 --> 00:48:32.000 word lists or edit them. You can only load this one, beautiful, perfect one. You can't do 00:48:32.000 --> 00:48:35.000 anything else. And 00:48:35.000 --> 00:48:38.000 at some point people kept saying, ''I wish I could put some words in the lexicon. I wish I could use it 00:48:38.000 --> 00:48:41.000 for the word list to keep track of the human and computer players.'' I'm like, ''Well, you just 00:48:41.000 --> 00:48:41.000 can't do it. You don't understand. It's very complicated. 00:48:41.000 --> 00:48:45.000 You 00:48:45.000 --> 00:48:50.000 just don't understand my pain.'' Whatever. Whatever. I had excuses. And then, really, 00:48:50.000 --> 00:48:51.000 after some number of years of just being a baby about it, I'm like oh, of course, why don't I 00:48:51.000 --> 00:48:54.000 just 00:48:54.000 --> 00:48:57.000 keep another structure off to the side? It's an abstraction. It's a word list. You don't need 00:48:57.000 --> 00:49:00.000 to know what I do. The fact that I keep a DAWG for some of the words 00:49:00.000 --> 00:49:04.000 and just an ordinary set for some of the other words 00:49:04.000 --> 00:49:06.000 is really completely irrelevant. And as long as when you ask for words I 00:49:06.000 --> 00:49:08.000 look in both and I tell you, ''Yes, it's there,'' why do you need to know 00:49:08.000 --> 00:49:11.000 where I keep it? Okay, so it took 00:49:11.000 --> 00:49:13.000 me a while to buy abstraction, even though I teach it all the time. It just 00:49:13.000 --> 00:49:17.000 wasn't totally making sense. 00:49:17.000 --> 00:49:19.000 So a neat little thing, just kinda keeping both behind. And so it turns out that I 00:49:19.000 --> 00:49:21.000 learned about this, actually, from a paper 00:49:21.000 --> 00:49:22.000 that I read at the - in the 00:49:22.000 --> 00:49:23.000 00:49:23.000 --> 00:49:27.000 early ?90s about somebody 00:49:27.000 --> 00:49:30.000 building a Scrabble player, and they wanted a dictionary that 00:49:30.000 --> 00:49:33.000 supported kinda finding plays on the board and being able to extend words. And so 00:49:33.000 --> 00:49:34.000 knowing things about root words and how they extend and prefixes and suffixes was kinda the 00:49:34.000 --> 00:49:38.000 original motivation. 00:49:38.000 --> 00:49:39.000 And this kind of data structure is actually really great for any kind of word playing. 00:49:39.000 --> 00:49:42.000 So you have 00:49:42.000 --> 00:49:43.000 anagrams that you want to rearrange or words that you want to extend past suffix and 00:49:43.000 --> 00:49:46.000 prefix that 00:49:46.000 --> 00:49:50.000 this kind of data structure is really optimized for 00:49:50.000 --> 00:49:52.000 doing that kind of traversal and kinda leveraging the existing patterns 00:49:52.000 --> 00:49:56.000 in words to save space while doing so. 00:49:56.000 --> 00:50:00.000 So it was a really fun little project to do. It was kinda neat to think about. Even though today, 00:50:00.000 --> 00:50:02.000 if you needed a lexicon, you could just use a set. It's fast enough. It takes 00:50:02.000 --> 00:50:05.000 a ton of memory, but you know what, memory's cheap. 00:50:05.000 --> 00:50:06.000 So this is not the kind of thing you would need to do today except when you were in some kinda 00:50:06.000 --> 00:50:07.000 embedded, 00:50:07.000 --> 00:50:11.000 low memory, 00:50:11.000 --> 00:50:14.000 very, very tight processor environment, but it still makes for kinda interesting sort of artifact 00:50:14.000 --> 00:50:18.000 to think about well, how you get there and what kind of things you can use. 00:50:18.000 --> 00:50:19.000 So Wednesday we will talk about some big picture design, some kinda wrap stuff, and I will 00:50:19.000 --> 00:50:22.000 see 00:50:22.000 --> 00:50:23.000 you