[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:00.00,0:00:12.00,Default,,0000,0000,0000,, Dialogue: 0,0:00:12.00,0:00:16.00,Default,,0000,0000,0000,, Dialogue: 0,0:00:16.00,0:00:25.00,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional Dialogue: 0,0:00:25.00,0:00:28.00,Default,,0000,0000,0000,,Development. Dialogue: 0,0:00:28.00,0:00:32.00,Default,,0000,0000,0000,,The piece I need is not here, so I'm going to actually just talk to you about what Dialogue: 0,0:00:32.00,0:00:34.00,Default,,0000,0000,0000,,it says on my slides and then when Jason arrives we'll plug it back in and see what's going on. Dialogue: 0,0:00:34.00,0:00:36.00,Default,,0000,0000,0000,,Oh, there's Jason. There we go. Dialogue: 0,0:00:36.00,0:00:40.00,Default,,0000,0000,0000,,So what I'm going to do today is actually something kinda fun. And so I'm Dialogue: 0,0:00:40.00,0:00:41.00,Default,,0000,0000,0000,,gonna take a case study of a particular data structure Dialogue: 0,0:00:41.00,0:00:45.00,Default,,0000,0000,0000,,design, Dialogue: 0,0:00:45.00,0:00:49.00,Default,,0000,0000,0000,,and the one I'm looking at there is the lexicon. Dialogue: 0,0:00:49.00,0:00:53.00,Default,,0000,0000,0000,,So how many of you actually ever decided to just open up the lexicon file, just to see Dialogue: 0,0:00:53.00,0:00:55.00,Default,,0000,0000,0000,,what was in there at some point during the quarter. We used it a couple of times. Dialogue: 0,0:00:55.00,0:01:00.00,Default,,0000,0000,0000,,Anybody want to tell us what they learned by opening up that file? Dialogue: 0,0:01:00.00,0:01:04.00,Default,,0000,0000,0000,,Anything interesting? Dialogue: 0,0:01:04.00,0:01:05.00,Default,,0000,0000,0000,,It's a data structure called a DAWG. You're like, well that's a pretty funny name for something. Dialogue: 0,0:01:05.00,0:01:10.00,Default,,0000,0000,0000,,And Dialogue: 0,0:01:10.00,0:01:14.00,Default,,0000,0000,0000,,was the code particularly well-[inaudible], easy to read, easy to understand? Dialogue: 0,0:01:14.00,0:01:18.00,Default,,0000,0000,0000,,No. Who wrote that code? She should totally be fired. Dialogue: 0,0:01:18.00,0:01:22.00,Default,,0000,0000,0000,,Yeah, so this what I wrote kinda a long time ago, actually, to solve this particular problem. And actually I would like to key through Dialogue: 0,0:01:22.00,0:01:28.00,Default,,0000,0000,0000,,kinda how it is I came to discover the DAWG and why it is the DAWG was the Dialogue: 0,0:01:28.00,0:01:29.00,Default,,0000,0000,0000,,choice that made sense for what the need pattern we had for the lexicon was. Dialogue: 0,0:01:29.00,0:01:30.00,Default,,0000,0000,0000,,So let me just refresh what Dialogue: 0,0:01:30.00,0:01:34.00,Default,,0000,0000,0000,,the lexicon does. Dialogue: 0,0:01:34.00,0:01:36.00,Default,,0000,0000,0000,,It is a special case of kinda set or map. It is a container where you're going stick things Dialogue: 0,0:01:36.00,0:01:38.00,Default,,0000,0000,0000,,in, they're going to be unique, right? There's no Dialogue: 0,0:01:38.00,0:01:41.00,Default,,0000,0000,0000,,need for any sort of duplicate. Dialogue: 0,0:01:41.00,0:01:43.00,Default,,0000,0000,0000,,The keys are always strings. And in fact, they're English words, so that's going to be an important Dialogue: 0,0:01:43.00,0:01:47.00,Default,,0000,0000,0000,,thing to remember, that Dialogue: 0,0:01:47.00,0:01:48.00,Default,,0000,0000,0000,,they're not just arbitrary sequences of letters, right? They have certain patterns to Dialogue: 0,0:01:48.00,0:01:51.00,Default,,0000,0000,0000,,them Dialogue: 0,0:01:51.00,0:01:54.00,Default,,0000,0000,0000,,and certain sizes that might actually be something we could capitalize on. When we know Dialogue: 0,0:01:54.00,0:01:57.00,Default,,0000,0000,0000,,something about the domain, we're able to kinda special case or tune or Dialogue: 0,0:01:57.00,0:01:58.00,Default,,0000,0000,0000,,data structure to solve that problem Dialogue: 0,0:01:58.00,0:01:59.00,Default,,0000,0000,0000,,very well, Dialogue: 0,0:01:59.00,0:02:03.00,Default,,0000,0000,0000,,in particular. Dialogue: 0,0:02:03.00,0:02:07.00,Default,,0000,0000,0000,,It has no associated value, so it doesn't have dictionary definitions or synonyms or anything Dialogue: 0,0:02:07.00,0:02:08.00,Default,,0000,0000,0000,,like that, right? So it really just is exists or doesn't exist Dialogue: 0,0:02:08.00,0:02:12.00,Default,,0000,0000,0000,,that we were interested in supporting. Dialogue: 0,0:02:12.00,0:02:15.00,Default,,0000,0000,0000,,And it also has, in addition to the kind of being able to add words and check if a Dialogue: 0,0:02:15.00,0:02:16.00,Default,,0000,0000,0000,,word is contained, it's also this prefixing operation that we had talked a Dialogue: 0,0:02:16.00,0:02:19.00,Default,,0000,0000,0000,,little about Dialogue: 0,0:02:19.00,0:02:23.00,Default,,0000,0000,0000,,in Boggle, and that came back on the midterm. Like, why was prefixing important, right? Dialogue: 0,0:02:23.00,0:02:25.00,Default,,0000,0000,0000,,It was critical for pruning these dead-end paths, these searches on the Dialogue: 0,0:02:25.00,0:02:26.00,Default,,0000,0000,0000,,Boggle board that otherwise Dialogue: 0,0:02:26.00,0:02:32.00,Default,,0000,0000,0000,,would really have Dialogue: 0,0:02:32.00,0:02:32.00,Default,,0000,0000,0000,,taken a lot of time and kinda led to nothing of any value. So what I'm gonna to talk you Dialogue: 0,0:02:32.00,0:02:36.00,Default,,0000,0000,0000,,through Dialogue: 0,0:02:36.00,0:02:40.00,Default,,0000,0000,0000,,is some of the obvious choices based on data structures we already know about, Dialogue: 0,0:02:40.00,0:02:43.00,Default,,0000,0000,0000,,and see where they would go. So this is a thought experiment. It's not that I wrote all Dialogue: 0,0:02:43.00,0:02:45.00,Default,,0000,0000,0000,,these to find out. Actually, I mostly did a lot of this on paper, trying to figure Dialogue: 0,0:02:45.00,0:02:48.00,Default,,0000,0000,0000,,out what options we had Dialogue: 0,0:02:48.00,0:02:49.00,Default,,0000,0000,0000,,and how they would play out and what kind of trade-offs I'd have to make to make them Dialogue: 0,0:02:49.00,0:02:52.00,Default,,0000,0000,0000,,work. Dialogue: 0,0:02:52.00,0:02:55.00,Default,,0000,0000,0000,,So the simplest case of any kind of collection is to think about Dialogue: 0,0:02:55.00,0:02:58.00,Default,,0000,0000,0000,,whether vector or array can do the job for you. Dialogue: 0,0:02:58.00,0:03:00.00,Default,,0000,0000,0000,,In this case, having a vector that's kept in sorted - alphabetically sorted - order seems Dialogue: 0,0:03:00.00,0:03:03.00,Default,,0000,0000,0000,,like a pretty good, Dialogue: 0,0:03:03.00,0:03:05.00,Default,,0000,0000,0000,,easy way to get this set up and running and how would it behave? Dialogue: 0,0:03:05.00,0:03:06.00,Default,,0000,0000,0000,,So in order to do a contains Dialogue: 0,0:03:06.00,0:03:07.00,Default,,0000,0000,0000,,(word) Dialogue: 0,0:03:07.00,0:03:15.00,Default,,0000,0000,0000,,on this data structure, Dialogue: 0,0:03:15.00,0:03:17.00,Default,,0000,0000,0000,,what algorithm are you going to use to do your searching? Binary search, Dialogue: 0,0:03:17.00,0:03:21.00,Default,,0000,0000,0000,,right? It's in sorted order, right? We've got to take advantage of that, right? If we do Dialogue: 0,0:03:21.00,0:03:25.00,Default,,0000,0000,0000,,a linear search I'm looking for the word ''mediocre,'' I'm going to be trucking through, Dialogue: 0,0:03:25.00,0:03:28.00,Default,,0000,0000,0000,,you know, thousands of words before I find it. So definitely want to be using binary search. Dialogue: 0,0:03:28.00,0:03:29.00,Default,,0000,0000,0000,,Equals-equals, comparing my strings, less than, greater than, dividing in half. And so Dialogue: 0,0:03:29.00,0:03:32.00,Default,,0000,0000,0000,,it should run a logarithm time. Dialogue: 0,0:03:32.00,0:03:35.00,Default,,0000,0000,0000,,Logarithm time, one of those greater performers that you Dialogue: 0,0:03:35.00,0:03:38.00,Default,,0000,0000,0000,,learned in the PQ assignment, hopefully. It's just basically not measurable on Dialogue: 0,0:03:38.00,0:03:41.00,Default,,0000,0000,0000,,today's computer. So even for entries of Dialogue: 0,0:03:41.00,0:03:42.00,Default,,0000,0000,0000,,100,000, 500,000, basically Dialogue: 0,0:03:42.00,0:03:46.00,Default,,0000,0000,0000,,for free. Dialogue: 0,0:03:46.00,0:03:50.00,Default,,0000,0000,0000,,How do we do contains (prefix)? Can we also do a fast prefix search given this Dialogue: 0,0:03:50.00,0:03:51.00,Default,,0000,0000,0000,,data structure? Dialogue: 0,0:03:51.00,0:03:54.00,Default,,0000,0000,0000,,I hope so right? Thinking about Dialogue: 0,0:03:54.00,0:03:57.00,Default,,0000,0000,0000,,the same kind of binary search. In this case, though, looking at substring, right? So Dialogue: 0,0:03:57.00,0:03:59.00,Default,,0000,0000,0000,,comparing I'm looking for a three-character substring, only looking at the Dialogue: 0,0:03:59.00,0:04:02.00,Default,,0000,0000,0000,,first three characters, decide which way to go. Dialogue: 0,0:04:02.00,0:04:05.00,Default,,0000,0000,0000,,Once I find something that matches in the first three characters, I'm done. Dialogue: 0,0:04:05.00,0:04:07.00,Default,,0000,0000,0000,,So again, still using the sorted property to quickly narrow in on where Dialogue: 0,0:04:07.00,0:04:10.00,Default,,0000,0000,0000,,it could be. Dialogue: 0,0:04:10.00,0:04:14.00,Default,,0000,0000,0000,,When it's time to add a new word to this, Dialogue: 0,0:04:14.00,0:04:17.00,Default,,0000,0000,0000,,if someone asked you to put in the word ''abalone,'' I gotta Dialogue: 0,0:04:17.00,0:04:21.00,Default,,0000,0000,0000,,use that binary search to find out where to go. So in logarithmic time I can Dialogue: 0,0:04:21.00,0:04:23.00,Default,,0000,0000,0000,,tell you where the new word goes, but then to put it into position, Dialogue: 0,0:04:23.00,0:04:26.00,Default,,0000,0000,0000,,there's gonna be some shuffling. Dialogue: 0,0:04:26.00,0:04:28.00,Default,,0000,0000,0000,,And that's where this operation Dialogue: 0,0:04:28.00,0:04:32.00,Default,,0000,0000,0000,,starts to bog down, Dialogue: 0,0:04:32.00,0:04:33.00,Default,,0000,0000,0000,,given the idea that you might be moving half or more of your items, Dialogue: 0,0:04:33.00,0:04:36.00,Default,,0000,0000,0000,,on average, Dialogue: 0,0:04:36.00,0:04:40.00,Default,,0000,0000,0000,,to make that space for that one to open up. Dialogue: 0,0:04:40.00,0:04:42.00,Default,,0000,0000,0000,,It is going to require a linear time operation to get new words into the data Dialogue: 0,0:04:42.00,0:04:43.00,Default,,0000,0000,0000,,structure. Okay, Dialogue: 0,0:04:43.00,0:04:48.00,Default,,0000,0000,0000,,probably still, though, Dialogue: 0,0:04:48.00,0:04:51.00,Default,,0000,0000,0000,,a reasonable approach, right? These are the operations that get done Dialogue: 0,0:04:51.00,0:04:54.00,Default,,0000,0000,0000,,immense numbers of times in Boggle, right? You're doing a ton of contains (word)'s and contains (prefix)'s Dialogue: 0,0:04:54.00,0:04:58.00,Default,,0000,0000,0000,,as you're exhaustively searching that big board. Dialogue: 0,0:04:58.00,0:05:01.00,Default,,0000,0000,0000,,But the add's are done kinda once at the beginning. You build that big dictionary and Dialogue: 0,0:05:01.00,0:05:02.00,Default,,0000,0000,0000,,then you subsequently use it. And so if it took a little bit more time to built Dialogue: 0,0:05:02.00,0:05:06.00,Default,,0000,0000,0000,,that thing, Dialogue: 0,0:05:06.00,0:05:08.00,Default,,0000,0000,0000,,sorta an n-squared operation to kinda load it up, well, Dialogue: 0,0:05:08.00,0:05:11.00,Default,,0000,0000,0000,,maybe that would be okay. Dialogue: 0,0:05:11.00,0:05:14.00,Default,,0000,0000,0000,,The other we want to look at is space usage, and that's - and some of this is a little Dialogue: 0,0:05:14.00,0:05:18.00,Default,,0000,0000,0000,,bit of an interesting artifact of the time I was working on this. It turns Dialogue: 0,0:05:18.00,0:05:21.00,Default,,0000,0000,0000,,out space was actually a limiting factor in what we had available Dialogue: 0,0:05:21.00,0:05:22.00,Default,,0000,0000,0000,,to us, in terms of core memories of our machines and Dialogue: 0,0:05:22.00,0:05:25.00,Default,,0000,0000,0000,,what we could take up for our dictionary. Dialogue: 0,0:05:25.00,0:05:28.00,Default,,0000,0000,0000,,In this case, it has per entry the size of the string, Dialogue: 0,0:05:28.00,0:05:29.00,Default,,0000,0000,0000,,which we don't exactly know the string it represented. It's an abstraction Dialogue: 0,0:05:29.00,0:05:32.00,Default,,0000,0000,0000,,to us, but Dialogue: 0,0:05:32.00,0:05:35.00,Default,,0000,0000,0000,,we can make the assumption of kinda a simplification, that Dialogue: 0,0:05:35.00,0:05:37.00,Default,,0000,0000,0000,,it's probably about the length of the Dialogue: 0,0:05:37.00,0:05:39.00,Default,,0000,0000,0000,,string times of the size of the character. And there might be a little bit more Dialogue: 0,0:05:39.00,0:05:40.00,Default,,0000,0000,0000,,housekeeping associated with it, but Dialogue: 0,0:05:40.00,0:05:42.00,Default,,0000,0000,0000,, Dialogue: 0,0:05:42.00,0:05:44.00,Default,,0000,0000,0000,,that's a pretty good estimate of what it's taking. So in fact, it's the string data Dialogue: 0,0:05:44.00,0:05:46.00,Default,,0000,0000,0000,,itself Dialogue: 0,0:05:46.00,0:05:50.00,Default,,0000,0000,0000,,that's being represented in this vector Dialogue: 0,0:05:50.00,0:05:51.00,Default,,0000,0000,0000,,without much else overhead. So if we say words are about average of about eight characters, Dialogue: 0,0:05:51.00,0:05:54.00,Default,,0000,0000,0000,,about eight bytes per word. Dialogue: 0,0:05:54.00,0:05:57.00,Default,,0000,0000,0000,,So if we have 100,000 we expect to have 800,000 bytes Dialogue: 0,0:05:57.00,0:06:00.00,Default,,0000,0000,0000,,invested in this sort of vector arrangement. Dialogue: 0,0:06:00.00,0:06:02.00,Default,,0000,0000,0000,,Excess capacity, a couple other things that might tweak that a little bit, but that Dialogue: 0,0:06:02.00,0:06:04.00,Default,,0000,0000,0000,,gives us at least a good starting point to Dialogue: 0,0:06:04.00,0:06:07.00,Default,,0000,0000,0000,,think about how this Dialogue: 0,0:06:07.00,0:06:09.00,Default,,0000,0000,0000,,compares to alternatives. Okay, Dialogue: 0,0:06:09.00,0:06:12.00,Default,,0000,0000,0000,,so then if we said Dialogue: 0,0:06:12.00,0:06:16.00,Default,,0000,0000,0000,,we know that sortedness - that binary search is the Dialogue: 0,0:06:16.00,0:06:17.00,Default,,0000,0000,0000,,real advantage of keeping it in sorted order, but the contiguous nature of the array Dialogue: 0,0:06:17.00,0:06:19.00,Default,,0000,0000,0000,,kinda Dialogue: 0,0:06:19.00,0:06:23.00,Default,,0000,0000,0000,,bites us in terms of insert. Dialogue: 0,0:06:23.00,0:06:26.00,Default,,0000,0000,0000,,If we reorganize into that binary search tree to give ourselves the dynamic flexibility of Dialogue: 0,0:06:26.00,0:06:27.00,Default,,0000,0000,0000,,wiring in the left and right halves through pointers, as opposed to Dialogue: 0,0:06:27.00,0:06:29.00,Default,,0000,0000,0000,,contiguous memory, Dialogue: 0,0:06:29.00,0:06:31.00,Default,,0000,0000,0000,,then we know we can also get add to run Dialogue: 0,0:06:31.00,0:06:35.00,Default,,0000,0000,0000,,in logarithmic time. So Dialogue: 0,0:06:35.00,0:06:36.00,Default,,0000,0000,0000,,looking at these operations, contains (word) is a tree search that's using equals-equals, Dialogue: 0,0:06:36.00,0:06:40.00,Default,,0000,0000,0000,,left-right, deciding which way to go, Dialogue: 0,0:06:40.00,0:06:43.00,Default,,0000,0000,0000,,and also performs logarithmically if the tree is balanced. Dialogue: 0,0:06:43.00,0:06:45.00,Default,,0000,0000,0000,,Contains (prefix), same deal. Keep Dialogue: 0,0:06:45.00,0:06:46.00,Default,,0000,0000,0000,,looking at the [inaudible], Dialogue: 0,0:06:46.00,0:06:48.00,Default,,0000,0000,0000,,working our way down, knowing which way to go. Dialogue: 0,0:06:48.00,0:06:51.00,Default,,0000,0000,0000,,And then Dialogue: 0,0:06:51.00,0:06:54.00,Default,,0000,0000,0000,,add can also perform in logarithmic time Dialogue: 0,0:06:54.00,0:06:57.00,Default,,0000,0000,0000,,because it does the same search, falling off the bottom and Dialogue: 0,0:06:57.00,0:06:58.00,Default,,0000,0000,0000,,then inserting the new word there, or finding it along the way and not having any Dialogue: 0,0:06:58.00,0:07:01.00,Default,,0000,0000,0000,,work to do. Dialogue: 0,0:07:01.00,0:07:03.00,Default,,0000,0000,0000,,So we can get all our operations done in logarithmic time, which we know to be kinda Dialogue: 0,0:07:03.00,0:07:05.00,Default,,0000,0000,0000,,fast enough to actually not be worth Dialogue: 0,0:07:05.00,0:07:08.00,Default,,0000,0000,0000,,fighting any harder for. Dialogue: 0,0:07:08.00,0:07:11.00,Default,,0000,0000,0000,,Where did we pay for this? Dialogue: 0,0:07:11.00,0:07:13.00,Default,,0000,0000,0000,,Nothing really comes for free, right? Dialogue: 0,0:07:13.00,0:07:15.00,Default,,0000,0000,0000,,Overhead, right? And overhead, in this case, memory is certainly one of the places Dialogue: 0,0:07:15.00,0:07:20.00,Default,,0000,0000,0000,,where there's going to be a Dialogue: 0,0:07:20.00,0:07:23.00,Default,,0000,0000,0000,,pretty big cost that we now have the string data for each entry, plus a Dialogue: 0,0:07:23.00,0:07:26.00,Default,,0000,0000,0000,,left pointer and a right pointer. And if we're doing the work to keep it Dialogue: 0,0:07:26.00,0:07:30.00,Default,,0000,0000,0000,,balanced, which we probably ought to do if we want to avoid the degenerate cases Dialogue: 0,0:07:30.00,0:07:31.00,Default,,0000,0000,0000,,of it kinda getting totally out of whack and degenerating into linear time, Dialogue: 0,0:07:31.00,0:07:34.00,Default,,0000,0000,0000,,we're going to have a couple more Dialogue: 0,0:07:34.00,0:07:37.00,Default,,0000,0000,0000,,bits, you know, bytes thrown into that factor, too. Dialogue: 0,0:07:37.00,0:07:39.00,Default,,0000,0000,0000,,And so if we're assuming that there's Dialogue: 0,0:07:39.00,0:07:42.00,Default,,0000,0000,0000,,ten bytes of overhead, Dialogue: 0,0:07:42.00,0:07:46.00,Default,,0000,0000,0000,,four bytes in each of these plus another two for the balance factor, then we've added Dialogue: 0,0:07:46.00,0:07:48.00,Default,,0000,0000,0000,,ten bytes of overhead on top of the eight bytes for the word. Dialogue: 0,0:07:48.00,0:07:48.00,Default,,0000,0000,0000,,Getting a little Dialogue: 0,0:07:48.00,0:07:52.00,Default,,0000,0000,0000,,bit hefty. Something to worry about. Dialogue: 0,0:07:52.00,0:07:56.00,Default,,0000,0000,0000,,Also in terms of code complexity, which I didn't put a bullet up for, but it's Dialogue: 0,0:07:56.00,0:07:59.00,Default,,0000,0000,0000,,also worth kinda keeping - weighing it in your mind also which is Dialogue: 0,0:07:59.00,0:08:01.00,Default,,0000,0000,0000,,that building the fancier, more complicated data structure probably meant Dialogue: 0,0:08:01.00,0:08:02.00,Default,,0000,0000,0000,,you spent more time debugging it, Dialogue: 0,0:08:02.00,0:08:06.00,Default,,0000,0000,0000,,more time developing it, Dialogue: 0,0:08:06.00,0:08:08.00,Default,,0000,0000,0000,,right? You will have - it will be harder for someone later to maintain and deal Dialogue: 0,0:08:08.00,0:08:10.00,Default,,0000,0000,0000,,with because it's actually just more complicated code that somebody looking at a Dialogue: 0,0:08:10.00,0:08:12.00,Default,,0000,0000,0000,,sorted vector Dialogue: 0,0:08:12.00,0:08:13.00,Default,,0000,0000,0000,,isn't likely to actually - like, they want to add a new operation, Dialogue: 0,0:08:13.00,0:08:16.00,Default,,0000,0000,0000,,let's say they add remove. Dialogue: 0,0:08:16.00,0:08:19.00,Default,,0000,0000,0000,,You're able to take a word out of the lexicon. The lexicon doesn't currently support Dialogue: 0,0:08:19.00,0:08:23.00,Default,,0000,0000,0000,,that. Somebody adding that on the sorted vector implementation probably won't find themselves Dialogue: 0,0:08:23.00,0:08:25.00,Default,,0000,0000,0000,,too tripped up. Someone who has to do a remove out of a binary search tree, removing it, reattaching Dialogue: 0,0:08:25.00,0:08:27.00,Default,,0000,0000,0000,,the subtrees Dialogue: 0,0:08:27.00,0:08:30.00,Default,,0000,0000,0000,,and not destroying the balancing Dialogue: 0,0:08:30.00,0:08:32.00,Default,,0000,0000,0000,,is a pretty complicated operation. So we've made - we've invested in this data Dialogue: 0,0:08:32.00,0:08:34.00,Default,,0000,0000,0000,,structure that actually will keep on Dialogue: 0,0:08:34.00,0:08:36.00,Default,,0000,0000,0000,,creating Dialogue: 0,0:08:36.00,0:08:38.00,Default,,0000,0000,0000,,further development hassles for any modifications that are down the road for Dialogue: 0,0:08:38.00,0:08:40.00,Default,,0000,0000,0000,,this data structure. So Dialogue: 0,0:08:40.00,0:08:45.00,Default,,0000,0000,0000,,it is Dialogue: 0,0:08:45.00,0:08:47.00,Default,,0000,0000,0000,,a big investment in your brainpower to keep it working. So Dialogue: 0,0:08:47.00,0:08:51.00,Default,,0000,0000,0000,,let's think about what a hash table can do for us. A Dialogue: 0,0:08:51.00,0:08:56.00,Default,,0000,0000,0000,,hash table is the last thing that we looked at on Friday, Dialogue: 0,0:08:56.00,0:08:58.00,Default,,0000,0000,0000,,kinda just moving away from this whole idea of sortedness and kinda moving it out of bound off to this idea of a Dialogue: 0,0:08:58.00,0:09:02.00,Default,,0000,0000,0000,,hashing function. Dialogue: 0,0:09:02.00,0:09:05.00,Default,,0000,0000,0000,,So if I have a hash table that has a certain number of buckets Dialogue: 0,0:09:05.00,0:09:06.00,Default,,0000,0000,0000,,that uses a link list for chaining to Dialogue: 0,0:09:06.00,0:09:10.00,Default,,0000,0000,0000,,handle the collision, Dialogue: 0,0:09:10.00,0:09:11.00,Default,,0000,0000,0000,,then what I would have is my words just scattered around my table, Dialogue: 0,0:09:11.00,0:09:14.00,Default,,0000,0000,0000,, Dialogue: 0,0:09:14.00,0:09:16.00,Default,,0000,0000,0000,,allowing me to have quick access by hashing to the right bucket and then Dialogue: 0,0:09:16.00,0:09:19.00,Default,,0000,0000,0000,,working my way down the bucket to C. Dialogue: 0,0:09:19.00,0:09:21.00,Default,,0000,0000,0000,,So the contains (word) would do a standard Dialogue: 0,0:09:21.00,0:09:24.00,Default,,0000,0000,0000,,hash table lookup. Dialogue: 0,0:09:24.00,0:09:27.00,Default,,0000,0000,0000,,Take the word, you know, ''mediate,'' run it through the hash function. It says oh, it's bucket Dialogue: 0,0:09:27.00,0:09:30.00,Default,,0000,0000,0000,,three in this case. I go to bucket three. I walk down the link list. I find it or Dialogue: 0,0:09:30.00,0:09:31.00,Default,,0000,0000,0000,,not. I can tell you whether the word ''mediate,'' if it's in the table, had to be in bucket Dialogue: 0,0:09:31.00,0:09:33.00,Default,,0000,0000,0000,,three. Dialogue: 0,0:09:33.00,0:09:35.00,Default,,0000,0000,0000,,No questions asked. Dialogue: 0,0:09:35.00,0:09:38.00,Default,,0000,0000,0000,,In that case, the Dialogue: 0,0:09:38.00,0:09:39.00,Default,,0000,0000,0000,,expected time for that is going to be n over b, where n is the number of entries, b is the number of Dialogue: 0,0:09:39.00,0:09:41.00,Default,,0000,0000,0000,,buckets. Dialogue: 0,0:09:41.00,0:09:42.00,Default,,0000,0000,0000,,If we have buckets Dialogue: 0,0:09:42.00,0:09:47.00,Default,,0000,0000,0000,,set to be in the Dialogue: 0,0:09:47.00,0:09:52.00,Default,,0000,0000,0000,,rough order of the number of entries, then we're going to get constant access there. Dialogue: 0,0:09:52.00,0:09:54.00,Default,,0000,0000,0000,,Now we want to do contains (prefix). I want Dialogue: 0,0:09:54.00,0:10:00.00,Default,,0000,0000,0000,,to Dialogue: 0,0:10:00.00,0:10:05.00,Default,,0000,0000,0000,,know if there's any words in this table that begin with ''med.'' Dialogue: 0,0:10:05.00,0:10:09.00,Default,,0000,0000,0000,,Where do I look? Dialogue: 0,0:10:09.00,0:10:11.00,Default,,0000,0000,0000,,If I run ''med'' through the hash function, Dialogue: 0,0:10:11.00,0:10:15.00,Default,,0000,0000,0000,,do I expect that I would get the same number Dialogue: 0,0:10:15.00,0:10:18.00,Default,,0000,0000,0000,,as other words that begin with ''med,'' like ''mediate'' and ''median'' and Dialogue: 0,0:10:18.00,0:10:19.00,Default,,0000,0000,0000,,''mediator?'' Yes. Dialogue: 0,0:10:19.00,0:10:22.00,Default,,0000,0000,0000,,You say yes. All Dialogue: 0,0:10:22.00,0:10:26.00,Default,,0000,0000,0000,,right, let's take that to an extreme. Dialogue: 0,0:10:26.00,0:10:29.00,Default,,0000,0000,0000,,If I expect that if a prefix in any longer words all have the same thing, then should it Dialogue: 0,0:10:29.00,0:10:33.00,Default,,0000,0000,0000,,be the case, for example, I think about the simplest case of a prefix, ''m.'' Dialogue: 0,0:10:33.00,0:10:35.00,Default,,0000,0000,0000,,Should all the words that begin with ''m'' hash to the same place? Dialogue: 0,0:10:35.00,0:10:38.00,Default,,0000,0000,0000,,If they did, we're in trouble, right? Dialogue: 0,0:10:38.00,0:10:42.00,Default,,0000,0000,0000,,Now, if all prefixes did hash to the same place, then what I'm going to end up with is a Dialogue: 0,0:10:42.00,0:10:45.00,Default,,0000,0000,0000,,really heavily clustered table where ''a,'' ''b,'' ''c,'' you know, ''z'' - Dialogue: 0,0:10:45.00,0:10:46.00,Default,,0000,0000,0000,,there's 26 buckets that have all the ''z'' words, all the Dialogue: 0,0:10:46.00,0:10:48.00,Default,,0000,0000,0000,,''y'' words, all the what not, Dialogue: 0,0:10:48.00,0:10:52.00,Default,,0000,0000,0000,,and none of the other buckets would be used. All Dialogue: 0,0:10:52.00,0:10:55.00,Default,,0000,0000,0000,,right, so in fact, actually, as part of the good behavior of a hash function, we're Dialogue: 0,0:10:55.00,0:10:58.00,Default,,0000,0000,0000,,hopping that, actually, it does jumble things up. And then a word that looks like one Dialogue: 0,0:10:58.00,0:11:02.00,Default,,0000,0000,0000,,but has another letter added on the end, we're hoping it goes somewhere totally different, Dialogue: 0,0:11:02.00,0:11:04.00,Default,,0000,0000,0000,,that ''median'' and ''medians'' hopefully go to two totally disparate places. Dialogue: 0,0:11:04.00,0:11:07.00,Default,,0000,0000,0000,,That if there were patterns where Dialogue: 0,0:11:07.00,0:11:10.00,Default,,0000,0000,0000,,words that were substrings of each other all got hashed to the same Dialogue: 0,0:11:10.00,0:11:12.00,Default,,0000,0000,0000,,location, we would end up with heavy clustering. Because there are a lot of words that Dialogue: 0,0:11:12.00,0:11:15.00,Default,,0000,0000,0000,,repeat parts of the other words in the dictionary. Dialogue: 0,0:11:15.00,0:11:19.00,Default,,0000,0000,0000,,So in fact, if I want to know if there's any words that begin with ''med,'' Dialogue: 0,0:11:19.00,0:11:21.00,Default,,0000,0000,0000,,I have no a priori information about where to look for Dialogue: 0,0:11:21.00,0:11:25.00,Default,,0000,0000,0000,,them. They could be anywhere. Dialogue: 0,0:11:25.00,0:11:26.00,Default,,0000,0000,0000,,I could hash ''med'' and see if ''med'' itself was a word, right? Dialogue: 0,0:11:26.00,0:11:30.00,Default,,0000,0000,0000,,But let's say it was something that Dialogue: 0,0:11:30.00,0:11:33.00,Default,,0000,0000,0000,,itself was just the beginning of something. ''St,'' ''str'' - Dialogue: 0,0:11:33.00,0:11:36.00,Default,,0000,0000,0000,,there's a lot of words that begin with that, but ''str'' itself is not a word. So Dialogue: 0,0:11:36.00,0:11:39.00,Default,,0000,0000,0000,,hashing to that bucket and looking for it would only tell me if there's exactly ''str,'' and then Dialogue: 0,0:11:39.00,0:11:41.00,Default,,0000,0000,0000,,occasionally there might be some ''str'' words also there. Dialogue: 0,0:11:41.00,0:11:44.00,Default,,0000,0000,0000,,But other than that, we just gotta look. Dialogue: 0,0:11:44.00,0:11:45.00,Default,,0000,0000,0000,,And where do we look? Everywhere. Dialogue: 0,0:11:45.00,0:11:50.00,Default,,0000,0000,0000,,Everywhere, everywhere, everywhere. Dialogue: 0,0:11:50.00,0:11:53.00,Default,,0000,0000,0000,,That in order to conclude there is no word that begins with the prefix we have, Dialogue: 0,0:11:53.00,0:11:55.00,Default,,0000,0000,0000,,the only way to be confident of that is to either find something, or to have searched Dialogue: 0,0:11:55.00,0:11:59.00,Default,,0000,0000,0000,,the entire table and not found it. Dialogue: 0,0:11:59.00,0:12:02.00,Default,,0000,0000,0000,,And so it does require this completely linear, exhaustive Dialogue: 0,0:12:02.00,0:12:04.00,Default,,0000,0000,0000,,look through the entire contents of the hash table to know Dialogue: 0,0:12:04.00,0:12:07.00,Default,,0000,0000,0000,,something, for example, is not a prefix. Dialogue: 0,0:12:07.00,0:12:09.00,Default,,0000,0000,0000,,You might find one quickly. So in the best case, you might be in the first couple of Dialogue: 0,0:12:09.00,0:12:10.00,Default,,0000,0000,0000,,buckets, happen upon one. Dialogue: 0,0:12:10.00,0:12:14.00,Default,,0000,0000,0000,,But no guarantees, Dialogue: 0,0:12:14.00,0:12:16.00,Default,,0000,0000,0000,,overall. And certainly when it's not a prefix, it's going to require a complete look-see of everything. Dialogue: 0,0:12:16.00,0:12:18.00,Default,,0000,0000,0000,,That's kinda a bummer. Adding Dialogue: 0,0:12:18.00,0:12:19.00,Default,,0000,0000,0000,,a Dialogue: 0,0:12:19.00,0:12:22.00,Default,,0000,0000,0000,,word, Dialogue: 0,0:12:22.00,0:12:24.00,Default,,0000,0000,0000,,back to the fast behavior again. Hashing in the bucket, Dialogue: 0,0:12:24.00,0:12:26.00,Default,,0000,0000,0000,,checking to see if it's already there, adding if needed Dialogue: 0,0:12:26.00,0:12:28.00,Default,,0000,0000,0000,,[inaudible]. Dialogue: 0,0:12:28.00,0:12:31.00,Default,,0000,0000,0000,,So it gets good times on these two, Dialogue: 0,0:12:31.00,0:12:34.00,Default,,0000,0000,0000,,but then kinda really bottoms out when it comes to supporting Dialogue: 0,0:12:34.00,0:12:35.00,Default,,0000,0000,0000,,that prefix search. Dialogue: 0,0:12:35.00,0:12:38.00,Default,,0000,0000,0000,,Its space usage Dialogue: 0,0:12:38.00,0:12:42.00,Default,,0000,0000,0000,,is kinda roughly comparable to that of the binary search tree. Dialogue: 0,0:12:42.00,0:12:42.00,Default,,0000,0000,0000,,We've got the string data itself, and we've got the four-byte pointer on the link list. Dialogue: 0,0:12:42.00,0:12:44.00,Default,,0000,0000,0000,, Dialogue: 0,0:12:44.00,0:12:48.00,Default,,0000,0000,0000,,There's also the four-byte bucket pointer, Dialogue: 0,0:12:48.00,0:12:51.00,Default,,0000,0000,0000,,which is over in this other structure here, this array full of buckets. Dialogue: 0,0:12:51.00,0:12:54.00,Default,,0000,0000,0000,,If that number, though, is roughly equal to the number of cells, then you can kinda Dialogue: 0,0:12:54.00,0:12:56.00,Default,,0000,0000,0000,,just count it as each entry Dialogue: 0,0:12:56.00,0:12:59.00,Default,,0000,0000,0000,,had this Dialogue: 0,0:12:59.00,0:13:02.00,Default,,0000,0000,0000,,12-byte cell, plus a four-byte bucket pointer somewhere Dialogue: 0,0:13:02.00,0:13:04.00,Default,,0000,0000,0000,,that you can kinda think of as associated on a per entry basis to tell you it's about Dialogue: 0,0:13:04.00,0:13:06.00,Default,,0000,0000,0000,, Dialogue: 0,0:13:06.00,0:13:10.00,Default,,0000,0000,0000,,eight bytes of overhead on top of the string. Dialogue: 0,0:13:10.00,0:13:11.00,Default,,0000,0000,0000,,So 16 bytes for each entry in the table. Dialogue: 0,0:13:11.00,0:13:16.00,Default,,0000,0000,0000,,Okay, that's Dialogue: 0,0:13:16.00,0:13:18.00,Default,,0000,0000,0000,,what we have so far. Dialogue: 0,0:13:18.00,0:13:21.00,Default,,0000,0000,0000,,We've got Dialogue: 0,0:13:21.00,0:13:24.00,Default,,0000,0000,0000,,pretty good performance on contains (word) no matter what we Dialogue: 0,0:13:24.00,0:13:27.00,Default,,0000,0000,0000,,choose. We can get either logarithmic or constant time with either of these three Dialogue: 0,0:13:27.00,0:13:31.00,Default,,0000,0000,0000,,data structure. So perfectly acceptable performance there. Dialogue: 0,0:13:31.00,0:13:33.00,Default,,0000,0000,0000,,We cannot get contains (prefix) to run well on a hash table. Dialogue: 0,0:13:33.00,0:13:34.00,Default,,0000,0000,0000,,Just doesn't happen the way hashing works. Dialogue: 0,0:13:34.00,0:13:38.00,Default,,0000,0000,0000,,And then add Dialogue: 0,0:13:38.00,0:13:41.00,Default,,0000,0000,0000,,can be fast on the two complicated pointer-based data structures to the right, Dialogue: 0,0:13:41.00,0:13:42.00,Default,,0000,0000,0000,,but can't be fast on the sorted vector. Dialogue: 0,0:13:42.00,0:13:45.00,Default,,0000,0000,0000,,And then Dialogue: 0,0:13:45.00,0:13:47.00,Default,,0000,0000,0000,,we see kinda some trade-offs here in terms of space versus time, that there's Dialogue: 0,0:13:47.00,0:13:51.00,Default,,0000,0000,0000,,very little space overhead Dialogue: 0,0:13:51.00,0:13:55.00,Default,,0000,0000,0000,,in the sorted vector case, but really blossoms into Dialogue: 0,0:13:55.00,0:13:58.00,Default,,0000,0000,0000,,two times, almost three times, the storage needed. In the VST and hash table Dialogue: 0,0:13:58.00,0:14:00.00,Default,,0000,0000,0000,,case, the code also got a lot more complicated when we moved up to those Dialogue: 0,0:14:00.00,0:14:04.00,Default,,0000,0000,0000,,data structures. So Dialogue: 0,0:14:04.00,0:14:06.00,Default,,0000,0000,0000,,then I'll tell you, actually, that it turns out that the one type of performance that was interesting to us Dialogue: 0,0:14:06.00,0:14:09.00,Default,,0000,0000,0000,,when I was Dialogue: 0,0:14:09.00,0:14:14.00,Default,,0000,0000,0000,,exploring this, but in fact, more important at the time, was the memory usage. Dialogue: 0,0:14:14.00,0:14:17.00,Default,,0000,0000,0000,,That the memory usage was really the big obstacle that we had. Dialogue: 0,0:14:17.00,0:14:19.00,Default,,0000,0000,0000,,The Official Scrabble Players Dictionary II is actually the source for the word list that Dialogue: 0,0:14:19.00,0:14:22.00,Default,,0000,0000,0000,,we're using. Dialogue: 0,0:14:22.00,0:14:23.00,Default,,0000,0000,0000,,It has about 128,000 words, I think, is exactly the Dialogue: 0,0:14:23.00,0:14:26.00,Default,,0000,0000,0000,,number it has. Dialogue: 0,0:14:26.00,0:14:29.00,Default,,0000,0000,0000,,And the average length of those is eight characters. And the file itself is Dialogue: 0,0:14:29.00,0:14:32.00,Default,,0000,0000,0000,,over a megabyte. So on disk, if you were just to look at the listing of all the words, Dialogue: 0,0:14:32.00,0:14:34.00,Default,,0000,0000,0000,,it's a megabyte - a little over a megabyte there. Dialogue: 0,0:14:34.00,0:14:37.00,Default,,0000,0000,0000,,If we were loading it into the sorted vector, Dialogue: 0,0:14:37.00,0:14:40.00,Default,,0000,0000,0000,,we'd get something that was just roughly equal to that. It would take about a megabyte of space, Dialogue: 0,0:14:40.00,0:14:45.00,Default,,0000,0000,0000,,plus a little bit of noise. Dialogue: 0,0:14:45.00,0:14:48.00,Default,,0000,0000,0000,,The ones that double up that take it up to two, two and a half megabytes total. Dialogue: 0,0:14:48.00,0:14:50.00,Default,,0000,0000,0000,,At the time we were working, and I know this is going to seem completely archaic to you, but we had Dialogue: 0,0:14:50.00,0:14:51.00,Default,,0000,0000,0000,,these stone tablets that we worked on. Dialogue: 0,0:14:51.00,0:14:55.00,Default,,0000,0000,0000,, Dialogue: 0,0:14:55.00,0:15:00.00,Default,,0000,0000,0000,,And they typically had main memories of about a megabyte, maybe, like in Dialogue: 0,0:15:00.00,0:15:02.00,Default,,0000,0000,0000,,an exciting case, four megabytes; four whole megabytes of RAM. Dialogue: 0,0:15:02.00,0:15:05.00,Default,,0000,0000,0000,,So it was not possible Dialogue: 0,0:15:05.00,0:15:09.00,Default,,0000,0000,0000,,that we could dedicate one or two megabytes of your main memory Dialogue: 0,0:15:09.00,0:15:12.00,Default,,0000,0000,0000,,to just your lexicon. It just wasn't - there just wasn't enough space to go around. Dialogue: 0,0:15:12.00,0:15:15.00,Default,,0000,0000,0000,,So we're working in this very tight environment in terms of that. So a little bit more Dialogue: 0,0:15:15.00,0:15:16.00,Default,,0000,0000,0000,,like, for example, today's world of embedded devices. If you're adding something for an Dialogue: 0,0:15:16.00,0:15:18.00,Default,,0000,0000,0000,,iPod or a Dialogue: 0,0:15:18.00,0:15:21.00,Default,,0000,0000,0000,, Dialogue: 0,0:15:21.00,0:15:24.00,Default,,0000,0000,0000,,cell phone you have a lot less Dialogue: 0,0:15:24.00,0:15:27.00,Default,,0000,0000,0000,,gargantuan memory. So your average desktop machine, having a Dialogue: 0,0:15:27.00,0:15:28.00,Default,,0000,0000,0000,,gigabyte of RAM, now you're like, whatever, a megabyte here, ten megabytes there, free, all the Dialogue: 0,0:15:28.00,0:15:31.00,Default,,0000,0000,0000,,memory you want. And Dialogue: 0,0:15:31.00,0:15:34.00,Default,,0000,0000,0000,,so this seems kinda like a silly thing to focus on, but this was 1995, Dialogue: 0,0:15:34.00,0:15:35.00,Default,,0000,0000,0000,,and it turns out, yeah, there was certainly more interest in Dialogue: 0,0:15:35.00,0:15:37.00,Default,,0000,0000,0000,,keeping it Dialogue: 0,0:15:37.00,0:15:40.00,Default,,0000,0000,0000,,tight to the memory. Dialogue: 0,0:15:40.00,0:15:44.00,Default,,0000,0000,0000,,The Boggle lexicon that we have actually takes under a third of a megabyte. Dialogue: 0,0:15:44.00,0:15:45.00,Default,,0000,0000,0000,,The memory that's used while it's working and searching and checking for words and Dialogue: 0,0:15:45.00,0:15:47.00,Default,,0000,0000,0000,,prefixes Dialogue: 0,0:15:47.00,0:15:50.00,Default,,0000,0000,0000,,is actually smaller Dialogue: 0,0:15:50.00,0:15:53.00,Default,,0000,0000,0000,,than the on-disk form of the data we started with. Dialogue: 0,0:15:53.00,0:15:57.00,Default,,0000,0000,0000,,So it actually does a kind of a bit of a compression Dialogue: 0,0:15:57.00,0:16:00.00,Default,,0000,0000,0000,,as part of its strategy for storage, which is pretty interesting to think about. Dialogue: 0,0:16:00.00,0:16:02.00,Default,,0000,0000,0000,,So I'll tell you, actually, the truth about how we first did Boggle, because it's kinda a little Dialogue: 0,0:16:02.00,0:16:03.00,Default,,0000,0000,0000,,interesting Dialogue: 0,0:16:03.00,0:16:07.00,Default,,0000,0000,0000,,historical perspective on it. Dialogue: 0,0:16:07.00,0:16:09.00,Default,,0000,0000,0000,,But Boggle was given by a colleague of mine for the first time, Todd Dialogue: 0,0:16:09.00,0:16:10.00,Default,,0000,0000,0000,,Feldman. He was the guy who came up with the idea, which is a great idea. And he said, ''Oh, this is a great Dialogue: 0,0:16:10.00,0:16:13.00,Default,,0000,0000,0000,,assignment.'' Dialogue: 0,0:16:13.00,0:16:16.00,Default,,0000,0000,0000,,And when he wrote it, he said, ''Oh, I need a lexicon.'' So as part of kinda putting the assignment together, he was like, ''Oh, we Dialogue: 0,0:16:16.00,0:16:18.00,Default,,0000,0000,0000,,have to get some lexicons.'' So he looked around and he found a word list that had about Dialogue: 0,0:16:18.00,0:16:20.00,Default,,0000,0000,0000,,20,000 words. Dialogue: 0,0:16:20.00,0:16:23.00,Default,,0000,0000,0000,, Dialogue: 0,0:16:23.00,0:16:27.00,Default,,0000,0000,0000,,And I think we first found this word list and we said, ''Oh, it's just impossible.'' It's going to take a megabyte Dialogue: 0,0:16:27.00,0:16:30.00,Default,,0000,0000,0000,,or more and we just don't have that kind of space. So we need much smaller data files. So we had a data file that had Dialogue: 0,0:16:30.00,0:16:33.00,Default,,0000,0000,0000,,about 20,000 words, which then takes about Dialogue: 0,0:16:33.00,0:16:34.00,Default,,0000,0000,0000,,200k. And he actually wrote it using a hash table. Dialogue: 0,0:16:34.00,0:16:37.00,Default,,0000,0000,0000,,So given he wrote it as a hash table, Dialogue: 0,0:16:37.00,0:16:38.00,Default,,0000,0000,0000,,he just didn't support contains (prefix). He Dialogue: 0,0:16:38.00,0:16:39.00,Default,,0000,0000,0000,,just didn't put it in, Dialogue: 0,0:16:39.00,0:16:42.00,Default,,0000,0000,0000,, Dialogue: 0,0:16:42.00,0:16:45.00,Default,,0000,0000,0000,,and - because you can't do it efficiently. So it actually took a smaller amount Dialogue: 0,0:16:45.00,0:16:49.00,Default,,0000,0000,0000,,of space and had a smaller word list and it wouldn't do contains (prefix). Dialogue: 0,0:16:49.00,0:16:51.00,Default,,0000,0000,0000,,So then people wrote Boggle. And as you learned from that midterm question, you can write Boggle without Dialogue: 0,0:16:51.00,0:16:55.00,Default,,0000,0000,0000,,contains (prefix). Dialogue: 0,0:16:55.00,0:16:58.00,Default,,0000,0000,0000,,It spends a lot more time looking down these data ends, but it eventually bottoms out Dialogue: 0,0:16:58.00,0:17:01.00,Default,,0000,0000,0000,,when it runs off the board and uses all the Dialogue: 0,0:17:01.00,0:17:04.00,Default,,0000,0000,0000,,cubes. And Dialogue: 0,0:17:04.00,0:17:05.00,Default,,0000,0000,0000,,it actually, in some ways, almost made the program a little more satisfying to Dialogue: 0,0:17:05.00,0:17:08.00,Default,,0000,0000,0000,,write in one odd way, Dialogue: 0,0:17:08.00,0:17:09.00,Default,,0000,0000,0000,,which is, when you're - so it's the computer's turn, right, that uses the Dialogue: 0,0:17:09.00,0:17:12.00,Default,,0000,0000,0000,,prefix pruning. Dialogue: 0,0:17:12.00,0:17:14.00,Default,,0000,0000,0000,,So you would type in your words and it would be perfectly fine finding it, but it Dialogue: 0,0:17:14.00,0:17:16.00,Default,,0000,0000,0000,,doesn't use the prefix part when it's doing the human's turn. You're typing in words; it's finding them. You're typing Dialogue: 0,0:17:16.00,0:17:19.00,Default,,0000,0000,0000,,in the words; it's find them. Dialogue: 0,0:17:19.00,0:17:21.00,Default,,0000,0000,0000,,Then you would go to do the computer's turn and you'd say, ''Okay, go.'' Dialogue: 0,0:17:21.00,0:17:25.00,Default,,0000,0000,0000,,And now it would take a really long time. Dialogue: 0,0:17:25.00,0:17:27.00,Default,,0000,0000,0000,,It'd be hunting around the board. It would be working so hard your fan would be coming on. Dialogue: 0,0:17:27.00,0:17:30.00,Default,,0000,0000,0000,,You'd feel like oh, my computer's doing something. Dialogue: 0,0:17:30.00,0:17:31.00,Default,,0000,0000,0000,,And then it would show up with twice as many words as you found or ten times as Dialogue: 0,0:17:31.00,0:17:33.00,Default,,0000,0000,0000,,many words as you found, whatever. Dialogue: 0,0:17:33.00,0:17:36.00,Default,,0000,0000,0000,,And in fact, then you would say, Dialogue: 0,0:17:36.00,0:17:40.00,Default,,0000,0000,0000,,''Well, at least it had to work hard to beat me.'' Dialogue: 0,0:17:40.00,0:17:41.00,Default,,0000,0000,0000,,And then you felt like I wrote this program that causes my processor to get busy, right? I feel good. And Dialogue: 0,0:17:41.00,0:17:44.00,Default,,0000,0000,0000,, Dialogue: 0,0:17:44.00,0:17:47.00,Default,,0000,0000,0000,,in fact, actually, the word list was pretty small. So in fact, it didn't - it wasn't even actually as likely to Dialogue: 0,0:17:47.00,0:17:49.00,Default,,0000,0000,0000,,beat you because it didn't know as many words. 125,000 words is Dialogue: 0,0:17:49.00,0:17:52.00,Default,,0000,0000,0000,,an enormous number of words. Dialogue: 0,0:17:52.00,0:17:55.00,Default,,0000,0000,0000,,The average - I've heard different factors on this, but the Dialogue: 0,0:17:55.00,0:17:59.00,Default,,0000,0000,0000,,average person's vocabulary, English vocabulary, is about 20,000 to Dialogue: 0,0:17:59.00,0:18:00.00,Default,,0000,0000,0000,,30,000 words. It tends to be about 10,000 higher if you've gone to college, so Dialogue: 0,0:18:00.00,0:18:03.00,Default,,0000,0000,0000,,maybe 30,000 to 40,000. Dialogue: 0,0:18:03.00,0:18:06.00,Default,,0000,0000,0000,,The fact that it knows 80,000 more words than the average college graduate Dialogue: 0,0:18:06.00,0:18:09.00,Default,,0000,0000,0000,,means it knows a lot of obscure words, too. So in fact, not only does it Dialogue: 0,0:18:09.00,0:18:12.00,Default,,0000,0000,0000,,very quickly find all those words and then produce these ridiculous words that you've Dialogue: 0,0:18:12.00,0:18:14.00,Default,,0000,0000,0000,,never heard of, it just - it's almost like it's taunting you by doing it Dialogue: 0,0:18:14.00,0:18:17.00,Default,,0000,0000,0000,,in a millisecond. Dialogue: 0,0:18:17.00,0:18:21.00,Default,,0000,0000,0000,,And back then it took a while, so it was actually like Dialogue: 0,0:18:21.00,0:18:23.00,Default,,0000,0000,0000,,okay, it beat me, but it had to work hard. Dialogue: 0,0:18:23.00,0:18:27.00,Default,,0000,0000,0000,,But then - so then I took over Boggle. Dialogue: 0,0:18:27.00,0:18:31.00,Default,,0000,0000,0000,,I taught x maybe a quarter or two later and I said, ''I'm going to write a Dialogue: 0,0:18:31.00,0:18:35.00,Default,,0000,0000,0000,,lexicon. I'm going to go up and make a little project for myself, because I don't have enough work to do.'' Dialogue: 0,0:18:35.00,0:18:38.00,Default,,0000,0000,0000,,And my goal was to make one that would do prefix, but that also would be Dialogue: 0,0:18:38.00,0:18:40.00,Default,,0000,0000,0000,,tight enough on memory that we could actually load a much bigger dictionary file, because I thought I'd Dialogue: 0,0:18:40.00,0:18:42.00,Default,,0000,0000,0000,,actually make the game even more Dialogue: 0,0:18:42.00,0:18:48.00,Default,,0000,0000,0000,,fun to play. Dialogue: 0,0:18:48.00,0:18:52.00,Default,,0000,0000,0000,,All right, so here's where I started. Dialogue: 0,0:18:52.00,0:18:55.00,Default,,0000,0000,0000,,This is a little excerpt from the middle of the Dialogue: 0,0:18:55.00,0:18:58.00,Default,,0000,0000,0000,,dictionary. ''Stra,'' yeah, that's totally the thing you notice, right, when you look at this thing. You see Dialogue: 0,0:18:58.00,0:19:02.00,Default,,0000,0000,0000,,''straddle,'' ''straddler,'' ''straddlers,'' ''straddles,'' ''straddling,'' and then these words that you can't Dialogue: 0,0:19:02.00,0:19:04.00,Default,,0000,0000,0000,,believe you're allowed to do this, but apparently you can put ''er'' and ''ier'' and Dialogue: 0,0:19:04.00,0:19:05.00,Default,,0000,0000,0000,,''ies' and ''ing'' on Dialogue: 0,0:19:05.00,0:19:07.00,Default,,0000,0000,0000,,almost anything out there and it Dialogue: 0,0:19:07.00,0:19:08.00,Default,,0000,0000,0000,,makes a valid word. Dialogue: 0,0:19:08.00,0:19:11.00,Default,,0000,0000,0000,,So Dialogue: 0,0:19:11.00,0:19:14.00,Default,,0000,0000,0000,,although you never really thought about the fact that Dialogue: 0,0:19:14.00,0:19:16.00,Default,,0000,0000,0000,,''straightforward,'' ''straightforwardly,'' ''straightforwardness,'' are all there, there is a huge amount Dialogue: 0,0:19:16.00,0:19:18.00,Default,,0000,0000,0000,,of repetition. And this is Dialogue: 0,0:19:18.00,0:19:20.00,Default,,0000,0000,0000,,where I said we're going to use some information about the domain. Dialogue: 0,0:19:20.00,0:19:23.00,Default,,0000,0000,0000,,These aren't just arbitrary strings. Dialogue: 0,0:19:23.00,0:19:24.00,Default,,0000,0000,0000,,They're not just random sequences of letters that are Dialogue: 0,0:19:24.00,0:19:28.00,Default,,0000,0000,0000,,picked out of the air, Dialogue: 0,0:19:28.00,0:19:30.00,Default,,0000,0000,0000,,that they develop over time. They're word roots. They're Dialogue: 0,0:19:30.00,0:19:34.00,Default,,0000,0000,0000,,suffixes. There's these prefixes that mean things Dialogue: 0,0:19:34.00,0:19:35.00,Default,,0000,0000,0000,,that actually show you that looking at this little excerpt out of the middle, Dialogue: 0,0:19:35.00,0:19:38.00,Default,,0000,0000,0000,,there's an awful lot of words Dialogue: 0,0:19:38.00,0:19:42.00,Default,,0000,0000,0000,,that repeat portions of other words. Dialogue: 0,0:19:42.00,0:19:44.00,Default,,0000,0000,0000,,And that may be part of what we can capitalize on. That Dialogue: 0,0:19:44.00,0:19:47.00,Default,,0000,0000,0000,,instead of really recording ''straightaway'' Dialogue: 0,0:19:47.00,0:19:51.00,Default,,0000,0000,0000,,and ''straightaways,'' repeating those first 11 characters and adding an ''s,'' Dialogue: 0,0:19:51.00,0:19:55.00,Default,,0000,0000,0000,,there's some way that I can unify the data that I have here. The same way when we try to Dialogue: 0,0:19:55.00,0:19:56.00,Default,,0000,0000,0000,,find codes, you have the same passage of code repeated two or three times. Dialogue: 0,0:19:56.00,0:19:59.00,Default,,0000,0000,0000,,We're always saying Dialogue: 0,0:19:59.00,0:20:00.00,Default,,0000,0000,0000,,unify it, move it into a helper and call it in three places. It's like can we do the same thing Dialogue: 0,0:20:00.00,0:20:02.00,Default,,0000,0000,0000,,with our data? Dialogue: 0,0:20:02.00,0:20:07.00,Default,,0000,0000,0000,,Can we share the parts of the words that are in common Dialogue: 0,0:20:07.00,0:20:09.00,Default,,0000,0000,0000,,and only distinguish where they are different? Dialogue: 0,0:20:09.00,0:20:11.00,Default,,0000,0000,0000,,Take a look at this. Dialogue: 0,0:20:11.00,0:20:14.00,Default,,0000,0000,0000,,This is a little portion Dialogue: 0,0:20:14.00,0:20:18.00,Default,,0000,0000,0000,,of something called a trie. Dialogue: 0,0:20:18.00,0:20:20.00,Default,,0000,0000,0000,,A trie is spelled t-r-i-e and it's actually from ''retrieval,'' but Dialogue: 0,0:20:20.00,0:20:23.00,Default,,0000,0000,0000,,sadly, it's still pronounced tri, just to confuse you. Dialogue: 0,0:20:23.00,0:20:28.00,Default,,0000,0000,0000,,And it is a variant of a tree. Dialogue: 0,0:20:28.00,0:20:30.00,Default,,0000,0000,0000,,In this case it's a tree that has 26 children coming out of any particular node, Dialogue: 0,0:20:30.00,0:20:32.00,Default,,0000,0000,0000,,one for each letter of the alphabet. Dialogue: 0,0:20:32.00,0:20:35.00,Default,,0000,0000,0000,,And so starting from the root node at the top, Dialogue: 0,0:20:35.00,0:20:39.00,Default,,0000,0000,0000,,all the words that begin with ''a'' are actually Dialogue: 0,0:20:39.00,0:20:42.00,Default,,0000,0000,0000,,down the A branch, and then there's a B branch. There'd be a C branch and a D Dialogue: 0,0:20:42.00,0:20:45.00,Default,,0000,0000,0000,,branch. And so the idea is that all the words, like if you think of the levels of the Dialogue: 0,0:20:45.00,0:20:46.00,Default,,0000,0000,0000,,tree are actually like - these are all the zero if characters of the word. Dialogue: 0,0:20:46.00,0:20:49.00,Default,,0000,0000,0000,,And then that second level is Dialogue: 0,0:20:49.00,0:20:52.00,Default,,0000,0000,0000,,all the characters that follow that zero if character. So here's the ''ab'' Dialogue: 0,0:20:52.00,0:20:56.00,Default,,0000,0000,0000,,words and the ''ac'' words and the ''ax'' words. Dialogue: 0,0:20:56.00,0:20:57.00,Default,,0000,0000,0000,,Over here is the ''st'' words, but there's not a ''sx,'' so there's some places that actually aren't Dialogue: 0,0:20:57.00,0:21:00.00,Default,,0000,0000,0000,,filled out in this tree. Dialogue: 0,0:21:00.00,0:21:02.00,Default,,0000,0000,0000,,And as you keep going down further, so that the depth Dialogue: 0,0:21:02.00,0:21:04.00,Default,,0000,0000,0000,,you would trace to a tree Dialogue: 0,0:21:04.00,0:21:08.00,Default,,0000,0000,0000,,would actually be Dialogue: 0,0:21:08.00,0:21:10.00,Default,,0000,0000,0000,,tracing through the characters in sequence that make words. Dialogue: 0,0:21:10.00,0:21:11.00,Default,,0000,0000,0000,,And so in this form, tracing out a Dialogue: 0,0:21:11.00,0:21:12.00,Default,,0000,0000,0000,,path through this tree Dialogue: 0,0:21:12.00,0:21:14.00,Default,,0000,0000,0000,,from the root Dialogue: 0,0:21:14.00,0:21:17.00,Default,,0000,0000,0000,,down to a particular node Dialogue: 0,0:21:17.00,0:21:19.00,Default,,0000,0000,0000,,tells you about prefixes. So there are words that begin with ''a'' and words that begin with ''ac'' Dialogue: 0,0:21:19.00,0:21:23.00,Default,,0000,0000,0000,,and ''ace,'' Dialogue: 0,0:21:23.00,0:21:23.00,Default,,0000,0000,0000,,and then there's actually a little notation there that's done by a Dialogue: 0,0:21:23.00,0:21:24.00,Default,,0000,0000,0000,,visual Dialogue: 0,0:21:24.00,0:21:26.00,Default,,0000,0000,0000,,of the Dialogue: 0,0:21:26.00,0:21:29.00,Default,,0000,0000,0000,,bold and circle around it Dialogue: 0,0:21:29.00,0:21:30.00,Default,,0000,0000,0000,,that says, ''Oh, and the path that led here from the root is a word.'' So Dialogue: 0,0:21:30.00,0:21:33.00,Default,,0000,0000,0000,,''a'' is a word Dialogue: 0,0:21:33.00,0:21:35.00,Default,,0000,0000,0000,,and ''ace'' is a word and ''aces'' is a word. Dialogue: 0,0:21:35.00,0:21:39.00,Default,,0000,0000,0000,,''Act'' is a word and so is ''acted'' and ''actor.'' Dialogue: 0,0:21:39.00,0:21:44.00,Default,,0000,0000,0000,,And in this case, the ''act'' part is totally shared, Dialogue: 0,0:21:44.00,0:21:46.00,Default,,0000,0000,0000,,so everything that comes off of ''act,'' ''acting,'' ''acted,'' ''actor,'' ''actors,'' Dialogue: 0,0:21:46.00,0:21:50.00,Default,,0000,0000,0000,,can all share that initial part. Dialogue: 0,0:21:50.00,0:21:52.00,Default,,0000,0000,0000,,And for a word like ''act,'' it doesn't seem that profound. But you start thinking about words like Dialogue: 0,0:21:52.00,0:21:55.00,Default,,0000,0000,0000,,''strawberry'' or ''straightforward'' or ''stratosphere'' Dialogue: 0,0:21:55.00,0:21:59.00,Default,,0000,0000,0000,,and realize that ''stratospheric'' and ''strawberries'' and ''straightforwardly'' Dialogue: 0,0:21:59.00,0:22:01.00,Default,,0000,0000,0000,,can all leverage the other ten or 12 characters that were already Dialogue: 0,0:22:01.00,0:22:04.00,Default,,0000,0000,0000,,invested in the root word, Dialogue: 0,0:22:04.00,0:22:06.00,Default,,0000,0000,0000,,and that ''straightjacket'' and ''straightforward'' can even share ''straight.'' Dialogue: 0,0:22:06.00,0:22:10.00,Default,,0000,0000,0000,,But there's actually a lot of Dialogue: 0,0:22:10.00,0:22:15.00,Default,,0000,0000,0000,,prefixes that can be unified across the space of the dictionary. Dialogue: 0,0:22:15.00,0:22:17.00,Default,,0000,0000,0000,,So knowing that words have these patterns helps us to do this. Dialogue: 0,0:22:17.00,0:22:19.00,Default,,0000,0000,0000,,So let's build that. Dialogue: 0,0:22:19.00,0:22:22.00,Default,,0000,0000,0000,,So the first form of this - Dialogue: 0,0:22:22.00,0:22:25.00,Default,,0000,0000,0000,,and this, again, is a thought experiment. I did not actually write it this way because Dialogue: 0,0:22:25.00,0:22:27.00,Default,,0000,0000,0000,,it would just be absolutely astronomically expensive. But it was a good way to kinda think Dialogue: 0,0:22:27.00,0:22:30.00,Default,,0000,0000,0000,,about where I needed to go. Dialogue: 0,0:22:30.00,0:22:34.00,Default,,0000,0000,0000,,I designed a new kinda trie node Dialogue: 0,0:22:34.00,0:22:35.00,Default,,0000,0000,0000,,that had the letter and a little Boolean flag of is this the word, so whether it had Dialogue: 0,0:22:35.00,0:22:39.00,Default,,0000,0000,0000,,the dark circle around it. Dialogue: 0,0:22:39.00,0:22:41.00,Default,,0000,0000,0000,,And then it had a 26-member children array, so pointers to other nodes like Dialogue: 0,0:22:41.00,0:22:44.00,Default,,0000,0000,0000,,this one. So totally recursive, Dialogue: 0,0:22:44.00,0:22:47.00,Default,,0000,0000,0000,,staring from that root and then the Dialogue: 0,0:22:47.00,0:22:50.00,Default,,0000,0000,0000,,idea is to use those 26 children as ''a'' being in the first slot and ''z'' in the last Dialogue: 0,0:22:50.00,0:22:52.00,Default,,0000,0000,0000,,slot to make it really easy to kinda just trace letter by letter Dialogue: 0,0:22:52.00,0:22:55.00,Default,,0000,0000,0000,,down to the levels of the tree. Dialogue: 0,0:22:55.00,0:22:59.00,Default,,0000,0000,0000,,So I'd have this one root node that pointed to the initial one, and then all the words Dialogue: 0,0:22:59.00,0:23:03.00,Default,,0000,0000,0000,,that begin with ''a'' are off that first lot; all from ''z'' off the last slot. Dialogue: 0,0:23:03.00,0:23:08.00,Default,,0000,0000,0000,,So when I want to know if a word is in the dictionary, Dialogue: 0,0:23:08.00,0:23:12.00,Default,,0000,0000,0000,,what do I have to do in this data structure? How do I find it? I Dialogue: 0,0:23:12.00,0:23:18.00,Default,,0000,0000,0000,,want to find the word ''far.'' Where do I look? Dialogue: 0,0:23:18.00,0:23:27.00,Default,,0000,0000,0000,,Left? Right? Down? Which way? It would be great Dialogue: 0,0:23:27.00,0:23:31.00,Default,,0000,0000,0000,,if you would tell me. Then I would understand what my data structure is. [Inaudible] Dialogue: 0,0:23:31.00,0:23:35.00,Default,,0000,0000,0000,,Exactly. So at the very top I say okay, ''f'' is my first letter. Match my ''f,'' get me to the Dialogue: 0,0:23:35.00,0:23:35.00,Default,,0000,0000,0000,,F node, and now it says recursively find ''ar'' from here. So it's very Dialogue: 0,0:23:35.00,0:23:37.00,Default,,0000,0000,0000,,recursive. It'll be like Dialogue: 0,0:23:37.00,0:23:39.00,Default,,0000,0000,0000,,find the first letter Dialogue: 0,0:23:39.00,0:23:43.00,Default,,0000,0000,0000,,and then recursively find the substring from here, working down. So find the ''a'' out of the next node, find Dialogue: 0,0:23:43.00,0:23:46.00,Default,,0000,0000,0000,,an ''r'' out of the next node, and then when you get to that node, Dialogue: 0,0:23:46.00,0:23:47.00,Default,,0000,0000,0000,,you'll check this little ''is word Boolean,'' Dialogue: 0,0:23:47.00,0:23:49.00,Default,,0000,0000,0000,,and say okay, Dialogue: 0,0:23:49.00,0:23:53.00,Default,,0000,0000,0000,,was that path I traced Dialogue: 0,0:23:53.00,0:23:58.00,Default,,0000,0000,0000,,marked in the dictionary as leading to something that's known to be a word? Dialogue: 0,0:23:58.00,0:24:01.00,Default,,0000,0000,0000,,The prefix looks exactly the same, except for that last check for the as word. So Dialogue: 0,0:24:01.00,0:24:05.00,Default,,0000,0000,0000,,given that nodes exist in this, because further down there there's some words, I just Dialogue: 0,0:24:05.00,0:24:08.00,Default,,0000,0000,0000,,trace out a sequence of letters ''str,'' and as long as there's something there, Dialogue: 0,0:24:08.00,0:24:11.00,Default,,0000,0000,0000,,then I know that away from there are subsequent children beneath it Dialogue: 0,0:24:11.00,0:24:15.00,Default,,0000,0000,0000,,that lead to words. Dialogue: 0,0:24:15.00,0:24:17.00,Default,,0000,0000,0000,,And adding a word does the same operation. Dialogue: 0,0:24:17.00,0:24:19.00,Default,,0000,0000,0000,,If I need to add the word ''strawberry,'' Dialogue: 0,0:24:19.00,0:24:20.00,Default,,0000,0000,0000,,then I start looking for ''s,'' ''t,'' ''r,'' Dialogue: 0,0:24:20.00,0:24:24.00,Default,,0000,0000,0000,,and at some point, Dialogue: 0,0:24:24.00,0:24:26.00,Default,,0000,0000,0000,,I find some places where the nodes don't exist. Then I start creating them from Dialogue: 0,0:24:26.00,0:24:29.00,Default,,0000,0000,0000,,there on down. Dialogue: 0,0:24:29.00,0:24:33.00,Default,,0000,0000,0000,,So it is tracing what exists and then adding new nodes off the bottom, Dialogue: 0,0:24:33.00,0:24:38.00,Default,,0000,0000,0000,,and then marking the final one as yes, this is a word. Dialogue: 0,0:24:38.00,0:24:43.00,Default,,0000,0000,0000,,What's interesting about this is that all three of these operations Dialogue: 0,0:24:43.00,0:24:47.00,Default,,0000,0000,0000,,don't actually make any reference to how many words are in the dictionary. Dialogue: 0,0:24:47.00,0:24:49.00,Default,,0000,0000,0000,,They are all dependent on the length of the word you're searching for or adding into the dictionary. Dialogue: 0,0:24:49.00,0:24:51.00,Default,,0000,0000,0000,,If there's ten words in there, if there's 1,000 words in there, if Dialogue: 0,0:24:51.00,0:24:53.00,Default,,0000,0000,0000,,there's a million words in there. Dialogue: 0,0:24:53.00,0:24:56.00,Default,,0000,0000,0000,,But in no way does the big O depend on Dialogue: 0,0:24:56.00,0:24:59.00,Default,,0000,0000,0000,,how big or how loaded the data structure is. Dialogue: 0,0:24:59.00,0:25:00.00,Default,,0000,0000,0000,,That's kinda wacky. Dialogue: 0,0:25:00.00,0:25:06.00,Default,,0000,0000,0000,,The longest word, Dialogue: 0,0:25:06.00,0:25:09.00,Default,,0000,0000,0000,,according to the Oxford English Dictionary? Dialogue: 0,0:25:09.00,0:25:11.00,Default,,0000,0000,0000,,There you go. You're not going to make that on the Boggle board anytime soon because it's only got 16 cubes. Even Big Boggle Dialogue: 0,0:25:11.00,0:25:12.00,Default,,0000,0000,0000,,can't really make that puppy. But Dialogue: 0,0:25:12.00,0:25:14.00,Default,,0000,0000,0000,,just so you know, Dialogue: 0,0:25:14.00,0:25:17.00,Default,,0000,0000,0000,,45 is not so many, right? Dialogue: 0,0:25:17.00,0:25:21.00,Default,,0000,0000,0000,,And, Dialogue: 0,0:25:21.00,0:25:24.00,Default,,0000,0000,0000,,in fact, actually, here's a little thought for you. Not only is it not dependent on num words, it actually has a Dialogue: 0,0:25:24.00,0:25:26.00,Default,,0000,0000,0000,,little bit of an inverse relationship with num words, Dialogue: 0,0:25:26.00,0:25:29.00,Default,,0000,0000,0000,,especially the add operation. Dialogue: 0,0:25:29.00,0:25:31.00,Default,,0000,0000,0000,,That as the data structure becomes more loaded, Dialogue: 0,0:25:31.00,0:25:34.00,Default,,0000,0000,0000,,there's typically less work to do in add. Dialogue: 0,0:25:34.00,0:25:37.00,Default,,0000,0000,0000,,So if ''strawberry'' is in there, and you go to add ''strawberries,'' Dialogue: 0,0:25:37.00,0:25:38.00,Default,,0000,0000,0000,,then actually you already have ''strawberries'' to depend on. You just need to build Dialogue: 0,0:25:38.00,0:25:41.00,Default,,0000,0000,0000,,off the thing. So the fact that Dialogue: 0,0:25:41.00,0:25:45.00,Default,,0000,0000,0000,,the more words that are in there, the more likely that some of the nodes you already need Dialogue: 0,0:25:45.00,0:25:46.00,Default,,0000,0000,0000,,are already in place, and then you can just kinda blow past them and just add your new Dialogue: 0,0:25:46.00,0:25:49.00,Default,,0000,0000,0000,,nodes. Dialogue: 0,0:25:49.00,0:25:53.00,Default,,0000,0000,0000,,Now, that's odd. You think of most data structures as really as Dialogue: 0,0:25:53.00,0:26:00.00,Default,,0000,0000,0000,,clearly as they get heavier, it takes more work to install new Dialogue: 0,0:26:00.00,0:26:03.00,Default,,0000,0000,0000,,things in it rather than less. [Inaudible]. Dialogue: 0,0:26:03.00,0:26:04.00,Default,,0000,0000,0000,,Well, the thing is, they're organized by letter, and that 26-number array. So in fact, Dialogue: 0,0:26:04.00,0:26:08.00,Default,,0000,0000,0000,,I don't even have to look. Dialogue: 0,0:26:08.00,0:26:11.00,Default,,0000,0000,0000,,So I just go straight to the slot. If I'm looking for strawberry, I'll look for ''s'' and I go Dialogue: 0,0:26:11.00,0:26:14.00,Default,,0000,0000,0000,,straight to the ''t'' slot and then the ''r'' slot. So in fact, if the mode was already in place, I'm Dialogue: 0,0:26:14.00,0:26:16.00,Default,,0000,0000,0000,,not searching for it. It really just is exactly in the ''r'' slot. That's why Dialogue: 0,0:26:16.00,0:26:21.00,Default,,0000,0000,0000,,there's 26 of them. In Dialogue: 0,0:26:21.00,0:26:24.00,Default,,0000,0000,0000,,each array? Each array is 26 and they're A through Z and if there's an empty slot we're using a null. Dialogue: 0,0:26:24.00,0:26:27.00,Default,,0000,0000,0000,,So in this case, it's optimized to make it super fast. I don't even have to look Dialogue: 0,0:26:27.00,0:26:29.00,Default,,0000,0000,0000,,through the letters to find the one that matches. There's no matching. It just goes straight Dialogue: 0,0:26:29.00,0:26:32.00,Default,,0000,0000,0000,,to the ''r'' slot, ''z'' slot, the ''s'' slot. Dialogue: 0,0:26:32.00,0:26:35.00,Default,,0000,0000,0000,,We're going to see this is going to be a consequence we can't Dialogue: 0,0:26:35.00,0:26:37.00,Default,,0000,0000,0000,,tolerate, but it is, at least, the starting point for thinking about how to Dialogue: 0,0:26:37.00,0:26:38.00,Default,,0000,0000,0000,,build a data structure. Dialogue: 0,0:26:38.00,0:26:40.00,Default,,0000,0000,0000,,So the space usage Dialogue: 0,0:26:40.00,0:26:44.00,Default,,0000,0000,0000,,is where this thing Dialogue: 0,0:26:44.00,0:26:47.00,Default,,0000,0000,0000,,really, really - so this is an extreme example of a space-time trade-off, right? Dialogue: 0,0:26:47.00,0:26:50.00,Default,,0000,0000,0000,,We've got this thing that optimizes beautifully for OAV length of Dialogue: 0,0:26:50.00,0:26:53.00,Default,,0000,0000,0000,,the word. So that means typically eight operations on average are going to be Dialogue: 0,0:26:53.00,0:26:56.00,Default,,0000,0000,0000,,needed to add or to search for something. Dialogue: 0,0:26:56.00,0:26:57.00,Default,,0000,0000,0000,,The space we're using here is 106 bytes per nodes, so that's this Dialogue: 0,0:26:57.00,0:26:59.00,Default,,0000,0000,0000,,26-member, Dialogue: 0,0:26:59.00,0:27:02.00,Default,,0000,0000,0000,,four-byte array, Dialogue: 0,0:27:02.00,0:27:04.00,Default,,0000,0000,0000,,plus two bytes for the character and the Boolean that are up there. One Dialogue: 0,0:27:04.00,0:27:05.00,Default,,0000,0000,0000,,hundred and six bytes Dialogue: 0,0:27:05.00,0:27:09.00,Default,,0000,0000,0000,,is a lot of memory, Dialogue: 0,0:27:09.00,0:27:13.00,Default,,0000,0000,0000,,and the tree has about a quarter of a million nodes. Dialogue: 0,0:27:13.00,0:27:17.00,Default,,0000,0000,0000,,So given the 125,000 words that I said are in the input, Dialogue: 0,0:27:17.00,0:27:19.00,Default,,0000,0000,0000,,if you build it into a tree, they take about 250,000 nodes. That's an interesting sort of Dialogue: 0,0:27:19.00,0:27:22.00,Default,,0000,0000,0000,,number to think about, though. You have 125,000 Dialogue: 0,0:27:22.00,0:27:25.00,Default,,0000,0000,0000,,words. They take 250,000 nodes. That tells you, actually, that Dialogue: 0,0:27:25.00,0:27:27.00,Default,,0000,0000,0000,,kinda on average, each Dialogue: 0,0:27:27.00,0:27:29.00,Default,,0000,0000,0000,,word has about two unique nodes, Dialogue: 0,0:27:29.00,0:27:33.00,Default,,0000,0000,0000,,that even words like ''straightforward'' and ''straightforwardly,'' Dialogue: 0,0:27:33.00,0:27:34.00,Default,,0000,0000,0000,,on average, are sharing enough of their prefix - common prefix Dialogue: 0,0:27:34.00,0:27:38.00,Default,,0000,0000,0000,,parts - that Dialogue: 0,0:27:38.00,0:27:39.00,Default,,0000,0000,0000,,there's very little unique data added by any particular word in there that averages Dialogue: 0,0:27:39.00,0:27:40.00,Default,,0000,0000,0000,,across the whole thing. That's Dialogue: 0,0:27:40.00,0:27:42.00,Default,,0000,0000,0000,,pretty neat Dialogue: 0,0:27:42.00,0:27:45.00,Default,,0000,0000,0000,,to know. However, this is just not going to fly. Dialogue: 0,0:27:45.00,0:27:50.00,Default,,0000,0000,0000,,So here I was telling you that my main memory had Dialogue: 0,0:27:50.00,0:27:53.00,Default,,0000,0000,0000,,four megabytes, maybe, on a good day, and now I've actually Dialogue: 0,0:27:53.00,0:27:55.00,Default,,0000,0000,0000,,built a data structure that's going to require six times that Dialogue: 0,0:27:55.00,0:27:58.00,Default,,0000,0000,0000,,in terms of storage. Dialogue: 0,0:27:58.00,0:28:00.00,Default,,0000,0000,0000,,Okay, we gotta squeeze that down. Dialogue: 0,0:28:00.00,0:28:04.00,Default,,0000,0000,0000,,So the first thing we can do Dialogue: 0,0:28:04.00,0:28:08.00,Default,,0000,0000,0000,,is to realize that those 26 - allocated a full 26-member array, of Dialogue: 0,0:28:08.00,0:28:11.00,Default,,0000,0000,0000,,which I only plan on using some of the slots, is pretty wasteful. Dialogue: 0,0:28:11.00,0:28:13.00,Default,,0000,0000,0000,,So instead of saying I'm committing to a 26 per thing, I'll say well, how about if Dialogue: 0,0:28:13.00,0:28:16.00,Default,,0000,0000,0000,,I have an array that actually Dialogue: 0,0:28:16.00,0:28:20.00,Default,,0000,0000,0000,,is tight to the number of children coming out of this node? Dialogue: 0,0:28:20.00,0:28:20.00,Default,,0000,0000,0000,,So the very first node will have 26. There are words that start with every letter in the Dialogue: 0,0:28:20.00,0:28:22.00,Default,,0000,0000,0000,,alphabet. Dialogue: 0,0:28:22.00,0:28:25.00,Default,,0000,0000,0000,,But from there, it very quickly winnows down. Dialogue: 0,0:28:25.00,0:28:29.00,Default,,0000,0000,0000,,You have the letter ''x.'' You do not need the full 26 children coming Dialogue: 0,0:28:29.00,0:28:33.00,Default,,0000,0000,0000,,off of ''x.'' There's only a few letters that follow ''x'' in the English language that are going Dialogue: 0,0:28:33.00,0:28:37.00,Default,,0000,0000,0000,,to need to be fleshed out. You need the ''xa,'' you need an ''xe,'' Dialogue: 0,0:28:37.00,0:28:40.00,Default,,0000,0000,0000,,but you don't need ''xj,'' you don't need ''xk,'' a whole bunch of things like that. Dialogue: 0,0:28:40.00,0:28:44.00,Default,,0000,0000,0000,,So if I change the structure here to instead of being Dialogue: 0,0:28:44.00,0:28:47.00,Default,,0000,0000,0000,,a 26-member array of node pointers, I make it to be a pointer to a Dialogue: 0,0:28:47.00,0:28:50.00,Default,,0000,0000,0000,,set of nodes pointers, so this is a dynamic array Dialogue: 0,0:28:50.00,0:28:54.00,Default,,0000,0000,0000,,that I'm keeping track of the number of children in a second field here Dialogue: 0,0:28:54.00,0:28:56.00,Default,,0000,0000,0000,,that the first one will be allocated to 26, the second will be allocated to eight. Dialogue: 0,0:28:56.00,0:28:59.00,Default,,0000,0000,0000,,This is going to change a little bit of our running times, because now we won't Dialogue: 0,0:28:59.00,0:29:01.00,Default,,0000,0000,0000,,be able to go exactly to the ''s'' slot. We'll have to go look for it. Dialogue: 0,0:29:01.00,0:29:03.00,Default,,0000,0000,0000,,So on each step down the tree, Dialogue: 0,0:29:03.00,0:29:07.00,Default,,0000,0000,0000,,it'll be Dialogue: 0,0:29:07.00,0:29:12.00,Default,,0000,0000,0000,,looking through that sized array at this slot to find the right match. But it adds a Dialogue: 0,0:29:12.00,0:29:14.00,Default,,0000,0000,0000,,constant factor, basically, on top of OAV length of the word. Dialogue: 0,0:29:14.00,0:29:15.00,Default,,0000,0000,0000,,So the space issues of this Dialogue: 0,0:29:15.00,0:29:19.00,Default,,0000,0000,0000,, Dialogue: 0,0:29:19.00,0:29:23.00,Default,,0000,0000,0000,,is - we're just not storing all those nulls. And there were a lot of nulls. That Dialogue: 0,0:29:23.00,0:29:25.00,Default,,0000,0000,0000,,the node itself is not just ten bytes, and then there's some space in this Dialogue: 0,0:29:25.00,0:29:27.00,Default,,0000,0000,0000,,dynamic array out there, which is the number of children Dialogue: 0,0:29:27.00,0:29:31.00,Default,,0000,0000,0000,,times the four-byte pointer, Dialogue: 0,0:29:31.00,0:29:33.00,Default,,0000,0000,0000,,that on average, a node has six children. Across the Dialogue: 0,0:29:33.00,0:29:36.00,Default,,0000,0000,0000,,250,000 nodes in the Dialogue: 0,0:29:36.00,0:29:39.00,Default,,0000,0000,0000,,data structure that's being billed from this, so really 26 was way Dialogue: 0,0:29:39.00,0:29:40.00,Default,,0000,0000,0000,,overkill. About 20 of them had nulls Dialogue: 0,0:29:40.00,0:29:42.00,Default,,0000,0000,0000,,for most of the nodes. Dialogue: 0,0:29:42.00,0:29:46.00,Default,,0000,0000,0000,,And many of them, for example, at the very bottom of the tree, Dialogue: 0,0:29:46.00,0:29:49.00,Default,,0000,0000,0000,,have completely none. They're all nulls. So you get to a word like Dialogue: 0,0:29:49.00,0:29:52.00,Default,,0000,0000,0000,,''straightforwardly.'' There's nothing after that ''y,'' Dialogue: 0,0:29:52.00,0:29:53.00,Default,,0000,0000,0000,,so it has zero children coming out the back of that one. Dialogue: 0,0:29:53.00,0:29:57.00,Default,,0000,0000,0000,,And so certainly Dialogue: 0,0:29:57.00,0:30:00.00,Default,,0000,0000,0000,,having 26 null children pointing away from that was a Dialogue: 0,0:30:00.00,0:30:01.00,Default,,0000,0000,0000,,waste that we can tighten up. Dialogue: 0,0:30:01.00,0:30:03.00,Default,,0000,0000,0000,,So when we get it down to this we'll have Dialogue: 0,0:30:03.00,0:30:04.00,Default,,0000,0000,0000,, Dialogue: 0,0:30:04.00,0:30:05.00,Default,,0000,0000,0000,,44 bytes Dialogue: 0,0:30:05.00,0:30:06.00,Default,,0000,0000,0000,, Dialogue: 0,0:30:06.00,0:30:08.00,Default,,0000,0000,0000,,- Dialogue: 0,0:30:08.00,0:30:11.00,Default,,0000,0000,0000,,or 34 bytes for the per node. Dialogue: 0,0:30:11.00,0:30:15.00,Default,,0000,0000,0000,,Still a quarter million nodes. Still Dialogue: 0,0:30:15.00,0:30:18.00,Default,,0000,0000,0000,,eight and half megabytes. So we're not winning any prizes just yet. Dialogue: 0,0:30:18.00,0:30:22.00,Default,,0000,0000,0000,,But we are making good progress. That got us down a factor of Dialogue: 0,0:30:22.00,0:30:25.00,Default,,0000,0000,0000,,three or four right there. So I still have the same Dialogue: 0,0:30:25.00,0:30:28.00,Default,,0000,0000,0000,,number of nodes. All I changed was the pointers to the Dialogue: 0,0:30:28.00,0:30:32.00,Default,,0000,0000,0000,,subsequent nodes further down the tree. So Dialogue: 0,0:30:32.00,0:30:35.00,Default,,0000,0000,0000,,I - the kind of letters and how the letters connect to each other - the connectivity is Dialogue: 0,0:30:35.00,0:30:39.00,Default,,0000,0000,0000,,still pretty much the same. It's just that in any particular node, how many children does it have? Dialogue: 0,0:30:39.00,0:30:40.00,Default,,0000,0000,0000,,I was storing a whole bunch of nulls and now I'm not storing those nulls. Dialogue: 0,0:30:40.00,0:30:43.00,Default,,0000,0000,0000,,So the number of nodes is still Dialogue: 0,0:30:43.00,0:30:47.00,Default,,0000,0000,0000,,exactly as it was before. I'm going Dialogue: 0,0:30:47.00,0:30:49.00,Default,,0000,0000,0000,,to do one other little squishing thing on this data structure Dialogue: 0,0:30:49.00,0:30:51.00,Default,,0000,0000,0000,,and then I'm going Dialogue: 0,0:30:51.00,0:30:53.00,Default,,0000,0000,0000,,to flip gears for a second. Dialogue: 0,0:30:53.00,0:30:56.00,Default,,0000,0000,0000,,So still using kinda my Dialogue: 0,0:30:56.00,0:30:59.00,Default,,0000,0000,0000,,CS side of things, how can I squash things down? Dialogue: 0,0:30:59.00,0:31:03.00,Default,,0000,0000,0000,,I'm going to use a technique that we used in the heap. Dialogue: 0,0:31:03.00,0:31:04.00,Default,,0000,0000,0000,,So in the case of the binary heap that we implemented into the Dialogue: 0,0:31:04.00,0:31:07.00,Default,,0000,0000,0000,,PQ, Dialogue: 0,0:31:07.00,0:31:11.00,Default,,0000,0000,0000,,you'll remember that we drew pictures in the handout that showed a heap with oh, I Dialogue: 0,0:31:11.00,0:31:11.00,Default,,0000,0000,0000,,have a parent with two children, which has four grandchildren and Dialogue: 0,0:31:11.00,0:31:15.00,Default,,0000,0000,0000,,so on. Dialogue: 0,0:31:15.00,0:31:16.00,Default,,0000,0000,0000,,But that we actually compressed that down into an array, Dialogue: 0,0:31:16.00,0:31:20.00,Default,,0000,0000,0000,,and in the array Dialogue: 0,0:31:20.00,0:31:24.00,Default,,0000,0000,0000,,there was this pattern of Dialogue: 0,0:31:24.00,0:31:27.00,Default,,0000,0000,0000,,here is the root node. Dialogue: 0,0:31:27.00,0:31:30.00,Default,,0000,0000,0000,,So we'll call this level zero. Dialogue: 0,0:31:30.00,0:31:33.00,Default,,0000,0000,0000,,And then the next two nodes Dialogue: 0,0:31:33.00,0:31:36.00,Default,,0000,0000,0000,,are level one. Dialogue: 0,0:31:36.00,0:31:38.00,Default,,0000,0000,0000,,And the next four nodes Dialogue: 0,0:31:38.00,0:31:42.00,Default,,0000,0000,0000,,are level two and so on. So Dialogue: 0,0:31:42.00,0:31:45.00,Default,,0000,0000,0000,,there's one here, two here, there's four here, there's eight here. And we kinda flattened it down Dialogue: 0,0:31:45.00,0:31:50.00,Default,,0000,0000,0000,,by generation, so looking at what was at the top and what was underneath it. Dialogue: 0,0:31:50.00,0:31:52.00,Default,,0000,0000,0000,,When we did that, we removed all the space for pointers. Dialogue: 0,0:31:52.00,0:31:54.00,Default,,0000,0000,0000,,Although we drew those pictures in the heap as though there were pointers Dialogue: 0,0:31:54.00,0:31:57.00,Default,,0000,0000,0000,,there, in fact, there were not. Dialogue: 0,0:31:57.00,0:32:02.00,Default,,0000,0000,0000,,There were no pointers being stored in the heap in this version of the PQ. Dialogue: 0,0:32:02.00,0:32:02.00,Default,,0000,0000,0000,,And so that gets us back to kind of array like storage requirements, Dialogue: 0,0:32:02.00,0:32:06.00,Default,,0000,0000,0000,,but Dialogue: 0,0:32:06.00,0:32:09.00,Default,,0000,0000,0000,,by flattening it down into this array, and still kinda using the tree Dialogue: 0,0:32:09.00,0:32:10.00,Default,,0000,0000,0000,,conceptually to work our way through it, we're getting the advantages of Dialogue: 0,0:32:10.00,0:32:12.00,Default,,0000,0000,0000,,that Dialogue: 0,0:32:12.00,0:32:14.00,Default,,0000,0000,0000,, Dialogue: 0,0:32:14.00,0:32:15.00,Default,,0000,0000,0000,,traversal that's based on height of tree, Dialogue: 0,0:32:15.00,0:32:20.00,Default,,0000,0000,0000,,not length of vector. Dialogue: 0,0:32:20.00,0:32:22.00,Default,,0000,0000,0000,,So if we do the same thing with the trie, it's a little bit more complicated Dialogue: 0,0:32:22.00,0:32:25.00,Default,,0000,0000,0000,,because there were a couple things about heap that made this a little Dialogue: 0,0:32:25.00,0:32:27.00,Default,,0000,0000,0000,,easier. Heap always had two children, so there were these rules about Dialogue: 0,0:32:27.00,0:32:30.00,Default,,0000,0000,0000,,how the tree was filled in that meant Dialogue: 0,0:32:30.00,0:32:33.00,Default,,0000,0000,0000,,you could always compute parent and child index using this kinda divide by two, Dialogue: 0,0:32:33.00,0:32:36.00,Default,,0000,0000,0000,,multiple by two exchange operation. Dialogue: 0,0:32:36.00,0:32:39.00,Default,,0000,0000,0000,,In this case, what we're going to have to do is just lay it down by generation, Dialogue: 0,0:32:39.00,0:32:41.00,Default,,0000,0000,0000,,but we don't know how big a generation's gonna be, so we're actually gonna have to keep Dialogue: 0,0:32:41.00,0:32:43.00,Default,,0000,0000,0000,,track of it manually. Dialogue: 0,0:32:43.00,0:32:46.00,Default,,0000,0000,0000,,So the root node will have Dialogue: 0,0:32:46.00,0:32:49.00,Default,,0000,0000,0000,,all A through Z. So this will be Dialogue: 0,0:32:49.00,0:32:53.00,Default,,0000,0000,0000,,level one, which is A through Z here, Dialogue: 0,0:32:53.00,0:32:57.00,Default,,0000,0000,0000,,and then level two is - there'll be a bunch of slots here, which is the Dialogue: 0,0:32:57.00,0:33:00.00,Default,,0000,0000,0000,,A followed by some letter, and then there'd be the B's followed by some letter, and Dialogue: 0,0:33:00.00,0:33:03.00,Default,,0000,0000,0000,,then there'd be the C's followed by some letter, and so on. Dialogue: 0,0:33:03.00,0:33:06.00,Default,,0000,0000,0000,,And then further down is level three, which are the three-character words. So these are all Dialogue: 0,0:33:06.00,0:33:09.00,Default,,0000,0000,0000,,the single-character paths, Dialogue: 0,0:33:09.00,0:33:12.00,Default,,0000,0000,0000,,two-character paths, three-character paths, kinda flattened down. Dialogue: 0,0:33:12.00,0:33:15.00,Default,,0000,0000,0000,,And so if I look at what I changed here, my first node Dialogue: 0,0:33:15.00,0:33:18.00,Default,,0000,0000,0000,,says it has no letter that's not the root node, it's not a word, Dialogue: 0,0:33:18.00,0:33:20.00,Default,,0000,0000,0000,,and it says it's children begin at index one. So Dialogue: 0,0:33:20.00,0:33:23.00,Default,,0000,0000,0000,,the next location. So it's not a Dialogue: 0,0:33:23.00,0:33:26.00,Default,,0000,0000,0000,,multiply by two and add one or something like that. It's okay, here's Dialogue: 0,0:33:26.00,0:33:30.00,Default,,0000,0000,0000,,where my children begin. They're at slot one in the array. Dialogue: 0,0:33:30.00,0:33:31.00,Default,,0000,0000,0000,,And then there are 26 of them. So that means that A, B, C, D are kinda in order there. Dialogue: 0,0:33:31.00,0:33:33.00,Default,,0000,0000,0000,,If I look at A, Dialogue: 0,0:33:33.00,0:33:36.00,Default,,0000,0000,0000,,''a'' is a word, Dialogue: 0,0:33:36.00,0:33:39.00,Default,,0000,0000,0000,,and it says its children begin at index 27 Dialogue: 0,0:33:39.00,0:33:44.00,Default,,0000,0000,0000,,and there are 12 of them, so there are 12 characters out of the 26 that Dialogue: 0,0:33:44.00,0:33:47.00,Default,,0000,0000,0000,,can follow an ''a'' in the sequence of a valid word. And Dialogue: 0,0:33:47.00,0:33:48.00,Default,,0000,0000,0000,,then B's children begin at index 39, which is 27 plus 12. So that's Dialogue: 0,0:33:48.00,0:33:50.00,Default,,0000,0000,0000,,where Dialogue: 0,0:33:50.00,0:33:54.00,Default,,0000,0000,0000,,A's children are sitting here at index Dialogue: 0,0:33:54.00,0:33:56.00,Default,,0000,0000,0000,,27 to 38 and then 39 to - I think B has eight or so children - Dialogue: 0,0:33:56.00,0:33:59.00,Default,,0000,0000,0000,,and C and so on. Dialogue: 0,0:33:59.00,0:34:00.00,Default,,0000,0000,0000,,So we flattened it all down by generation Dialogue: 0,0:34:00.00,0:34:01.00,Default,,0000,0000,0000,,into Dialogue: 0,0:34:01.00,0:34:04.00,Default,,0000,0000,0000,,an array. Dialogue: 0,0:34:04.00,0:34:07.00,Default,,0000,0000,0000,,That means all the space for pointers Dialogue: 0,0:34:07.00,0:34:10.00,Default,,0000,0000,0000,,has now been removed. Dialogue: 0,0:34:10.00,0:34:12.00,Default,,0000,0000,0000,,One serious consequence of this, though, Dialogue: 0,0:34:12.00,0:34:14.00,Default,,0000,0000,0000,,is that the pointers got us flexibility, that Dialogue: 0,0:34:14.00,0:34:15.00,Default,,0000,0000,0000,,that insert and remove Dialogue: 0,0:34:15.00,0:34:19.00,Default,,0000,0000,0000,, Dialogue: 0,0:34:19.00,0:34:22.00,Default,,0000,0000,0000,,was dependent on the idea that pointers really being there, pointers buying us Dialogue: 0,0:34:22.00,0:34:25.00,Default,,0000,0000,0000,,the ability to insert and delete without doing a shuffle. Dialogue: 0,0:34:25.00,0:34:27.00,Default,,0000,0000,0000,,That now, once I have flattened it, I have suddenly bought back the same Dialogue: 0,0:34:27.00,0:34:29.00,Default,,0000,0000,0000,,problems that arrays have, which is Dialogue: 0,0:34:29.00,0:34:33.00,Default,,0000,0000,0000,,if I have the words here that have Dialogue: 0,0:34:33.00,0:34:36.00,Default,,0000,0000,0000,,''ab'' for abalone, and ''ac'' for act and I don't happen to have any words with ''ad.'' And Dialogue: 0,0:34:36.00,0:34:40.00,Default,,0000,0000,0000,,I try to add the word ''add,'' Dialogue: 0,0:34:40.00,0:34:43.00,Default,,0000,0000,0000,,that in order to put ''ad'' in there, we're going to have to move everything over and then Dialogue: 0,0:34:43.00,0:34:46.00,Default,,0000,0000,0000,,there's going to be a lot of shuffling and what not. Dialogue: 0,0:34:46.00,0:34:48.00,Default,,0000,0000,0000,,So I'm starting to build a data structure that's losing some flexibility but Dialogue: 0,0:34:48.00,0:34:49.00,Default,,0000,0000,0000,,saving space. Dialogue: 0,0:34:49.00,0:34:53.00,Default,,0000,0000,0000,,So Dialogue: 0,0:34:53.00,0:34:54.00,Default,,0000,0000,0000,,I'm, in some sense, focusing my attention on how can I build a data structure that's really fast Dialogue: 0,0:34:54.00,0:34:56.00,Default,,0000,0000,0000,,to search, Dialogue: 0,0:34:56.00,0:35:00.00,Default,,0000,0000,0000,,even if it was a little harder to build? Dialogue: 0,0:35:00.00,0:35:02.00,Default,,0000,0000,0000,,Because it turns out the thing I most care about is making sure it can Dialogue: 0,0:35:02.00,0:35:07.00,Default,,0000,0000,0000,,actually very, very efficiently look up words. Dialogue: 0,0:35:07.00,0:35:11.00,Default,,0000,0000,0000,,Maybe building it once, being a little slow, isn't so painful. Dialogue: 0,0:35:11.00,0:35:14.00,Default,,0000,0000,0000,,So the space issues in this gets us down to just ten bytes per Dialogue: 0,0:35:14.00,0:35:16.00,Default,,0000,0000,0000,,node. All the pointers are gone. So there was 24 bytes of pointers that we've Dialogue: 0,0:35:16.00,0:35:19.00,Default,,0000,0000,0000,,just eliminated. Dialogue: 0,0:35:19.00,0:35:21.00,Default,,0000,0000,0000,,And that means we've got two and a half megabyte Dialogue: 0,0:35:21.00,0:35:28.00,Default,,0000,0000,0000,,memory footprint right now. Dialogue: 0,0:35:28.00,0:35:29.00,Default,,0000,0000,0000,,So we're kinda getting close to hitting the range for binary tree and hash tables. Dialogue: 0,0:35:29.00,0:35:33.00,Default,,0000,0000,0000,,Now I'm Dialogue: 0,0:35:33.00,0:35:37.00,Default,,0000,0000,0000,,going to go back to the domain again. So I worked on it kinda from the CS side, thinking Dialogue: 0,0:35:37.00,0:35:39.00,Default,,0000,0000,0000,,about things I know about data structures and Dialogue: 0,0:35:39.00,0:35:41.00,Default,,0000,0000,0000,,reorganizing arrays and heaps and pointers. I'm going Dialogue: 0,0:35:41.00,0:35:43.00,Default,,0000,0000,0000,,to go back and look at that domain again Dialogue: 0,0:35:43.00,0:35:46.00,Default,,0000,0000,0000,,and see if there's something else I Dialogue: 0,0:35:46.00,0:35:48.00,Default,,0000,0000,0000,,can kinda take advantage of. Dialogue: 0,0:35:48.00,0:35:50.00,Default,,0000,0000,0000,,So this is a different Dialogue: 0,0:35:50.00,0:35:52.00,Default,,0000,0000,0000,,cross-section of the dictionary Dialogue: 0,0:35:52.00,0:35:57.00,Default,,0000,0000,0000,,showing the words ''adapted,'' Dialogue: 0,0:35:57.00,0:36:02.00,Default,,0000,0000,0000,,''adaptor,'' ''adaptors,'' ''adapting,'' ''adaption,'' adaptions,'' ''adapts.'' Dialogue: 0,0:36:02.00,0:36:04.00,Default,,0000,0000,0000,,So eight words that are all built off the ''adapt'' root with a bunch of suffixes. I look Dialogue: 0,0:36:04.00,0:36:05.00,Default,,0000,0000,0000,,sorta over a little bit further. I've got Dialogue: 0,0:36:05.00,0:36:10.00,Default,,0000,0000,0000,,''desert,'' Dialogue: 0,0:36:10.00,0:36:15.00,Default,,0000,0000,0000,,''deserted,'' ''deserter,'' ''deserting,'' ''desertings,'' ''desertion,'' ''desertions,'' ''deserts.'' Okay, Dialogue: 0,0:36:15.00,0:36:20.00,Default,,0000,0000,0000,,and ''detect'' and ''insert'' and ''perfect'' and ''invent'' and ''interrupt'' Dialogue: 0,0:36:20.00,0:36:23.00,Default,,0000,0000,0000,,that all are verbs for which you can apply the past tense in the Dialogue: 0,0:36:23.00,0:36:26.00,Default,,0000,0000,0000,,gerund form and the ''ion's'' that Dialogue: 0,0:36:26.00,0:36:29.00,Default,,0000,0000,0000,,tack onto that root word to give you these suffixes. Dialogue: 0,0:36:29.00,0:36:32.00,Default,,0000,0000,0000,,And each of these words that has shown up here has exactly the same set of suffixes Dialogue: 0,0:36:32.00,0:36:33.00,Default,,0000,0000,0000,,that can be applied to it. Dialogue: 0,0:36:33.00,0:36:35.00,Default,,0000,0000,0000,,So I Dialogue: 0,0:36:35.00,0:36:37.00,Default,,0000,0000,0000,,did this thing about prefixes. I Dialogue: 0,0:36:37.00,0:36:42.00,Default,,0000,0000,0000,,went out of my way to find that - Dialogue: 0,0:36:42.00,0:36:45.00,Default,,0000,0000,0000,,so all these words could share that same prefix, ''assert,'' ''asserter,'' ''assertings,'' ''asserting.'' Dialogue: 0,0:36:45.00,0:36:47.00,Default,,0000,0000,0000,,But is there some way that I could take this idea of ''corrupt'' Dialogue: 0,0:36:47.00,0:36:50.00,Default,,0000,0000,0000,,and ''assert'' and ''insert,'' Dialogue: 0,0:36:50.00,0:36:54.00,Default,,0000,0000,0000,,and then once I get down to that ''t,'' where they end, is there a way where they can also Dialogue: 0,0:36:54.00,0:36:57.00,Default,,0000,0000,0000,,share their backend parts? That those ''ing's'' and ''ion's,'' can they be Dialogue: 0,0:36:57.00,0:36:58.00,Default,,0000,0000,0000,,unified, as well? Well, Dialogue: 0,0:36:58.00,0:37:00.00,Default,,0000,0000,0000,,why not? Dialogue: 0,0:37:00.00,0:37:03.00,Default,,0000,0000,0000,,Why not? So here is Dialogue: 0,0:37:03.00,0:37:06.00,Default,,0000,0000,0000,,the first version of what's going to be the DAWG, Dialogue: 0,0:37:06.00,0:37:07.00,Default,,0000,0000,0000,,the Directed Acyclic Word Dialogue: 0,0:37:07.00,0:37:10.00,Default,,0000,0000,0000,,Graph. Dialogue: 0,0:37:10.00,0:37:12.00,Default,,0000,0000,0000,,It takes the idea of unifying the prefixes and now applies it to the Dialogue: 0,0:37:12.00,0:37:14.00,Default,,0000,0000,0000,,suffixes. Dialogue: 0,0:37:14.00,0:37:17.00,Default,,0000,0000,0000,,That there are a lot of Dialogue: 0,0:37:17.00,0:37:20.00,Default,,0000,0000,0000,,words that end in the same letters, as well as start in the same letters. Why not just apply Dialogue: 0,0:37:20.00,0:37:22.00,Default,,0000,0000,0000,,the same kind of unification and sharing on the back end? Dialogue: 0,0:37:22.00,0:37:24.00,Default,,0000,0000,0000,,So here comes in Dialogue: 0,0:37:24.00,0:37:26.00,Default,,0000,0000,0000,,''adapt'' and ''adopt'' Dialogue: 0,0:37:26.00,0:37:30.00,Default,,0000,0000,0000,,and ''based'' and ''port'' Dialogue: 0,0:37:30.00,0:37:33.00,Default,,0000,0000,0000,,and out the back end is ''ports'' and ''portion'' and ''porting'' and ''porter'' and ''adopter'' Dialogue: 0,0:37:33.00,0:37:37.00,Default,,0000,0000,0000,,and ''adopted'' and ''basting'' and ''bastion,'' Dialogue: 0,0:37:37.00,0:37:40.00,Default,,0000,0000,0000,,all coming off that sharing all the endings, because they have the same Dialogue: 0,0:37:40.00,0:37:41.00,Default,,0000,0000,0000,,words that are viable and the same connections that are Dialogue: 0,0:37:41.00,0:37:46.00,Default,,0000,0000,0000,,between them. So Dialogue: 0,0:37:46.00,0:37:49.00,Default,,0000,0000,0000,,the idea that all sharing the ''er'' and the ''er'' leading in to the ''s.'' And that same ''s,'' for Dialogue: 0,0:37:49.00,0:37:50.00,Default,,0000,0000,0000,,example, being shared from ''portion'' and ''baste'' and Dialogue: 0,0:37:50.00,0:37:51.00,Default,,0000,0000,0000,,''adopt'' and Dialogue: 0,0:37:51.00,0:37:54.00,Default,,0000,0000,0000,,''adopter,'' Dialogue: 0,0:37:54.00,0:37:57.00,Default,,0000,0000,0000,,all coming into the same ending of oh, these things that end in ''s'' are words. Here's an ''s'' that is a Dialogue: 0,0:37:57.00,0:37:59.00,Default,,0000,0000,0000,,word for them to share. Dialogue: 0,0:37:59.00,0:38:01.00,Default,,0000,0000,0000,,So it is directed. So Dialogue: 0,0:38:01.00,0:38:04.00,Default,,0000,0000,0000,,it is a graph. Dialogue: 0,0:38:04.00,0:38:05.00,Default,,0000,0000,0000,,So once we remove this - the idea Dialogue: 0,0:38:05.00,0:38:09.00,Default,,0000,0000,0000,,that there's a single path Dialogue: 0,0:38:09.00,0:38:12.00,Default,,0000,0000,0000,,from the root to any node, we're no longer a tree structure. So we Dialogue: 0,0:38:12.00,0:38:16.00,Default,,0000,0000,0000,,can get to ''t'' by going through ''adapt'' or ''adopt'' or ''port,'' Dialogue: 0,0:38:16.00,0:38:19.00,Default,,0000,0000,0000,,and so that no longer fits the definition of what is a tree. It's Dialogue: 0,0:38:19.00,0:38:23.00,Default,,0000,0000,0000,,become a graph. We can get to these nodes from different places. Dialogue: 0,0:38:23.00,0:38:24.00,Default,,0000,0000,0000,,The arcs go one way in this case. We can't go back up the tree Dialogue: 0,0:38:24.00,0:38:27.00,Default,,0000,0000,0000,,to find the words in reverse. Dialogue: 0,0:38:27.00,0:38:31.00,Default,,0000,0000,0000,,There are no cycles. So acyclic, right, you can't just Dialogue: 0,0:38:31.00,0:38:33.00,Default,,0000,0000,0000,,wank around some portion and get this many bananas as you want. Dialogue: 0,0:38:33.00,0:38:34.00,Default,,0000,0000,0000,,And Dialogue: 0,0:38:34.00,0:38:37.00,Default,,0000,0000,0000,,they have Dialogue: 0,0:38:37.00,0:38:44.00,Default,,0000,0000,0000,,been with certain nodes that are marked, in this case, the same way they were in the Dialogue: 0,0:38:44.00,0:38:45.00,Default,,0000,0000,0000,,trie. This path from the root node to here does represent a word in this Dialogue: 0,0:38:45.00,0:38:48.00,Default,,0000,0000,0000,,situation. Dialogue: 0,0:38:48.00,0:38:53.00,Default,,0000,0000,0000,,So building this is a little bit complicated. I'll give you a little bit Dialogue: 0,0:38:53.00,0:38:56.00,Default,,0000,0000,0000,,of the idea of the intuition for it, and I will not get too deep into it. Dialogue: 0,0:38:56.00,0:39:00.00,Default,,0000,0000,0000,,But largely the way you do it is you build the trie, Dialogue: 0,0:39:00.00,0:39:01.00,Default,,0000,0000,0000,,so that it doesn't have the unification of the suffixes; it only unifies the Dialogue: 0,0:39:01.00,0:39:03.00,Default,,0000,0000,0000,,prefixes. And Dialogue: 0,0:39:03.00,0:39:07.00,Default,,0000,0000,0000,,then you work backwards Dialogue: 0,0:39:07.00,0:39:09.00,Default,,0000,0000,0000,,on the suffixes. So in some sense you look at all the words that end, for example, in Dialogue: 0,0:39:09.00,0:39:12.00,Default,,0000,0000,0000,,''s'' in the dictionary. Dialogue: 0,0:39:12.00,0:39:15.00,Default,,0000,0000,0000,,And you find all the ones that just end in one ''s'' that goes nowhere, and you say Dialogue: 0,0:39:15.00,0:39:17.00,Default,,0000,0000,0000,,look, this ''s'' looks like every other ''s'' in this thing. All these words Dialogue: 0,0:39:17.00,0:39:20.00,Default,,0000,0000,0000,,that are just plurals or Dialogue: 0,0:39:20.00,0:39:23.00,Default,,0000,0000,0000,,verbs that can have an ''s'' tacked on. Let's share that ''s.'' Dialogue: 0,0:39:23.00,0:39:25.00,Default,,0000,0000,0000,,And then so you take however many s's there were, 1,000 of them, and you Dialogue: 0,0:39:25.00,0:39:29.00,Default,,0000,0000,0000,,say these are all the same ''s.'' Dialogue: 0,0:39:29.00,0:39:31.00,Default,,0000,0000,0000,,And then what you do is you look at all the nodes that lead into that ''s.'' So you're kinda Dialogue: 0,0:39:31.00,0:39:33.00,Default,,0000,0000,0000,,traversing the graph in reverse, and you say well look, Dialogue: 0,0:39:33.00,0:39:36.00,Default,,0000,0000,0000,,here's a bunch of people who lead in on an ''e.'' Dialogue: 0,0:39:36.00,0:39:37.00,Default,,0000,0000,0000,,And if that ''e'' is a word or not, there might be some group that has ''e'' Dialogue: 0,0:39:37.00,0:39:38.00,Default,,0000,0000,0000,,that's also a word, Dialogue: 0,0:39:38.00,0:39:40.00,Default,,0000,0000,0000,,so like Dialogue: 0,0:39:40.00,0:39:42.00,Default,,0000,0000,0000,,the word ''apple'' and ''apples.'' Dialogue: 0,0:39:42.00,0:39:46.00,Default,,0000,0000,0000,,But sometimes the ''e'' is not, Dialogue: 0,0:39:46.00,0:39:49.00,Default,,0000,0000,0000,,where that wasn't a stopping place, let's say, for ''parties'' it wasn't. Dialogue: 0,0:39:49.00,0:39:53.00,Default,,0000,0000,0000,,So all the ones that are a word you can unify on the ''e'' that was a word, and all the ones that aren't Dialogue: 0,0:39:53.00,0:39:54.00,Default,,0000,0000,0000,,are that way can unify on that. So you just work your way backwards from there. So you say Dialogue: 0,0:39:54.00,0:39:55.00,Default,,0000,0000,0000,,what letters led into that ''e?'' Dialogue: 0,0:39:55.00,0:39:58.00,Default,,0000,0000,0000,,And so kinda from Dialogue: 0,0:39:58.00,0:40:02.00,Default,,0000,0000,0000,,the end of the words backwards, you're looking for all the people who can end in an Dialogue: 0,0:40:02.00,0:40:05.00,Default,,0000,0000,0000,,''s,'' and end in an ''e,'' and end in an ''ies,'' and unify them up, Dialogue: 0,0:40:05.00,0:40:07.00,Default,,0000,0000,0000,,making sure that they have all the same outgoing connections. So for Dialogue: 0,0:40:07.00,0:40:09.00,Default,,0000,0000,0000,,example, ''running'' has Dialogue: 0,0:40:09.00,0:40:13.00,Default,,0000,0000,0000,,an ''ing'' at the end. So does ''swimming,'' Dialogue: 0,0:40:13.00,0:40:15.00,Default,,0000,0000,0000,,but there's ''swimmingly'' and there's not ''runningly.'' So you have to be careful about Dialogue: 0,0:40:15.00,0:40:18.00,Default,,0000,0000,0000,,unifying things that Dialogue: 0,0:40:18.00,0:40:20.00,Default,,0000,0000,0000,,they would kinda have to have all identical properties from this point Dialogue: 0,0:40:20.00,0:40:22.00,Default,,0000,0000,0000,,down to the bottom of the tree Dialogue: 0,0:40:22.00,0:40:25.00,Default,,0000,0000,0000,,to be unifiable. Dialogue: 0,0:40:25.00,0:40:28.00,Default,,0000,0000,0000,,But it's a pretty neat process. It's actually what's called Dialogue: 0,0:40:28.00,0:40:29.00,Default,,0000,0000,0000,,DFT subset minimization, so if you want to learn a little bit about it, that's the Dialogue: 0,0:40:29.00,0:40:31.00,Default,,0000,0000,0000,,word to look up. Dialogue: 0,0:40:31.00,0:40:35.00,Default,,0000,0000,0000,,And so if I do that, Dialogue: 0,0:40:35.00,0:40:36.00,Default,,0000,0000,0000,,and I build this as a DAWG, I'm still using the same structure definition here, Dialogue: 0,0:40:36.00,0:40:40.00,Default,,0000,0000,0000,, Dialogue: 0,0:40:40.00,0:40:44.00,Default,,0000,0000,0000,,but what I am doing is unifying suffixes as well as prefixes Dialogue: 0,0:40:44.00,0:40:46.00,Default,,0000,0000,0000,,and that, actually, is gonna change the number of nodes that I need drastically. Dialogue: 0,0:40:46.00,0:40:49.00,Default,,0000,0000,0000,,It goes down to 80,000 Dialogue: 0,0:40:49.00,0:40:52.00,Default,,0000,0000,0000,,from my initial 250,000. Dialogue: 0,0:40:52.00,0:40:55.00,Default,,0000,0000,0000,,And if you think about the number of words again, Dialogue: 0,0:40:55.00,0:40:57.00,Default,,0000,0000,0000,,there were 125,000 words in the dictionary, Dialogue: 0,0:40:57.00,0:41:01.00,Default,,0000,0000,0000,,the DAWG has just 80,000 words. So it means, actually, once you start Dialogue: 0,0:41:01.00,0:41:04.00,Default,,0000,0000,0000,,unifying the suffixes, it's like most words don't even have - Dialogue: 0,0:41:04.00,0:41:07.00,Default,,0000,0000,0000,,or many of the words don't even have nodes of their own that are unique to Dialogue: 0,0:41:07.00,0:41:11.00,Default,,0000,0000,0000,,them whatsoever. They just exist kinda as portions of other words Dialogue: 0,0:41:11.00,0:41:12.00,Default,,0000,0000,0000,,that have all been joined together and Dialogue: 0,0:41:12.00,0:41:15.00,Default,,0000,0000,0000,,compressed into one package. Dialogue: 0,0:41:15.00,0:41:18.00,Default,,0000,0000,0000,,So 80,000 nodes total, about Dialogue: 0,0:41:18.00,0:41:21.00,Default,,0000,0000,0000,,two-thirds of the number of words total. Dialogue: 0,0:41:21.00,0:41:23.00,Default,,0000,0000,0000,,So we have ten nodes - ten-byte nodes Dialogue: 0,0:41:23.00,0:41:27.00,Default,,0000,0000,0000,,and 80,000 of them. Dialogue: 0,0:41:27.00,0:41:31.00,Default,,0000,0000,0000,,We are down to under a megabyte. So we've crossed that little interesting barrier of Dialogue: 0,0:41:31.00,0:41:34.00,Default,,0000,0000,0000,,the data structure on disk. The data that was fed into it, those words, was over a Dialogue: 0,0:41:34.00,0:41:36.00,Default,,0000,0000,0000,,megabyte and now, actually, we have created a representation that actually is Dialogue: 0,0:41:36.00,0:41:40.00,Default,,0000,0000,0000,,compressed, that uses less space Dialogue: 0,0:41:40.00,0:41:42.00,Default,,0000,0000,0000,,than the text file did in the first place. And that's really because it is Dialogue: 0,0:41:42.00,0:41:44.00,Default,,0000,0000,0000,,finding all these ways in which words are like each other Dialogue: 0,0:41:44.00,0:41:46.00,Default,,0000,0000,0000,,and sharing, Dialogue: 0,0:41:46.00,0:41:48.00,Default,,0000,0000,0000,,making use of the existing Dialogue: 0,0:41:48.00,0:41:50.00,Default,,0000,0000,0000,,structure in the Dialogue: 0,0:41:50.00,0:41:51.00,Default,,0000,0000,0000,,DAWG to add new words Dialogue: 0,0:41:51.00,0:41:53.00,Default,,0000,0000,0000,,without Dialogue: 0,0:41:53.00,0:41:56.00,Default,,0000,0000,0000,,adding bulk. Dialogue: 0,0:41:56.00,0:41:57.00,Default,,0000,0000,0000,,So once I've done this I have to say that I've made the data structure Dialogue: 0,0:41:57.00,0:42:01.00,Default,,0000,0000,0000,,even less flexible. Dialogue: 0,0:42:01.00,0:42:03.00,Default,,0000,0000,0000,,So to be clear, each of these steps has a cost somewhere else that made the code more Dialogue: 0,0:42:03.00,0:42:07.00,Default,,0000,0000,0000,,complicated. It also made it so that Dialogue: 0,0:42:07.00,0:42:10.00,Default,,0000,0000,0000,,now that I have all this sharing back here, that when I'm adding a new word, so I go Dialogue: 0,0:42:10.00,0:42:14.00,Default,,0000,0000,0000,,back to look at this picture here, that Dialogue: 0,0:42:14.00,0:42:15.00,Default,,0000,0000,0000,,I go in to add a word and I say oh, actually, the word Dialogue: 0,0:42:15.00,0:42:17.00,Default,,0000,0000,0000,,''portioning'' is a word. Dialogue: 0,0:42:17.00,0:42:19.00,Default,,0000,0000,0000,,Well, ''adoptioning'' is not. Dialogue: 0,0:42:19.00,0:42:22.00,Default,,0000,0000,0000,,So Dialogue: 0,0:42:22.00,0:42:26.00,Default,,0000,0000,0000,,that will necessitate if I go in to add the word ''portioning,'' that I actually have Dialogue: 0,0:42:26.00,0:42:30.00,Default,,0000,0000,0000,,to split it off of its existing unified branch, because if I just tacked on the Dialogue: 0,0:42:30.00,0:42:33.00,Default,,0000,0000,0000,,''ing'' on the end of ''portioning'' there, that it turns out that ''adaptioning'' Dialogue: 0,0:42:33.00,0:42:37.00,Default,,0000,0000,0000,,and ''adoptioning'' would suddenly become words, too. So actually I have to Dialogue: 0,0:42:37.00,0:42:38.00,Default,,0000,0000,0000,,be extra careful once I've got it unified this way that when I add new words that I don't Dialogue: 0,0:42:38.00,0:42:39.00,Default,,0000,0000,0000,,actually Dialogue: 0,0:42:39.00,0:42:42.00,Default,,0000,0000,0000,,accidentally Dialogue: 0,0:42:42.00,0:42:45.00,Default,,0000,0000,0000,,add some additional words that I didn't plan on because these paths have Dialogue: 0,0:42:45.00,0:42:49.00,Default,,0000,0000,0000,,been shared. So it requires a little bit more careful management. Again, Dialogue: 0,0:42:49.00,0:42:50.00,Default,,0000,0000,0000,,the goal, though, here, in this case, was to try to build a data structure that was kinda freeze-dried. Dialogue: 0,0:42:50.00,0:42:53.00,Default,,0000,0000,0000,,So I Dialogue: 0,0:42:53.00,0:42:55.00,Default,,0000,0000,0000,,build it off the data structure - the input data file I had, Dialogue: 0,0:42:55.00,0:42:57.00,Default,,0000,0000,0000,,that then would operate very, very efficiently, Dialogue: 0,0:42:57.00,0:43:00.00,Default,,0000,0000,0000,,and I wasn't planning on making a lot Dialogue: 0,0:43:00.00,0:43:01.00,Default,,0000,0000,0000,,of runtime changes to it. Dialogue: 0,0:43:01.00,0:43:04.00,Default,,0000,0000,0000,,I get this down to this Dialogue: 0,0:43:04.00,0:43:08.00,Default,,0000,0000,0000,,and then the last thing I'm going to do to it Dialogue: 0,0:43:08.00,0:43:11.00,Default,,0000,0000,0000,,is go back to kinda the CS side of it and see if I can squeeze it down kinda Dialogue: 0,0:43:11.00,0:43:15.00,Default,,0000,0000,0000,,in terms of just data structure, what's being stored per node. Dialogue: 0,0:43:15.00,0:43:18.00,Default,,0000,0000,0000,,So I have 80,000 of these. So Dialogue: 0,0:43:18.00,0:43:22.00,Default,,0000,0000,0000,,typically, when you look at a structure, Dialogue: 0,0:43:22.00,0:43:24.00,Default,,0000,0000,0000,,it's very rare that the idea if you change something from an int to a Dialogue: 0,0:43:24.00,0:43:28.00,Default,,0000,0000,0000,,char, so it was a two-byte thing into a one-byte thing, Dialogue: 0,0:43:28.00,0:43:32.00,Default,,0000,0000,0000,,that you're going to have any profound impact on the total size needed. Dialogue: 0,0:43:32.00,0:43:33.00,Default,,0000,0000,0000,,But if you happen to know you're going to make a lot of these structures, tens of thousands, Dialogue: 0,0:43:33.00,0:43:35.00,Default,,0000,0000,0000,,almost 100,000 of them, Dialogue: 0,0:43:35.00,0:43:38.00,Default,,0000,0000,0000,,then it turns out every byte counts. Dialogue: 0,0:43:38.00,0:43:40.00,Default,,0000,0000,0000,,If I can get this thing down from, in this case, Dialogue: 0,0:43:40.00,0:43:42.00,Default,,0000,0000,0000,,ten bytes to eight bytes, or to Dialogue: 0,0:43:42.00,0:43:46.00,Default,,0000,0000,0000,,six bytes or four bytes, Dialogue: 0,0:43:46.00,0:43:49.00,Default,,0000,0000,0000,,then it will cut my memory by quite a factor. Dialogue: 0,0:43:49.00,0:43:52.00,Default,,0000,0000,0000,,So what I did in the final version of it Dialogue: 0,0:43:52.00,0:43:58.00,Default,,0000,0000,0000,,is I used a little technique called bit packing, Dialogue: 0,0:43:58.00,0:44:00.00,Default,,0000,0000,0000,,which is to actually stash stuff in the absolute minimum possible space Dialogue: 0,0:44:00.00,0:44:04.00,Default,,0000,0000,0000,,that I can fit it into, Dialogue: 0,0:44:04.00,0:44:04.00,Default,,0000,0000,0000,,taking advantage of things like, for example, if there are only 26 letters in the Dialogue: 0,0:44:04.00,0:44:08.00,Default,,0000,0000,0000,, Dialogue: 0,0:44:08.00,0:44:10.00,Default,,0000,0000,0000,,alphabet, a character is actually a full eight-bit value, which can hold Dialogue: 0,0:44:10.00,0:44:13.00,Default,,0000,0000,0000,,numbers from zero to 255. Dialogue: 0,0:44:13.00,0:44:15.00,Default,,0000,0000,0000,,And I just don't need that much space. I'm not using the yen sign, I'm not using the parentheses, Dialogue: 0,0:44:15.00,0:44:19.00,Default,,0000,0000,0000,,I'm not using the digits. Dialogue: 0,0:44:19.00,0:44:22.00,Default,,0000,0000,0000,,So having, in some sense, the capacity to store all those distinct characters is actually Dialogue: 0,0:44:22.00,0:44:26.00,Default,,0000,0000,0000,,a cost I'm paying on every character that I don't need to. Dialogue: 0,0:44:26.00,0:44:29.00,Default,,0000,0000,0000,,So in fact, I squeezed it down by dividing up into the sub-bit level. This Dialogue: 0,0:44:29.00,0:44:30.00,Default,,0000,0000,0000,,something that's talked about a little bit in Chapter 15 if you're at all curious. It's Dialogue: 0,0:44:30.00,0:44:33.00,Default,,0000,0000,0000,,not what I Dialogue: 0,0:44:33.00,0:44:34.00,Default,,0000,0000,0000,,consider core material for 106B, but I'm just sorta mentioning it to kinda Dialogue: 0,0:44:34.00,0:44:36.00,Default,,0000,0000,0000,,complete the picture. Dialogue: 0,0:44:36.00,0:44:38.00,Default,,0000,0000,0000,,Where five bits, Dialogue: 0,0:44:38.00,0:44:41.00,Default,,0000,0000,0000,,given that a bit is either zero or one, Dialogue: 0,0:44:41.00,0:44:45.00,Default,,0000,0000,0000,,that I have five of them connected together, then I have 25 Dialogue: 0,0:44:45.00,0:44:45.00,Default,,0000,0000,0000,,different patterns that I can represent in that Dialogue: 0,0:44:45.00,0:44:48.00,Default,,0000,0000,0000,, Dialogue: 0,0:44:48.00,0:44:50.00,Default,,0000,0000,0000,,amount of capacity there, and that's 32 different patterns, which is enough for Dialogue: 0,0:44:50.00,0:44:54.00,Default,,0000,0000,0000,,my characters, plus a little slack. Dialogue: 0,0:44:54.00,0:44:56.00,Default,,0000,0000,0000,,I use exactly one bit for is word, rather than a full Boolean. All I need Dialogue: 0,0:44:56.00,0:44:59.00,Default,,0000,0000,0000,,to know is yes or no. Dialogue: 0,0:44:59.00,0:45:02.00,Default,,0000,0000,0000,,I also changed the way I did Dialogue: 0,0:45:02.00,0:45:04.00,Default,,0000,0000,0000,,knowing that you were the last child of a generation, rather than having Dialogue: 0,0:45:04.00,0:45:06.00,Default,,0000,0000,0000,,A say, ''there are 12 children here.'' Dialogue: 0,0:45:06.00,0:45:10.00,Default,,0000,0000,0000,,I just said, go to index 27 Dialogue: 0,0:45:10.00,0:45:12.00,Default,,0000,0000,0000,,and start reading, and one of them will tell you it's the last. So each of them says, ''No, Dialogue: 0,0:45:12.00,0:45:16.00,Default,,0000,0000,0000,,I'm not the last,'' ''No, I'm not the last.'' And then when you get to the last one, it Dialogue: 0,0:45:16.00,0:45:18.00,Default,,0000,0000,0000,,will be marked as ''yes,'' and that's how you'll know - that's where A ends and the Dialogue: 0,0:45:18.00,0:45:20.00,Default,,0000,0000,0000,,next one beings. Dialogue: 0,0:45:20.00,0:45:21.00,Default,,0000,0000,0000,,And so if I can get it down in this way Dialogue: 0,0:45:21.00,0:45:24.00,Default,,0000,0000,0000,, Dialogue: 0,0:45:24.00,0:45:28.00,Default,,0000,0000,0000,,I've taken Dialogue: 0,0:45:28.00,0:45:31.00,Default,,0000,0000,0000,,what was a ten-byte structure down to four. And it turns out there's often Dialogue: 0,0:45:31.00,0:45:35.00,Default,,0000,0000,0000,,reasons why trying to get it down to a multiple of two is actually Dialogue: 0,0:45:35.00,0:45:37.00,Default,,0000,0000,0000,,a critical barrier. That if it were five bytes, it's actually likely to be eight, Dialogue: 0,0:45:37.00,0:45:39.00,Default,,0000,0000,0000,,just because of what's called alignment and padding restrictions. Dialogue: 0,0:45:39.00,0:45:42.00,Default,,0000,0000,0000,,So if were at Dialogue: 0,0:45:42.00,0:45:45.00,Default,,0000,0000,0000,,seven and I squeeze down to six, it might not make any change. And if I squeeze down to five, it Dialogue: 0,0:45:45.00,0:45:47.00,Default,,0000,0000,0000,,still might not make any change. It might be, actually, artificially decided that in Dialogue: 0,0:45:47.00,0:45:50.00,Default,,0000,0000,0000,,each of the structures is going to be eight behind the scenes. Dialogue: 0,0:45:50.00,0:45:54.00,Default,,0000,0000,0000,,But getting it down to four actually tends to be another one of those critical barriers where Dialogue: 0,0:45:54.00,0:45:56.00,Default,,0000,0000,0000,,okay, four actually will be smaller than something that was ten. And Dialogue: 0,0:45:56.00,0:45:58.00,Default,,0000,0000,0000,,so I take my 80,000 nodes, Dialogue: 0,0:45:58.00,0:46:03.00,Default,,0000,0000,0000,,I get them to be four bytes each, Dialogue: 0,0:46:03.00,0:46:06.00,Default,,0000,0000,0000,,and then I now have a data structure that's about a third of a megabyte in memory, Dialogue: 0,0:46:06.00,0:46:09.00,Default,,0000,0000,0000,,as opposed to the over a megabyte on disk Dialogue: 0,0:46:09.00,0:46:11.00,Default,,0000,0000,0000,,in the text form. Dialogue: 0,0:46:11.00,0:46:12.00,Default,,0000,0000,0000,,And so that's how it does what it does. So if you're going to look at the code, Dialogue: 0,0:46:12.00,0:46:16.00,Default,,0000,0000,0000,,you'll see that Dialogue: 0,0:46:16.00,0:46:17.00,Default,,0000,0000,0000,,there is this complicated bit pack structure that Dialogue: 0,0:46:17.00,0:46:20.00,Default,,0000,0000,0000,,it does Dialogue: 0,0:46:20.00,0:46:24.00,Default,,0000,0000,0000,,look for things char byte using a recursive character-by-character Dialogue: 0,0:46:24.00,0:46:24.00,Default,,0000,0000,0000,,strategy of start at the top, find the matching character, then go to the children index, Dialogue: 0,0:46:24.00,0:46:28.00,Default,,0000,0000,0000,,and then Dialogue: 0,0:46:28.00,0:46:33.00,Default,,0000,0000,0000,,work your way down the children to find the next matching character, and so on Dialogue: 0,0:46:33.00,0:46:37.00,Default,,0000,0000,0000,,to find its way through the data structure. Dialogue: 0,0:46:37.00,0:46:41.00,Default,,0000,0000,0000,,So [inaudible] is where the child index, or for example, the idea that Dialogue: 0,0:46:41.00,0:46:42.00,Default,,0000,0000,0000,,A says my children start at 27 or B starts at 39, Dialogue: 0,0:46:42.00,0:46:45.00,Default,,0000,0000,0000,,and so I just need Dialogue: 0,0:46:45.00,0:46:51.00,Default,,0000,0000,0000,,a number there, but I'm using Dialogue: 0,0:46:51.00,0:46:55.00,Default,,0000,0000,0000,,a slightly smaller than an integer, just to make them all fit very tightly. So I'll tell Dialogue: 0,0:46:55.00,0:46:57.00,Default,,0000,0000,0000,,you a couple of interesting things about it, actually, before I close this off, which is the Dialogue: 0,0:46:57.00,0:47:02.00,Default,,0000,0000,0000,,lexicon.dat file Dialogue: 0,0:47:02.00,0:47:03.00,Default,,0000,0000,0000,,actually just is an in-memory representation of what the DAWG data Dialogue: 0,0:47:03.00,0:47:07.00,Default,,0000,0000,0000,,structure looks like. Dialogue: 0,0:47:07.00,0:47:11.00,Default,,0000,0000,0000,,And so in fact, it's very easy to take the whole data structure. It doesn't have any pointers in it. Pointers, actually, tend to be a Dialogue: 0,0:47:11.00,0:47:15.00,Default,,0000,0000,0000,,little bit - make data structures harder to rebuild and Dialogue: 0,0:47:15.00,0:47:17.00,Default,,0000,0000,0000,,to dehydrate to disk because they have to kinda be wired up in memory. You don't know what Dialogue: 0,0:47:17.00,0:47:20.00,Default,,0000,0000,0000,,memory you're going to get from new and what the addresses are. Dialogue: 0,0:47:20.00,0:47:22.00,Default,,0000,0000,0000,,This one, actually, doesn't have any pointers. Everything is actually kinda self-contained. It's Dialogue: 0,0:47:22.00,0:47:26.00,Default,,0000,0000,0000,,just one big array of these blocks. Dialogue: 0,0:47:26.00,0:47:27.00,Default,,0000,0000,0000,,And all the information is relative in there. It says go to the index 27, index Dialogue: 0,0:47:27.00,0:47:30.00,Default,,0000,0000,0000,,45. Dialogue: 0,0:47:30.00,0:47:34.00,Default,,0000,0000,0000,,So you can take that whole block and just write it to disk, 300 bytes worth, Dialogue: 0,0:47:34.00,0:47:36.00,Default,,0000,0000,0000,,read it back in 300 bytes worth and it kinda is rehydrated Dialogue: 0,0:47:36.00,0:47:39.00,Default,,0000,0000,0000,,without any extra effort. So Dialogue: 0,0:47:39.00,0:47:42.00,Default,,0000,0000,0000,,that's actually what the lexicon.dat file is, it's just a dehydrated form of Dialogue: 0,0:47:42.00,0:47:44.00,Default,,0000,0000,0000,,the DAWG having been Dialogue: 0,0:47:44.00,0:47:48.00,Default,,0000,0000,0000,,build in memory. That makes it very, very fast to load. Dialogue: 0,0:47:48.00,0:47:49.00,Default,,0000,0000,0000,,At that point, though, it's super inflexible. If you want to add a word to Dialogue: 0,0:47:49.00,0:47:52.00,Default,,0000,0000,0000,,it, Dialogue: 0,0:47:52.00,0:47:56.00,Default,,0000,0000,0000,,the way the structure is build, you'd have to kinda add nodes and make sure it wasn't Dialogue: 0,0:47:56.00,0:47:58.00,Default,,0000,0000,0000,,glomming on to some existing words that were nearby, and so it would be a lot of work to Dialogue: 0,0:47:58.00,0:48:01.00,Default,,0000,0000,0000,,actually edit the DAWG in place. Dialogue: 0,0:48:01.00,0:48:04.00,Default,,0000,0000,0000,,In fact, when you ask it to add a word, Dialogue: 0,0:48:04.00,0:48:05.00,Default,,0000,0000,0000,,it actually doesn't edit that one at all. Dialogue: 0,0:48:05.00,0:48:09.00,Default,,0000,0000,0000,,You say I'd like to put a new word in. Dialogue: 0,0:48:09.00,0:48:11.00,Default,,0000,0000,0000,,It says I'm just - this one's so perfect the way it is. I've built this one, I love this Dialogue: 0,0:48:11.00,0:48:14.00,Default,,0000,0000,0000,,one, I'm very happy with it. Dialogue: 0,0:48:14.00,0:48:16.00,Default,,0000,0000,0000,,If you want to put some other words into it, it just keeps a second data structure Dialogue: 0,0:48:16.00,0:48:21.00,Default,,0000,0000,0000,,on the side, an Dialogue: 0,0:48:21.00,0:48:21.00,Default,,0000,0000,0000,,auxiliary one over here, where it sticks the additional words you ask it to put in. I Dialogue: 0,0:48:21.00,0:48:24.00,Default,,0000,0000,0000,,have Dialogue: 0,0:48:24.00,0:48:28.00,Default,,0000,0000,0000,,to say that seems like a very obvious thing to do, but in fact, I didn't even think of Dialogue: 0,0:48:28.00,0:48:31.00,Default,,0000,0000,0000,,that for years. We had the lexicon. I said you can't use the lexicon to make new Dialogue: 0,0:48:31.00,0:48:32.00,Default,,0000,0000,0000,,word lists or edit them. You can only load this one, beautiful, perfect one. You can't do Dialogue: 0,0:48:32.00,0:48:35.00,Default,,0000,0000,0000,,anything else. And Dialogue: 0,0:48:35.00,0:48:38.00,Default,,0000,0000,0000,,at some point people kept saying, ''I wish I could put some words in the lexicon. I wish I could use it Dialogue: 0,0:48:38.00,0:48:41.00,Default,,0000,0000,0000,,for the word list to keep track of the human and computer players.'' I'm like, ''Well, you just Dialogue: 0,0:48:41.00,0:48:41.00,Default,,0000,0000,0000,,can't do it. You don't understand. It's very complicated. Dialogue: 0,0:48:41.00,0:48:45.00,Default,,0000,0000,0000,,You Dialogue: 0,0:48:45.00,0:48:50.00,Default,,0000,0000,0000,,just don't understand my pain.'' Whatever. Whatever. I had excuses. And then, really, Dialogue: 0,0:48:50.00,0:48:51.00,Default,,0000,0000,0000,,after some number of years of just being a baby about it, I'm like oh, of course, why don't I Dialogue: 0,0:48:51.00,0:48:54.00,Default,,0000,0000,0000,,just Dialogue: 0,0:48:54.00,0:48:57.00,Default,,0000,0000,0000,,keep another structure off to the side? It's an abstraction. It's a word list. You don't need Dialogue: 0,0:48:57.00,0:49:00.00,Default,,0000,0000,0000,,to know what I do. The fact that I keep a DAWG for some of the words Dialogue: 0,0:49:00.00,0:49:04.00,Default,,0000,0000,0000,,and just an ordinary set for some of the other words Dialogue: 0,0:49:04.00,0:49:06.00,Default,,0000,0000,0000,,is really completely irrelevant. And as long as when you ask for words I Dialogue: 0,0:49:06.00,0:49:08.00,Default,,0000,0000,0000,,look in both and I tell you, ''Yes, it's there,'' why do you need to know Dialogue: 0,0:49:08.00,0:49:11.00,Default,,0000,0000,0000,,where I keep it? Okay, so it took Dialogue: 0,0:49:11.00,0:49:13.00,Default,,0000,0000,0000,,me a while to buy abstraction, even though I teach it all the time. It just Dialogue: 0,0:49:13.00,0:49:17.00,Default,,0000,0000,0000,,wasn't totally making sense. Dialogue: 0,0:49:17.00,0:49:19.00,Default,,0000,0000,0000,,So a neat little thing, just kinda keeping both behind. And so it turns out that I Dialogue: 0,0:49:19.00,0:49:21.00,Default,,0000,0000,0000,,learned about this, actually, from a paper Dialogue: 0,0:49:21.00,0:49:22.00,Default,,0000,0000,0000,,that I read at the - in the Dialogue: 0,0:49:22.00,0:49:23.00,Default,,0000,0000,0000,, Dialogue: 0,0:49:23.00,0:49:27.00,Default,,0000,0000,0000,,early ?90s about somebody Dialogue: 0,0:49:27.00,0:49:30.00,Default,,0000,0000,0000,,building a Scrabble player, and they wanted a dictionary that Dialogue: 0,0:49:30.00,0:49:33.00,Default,,0000,0000,0000,,supported kinda finding plays on the board and being able to extend words. And so Dialogue: 0,0:49:33.00,0:49:34.00,Default,,0000,0000,0000,,knowing things about root words and how they extend and prefixes and suffixes was kinda the Dialogue: 0,0:49:34.00,0:49:38.00,Default,,0000,0000,0000,,original motivation. Dialogue: 0,0:49:38.00,0:49:39.00,Default,,0000,0000,0000,,And this kind of data structure is actually really great for any kind of word playing. Dialogue: 0,0:49:39.00,0:49:42.00,Default,,0000,0000,0000,,So you have Dialogue: 0,0:49:42.00,0:49:43.00,Default,,0000,0000,0000,,anagrams that you want to rearrange or words that you want to extend past suffix and Dialogue: 0,0:49:43.00,0:49:46.00,Default,,0000,0000,0000,,prefix that Dialogue: 0,0:49:46.00,0:49:50.00,Default,,0000,0000,0000,,this kind of data structure is really optimized for Dialogue: 0,0:49:50.00,0:49:52.00,Default,,0000,0000,0000,,doing that kind of traversal and kinda leveraging the existing patterns Dialogue: 0,0:49:52.00,0:49:56.00,Default,,0000,0000,0000,,in words to save space while doing so. Dialogue: 0,0:49:56.00,0:50:00.00,Default,,0000,0000,0000,,So it was a really fun little project to do. It was kinda neat to think about. Even though today, Dialogue: 0,0:50:00.00,0:50:02.00,Default,,0000,0000,0000,,if you needed a lexicon, you could just use a set. It's fast enough. It takes Dialogue: 0,0:50:02.00,0:50:05.00,Default,,0000,0000,0000,,a ton of memory, but you know what, memory's cheap. Dialogue: 0,0:50:05.00,0:50:06.00,Default,,0000,0000,0000,,So this is not the kind of thing you would need to do today except when you were in some kinda Dialogue: 0,0:50:06.00,0:50:07.00,Default,,0000,0000,0000,,embedded, Dialogue: 0,0:50:07.00,0:50:11.00,Default,,0000,0000,0000,,low memory, Dialogue: 0,0:50:11.00,0:50:14.00,Default,,0000,0000,0000,,very, very tight processor environment, but it still makes for kinda interesting sort of artifact Dialogue: 0,0:50:14.00,0:50:18.00,Default,,0000,0000,0000,,to think about well, how you get there and what kind of things you can use. Dialogue: 0,0:50:18.00,0:50:19.00,Default,,0000,0000,0000,,So Wednesday we will talk about some big picture design, some kinda wrap stuff, and I will Dialogue: 0,0:50:19.00,0:50:22.00,Default,,0000,0000,0000,,see Dialogue: 0,0:50:22.00,0:50:23.00,Default,,0000,0000,0000,,you