WEBVTT 00:00:04.108 --> 00:00:06.769 ♪ (Music fades in) ♪ 00:00:15.092 --> 00:00:19.394 (Chirping) 00:00:24.559 --> 00:00:30.051 (Vocalizations, different languages) 00:00:32.488 --> 00:00:35.487 (Talking overlaps in background) 00:00:36.347 --> 00:00:37.786 (Laughing) 00:00:38.486 --> 00:00:43.801 (Computerized beeping) 00:00:45.241 --> 00:00:46.556 (Beep) 00:00:49.116 --> 00:00:54.431 (Man) We come into this world with the innate ability to learn to interact 00:00:54.431 --> 00:00:57.320 with other sentient beings. 00:00:58.060 --> 00:00:59.767 (Child vocalizing) 00:01:00.116 --> 00:01:02.542 (Man) Suppose you had to interact with other people by writing little messages. 00:01:03.452 --> 00:01:04.768 (Child vocalizes) 00:01:05.258 --> 00:01:07.396 (Man) It'd be a real pain. 00:01:07.396 --> 00:01:09.486 (Man) And that's how we interact with computers. 00:01:09.490 --> 00:01:11.647 It's much easier just to talk to them... just so much easier... 00:01:12.137 --> 00:01:13.314 (Child vocalizes) 00:01:13.804 --> 00:01:16.491 (Man) If the computers could understand what we're saying. 00:01:17.814 --> 00:01:20.342 For that, you need really good speech recognition. 00:01:20.547 --> 00:01:23.859 (Narrator) The first speech recognition system was developed by Bell Laboratories 00:01:23.859 --> 00:01:28.108 in 1952. It could only recognize numbers spoken by one person. 00:01:28.108 --> 00:01:31.500 In the 1970s, Carnegie-Mellon came out with the Harpy System. 00:01:31.500 --> 00:01:36.522 This was able to recognize over 1,000 words and different pronunciations 00:01:36.522 --> 00:01:39.563 (Narrator) of the same word. - (Man) Tomato - (Woman) Tomato 00:01:39.563 --> 00:01:42.543 (Narrator) Speech recognition continued in the 80s with the introduction of the 00:01:42.543 --> 00:01:45.576 Hidden Markov Model, which used a more mathematical approach 00:01:45.576 --> 00:01:50.193 to analyzing sound waves that led to many breakthroughs we have today. 00:01:50.197 --> 00:01:52.904 You're taking in very raw audio wave forms 00:01:52.904 --> 00:01:54.623 like you get through a microphone 00:01:54.623 --> 00:01:55.858 on your phone 00:01:55.858 --> 00:01:56.831 or whatever... 00:01:56.966 --> 00:02:02.446 (Woman) We chop it into small pieces and it tries to identify which phoneme 00:02:02.446 --> 00:02:05.498 was spoken in that piece of speech. 00:02:05.498 --> 00:02:09.450 - Phoneme is a primitive unit for expressing words. 00:02:09.530 --> 00:02:13.924 (voicing phonemes shown above) 00:02:14.635 --> 00:02:19.757 And then it stitches those together into likely words like Palo Alto. 00:02:19.757 --> 00:02:23.593 - Speech recognition today is good at transcribing what you've said... 00:02:23.593 --> 00:02:25.355 (Man, to phone) What's the weather like in Topeka? 00:02:25.355 --> 00:02:30.122 (Man) You can talk about travels, your contacts, like, "Where can I get pizza?" 00:02:30.122 --> 00:02:32.072 (Phone) Here are the listings for Pizza. 00:02:32.072 --> 00:02:34.281 (Man) "How tall is the Eiffel Tower?" (Phone) The Eiffel Tower is ... 00:02:34.326 --> 00:02:36.912 (Woman) We've made tremendous improvements very quickly. 00:02:36.912 --> 00:02:39.212 (Man, to phone) Who is the 21st President of the United States? 00:02:39.560 --> 00:02:42.360 (Phone beeps) (Phone) Chester A. Arthur was the 21st... 00:02:42.360 --> 00:02:44.397 (Man, to phone) Okay, Google, where is he from? 00:02:44.397 --> 00:02:47.303 (Man) Years ago, you had to be an engineer to interact with computers. 00:02:47.953 --> 00:02:50.023 Today, everybody can interact. 00:02:50.234 --> 00:02:53.772 - One thing still in its infancy is understanding. 00:02:53.772 --> 00:02:56.412 - We need a far more sophisticated language understanding model 00:02:56.412 --> 00:02:58.592 that understands what the sentence means. 00:02:58.592 --> 00:03:00.902 We're still a very long way from that. 00:03:01.187 --> 00:03:02.511 (Beeping) 00:03:03.501 --> 00:03:06.912 ♪ (Soft background music) ♪ 00:03:07.745 --> 00:03:11.625 (Woman) Our ability to use language is one of the things that helps us have culture. 00:03:13.467 --> 00:03:18.649 It's one of the things that helps us pass on traditions across generations. 00:03:19.516 --> 00:03:25.975 Figuring out how the system of language works, even though it seems easy, 00:03:25.982 --> 00:03:32.666 turns out to be very hard, but is one that every baby understands by 2 years old. 00:03:32.666 --> 00:03:37.529 (Girl) There's two of them. (Woman) There's two Ls, yeah (spells word) 00:03:38.466 --> 00:03:41.045 - Language is extremely complex and sophisticated... 00:03:41.086 --> 00:03:42.445 - From the semantics 00:03:42.445 --> 00:03:43.819 - (Man in chair) Ironies... - (Woman) Strong accents... 00:03:43.819 --> 00:03:44.887 - (Man) Facial expressions... 00:03:44.887 --> 00:03:47.445 - Human emotions, because that's part of how we communicate. 00:03:47.445 --> 00:03:48.846 - Humor... 00:03:48.846 --> 00:03:51.573 (Aside) Do I have to be careful not to offend the dinosaur? 00:03:51.573 --> 00:03:54.187 - Language has so many different layers and that's why it's 00:03:54.187 --> 00:03:56.126 such a difficult problem. 00:03:56.126 --> 00:03:59.214 (Man) The present human brain and the learning algorithms in it 00:03:59.214 --> 00:04:01.997 are far, far better at things like language understanding 00:04:02.179 --> 00:04:04.781 and they're still a lot better at pun recognition. 00:04:05.032 --> 00:04:09.379 - Whether or not we replicate exactly what the brain does, to understand 00:04:09.379 --> 00:04:14.194 language and speech, is still a question. 00:04:15.125 --> 00:04:16.951 (Beeping) 00:04:17.506 --> 00:04:23.306 (Man) For many years, we believed that neural networks should work better than 00:04:23.306 --> 00:04:26.914 the dumb existing technology that's basically just "table look-up." 00:04:27.443 --> 00:04:33.481 Then, in 2009, two of my students (with some help from me) got it 00:04:33.481 --> 00:04:36.849 working better. The first time it was just a little better. 00:04:36.849 --> 00:04:40.165 But it was obvious that this could be improved to work much better. 00:04:40.485 --> 00:04:44.442 (Man) The brain has this system of neurons all computing in parallel. 00:04:44.537 --> 00:04:48.882 All knowledge in the brain is in the strength of connection between neurons. 00:04:49.212 --> 00:04:53.219 What I mean by "neural net" is something that is simulated on a conventional 00:04:53.219 --> 00:04:58.624 computer, but is designed to work in roughly the same ways as the brain. 00:04:59.804 --> 00:05:03.954 Until quite recently, people got features by hand engineering them. 00:05:04.203 --> 00:05:08.600 They looked at sine waves and did fourier analysis and tried to figure out 00:05:08.600 --> 00:05:11.925 what features they should feed to the pattern recognition system. 00:05:12.045 --> 00:05:14.517 The thing about neural networks is that they learn their own features. 00:05:14.569 --> 00:05:19.552 In particular, they can learn features and features of features, etc, 00:05:21.065 --> 00:05:23.770 and that's lead to huge improvement in speech recognition. 00:05:23.860 --> 00:05:26.710 - But you can also use them for language understanding tasks. 00:05:26.842 --> 00:05:32.690 How you do this is to represent words in very high-dimensional spaces. 00:05:32.724 --> 00:05:36.200 - (Man) We can now deal with analogies where a word is represented as a list 00:05:36.200 --> 00:05:37.567 of numbers. 00:05:37.748 --> 00:05:44.292 For example, if I take 100 numbers that represent "Paris," and I subtract from it 00:05:44.292 --> 00:05:49.669 "France" and add "Italy," if I look at the numbers I have, the closest 00:05:49.669 --> 00:05:53.000 thing is a list of numbers that represents "Rome." 00:05:53.155 --> 00:05:58.038 By first converting words into numbers, using a neural net, you can actually 00:05:58.038 --> 00:06:00.772 do this analogical reasoning. 00:06:01.194 --> 00:06:06.286 I predict that, in the next five years, it will be clear that these neural networks 00:06:06.286 --> 00:06:10.832 with new learning algorithms will give us much better language understanding. 00:06:13.476 --> 00:06:19.079 (Woman) When we started out, we thought things like chess or mathematics or logic 00:06:19.079 --> 00:06:21.303 would be things that were really hard. 00:06:21.303 --> 00:06:26.177 They're not that hard. We ended up with a machine that played as well as 00:06:26.177 --> 00:06:28.489 a Grand Master at chess. 00:06:28.489 --> 00:06:33.092 What we thought would be easy for a computer system, like language, 00:06:33.092 --> 00:06:36.730 has turned out incredibly hard. 00:06:36.730 --> 00:06:41.692 (Man) I can't even imagine the moment of success quite yet because there are so many 00:06:41.692 --> 00:06:46.534 pieces of this puzzle that are unsolved, both from a science point of view 00:06:46.534 --> 00:06:50.573 as well as a technical implementation point of view. 00:06:50.573 --> 00:06:52.372 There are a lot of unknowns. 00:06:52.372 --> 00:06:56.257 (Woman) Those are the great revolutions. Not just what we fiddle with what 00:06:56.257 --> 00:07:00.230 we already know, but when we discover something completely new and unexpected. 00:07:00.230 --> 00:07:03.252 (Man) Once you are in the area of 00:07:03.252 --> 00:07:08.534 human-level performance, that will be pretty remarkable. 00:07:12.760 --> 00:07:14.361 (Beep)