WEBVTT 00:00:00.876 --> 00:00:02.026 Look at these images. 00:00:02.051 --> 00:00:04.685 Now, tell me which Obama here is real. NOTE Paragraph 00:00:04.710 --> 00:00:07.568 (Video) Barack Obama: To help families refinance their homes, 00:00:07.595 --> 00:00:10.241 to invest in things like high-tech manufacturing, 00:00:10.266 --> 00:00:11.424 clean energy, 00:00:11.449 --> 00:00:14.228 and the infrastructure that creates good new jobs. NOTE Paragraph 00:00:14.647 --> 00:00:16.129 Supasorn Suwajanakorn: Anyone? 00:00:16.155 --> 00:00:18.029 The answer is none of them. NOTE Paragraph 00:00:18.053 --> 00:00:19.167 (Laughter) NOTE Paragraph 00:00:19.191 --> 00:00:20.977 None of these is actually real. 00:00:21.001 --> 00:00:22.841 So let me tell you how we got here. NOTE Paragraph 00:00:23.851 --> 00:00:28.056 My inspiration for this work was a project meant to preserve 00:00:28.127 --> 00:00:31.912 our last chance for learning about the Holocaust from the survivors. 00:00:32.769 --> 00:00:35.380 It's called New Dimensions in Testimony, 00:00:35.420 --> 00:00:38.539 and it allows you to have interactive conversations 00:00:38.570 --> 00:00:41.126 with a hologram of a real Holocaust survivor. NOTE Paragraph 00:00:41.793 --> 00:00:43.759 (Video) Question: How did you survive the Holocaust? NOTE Paragraph 00:00:43.783 --> 00:00:45.451 (Video) Answer: How did I survive? 00:00:45.912 --> 00:00:47.719 I survived, 00:00:48.419 --> 00:00:49.906 I believe, 00:00:49.970 --> 00:00:52.993 because providence watched over me. NOTE Paragraph 00:00:53.573 --> 00:00:57.027 SS: Turns out these answers were pre-recorded in a studio. 00:00:57.051 --> 00:00:59.408 Yet the effect is astounding. 00:00:59.527 --> 00:01:03.146 You feel so connected to his story and to him as a person. 00:01:04.011 --> 00:01:07.296 I think there is something special about human interaction 00:01:07.336 --> 00:01:09.998 that makes it much more profound 00:01:10.117 --> 00:01:11.457 and personal 00:01:11.482 --> 00:01:15.824 than what books or lectures or movies could ever teach us. NOTE Paragraph 00:01:16.267 --> 00:01:18.605 So I saw this and began to wonder, 00:01:18.716 --> 00:01:21.526 can we create a model like this for anyone? 00:01:21.550 --> 00:01:24.525 A model that looks, talks, and acts just like them? 00:01:25.573 --> 00:01:27.582 So I set out to see if this could be done, 00:01:27.604 --> 00:01:29.858 and eventually came up with a new solution 00:01:29.938 --> 00:01:33.158 that can build a model of a person using nothing but these: 00:01:33.747 --> 00:01:35.961 existing photos and videos of a person. 00:01:36.701 --> 00:01:39.287 If you can leverage this kind of passive information, 00:01:39.342 --> 00:01:41.334 just photos and video that are out there, 00:01:41.373 --> 00:01:43.429 that's the key to scaling to anyone. NOTE Paragraph 00:01:44.119 --> 00:01:45.896 By the way, here's Richard Feynman, 00:01:45.920 --> 00:01:49.325 who in addition to being a Nobel Prize Winner in physics 00:01:49.357 --> 00:01:51.810 was also known as a legendary teacher. 00:01:53.080 --> 00:01:55.272 Wouldn't it be great if we could bring him back 00:01:55.302 --> 00:01:58.567 to give his lectures and inspire millions of kids, 00:01:58.591 --> 00:02:01.583 perhaps not just in English but in any language? 00:02:02.441 --> 00:02:07.015 Or if you could ask our grandparents for advice and hear those comforting words 00:02:07.067 --> 00:02:08.837 even if they are no longer with us? 00:02:09.683 --> 00:02:12.993 Or maybe using this tool, books authors, alive or not, 00:02:13.103 --> 00:02:16.040 could read aloud all of their books for anyone interested. NOTE Paragraph 00:02:17.199 --> 00:02:19.635 The creative possibilities here are endless, 00:02:19.660 --> 00:02:21.373 and to me that's very exciting. 00:02:22.595 --> 00:02:24.597 And here's how it's working so far. NOTE Paragraph 00:02:24.621 --> 00:02:26.288 First, we introduce a new technique 00:02:26.312 --> 00:02:29.761 that can reconstruct a high-detailed 3D face model 00:02:29.785 --> 00:02:30.935 from any image 00:02:30.986 --> 00:02:33.105 without ever 3D-scanning the person. 00:02:33.890 --> 00:02:36.532 And here's the same output model from different views. 00:02:37.969 --> 00:02:39.471 This also works on videos 00:02:39.495 --> 00:02:42.260 by running the same algorithm on each video frame 00:02:42.284 --> 00:02:44.836 and generating a moving 3D model. 00:02:45.501 --> 00:02:48.310 And here's the same output model from different angles. NOTE Paragraph 00:02:49.933 --> 00:02:52.427 It turns out this problem is very challenging, 00:02:52.491 --> 00:02:56.269 but the key trick is that we are going to analyze a large photo collection 00:02:56.293 --> 00:02:57.848 of the person beforehand. 00:02:58.650 --> 00:03:01.189 For George W. Bush, we can just search on Google, 00:03:02.309 --> 00:03:04.786 and from that, we are able to build an average model, 00:03:04.832 --> 00:03:07.888 an iterative, refined model to recover the expression 00:03:07.967 --> 00:03:10.303 in fine details, like creases and wrinkles. 00:03:11.326 --> 00:03:12.728 What's fascinating about this 00:03:12.753 --> 00:03:16.052 is that the photo collection can come from your typical photos. 00:03:16.200 --> 00:03:18.800 It doesn't really matter what expression you're making 00:03:18.827 --> 00:03:20.578 or where you took those photos. 00:03:20.736 --> 00:03:23.136 What matters is that there are a lot of them. 00:03:23.160 --> 00:03:24.895 And we are still missing color here, 00:03:24.920 --> 00:03:27.197 so next we develop a new blending technique 00:03:27.292 --> 00:03:30.073 that improves upon a single averaging method 00:03:30.152 --> 00:03:32.970 and produces sharp facial textures and colors. 00:03:33.779 --> 00:03:36.550 And this can be done for any expression. NOTE Paragraph 00:03:37.485 --> 00:03:39.907 Now we have a control of a model of a person, 00:03:40.008 --> 00:03:43.555 and the way it's controlled now is by a sequence of static photos. 00:03:43.827 --> 00:03:46.953 Notice how the wrinkles come and go depending on the expression. 00:03:48.109 --> 00:03:50.855 We can also use a video to drive the model. NOTE Paragraph 00:03:50.879 --> 00:03:54.489 (Video) Daniel Craig: Rory, but somehow we've managed to attract 00:03:54.513 --> 00:03:57.267 some more amazing people. NOTE Paragraph 00:03:58.021 --> 00:03:59.663 SS: And here's another fun demo. 00:03:59.687 --> 00:04:01.848 So what you see here are controllable models 00:04:01.957 --> 00:04:04.226 of people I built from their internet photos. 00:04:04.425 --> 00:04:07.162 Now, if you transfer the motion from the input video, 00:04:07.353 --> 00:04:09.504 we can actually drive the entire party. NOTE Paragraph 00:04:09.529 --> 00:04:11.600 (Video) George W. Bush: It's a difficult bill to pass, 00:04:11.625 --> 00:04:13.981 because there's a lot of moving parts, 00:04:14.052 --> 00:04:19.167 and the legislative process can be ugly. NOTE Paragraph 00:04:19.307 --> 00:04:20.890 (Applause) NOTE Paragraph 00:04:20.961 --> 00:04:22.750 SS: So coming back a little bit, 00:04:22.774 --> 00:04:26.012 our ultimate goal, rather, is to capture their mannerisms 00:04:26.037 --> 00:04:29.037 or the unique way each of these people talks and smiles. 00:04:29.106 --> 00:04:31.419 So to do that, can we actually teach the computer 00:04:31.443 --> 00:04:33.625 to imitate the way someone talks 00:04:33.689 --> 00:04:36.109 by only showing it video footage of the person? 00:04:36.898 --> 00:04:39.474 And what I did exactly was, I let a computer watch 00:04:39.499 --> 00:04:42.776 14 hours of pure Barack Obama giving addresses. 00:04:43.443 --> 00:04:46.879 And here's what we can produce given only his audio. NOTE Paragraph 00:04:46.983 --> 00:04:48.702 (Video) BO: The results are clear. 00:04:48.784 --> 00:04:53.052 America's business have created 14.5 million new jobs 00:04:53.157 --> 00:04:55.931 over 75 straight months. NOTE Paragraph 00:04:55.955 --> 00:04:58.860 SS: So what's being synthesized here is only the mouth region, 00:04:58.884 --> 00:05:00.424 and here's how we do it. 00:05:00.764 --> 00:05:02.589 Our pipeline uses a neural network 00:05:02.614 --> 00:05:05.550 to convert and input audio into these mouth points. NOTE Paragraph 00:05:06.547 --> 00:05:10.654 (Video) BO: We get it through our job or through Medicare or Medicaid. NOTE Paragraph 00:05:10.796 --> 00:05:14.162 SS: Then we synthesize the texture, enhance details and teeth, 00:05:14.240 --> 00:05:17.314 and blend it into the head and background from a source video. NOTE Paragraph 00:05:17.338 --> 00:05:19.243 (Video) BO: Women can get free checkups, 00:05:19.267 --> 00:05:22.235 and you can't get charged more just for being a woman. 00:05:22.973 --> 00:05:26.279 Young people can stay on a parent's plan until they turn 26. NOTE Paragraph 00:05:27.267 --> 00:05:30.171 SS: I think these results seem very realistic and intriguing, 00:05:30.243 --> 00:05:33.288 but at the same time frightening, even to me. NOTE Paragraph 00:05:33.440 --> 00:05:37.455 Our goal was to build an accurate model of a person, not to misrepresent them. 00:05:37.956 --> 00:05:41.067 But one thing that concerns me is its potential for misuse. 00:05:41.958 --> 00:05:44.928 People have been thinking about this problem for a long time, 00:05:44.953 --> 00:05:47.334 since the days when Photoshop first hit the market. 00:05:47.862 --> 00:05:51.623 As a researcher, I'm also working on countermeasure technology, 00:05:51.687 --> 00:05:54.573 and I'm part of an ongoing effort at AI Foundation 00:05:54.653 --> 00:05:58.050 which uses a combination of machine learning and human moderators 00:05:58.074 --> 00:06:00.218 to detect fake images and videos, 00:06:00.242 --> 00:06:01.756 fighting against my own work. 00:06:02.675 --> 00:06:05.868 And one of the tools we plan to release is called Reality Defender, 00:06:05.889 --> 00:06:09.881 which is a web browser plugin that can flag potentially fake content 00:06:09.952 --> 00:06:12.420 automatically right in the browser. NOTE Paragraph 00:06:12.509 --> 00:06:16.715 (Applause) NOTE Paragraph 00:06:16.761 --> 00:06:18.214 Despite all this, though, 00:06:18.238 --> 00:06:20.078 fake videos could do a lot of damage, 00:06:20.102 --> 00:06:23.364 even before anyone has a chance to verify, 00:06:23.420 --> 00:06:26.142 so it's very important that we make everyone aware 00:06:26.166 --> 00:06:28.158 of what's currently possible 00:06:28.197 --> 00:06:31.566 so we can have the right assumption and be critical about what we see. NOTE Paragraph 00:06:32.423 --> 00:06:37.399 There's still a long way to go before we can fully model individual people 00:06:37.454 --> 00:06:40.240 and before we can ensure the safety of this technology, 00:06:41.097 --> 00:06:42.628 but I'm excited and hopeful 00:06:42.708 --> 00:06:46.239 because if we use it right and carefully, 00:06:46.271 --> 00:06:50.540 this tool can allow any individual's positive impact on the world 00:06:50.604 --> 00:06:52.747 to be massively scaled 00:06:52.818 --> 00:06:55.560 and really help shape our future the way we want it to be. NOTE Paragraph 00:06:55.584 --> 00:06:56.734 Thank you. NOTE Paragraph 00:06:56.759 --> 00:07:01.849 (Applause)