1 00:00:01,075 --> 00:00:02,277 Look at these images. 2 00:00:02,277 --> 00:00:04,935 Now, tell me which Obama here is real. 3 00:00:04,935 --> 00:00:07,320 (Video) Barack Obama: To help families refinance their homes, 4 00:00:07,320 --> 00:00:10,117 to invest in things like high-tech manufacturing, 5 00:00:10,117 --> 00:00:11,857 clean energy, 6 00:00:11,857 --> 00:00:14,660 and the infrastructure that creates good new jobs. 7 00:00:14,660 --> 00:00:16,029 Supasorn Suwajanakorn: Anyone? 8 00:00:16,029 --> 00:00:18,313 The answer is none of them. 9 00:00:18,313 --> 00:00:19,451 (Laughter) 10 00:00:19,451 --> 00:00:21,261 None of these is actually real. 11 00:00:21,261 --> 00:00:24,111 So let me tell you how we got here. 12 00:00:24,111 --> 00:00:28,207 My inspiration for this work was a project meant to preserve 13 00:00:28,207 --> 00:00:32,566 our last chance for learning about the Holocaust from the survivors. 14 00:00:32,566 --> 00:00:35,223 It's called New Dimensions in Testimony, 15 00:00:35,223 --> 00:00:38,185 and it allows you to have interactive conversations 16 00:00:38,185 --> 00:00:41,590 with a hologram of a real Holocaust survivor. 17 00:00:41,590 --> 00:00:44,007 (Video) Question: How did you survive the Holocaust? 18 00:00:44,007 --> 00:00:46,129 (Video) Answer: How did I survive? 19 00:00:46,129 --> 00:00:48,003 I survived, 20 00:00:48,003 --> 00:00:49,742 I believe, 21 00:00:49,742 --> 00:00:53,635 because providence watched over me. 22 00:00:53,635 --> 00:00:57,311 SS: Turns out these answers were pre-recorded in a studio. 23 00:00:57,311 --> 00:00:59,396 Yet the effect is astounding. 24 00:00:59,396 --> 00:01:03,968 You feel so connected to his story and to him as a person. 25 00:01:03,968 --> 00:01:07,351 I think there is something special about human interaction 26 00:01:07,351 --> 00:01:09,931 that makes it much more profound 27 00:01:09,931 --> 00:01:11,355 and personal 28 00:01:11,355 --> 00:01:16,108 than what books or lectures or movies could ever teach us. 29 00:01:16,108 --> 00:01:18,272 So I saw this and began to wonder, 30 00:01:18,272 --> 00:01:21,810 can we create a model like this for anyone? 31 00:01:21,810 --> 00:01:25,123 A model that looks, talks, and acts just like them? 32 00:01:25,123 --> 00:01:27,666 So I set out to see if this could be done, 33 00:01:27,666 --> 00:01:30,045 and eventually came up with a new solution 34 00:01:30,045 --> 00:01:31,619 that can build a model of a person 35 00:01:31,619 --> 00:01:33,348 using nothing but these: 36 00:01:34,063 --> 00:01:36,961 existing photos and videos of a person. 37 00:01:36,961 --> 00:01:39,375 If you can leverage this kind of passive information, 38 00:01:39,375 --> 00:01:41,518 just photos and video that are out there, 39 00:01:41,518 --> 00:01:44,233 that's the key to scaling to anyone. 40 00:01:44,233 --> 00:01:46,180 By the way, here's Richard Feynman, 41 00:01:46,180 --> 00:01:49,059 who in addition to being a Nobel Prize Winner in physics 42 00:01:49,059 --> 00:01:52,479 was also known as a legendary teacher. 43 00:01:52,479 --> 00:01:55,126 Wouldn't it be great if we could bring him back 44 00:01:55,126 --> 00:01:58,851 to give his lectures and inspire millions of kids, 45 00:01:58,851 --> 00:02:02,701 perhaps not just in English but in any language? 46 00:02:02,701 --> 00:02:07,299 Or if you could ask our grandparents for advice and hear those comforting words 47 00:02:07,299 --> 00:02:09,576 even if they are no longer with us? 48 00:02:09,576 --> 00:02:13,277 Or maybe using this tool, books authors, alive or not, 49 00:02:13,277 --> 00:02:17,578 could read aloud all of their books for anyone interested. 50 00:02:17,578 --> 00:02:20,296 The creative possibilities here are endless, 51 00:02:20,296 --> 00:02:22,663 and to me that's very exciting. 52 00:02:22,663 --> 00:02:24,881 And here's how it's working so far. 53 00:02:24,881 --> 00:02:26,543 First, we introduce a new technique 54 00:02:26,543 --> 00:02:30,045 that can reconstruct a high-detailed 3D face model 55 00:02:30,045 --> 00:02:31,154 from any image 56 00:02:31,154 --> 00:02:33,767 without ever 3D-scanning the person. 57 00:02:33,767 --> 00:02:37,421 And here's the same output model from different views. 58 00:02:37,421 --> 00:02:39,755 This also works on videos 59 00:02:39,755 --> 00:02:42,544 by running the same algorithm on each video frame 60 00:02:42,544 --> 00:02:45,120 and generating a moving 3D model. 61 00:02:45,120 --> 00:02:49,674 And here's the same output model from different angles. 62 00:02:50,193 --> 00:02:52,536 It turns out, this problem is very challenging, 63 00:02:52,536 --> 00:02:56,553 but the key trick is that we are going to analyze a large photo collection 64 00:02:56,553 --> 00:02:58,859 of the person beforehand. 65 00:02:58,859 --> 00:03:02,569 For George W. Bush, we can just search on Google, 66 00:03:02,569 --> 00:03:05,055 and from that, we are able to build an average model, 67 00:03:05,055 --> 00:03:08,107 an iterative, refined model to recover the expression 68 00:03:08,107 --> 00:03:10,563 in fine details like creases and wrinkles. 69 00:03:11,586 --> 00:03:14,363 What's fascinating about this is that the photo collection 70 00:03:14,363 --> 00:03:16,336 can come from your typical photos. 71 00:03:16,336 --> 00:03:18,849 It doesn't really matter what expression you're making 72 00:03:18,849 --> 00:03:20,862 or where you took those photos. 73 00:03:20,862 --> 00:03:23,420 What matters is that there are a lot of them. 74 00:03:23,420 --> 00:03:25,348 And we are still missing color here, 75 00:03:25,348 --> 00:03:27,474 so next we develop a new blending technique 76 00:03:27,474 --> 00:03:30,357 that improves upon a single averaging method 77 00:03:30,357 --> 00:03:33,512 and produces sharp facial textures and colors. 78 00:03:33,512 --> 00:03:36,834 And this can be done for any expression. 79 00:03:36,834 --> 00:03:40,044 Now we have a control of a model of a person, 80 00:03:40,044 --> 00:03:43,839 and the way it's controlled now is by a sequence of static photos. 81 00:03:43,839 --> 00:03:47,961 Notice how the wrinkles come and go depending on the expression. 82 00:03:48,369 --> 00:03:51,139 We can also use a video to drive the model. 83 00:03:51,139 --> 00:03:54,773 (Video) Daniel Craig: Rory, but somehow we've managed to attract 84 00:03:54,773 --> 00:03:58,042 more amazing people. 85 00:03:58,042 --> 00:03:59,947 SS: And here's another fun demo. 86 00:03:59,947 --> 00:04:02,132 So what you see here are controllable models 87 00:04:02,132 --> 00:04:04,510 of people I built from their internet photos. 88 00:04:04,510 --> 00:04:07,446 Now, if you transfer the motion from the input video, 89 00:04:07,446 --> 00:04:09,694 we can actually drive the entire party. 90 00:04:09,882 --> 00:04:11,977 (Video) George W. Bush: It's a very difficult bill to pass, 91 00:04:11,977 --> 00:04:13,993 because there's a lot of moving parts, 92 00:04:13,993 --> 00:04:17,709 and the legislative process can be ugly. 93 00:04:18,877 --> 00:04:20,898 (Applause) 94 00:04:20,898 --> 00:04:23,034 SS: So coming back a little bit, 95 00:04:23,034 --> 00:04:26,342 our ultimate goal, rather, is to capture their mannerisms 96 00:04:26,342 --> 00:04:29,366 or the unique way each of these people talks and smiles. 97 00:04:29,366 --> 00:04:31,703 So to do that, can we actually teach the computer 98 00:04:31,703 --> 00:04:33,718 to imitate the way someone talks 99 00:04:33,718 --> 00:04:37,158 by only showing it video footage of the person? 100 00:04:37,158 --> 00:04:40,544 And what I did exactly was, I let a computer watch 101 00:04:40,544 --> 00:04:43,420 14 hours of pure Barack Obama giving addresses. 102 00:04:43,420 --> 00:04:46,747 And here's what we can produce given only his audio. 103 00:04:46,747 --> 00:04:48,886 (Video) BO: The results are clear. 104 00:04:48,886 --> 00:04:53,344 America's business have created 14.5 million new jobs 105 00:04:53,344 --> 00:04:56,215 over 75 straight months. 106 00:04:56,215 --> 00:04:59,112 SS: So what's being synthesized here is only the mouth region, 107 00:04:59,112 --> 00:05:00,708 and here's how we do it. 108 00:05:00,708 --> 00:05:03,020 Our pipeline uses a neural network 109 00:05:03,020 --> 00:05:06,807 to convert and input audio into these mouth points. 110 00:05:06,807 --> 00:05:10,666 (Video) BO: We get it through our job or through Medicare or Medicaid. 111 00:05:10,666 --> 00:05:14,446 SS: Then we synthesize the texture, enhance details and teeth, 112 00:05:14,446 --> 00:05:17,598 and blend it into the head and background from a source video. 113 00:05:17,598 --> 00:05:19,523 (Video) BO: Women can get free checkups, 114 00:05:19,523 --> 00:05:22,863 and you can't get charged more just for being a woman. 115 00:05:22,863 --> 00:05:26,563 Young people can stay on a parent's plan until they turn 26. 116 00:05:26,563 --> 00:05:30,194 SS: I think these results seem very realistic and intriguing, 117 00:05:30,194 --> 00:05:32,023 but at the same time frightening, 118 00:05:32,023 --> 00:05:33,572 even to me. 119 00:05:33,572 --> 00:05:38,113 Our goal was to build an accurate model of a person, not to misrepresent them. 120 00:05:38,113 --> 00:05:41,626 But one thing that concerns me is its potential for misuse. 121 00:05:41,626 --> 00:05:45,512 People have been thinking about this problem for a long time, 122 00:05:45,512 --> 00:05:47,665 since the days when Photoshop first hit the market. 123 00:05:47,665 --> 00:05:51,756 As a researcher, I'm also working on countermeasure technology, 124 00:05:51,756 --> 00:05:54,609 and I'm part of an ongoing effort at AI Foundation 125 00:05:54,609 --> 00:05:58,334 which uses a combination of machine learning and human moderators 126 00:05:58,334 --> 00:06:00,502 to detect fake images and videos, 127 00:06:00,502 --> 00:06:02,785 fighting against my own work. 128 00:06:02,785 --> 00:06:05,854 And one of the tools we plan to release is called Reality Defender, 129 00:06:05,854 --> 00:06:09,922 which is a web browser plugin that can flag potentially fake content 130 00:06:09,922 --> 00:06:12,080 automatically right in the browser. 131 00:06:12,080 --> 00:06:14,858 (Applause) 132 00:06:17,021 --> 00:06:18,498 Despite all this, though, 133 00:06:18,498 --> 00:06:20,362 fake videos could do a lot of damage, 134 00:06:20,362 --> 00:06:23,303 even before anyone has a chance to verify, 135 00:06:23,303 --> 00:06:26,426 so it's very important that we make everyone aware 136 00:06:26,426 --> 00:06:27,944 of what's currently possible 137 00:06:27,944 --> 00:06:31,676 so we can have the right assumption and be critical about what we see. 138 00:06:31,676 --> 00:06:36,787 There's still a long way to go before we can fully model individual people 139 00:06:36,787 --> 00:06:40,737 and before we can ensure the safety of this technology, 140 00:06:40,737 --> 00:06:42,794 but I'm excited and hopeful 141 00:06:42,794 --> 00:06:45,961 because if we use it right and carefully, 142 00:06:45,961 --> 00:06:50,497 this tool can allow any individual's positive impact on the world 143 00:06:50,497 --> 00:06:52,211 to be massively scaled 144 00:06:52,211 --> 00:06:55,844 and really help shape our future the way we want it to be. 145 00:06:55,844 --> 00:06:57,055 Thank you. 146 00:06:57,055 --> 00:07:01,658 (Applause)