1 00:00:00,876 --> 00:00:02,027 Look at these images. 2 00:00:02,051 --> 00:00:04,686 Now, tell me which Obama here is real. 3 00:00:04,710 --> 00:00:07,571 (Video) Barack Obama: To help families refinance their homes, 4 00:00:07,595 --> 00:00:10,242 to invest in things like high-tech manufacturing, 5 00:00:10,266 --> 00:00:11,425 clean energy 6 00:00:11,449 --> 00:00:14,228 and the infrastructure that creates good new jobs. 7 00:00:14,647 --> 00:00:16,131 Supasorn Suwajanakorn: Anyone? 8 00:00:16,155 --> 00:00:18,029 The answer is none of them. 9 00:00:18,053 --> 00:00:19,167 (Laughter) 10 00:00:19,191 --> 00:00:20,977 None of these is actually real. 11 00:00:21,001 --> 00:00:22,841 So let me tell you how we got here. 12 00:00:23,940 --> 00:00:25,518 My inspiration for this work 13 00:00:25,542 --> 00:00:30,953 was a project meant to preserve our last chance for learning about the Holocaust 14 00:00:30,977 --> 00:00:32,745 from the survivors. 15 00:00:32,769 --> 00:00:35,396 It's called New Dimensions in Testimony, 16 00:00:35,420 --> 00:00:38,546 and it allows you to have interactive conversations 17 00:00:38,570 --> 00:00:41,126 with a hologram of a real Holocaust survivor. 18 00:00:41,793 --> 00:00:43,759 (Video) Man: How did you survive the Holocaust? 19 00:00:43,783 --> 00:00:45,451 (Video) Hologram: How did I survive? 20 00:00:45,912 --> 00:00:47,719 I survived, 21 00:00:48,419 --> 00:00:49,946 I believe, 22 00:00:49,970 --> 00:00:52,993 because providence watched over me. 23 00:00:53,573 --> 00:00:57,027 SS: Turns out these answers were prerecorded in a studio. 24 00:00:57,051 --> 00:00:59,503 Yet the effect is astounding. 25 00:00:59,527 --> 00:01:03,146 You feel so connected to his story and to him as a person. 26 00:01:04,011 --> 00:01:07,312 I think there's something special about human interaction 27 00:01:07,336 --> 00:01:10,093 that makes it much more profound 28 00:01:10,117 --> 00:01:12,315 and personal 29 00:01:12,339 --> 00:01:15,824 than what books or lectures or movies could ever teach us. 30 00:01:16,267 --> 00:01:18,692 So I saw this and began to wonder, 31 00:01:18,716 --> 00:01:21,526 can we create a model like this for anyone? 32 00:01:21,550 --> 00:01:24,525 A model that looks, talks and acts just like them? 33 00:01:25,573 --> 00:01:27,580 So I set out to see if this could be done 34 00:01:27,604 --> 00:01:29,914 and eventually came up with a new solution 35 00:01:29,938 --> 00:01:33,158 that can build a model of a person using nothing but these: 36 00:01:33,747 --> 00:01:35,961 existing photos and videos of a person. 37 00:01:36,701 --> 00:01:39,318 If you can leverage this kind of passive information, 38 00:01:39,342 --> 00:01:41,349 just photos and video that are out there, 39 00:01:41,373 --> 00:01:43,429 that's the key to scaling to anyone. 40 00:01:44,119 --> 00:01:45,896 By the way, here's Richard Feynman, 41 00:01:45,920 --> 00:01:49,333 who in addition to being a Nobel Prize winner in physics 42 00:01:49,357 --> 00:01:51,810 was also known as a legendary teacher. 43 00:01:53,080 --> 00:01:55,278 Wouldn't it be great if we could bring him back 44 00:01:55,302 --> 00:01:58,567 to give his lectures and inspire millions of kids, 45 00:01:58,591 --> 00:02:01,583 perhaps not just in English but in any language? 46 00:02:02,441 --> 00:02:07,043 Or if you could ask our grandparents for advice and hear those comforting words 47 00:02:07,067 --> 00:02:08,837 even if they're no longer with us? 48 00:02:09,683 --> 00:02:13,079 Or maybe using this tool, book authors, alive or not, 49 00:02:13,103 --> 00:02:16,040 could read aloud all of their books for anyone interested. 50 00:02:17,199 --> 00:02:19,636 The creative possibilities here are endless, 51 00:02:19,660 --> 00:02:21,373 and to me, that's very exciting. 52 00:02:22,595 --> 00:02:24,597 And here's how it's working so far. 53 00:02:24,621 --> 00:02:26,288 First, we introduce a new technique 54 00:02:26,312 --> 00:02:30,884 that can reconstruct a high-detailed 3D face model from any image 55 00:02:30,908 --> 00:02:33,027 without ever 3D-scanning the person. 56 00:02:33,890 --> 00:02:36,532 And here's the same output model from different views. 57 00:02:37,969 --> 00:02:39,471 This also works on videos, 58 00:02:39,495 --> 00:02:42,347 by running the same algorithm on each video frame 59 00:02:42,371 --> 00:02:44,593 and generating a moving 3D model. 60 00:02:45,538 --> 00:02:48,310 And here's the same output model from different angles. 61 00:02:49,933 --> 00:02:52,467 It turns out this problem is very challenging, 62 00:02:52,491 --> 00:02:55,016 but the key trick is that we are going to analyze 63 00:02:55,040 --> 00:02:58,006 a large photo collection of the person beforehand. 64 00:02:58,650 --> 00:03:01,189 For George W. Bush, we can just search on Google, 65 00:03:02,309 --> 00:03:04,808 and from that, we are able to build an average model, 66 00:03:04,832 --> 00:03:07,943 an iterative, refined model to recover the expression 67 00:03:07,967 --> 00:03:10,303 in fine details, like creases and wrinkles. 68 00:03:11,326 --> 00:03:12,729 What's fascinating about this 69 00:03:12,753 --> 00:03:16,176 is that the photo collection can come from your typical photos. 70 00:03:16,200 --> 00:03:18,803 It doesn't really matter what expression you're making 71 00:03:18,827 --> 00:03:20,712 or where you took those photos. 72 00:03:20,736 --> 00:03:23,136 What matters is that there are a lot of them. 73 00:03:23,160 --> 00:03:24,896 And we are still missing color here, 74 00:03:24,920 --> 00:03:27,268 so next, we develop a new blending technique 75 00:03:27,292 --> 00:03:30,128 that improves upon a single averaging method 76 00:03:30,152 --> 00:03:32,970 and produces sharp facial textures and colors. 77 00:03:33,779 --> 00:03:36,550 And this can be done for any expression. 78 00:03:37,485 --> 00:03:39,984 Now we have a control of a model of a person, 79 00:03:40,008 --> 00:03:43,803 and the way it's controlled now is by a sequence of static photos. 80 00:03:43,827 --> 00:03:46,953 Notice how the wrinkles come and go, depending on the expression. 81 00:03:48,109 --> 00:03:50,855 We can also use a video to drive the model. 82 00:03:50,879 --> 00:03:53,472 (Video) Daniel Craig: Right, but somehow, 83 00:03:53,496 --> 00:03:57,267 we've managed to attract some more amazing people. 84 00:03:58,021 --> 00:03:59,663 SS: And here's another fun demo. 85 00:03:59,687 --> 00:04:01,933 So what you see here are controllable models 86 00:04:01,957 --> 00:04:04,401 of people I built from their internet photos. 87 00:04:04,425 --> 00:04:07,329 Now, if you transfer the motion from the input video, 88 00:04:07,353 --> 00:04:09,505 we can actually drive the entire party. 89 00:04:09,529 --> 00:04:11,701 George W. Bush: It's a difficult bill to pass, 90 00:04:11,725 --> 00:04:14,028 because there's a lot of moving parts, 91 00:04:14,052 --> 00:04:19,283 and the legislative processes can be ugly. 92 00:04:19,307 --> 00:04:20,937 (Applause) 93 00:04:20,961 --> 00:04:22,798 SS: So coming back a little bit, 94 00:04:22,822 --> 00:04:26,013 our ultimate goal, rather, is to capture their mannerisms 95 00:04:26,037 --> 00:04:29,082 or the unique way each of these people talks and smiles. 96 00:04:29,106 --> 00:04:31,419 So to do that, can we actually teach the computer 97 00:04:31,443 --> 00:04:33,665 to imitate the way someone talks 98 00:04:33,689 --> 00:04:36,109 by only showing it video footage of the person? 99 00:04:36,898 --> 00:04:39,475 And what I did exactly was, I let a computer watch 100 00:04:39,499 --> 00:04:42,776 14 hours of pure Barack Obama giving addresses. 101 00:04:43,443 --> 00:04:46,959 And here's what we can produce given only his audio. 102 00:04:46,983 --> 00:04:48,760 (Video) BO: The results are clear. 103 00:04:48,784 --> 00:04:53,133 America's businesses have created 14.5 million new jobs 104 00:04:53,157 --> 00:04:55,931 over 75 straight months. 105 00:04:55,955 --> 00:04:58,860 SS: So what's being synthesized here is only the mouth region, 106 00:04:58,884 --> 00:05:00,424 and here's how we do it. 107 00:05:00,764 --> 00:05:02,590 Our pipeline uses a neural network 108 00:05:02,614 --> 00:05:05,550 to convert and input audio into these mouth points. 109 00:05:06,547 --> 00:05:10,772 (Video) BO: We get it through our job or through Medicare or Medicaid. 110 00:05:10,796 --> 00:05:14,216 SS: Then we synthesize the texture, enhance details and teeth, 111 00:05:14,240 --> 00:05:17,314 and blend it into the head and background from a source video. 112 00:05:17,338 --> 00:05:19,243 (Video) BO: Women can get free checkups, 113 00:05:19,267 --> 00:05:22,235 and you can't get charged more just for being a woman. 114 00:05:22,973 --> 00:05:26,279 Young people can stay on a parent's plan until they turn 26. 115 00:05:27,267 --> 00:05:30,219 SS: I think these results seem very realistic and intriguing, 116 00:05:30,243 --> 00:05:33,416 but at the same time frightening, even to me. 117 00:05:33,440 --> 00:05:37,455 Our goal was to build an accurate model of a person, not to misrepresent them. 118 00:05:37,956 --> 00:05:41,067 But one thing that concerns me is its potential for misuse. 119 00:05:41,958 --> 00:05:44,929 People have been thinking about this problem for a long time, 120 00:05:44,953 --> 00:05:47,334 since the days when Photoshop first hit the market. 121 00:05:47,862 --> 00:05:51,663 As a researcher, I'm also working on countermeasure technology, 122 00:05:51,687 --> 00:05:54,629 and I'm part of an ongoing effort at AI Foundation, 123 00:05:54,653 --> 00:05:58,050 which uses a combination of machine learning and human moderators 124 00:05:58,074 --> 00:06:00,218 to detect fake images and videos, 125 00:06:00,242 --> 00:06:01,756 fighting against my own work. 126 00:06:02,675 --> 00:06:05,865 And one of the tools we plan to release is called Reality Defender, 127 00:06:05,889 --> 00:06:09,928 which is a web-browser plug-in that can flag potentially fake content 128 00:06:09,952 --> 00:06:12,485 automatically, right in the browser. 129 00:06:12,509 --> 00:06:16,737 (Applause) 130 00:06:16,761 --> 00:06:18,214 Despite all this, though, 131 00:06:18,238 --> 00:06:20,078 fake videos could do a lot of damage, 132 00:06:20,102 --> 00:06:23,396 even before anyone has a chance to verify, 133 00:06:23,420 --> 00:06:26,142 so it's very important that we make everyone aware 134 00:06:26,166 --> 00:06:28,173 of what's currently possible 135 00:06:28,197 --> 00:06:31,566 so we can have the right assumption and be critical about what we see. 136 00:06:32,423 --> 00:06:37,430 There's still a long way to go before we can fully model individual people 137 00:06:37,454 --> 00:06:40,240 and before we can ensure the safety of this technology. 138 00:06:41,097 --> 00:06:42,684 But I'm excited and hopeful, 139 00:06:42,708 --> 00:06:46,247 because if we use it right and carefully, 140 00:06:46,271 --> 00:06:50,580 this tool can allow any individual's positive impact on the world 141 00:06:50,604 --> 00:06:52,794 to be massively scaled 142 00:06:52,818 --> 00:06:55,560 and really help shape our future the way we want it to be. 143 00:06:55,584 --> 00:06:56,735 Thank you. 144 00:06:56,759 --> 00:07:01,849 (Applause)