WEBVTT 00:00:07.360 --> 00:00:11.760 Machine learning is only as good as the training data you put into it. 00:00:11.800 --> 00:00:15.820 So, it's super important to use high quality data, and lots of it. 00:00:16.760 --> 00:00:21.960 But if data is important, it's worth asking where does training data come from? 00:00:22.280 --> 00:00:26.260 Often, computers are collecting training data from people like you and me, 00:00:26.260 --> 00:00:27.860 without any effort on our part. 00:00:28.440 --> 00:00:31.480 A video streaming service might keep track of what you watch, then it can recognize patterns 00:00:31.660 --> 00:00:36.000 in that data to recommend what you might want to watch next. 00:00:37.420 --> 00:00:43.200 Other times, you're directly asked to help, like when a website asks you to spot street signs and photos, 00:00:43.780 --> 00:00:49.280 You're providing training data to help a machine learn to see, and maybe even one day drive. 00:00:52.320 --> 00:00:56.440 Medical researchers can use medical images as training data to teach 00:00:56.520 --> 00:00:59.900 computers how to recognize and diagnose diseases. 00:01:00.300 --> 00:01:05.560 Machine Learning needs hundreds and thousands of images, and training direction from a doctor 00:01:05.640 --> 00:01:09.920 who knows what to look for, before it can correctly identify disease. 00:01:10.520 --> 00:01:15.540 Even with thousands of examples, there can be problems with the computer's predictions. 00:01:15.880 --> 00:01:20.660 If X-ray data is only collected from men, then the computer's predictions may only work for men. 00:01:21.880 --> 00:01:26.300 It may not recognize diseases when asked to diagnose the X-ray of a woman. 00:01:26.620 --> 00:01:30.820 This blind spot in the training data creates something called bias. 00:01:31.260 --> 00:01:36.420 Biased data favors some things, and de-prioritizes or excludes others. 00:01:36.780 --> 00:01:41.800 Depending on how training data is collected, who is doing the collecting, and how the data is fed, 00:01:41.800 --> 00:01:45.340 there is a chance that human bias is included in the data. 00:01:45.880 --> 00:01:50.700 By learning from bias data, the computer may make biased predictions, 00:01:50.780 --> 00:01:54.320 whether the people training the computer are aware of it or not. 00:01:54.760 --> 00:01:58.400 When you are looking at training data, ask yourself two questions: 00:01:58.640 --> 00:02:01.600 Is this enough data to accurately train a computer? 00:02:02.320 --> 00:02:06.860 And, does this data represent all possible scenarios and users without bias? 00:02:07.460 --> 00:02:11.040 This is where you, as the human training, play a crucial role. 00:02:11.160 --> 00:02:14.500 It's up to you to give your machine unbiased data. 00:02:14.500 --> 00:02:18.160 That means collecting tons of examples, from lots of sources. 00:02:19.300 --> 00:02:22.580 Remember, when you pick and choose data for machine learning, 00:02:22.580 --> 00:02:26.660 you're actually programming the algorithm, using training data instead of code. 00:02:27.100 --> 00:02:29.780 The data IS the code. 00:02:30.180 --> 00:02:34.680 The better the data you provide, the better the computer will learn.