WEBVTT

00:00:07.360 --> 00:00:11.760
Machine learning is only as good as the
training data you put into it.

00:00:11.800 --> 00:00:15.820
So, it's super important to use high quality data, and lots of it.

00:00:16.760 --> 00:00:21.960
But if data is important, it's worth asking where does training data come from?

00:00:22.280 --> 00:00:26.260
Often, computers are collecting training data from people like you and me,

00:00:26.260 --> 00:00:27.860
without any effort on our part.

00:00:28.440 --> 00:00:31.480
A video streaming service might keep track of what you watch, then it can recognize patterns

00:00:31.660 --> 00:00:36.000
in that data to recommend what you might want to watch next.

00:00:37.420 --> 00:00:43.200
Other times, you're directly asked to help, like when a website asks you to spot street signs and photos,

00:00:43.780 --> 00:00:49.280
You're providing training data to help a
machine learn to see, and maybe even one day drive.

00:00:52.320 --> 00:00:56.440
Medical researchers can use
medical images as training data to teach

00:00:56.520 --> 00:00:59.900
computers how to recognize and diagnose diseases.

00:01:00.300 --> 00:01:05.560
Machine Learning needs hundreds and thousands of images, and training direction from a doctor

00:01:05.640 --> 00:01:09.920
who knows what to look for, before it can correctly identify disease.

00:01:10.520 --> 00:01:15.540
Even with thousands of examples, there can be problems with the computer's predictions.

00:01:15.880 --> 00:01:20.660
If X-ray data is only collected from men, then the computer's predictions may only work for men.

00:01:21.880 --> 00:01:26.300
It may not recognize diseases when
asked to diagnose the X-ray of a woman.

00:01:26.620 --> 00:01:30.820
This blind spot in the training data
creates something called bias.

00:01:31.260 --> 00:01:36.420
Biased data favors some things, and de-prioritizes or excludes others.

00:01:36.780 --> 00:01:41.800
Depending on how training data is collected, who is doing the collecting, and how the data is fed,

00:01:41.800 --> 00:01:45.340
there is a chance that
human bias is included in the data.

00:01:45.880 --> 00:01:50.700
By learning from bias data, the computer may make biased predictions,

00:01:50.780 --> 00:01:54.320
whether the people training the computer
are aware of it or not.

00:01:54.760 --> 00:01:58.400
When you are looking at training data, ask yourself two questions:

00:01:58.640 --> 00:02:01.600
Is this enough data to accurately train a computer?

00:02:02.320 --> 00:02:06.860
And, does this data represent all possible scenarios and users without bias?

00:02:07.460 --> 00:02:11.040
This is where you, as the human training, play a crucial role.

00:02:11.160 --> 00:02:14.500
It's up to you to give your machine unbiased data.

00:02:14.500 --> 00:02:18.160
That means collecting tons of examples, from lots of sources.

00:02:19.300 --> 00:02:22.580
Remember, when you pick and choose data for machine learning,

00:02:22.580 --> 00:02:26.660
you're actually programming the algorithm, using training data instead of code.

00:02:27.100 --> 00:02:29.780
The data IS the code.

00:02:30.180 --> 00:02:34.680
The better the data you provide, the better the computer will learn.