-
Machine learning is only as good as the
training data you put into it.
-
So, it's super important to use high quality data, and lots of it.
-
But if data is important, it's worth asking where does training data come from?
-
Often, computers are collecting training data from people like you and me,
-
without any effort on our part.
-
A video streaming service might keep track of what you watch, then it can recognize patterns
-
in that data to recommend what you might want to watch next.
-
Other times, you're directly asked to help, like when a website asks you to spot street signs and photos,
-
You're providing training data to help a
machine learn to see, and maybe even one day drive.
-
Medical researchers can use
medical images as training data to teach
-
computers how to recognize and diagnose diseases.
-
Machine Learning needs hundreds and thousands of images, and training direction from a doctor
-
who knows what to look for, before it can correctly identify disease.
-
Even with thousands of examples, there can be problems with the computer's predictions.
-
If X-ray data is only collected from men, then the computer's predictions may only work for men.
-
It may not recognize diseases when
asked to diagnose the X-ray of a woman.
-
This blind spot in the training data
creates something called bias.
-
Biased data favors some things, and de-prioritizes or excludes others.
-
Depending on how training data is collected, who is doing the collecting, and how the data is fed,
-
there is a chance that
human bias is included in the data.
-
By learning from bias data, the computer may make biased predictions,
-
whether the people training the computer
are aware of it or not.
-
When you are looking at training data, ask yourself two questions:
-
Is this enough data to accurately train a computer?
-
And, does this data represent all possible scenarios and users without bias?
-
This is where you, as the human training, play a crucial role.
-
It's up to you to give your machine unbiased data.
-
That means collecting tons of examples, from lots of sources.
-
Remember, when you pick and choose data for machine learning,
-
you're actually programming the algorithm, using training data instead of code.
-
The data IS the code.
-
The better the data you provide, the better the computer will learn.