-
This video I'm going to talk with you about coding and coding data learning outcomes are to recognize when a variable might
-
need a code book or a dictionary to give it more explanation to understand the difference between a variable in its encoding,
-
the transforming and coding to another. We start to think about what we're going to need in order to do that.
-
So the data we've been talking about needs to be encode. It needs to be stored somehow.
-
So the variables we talked about in the earlier video, we actually have to record those values in some way.
-
It's important to recognize that the encoding and the value that we're encoding and the value that is encoded in that coding are not the same.
-
We could have the number twenty seven and we can write it in multiple different way.
-
This is warm up example. We could write the digits to seven.
-
We can write out the word 27. We could write it out and hexadecimal zero x one B. The value is twenty seven, but we have different ways of writing it.
-
So to encode numeric data, we can encode it as a binary integer.
-
And in these, each of these I'm showing the hexes, decimal values of the bytes used to encode it.
-
So we can encoded as a binary integer, we could encode it as a floating point.
-
Binary number. Looks very different, doesn't it?
-
We could encode it as a decimal number. Those are the codes for the ASCII codes for the digits.
-
Two and seven, when we save the CSB file, it's stored as text.
-
So it's storing the decimal digits. There's another format called binary code,
-
a decimal that's used on some mainframes and other systems for efficiently storing the actual decimal values.
-
Encodings can also be Lawsie. A floating point, for example, loses precision.
-
And we can encode we can record things as integers and it truncates whatever decimal part they may have had.
-
So this is just for encoding numbers. We've had these four ways of encoding, but that's the syntactic encoding of.
-
OK. This is stored as decimal characters. This is stored as a 32 bit integer is not enough to interpret it.
-
No, because we need to know how it was measured. We need to know what the units are.
-
Is this millimeters ft.
-
Crocodiles may have been transformed somehow some data sets, they center the values or they take the logarithm of their values.
-
We also need to know if there are any sensible values because it's not uncommon to get a data set where it's numbers.
-
But then there's a special value that you use to indicate, say, unknown data.
-
So we might have. We might have. The number of classes a student took in a day, one, two, seven, whatever.
-
And if we don't know, we record ninety nine. Data recordings today tend to actually just exclude the value.
-
But there's lots of historical data out there,
-
lots of historical data processing systems that use specific values to to indicate things such as unknown.
-
We need to know if any of those values are in the data. When we have categorical data, we need to know how the data is, what we call coded.
-
So the categorical data. There's a few different values the variable can take on.
-
We call these codes or levels of the categorical variable. We need to know what they are.
-
We also need to know how they're stored. Are they stored numeric or string some data sets?
-
We have a string like our penguins. We just wrote down the string. But some maybe there'll be numbers zero.
-
One, two, three, four. And when you when you the code book to tell us what those numbers are.
-
In any case, we know how the data was recorded. But we didn't know what they are. We also need to know how they're defined the.
-
What does a particular value for this categorical variable mean?
-
We need to know.
-
It's useful to know what rules were to use to decide which code to apply for some things that might be fairly straightforward and obvious,
-
but for others it might not be. Also, there's a made a question around coding categorical data about who decided the definitions and how
-
were the how was this set of codes decided upon and given the definitions that they were given.
-
One example of this is in census data.
-
The Category four rate, the way race is collected, that's changed throughout the history of the United States.
-
The set of categories, whether you have to pick one or whether it's a check, all that apply, etc., that changed throughout the throughout the course.
-
And it becomes a very political process about how do you define what how it is that we record race when we are collecting census data.
-
That then has strong effects for how we understand. The representation and distribution of race in the country, so few examples.
-
So the penguin data set we looked at. The species is a categorical variable and it's written down.
-
It just with the name of the species, Adelie chinstrap or Gentoo in the Rorty.
-
There's a raw version of the data that has the full biological species name.
-
These come for biological taxonomy. There's another data set that is used for some things that is credit information for German loan applicants.
-
And it has various variables. One of its variables is the status of the applicants checking account.
-
And it has the codes a11 for overdrawn a twelve for between zero and hundred deutschmarks,
-
a A13 for either at least 200 deutschmarks, or they've had their salary to positive for a year.
-
And then A14 means no checking account. And so we've got these categorical codes.
-
We have the if you see a 12 in the data, you do not know what it means without looking at the code book.
-
Even if it does look obvious, it's good to look at the code book category, but then we actually go to record categorical data.
-
A lot of times, especially in the raw data, that we're get the data files that we load, it's going to be directly stored as a string or an integer.
-
We're gonna have a column for the categorical variable and it has the value in there, but.
-
For computational purposes, we're often going to need to encode it differently because you can't compute on A13 a couple different encodings there.
-
One is one hot encoding where each different coder level gets a variable.
-
A logical variable, but we encode it with an integer zero or one.
-
And so for the German credit, we're going to have a eleven, twelve, thirteen, fourteen.
-
There are all different variables we could for the penguins. We would have three variables, one for each species, and exactly one of them is one.
-
So when a deli penguin would have one, the Adelie variable and zero in a chinstrap and Gentoo.
-
Another option is what's called dummy coding, which is very, very similar, except one of the codes doesn't get a variable.
-
So it all zeroes in the variables for the categorical variable.
-
I mean, it's the admitted one and a one at any of them means that one.
-
Why we need that is going to. It's going to come up when we start talking about linear modeling.
-
But it's a very common statistical way of encoding a variable. The variables that we use for this are called indicator variables.
-
So if we're transforming our penguins into either one hot or dummy code variables, we say that we have an indicator variable for chinstrap.
-
And it's one if the if the penguin is a chinstrap and zero if it is not.
-
So data has to be coded and encoded in order for us to process and analyze that, we actually have to store it somehow.
-
And the process of coding affects are the data that we have when we go when we do an analysis.
-
When we do an inference. The way that the data was coded affects how we view and how we understand the things that the data are actually about.
-
Sometimes this is relatively straightforward. The penguin species,
-
although the way the penguins got divided into their various species and got those species names is is a historical social process.
-
But with other things that are very that are have very contested social definitions, such as how do you indicate race?
-
It becomes a very strong lens that affects how we understand the underlying
-
reality that the data is supposed to represent and that the representation,
-
the coding, the codebook,
-
the definitions that need to be documented thoroughly in order for us to properly understand the data that we're working with.