This video I'm going to talk with you about coding and coding data learning outcomes are to recognize when a variable might
need a code book or a dictionary to give it more explanation to understand the difference between a variable in its encoding,
the transforming and coding to another. We start to think about what we're going to need in order to do that.
So the data we've been talking about needs to be encode. It needs to be stored somehow.
So the variables we talked about in the earlier video, we actually have to record those values in some way.
It's important to recognize that the encoding and the value that we're encoding and the value that is encoded in that coding are not the same.
We could have the number twenty seven and we can write it in multiple different way.
This is warm up example. We could write the digits to seven.
We can write out the word 27. We could write it out and hexadecimal zero x one B. The value is twenty seven, but we have different ways of writing it.
So to encode numeric data, we can encode it as a binary integer.
And in these, each of these I'm showing the hexes, decimal values of the bytes used to encode it.
So we can encoded as a binary integer, we could encode it as a floating point.
Binary number. Looks very different, doesn't it?
We could encode it as a decimal number. Those are the codes for the ASCII codes for the digits.
Two and seven, when we save the CSB file, it's stored as text.
So it's storing the decimal digits. There's another format called binary code,
a decimal that's used on some mainframes and other systems for efficiently storing the actual decimal values.
Encodings can also be Lawsie. A floating point, for example, loses precision.
And we can encode we can record things as integers and it truncates whatever decimal part they may have had.
So this is just for encoding numbers. We've had these four ways of encoding, but that's the syntactic encoding of.
OK. This is stored as decimal characters. This is stored as a 32 bit integer is not enough to interpret it.
No, because we need to know how it was measured. We need to know what the units are.
Is this millimeters ft.
Crocodiles may have been transformed somehow some data sets, they center the values or they take the logarithm of their values.
We also need to know if there are any sensible values because it's not uncommon to get a data set where it's numbers.
But then there's a special value that you use to indicate, say, unknown data.
So we might have. We might have. The number of classes a student took in a day, one, two, seven, whatever.
And if we don't know, we record ninety nine. Data recordings today tend to actually just exclude the value.
But there's lots of historical data out there,
lots of historical data processing systems that use specific values to to indicate things such as unknown.
We need to know if any of those values are in the data. When we have categorical data, we need to know how the data is, what we call coded.
So the categorical data. There's a few different values the variable can take on.
We call these codes or levels of the categorical variable. We need to know what they are.
We also need to know how they're stored. Are they stored numeric or string some data sets?
We have a string like our penguins. We just wrote down the string. But some maybe there'll be numbers zero.
One, two, three, four. And when you when you the code book to tell us what those numbers are.
In any case, we know how the data was recorded. But we didn't know what they are. We also need to know how they're defined the.
What does a particular value for this categorical variable mean?
We need to know.
It's useful to know what rules were to use to decide which code to apply for some things that might be fairly straightforward and obvious,
but for others it might not be. Also, there's a made a question around coding categorical data about who decided the definitions and how
were the how was this set of codes decided upon and given the definitions that they were given.
One example of this is in census data.
The Category four rate, the way race is collected, that's changed throughout the history of the United States.
The set of categories, whether you have to pick one or whether it's a check, all that apply, etc., that changed throughout the throughout the course.
And it becomes a very political process about how do you define what how it is that we record race when we are collecting census data.
That then has strong effects for how we understand. The representation and distribution of race in the country, so few examples.
So the penguin data set we looked at. The species is a categorical variable and it's written down.
It just with the name of the species, Adelie chinstrap or Gentoo in the Rorty.
There's a raw version of the data that has the full biological species name.
These come for biological taxonomy. There's another data set that is used for some things that is credit information for German loan applicants.
And it has various variables. One of its variables is the status of the applicants checking account.
And it has the codes a11 for overdrawn a twelve for between zero and hundred deutschmarks,
a A13 for either at least 200 deutschmarks, or they've had their salary to positive for a year.
And then A14 means no checking account. And so we've got these categorical codes.
We have the if you see a 12 in the data, you do not know what it means without looking at the code book.
Even if it does look obvious, it's good to look at the code book category, but then we actually go to record categorical data.
A lot of times, especially in the raw data, that we're get the data files that we load, it's going to be directly stored as a string or an integer.
We're gonna have a column for the categorical variable and it has the value in there, but.
For computational purposes, we're often going to need to encode it differently because you can't compute on A13 a couple different encodings there.
One is one hot encoding where each different coder level gets a variable.
A logical variable, but we encode it with an integer zero or one.
And so for the German credit, we're going to have a eleven, twelve, thirteen, fourteen.
There are all different variables we could for the penguins. We would have three variables, one for each species, and exactly one of them is one.
So when a deli penguin would have one, the Adelie variable and zero in a chinstrap and Gentoo.
Another option is what's called dummy coding, which is very, very similar, except one of the codes doesn't get a variable.
So it all zeroes in the variables for the categorical variable.
I mean, it's the admitted one and a one at any of them means that one.
Why we need that is going to. It's going to come up when we start talking about linear modeling.
But it's a very common statistical way of encoding a variable. The variables that we use for this are called indicator variables.
So if we're transforming our penguins into either one hot or dummy code variables, we say that we have an indicator variable for chinstrap.
And it's one if the if the penguin is a chinstrap and zero if it is not.
So data has to be coded and encoded in order for us to process and analyze that, we actually have to store it somehow.
And the process of coding affects are the data that we have when we go when we do an analysis.
When we do an inference. The way that the data was coded affects how we view and how we understand the things that the data are actually about.
Sometimes this is relatively straightforward. The penguin species,
although the way the penguins got divided into their various species and got those species names is is a historical social process.
But with other things that are very that are have very contested social definitions, such as how do you indicate race?
It becomes a very strong lens that affects how we understand the underlying
reality that the data is supposed to represent and that the representation,
the coding, the codebook,
the definitions that need to be documented thoroughly in order for us to properly understand the data that we're working with.