https:/.../9c3e036c-fc72-446f-8b20-ad9000efda2e-c0247da3-42e1-4d9b-9515-ad910127a4da.mp4?invocationId=5cc9289b-3009-ec11-a9e9-0a1a827ad0ec

Edit subtitles

0:05 - 0:11

This video I'm going to talk with you about coding and coding data learning outcomes are to recognize when a variable might
0:11 - 0:17

need a code book or a dictionary to give it more explanation to understand the difference between a variable in its encoding,
0:17 - 0:23

the transforming and coding to another. We start to think about what we're going to need in order to do that.
0:23 - 0:28

So the data we've been talking about needs to be encode. It needs to be stored somehow.
0:28 - 0:35

So the variables we talked about in the earlier video, we actually have to record those values in some way.
0:35 - 0:43

It's important to recognize that the encoding and the value that we're encoding and the value that is encoded in that coding are not the same.
0:43 - 0:46

We could have the number twenty seven and we can write it in multiple different way.
0:46 - 0:51

This is warm up example. We could write the digits to seven.
0:51 - 1:00

We can write out the word 27. We could write it out and hexadecimal zero x one B. The value is twenty seven, but we have different ways of writing it.
1:00 - 1:05

So to encode numeric data, we can encode it as a binary integer.
1:05 - 1:13

And in these, each of these I'm showing the hexes, decimal values of the bytes used to encode it.
1:13 - 1:18

So we can encoded as a binary integer, we could encode it as a floating point.
1:18 - 1:23

Binary number. Looks very different, doesn't it?
1:23 - 1:28

We could encode it as a decimal number. Those are the codes for the ASCII codes for the digits.
1:28 - 1:32

Two and seven, when we save the CSB file, it's stored as text.
1:32 - 1:36

So it's storing the decimal digits. There's another format called binary code,
1:36 - 1:43

a decimal that's used on some mainframes and other systems for efficiently storing the actual decimal values.
1:43 - 1:47

Encodings can also be Lawsie. A floating point, for example, loses precision.
1:47 - 1:54

And we can encode we can record things as integers and it truncates whatever decimal part they may have had.
1:54 - 2:00

So this is just for encoding numbers. We've had these four ways of encoding, but that's the syntactic encoding of.
2:00 - 2:07

OK. This is stored as decimal characters. This is stored as a 32 bit integer is not enough to interpret it.
2:07 - 2:12

No, because we need to know how it was measured. We need to know what the units are.
2:12 - 2:15

Is this millimeters ft.
2:15 - 2:25

Crocodiles may have been transformed somehow some data sets, they center the values or they take the logarithm of their values.
2:25 - 2:33

We also need to know if there are any sensible values because it's not uncommon to get a data set where it's numbers.
2:33 - 2:37

But then there's a special value that you use to indicate, say, unknown data.
2:37 - 2:47

So we might have. We might have. The number of classes a student took in a day, one, two, seven, whatever.
2:47 - 2:58

And if we don't know, we record ninety nine. Data recordings today tend to actually just exclude the value.
2:58 - 3:01

But there's lots of historical data out there,
3:01 - 3:08

lots of historical data processing systems that use specific values to to indicate things such as unknown.
3:08 - 3:17

We need to know if any of those values are in the data. When we have categorical data, we need to know how the data is, what we call coded.
3:17 - 3:21

So the categorical data. There's a few different values the variable can take on.
3:21 - 3:27

We call these codes or levels of the categorical variable. We need to know what they are.
3:27 - 3:33

We also need to know how they're stored. Are they stored numeric or string some data sets?
3:33 - 3:38

We have a string like our penguins. We just wrote down the string. But some maybe there'll be numbers zero.
3:38 - 3:42

One, two, three, four. And when you when you the code book to tell us what those numbers are.
3:42 - 3:52

In any case, we know how the data was recorded. But we didn't know what they are. We also need to know how they're defined the.
3:52 - 3:57

What does a particular value for this categorical variable mean?
3:57 - 3:58

We need to know.
3:58 - 4:05

It's useful to know what rules were to use to decide which code to apply for some things that might be fairly straightforward and obvious,
4:05 - 4:16

but for others it might not be. Also, there's a made a question around coding categorical data about who decided the definitions and how
4:16 - 4:22

were the how was this set of codes decided upon and given the definitions that they were given.
4:22 - 4:29

One example of this is in census data.
4:29 - 4:35

The Category four rate, the way race is collected, that's changed throughout the history of the United States.
4:35 - 4:43

The set of categories, whether you have to pick one or whether it's a check, all that apply, etc., that changed throughout the throughout the course.
4:43 - 4:55

And it becomes a very political process about how do you define what how it is that we record race when we are collecting census data.
4:55 - 5:05

That then has strong effects for how we understand. The representation and distribution of race in the country, so few examples.
5:05 - 5:11

So the penguin data set we looked at. The species is a categorical variable and it's written down.
5:11 - 5:17

It just with the name of the species, Adelie chinstrap or Gentoo in the Rorty.
5:17 - 5:21

There's a raw version of the data that has the full biological species name.
5:21 - 5:34

These come for biological taxonomy. There's another data set that is used for some things that is credit information for German loan applicants.
5:34 - 5:40

And it has various variables. One of its variables is the status of the applicants checking account.
5:40 - 5:46

And it has the codes a11 for overdrawn a twelve for between zero and hundred deutschmarks,
5:46 - 5:53

a A13 for either at least 200 deutschmarks, or they've had their salary to positive for a year.
5:53 - 5:58

And then A14 means no checking account. And so we've got these categorical codes.
5:58 - 6:03

We have the if you see a 12 in the data, you do not know what it means without looking at the code book.
6:03 - 6:12

Even if it does look obvious, it's good to look at the code book category, but then we actually go to record categorical data.
6:12 - 6:19

A lot of times, especially in the raw data, that we're get the data files that we load, it's going to be directly stored as a string or an integer.
6:19 - 6:25

We're gonna have a column for the categorical variable and it has the value in there, but.
6:25 - 6:34

For computational purposes, we're often going to need to encode it differently because you can't compute on A13 a couple different encodings there.
6:34 - 6:39

One is one hot encoding where each different coder level gets a variable.
6:39 - 6:44

A logical variable, but we encode it with an integer zero or one.
6:44 - 6:49

And so for the German credit, we're going to have a eleven, twelve, thirteen, fourteen.
6:49 - 6:57

There are all different variables we could for the penguins. We would have three variables, one for each species, and exactly one of them is one.
6:57 - 7:02

So when a deli penguin would have one, the Adelie variable and zero in a chinstrap and Gentoo.
7:02 - 7:11

Another option is what's called dummy coding, which is very, very similar, except one of the codes doesn't get a variable.
7:11 - 7:16

So it all zeroes in the variables for the categorical variable.
7:16 - 7:21

I mean, it's the admitted one and a one at any of them means that one.
7:21 - 7:26

Why we need that is going to. It's going to come up when we start talking about linear modeling.
7:26 - 7:33

But it's a very common statistical way of encoding a variable. The variables that we use for this are called indicator variables.
7:33 - 7:42

So if we're transforming our penguins into either one hot or dummy code variables, we say that we have an indicator variable for chinstrap.
7:42 - 7:46

And it's one if the if the penguin is a chinstrap and zero if it is not.
7:46 - 7:54

So data has to be coded and encoded in order for us to process and analyze that, we actually have to store it somehow.
7:54 - 7:59

And the process of coding affects are the data that we have when we go when we do an analysis.
7:59 - 8:13

When we do an inference. The way that the data was coded affects how we view and how we understand the things that the data are actually about.
8:13 - 8:16

Sometimes this is relatively straightforward. The penguin species,
8:16 - 8:24

although the way the penguins got divided into their various species and got those species names is is a historical social process.
8:24 - 8:33

But with other things that are very that are have very contested social definitions, such as how do you indicate race?
8:33 - 8:39

It becomes a very strong lens that affects how we understand the underlying
8:39 - 8:44

reality that the data is supposed to represent and that the representation,
8:44 - 8:45

the coding, the codebook,
8:45 - 8:57

the definitions that need to be documented thoroughly in order for us to properly understand the data that we're working with.

Title:: https:/.../9c3e036c-fc72-446f-8b20-ad9000efda2e-c0247da3-42e1-4d9b-9515-ad910127a4da.mp4?invocationId=5cc9289b-3009-ec11-a9e9-0a1a827ad0ec
Video Language:: English
Duration:: 08:57

janetlayne edited English subtitles for https:/.../9c3e036c-fc72-446f-8b20-ad9000efda2e-c0247da3-42e1-4d9b-9515-ad910127a4da.mp4?invocationId=5cc9289b-3009-ec11-a9e9-0a1a827ad0ec

English subtitles

Revisions

Revision 1 Uploaded

janetlayne

https:/.../9c3e036c-fc72-446f-8b20-ad9000efda2e-c0247da3-42e1-4d9b-9515-ad910127a4da.mp4?invocationId=5cc9289b-3009-ec11-a9e9-0a1a827ad0ec

Revisions

Our website uses cookies

Operating cookies (Required)