< Return to Video

https:/.../9c3e036c-fc72-446f-8b20-ad9000efda2e-c0247da3-42e1-4d9b-9515-ad910127a4da.mp4?invocationId=5cc9289b-3009-ec11-a9e9-0a1a827ad0ec

  • 0:05 - 0:11
    This video I'm going to talk with you about coding and coding data learning outcomes are to recognize when a variable might
  • 0:11 - 0:17
    need a code book or a dictionary to give it more explanation to understand the difference between a variable in its encoding,
  • 0:17 - 0:23
    the transforming and coding to another. We start to think about what we're going to need in order to do that.
  • 0:23 - 0:28
    So the data we've been talking about needs to be encode. It needs to be stored somehow.
  • 0:28 - 0:35
    So the variables we talked about in the earlier video, we actually have to record those values in some way.
  • 0:35 - 0:43
    It's important to recognize that the encoding and the value that we're encoding and the value that is encoded in that coding are not the same.
  • 0:43 - 0:46
    We could have the number twenty seven and we can write it in multiple different way.
  • 0:46 - 0:51
    This is warm up example. We could write the digits to seven.
  • 0:51 - 1:00
    We can write out the word 27. We could write it out and hexadecimal zero x one B. The value is twenty seven, but we have different ways of writing it.
  • 1:00 - 1:05
    So to encode numeric data, we can encode it as a binary integer.
  • 1:05 - 1:13
    And in these, each of these I'm showing the hexes, decimal values of the bytes used to encode it.
  • 1:13 - 1:18
    So we can encoded as a binary integer, we could encode it as a floating point.
  • 1:18 - 1:23
    Binary number. Looks very different, doesn't it?
  • 1:23 - 1:28
    We could encode it as a decimal number. Those are the codes for the ASCII codes for the digits.
  • 1:28 - 1:32
    Two and seven, when we save the CSB file, it's stored as text.
  • 1:32 - 1:36
    So it's storing the decimal digits. There's another format called binary code,
  • 1:36 - 1:43
    a decimal that's used on some mainframes and other systems for efficiently storing the actual decimal values.
  • 1:43 - 1:47
    Encodings can also be Lawsie. A floating point, for example, loses precision.
  • 1:47 - 1:54
    And we can encode we can record things as integers and it truncates whatever decimal part they may have had.
  • 1:54 - 2:00
    So this is just for encoding numbers. We've had these four ways of encoding, but that's the syntactic encoding of.
  • 2:00 - 2:07
    OK. This is stored as decimal characters. This is stored as a 32 bit integer is not enough to interpret it.
  • 2:07 - 2:12
    No, because we need to know how it was measured. We need to know what the units are.
  • 2:12 - 2:15
    Is this millimeters ft.
  • 2:15 - 2:25
    Crocodiles may have been transformed somehow some data sets, they center the values or they take the logarithm of their values.
  • 2:25 - 2:33
    We also need to know if there are any sensible values because it's not uncommon to get a data set where it's numbers.
  • 2:33 - 2:37
    But then there's a special value that you use to indicate, say, unknown data.
  • 2:37 - 2:47
    So we might have. We might have. The number of classes a student took in a day, one, two, seven, whatever.
  • 2:47 - 2:58
    And if we don't know, we record ninety nine. Data recordings today tend to actually just exclude the value.
  • 2:58 - 3:01
    But there's lots of historical data out there,
  • 3:01 - 3:08
    lots of historical data processing systems that use specific values to to indicate things such as unknown.
  • 3:08 - 3:17
    We need to know if any of those values are in the data. When we have categorical data, we need to know how the data is, what we call coded.
  • 3:17 - 3:21
    So the categorical data. There's a few different values the variable can take on.
  • 3:21 - 3:27
    We call these codes or levels of the categorical variable. We need to know what they are.
  • 3:27 - 3:33
    We also need to know how they're stored. Are they stored numeric or string some data sets?
  • 3:33 - 3:38
    We have a string like our penguins. We just wrote down the string. But some maybe there'll be numbers zero.
  • 3:38 - 3:42
    One, two, three, four. And when you when you the code book to tell us what those numbers are.
  • 3:42 - 3:52
    In any case, we know how the data was recorded. But we didn't know what they are. We also need to know how they're defined the.
  • 3:52 - 3:57
    What does a particular value for this categorical variable mean?
  • 3:57 - 3:58
    We need to know.
  • 3:58 - 4:05
    It's useful to know what rules were to use to decide which code to apply for some things that might be fairly straightforward and obvious,
  • 4:05 - 4:16
    but for others it might not be. Also, there's a made a question around coding categorical data about who decided the definitions and how
  • 4:16 - 4:22
    were the how was this set of codes decided upon and given the definitions that they were given.
  • 4:22 - 4:29
    One example of this is in census data.
  • 4:29 - 4:35
    The Category four rate, the way race is collected, that's changed throughout the history of the United States.
  • 4:35 - 4:43
    The set of categories, whether you have to pick one or whether it's a check, all that apply, etc., that changed throughout the throughout the course.
  • 4:43 - 4:55
    And it becomes a very political process about how do you define what how it is that we record race when we are collecting census data.
  • 4:55 - 5:05
    That then has strong effects for how we understand. The representation and distribution of race in the country, so few examples.
  • 5:05 - 5:11
    So the penguin data set we looked at. The species is a categorical variable and it's written down.
  • 5:11 - 5:17
    It just with the name of the species, Adelie chinstrap or Gentoo in the Rorty.
  • 5:17 - 5:21
    There's a raw version of the data that has the full biological species name.
  • 5:21 - 5:34
    These come for biological taxonomy. There's another data set that is used for some things that is credit information for German loan applicants.
  • 5:34 - 5:40
    And it has various variables. One of its variables is the status of the applicants checking account.
  • 5:40 - 5:46
    And it has the codes a11 for overdrawn a twelve for between zero and hundred deutschmarks,
  • 5:46 - 5:53
    a A13 for either at least 200 deutschmarks, or they've had their salary to positive for a year.
  • 5:53 - 5:58
    And then A14 means no checking account. And so we've got these categorical codes.
  • 5:58 - 6:03
    We have the if you see a 12 in the data, you do not know what it means without looking at the code book.
  • 6:03 - 6:12
    Even if it does look obvious, it's good to look at the code book category, but then we actually go to record categorical data.
  • 6:12 - 6:19
    A lot of times, especially in the raw data, that we're get the data files that we load, it's going to be directly stored as a string or an integer.
  • 6:19 - 6:25
    We're gonna have a column for the categorical variable and it has the value in there, but.
  • 6:25 - 6:34
    For computational purposes, we're often going to need to encode it differently because you can't compute on A13 a couple different encodings there.
  • 6:34 - 6:39
    One is one hot encoding where each different coder level gets a variable.
  • 6:39 - 6:44
    A logical variable, but we encode it with an integer zero or one.
  • 6:44 - 6:49
    And so for the German credit, we're going to have a eleven, twelve, thirteen, fourteen.
  • 6:49 - 6:57
    There are all different variables we could for the penguins. We would have three variables, one for each species, and exactly one of them is one.
  • 6:57 - 7:02
    So when a deli penguin would have one, the Adelie variable and zero in a chinstrap and Gentoo.
  • 7:02 - 7:11
    Another option is what's called dummy coding, which is very, very similar, except one of the codes doesn't get a variable.
  • 7:11 - 7:16
    So it all zeroes in the variables for the categorical variable.
  • 7:16 - 7:21
    I mean, it's the admitted one and a one at any of them means that one.
  • 7:21 - 7:26
    Why we need that is going to. It's going to come up when we start talking about linear modeling.
  • 7:26 - 7:33
    But it's a very common statistical way of encoding a variable. The variables that we use for this are called indicator variables.
  • 7:33 - 7:42
    So if we're transforming our penguins into either one hot or dummy code variables, we say that we have an indicator variable for chinstrap.
  • 7:42 - 7:46
    And it's one if the if the penguin is a chinstrap and zero if it is not.
  • 7:46 - 7:54
    So data has to be coded and encoded in order for us to process and analyze that, we actually have to store it somehow.
  • 7:54 - 7:59
    And the process of coding affects are the data that we have when we go when we do an analysis.
  • 7:59 - 8:13
    When we do an inference. The way that the data was coded affects how we view and how we understand the things that the data are actually about.
  • 8:13 - 8:16
    Sometimes this is relatively straightforward. The penguin species,
  • 8:16 - 8:24
    although the way the penguins got divided into their various species and got those species names is is a historical social process.
  • 8:24 - 8:33
    But with other things that are very that are have very contested social definitions, such as how do you indicate race?
  • 8:33 - 8:39
    It becomes a very strong lens that affects how we understand the underlying
  • 8:39 - 8:44
    reality that the data is supposed to represent and that the representation,
  • 8:44 - 8:45
    the coding, the codebook,
  • 8:45 - 8:57
    the definitions that need to be documented thoroughly in order for us to properly understand the data that we're working with.
Title:
https:/.../9c3e036c-fc72-446f-8b20-ad9000efda2e-c0247da3-42e1-4d9b-9515-ad910127a4da.mp4?invocationId=5cc9289b-3009-ec11-a9e9-0a1a827ad0ec
Video Language:
English
Duration:
08:57

English subtitles

Revisions