Return to Video

https:/.../afc96ff2-0043-4387-af9b-ad9000efd6b0-753c0fcd-b2eb-4f7d-8d63-ad9101272f10.mp4?invocationId=c1ea84bb-2f09-ec11-a9e9-0a1a827ad0ec

  • 0:05 - 0:10
    This video, I want to talk with you about variables and observations and type,
  • 0:10 - 0:18
    so in the previous video we saw how to load a data file in the pandas, how to see how many rows we have, what python types of data.
  • 0:18 - 0:21
    But this video, we're going to go from the python types, the conceptual types,
  • 0:21 - 0:31
    and start to talk about what kinds of data we collect and we store in these pandas data frames.
  • 0:31 - 0:35
    So the learning outcomes for this video are to know the relationship between pandas
  • 0:35 - 0:41
    structures and statistical variables and to identify the type of a statistical variable.
  • 0:41 - 0:50
    So you've referred back to our data pipeline. We have things that produce Farnam observable phenomena that produce raw data.
  • 0:50 - 0:55
    Robbs ovations that we then process into a data set.
  • 0:55 - 1:02
    So the core idea when we have something that's going to be a data set that's processed, ready to use for a task.
  • 1:02 - 1:08
    Well, we usually have is a table of observations. Each row is one observation.
  • 1:08 - 1:14
    The data sheets of reading calls this an instance. Sometimes this is called a sample.
  • 1:14 - 1:20
    And this is an observation of the values of one or more variables pertaining to a single object.
  • 1:20 - 1:25
    So I'm showing here three rows from a data set. The Palmer penguin's data set.
  • 1:25 - 1:32
    That's measurements of penguins in Antarctica. And so each one has several different variables.
  • 1:32 - 1:34
    Each row represents.
  • 1:34 - 1:42
    So the first row represents, we observed and Adelie penguin on Torgersen Island with a bill length of thirty nine point one millimeters,
  • 1:42 - 1:51
    depth of eighteen point seven, et cetera. We store this as a pandas data frame, reach variable as a column.
  • 1:51 - 1:56
    But but the variables have their own conceptual properties that we're going to start talking about here.
  • 1:56 - 2:03
    So for each penguin we have it species, there are three different species of penguin that are observed in this dataset island.
  • 2:03 - 2:06
    There are three different islands on which they're measured.
  • 2:06 - 2:15
    We have measurements of the Penguins bill its length and depth, the length of its flipper, the body mass of the penguin and then the penguin sex.
  • 2:15 - 2:21
    Each data point, each observation is about one penguin. And we have these different observations.
  • 2:21 - 2:29
    We also have that the documentation which you can the you'll find a link to online and the slides,
  • 2:29 - 2:33
    the documentation tells us things like the build length.
  • 2:33 - 2:39
    That's the length of the bill in millimeters. The penguin body mass is in grams.
  • 2:39 - 2:43
    And when we have a data set, it's important to document.
  • 2:43 - 2:52
    And we have data whether it's organized and curated and processed into a dataset or it's the Robbs ovations.
  • 2:52 - 2:58
    We need to have the first order of things that we need to know besides how much data we have.
  • 2:58 - 3:03
    OK, what are the columns? But things like what are the units we can't properly interpret?
  • 3:03 - 3:09
    The bill length bill length of thirty nine. Thirty nine what? It's probably not thirty nine feet.
  • 3:09 - 3:19
    That would be a very large penguin. But with that we when we're producing a data we need to document all of these things and we're consuming data.
  • 3:19 - 3:28
    We need to find the answers to all of these things. So the variables we have can take on a variety of different types,
  • 3:28 - 3:35
    and the first type I want to talk about is a continuous variable and a continuous variable can take on any value pop.
  • 3:35 - 3:39
    There might be a range that limits the range of values that the variable can take on.
  • 3:39 - 3:43
    Mathematically, continuous variables correspond to real numbers.
  • 3:43 - 3:50
    And the key idea here that really what makes it continuous is for any two values, we could have a value between them.
  • 3:50 - 3:58
    So if we have two penguins, let's say one has a has a flipper length of a 40 millimeters and another has a flipper length of forty five.
  • 3:58 - 4:02
    We could have a penguin with a flipper length of forty two millimeters.
  • 4:02 - 4:08
    And no matter how close together they are, we could conceivably have a penguin with a flipper length that's in between them.
  • 4:08 - 4:13
    That's what makes it continuous. And now.
  • 4:13 - 4:19
    Observations are often desk critize, so even if we have something continuous like the penguins flipper length,
  • 4:19 - 4:29
    often our observations will be discrete and noisy because we don't have infinitely precise rulers with which to measure penguin flippers.
  • 4:29 - 4:35
    But what makes it continuous is conceptual. Lee. It's a continuous variable.
  • 4:35 - 4:42
    Even if our actual measurements of it might be disk critize, typically it's stored as a float.
  • 4:42 - 4:48
    Occasionally it'll be stored as an. If we only measure to integer private precision, we might store it as an ant instead of a float.
  • 4:48 - 4:53
    It is important to note, though, that floating point storage is also it is imprecise.
  • 4:53 - 5:00
    It's fine for the vast majority of what we're doing because like if you're taking any kind of a physical or natural measurement.
  • 5:00 - 5:03
    The measurement instrument is going to have some imprecision in it.
  • 5:03 - 5:12
    The imprecision of storing and a floating point number is far less than the imprecision of most physical instruments.
  • 5:12 - 5:19
    Unless you're doing something like high energy physics and a particle accelerator, most of us aren't doing that here.
  • 5:19 - 5:26
    So for most for most of our purposes, for measuring safe physical quantities, we don't need to worry about the floating point imprecision.
  • 5:26 - 5:34
    The one except the exception is for storing money. A discrete variable, on the other hand, takes on two distinct values.
  • 5:34 - 5:44
    And most of our variables are basically going to fall into the category of continuous real number or discrete almost everything else.
  • 5:44 - 5:49
    And it can have we can have values that have no intermediates and.
  • 5:49 - 5:56
    If I have four eggs, five eggs, I mean, I could take crack and egg and have half the contents.
  • 5:56 - 6:01
    But I'm talking about distinct eggs. I have four or five eggs.
  • 6:01 - 6:07
    Descript variables might not have an order. There's many different types of discrete variables.
  • 6:07 - 6:11
    And I'm going to walk through some of those in these remaining slides.
  • 6:11 - 6:17
    So the first is an integer and integers are discrete and they're typically things like count counting.
  • 6:17 - 6:25
    Something is our canonical example of an integer. They have an order for it is less than five.
  • 6:25 - 6:32
    But. They but you can't have a value in the middle.
  • 6:32 - 6:37
    So, for example, the number of penguins measured the size of something is usually an integer.
  • 6:37 - 6:48
    We often treat an integer continuously. So, for example, the American Veterinary Medical Association computed that for households that own cats.
  • 6:48 - 6:55
    The average number of cats per household is one point eight. Now, you can't actually have point eight of a cat.
  • 6:55 - 7:01
    So in terms of the individual, if you were going to take observations and observe how many cats are in each household,
  • 7:01 - 7:06
    the number of cats would be an integer. You can't.
  • 7:06 - 7:08
    You can't have point eight cats.
  • 7:08 - 7:17
    But we then can treat it as a continuous value so we can talk about the average number of cats per household and that's meaningful to talk about.
  • 7:17 - 7:22
    Even though nobody actually has one point eight cats, integers are usually stored.
  • 7:22 - 7:27
    And sometimes as floats, particularly in.
  • 7:27 - 7:34
    If you have missing values, pandas can't represent missing values for in an integer type.
  • 7:34 - 7:40
    But so it will store those enties floats if it finds any missing values.
  • 7:40 - 7:48
    Categorical variable takes on one of a fixed set of unordered values and.
  • 7:48 - 7:52
    We can compare them for a quality, but that's about all we can do.
  • 7:52 - 7:58
    There's no order, so we can't sort out. We can't do arithmetic.
  • 7:58 - 8:04
    We can't sort for convenience. Like if we have our paint. The penguin species is one example of a categorical variable.
  • 8:04 - 8:09
    We can sort by species in alphabetical order. But that's just convention.
  • 8:09 - 8:16
    The convention of the English alphabet. It's not intrinsic to the meaning of the different of the different penguin species.
  • 8:16 - 8:21
    Said to Athalie is not less than chinstrap is not less than Gentoo.
  • 8:21 - 8:31
    They just happened to come in a particular order in the alphabet. Another example is the user in the movie lends data we saw in the previous video.
  • 8:31 - 8:34
    The user I.D. column is an integer, but that's not a count.
  • 8:34 - 8:39
    It's a user, right? It's an identifier. It's a kind of actually a categorical variable.
  • 8:39 - 8:48
    Which user in the system are we talking about? They're each assigned a numeric identifier for computational convenience.
  • 8:48 - 8:55
    But if we have users. Seventy five and users three hundred and forty two asking what is user seventy
  • 8:55 - 8:59
    five plus user three hundred forty two is a completely meaningless question.
  • 8:59 - 9:10
    They're categorical variables. There is no error, arithmetic or comparison operations that we can do between them.
  • 9:10 - 9:14
    Categorical variables are stored. They can be stored as strings. They can be stored as integers.
  • 9:14 - 9:19
    PAND is also as a category type. That's useful for storing categorical variables.
  • 9:19 - 9:26
    A boolean or a logical variable is a special case of a categorical that can either be just has the two values true and false.
  • 9:26 - 9:32
    Usually it's stored as an int or a bull. And the bull is just a special case type event.
  • 9:32 - 9:38
    The convention typically is that one is true and zero is false.
  • 9:38 - 9:44
    And then ordinal values there, like categorical, they take on a fix one of a fixed set of values.
  • 9:44 - 9:54
    But those values are ordered. This is the key difference between an ordinary and a categorical and so we can order, we can compare for any quality.
  • 9:54 - 10:01
    A few examples of these are classic rates. A is A better grade than B.
  • 10:01 - 10:04
    There in this order. But you can't directly do math on them.
  • 10:04 - 10:08
    A minus B, we can assign numbers to them to try to do math.
  • 10:08 - 10:12
    But but they're just in an order, not an order. A Likert scale.
  • 10:12 - 10:22
    If you've taken a survey that asks you to strongly disagree, to strongly agree with something that is ordinal.
  • 10:22 - 10:29
    Also a movie rating, like if you if you go and you rate a movie, if you read a product, five stars, four stars on Netflix.
  • 10:29 - 10:35
    What you're using a number, but it's ordinal in the sense that.
  • 10:35 - 10:42
    We know that with you, if you like something four stars, you're saying you like it more than you like a three star thing.
  • 10:42 - 10:47
    But we don't know if you are a five star and a four star and a three star.
  • 10:47 - 10:55
    Is the five star movie just as much better than the fourth star as the four star is than the three star?
  • 10:55 - 10:58
    Or does it just tell us which order they're in intrinsically?
  • 10:58 - 11:03
    All it tells us is the order that you put these movies in.
  • 11:03 - 11:10
    But sometimes we do arithmetic anyway, like Amazon computes the average rating for a product.
  • 11:10 - 11:14
    Even though ratings are ordinal. Or you have a GPA.
  • 11:14 - 11:23
    That's computing the average of your ordinal grades. They're stored as insur floats or strings with an externally defined order wave.
  • 11:23 - 11:28
    If we have us if we haven't a no variable started and stored in a string,
  • 11:28 - 11:32
    we have to have something that tells us what order of those values actually go in.
  • 11:32 - 11:39
    Also, the panda's category type has an ordered modes. You can tell it, this is a categorical variable and it is ordered.
  • 11:39 - 11:45
    There's other types of data we're going to encounter time. We usually are going to treat is continuous.
  • 11:45 - 11:47
    Like it might be stored as written out as a date.
  • 11:47 - 11:53
    But if we're actually going to work with time, we're probably going to convert it into a continuous variable.
  • 11:53 - 12:02
    Common common encodings for that are either a number of seconds or a number of years, sometimes a number of milliseconds.
  • 12:02 - 12:07
    Text is categorical ish, but we're usually can convert it into categorical or account variables.
  • 12:07 - 12:15
    We'll talk more about that later. We actually get to processing. Text images are stored as matrices of insur real's.
  • 12:15 - 12:21
    We may also extract features from them that become other kinds of variables. Money is often stored as an interim float,
  • 12:21 - 12:25
    but we have to be careful because the imprecision of floating point numbers can
  • 12:25 - 12:30
    cause a problem when we're using money for the purposes of causing finding,
  • 12:30 - 12:35
    of creating financial transactions for the kinds of things we might be doing with money here in this class.
  • 12:35 - 12:40
    It's not going to be a problem. Nobody loses money if you have a little imprecision in computing.
  • 12:40 - 12:46
    The average price of of a ton of potatoes,
  • 12:46 - 12:51
    it really becomes a problem when you feed that back in to an into actual financial
  • 12:51 - 12:57
    transaction systems because it is impossible to precisely represent 10 cents.
  • 12:57 - 13:02
    It's just a hair under or over 10 cents. When you store it in a floating point value.
  • 13:02 - 13:04
    So the other thing I want to highlight here, though,
  • 13:04 - 13:14
    is that knowing the python were no higher Panda's data type of variable is not sufficient to know its type in a statistical sense.
  • 13:14 - 13:18
    Suppose we have a variable that's that's an in 64. Well, what's in that?
  • 13:18 - 13:23
    Is it a categorical variable like our movie and user I.D. and movie lens?
  • 13:23 - 13:31
    Is it a continuous variable that happens to be measured with integer precision in the penguins data set when you download it?
  • 13:31 - 13:34
    The body mass is integers because they were just measuring to the whole Graham.
  • 13:34 - 13:41
    They didn't measure fractional grams. But as conceptually mass really is is a continuous value.
  • 13:41 - 13:45
    It's just we don't have our measurements aren't that precise.
  • 13:45 - 13:51
    Is it ordinal or is it is it zeros and ones that are representing the logical variable?
  • 13:51 - 13:55
    I said before integers is missing values also lotas float.
  • 13:55 - 14:02
    So there's a we can look at the data, we can look at the data types, we can look at that itself to try to start to get a sense of what it does.
  • 14:02 - 14:10
    But knowing that is not sufficient to know what type of data we're dealing with for the purposes of handling it properly.
  • 14:10 - 14:18
    I want to talk to us a little bit about entities or instances that I introduced in last time and that are talked about in the reading.
  • 14:18 - 14:24
    So we want to be clear when we have a we have a data frame, a data, a data table.
  • 14:24 - 14:31
    What are the things being observed in this set of observations? But what's being observed if you've taken the database's class?
  • 14:31 - 14:36
    We called these entities. But sometimes this is pretty straightforward.
  • 14:36 - 14:40
    For example, this penguin dataset. Each row represents the measurements of one penguin.
  • 14:40 - 14:43
    But sometimes they're complex and linked, such as the rating data table.
  • 14:43 - 14:49
    Each instance is a rating. But that's a rating about movies.
  • 14:49 - 14:58
    And we can also derive things such as. We could count the number of ratings for each movie, and that could be a variable for a movie.
  • 14:58 - 15:02
    We could do this aggregation. We're going to see how to do aggregations in a little bit.
  • 15:02 - 15:11
    That gives us a new variable number of ratings. That becomes a variable for observations of movies.
  • 15:11 - 15:11
    So wrap up,
  • 15:11 - 15:19
    there are many different kinds of variables broadly divided into continuous and discrete with several specific types of discrete variables.
  • 15:19 - 15:24
    These conceptual variable types do not map one to one dependance data types.
  • 15:24 - 15:30
    You need more information in order to know how to properly interpret and work with a variable.
  • 15:30 - 15:34
    So the data source that you're working with needs to be documented.
  • 15:34 - 15:47
    And if you're creating a data source, you need to document what all of the columns mean and how they're being encoded and stored.
Title:
https:/.../afc96ff2-0043-4387-af9b-ad9000efd6b0-753c0fcd-b2eb-4f7d-8d63-ad9101272f10.mp4?invocationId=c1ea84bb-2f09-ec11-a9e9-0a1a827ad0ec
Video Language:
English
Duration:
15:47

English subtitles

Revisions