[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:04.86,0:00:09.66,Default,,0000,0000,0000,,This video, I want to talk with you about variables and observations and type, Dialogue: 0,0:00:09.66,0:00:17.87,Default,,0000,0000,0000,,so in the previous video we saw how to load a data file in the pandas, how to see how many rows we have, what python types of data. Dialogue: 0,0:00:17.87,0:00:21.47,Default,,0000,0000,0000,,But this video, we're going to go from the python types, the conceptual types, Dialogue: 0,0:00:21.47,0:00:30.56,Default,,0000,0000,0000,,and start to talk about what kinds of data we collect and we store in these pandas data frames. Dialogue: 0,0:00:30.56,0:00:34.58,Default,,0000,0000,0000,,So the learning outcomes for this video are to know the relationship between pandas Dialogue: 0,0:00:34.58,0:00:41.15,Default,,0000,0000,0000,,structures and statistical variables and to identify the type of a statistical variable. Dialogue: 0,0:00:41.15,0:00:49.63,Default,,0000,0000,0000,,So you've referred back to our data pipeline. We have things that produce Farnam observable phenomena that produce raw data. Dialogue: 0,0:00:49.63,0:00:54.54,Default,,0000,0000,0000,,Robbs ovations that we then process into a data set. Dialogue: 0,0:00:54.54,0:01:01.80,Default,,0000,0000,0000,,So the core idea when we have something that's going to be a data set that's processed, ready to use for a task. Dialogue: 0,0:01:01.80,0:01:07.98,Default,,0000,0000,0000,,Well, we usually have is a table of observations. Each row is one observation. Dialogue: 0,0:01:07.98,0:01:14.04,Default,,0000,0000,0000,,The data sheets of reading calls this an instance. Sometimes this is called a sample. Dialogue: 0,0:01:14.04,0:01:19.50,Default,,0000,0000,0000,,And this is an observation of the values of one or more variables pertaining to a single object. Dialogue: 0,0:01:19.50,0:01:24.90,Default,,0000,0000,0000,,So I'm showing here three rows from a data set. The Palmer penguin's data set. Dialogue: 0,0:01:24.90,0:01:32.31,Default,,0000,0000,0000,,That's measurements of penguins in Antarctica. And so each one has several different variables. Dialogue: 0,0:01:32.31,0:01:33.91,Default,,0000,0000,0000,,Each row represents. Dialogue: 0,0:01:33.91,0:01:41.73,Default,,0000,0000,0000,,So the first row represents, we observed and Adelie penguin on Torgersen Island with a bill length of thirty nine point one millimeters, Dialogue: 0,0:01:41.73,0:01:50.52,Default,,0000,0000,0000,,depth of eighteen point seven, et cetera. We store this as a pandas data frame, reach variable as a column. Dialogue: 0,0:01:50.52,0:01:56.01,Default,,0000,0000,0000,,But but the variables have their own conceptual properties that we're going to start talking about here. Dialogue: 0,0:01:56.01,0:02:03.26,Default,,0000,0000,0000,,So for each penguin we have it species, there are three different species of penguin that are observed in this dataset island. Dialogue: 0,0:02:03.26,0:02:05.61,Default,,0000,0000,0000,,There are three different islands on which they're measured. Dialogue: 0,0:02:05.61,0:02:15.18,Default,,0000,0000,0000,,We have measurements of the Penguins bill its length and depth, the length of its flipper, the body mass of the penguin and then the penguin sex. Dialogue: 0,0:02:15.18,0:02:21.09,Default,,0000,0000,0000,,Each data point, each observation is about one penguin. And we have these different observations. Dialogue: 0,0:02:21.09,0:02:28.98,Default,,0000,0000,0000,,We also have that the documentation which you can the you'll find a link to online and the slides, Dialogue: 0,0:02:28.98,0:02:32.97,Default,,0000,0000,0000,,the documentation tells us things like the build length. Dialogue: 0,0:02:32.97,0:02:38.91,Default,,0000,0000,0000,,That's the length of the bill in millimeters. The penguin body mass is in grams. Dialogue: 0,0:02:38.91,0:02:42.84,Default,,0000,0000,0000,,And when we have a data set, it's important to document. Dialogue: 0,0:02:42.84,0:02:52.42,Default,,0000,0000,0000,,And we have data whether it's organized and curated and processed into a dataset or it's the Robbs ovations. Dialogue: 0,0:02:52.42,0:02:57.61,Default,,0000,0000,0000,,We need to have the first order of things that we need to know besides how much data we have. Dialogue: 0,0:02:57.61,0:03:02.89,Default,,0000,0000,0000,,OK, what are the columns? But things like what are the units we can't properly interpret? Dialogue: 0,0:03:02.89,0:03:08.74,Default,,0000,0000,0000,,The bill length bill length of thirty nine. Thirty nine what? It's probably not thirty nine feet. Dialogue: 0,0:03:08.74,0:03:19.24,Default,,0000,0000,0000,,That would be a very large penguin. But with that we when we're producing a data we need to document all of these things and we're consuming data. Dialogue: 0,0:03:19.24,0:03:27.65,Default,,0000,0000,0000,,We need to find the answers to all of these things. So the variables we have can take on a variety of different types, Dialogue: 0,0:03:27.65,0:03:34.58,Default,,0000,0000,0000,,and the first type I want to talk about is a continuous variable and a continuous variable can take on any value pop. Dialogue: 0,0:03:34.58,0:03:39.20,Default,,0000,0000,0000,,There might be a range that limits the range of values that the variable can take on. Dialogue: 0,0:03:39.20,0:03:42.83,Default,,0000,0000,0000,,Mathematically, continuous variables correspond to real numbers. Dialogue: 0,0:03:42.83,0:03:49.97,Default,,0000,0000,0000,,And the key idea here that really what makes it continuous is for any two values, we could have a value between them. Dialogue: 0,0:03:49.97,0:03:58.04,Default,,0000,0000,0000,,So if we have two penguins, let's say one has a has a flipper length of a 40 millimeters and another has a flipper length of forty five. Dialogue: 0,0:03:58.04,0:04:02.30,Default,,0000,0000,0000,,We could have a penguin with a flipper length of forty two millimeters. Dialogue: 0,0:04:02.30,0:04:08.24,Default,,0000,0000,0000,,And no matter how close together they are, we could conceivably have a penguin with a flipper length that's in between them. Dialogue: 0,0:04:08.24,0:04:12.96,Default,,0000,0000,0000,,That's what makes it continuous. And now. Dialogue: 0,0:04:12.96,0:04:19.17,Default,,0000,0000,0000,,Observations are often desk critize, so even if we have something continuous like the penguins flipper length, Dialogue: 0,0:04:19.17,0:04:29.19,Default,,0000,0000,0000,,often our observations will be discrete and noisy because we don't have infinitely precise rulers with which to measure penguin flippers. Dialogue: 0,0:04:29.19,0:04:35.42,Default,,0000,0000,0000,,But what makes it continuous is conceptual. Lee. It's a continuous variable. Dialogue: 0,0:04:35.42,0:04:41.75,Default,,0000,0000,0000,,Even if our actual measurements of it might be disk critize, typically it's stored as a float. Dialogue: 0,0:04:41.75,0:04:48.41,Default,,0000,0000,0000,,Occasionally it'll be stored as an. If we only measure to integer private precision, we might store it as an ant instead of a float. Dialogue: 0,0:04:48.41,0:04:53.30,Default,,0000,0000,0000,,It is important to note, though, that floating point storage is also it is imprecise. Dialogue: 0,0:04:53.30,0:05:00.36,Default,,0000,0000,0000,,It's fine for the vast majority of what we're doing because like if you're taking any kind of a physical or natural measurement. Dialogue: 0,0:05:00.36,0:05:03.30,Default,,0000,0000,0000,,The measurement instrument is going to have some imprecision in it. Dialogue: 0,0:05:03.30,0:05:11.88,Default,,0000,0000,0000,,The imprecision of storing and a floating point number is far less than the imprecision of most physical instruments. Dialogue: 0,0:05:11.88,0:05:19.29,Default,,0000,0000,0000,,Unless you're doing something like high energy physics and a particle accelerator, most of us aren't doing that here. Dialogue: 0,0:05:19.29,0:05:26.31,Default,,0000,0000,0000,,So for most for most of our purposes, for measuring safe physical quantities, we don't need to worry about the floating point imprecision. Dialogue: 0,0:05:26.31,0:05:34.41,Default,,0000,0000,0000,,The one except the exception is for storing money. A discrete variable, on the other hand, takes on two distinct values. Dialogue: 0,0:05:34.41,0:05:43.91,Default,,0000,0000,0000,,And most of our variables are basically going to fall into the category of continuous real number or discrete almost everything else. Dialogue: 0,0:05:43.91,0:05:48.94,Default,,0000,0000,0000,,And it can have we can have values that have no intermediates and. Dialogue: 0,0:05:48.94,0:05:55.93,Default,,0000,0000,0000,,If I have four eggs, five eggs, I mean, I could take crack and egg and have half the contents. Dialogue: 0,0:05:55.93,0:06:01.21,Default,,0000,0000,0000,,But I'm talking about distinct eggs. I have four or five eggs. Dialogue: 0,0:06:01.21,0:06:07.00,Default,,0000,0000,0000,,Descript variables might not have an order. There's many different types of discrete variables. Dialogue: 0,0:06:07.00,0:06:11.10,Default,,0000,0000,0000,,And I'm going to walk through some of those in these remaining slides. Dialogue: 0,0:06:11.10,0:06:17.20,Default,,0000,0000,0000,,So the first is an integer and integers are discrete and they're typically things like count counting. Dialogue: 0,0:06:17.20,0:06:24.59,Default,,0000,0000,0000,,Something is our canonical example of an integer. They have an order for it is less than five. Dialogue: 0,0:06:24.59,0:06:32.40,Default,,0000,0000,0000,,But. They but you can't have a value in the middle. Dialogue: 0,0:06:32.40,0:06:37.14,Default,,0000,0000,0000,,So, for example, the number of penguins measured the size of something is usually an integer. Dialogue: 0,0:06:37.14,0:06:47.91,Default,,0000,0000,0000,,We often treat an integer continuously. So, for example, the American Veterinary Medical Association computed that for households that own cats. Dialogue: 0,0:06:47.91,0:06:54.92,Default,,0000,0000,0000,,The average number of cats per household is one point eight. Now, you can't actually have point eight of a cat. Dialogue: 0,0:06:54.92,0:07:00.98,Default,,0000,0000,0000,,So in terms of the individual, if you were going to take observations and observe how many cats are in each household, Dialogue: 0,0:07:00.98,0:07:06.35,Default,,0000,0000,0000,,the number of cats would be an integer. You can't. Dialogue: 0,0:07:06.35,0:07:08.42,Default,,0000,0000,0000,,You can't have point eight cats. Dialogue: 0,0:07:08.42,0:07:16.64,Default,,0000,0000,0000,,But we then can treat it as a continuous value so we can talk about the average number of cats per household and that's meaningful to talk about. Dialogue: 0,0:07:16.64,0:07:21.59,Default,,0000,0000,0000,,Even though nobody actually has one point eight cats, integers are usually stored. Dialogue: 0,0:07:21.59,0:07:27.41,Default,,0000,0000,0000,,And sometimes as floats, particularly in. Dialogue: 0,0:07:27.41,0:07:33.74,Default,,0000,0000,0000,,If you have missing values, pandas can't represent missing values for in an integer type. Dialogue: 0,0:07:33.74,0:07:39.67,Default,,0000,0000,0000,,But so it will store those enties floats if it finds any missing values. Dialogue: 0,0:07:39.67,0:07:47.56,Default,,0000,0000,0000,,Categorical variable takes on one of a fixed set of unordered values and. Dialogue: 0,0:07:47.56,0:07:51.85,Default,,0000,0000,0000,,We can compare them for a quality, but that's about all we can do. Dialogue: 0,0:07:51.85,0:07:57.80,Default,,0000,0000,0000,,There's no order, so we can't sort out. We can't do arithmetic. Dialogue: 0,0:07:57.80,0:08:04.36,Default,,0000,0000,0000,,We can't sort for convenience. Like if we have our paint. The penguin species is one example of a categorical variable. Dialogue: 0,0:08:04.36,0:08:09.49,Default,,0000,0000,0000,,We can sort by species in alphabetical order. But that's just convention. Dialogue: 0,0:08:09.49,0:08:16.50,Default,,0000,0000,0000,,The convention of the English alphabet. It's not intrinsic to the meaning of the different of the different penguin species. Dialogue: 0,0:08:16.50,0:08:20.71,Default,,0000,0000,0000,,Said to Athalie is not less than chinstrap is not less than Gentoo. Dialogue: 0,0:08:20.71,0:08:30.52,Default,,0000,0000,0000,,They just happened to come in a particular order in the alphabet. Another example is the user in the movie lends data we saw in the previous video. Dialogue: 0,0:08:30.52,0:08:34.12,Default,,0000,0000,0000,,The user I.D. column is an integer, but that's not a count. Dialogue: 0,0:08:34.12,0:08:38.83,Default,,0000,0000,0000,,It's a user, right? It's an identifier. It's a kind of actually a categorical variable. Dialogue: 0,0:08:38.83,0:08:47.77,Default,,0000,0000,0000,,Which user in the system are we talking about? They're each assigned a numeric identifier for computational convenience. Dialogue: 0,0:08:47.77,0:08:55.20,Default,,0000,0000,0000,,But if we have users. Seventy five and users three hundred and forty two asking what is user seventy Dialogue: 0,0:08:55.20,0:08:59.32,Default,,0000,0000,0000,,five plus user three hundred forty two is a completely meaningless question. Dialogue: 0,0:08:59.32,0:09:09.68,Default,,0000,0000,0000,,They're categorical variables. There is no error, arithmetic or comparison operations that we can do between them. Dialogue: 0,0:09:09.68,0:09:13.54,Default,,0000,0000,0000,,Categorical variables are stored. They can be stored as strings. They can be stored as integers. Dialogue: 0,0:09:13.54,0:09:18.53,Default,,0000,0000,0000,,PAND is also as a category type. That's useful for storing categorical variables. Dialogue: 0,0:09:18.53,0:09:25.97,Default,,0000,0000,0000,,A boolean or a logical variable is a special case of a categorical that can either be just has the two values true and false. Dialogue: 0,0:09:25.97,0:09:32.03,Default,,0000,0000,0000,,Usually it's stored as an int or a bull. And the bull is just a special case type event. Dialogue: 0,0:09:32.03,0:09:37.95,Default,,0000,0000,0000,,The convention typically is that one is true and zero is false. Dialogue: 0,0:09:37.95,0:09:44.43,Default,,0000,0000,0000,,And then ordinal values there, like categorical, they take on a fix one of a fixed set of values. Dialogue: 0,0:09:44.43,0:09:54.29,Default,,0000,0000,0000,,But those values are ordered. This is the key difference between an ordinary and a categorical and so we can order, we can compare for any quality. Dialogue: 0,0:09:54.29,0:10:00.80,Default,,0000,0000,0000,,A few examples of these are classic rates. A is A better grade than B. Dialogue: 0,0:10:00.80,0:10:04.04,Default,,0000,0000,0000,,There in this order. But you can't directly do math on them. Dialogue: 0,0:10:04.04,0:10:07.70,Default,,0000,0000,0000,,A minus B, we can assign numbers to them to try to do math. Dialogue: 0,0:10:07.70,0:10:12.32,Default,,0000,0000,0000,,But but they're just in an order, not an order. A Likert scale. Dialogue: 0,0:10:12.32,0:10:21.77,Default,,0000,0000,0000,,If you've taken a survey that asks you to strongly disagree, to strongly agree with something that is ordinal. Dialogue: 0,0:10:21.77,0:10:29.10,Default,,0000,0000,0000,,Also a movie rating, like if you if you go and you rate a movie, if you read a product, five stars, four stars on Netflix. Dialogue: 0,0:10:29.10,0:10:35.30,Default,,0000,0000,0000,,What you're using a number, but it's ordinal in the sense that. Dialogue: 0,0:10:35.30,0:10:42.02,Default,,0000,0000,0000,,We know that with you, if you like something four stars, you're saying you like it more than you like a three star thing. Dialogue: 0,0:10:42.02,0:10:47.33,Default,,0000,0000,0000,,But we don't know if you are a five star and a four star and a three star. Dialogue: 0,0:10:47.33,0:10:54.53,Default,,0000,0000,0000,,Is the five star movie just as much better than the fourth star as the four star is than the three star? Dialogue: 0,0:10:54.53,0:10:58.43,Default,,0000,0000,0000,,Or does it just tell us which order they're in intrinsically? Dialogue: 0,0:10:58.43,0:11:02.54,Default,,0000,0000,0000,,All it tells us is the order that you put these movies in. Dialogue: 0,0:11:02.54,0:11:09.56,Default,,0000,0000,0000,,But sometimes we do arithmetic anyway, like Amazon computes the average rating for a product. Dialogue: 0,0:11:09.56,0:11:13.67,Default,,0000,0000,0000,,Even though ratings are ordinal. Or you have a GPA. Dialogue: 0,0:11:13.67,0:11:23.39,Default,,0000,0000,0000,,That's computing the average of your ordinal grades. They're stored as insur floats or strings with an externally defined order wave. Dialogue: 0,0:11:23.39,0:11:27.56,Default,,0000,0000,0000,,If we have us if we haven't a no variable started and stored in a string, Dialogue: 0,0:11:27.56,0:11:31.73,Default,,0000,0000,0000,,we have to have something that tells us what order of those values actually go in. Dialogue: 0,0:11:31.73,0:11:39.16,Default,,0000,0000,0000,,Also, the panda's category type has an ordered modes. You can tell it, this is a categorical variable and it is ordered. Dialogue: 0,0:11:39.16,0:11:45.28,Default,,0000,0000,0000,,There's other types of data we're going to encounter time. We usually are going to treat is continuous. Dialogue: 0,0:11:45.28,0:11:47.41,Default,,0000,0000,0000,,Like it might be stored as written out as a date. Dialogue: 0,0:11:47.41,0:11:52.87,Default,,0000,0000,0000,,But if we're actually going to work with time, we're probably going to convert it into a continuous variable. Dialogue: 0,0:11:52.87,0:12:01.51,Default,,0000,0000,0000,,Common common encodings for that are either a number of seconds or a number of years, sometimes a number of milliseconds. Dialogue: 0,0:12:01.51,0:12:06.88,Default,,0000,0000,0000,,Text is categorical ish, but we're usually can convert it into categorical or account variables. Dialogue: 0,0:12:06.88,0:12:14.65,Default,,0000,0000,0000,,We'll talk more about that later. We actually get to processing. Text images are stored as matrices of insur real's. Dialogue: 0,0:12:14.65,0:12:20.89,Default,,0000,0000,0000,,We may also extract features from them that become other kinds of variables. Money is often stored as an interim float, Dialogue: 0,0:12:20.89,0:12:25.18,Default,,0000,0000,0000,,but we have to be careful because the imprecision of floating point numbers can Dialogue: 0,0:12:25.18,0:12:29.57,Default,,0000,0000,0000,,cause a problem when we're using money for the purposes of causing finding, Dialogue: 0,0:12:29.57,0:12:34.56,Default,,0000,0000,0000,,of creating financial transactions for the kinds of things we might be doing with money here in this class. Dialogue: 0,0:12:34.56,0:12:40.24,Default,,0000,0000,0000,,It's not going to be a problem. Nobody loses money if you have a little imprecision in computing. Dialogue: 0,0:12:40.24,0:12:46.05,Default,,0000,0000,0000,,The average price of of a ton of potatoes, Dialogue: 0,0:12:46.05,0:12:50.95,Default,,0000,0000,0000,,it really becomes a problem when you feed that back in to an into actual financial Dialogue: 0,0:12:50.95,0:12:56.74,Default,,0000,0000,0000,,transaction systems because it is impossible to precisely represent 10 cents. Dialogue: 0,0:12:56.74,0:13:01.78,Default,,0000,0000,0000,,It's just a hair under or over 10 cents. When you store it in a floating point value. Dialogue: 0,0:13:01.78,0:13:04.15,Default,,0000,0000,0000,,So the other thing I want to highlight here, though, Dialogue: 0,0:13:04.15,0:13:14.05,Default,,0000,0000,0000,,is that knowing the python were no higher Panda's data type of variable is not sufficient to know its type in a statistical sense. Dialogue: 0,0:13:14.05,0:13:18.37,Default,,0000,0000,0000,,Suppose we have a variable that's that's an in 64. Well, what's in that? Dialogue: 0,0:13:18.37,0:13:22.93,Default,,0000,0000,0000,,Is it a categorical variable like our movie and user I.D. and movie lens? Dialogue: 0,0:13:22.93,0:13:30.79,Default,,0000,0000,0000,,Is it a continuous variable that happens to be measured with integer precision in the penguins data set when you download it? Dialogue: 0,0:13:30.79,0:13:34.36,Default,,0000,0000,0000,,The body mass is integers because they were just measuring to the whole Graham. Dialogue: 0,0:13:34.36,0:13:41.11,Default,,0000,0000,0000,,They didn't measure fractional grams. But as conceptually mass really is is a continuous value. Dialogue: 0,0:13:41.11,0:13:45.45,Default,,0000,0000,0000,,It's just we don't have our measurements aren't that precise. Dialogue: 0,0:13:45.45,0:13:51.22,Default,,0000,0000,0000,,Is it ordinal or is it is it zeros and ones that are representing the logical variable? Dialogue: 0,0:13:51.22,0:13:54.94,Default,,0000,0000,0000,,I said before integers is missing values also lotas float. Dialogue: 0,0:13:54.94,0:14:02.35,Default,,0000,0000,0000,,So there's a we can look at the data, we can look at the data types, we can look at that itself to try to start to get a sense of what it does. Dialogue: 0,0:14:02.35,0:14:10.41,Default,,0000,0000,0000,,But knowing that is not sufficient to know what type of data we're dealing with for the purposes of handling it properly. Dialogue: 0,0:14:10.41,0:14:17.50,Default,,0000,0000,0000,,I want to talk to us a little bit about entities or instances that I introduced in last time and that are talked about in the reading. Dialogue: 0,0:14:17.50,0:14:23.92,Default,,0000,0000,0000,,So we want to be clear when we have a we have a data frame, a data, a data table. Dialogue: 0,0:14:23.92,0:14:30.88,Default,,0000,0000,0000,,What are the things being observed in this set of observations? But what's being observed if you've taken the database's class? Dialogue: 0,0:14:30.88,0:14:35.71,Default,,0000,0000,0000,,We called these entities. But sometimes this is pretty straightforward. Dialogue: 0,0:14:35.71,0:14:39.82,Default,,0000,0000,0000,,For example, this penguin dataset. Each row represents the measurements of one penguin. Dialogue: 0,0:14:39.82,0:14:43.48,Default,,0000,0000,0000,,But sometimes they're complex and linked, such as the rating data table. Dialogue: 0,0:14:43.48,0:14:48.58,Default,,0000,0000,0000,,Each instance is a rating. But that's a rating about movies. Dialogue: 0,0:14:48.58,0:14:57.72,Default,,0000,0000,0000,,And we can also derive things such as. We could count the number of ratings for each movie, and that could be a variable for a movie. Dialogue: 0,0:14:57.72,0:15:02.37,Default,,0000,0000,0000,,We could do this aggregation. We're going to see how to do aggregations in a little bit. Dialogue: 0,0:15:02.37,0:15:10.62,Default,,0000,0000,0000,,That gives us a new variable number of ratings. That becomes a variable for observations of movies. Dialogue: 0,0:15:10.62,0:15:11.22,Default,,0000,0000,0000,,So wrap up, Dialogue: 0,0:15:11.22,0:15:19.41,Default,,0000,0000,0000,,there are many different kinds of variables broadly divided into continuous and discrete with several specific types of discrete variables. Dialogue: 0,0:15:19.41,0:15:23.82,Default,,0000,0000,0000,,These conceptual variable types do not map one to one dependance data types. Dialogue: 0,0:15:23.82,0:15:30.27,Default,,0000,0000,0000,,You need more information in order to know how to properly interpret and work with a variable. Dialogue: 0,0:15:30.27,0:15:33.99,Default,,0000,0000,0000,,So the data source that you're working with needs to be documented. Dialogue: 0,0:15:33.99,0:15:46.93,Default,,0000,0000,0000,,And if you're creating a data source, you need to document what all of the columns mean and how they're being encoded and stored.