-
This video, I want to talk with you about variables and observations and type,
-
so in the previous video we saw how to load a data file in the pandas, how to see how many rows we have, what python types of data.
-
But this video, we're going to go from the python types, the conceptual types,
-
and start to talk about what kinds of data we collect and we store in these pandas data frames.
-
So the learning outcomes for this video are to know the relationship between pandas
-
structures and statistical variables and to identify the type of a statistical variable.
-
So you've referred back to our data pipeline. We have things that produce Farnam observable phenomena that produce raw data.
-
Robbs ovations that we then process into a data set.
-
So the core idea when we have something that's going to be a data set that's processed, ready to use for a task.
-
Well, we usually have is a table of observations. Each row is one observation.
-
The data sheets of reading calls this an instance. Sometimes this is called a sample.
-
And this is an observation of the values of one or more variables pertaining to a single object.
-
So I'm showing here three rows from a data set. The Palmer penguin's data set.
-
That's measurements of penguins in Antarctica. And so each one has several different variables.
-
Each row represents.
-
So the first row represents, we observed and Adelie penguin on Torgersen Island with a bill length of thirty nine point one millimeters,
-
depth of eighteen point seven, et cetera. We store this as a pandas data frame, reach variable as a column.
-
But but the variables have their own conceptual properties that we're going to start talking about here.
-
So for each penguin we have it species, there are three different species of penguin that are observed in this dataset island.
-
There are three different islands on which they're measured.
-
We have measurements of the Penguins bill its length and depth, the length of its flipper, the body mass of the penguin and then the penguin sex.
-
Each data point, each observation is about one penguin. And we have these different observations.
-
We also have that the documentation which you can the you'll find a link to online and the slides,
-
the documentation tells us things like the build length.
-
That's the length of the bill in millimeters. The penguin body mass is in grams.
-
And when we have a data set, it's important to document.
-
And we have data whether it's organized and curated and processed into a dataset or it's the Robbs ovations.
-
We need to have the first order of things that we need to know besides how much data we have.
-
OK, what are the columns? But things like what are the units we can't properly interpret?
-
The bill length bill length of thirty nine. Thirty nine what? It's probably not thirty nine feet.
-
That would be a very large penguin. But with that we when we're producing a data we need to document all of these things and we're consuming data.
-
We need to find the answers to all of these things. So the variables we have can take on a variety of different types,
-
and the first type I want to talk about is a continuous variable and a continuous variable can take on any value pop.
-
There might be a range that limits the range of values that the variable can take on.
-
Mathematically, continuous variables correspond to real numbers.
-
And the key idea here that really what makes it continuous is for any two values, we could have a value between them.
-
So if we have two penguins, let's say one has a has a flipper length of a 40 millimeters and another has a flipper length of forty five.
-
We could have a penguin with a flipper length of forty two millimeters.
-
And no matter how close together they are, we could conceivably have a penguin with a flipper length that's in between them.
-
That's what makes it continuous. And now.
-
Observations are often desk critize, so even if we have something continuous like the penguins flipper length,
-
often our observations will be discrete and noisy because we don't have infinitely precise rulers with which to measure penguin flippers.
-
But what makes it continuous is conceptual. Lee. It's a continuous variable.
-
Even if our actual measurements of it might be disk critize, typically it's stored as a float.
-
Occasionally it'll be stored as an. If we only measure to integer private precision, we might store it as an ant instead of a float.
-
It is important to note, though, that floating point storage is also it is imprecise.
-
It's fine for the vast majority of what we're doing because like if you're taking any kind of a physical or natural measurement.
-
The measurement instrument is going to have some imprecision in it.
-
The imprecision of storing and a floating point number is far less than the imprecision of most physical instruments.
-
Unless you're doing something like high energy physics and a particle accelerator, most of us aren't doing that here.
-
So for most for most of our purposes, for measuring safe physical quantities, we don't need to worry about the floating point imprecision.
-
The one except the exception is for storing money. A discrete variable, on the other hand, takes on two distinct values.
-
And most of our variables are basically going to fall into the category of continuous real number or discrete almost everything else.
-
And it can have we can have values that have no intermediates and.
-
If I have four eggs, five eggs, I mean, I could take crack and egg and have half the contents.
-
But I'm talking about distinct eggs. I have four or five eggs.
-
Descript variables might not have an order. There's many different types of discrete variables.
-
And I'm going to walk through some of those in these remaining slides.
-
So the first is an integer and integers are discrete and they're typically things like count counting.
-
Something is our canonical example of an integer. They have an order for it is less than five.
-
But. They but you can't have a value in the middle.
-
So, for example, the number of penguins measured the size of something is usually an integer.
-
We often treat an integer continuously. So, for example, the American Veterinary Medical Association computed that for households that own cats.
-
The average number of cats per household is one point eight. Now, you can't actually have point eight of a cat.
-
So in terms of the individual, if you were going to take observations and observe how many cats are in each household,
-
the number of cats would be an integer. You can't.
-
You can't have point eight cats.
-
But we then can treat it as a continuous value so we can talk about the average number of cats per household and that's meaningful to talk about.
-
Even though nobody actually has one point eight cats, integers are usually stored.
-
And sometimes as floats, particularly in.
-
If you have missing values, pandas can't represent missing values for in an integer type.
-
But so it will store those enties floats if it finds any missing values.
-
Categorical variable takes on one of a fixed set of unordered values and.
-
We can compare them for a quality, but that's about all we can do.
-
There's no order, so we can't sort out. We can't do arithmetic.
-
We can't sort for convenience. Like if we have our paint. The penguin species is one example of a categorical variable.
-
We can sort by species in alphabetical order. But that's just convention.
-
The convention of the English alphabet. It's not intrinsic to the meaning of the different of the different penguin species.
-
Said to Athalie is not less than chinstrap is not less than Gentoo.
-
They just happened to come in a particular order in the alphabet. Another example is the user in the movie lends data we saw in the previous video.
-
The user I.D. column is an integer, but that's not a count.
-
It's a user, right? It's an identifier. It's a kind of actually a categorical variable.
-
Which user in the system are we talking about? They're each assigned a numeric identifier for computational convenience.
-
But if we have users. Seventy five and users three hundred and forty two asking what is user seventy
-
five plus user three hundred forty two is a completely meaningless question.
-
They're categorical variables. There is no error, arithmetic or comparison operations that we can do between them.
-
Categorical variables are stored. They can be stored as strings. They can be stored as integers.
-
PAND is also as a category type. That's useful for storing categorical variables.
-
A boolean or a logical variable is a special case of a categorical that can either be just has the two values true and false.
-
Usually it's stored as an int or a bull. And the bull is just a special case type event.
-
The convention typically is that one is true and zero is false.
-
And then ordinal values there, like categorical, they take on a fix one of a fixed set of values.
-
But those values are ordered. This is the key difference between an ordinary and a categorical and so we can order, we can compare for any quality.
-
A few examples of these are classic rates. A is A better grade than B.
-
There in this order. But you can't directly do math on them.
-
A minus B, we can assign numbers to them to try to do math.
-
But but they're just in an order, not an order. A Likert scale.
-
If you've taken a survey that asks you to strongly disagree, to strongly agree with something that is ordinal.
-
Also a movie rating, like if you if you go and you rate a movie, if you read a product, five stars, four stars on Netflix.
-
What you're using a number, but it's ordinal in the sense that.
-
We know that with you, if you like something four stars, you're saying you like it more than you like a three star thing.
-
But we don't know if you are a five star and a four star and a three star.
-
Is the five star movie just as much better than the fourth star as the four star is than the three star?
-
Or does it just tell us which order they're in intrinsically?
-
All it tells us is the order that you put these movies in.
-
But sometimes we do arithmetic anyway, like Amazon computes the average rating for a product.
-
Even though ratings are ordinal. Or you have a GPA.
-
That's computing the average of your ordinal grades. They're stored as insur floats or strings with an externally defined order wave.
-
If we have us if we haven't a no variable started and stored in a string,
-
we have to have something that tells us what order of those values actually go in.
-
Also, the panda's category type has an ordered modes. You can tell it, this is a categorical variable and it is ordered.
-
There's other types of data we're going to encounter time. We usually are going to treat is continuous.
-
Like it might be stored as written out as a date.
-
But if we're actually going to work with time, we're probably going to convert it into a continuous variable.
-
Common common encodings for that are either a number of seconds or a number of years, sometimes a number of milliseconds.
-
Text is categorical ish, but we're usually can convert it into categorical or account variables.
-
We'll talk more about that later. We actually get to processing. Text images are stored as matrices of insur real's.
-
We may also extract features from them that become other kinds of variables. Money is often stored as an interim float,
-
but we have to be careful because the imprecision of floating point numbers can
-
cause a problem when we're using money for the purposes of causing finding,
-
of creating financial transactions for the kinds of things we might be doing with money here in this class.
-
It's not going to be a problem. Nobody loses money if you have a little imprecision in computing.
-
The average price of of a ton of potatoes,
-
it really becomes a problem when you feed that back in to an into actual financial
-
transaction systems because it is impossible to precisely represent 10 cents.
-
It's just a hair under or over 10 cents. When you store it in a floating point value.
-
So the other thing I want to highlight here, though,
-
is that knowing the python were no higher Panda's data type of variable is not sufficient to know its type in a statistical sense.
-
Suppose we have a variable that's that's an in 64. Well, what's in that?
-
Is it a categorical variable like our movie and user I.D. and movie lens?
-
Is it a continuous variable that happens to be measured with integer precision in the penguins data set when you download it?
-
The body mass is integers because they were just measuring to the whole Graham.
-
They didn't measure fractional grams. But as conceptually mass really is is a continuous value.
-
It's just we don't have our measurements aren't that precise.
-
Is it ordinal or is it is it zeros and ones that are representing the logical variable?
-
I said before integers is missing values also lotas float.
-
So there's a we can look at the data, we can look at the data types, we can look at that itself to try to start to get a sense of what it does.
-
But knowing that is not sufficient to know what type of data we're dealing with for the purposes of handling it properly.
-
I want to talk to us a little bit about entities or instances that I introduced in last time and that are talked about in the reading.
-
So we want to be clear when we have a we have a data frame, a data, a data table.
-
What are the things being observed in this set of observations? But what's being observed if you've taken the database's class?
-
We called these entities. But sometimes this is pretty straightforward.
-
For example, this penguin dataset. Each row represents the measurements of one penguin.
-
But sometimes they're complex and linked, such as the rating data table.
-
Each instance is a rating. But that's a rating about movies.
-
And we can also derive things such as. We could count the number of ratings for each movie, and that could be a variable for a movie.
-
We could do this aggregation. We're going to see how to do aggregations in a little bit.
-
That gives us a new variable number of ratings. That becomes a variable for observations of movies.
-
So wrap up,
-
there are many different kinds of variables broadly divided into continuous and discrete with several specific types of discrete variables.
-
These conceptual variable types do not map one to one dependance data types.
-
You need more information in order to know how to properly interpret and work with a variable.
-
So the data source that you're working with needs to be documented.
-
And if you're creating a data source, you need to document what all of the columns mean and how they're being encoded and stored.