This video, I want to talk with you about variables and observations and type,
so in the previous video we saw how to load a data file in the pandas, how to see how many rows we have, what python types of data.
But this video, we're going to go from the python types, the conceptual types,
and start to talk about what kinds of data we collect and we store in these pandas data frames.
So the learning outcomes for this video are to know the relationship between pandas
structures and statistical variables and to identify the type of a statistical variable.
So you've referred back to our data pipeline. We have things that produce Farnam observable phenomena that produce raw data.
Robbs ovations that we then process into a data set.
So the core idea when we have something that's going to be a data set that's processed, ready to use for a task.
Well, we usually have is a table of observations. Each row is one observation.
The data sheets of reading calls this an instance. Sometimes this is called a sample.
And this is an observation of the values of one or more variables pertaining to a single object.
So I'm showing here three rows from a data set. The Palmer penguin's data set.
That's measurements of penguins in Antarctica. And so each one has several different variables.
Each row represents.
So the first row represents, we observed and Adelie penguin on Torgersen Island with a bill length of thirty nine point one millimeters,
depth of eighteen point seven, et cetera. We store this as a pandas data frame, reach variable as a column.
But but the variables have their own conceptual properties that we're going to start talking about here.
So for each penguin we have it species, there are three different species of penguin that are observed in this dataset island.
There are three different islands on which they're measured.
We have measurements of the Penguins bill its length and depth, the length of its flipper, the body mass of the penguin and then the penguin sex.
Each data point, each observation is about one penguin. And we have these different observations.
We also have that the documentation which you can the you'll find a link to online and the slides,
the documentation tells us things like the build length.
That's the length of the bill in millimeters. The penguin body mass is in grams.
And when we have a data set, it's important to document.
And we have data whether it's organized and curated and processed into a dataset or it's the Robbs ovations.
We need to have the first order of things that we need to know besides how much data we have.
OK, what are the columns? But things like what are the units we can't properly interpret?
The bill length bill length of thirty nine. Thirty nine what? It's probably not thirty nine feet.
That would be a very large penguin. But with that we when we're producing a data we need to document all of these things and we're consuming data.
We need to find the answers to all of these things. So the variables we have can take on a variety of different types,
and the first type I want to talk about is a continuous variable and a continuous variable can take on any value pop.
There might be a range that limits the range of values that the variable can take on.
Mathematically, continuous variables correspond to real numbers.
And the key idea here that really what makes it continuous is for any two values, we could have a value between them.
So if we have two penguins, let's say one has a has a flipper length of a 40 millimeters and another has a flipper length of forty five.
We could have a penguin with a flipper length of forty two millimeters.
And no matter how close together they are, we could conceivably have a penguin with a flipper length that's in between them.
That's what makes it continuous. And now.
Observations are often desk critize, so even if we have something continuous like the penguins flipper length,
often our observations will be discrete and noisy because we don't have infinitely precise rulers with which to measure penguin flippers.
But what makes it continuous is conceptual. Lee. It's a continuous variable.
Even if our actual measurements of it might be disk critize, typically it's stored as a float.
Occasionally it'll be stored as an. If we only measure to integer private precision, we might store it as an ant instead of a float.
It is important to note, though, that floating point storage is also it is imprecise.
It's fine for the vast majority of what we're doing because like if you're taking any kind of a physical or natural measurement.
The measurement instrument is going to have some imprecision in it.
The imprecision of storing and a floating point number is far less than the imprecision of most physical instruments.
Unless you're doing something like high energy physics and a particle accelerator, most of us aren't doing that here.
So for most for most of our purposes, for measuring safe physical quantities, we don't need to worry about the floating point imprecision.
The one except the exception is for storing money. A discrete variable, on the other hand, takes on two distinct values.
And most of our variables are basically going to fall into the category of continuous real number or discrete almost everything else.
And it can have we can have values that have no intermediates and.
If I have four eggs, five eggs, I mean, I could take crack and egg and have half the contents.
But I'm talking about distinct eggs. I have four or five eggs.
Descript variables might not have an order. There's many different types of discrete variables.
And I'm going to walk through some of those in these remaining slides.
So the first is an integer and integers are discrete and they're typically things like count counting.
Something is our canonical example of an integer. They have an order for it is less than five.
But. They but you can't have a value in the middle.
So, for example, the number of penguins measured the size of something is usually an integer.
We often treat an integer continuously. So, for example, the American Veterinary Medical Association computed that for households that own cats.
The average number of cats per household is one point eight. Now, you can't actually have point eight of a cat.
So in terms of the individual, if you were going to take observations and observe how many cats are in each household,
the number of cats would be an integer. You can't.
You can't have point eight cats.
But we then can treat it as a continuous value so we can talk about the average number of cats per household and that's meaningful to talk about.
Even though nobody actually has one point eight cats, integers are usually stored.
And sometimes as floats, particularly in.
If you have missing values, pandas can't represent missing values for in an integer type.
But so it will store those enties floats if it finds any missing values.
Categorical variable takes on one of a fixed set of unordered values and.
We can compare them for a quality, but that's about all we can do.
There's no order, so we can't sort out. We can't do arithmetic.
We can't sort for convenience. Like if we have our paint. The penguin species is one example of a categorical variable.
We can sort by species in alphabetical order. But that's just convention.
The convention of the English alphabet. It's not intrinsic to the meaning of the different of the different penguin species.
Said to Athalie is not less than chinstrap is not less than Gentoo.
They just happened to come in a particular order in the alphabet. Another example is the user in the movie lends data we saw in the previous video.
The user I.D. column is an integer, but that's not a count.
It's a user, right? It's an identifier. It's a kind of actually a categorical variable.
Which user in the system are we talking about? They're each assigned a numeric identifier for computational convenience.
But if we have users. Seventy five and users three hundred and forty two asking what is user seventy
five plus user three hundred forty two is a completely meaningless question.
They're categorical variables. There is no error, arithmetic or comparison operations that we can do between them.
Categorical variables are stored. They can be stored as strings. They can be stored as integers.
PAND is also as a category type. That's useful for storing categorical variables.
A boolean or a logical variable is a special case of a categorical that can either be just has the two values true and false.
Usually it's stored as an int or a bull. And the bull is just a special case type event.
The convention typically is that one is true and zero is false.
And then ordinal values there, like categorical, they take on a fix one of a fixed set of values.
But those values are ordered. This is the key difference between an ordinary and a categorical and so we can order, we can compare for any quality.
A few examples of these are classic rates. A is A better grade than B.
There in this order. But you can't directly do math on them.
A minus B, we can assign numbers to them to try to do math.
But but they're just in an order, not an order. A Likert scale.
If you've taken a survey that asks you to strongly disagree, to strongly agree with something that is ordinal.
Also a movie rating, like if you if you go and you rate a movie, if you read a product, five stars, four stars on Netflix.
What you're using a number, but it's ordinal in the sense that.
We know that with you, if you like something four stars, you're saying you like it more than you like a three star thing.
But we don't know if you are a five star and a four star and a three star.
Is the five star movie just as much better than the fourth star as the four star is than the three star?
Or does it just tell us which order they're in intrinsically?
All it tells us is the order that you put these movies in.
But sometimes we do arithmetic anyway, like Amazon computes the average rating for a product.
Even though ratings are ordinal. Or you have a GPA.
That's computing the average of your ordinal grades. They're stored as insur floats or strings with an externally defined order wave.
If we have us if we haven't a no variable started and stored in a string,
we have to have something that tells us what order of those values actually go in.
Also, the panda's category type has an ordered modes. You can tell it, this is a categorical variable and it is ordered.
There's other types of data we're going to encounter time. We usually are going to treat is continuous.
Like it might be stored as written out as a date.
But if we're actually going to work with time, we're probably going to convert it into a continuous variable.
Common common encodings for that are either a number of seconds or a number of years, sometimes a number of milliseconds.
Text is categorical ish, but we're usually can convert it into categorical or account variables.
We'll talk more about that later. We actually get to processing. Text images are stored as matrices of insur real's.
We may also extract features from them that become other kinds of variables. Money is often stored as an interim float,
but we have to be careful because the imprecision of floating point numbers can
cause a problem when we're using money for the purposes of causing finding,
of creating financial transactions for the kinds of things we might be doing with money here in this class.
It's not going to be a problem. Nobody loses money if you have a little imprecision in computing.
The average price of of a ton of potatoes,
it really becomes a problem when you feed that back in to an into actual financial
transaction systems because it is impossible to precisely represent 10 cents.
It's just a hair under or over 10 cents. When you store it in a floating point value.
So the other thing I want to highlight here, though,
is that knowing the python were no higher Panda's data type of variable is not sufficient to know its type in a statistical sense.
Suppose we have a variable that's that's an in 64. Well, what's in that?
Is it a categorical variable like our movie and user I.D. and movie lens?
Is it a continuous variable that happens to be measured with integer precision in the penguins data set when you download it?
The body mass is integers because they were just measuring to the whole Graham.
They didn't measure fractional grams. But as conceptually mass really is is a continuous value.
It's just we don't have our measurements aren't that precise.
Is it ordinal or is it is it zeros and ones that are representing the logical variable?
I said before integers is missing values also lotas float.
So there's a we can look at the data, we can look at the data types, we can look at that itself to try to start to get a sense of what it does.
But knowing that is not sufficient to know what type of data we're dealing with for the purposes of handling it properly.
I want to talk to us a little bit about entities or instances that I introduced in last time and that are talked about in the reading.
So we want to be clear when we have a we have a data frame, a data, a data table.
What are the things being observed in this set of observations? But what's being observed if you've taken the database's class?
We called these entities. But sometimes this is pretty straightforward.
For example, this penguin dataset. Each row represents the measurements of one penguin.
But sometimes they're complex and linked, such as the rating data table.
Each instance is a rating. But that's a rating about movies.
And we can also derive things such as. We could count the number of ratings for each movie, and that could be a variable for a movie.
We could do this aggregation. We're going to see how to do aggregations in a little bit.
That gives us a new variable number of ratings. That becomes a variable for observations of movies.
So wrap up,
there are many different kinds of variables broadly divided into continuous and discrete with several specific types of discrete variables.
These conceptual variable types do not map one to one dependance data types.
You need more information in order to know how to properly interpret and work with a variable.
So the data source that you're working with needs to be documented.
And if you're creating a data source, you need to document what all of the columns mean and how they're being encoded and stored.