https:/.../afc96ff2-0043-4387-af9b-ad9000efd6b0-753c0fcd-b2eb-4f7d-8d63-ad9101272f10.mp4?invocationId=c1ea84bb-2f09-ec11-a9e9-0a1a827ad0ec

Edit subtitles

0:05 - 0:10

This video, I want to talk with you about variables and observations and type,
0:10 - 0:18

so in the previous video we saw how to load a data file in the pandas, how to see how many rows we have, what python types of data.
0:18 - 0:21

But this video, we're going to go from the python types, the conceptual types,
0:21 - 0:31

and start to talk about what kinds of data we collect and we store in these pandas data frames.
0:31 - 0:35

So the learning outcomes for this video are to know the relationship between pandas
0:35 - 0:41

structures and statistical variables and to identify the type of a statistical variable.
0:41 - 0:50

So you've referred back to our data pipeline. We have things that produce Farnam observable phenomena that produce raw data.
0:50 - 0:55

Robbs ovations that we then process into a data set.
0:55 - 1:02

So the core idea when we have something that's going to be a data set that's processed, ready to use for a task.
1:02 - 1:08

Well, we usually have is a table of observations. Each row is one observation.
1:08 - 1:14

The data sheets of reading calls this an instance. Sometimes this is called a sample.
1:14 - 1:20

And this is an observation of the values of one or more variables pertaining to a single object.
1:20 - 1:25

So I'm showing here three rows from a data set. The Palmer penguin's data set.
1:25 - 1:32

That's measurements of penguins in Antarctica. And so each one has several different variables.
1:32 - 1:34

Each row represents.
1:34 - 1:42

So the first row represents, we observed and Adelie penguin on Torgersen Island with a bill length of thirty nine point one millimeters,
1:42 - 1:51

depth of eighteen point seven, et cetera. We store this as a pandas data frame, reach variable as a column.
1:51 - 1:56

But but the variables have their own conceptual properties that we're going to start talking about here.
1:56 - 2:03

So for each penguin we have it species, there are three different species of penguin that are observed in this dataset island.
2:03 - 2:06

There are three different islands on which they're measured.
2:06 - 2:15

We have measurements of the Penguins bill its length and depth, the length of its flipper, the body mass of the penguin and then the penguin sex.
2:15 - 2:21

Each data point, each observation is about one penguin. And we have these different observations.
2:21 - 2:29

We also have that the documentation which you can the you'll find a link to online and the slides,
2:29 - 2:33

the documentation tells us things like the build length.
2:33 - 2:39

That's the length of the bill in millimeters. The penguin body mass is in grams.
2:39 - 2:43

And when we have a data set, it's important to document.
2:43 - 2:52

And we have data whether it's organized and curated and processed into a dataset or it's the Robbs ovations.
2:52 - 2:58

We need to have the first order of things that we need to know besides how much data we have.
2:58 - 3:03

OK, what are the columns? But things like what are the units we can't properly interpret?
3:03 - 3:09

The bill length bill length of thirty nine. Thirty nine what? It's probably not thirty nine feet.
3:09 - 3:19

That would be a very large penguin. But with that we when we're producing a data we need to document all of these things and we're consuming data.
3:19 - 3:28

We need to find the answers to all of these things. So the variables we have can take on a variety of different types,
3:28 - 3:35

and the first type I want to talk about is a continuous variable and a continuous variable can take on any value pop.
3:35 - 3:39

There might be a range that limits the range of values that the variable can take on.
3:39 - 3:43

Mathematically, continuous variables correspond to real numbers.
3:43 - 3:50

And the key idea here that really what makes it continuous is for any two values, we could have a value between them.
3:50 - 3:58

So if we have two penguins, let's say one has a has a flipper length of a 40 millimeters and another has a flipper length of forty five.
3:58 - 4:02

We could have a penguin with a flipper length of forty two millimeters.
4:02 - 4:08

And no matter how close together they are, we could conceivably have a penguin with a flipper length that's in between them.
4:08 - 4:13

That's what makes it continuous. And now.
4:13 - 4:19

Observations are often desk critize, so even if we have something continuous like the penguins flipper length,
4:19 - 4:29

often our observations will be discrete and noisy because we don't have infinitely precise rulers with which to measure penguin flippers.
4:29 - 4:35

But what makes it continuous is conceptual. Lee. It's a continuous variable.
4:35 - 4:42

Even if our actual measurements of it might be disk critize, typically it's stored as a float.
4:42 - 4:48

Occasionally it'll be stored as an. If we only measure to integer private precision, we might store it as an ant instead of a float.
4:48 - 4:53

It is important to note, though, that floating point storage is also it is imprecise.
4:53 - 5:00

It's fine for the vast majority of what we're doing because like if you're taking any kind of a physical or natural measurement.
5:00 - 5:03

The measurement instrument is going to have some imprecision in it.
5:03 - 5:12

The imprecision of storing and a floating point number is far less than the imprecision of most physical instruments.
5:12 - 5:19

Unless you're doing something like high energy physics and a particle accelerator, most of us aren't doing that here.
5:19 - 5:26

So for most for most of our purposes, for measuring safe physical quantities, we don't need to worry about the floating point imprecision.
5:26 - 5:34

The one except the exception is for storing money. A discrete variable, on the other hand, takes on two distinct values.
5:34 - 5:44

And most of our variables are basically going to fall into the category of continuous real number or discrete almost everything else.
5:44 - 5:49

And it can have we can have values that have no intermediates and.
5:49 - 5:56

If I have four eggs, five eggs, I mean, I could take crack and egg and have half the contents.
5:56 - 6:01

But I'm talking about distinct eggs. I have four or five eggs.
6:01 - 6:07

Descript variables might not have an order. There's many different types of discrete variables.
6:07 - 6:11

And I'm going to walk through some of those in these remaining slides.
6:11 - 6:17

So the first is an integer and integers are discrete and they're typically things like count counting.
6:17 - 6:25

Something is our canonical example of an integer. They have an order for it is less than five.
6:25 - 6:32

But. They but you can't have a value in the middle.
6:32 - 6:37

So, for example, the number of penguins measured the size of something is usually an integer.
6:37 - 6:48

We often treat an integer continuously. So, for example, the American Veterinary Medical Association computed that for households that own cats.
6:48 - 6:55

The average number of cats per household is one point eight. Now, you can't actually have point eight of a cat.
6:55 - 7:01

So in terms of the individual, if you were going to take observations and observe how many cats are in each household,
7:01 - 7:06

the number of cats would be an integer. You can't.
7:06 - 7:08

You can't have point eight cats.
7:08 - 7:17

But we then can treat it as a continuous value so we can talk about the average number of cats per household and that's meaningful to talk about.
7:17 - 7:22

Even though nobody actually has one point eight cats, integers are usually stored.
7:22 - 7:27

And sometimes as floats, particularly in.
7:27 - 7:34

If you have missing values, pandas can't represent missing values for in an integer type.
7:34 - 7:40

But so it will store those enties floats if it finds any missing values.
7:40 - 7:48

Categorical variable takes on one of a fixed set of unordered values and.
7:48 - 7:52

We can compare them for a quality, but that's about all we can do.
7:52 - 7:58

There's no order, so we can't sort out. We can't do arithmetic.
7:58 - 8:04

We can't sort for convenience. Like if we have our paint. The penguin species is one example of a categorical variable.
8:04 - 8:09

We can sort by species in alphabetical order. But that's just convention.
8:09 - 8:16

The convention of the English alphabet. It's not intrinsic to the meaning of the different of the different penguin species.
8:16 - 8:21

Said to Athalie is not less than chinstrap is not less than Gentoo.
8:21 - 8:31

They just happened to come in a particular order in the alphabet. Another example is the user in the movie lends data we saw in the previous video.
8:31 - 8:34

The user I.D. column is an integer, but that's not a count.
8:34 - 8:39

It's a user, right? It's an identifier. It's a kind of actually a categorical variable.
8:39 - 8:48

Which user in the system are we talking about? They're each assigned a numeric identifier for computational convenience.
8:48 - 8:55

But if we have users. Seventy five and users three hundred and forty two asking what is user seventy
8:55 - 8:59

five plus user three hundred forty two is a completely meaningless question.
8:59 - 9:10

They're categorical variables. There is no error, arithmetic or comparison operations that we can do between them.
9:10 - 9:14

Categorical variables are stored. They can be stored as strings. They can be stored as integers.
9:14 - 9:19

PAND is also as a category type. That's useful for storing categorical variables.
9:19 - 9:26

A boolean or a logical variable is a special case of a categorical that can either be just has the two values true and false.
9:26 - 9:32

Usually it's stored as an int or a bull. And the bull is just a special case type event.
9:32 - 9:38

The convention typically is that one is true and zero is false.
9:38 - 9:44

And then ordinal values there, like categorical, they take on a fix one of a fixed set of values.
9:44 - 9:54

But those values are ordered. This is the key difference between an ordinary and a categorical and so we can order, we can compare for any quality.
9:54 - 10:01

A few examples of these are classic rates. A is A better grade than B.
10:01 - 10:04

There in this order. But you can't directly do math on them.
10:04 - 10:08

A minus B, we can assign numbers to them to try to do math.
10:08 - 10:12

But but they're just in an order, not an order. A Likert scale.
10:12 - 10:22

If you've taken a survey that asks you to strongly disagree, to strongly agree with something that is ordinal.
10:22 - 10:29

Also a movie rating, like if you if you go and you rate a movie, if you read a product, five stars, four stars on Netflix.
10:29 - 10:35

What you're using a number, but it's ordinal in the sense that.
10:35 - 10:42

We know that with you, if you like something four stars, you're saying you like it more than you like a three star thing.
10:42 - 10:47

But we don't know if you are a five star and a four star and a three star.
10:47 - 10:55

Is the five star movie just as much better than the fourth star as the four star is than the three star?
10:55 - 10:58

Or does it just tell us which order they're in intrinsically?
10:58 - 11:03

All it tells us is the order that you put these movies in.
11:03 - 11:10

But sometimes we do arithmetic anyway, like Amazon computes the average rating for a product.
11:10 - 11:14

Even though ratings are ordinal. Or you have a GPA.
11:14 - 11:23

That's computing the average of your ordinal grades. They're stored as insur floats or strings with an externally defined order wave.
11:23 - 11:28

If we have us if we haven't a no variable started and stored in a string,
11:28 - 11:32

we have to have something that tells us what order of those values actually go in.
11:32 - 11:39

Also, the panda's category type has an ordered modes. You can tell it, this is a categorical variable and it is ordered.
11:39 - 11:45

There's other types of data we're going to encounter time. We usually are going to treat is continuous.
11:45 - 11:47

Like it might be stored as written out as a date.
11:47 - 11:53

But if we're actually going to work with time, we're probably going to convert it into a continuous variable.
11:53 - 12:02

Common common encodings for that are either a number of seconds or a number of years, sometimes a number of milliseconds.
12:02 - 12:07

Text is categorical ish, but we're usually can convert it into categorical or account variables.
12:07 - 12:15

We'll talk more about that later. We actually get to processing. Text images are stored as matrices of insur real's.
12:15 - 12:21

We may also extract features from them that become other kinds of variables. Money is often stored as an interim float,
12:21 - 12:25

but we have to be careful because the imprecision of floating point numbers can
12:25 - 12:30

cause a problem when we're using money for the purposes of causing finding,
12:30 - 12:35

of creating financial transactions for the kinds of things we might be doing with money here in this class.
12:35 - 12:40

It's not going to be a problem. Nobody loses money if you have a little imprecision in computing.
12:40 - 12:46

The average price of of a ton of potatoes,
12:46 - 12:51

it really becomes a problem when you feed that back in to an into actual financial
12:51 - 12:57

transaction systems because it is impossible to precisely represent 10 cents.
12:57 - 13:02

It's just a hair under or over 10 cents. When you store it in a floating point value.
13:02 - 13:04

So the other thing I want to highlight here, though,
13:04 - 13:14

is that knowing the python were no higher Panda's data type of variable is not sufficient to know its type in a statistical sense.
13:14 - 13:18

Suppose we have a variable that's that's an in 64. Well, what's in that?
13:18 - 13:23

Is it a categorical variable like our movie and user I.D. and movie lens?
13:23 - 13:31

Is it a continuous variable that happens to be measured with integer precision in the penguins data set when you download it?
13:31 - 13:34

The body mass is integers because they were just measuring to the whole Graham.
13:34 - 13:41

They didn't measure fractional grams. But as conceptually mass really is is a continuous value.
13:41 - 13:45

It's just we don't have our measurements aren't that precise.
13:45 - 13:51

Is it ordinal or is it is it zeros and ones that are representing the logical variable?
13:51 - 13:55

I said before integers is missing values also lotas float.
13:55 - 14:02

So there's a we can look at the data, we can look at the data types, we can look at that itself to try to start to get a sense of what it does.
14:02 - 14:10

But knowing that is not sufficient to know what type of data we're dealing with for the purposes of handling it properly.
14:10 - 14:18

I want to talk to us a little bit about entities or instances that I introduced in last time and that are talked about in the reading.
14:18 - 14:24

So we want to be clear when we have a we have a data frame, a data, a data table.
14:24 - 14:31

What are the things being observed in this set of observations? But what's being observed if you've taken the database's class?
14:31 - 14:36

We called these entities. But sometimes this is pretty straightforward.
14:36 - 14:40

For example, this penguin dataset. Each row represents the measurements of one penguin.
14:40 - 14:43

But sometimes they're complex and linked, such as the rating data table.
14:43 - 14:49

Each instance is a rating. But that's a rating about movies.
14:49 - 14:58

And we can also derive things such as. We could count the number of ratings for each movie, and that could be a variable for a movie.
14:58 - 15:02

We could do this aggregation. We're going to see how to do aggregations in a little bit.
15:02 - 15:11

That gives us a new variable number of ratings. That becomes a variable for observations of movies.
15:11 - 15:11

So wrap up,
15:11 - 15:19

there are many different kinds of variables broadly divided into continuous and discrete with several specific types of discrete variables.
15:19 - 15:24

These conceptual variable types do not map one to one dependance data types.
15:24 - 15:30

You need more information in order to know how to properly interpret and work with a variable.
15:30 - 15:34

So the data source that you're working with needs to be documented.
15:34 - 15:47

And if you're creating a data source, you need to document what all of the columns mean and how they're being encoded and stored.

Title:: https:/.../afc96ff2-0043-4387-af9b-ad9000efd6b0-753c0fcd-b2eb-4f7d-8d63-ad9101272f10.mp4?invocationId=c1ea84bb-2f09-ec11-a9e9-0a1a827ad0ec
Video Language:: English
Duration:: 15:47

janetlayne edited English subtitles for https:/.../afc96ff2-0043-4387-af9b-ad9000efd6b0-753c0fcd-b2eb-4f7d-8d63-ad9101272f10.mp4?invocationId=c1ea84bb-2f09-ec11-a9e9-0a1a827ad0ec

English subtitles

Revisions

Revision 1 Uploaded

janetlayne

https:/.../afc96ff2-0043-4387-af9b-ad9000efd6b0-753c0fcd-b2eb-4f7d-8d63-ad9101272f10.mp4?invocationId=c1ea84bb-2f09-ec11-a9e9-0a1a827ad0ec

Revisions

Our website uses cookies

Operating cookies (Required)