-
Welcome back. It's the beginning of the material for week two in which we're going to be talking about describing data.
-
And so the learning outcomes for this week are for you to be able to actually loaded data,
-
file independence and describe its basic structural characteristics.
-
How big is the data file? For example, what data types are it?
-
We want to be able to identify the type of a variable and descriptive statistics that are appropriate to that variable.
-
We also want to be able to describe the distribution of a variable,
-
to describe how data was collected and to start to reason about the limitations of that collection process and of the representation of data.
-
So before we get into describing data, I want to talk a little bit what is the data that we're talking about?
-
Let's start just with a definition from Oxford Dictionary. That data, facts or statistics collected together for reference or analysis.
-
That's a good enough definition for what we're going to be talking about.
-
So we have it's some data points, some facts of some kind or another that have been that have been assembled or collected together.
-
So. Where data comes from. In the broader scheme of what we're trying to do is there's a lot of ways to think about it.
-
There's a lot of philosophical questions about what data is and where it comes from.
-
But for our purpose, we can think about it in terms of there is some.
-
There's something that we want to learn about.
-
It might be objects themselves or entities that maybe want to learn something about people or animals or something.
-
It might be a process like a social process or a natural process that we want to learn about.
-
But that that thing, that conceptual thing results in some kind of a phenomena.
-
Either a naturally occurring phenomena that we can observe or an experiment.
-
And I want to make that we can use to to observe and elucidate what it is that we're looking at.
-
I want to make this distinction here between the thing and what we can observe,
-
because it might be that our observations are a layer removed from the thing that we're trying to study.
-
For example. It's impossible to observe anything about somewhat directly observe anything about someone's internal mental state.
-
But we can make observations of what they do or of what they tell us about what they're thinking and.
-
Those observations really valuable, but it's important to maintain the distinction that what we observe, if someone tells us they are feeling happy.
-
That observation is just that. They told us they are feeling happy. It's not directly the underlying mental state.
-
We have some phenomenon or experiment that then produces raw data that direct what was observed or what was measured as a result of this process.
-
The raw data is then transformed, cleaned up, documented and labeled to produce a data set that is basically ready to use for some purpose.
-
We then use that data set to make inferences or analysis or whatever else we're going to do,
-
which hopefully then give us answers to what it is that we're trying to study, at least partial answers, at least some answers.
-
But we have these multiple steps here. We have the thing we could observe. We had the observations themselves.
-
We then have the collection and organization and preparation of the observations into something that's usable
-
for an inference task or a prediction task or whatever is that we're trying to do with our data science tool.
-
One way to summarize it is that the data is the messy pile. The data set is when it's cleaned up and it's ready for us to actually be able to use it.
-
So there's a lot of. There's a number of definitions of a data set. One of the readings that I've assigned goes over some of those definitions.
-
But the common themes that are widely used across the definitions is that it's data that's collected or curated for a purpose.
-
It's mostly ready to use and it's documented for that purpose or for that task.
-
That doesn't mean it's the only purpose or task that it can be used for.
-
But it usually it was created or assembled for some particular purpose or task.
-
It's documented in the context of that. So when we get some data.
-
Whether it's raw data, whether it is a processed and ready, ready to use data set, there's a few things that we need to know.
-
One is we need to know how much data we have, how many records are there, how many columns, how big is the file?
-
What kinds of data do we have? We're going to talk a lot more about we're going to talk more about a number of these.
-
What is the data about? What are the things, the entities, the objects that the data is about?
-
How was it collected? How was it recorded?
-
The biopsy, the data might there might be bias in the selection process.
-
The recording process, et cetera, for the for the data.
-
We want to know what what is it that we do know about the process that it will be called the data generating process.
-
And the data generating process is the combination of the underlying phenomenon,
-
the observation method can ism and actually recording his observations into data.
-
The reading, as I said, the readings that we have for this week discuss this more.
-
So to wrap up datasets arise from curating or collecting data often result from some observations for a particular purpose.
-
There are layers between the thing that we actually want to study in the data that we have available.
-
And this week we're going to be leveraging the reading.
-
So the first week, it was primarily the videos that the textbook and the Python tutorials to supplement and get you more of an on ramp this week.
-
The readings are a fairly fundamental piece of what it is that we're going to be discussing and working on.