-
Not Synced
1
00:00:04,870 --> 00:00:12,190
This video, I'm going to introduce some of the fundamental structures and principles of doing scientific computing in Python.
-
Not Synced
2
00:00:12,190 --> 00:00:18,070
Since the last couple of videos, I've briefly introduced Python's core structures and core data types.
-
Not Synced
3
00:00:18,070 --> 00:00:23,390
But a lot of our work is going to be working with an additional set of structures,
-
Not Synced
4
00:00:23,390 --> 00:00:28,060
a set of libraries known as scientific python or as the pie data stack.
-
Not Synced
5
00:00:28,060 --> 00:00:35,500
So learning outcomes of this video are to understand limitations of core python data types for data science to know.
-
Not Synced
6
00:00:35,500 --> 00:00:41,440
Three key rate data types particularly. Are we focusing primarily on the number high end the array?
-
Not Synced
7
00:00:41,440 --> 00:00:48,450
Also briefly introduce serious and data frame. We're going to see a lot more about those next week and the.
-
Not Synced
8
00:00:48,450 --> 00:00:56,390
To be able to perform basic vectorized operations. So in Python, we can write a list of numbers like this.
-
Not Synced
9
00:00:56,390 --> 00:01:01,040
So numbers equals I'm using the list syntax that we talked about in the earlier video.
-
Not Synced
10
00:01:01,040 --> 00:01:07,630
And I've got four numbers in here that I'm storing in this list and the variable numbers.
-
Not Synced
11
00:01:07,630 --> 00:01:11,260
Now. This seems like a perfectly natural thing to do.
-
Not Synced
12
00:01:11,260 --> 00:01:17,860
But remember, we said I said in the previous video that everything in Python is an object.
-
Not Synced
13
00:01:17,860 --> 00:01:24,220
So this isn't just a list of numbers. If we wrote this in Java or C, we would have an array of numbers where system array.
-
Not Synced
14
00:01:24,220 --> 00:01:31,180
And it's the stores, the numbers, one after the other. But in Python, that's not how it works because everything is an object.
-
Not Synced
15
00:01:31,180 --> 00:01:34,960
What our list stores is, it stores pointers to numbers.
-
Not Synced
16
00:01:34,960 --> 00:01:50,480
So we've got a list. And it's got a pointer to O point three and a pointer to nine point two, et cetera.
-
Not Synced
17
00:01:50,480 --> 00:01:58,510
So what we store is the list itself has these pointers, which are eight bytes each.
-
Not Synced
18
00:01:58,510 --> 00:02:02,650
And it has the. Numbers themselves.
-
Not Synced
19
00:02:02,650 --> 00:02:08,730
A flooding point. A double precision flooding point number takes eight bites. But the numbers aren't just numbers, they're objects.
-
Not Synced
20
00:02:08,730 --> 00:02:12,870
And every python object has at least 16 bites.
-
Not Synced
21
00:02:12,870 --> 00:02:17,610
This is all on a 64 bit system, has at least 16 bytes of header information.
-
Not Synced
22
00:02:17,610 --> 00:02:24,060
And so this whole list of numbers takes 144 bytes because we've the list has a header.
-
Not Synced
23
00:02:24,060 --> 00:02:28,830
It has pointers. The pointers are the objects that have headers in addition to the data.
-
Not Synced
24
00:02:28,830 --> 00:02:36,300
Also, the elements of a list can be different types. So when you go over the list, there's no guarantee that everything is a number.
-
Not Synced
25
00:02:36,300 --> 00:02:42,510
So if we if we want to sum our numbers, there is a python function called some that will double do a sum.
-
Not Synced
26
00:02:42,510 --> 00:02:47,550
But it's basically doing this. So we'll initialize a variable called total.
-
Not Synced
27
00:02:47,550 --> 00:02:51,400
Well, then loop over all of our numbers and we'll add each one to the total.
-
Not Synced
28
00:02:51,400 --> 00:02:57,360
And that's gonna make the total equal the total of the numbers. This works, it works just fine.
-
Not Synced
29
00:02:57,360 --> 00:03:02,310
And for a list of four numbers, it's completely fine. But Python.
-
Not Synced
30
00:03:02,310 --> 00:03:07,770
There's a couple of issues here. One python is Python. The language itself is rather slow.
-
Not Synced
31
00:03:07,770 --> 00:03:11,280
It's quite convenient, but it's slow and it's slow for two reasons.
-
Not Synced
32
00:03:11,280 --> 00:03:17,790
One is that it is interpreted the python code is compiled to an internal data structure,
-
Not Synced
33
00:03:17,790 --> 00:03:24,090
but then there's C code that runs in a loop interpreting that data structure.
-
Not Synced
34
00:03:24,090 --> 00:03:32,310
It's also dynamically typed. So remember, I said there's the the values and the numbers are in the list can have different types.
-
Not Synced
35
00:03:32,310 --> 00:03:38,070
We wrote a set of numbers there. But Python isn't guaranteed that they're all numbers.
-
Not Synced
36
00:03:38,070 --> 00:03:41,040
And so rather than saying, okay, I have a number, I'm going to keep adding it.
-
Not Synced
37
00:03:41,040 --> 00:03:45,810
What it says is I have a thing and I'm going to try to add it to the thing I already have.
-
Not Synced
38
00:03:45,810 --> 00:03:50,130
And it has to go look up how to do that, and it does that every time for each number.
-
Not Synced
39
00:03:50,130 --> 00:03:58,900
This is all very slow. Also, since it's pointers, if you've taken the computer architecture class.
-
Not Synced
40
00:03:58,900 --> 00:04:06,610
That may ring a few alarm bells for you because rather than just having an array of numbers which will be loaded into our cash very quickly accessed,
-
Not Synced
41
00:04:06,610 --> 00:04:12,280
we have an array of pointers and each pointer has to go off and look up the number in memory.
-
Not Synced
42
00:04:12,280 --> 00:04:16,970
And those numbers might be stored next to each other, but they might be stored all over the heap.
-
Not Synced
43
00:04:16,970 --> 00:04:25,390
We're gonna have cash misses which make these slow process even slower so we can write code like this and it works fine,
-
Not Synced
44
00:04:25,390 --> 00:04:31,510
but it's not an efficient way to do computation. And as we get to larger and larger data sets, you get a few hundred.
-
Not Synced
45
00:04:31,510 --> 00:04:37,780
You got a few thousand numbers. You're gonna be fine. When you've got a million numbers, when you have a hundred million or a billion numbers.
-
Not Synced
46
00:04:37,780 --> 00:04:47,500
Then things start to really get slow. So none PI is a python package that provides efficient data types for doing numeric computation.
-
Not Synced
47
00:04:47,500 --> 00:04:55,720
And NUM Pi underlies almost all of the rest of the scientific python and data science and machine learning for Python software.
-
Not Synced
48
00:04:55,720 --> 00:05:00,430
It has a data type called an NDA array. There's a variety of different ways you can create one,
-
Not Synced
49
00:05:00,430 --> 00:05:06,170
but here we're going to just create one using the array constructor and then we're going to pass it our list.
-
Not Synced
50
00:05:06,170 --> 00:05:14,890
So we're creating the list in this case. We are going to see later many ways to load arrays without having to go through a list.
-
Not Synced
51
00:05:14,890 --> 00:05:19,210
I'm just doing this here so I can demonstrate how the array works.
-
Not Synced
52
00:05:19,210 --> 00:05:24,940
But all the elements are of the same type in an array and they're also stored directly in the array.
-
Not Synced
53
00:05:24,940 --> 00:05:28,480
So this ENDI array, it's the stores, the floats, one right after each other.
-
Not Synced
54
00:05:28,480 --> 00:05:35,140
Eight bytes each. And so we don't have the indirection, three pointers. We don't have all of the overhead of storing all of these different objects.
-
Not Synced
55
00:05:35,140 --> 00:05:41,170
It's just storing the numbers, one right after each other. You can have an Endi array of objects and that's going to store the pointers.
-
Not Synced
56
00:05:41,170 --> 00:05:46,900
And that's useful in a few cases, especially for treating strings consistently with numbers.
-
Not Synced
57
00:05:46,900 --> 00:05:55,280
But it really shines when we're dealing with arrays of numbers for various scientific computing applications.
-
Not Synced
58
00:05:55,280 --> 00:06:01,880
So if you want to sum our numbers, we can use the num pi some function and it it's much shorter.
-
Not Synced
59
00:06:01,880 --> 00:06:08,340
A little python has a some function, as I mentioned, that we could have used, but also it's implemented in a compiled language.
-
Not Synced
60
00:06:08,340 --> 00:06:14,660
And when you have a num high array that's storing numbers, whether the integers,
-
Not Synced
61
00:06:14,660 --> 00:06:20,840
whether they're floating point numbers, it's stored internally in a format that's compatible with C or Fortran.
-
Not Synced
62
00:06:20,840 --> 00:06:25,570
And so a lot of num pi. Functions.
-
Not Synced
63
00:06:25,570 --> 00:06:29,650
What they're doing is they're passing the array to see code or Fortran code or
-
Not Synced
64
00:06:29,650 --> 00:06:36,520
C++ code that has a comp. loop that works on that data type and is able to very,
-
Not Synced
65
00:06:36,520 --> 00:06:42,070
very efficiently sum up those numbers. We don't have a cast mate cash issues from the indirection.
-
Not Synced
66
00:06:42,070 --> 00:06:45,640
We don't have the overhead of Python's interpreted code.
-
Not Synced
67
00:06:45,640 --> 00:06:52,810
We don't have the overhead of having to deal with the the elements of the array might be of different types.
-
Not Synced
68
00:06:52,810 --> 00:06:59,650
They're all the same type. We can work over them in in a loop, in comp.
-
Not Synced
69
00:06:59,650 --> 00:07:05,350
Machine code. So in general, don't loop. You can loop over a number high end the array.
-
Not Synced
70
00:07:05,350 --> 00:07:09,370
It's iterable just like a list. But in general, you don't want to do that.
-
Not Synced
71
00:07:09,370 --> 00:07:14,380
You want to set up your code so that num pi can do the looping for you.
-
Not Synced
72
00:07:14,380 --> 00:07:24,970
And effectively what we wind up using Python as is a scripting language to tell the underlying C, C++ and Fortran code.
-
Not Synced
73
00:07:24,970 --> 00:07:26,500
What to do.
-
Not Synced
74
00:07:26,500 --> 00:07:38,240
And the fact that Python is a slow language doesn't matter very much because the vast majority of our processing time won't be spent in Python.
-
Not Synced
75
00:07:38,240 --> 00:07:43,190
So I thought none pile. So has a feature called Vector Ization.
-
Not Synced
76
00:07:43,190 --> 00:07:47,750
There are a lot of operations that operate on an entire array at a time.
-
Not Synced
77
00:07:47,750 --> 00:07:52,580
So if I get it, I can create another array. The Linn's base function here.
-
Not Synced
78
00:07:52,580 --> 00:08:01,180
It. The land space function here.
-
Not Synced
79
00:08:01,180 --> 00:08:06,370
It creates an array of four values that are evenly spaced from zero to one inclusive.
-
Not Synced
80
00:08:06,370 --> 00:08:11,170
And then the plus operator here, remember, plus between two numbers is going to add it between two strings.
-
Not Synced
81
00:08:11,170 --> 00:08:18,460
It's going to concatenate them plus between two arrays requires them to be of compatible shapes.
-
Not Synced
82
00:08:18,460 --> 00:08:24,970
And it adds the the corresponding elements of the arrays to each other and returns a new array.
-
Not Synced
83
00:08:24,970 --> 00:08:29,500
So what if we have a bunch of number one array of numbers and we have another array of numbers?
-
Not Synced
84
00:08:29,500 --> 00:08:39,460
We want to add them together. We just add the two arrays and it does that addition again in a loop written in C or Fortran.
-
Not Synced
85
00:08:39,460 --> 00:08:41,740
And it does it very, very quickly.
-
Not Synced
86
00:08:41,740 --> 00:08:49,540
You can also add an integer or an integer or a floating point, single number to an array, and it'll add it to every element of the array.
-
Not Synced
87
00:08:49,540 --> 00:08:55,120
But this is the key point to be able to make scientific computing with Python fast.
-
Not Synced
88
00:08:55,120 --> 00:09:04,090
We setup our code and throughout. We're gonna be trying to set it up so that we use vectorized nation as much as possible.
-
Not Synced
89
00:09:04,090 --> 00:09:11,270
And we vectorized over as much data at a time as possible so we can allow the optimized loops and in num pi,
-
Not Synced
90
00:09:11,270 --> 00:09:21,400
in Pandas and Sai Pi and psychic learn to do the work and to put as much of the work as possible into those compile loops.
-
Not Synced
91
00:09:21,400 --> 00:09:28,390
So we're not spending a lot of time in slow python code. Each array has three key things.
-
Not Synced
92
00:09:28,390 --> 00:09:33,740
It has a data type called a D type, and that says what kind of elements are in the array?
-
Not Synced
93
00:09:33,740 --> 00:09:39,710
PI has data types for your standard integers of various sizes.
-
Not Synced
94
00:09:39,710 --> 00:09:44,870
Single and double precision floating point numbers. It also has D types for working with.
-
Not Synced
95
00:09:44,870 --> 00:09:50,080
Date. Date. Times. Strings and then storing arrays.
-
Not Synced
96
00:09:50,080 --> 00:09:53,000
That's where pointers to arbitrary python objects.
-
Not Synced
97
00:09:53,000 --> 00:09:59,720
The data type or the array also has a shape which is a tuple of integers that says how big the array is.
-
Not Synced
98
00:09:59,720 --> 00:10:04,700
The array may be multidimensional. So Endi array stands for N Dimensional Array.
-
Not Synced
99
00:10:04,700 --> 00:10:15,580
And it can be one, two, three, four, whatever dimensional. So if we have a 100 by 50 matrix, it's stored in a in a number PI in the array of shape.
-
Not Synced
100
00:10:15,580 --> 00:10:19,730
One hundred, comma 50. And then there's the data. It's stealth.
-
Not Synced
101
00:10:19,730 --> 00:10:26,420
That's the elements of the array. The data points themselves that are stored in the array.
-
Not Synced
102
00:10:26,420 --> 00:10:33,170
So then pandas, which we're going to see next week, builds on top of a raise with two new data types,
-
Not Synced
103
00:10:33,170 --> 00:10:38,960
a series is an array with an associated index that allows us to look up.
-
Not Synced
104
00:10:38,960 --> 00:10:48,260
So an ENDI array, like a python list is indexed using numbers starting from zero zero one, two, three, four, five.
-
Not Synced
105
00:10:48,260 --> 00:10:55,850
But sometimes for a lot of times we're gonna have some other natural index. If you've taken databases, it's equivalent to the primary key.
-
Not Synced
106
00:10:55,850 --> 00:11:00,150
So a series is an array with an associated index that might be other numbers.
-
Not Synced
107
00:11:00,150 --> 00:11:04,580
That might be strings. But some other way of accessing the points.
-
Not Synced
108
00:11:04,580 --> 00:11:12,470
It also has an efficient representations that you can have a series that's indexed zero through and minus one where N is the length of the series.
-
Not Synced
109
00:11:12,470 --> 00:11:18,800
And it does not take up a lot of space to do that. And then a data frame is a table where each column is a series.
-
Not Synced
110
00:11:18,800 --> 00:11:25,760
And they all share the same index. And we're gonna see those a lot because we load in a set of data points that's gonna be in a data frame.
-
Not Synced
111
00:11:25,760 --> 00:11:30,110
Now, an assignment zero, you're going to briefly see both of these data structures.
-
Not Synced
112
00:11:30,110 --> 00:11:35,930
I walk you through everything you have to do with them in assignment zero. And we're going to introduce them a lot more.
-
Not Synced
113
00:11:35,930 --> 00:11:39,560
Woomera's talking about how to describe data next week.
-
Not Synced
114
00:11:39,560 --> 00:11:47,920
But. Endi Arae, the number higher radiata structure is the fundamental core that all of these others are built on.
-
Not Synced
115
00:11:47,920 --> 00:11:55,740
The series augments it with an index. The data frame collects multiple series together with column names like a spreadsheet table.
-
Not Synced
116
00:11:55,740 --> 00:12:00,950
So we're still going to sometimes use Python native lists and loops.
-
Not Synced
117
00:12:00,950 --> 00:12:05,360
Oftentimes, it's going to be because for some reason, we need a list of arrays or data frames.
-
Not Synced
118
00:12:05,360 --> 00:12:07,310
Also, if we need to loop, if we have, say,
-
Not Synced
119
00:12:07,310 --> 00:12:15,950
20 input files that we need to put together to to to be our data set or we got different groups of data, we're going to loop over those.
-
Not Synced
120
00:12:15,950 --> 00:12:20,390
But the big thing we avoid doing is looping over individual data points.
-
Not Synced
121
00:12:20,390 --> 00:12:24,450
We load in a few hundred thousand records. They're going to be in a data frame.
-
Not Synced
122
00:12:24,450 --> 00:12:33,470
We don't loop over the rows of a data frame. If we can avoid it, because there's almost always a more efficient way to do that computation,
-
Not Synced
123
00:12:33,470 --> 00:12:41,000
that pushes a lot of it into the C and C++ code and Fortran code that underlies NUM, Pi, pandas, et cetera.
-
Not Synced
124
00:12:41,000 --> 00:12:46,070
So wrap up num pi provides efficient to ray data structures that are more memory compact's.
-
Not Synced
125
00:12:46,070 --> 00:12:50,690
They don't take up nearly as much space and they're also much more efficient to compute over.
-
Not Synced
126
00:12:50,690 --> 00:12:54,320
These are going to be the backbone of our data processing throughout the rest of the class.
-
Not Synced
127
00:12:54,320 --> 00:13:04,250
And we want to prefer vectorized operations that perform these loops in native comp. machine code whenever possible for a little bit of practice.
-
Not Synced
128
00:13:04,250 --> 00:13:09,320
I encourage you to take the example code from this from these slides and go and try them
-
Not Synced
129
00:13:09,320 --> 00:13:13,880
in a notebook so you can get a little more practice creating notebooks and running code.
-
Not Synced
130
00:13:13,880 --> 00:13:20,967
I will see you in class.
-
Not Synced