1 00:00:04,870 --> 00:00:12,190 This video, I'm going to introduce some of the fundamental structures and principles of doing scientific computing in Python. 2 00:00:12,190 --> 00:00:18,070 Since the last couple of videos, I've briefly introduced Python's core structures and core data types. 3 00:00:18,070 --> 00:00:23,390 But a lot of our work is going to be working with an additional set of structures, 4 00:00:23,390 --> 00:00:28,060 a set of libraries known as scientific python or as the pie data stack. 5 00:00:28,060 --> 00:00:35,500 So learning outcomes of this video are to understand limitations of core python data types for data science to know. 6 00:00:35,500 --> 00:00:41,440 Three key rate data types particularly. Are we focusing primarily on the number high end the array? 7 00:00:41,440 --> 00:00:48,450 Also briefly introduce serious and data frame. We're going to see a lot more about those next week and the. 8 00:00:48,450 --> 00:00:56,390 To be able to perform basic vectorized operations. So in Python, we can write a list of numbers like this. 9 00:00:56,390 --> 00:01:01,040 So numbers equals I'm using the list syntax that we talked about in the earlier video. 10 00:01:01,040 --> 00:01:07,630 And I've got four numbers in here that I'm storing in this list and the variable numbers. 11 00:01:07,630 --> 00:01:11,260 Now. This seems like a perfectly natural thing to do. 12 00:01:11,260 --> 00:01:17,860 But remember, we said I said in the previous video that everything in Python is an object. 13 00:01:17,860 --> 00:01:24,220 So this isn't just a list of numbers. If we wrote this in Java or C, we would have an array of numbers where system array. 14 00:01:24,220 --> 00:01:31,180 And it's the stores, the numbers, one after the other. But in Python, that's not how it works because everything is an object. 15 00:01:31,180 --> 00:01:34,960 What our list stores is, it stores pointers to numbers. 16 00:01:34,960 --> 00:01:50,480 So we've got a list. And it's got a pointer to O point three and a pointer to nine point two, et cetera. 17 00:01:50,480 --> 00:01:58,510 So what we store is the list itself has these pointers, which are eight bytes each. 18 00:01:58,510 --> 00:02:02,650 And it has the. Numbers themselves. 19 00:02:02,650 --> 00:02:08,730 A flooding point. A double precision flooding point number takes eight bites. But the numbers aren't just numbers, they're objects. 20 00:02:08,730 --> 00:02:12,870 And every python object has at least 16 bites. 21 00:02:12,870 --> 00:02:17,610 This is all on a 64 bit system, has at least 16 bytes of header information. 22 00:02:17,610 --> 00:02:24,060 And so this whole list of numbers takes 144 bytes because we've the list has a header. 23 00:02:24,060 --> 00:02:28,830 It has pointers. The pointers are the objects that have headers in addition to the data. 24 00:02:28,830 --> 00:02:36,300 Also, the elements of a list can be different types. So when you go over the list, there's no guarantee that everything is a number. 25 00:02:36,300 --> 00:02:42,510 So if we if we want to sum our numbers, there is a python function called some that will double do a sum. 26 00:02:42,510 --> 00:02:47,550 But it's basically doing this. So we'll initialize a variable called total. 27 00:02:47,550 --> 00:02:51,400 Well, then loop over all of our numbers and we'll add each one to the total. 28 00:02:51,400 --> 00:02:57,360 And that's gonna make the total equal the total of the numbers. This works, it works just fine. 29 00:02:57,360 --> 00:03:02,310 And for a list of four numbers, it's completely fine. But Python. 30 00:03:02,310 --> 00:03:07,770 There's a couple of issues here. One python is Python. The language itself is rather slow. 31 00:03:07,770 --> 00:03:11,280 It's quite convenient, but it's slow and it's slow for two reasons. 32 00:03:11,280 --> 00:03:17,790 One is that it is interpreted the python code is compiled to an internal data structure, 33 00:03:17,790 --> 00:03:24,090 but then there's C code that runs in a loop interpreting that data structure. 34 00:03:24,090 --> 00:03:32,310 It's also dynamically typed. So remember, I said there's the the values and the numbers are in the list can have different types. 35 00:03:32,310 --> 00:03:38,070 We wrote a set of numbers there. But Python isn't guaranteed that they're all numbers. 36 00:03:38,070 --> 00:03:41,040 And so rather than saying, okay, I have a number, I'm going to keep adding it. 37 00:03:41,040 --> 00:03:45,810 What it says is I have a thing and I'm going to try to add it to the thing I already have. 38 00:03:45,810 --> 00:03:50,130 And it has to go look up how to do that, and it does that every time for each number. 39 00:03:50,130 --> 00:03:58,900 This is all very slow. Also, since it's pointers, if you've taken the computer architecture class. 40 00:03:58,900 --> 00:04:06,610 That may ring a few alarm bells for you because rather than just having an array of numbers which will be loaded into our cash very quickly accessed, 41 00:04:06,610 --> 00:04:12,280 we have an array of pointers and each pointer has to go off and look up the number in memory. 42 00:04:12,280 --> 00:04:16,970 And those numbers might be stored next to each other, but they might be stored all over the heap. 43 00:04:16,970 --> 00:04:25,390 We're gonna have cash misses which make these slow process even slower so we can write code like this and it works fine, 44 00:04:25,390 --> 00:04:31,510 but it's not an efficient way to do computation. And as we get to larger and larger data sets, you get a few hundred. 45 00:04:31,510 --> 00:04:37,780 You got a few thousand numbers. You're gonna be fine. When you've got a million numbers, when you have a hundred million or a billion numbers. 46 00:04:37,780 --> 00:04:47,500 Then things start to really get slow. So none PI is a python package that provides efficient data types for doing numeric computation. 47 00:04:47,500 --> 00:04:55,720 And NUM Pi underlies almost all of the rest of the scientific python and data science and machine learning for Python software. 48 00:04:55,720 --> 00:05:00,430 It has a data type called an NDA array. There's a variety of different ways you can create one, 49 00:05:00,430 --> 00:05:06,170 but here we're going to just create one using the array constructor and then we're going to pass it our list. 50 00:05:06,170 --> 00:05:14,890 So we're creating the list in this case. We are going to see later many ways to load arrays without having to go through a list. 51 00:05:14,890 --> 00:05:19,210 I'm just doing this here so I can demonstrate how the array works. 52 00:05:19,210 --> 00:05:24,940 But all the elements are of the same type in an array and they're also stored directly in the array. 53 00:05:24,940 --> 00:05:28,480 So this ENDI array, it's the stores, the floats, one right after each other. 54 00:05:28,480 --> 00:05:35,140 Eight bytes each. And so we don't have the indirection, three pointers. We don't have all of the overhead of storing all of these different objects. 55 00:05:35,140 --> 00:05:41,170 It's just storing the numbers, one right after each other. You can have an Endi array of objects and that's going to store the pointers. 56 00:05:41,170 --> 00:05:46,900 And that's useful in a few cases, especially for treating strings consistently with numbers. 57 00:05:46,900 --> 00:05:55,280 But it really shines when we're dealing with arrays of numbers for various scientific computing applications. 58 00:05:55,280 --> 00:06:01,880 So if you want to sum our numbers, we can use the num pi some function and it it's much shorter. 59 00:06:01,880 --> 00:06:08,340 A little python has a some function, as I mentioned, that we could have used, but also it's implemented in a compiled language. 60 00:06:08,340 --> 00:06:14,660 And when you have a num high array that's storing numbers, whether the integers, 61 00:06:14,660 --> 00:06:20,840 whether they're floating point numbers, it's stored internally in a format that's compatible with C or Fortran. 62 00:06:20,840 --> 00:06:25,570 And so a lot of num pi. Functions. 63 00:06:25,570 --> 00:06:29,650 What they're doing is they're passing the array to see code or Fortran code or 64 00:06:29,650 --> 00:06:36,520 C++ code that has a comp. loop that works on that data type and is able to very, 65 00:06:36,520 --> 00:06:42,070 very efficiently sum up those numbers. We don't have a cast mate cash issues from the indirection. 66 00:06:42,070 --> 00:06:45,640 We don't have the overhead of Python's interpreted code. 67 00:06:45,640 --> 00:06:52,810 We don't have the overhead of having to deal with the the elements of the array might be of different types. 68 00:06:52,810 --> 00:06:59,650 They're all the same type. We can work over them in in a loop, in comp. 69 00:06:59,650 --> 00:07:05,350 Machine code. So in general, don't loop. You can loop over a number high end the array. 70 00:07:05,350 --> 00:07:09,370 It's iterable just like a list. But in general, you don't want to do that. 71 00:07:09,370 --> 00:07:14,380 You want to set up your code so that num pi can do the looping for you. 72 00:07:14,380 --> 00:07:24,970 And effectively what we wind up using Python as is a scripting language to tell the underlying C, C++ and Fortran code. 73 00:07:24,970 --> 00:07:26,500 What to do. 74 00:07:26,500 --> 00:07:38,240 And the fact that Python is a slow language doesn't matter very much because the vast majority of our processing time won't be spent in Python. 75 00:07:38,240 --> 00:07:43,190 So I thought none pile. So has a feature called Vector Ization. 76 00:07:43,190 --> 00:07:47,750 There are a lot of operations that operate on an entire array at a time. 77 00:07:47,750 --> 00:07:52,580 So if I get it, I can create another array. The Linn's base function here. 78 00:07:52,580 --> 00:08:01,180 It. The land space function here. 79 00:08:01,180 --> 00:08:06,370 It creates an array of four values that are evenly spaced from zero to one inclusive. 80 00:08:06,370 --> 00:08:11,170 And then the plus operator here, remember, plus between two numbers is going to add it between two strings. 81 00:08:11,170 --> 00:08:18,460 It's going to concatenate them plus between two arrays requires them to be of compatible shapes. 82 00:08:18,460 --> 00:08:24,970 And it adds the the corresponding elements of the arrays to each other and returns a new array. 83 00:08:24,970 --> 00:08:29,500 So what if we have a bunch of number one array of numbers and we have another array of numbers? 84 00:08:29,500 --> 00:08:39,460 We want to add them together. We just add the two arrays and it does that addition again in a loop written in C or Fortran. 85 00:08:39,460 --> 00:08:41,740 And it does it very, very quickly. 86 00:08:41,740 --> 00:08:49,540 You can also add an integer or an integer or a floating point, single number to an array, and it'll add it to every element of the array. 87 00:08:49,540 --> 00:08:55,120 But this is the key point to be able to make scientific computing with Python fast. 88 00:08:55,120 --> 00:09:04,090 We setup our code and throughout. We're gonna be trying to set it up so that we use vectorized nation as much as possible. 89 00:09:04,090 --> 00:09:11,270 And we vectorized over as much data at a time as possible so we can allow the optimized loops and in num pi, 90 00:09:11,270 --> 00:09:21,400 in Pandas and Sai Pi and psychic learn to do the work and to put as much of the work as possible into those compile loops. 91 00:09:21,400 --> 00:09:28,390 So we're not spending a lot of time in slow python code. Each array has three key things. 92 00:09:28,390 --> 00:09:33,740 It has a data type called a D type, and that says what kind of elements are in the array? 93 00:09:33,740 --> 00:09:39,710 PI has data types for your standard integers of various sizes. 94 00:09:39,710 --> 00:09:44,870 Single and double precision floating point numbers. It also has D types for working with. 95 00:09:44,870 --> 00:09:50,080 Date. Date. Times. Strings and then storing arrays. 96 00:09:50,080 --> 00:09:53,000 That's where pointers to arbitrary python objects. 97 00:09:53,000 --> 00:09:59,720 The data type or the array also has a shape which is a tuple of integers that says how big the array is. 98 00:09:59,720 --> 00:10:04,700 The array may be multidimensional. So Endi array stands for N Dimensional Array. 99 00:10:04,700 --> 00:10:15,580 And it can be one, two, three, four, whatever dimensional. So if we have a 100 by 50 matrix, it's stored in a in a number PI in the array of shape. 100 00:10:15,580 --> 00:10:19,730 One hundred, comma 50. And then there's the data. It's stealth. 101 00:10:19,730 --> 00:10:26,420 That's the elements of the array. The data points themselves that are stored in the array. 102 00:10:26,420 --> 00:10:33,170 So then pandas, which we're going to see next week, builds on top of a raise with two new data types, 103 00:10:33,170 --> 00:10:38,960 a series is an array with an associated index that allows us to look up. 104 00:10:38,960 --> 00:10:48,260 So an ENDI array, like a python list is indexed using numbers starting from zero zero one, two, three, four, five. 105 00:10:48,260 --> 00:10:55,850 But sometimes for a lot of times we're gonna have some other natural index. If you've taken databases, it's equivalent to the primary key. 106 00:10:55,850 --> 00:11:00,150 So a series is an array with an associated index that might be other numbers. 107 00:11:00,150 --> 00:11:04,580 That might be strings. But some other way of accessing the points. 108 00:11:04,580 --> 00:11:12,470 It also has an efficient representations that you can have a series that's indexed zero through and minus one where N is the length of the series. 109 00:11:12,470 --> 00:11:18,800 And it does not take up a lot of space to do that. And then a data frame is a table where each column is a series. 110 00:11:18,800 --> 00:11:25,760 And they all share the same index. And we're gonna see those a lot because we load in a set of data points that's gonna be in a data frame. 111 00:11:25,760 --> 00:11:30,110 Now, an assignment zero, you're going to briefly see both of these data structures. 112 00:11:30,110 --> 00:11:35,930 I walk you through everything you have to do with them in assignment zero. And we're going to introduce them a lot more. 113 00:11:35,930 --> 00:11:39,560 Woomera's talking about how to describe data next week. 114 00:11:39,560 --> 00:11:47,920 But. Endi Arae, the number higher radiata structure is the fundamental core that all of these others are built on. 115 00:11:47,920 --> 00:11:55,740 The series augments it with an index. The data frame collects multiple series together with column names like a spreadsheet table. 116 00:11:55,740 --> 00:12:00,950 So we're still going to sometimes use Python native lists and loops. 117 00:12:00,950 --> 00:12:05,360 Oftentimes, it's going to be because for some reason, we need a list of arrays or data frames. 118 00:12:05,360 --> 00:12:07,310 Also, if we need to loop, if we have, say, 119 00:12:07,310 --> 00:12:15,950 20 input files that we need to put together to to to be our data set or we got different groups of data, we're going to loop over those. 120 00:12:15,950 --> 00:12:20,390 But the big thing we avoid doing is looping over individual data points. 121 00:12:20,390 --> 00:12:24,450 We load in a few hundred thousand records. They're going to be in a data frame. 122 00:12:24,450 --> 00:12:33,470 We don't loop over the rows of a data frame. If we can avoid it, because there's almost always a more efficient way to do that computation, 123 00:12:33,470 --> 00:12:41,000 that pushes a lot of it into the C and C++ code and Fortran code that underlies NUM, Pi, pandas, et cetera. 124 00:12:41,000 --> 00:12:46,070 So wrap up num pi provides efficient to ray data structures that are more memory compact's. 125 00:12:46,070 --> 00:12:50,690 They don't take up nearly as much space and they're also much more efficient to compute over. 126 00:12:50,690 --> 00:12:54,320 These are going to be the backbone of our data processing throughout the rest of the class. 127 00:12:54,320 --> 00:13:04,250 And we want to prefer vectorized operations that perform these loops in native comp. machine code whenever possible for a little bit of practice. 128 00:13:04,250 --> 00:13:09,320 I encourage you to take the example code from this from these slides and go and try them 129 00:13:09,320 --> 00:13:13,880 in a notebook so you can get a little more practice creating notebooks and running code. 130 00:13:13,880 --> 00:13:20,967 I will see you in class.