9:59:59.000,9:59:59.000 1[br]00:00:04,870 --> 00:00:12,190[br]This video, I'm going to introduce some of the fundamental structures and principles of doing scientific computing in Python. 9:59:59.000,9:59:59.000 2[br]00:00:12,190 --> 00:00:18,070[br]Since the last couple of videos, I've briefly introduced Python's core structures and core data types. 9:59:59.000,9:59:59.000 3[br]00:00:18,070 --> 00:00:23,390[br]But a lot of our work is going to be working with an additional set of structures, 9:59:59.000,9:59:59.000 4[br]00:00:23,390 --> 00:00:28,060[br]a set of libraries known as scientific python or as the pie data stack. 9:59:59.000,9:59:59.000 5[br]00:00:28,060 --> 00:00:35,500[br]So learning outcomes of this video are to understand limitations of core python data types for data science to know. 9:59:59.000,9:59:59.000 6[br]00:00:35,500 --> 00:00:41,440[br]Three key rate data types particularly. Are we focusing primarily on the number high end the array? 9:59:59.000,9:59:59.000 7[br]00:00:41,440 --> 00:00:48,450[br]Also briefly introduce serious and data frame. We're going to see a lot more about those next week and the. 9:59:59.000,9:59:59.000 8[br]00:00:48,450 --> 00:00:56,390[br]To be able to perform basic vectorized operations. So in Python, we can write a list of numbers like this. 9:59:59.000,9:59:59.000 9[br]00:00:56,390 --> 00:01:01,040[br]So numbers equals I'm using the list syntax that we talked about in the earlier video. 9:59:59.000,9:59:59.000 10[br]00:01:01,040 --> 00:01:07,630[br]And I've got four numbers in here that I'm storing in this list and the variable numbers. 9:59:59.000,9:59:59.000 11[br]00:01:07,630 --> 00:01:11,260[br]Now. This seems like a perfectly natural thing to do. 9:59:59.000,9:59:59.000 12[br]00:01:11,260 --> 00:01:17,860[br]But remember, we said I said in the previous video that everything in Python is an object. 9:59:59.000,9:59:59.000 13[br]00:01:17,860 --> 00:01:24,220[br]So this isn't just a list of numbers. If we wrote this in Java or C, we would have an array of numbers where system array. 9:59:59.000,9:59:59.000 14[br]00:01:24,220 --> 00:01:31,180[br]And it's the stores, the numbers, one after the other. But in Python, that's not how it works because everything is an object. 9:59:59.000,9:59:59.000 15[br]00:01:31,180 --> 00:01:34,960[br]What our list stores is, it stores pointers to numbers. 9:59:59.000,9:59:59.000 16[br]00:01:34,960 --> 00:01:50,480[br]So we've got a list. And it's got a pointer to O point three and a pointer to nine point two, et cetera. 9:59:59.000,9:59:59.000 17[br]00:01:50,480 --> 00:01:58,510[br]So what we store is the list itself has these pointers, which are eight bytes each. 9:59:59.000,9:59:59.000 18[br]00:01:58,510 --> 00:02:02,650[br]And it has the. Numbers themselves. 9:59:59.000,9:59:59.000 19[br]00:02:02,650 --> 00:02:08,730[br]A flooding point. A double precision flooding point number takes eight bites. But the numbers aren't just numbers, they're objects. 9:59:59.000,9:59:59.000 20[br]00:02:08,730 --> 00:02:12,870[br]And every python object has at least 16 bites. 9:59:59.000,9:59:59.000 21[br]00:02:12,870 --> 00:02:17,610[br]This is all on a 64 bit system, has at least 16 bytes of header information. 9:59:59.000,9:59:59.000 22[br]00:02:17,610 --> 00:02:24,060[br]And so this whole list of numbers takes 144 bytes because we've the list has a header. 9:59:59.000,9:59:59.000 23[br]00:02:24,060 --> 00:02:28,830[br]It has pointers. The pointers are the objects that have headers in addition to the data. 9:59:59.000,9:59:59.000 24[br]00:02:28,830 --> 00:02:36,300[br]Also, the elements of a list can be different types. So when you go over the list, there's no guarantee that everything is a number. 9:59:59.000,9:59:59.000 25[br]00:02:36,300 --> 00:02:42,510[br]So if we if we want to sum our numbers, there is a python function called some that will double do a sum. 9:59:59.000,9:59:59.000 26[br]00:02:42,510 --> 00:02:47,550[br]But it's basically doing this. So we'll initialize a variable called total. 9:59:59.000,9:59:59.000 27[br]00:02:47,550 --> 00:02:51,400[br]Well, then loop over all of our numbers and we'll add each one to the total. 9:59:59.000,9:59:59.000 28[br]00:02:51,400 --> 00:02:57,360[br]And that's gonna make the total equal the total of the numbers. This works, it works just fine. 9:59:59.000,9:59:59.000 29[br]00:02:57,360 --> 00:03:02,310[br]And for a list of four numbers, it's completely fine. But Python. 9:59:59.000,9:59:59.000 30[br]00:03:02,310 --> 00:03:07,770[br]There's a couple of issues here. One python is Python. The language itself is rather slow. 9:59:59.000,9:59:59.000 31[br]00:03:07,770 --> 00:03:11,280[br]It's quite convenient, but it's slow and it's slow for two reasons. 9:59:59.000,9:59:59.000 32[br]00:03:11,280 --> 00:03:17,790[br]One is that it is interpreted the python code is compiled to an internal data structure, 9:59:59.000,9:59:59.000 33[br]00:03:17,790 --> 00:03:24,090[br]but then there's C code that runs in a loop interpreting that data structure. 9:59:59.000,9:59:59.000 34[br]00:03:24,090 --> 00:03:32,310[br]It's also dynamically typed. So remember, I said there's the the values and the numbers are in the list can have different types. 9:59:59.000,9:59:59.000 35[br]00:03:32,310 --> 00:03:38,070[br]We wrote a set of numbers there. But Python isn't guaranteed that they're all numbers. 9:59:59.000,9:59:59.000 36[br]00:03:38,070 --> 00:03:41,040[br]And so rather than saying, okay, I have a number, I'm going to keep adding it. 9:59:59.000,9:59:59.000 37[br]00:03:41,040 --> 00:03:45,810[br]What it says is I have a thing and I'm going to try to add it to the thing I already have. 9:59:59.000,9:59:59.000 38[br]00:03:45,810 --> 00:03:50,130[br]And it has to go look up how to do that, and it does that every time for each number. 9:59:59.000,9:59:59.000 39[br]00:03:50,130 --> 00:03:58,900[br]This is all very slow. Also, since it's pointers, if you've taken the computer architecture class. 9:59:59.000,9:59:59.000 40[br]00:03:58,900 --> 00:04:06,610[br]That may ring a few alarm bells for you because rather than just having an array of numbers which will be loaded into our cash very quickly accessed, 9:59:59.000,9:59:59.000 41[br]00:04:06,610 --> 00:04:12,280[br]we have an array of pointers and each pointer has to go off and look up the number in memory. 9:59:59.000,9:59:59.000 42[br]00:04:12,280 --> 00:04:16,970[br]And those numbers might be stored next to each other, but they might be stored all over the heap. 9:59:59.000,9:59:59.000 43[br]00:04:16,970 --> 00:04:25,390[br]We're gonna have cash misses which make these slow process even slower so we can write code like this and it works fine, 9:59:59.000,9:59:59.000 44[br]00:04:25,390 --> 00:04:31,510[br]but it's not an efficient way to do computation. And as we get to larger and larger data sets, you get a few hundred. 9:59:59.000,9:59:59.000 45[br]00:04:31,510 --> 00:04:37,780[br]You got a few thousand numbers. You're gonna be fine. When you've got a million numbers, when you have a hundred million or a billion numbers. 9:59:59.000,9:59:59.000 46[br]00:04:37,780 --> 00:04:47,500[br]Then things start to really get slow. So none PI is a python package that provides efficient data types for doing numeric computation. 9:59:59.000,9:59:59.000 47[br]00:04:47,500 --> 00:04:55,720[br]And NUM Pi underlies almost all of the rest of the scientific python and data science and machine learning for Python software. 9:59:59.000,9:59:59.000 48[br]00:04:55,720 --> 00:05:00,430[br]It has a data type called an NDA array. There's a variety of different ways you can create one, 9:59:59.000,9:59:59.000 49[br]00:05:00,430 --> 00:05:06,170[br]but here we're going to just create one using the array constructor and then we're going to pass it our list. 9:59:59.000,9:59:59.000 50[br]00:05:06,170 --> 00:05:14,890[br]So we're creating the list in this case. We are going to see later many ways to load arrays without having to go through a list. 9:59:59.000,9:59:59.000 51[br]00:05:14,890 --> 00:05:19,210[br]I'm just doing this here so I can demonstrate how the array works. 9:59:59.000,9:59:59.000 52[br]00:05:19,210 --> 00:05:24,940[br]But all the elements are of the same type in an array and they're also stored directly in the array. 9:59:59.000,9:59:59.000 53[br]00:05:24,940 --> 00:05:28,480[br]So this ENDI array, it's the stores, the floats, one right after each other. 9:59:59.000,9:59:59.000 54[br]00:05:28,480 --> 00:05:35,140[br]Eight bytes each. And so we don't have the indirection, three pointers. We don't have all of the overhead of storing all of these different objects. 9:59:59.000,9:59:59.000 55[br]00:05:35,140 --> 00:05:41,170[br]It's just storing the numbers, one right after each other. You can have an Endi array of objects and that's going to store the pointers. 9:59:59.000,9:59:59.000 56[br]00:05:41,170 --> 00:05:46,900[br]And that's useful in a few cases, especially for treating strings consistently with numbers. 9:59:59.000,9:59:59.000 57[br]00:05:46,900 --> 00:05:55,280[br]But it really shines when we're dealing with arrays of numbers for various scientific computing applications. 9:59:59.000,9:59:59.000 58[br]00:05:55,280 --> 00:06:01,880[br]So if you want to sum our numbers, we can use the num pi some function and it it's much shorter. 9:59:59.000,9:59:59.000 59[br]00:06:01,880 --> 00:06:08,340[br]A little python has a some function, as I mentioned, that we could have used, but also it's implemented in a compiled language. 9:59:59.000,9:59:59.000 60[br]00:06:08,340 --> 00:06:14,660[br]And when you have a num high array that's storing numbers, whether the integers, 9:59:59.000,9:59:59.000 61[br]00:06:14,660 --> 00:06:20,840[br]whether they're floating point numbers, it's stored internally in a format that's compatible with C or Fortran. 9:59:59.000,9:59:59.000 62[br]00:06:20,840 --> 00:06:25,570[br]And so a lot of num pi. Functions. 9:59:59.000,9:59:59.000 63[br]00:06:25,570 --> 00:06:29,650[br]What they're doing is they're passing the array to see code or Fortran code or 9:59:59.000,9:59:59.000 64[br]00:06:29,650 --> 00:06:36,520[br]C++ code that has a comp. loop that works on that data type and is able to very, 9:59:59.000,9:59:59.000 65[br]00:06:36,520 --> 00:06:42,070[br]very efficiently sum up those numbers. We don't have a cast mate cash issues from the indirection. 9:59:59.000,9:59:59.000 66[br]00:06:42,070 --> 00:06:45,640[br]We don't have the overhead of Python's interpreted code. 9:59:59.000,9:59:59.000 67[br]00:06:45,640 --> 00:06:52,810[br]We don't have the overhead of having to deal with the the elements of the array might be of different types. 9:59:59.000,9:59:59.000 68[br]00:06:52,810 --> 00:06:59,650[br]They're all the same type. We can work over them in in a loop, in comp. 9:59:59.000,9:59:59.000 69[br]00:06:59,650 --> 00:07:05,350[br]Machine code. So in general, don't loop. You can loop over a number high end the array. 9:59:59.000,9:59:59.000 70[br]00:07:05,350 --> 00:07:09,370[br]It's iterable just like a list. But in general, you don't want to do that. 9:59:59.000,9:59:59.000 71[br]00:07:09,370 --> 00:07:14,380[br]You want to set up your code so that num pi can do the looping for you. 9:59:59.000,9:59:59.000 72[br]00:07:14,380 --> 00:07:24,970[br]And effectively what we wind up using Python as is a scripting language to tell the underlying C, C++ and Fortran code. 9:59:59.000,9:59:59.000 73[br]00:07:24,970 --> 00:07:26,500[br]What to do. 9:59:59.000,9:59:59.000 74[br]00:07:26,500 --> 00:07:38,240[br]And the fact that Python is a slow language doesn't matter very much because the vast majority of our processing time won't be spent in Python. 9:59:59.000,9:59:59.000 75[br]00:07:38,240 --> 00:07:43,190[br]So I thought none pile. So has a feature called Vector Ization. 9:59:59.000,9:59:59.000 76[br]00:07:43,190 --> 00:07:47,750[br]There are a lot of operations that operate on an entire array at a time. 9:59:59.000,9:59:59.000 77[br]00:07:47,750 --> 00:07:52,580[br]So if I get it, I can create another array. The Linn's base function here. 9:59:59.000,9:59:59.000 78[br]00:07:52,580 --> 00:08:01,180[br]It. The land space function here. 9:59:59.000,9:59:59.000 79[br]00:08:01,180 --> 00:08:06,370[br]It creates an array of four values that are evenly spaced from zero to one inclusive. 9:59:59.000,9:59:59.000 80[br]00:08:06,370 --> 00:08:11,170[br]And then the plus operator here, remember, plus between two numbers is going to add it between two strings. 9:59:59.000,9:59:59.000 81[br]00:08:11,170 --> 00:08:18,460[br]It's going to concatenate them plus between two arrays requires them to be of compatible shapes. 9:59:59.000,9:59:59.000 82[br]00:08:18,460 --> 00:08:24,970[br]And it adds the the corresponding elements of the arrays to each other and returns a new array. 9:59:59.000,9:59:59.000 83[br]00:08:24,970 --> 00:08:29,500[br]So what if we have a bunch of number one array of numbers and we have another array of numbers? 9:59:59.000,9:59:59.000 84[br]00:08:29,500 --> 00:08:39,460[br]We want to add them together. We just add the two arrays and it does that addition again in a loop written in C or Fortran. 9:59:59.000,9:59:59.000 85[br]00:08:39,460 --> 00:08:41,740[br]And it does it very, very quickly. 9:59:59.000,9:59:59.000 86[br]00:08:41,740 --> 00:08:49,540[br]You can also add an integer or an integer or a floating point, single number to an array, and it'll add it to every element of the array. 9:59:59.000,9:59:59.000 87[br]00:08:49,540 --> 00:08:55,120[br]But this is the key point to be able to make scientific computing with Python fast. 9:59:59.000,9:59:59.000 88[br]00:08:55,120 --> 00:09:04,090[br]We setup our code and throughout. We're gonna be trying to set it up so that we use vectorized nation as much as possible. 9:59:59.000,9:59:59.000 89[br]00:09:04,090 --> 00:09:11,270[br]And we vectorized over as much data at a time as possible so we can allow the optimized loops and in num pi, 9:59:59.000,9:59:59.000 90[br]00:09:11,270 --> 00:09:21,400[br]in Pandas and Sai Pi and psychic learn to do the work and to put as much of the work as possible into those compile loops. 9:59:59.000,9:59:59.000 91[br]00:09:21,400 --> 00:09:28,390[br]So we're not spending a lot of time in slow python code. Each array has three key things. 9:59:59.000,9:59:59.000 92[br]00:09:28,390 --> 00:09:33,740[br]It has a data type called a D type, and that says what kind of elements are in the array? 9:59:59.000,9:59:59.000 93[br]00:09:33,740 --> 00:09:39,710[br]PI has data types for your standard integers of various sizes. 9:59:59.000,9:59:59.000 94[br]00:09:39,710 --> 00:09:44,870[br]Single and double precision floating point numbers. It also has D types for working with. 9:59:59.000,9:59:59.000 95[br]00:09:44,870 --> 00:09:50,080[br]Date. Date. Times. Strings and then storing arrays. 9:59:59.000,9:59:59.000 96[br]00:09:50,080 --> 00:09:53,000[br]That's where pointers to arbitrary python objects. 9:59:59.000,9:59:59.000 97[br]00:09:53,000 --> 00:09:59,720[br]The data type or the array also has a shape which is a tuple of integers that says how big the array is. 9:59:59.000,9:59:59.000 98[br]00:09:59,720 --> 00:10:04,700[br]The array may be multidimensional. So Endi array stands for N Dimensional Array. 9:59:59.000,9:59:59.000 99[br]00:10:04,700 --> 00:10:15,580[br]And it can be one, two, three, four, whatever dimensional. So if we have a 100 by 50 matrix, it's stored in a in a number PI in the array of shape. 9:59:59.000,9:59:59.000 100[br]00:10:15,580 --> 00:10:19,730[br]One hundred, comma 50. And then there's the data. It's stealth. 9:59:59.000,9:59:59.000 101[br]00:10:19,730 --> 00:10:26,420[br]That's the elements of the array. The data points themselves that are stored in the array. 9:59:59.000,9:59:59.000 102[br]00:10:26,420 --> 00:10:33,170[br]So then pandas, which we're going to see next week, builds on top of a raise with two new data types, 9:59:59.000,9:59:59.000 103[br]00:10:33,170 --> 00:10:38,960[br]a series is an array with an associated index that allows us to look up. 9:59:59.000,9:59:59.000 104[br]00:10:38,960 --> 00:10:48,260[br]So an ENDI array, like a python list is indexed using numbers starting from zero zero one, two, three, four, five. 9:59:59.000,9:59:59.000 105[br]00:10:48,260 --> 00:10:55,850[br]But sometimes for a lot of times we're gonna have some other natural index. If you've taken databases, it's equivalent to the primary key. 9:59:59.000,9:59:59.000 106[br]00:10:55,850 --> 00:11:00,150[br]So a series is an array with an associated index that might be other numbers. 9:59:59.000,9:59:59.000 107[br]00:11:00,150 --> 00:11:04,580[br]That might be strings. But some other way of accessing the points. 9:59:59.000,9:59:59.000 108[br]00:11:04,580 --> 00:11:12,470[br]It also has an efficient representations that you can have a series that's indexed zero through and minus one where N is the length of the series. 9:59:59.000,9:59:59.000 109[br]00:11:12,470 --> 00:11:18,800[br]And it does not take up a lot of space to do that. And then a data frame is a table where each column is a series. 9:59:59.000,9:59:59.000 110[br]00:11:18,800 --> 00:11:25,760[br]And they all share the same index. And we're gonna see those a lot because we load in a set of data points that's gonna be in a data frame. 9:59:59.000,9:59:59.000 111[br]00:11:25,760 --> 00:11:30,110[br]Now, an assignment zero, you're going to briefly see both of these data structures. 9:59:59.000,9:59:59.000 112[br]00:11:30,110 --> 00:11:35,930[br]I walk you through everything you have to do with them in assignment zero. And we're going to introduce them a lot more. 9:59:59.000,9:59:59.000 113[br]00:11:35,930 --> 00:11:39,560[br]Woomera's talking about how to describe data next week. 9:59:59.000,9:59:59.000 114[br]00:11:39,560 --> 00:11:47,920[br]But. Endi Arae, the number higher radiata structure is the fundamental core that all of these others are built on. 9:59:59.000,9:59:59.000 115[br]00:11:47,920 --> 00:11:55,740[br]The series augments it with an index. The data frame collects multiple series together with column names like a spreadsheet table. 9:59:59.000,9:59:59.000 116[br]00:11:55,740 --> 00:12:00,950[br]So we're still going to sometimes use Python native lists and loops. 9:59:59.000,9:59:59.000 117[br]00:12:00,950 --> 00:12:05,360[br]Oftentimes, it's going to be because for some reason, we need a list of arrays or data frames. 9:59:59.000,9:59:59.000 118[br]00:12:05,360 --> 00:12:07,310[br]Also, if we need to loop, if we have, say, 9:59:59.000,9:59:59.000 119[br]00:12:07,310 --> 00:12:15,950[br]20 input files that we need to put together to to to be our data set or we got different groups of data, we're going to loop over those. 9:59:59.000,9:59:59.000 120[br]00:12:15,950 --> 00:12:20,390[br]But the big thing we avoid doing is looping over individual data points. 9:59:59.000,9:59:59.000 121[br]00:12:20,390 --> 00:12:24,450[br]We load in a few hundred thousand records. They're going to be in a data frame. 9:59:59.000,9:59:59.000 122[br]00:12:24,450 --> 00:12:33,470[br]We don't loop over the rows of a data frame. If we can avoid it, because there's almost always a more efficient way to do that computation, 9:59:59.000,9:59:59.000 123[br]00:12:33,470 --> 00:12:41,000[br]that pushes a lot of it into the C and C++ code and Fortran code that underlies NUM, Pi, pandas, et cetera. 9:59:59.000,9:59:59.000 124[br]00:12:41,000 --> 00:12:46,070[br]So wrap up num pi provides efficient to ray data structures that are more memory compact's. 9:59:59.000,9:59:59.000 125[br]00:12:46,070 --> 00:12:50,690[br]They don't take up nearly as much space and they're also much more efficient to compute over. 9:59:59.000,9:59:59.000 126[br]00:12:50,690 --> 00:12:54,320[br]These are going to be the backbone of our data processing throughout the rest of the class. 9:59:59.000,9:59:59.000 127[br]00:12:54,320 --> 00:13:04,250[br]And we want to prefer vectorized operations that perform these loops in native comp. machine code whenever possible for a little bit of practice. 9:59:59.000,9:59:59.000 128[br]00:13:04,250 --> 00:13:09,320[br]I encourage you to take the example code from this from these slides and go and try them 9:59:59.000,9:59:59.000 129[br]00:13:09,320 --> 00:13:13,880[br]in a notebook so you can get a little more practice creating notebooks and running code. 9:59:59.000,9:59:59.000 130[br]00:13:13,880 --> 00:13:20,967[br]I will see you in class. 9:59:59.000,9:59:59.000