https:/.../bb427c78-988a-4d4a-b671-ad75015afbf6-951fa520-04d5-48a8-a858-ad8c0122cdde.mp4?invocationId=cef9e109-7003-ec11-a9e9-0a1a827ad0ec

Edit subtitles

Not Synced

1
00:00:04,870 --> 00:00:12,190
This video, I'm going to introduce some of the fundamental structures and principles of doing scientific computing in Python.
Not Synced

2
00:00:12,190 --> 00:00:18,070
Since the last couple of videos, I've briefly introduced Python's core structures and core data types.
Not Synced

3
00:00:18,070 --> 00:00:23,390
But a lot of our work is going to be working with an additional set of structures,
Not Synced

4
00:00:23,390 --> 00:00:28,060
a set of libraries known as scientific python or as the pie data stack.
Not Synced

5
00:00:28,060 --> 00:00:35,500
So learning outcomes of this video are to understand limitations of core python data types for data science to know.
Not Synced

6
00:00:35,500 --> 00:00:41,440
Three key rate data types particularly. Are we focusing primarily on the number high end the array?
Not Synced

7
00:00:41,440 --> 00:00:48,450
Also briefly introduce serious and data frame. We're going to see a lot more about those next week and the.
Not Synced

8
00:00:48,450 --> 00:00:56,390
To be able to perform basic vectorized operations. So in Python, we can write a list of numbers like this.
Not Synced

9
00:00:56,390 --> 00:01:01,040
So numbers equals I'm using the list syntax that we talked about in the earlier video.
Not Synced

10
00:01:01,040 --> 00:01:07,630
And I've got four numbers in here that I'm storing in this list and the variable numbers.
Not Synced

11
00:01:07,630 --> 00:01:11,260
Now. This seems like a perfectly natural thing to do.
Not Synced

12
00:01:11,260 --> 00:01:17,860
But remember, we said I said in the previous video that everything in Python is an object.
Not Synced

13
00:01:17,860 --> 00:01:24,220
So this isn't just a list of numbers. If we wrote this in Java or C, we would have an array of numbers where system array.
Not Synced

14
00:01:24,220 --> 00:01:31,180
And it's the stores, the numbers, one after the other. But in Python, that's not how it works because everything is an object.
Not Synced

15
00:01:31,180 --> 00:01:34,960
What our list stores is, it stores pointers to numbers.
Not Synced

16
00:01:34,960 --> 00:01:50,480
So we've got a list. And it's got a pointer to O point three and a pointer to nine point two, et cetera.
Not Synced

17
00:01:50,480 --> 00:01:58,510
So what we store is the list itself has these pointers, which are eight bytes each.
Not Synced

18
00:01:58,510 --> 00:02:02,650
And it has the. Numbers themselves.
Not Synced

19
00:02:02,650 --> 00:02:08,730
A flooding point. A double precision flooding point number takes eight bites. But the numbers aren't just numbers, they're objects.
Not Synced

20
00:02:08,730 --> 00:02:12,870
And every python object has at least 16 bites.
Not Synced

21
00:02:12,870 --> 00:02:17,610
This is all on a 64 bit system, has at least 16 bytes of header information.
Not Synced

22
00:02:17,610 --> 00:02:24,060
And so this whole list of numbers takes 144 bytes because we've the list has a header.
Not Synced

23
00:02:24,060 --> 00:02:28,830
It has pointers. The pointers are the objects that have headers in addition to the data.
Not Synced

24
00:02:28,830 --> 00:02:36,300
Also, the elements of a list can be different types. So when you go over the list, there's no guarantee that everything is a number.
Not Synced

25
00:02:36,300 --> 00:02:42,510
So if we if we want to sum our numbers, there is a python function called some that will double do a sum.
Not Synced

26
00:02:42,510 --> 00:02:47,550
But it's basically doing this. So we'll initialize a variable called total.
Not Synced

27
00:02:47,550 --> 00:02:51,400
Well, then loop over all of our numbers and we'll add each one to the total.
Not Synced

28
00:02:51,400 --> 00:02:57,360
And that's gonna make the total equal the total of the numbers. This works, it works just fine.
Not Synced

29
00:02:57,360 --> 00:03:02,310
And for a list of four numbers, it's completely fine. But Python.
Not Synced

30
00:03:02,310 --> 00:03:07,770
There's a couple of issues here. One python is Python. The language itself is rather slow.
Not Synced

31
00:03:07,770 --> 00:03:11,280
It's quite convenient, but it's slow and it's slow for two reasons.
Not Synced

32
00:03:11,280 --> 00:03:17,790
One is that it is interpreted the python code is compiled to an internal data structure,
Not Synced

33
00:03:17,790 --> 00:03:24,090
but then there's C code that runs in a loop interpreting that data structure.
Not Synced

34
00:03:24,090 --> 00:03:32,310
It's also dynamically typed. So remember, I said there's the the values and the numbers are in the list can have different types.
Not Synced

35
00:03:32,310 --> 00:03:38,070
We wrote a set of numbers there. But Python isn't guaranteed that they're all numbers.
Not Synced

36
00:03:38,070 --> 00:03:41,040
And so rather than saying, okay, I have a number, I'm going to keep adding it.
Not Synced

37
00:03:41,040 --> 00:03:45,810
What it says is I have a thing and I'm going to try to add it to the thing I already have.
Not Synced

38
00:03:45,810 --> 00:03:50,130
And it has to go look up how to do that, and it does that every time for each number.
Not Synced

39
00:03:50,130 --> 00:03:58,900
This is all very slow. Also, since it's pointers, if you've taken the computer architecture class.
Not Synced

40
00:03:58,900 --> 00:04:06,610
That may ring a few alarm bells for you because rather than just having an array of numbers which will be loaded into our cash very quickly accessed,
Not Synced

41
00:04:06,610 --> 00:04:12,280
we have an array of pointers and each pointer has to go off and look up the number in memory.
Not Synced

42
00:04:12,280 --> 00:04:16,970
And those numbers might be stored next to each other, but they might be stored all over the heap.
Not Synced

43
00:04:16,970 --> 00:04:25,390
We're gonna have cash misses which make these slow process even slower so we can write code like this and it works fine,
Not Synced

44
00:04:25,390 --> 00:04:31,510
but it's not an efficient way to do computation. And as we get to larger and larger data sets, you get a few hundred.
Not Synced

45
00:04:31,510 --> 00:04:37,780
You got a few thousand numbers. You're gonna be fine. When you've got a million numbers, when you have a hundred million or a billion numbers.
Not Synced

46
00:04:37,780 --> 00:04:47,500
Then things start to really get slow. So none PI is a python package that provides efficient data types for doing numeric computation.
Not Synced

47
00:04:47,500 --> 00:04:55,720
And NUM Pi underlies almost all of the rest of the scientific python and data science and machine learning for Python software.
Not Synced

48
00:04:55,720 --> 00:05:00,430
It has a data type called an NDA array. There's a variety of different ways you can create one,
Not Synced

49
00:05:00,430 --> 00:05:06,170
but here we're going to just create one using the array constructor and then we're going to pass it our list.
Not Synced

50
00:05:06,170 --> 00:05:14,890
So we're creating the list in this case. We are going to see later many ways to load arrays without having to go through a list.
Not Synced

51
00:05:14,890 --> 00:05:19,210
I'm just doing this here so I can demonstrate how the array works.
Not Synced

52
00:05:19,210 --> 00:05:24,940
But all the elements are of the same type in an array and they're also stored directly in the array.
Not Synced

53
00:05:24,940 --> 00:05:28,480
So this ENDI array, it's the stores, the floats, one right after each other.
Not Synced

54
00:05:28,480 --> 00:05:35,140
Eight bytes each. And so we don't have the indirection, three pointers. We don't have all of the overhead of storing all of these different objects.
Not Synced

55
00:05:35,140 --> 00:05:41,170
It's just storing the numbers, one right after each other. You can have an Endi array of objects and that's going to store the pointers.
Not Synced

56
00:05:41,170 --> 00:05:46,900
And that's useful in a few cases, especially for treating strings consistently with numbers.
Not Synced

57
00:05:46,900 --> 00:05:55,280
But it really shines when we're dealing with arrays of numbers for various scientific computing applications.
Not Synced

58
00:05:55,280 --> 00:06:01,880
So if you want to sum our numbers, we can use the num pi some function and it it's much shorter.
Not Synced

59
00:06:01,880 --> 00:06:08,340
A little python has a some function, as I mentioned, that we could have used, but also it's implemented in a compiled language.
Not Synced

60
00:06:08,340 --> 00:06:14,660
And when you have a num high array that's storing numbers, whether the integers,
Not Synced

61
00:06:14,660 --> 00:06:20,840
whether they're floating point numbers, it's stored internally in a format that's compatible with C or Fortran.
Not Synced

62
00:06:20,840 --> 00:06:25,570
And so a lot of num pi. Functions.
Not Synced

63
00:06:25,570 --> 00:06:29,650
What they're doing is they're passing the array to see code or Fortran code or
Not Synced

64
00:06:29,650 --> 00:06:36,520
C++ code that has a comp. loop that works on that data type and is able to very,
Not Synced

65
00:06:36,520 --> 00:06:42,070
very efficiently sum up those numbers. We don't have a cast mate cash issues from the indirection.
Not Synced

66
00:06:42,070 --> 00:06:45,640
We don't have the overhead of Python's interpreted code.
Not Synced

67
00:06:45,640 --> 00:06:52,810
We don't have the overhead of having to deal with the the elements of the array might be of different types.
Not Synced

68
00:06:52,810 --> 00:06:59,650
They're all the same type. We can work over them in in a loop, in comp.
Not Synced

69
00:06:59,650 --> 00:07:05,350
Machine code. So in general, don't loop. You can loop over a number high end the array.
Not Synced

70
00:07:05,350 --> 00:07:09,370
It's iterable just like a list. But in general, you don't want to do that.
Not Synced

71
00:07:09,370 --> 00:07:14,380
You want to set up your code so that num pi can do the looping for you.
Not Synced

72
00:07:14,380 --> 00:07:24,970
And effectively what we wind up using Python as is a scripting language to tell the underlying C, C++ and Fortran code.
Not Synced

73
00:07:24,970 --> 00:07:26,500
What to do.
Not Synced

74
00:07:26,500 --> 00:07:38,240
And the fact that Python is a slow language doesn't matter very much because the vast majority of our processing time won't be spent in Python.
Not Synced

75
00:07:38,240 --> 00:07:43,190
So I thought none pile. So has a feature called Vector Ization.
Not Synced

76
00:07:43,190 --> 00:07:47,750
There are a lot of operations that operate on an entire array at a time.
Not Synced

77
00:07:47,750 --> 00:07:52,580
So if I get it, I can create another array. The Linn's base function here.
Not Synced

78
00:07:52,580 --> 00:08:01,180
It. The land space function here.
Not Synced

79
00:08:01,180 --> 00:08:06,370
It creates an array of four values that are evenly spaced from zero to one inclusive.
Not Synced

80
00:08:06,370 --> 00:08:11,170
And then the plus operator here, remember, plus between two numbers is going to add it between two strings.
Not Synced

81
00:08:11,170 --> 00:08:18,460
It's going to concatenate them plus between two arrays requires them to be of compatible shapes.
Not Synced

82
00:08:18,460 --> 00:08:24,970
And it adds the the corresponding elements of the arrays to each other and returns a new array.
Not Synced

83
00:08:24,970 --> 00:08:29,500
So what if we have a bunch of number one array of numbers and we have another array of numbers?
Not Synced

84
00:08:29,500 --> 00:08:39,460
We want to add them together. We just add the two arrays and it does that addition again in a loop written in C or Fortran.
Not Synced

85
00:08:39,460 --> 00:08:41,740
And it does it very, very quickly.
Not Synced

86
00:08:41,740 --> 00:08:49,540
You can also add an integer or an integer or a floating point, single number to an array, and it'll add it to every element of the array.
Not Synced

87
00:08:49,540 --> 00:08:55,120
But this is the key point to be able to make scientific computing with Python fast.
Not Synced

88
00:08:55,120 --> 00:09:04,090
We setup our code and throughout. We're gonna be trying to set it up so that we use vectorized nation as much as possible.
Not Synced

89
00:09:04,090 --> 00:09:11,270
And we vectorized over as much data at a time as possible so we can allow the optimized loops and in num pi,
Not Synced

90
00:09:11,270 --> 00:09:21,400
in Pandas and Sai Pi and psychic learn to do the work and to put as much of the work as possible into those compile loops.
Not Synced

91
00:09:21,400 --> 00:09:28,390
So we're not spending a lot of time in slow python code. Each array has three key things.
Not Synced

92
00:09:28,390 --> 00:09:33,740
It has a data type called a D type, and that says what kind of elements are in the array?
Not Synced

93
00:09:33,740 --> 00:09:39,710
PI has data types for your standard integers of various sizes.
Not Synced

94
00:09:39,710 --> 00:09:44,870
Single and double precision floating point numbers. It also has D types for working with.
Not Synced

95
00:09:44,870 --> 00:09:50,080
Date. Date. Times. Strings and then storing arrays.
Not Synced

96
00:09:50,080 --> 00:09:53,000
That's where pointers to arbitrary python objects.
Not Synced

97
00:09:53,000 --> 00:09:59,720
The data type or the array also has a shape which is a tuple of integers that says how big the array is.
Not Synced

98
00:09:59,720 --> 00:10:04,700
The array may be multidimensional. So Endi array stands for N Dimensional Array.
Not Synced

99
00:10:04,700 --> 00:10:15,580
And it can be one, two, three, four, whatever dimensional. So if we have a 100 by 50 matrix, it's stored in a in a number PI in the array of shape.
Not Synced

100
00:10:15,580 --> 00:10:19,730
One hundred, comma 50. And then there's the data. It's stealth.
Not Synced

101
00:10:19,730 --> 00:10:26,420
That's the elements of the array. The data points themselves that are stored in the array.
Not Synced

102
00:10:26,420 --> 00:10:33,170
So then pandas, which we're going to see next week, builds on top of a raise with two new data types,
Not Synced

103
00:10:33,170 --> 00:10:38,960
a series is an array with an associated index that allows us to look up.
Not Synced

104
00:10:38,960 --> 00:10:48,260
So an ENDI array, like a python list is indexed using numbers starting from zero zero one, two, three, four, five.
Not Synced

105
00:10:48,260 --> 00:10:55,850
But sometimes for a lot of times we're gonna have some other natural index. If you've taken databases, it's equivalent to the primary key.
Not Synced

106
00:10:55,850 --> 00:11:00,150
So a series is an array with an associated index that might be other numbers.
Not Synced

107
00:11:00,150 --> 00:11:04,580
That might be strings. But some other way of accessing the points.
Not Synced

108
00:11:04,580 --> 00:11:12,470
It also has an efficient representations that you can have a series that's indexed zero through and minus one where N is the length of the series.
Not Synced

109
00:11:12,470 --> 00:11:18,800
And it does not take up a lot of space to do that. And then a data frame is a table where each column is a series.
Not Synced

110
00:11:18,800 --> 00:11:25,760
And they all share the same index. And we're gonna see those a lot because we load in a set of data points that's gonna be in a data frame.
Not Synced

111
00:11:25,760 --> 00:11:30,110
Now, an assignment zero, you're going to briefly see both of these data structures.
Not Synced

112
00:11:30,110 --> 00:11:35,930
I walk you through everything you have to do with them in assignment zero. And we're going to introduce them a lot more.
Not Synced

113
00:11:35,930 --> 00:11:39,560
Woomera's talking about how to describe data next week.
Not Synced

114
00:11:39,560 --> 00:11:47,920
But. Endi Arae, the number higher radiata structure is the fundamental core that all of these others are built on.
Not Synced

115
00:11:47,920 --> 00:11:55,740
The series augments it with an index. The data frame collects multiple series together with column names like a spreadsheet table.
Not Synced

116
00:11:55,740 --> 00:12:00,950
So we're still going to sometimes use Python native lists and loops.
Not Synced

117
00:12:00,950 --> 00:12:05,360
Oftentimes, it's going to be because for some reason, we need a list of arrays or data frames.
Not Synced

118
00:12:05,360 --> 00:12:07,310
Also, if we need to loop, if we have, say,
Not Synced

119
00:12:07,310 --> 00:12:15,950
20 input files that we need to put together to to to be our data set or we got different groups of data, we're going to loop over those.
Not Synced

120
00:12:15,950 --> 00:12:20,390
But the big thing we avoid doing is looping over individual data points.
Not Synced

121
00:12:20,390 --> 00:12:24,450
We load in a few hundred thousand records. They're going to be in a data frame.
Not Synced

122
00:12:24,450 --> 00:12:33,470
We don't loop over the rows of a data frame. If we can avoid it, because there's almost always a more efficient way to do that computation,
Not Synced

123
00:12:33,470 --> 00:12:41,000
that pushes a lot of it into the C and C++ code and Fortran code that underlies NUM, Pi, pandas, et cetera.
Not Synced

124
00:12:41,000 --> 00:12:46,070
So wrap up num pi provides efficient to ray data structures that are more memory compact's.
Not Synced

125
00:12:46,070 --> 00:12:50,690
They don't take up nearly as much space and they're also much more efficient to compute over.
Not Synced

126
00:12:50,690 --> 00:12:54,320
These are going to be the backbone of our data processing throughout the rest of the class.
Not Synced

127
00:12:54,320 --> 00:13:04,250
And we want to prefer vectorized operations that perform these loops in native comp. machine code whenever possible for a little bit of practice.
Not Synced

128
00:13:04,250 --> 00:13:09,320
I encourage you to take the example code from this from these slides and go and try them
Not Synced

129
00:13:09,320 --> 00:13:13,880
in a notebook so you can get a little more practice creating notebooks and running code.
Not Synced

130
00:13:13,880 --> 00:13:20,967
I will see you in class.
Not Synced

Title:: https:/.../bb427c78-988a-4d4a-b671-ad75015afbf6-951fa520-04d5-48a8-a858-ad8c0122cdde.mp4?invocationId=cef9e109-7003-ec11-a9e9-0a1a827ad0ec
Video Language:: English
Duration:: 13:21

janetlayne edited English subtitles for https:/.../bb427c78-988a-4d4a-b671-ad75015afbf6-951fa520-04d5-48a8-a858-ad8c0122cdde.mp4?invocationId=cef9e109-7003-ec11-a9e9-0a1a827ad0ec

English subtitles

Incomplete

Revisions

Revision 1 Uploaded

janetlayne

https:/.../bb427c78-988a-4d4a-b671-ad75015afbf6-951fa520-04d5-48a8-a858-ad8c0122cdde.mp4?invocationId=cef9e109-7003-ec11-a9e9-0a1a827ad0ec

Revisions

Our website uses cookies

Operating cookies (Required)