-
Not Synced
1
00:00:08,280 --> 00:00:13,300
Hello again. And this video, I want to talk about organizing notebooks as I've promised.
-
Not Synced
2
00:00:13,300 --> 00:00:18,210
So we've talked about how do we make charts? That's been a lot of what we've been talking about here.
-
Not Synced
3
00:00:18,210 --> 00:00:25,260
But I wanted to talk about how do we actually put together a notebook that's presenting these charts and presenting our conclusions from them.
-
Not Synced
4
00:00:25,260 --> 00:00:33,040
So learning outcomes for this video are for you to be able to use markdown document structure to organize a notebook, to use the Jupiter,
-
Not Synced
5
00:00:33,040 --> 00:00:40,350
a markdown features to format text in a notebook to create a notebook that clearly tells the story of a data analysis.
-
Not Synced
6
00:00:40,350 --> 00:00:47,610
First thing to understand is that a notebook is a document. It is a convenient way to run Python code and to see the results of it.
-
Not Synced
7
00:00:47,610 --> 00:00:52,950
But the notebook structure is first and foremost a document. It's meant to be read.
-
Not Synced
8
00:00:52,950 --> 00:01:00,300
And there's some structure imposed in the document because it has to read in the same order as the code is going to execute.
-
Not Synced
9
00:01:00,300 --> 00:01:08,420
But. We want to be able to actually read it and understand what's going on as we walk through the notebook.
-
Not Synced
10
00:01:08,420 --> 00:01:12,830
So we also want to factor particularly complex computations out of the notebook.
-
Not Synced
11
00:01:12,830 --> 00:01:15,410
So far, nothing. We've been doing a super complex.
-
Not Synced
12
00:01:15,410 --> 00:01:22,220
But if I have a large, complicated data processing operation, training and an extensive set of machine learning models or something,
-
Not Synced
13
00:01:22,220 --> 00:01:33,200
I'll put those out of the notebook and other scripts and other modules and leave the notebook for communicating the results of my data analysis.
-
Not Synced
14
00:01:33,200 --> 00:01:38,570
So a notebook has two primary types of cells. We have code cells, which you've seen a lot.
-
Not Synced
15
00:01:38,570 --> 00:01:44,720
The Python code and its output. And we have marked down cells that contain formatted text.
-
Not Synced
16
00:01:44,720 --> 00:01:48,260
Could keep you. I recommend keeping your code cells relatively short.
-
Not Synced
17
00:01:48,260 --> 00:01:57,050
One, a few lines. One function definition. If you're defining an entire class and it's taking 100 lines within a code cell.
-
Not Synced
18
00:01:57,050 --> 00:02:01,580
That's a good sign you to pull that out into a python module of some kind.
-
Not Synced
19
00:02:01,580 --> 00:02:06,440
If helpful, show results after the cell. I do this a lot, particularly in development.
-
Not Synced
20
00:02:06,440 --> 00:02:11,450
But if you have too much of it, it can make it hard to read the final notebook because you have all of these outputs.
-
Not Synced
21
00:02:11,450 --> 00:02:15,440
And the notebook wins it being a sea of charts and tables.
-
Not Synced
22
00:02:15,440 --> 00:02:20,600
And it's difficult to find your way through the notebook and find the pieces that you need to look at.
-
Not Synced
23
00:02:20,600 --> 00:02:27,020
So go ahead. Do a lot of them, especially while you're debugging in your prototyping before you submit.
-
Not Synced
24
00:02:27,020 --> 00:02:32,750
Maybe go through and clean up, remove things that were just there for you to test how something worked and leave the cells in
-
Not Synced
25
00:02:32,750 --> 00:02:37,940
your notebook being the ones that help the reader understand the results of what you're doing.
-
Not Synced
26
00:02:37,940 --> 00:02:45,790
Remember, the purpose of the presentation is to show the reader what you learned and how you know it's true.
-
Not Synced
27
00:02:45,790 --> 00:02:54,040
Cells that didn't help you do that. Maybe you can consider removing, though, or that don't help you do that.
-
Not Synced
28
00:02:54,040 --> 00:02:58,720
At the end of the day, they might have helped you figure out how to do that.
-
Not Synced
29
00:02:58,720 --> 00:03:02,650
You can save up a copy of your notebook before doing the cleanup so you don't lose them.
-
Not Synced
30
00:03:02,650 --> 00:03:07,210
You can have a supplementary notebook that has maybe Pazz, you went down.
-
Not Synced
31
00:03:07,210 --> 00:03:11,170
That didn't work out. Another thing you can consider doing is having an appendix.
-
Not Synced
32
00:03:11,170 --> 00:03:17,020
So you've got all of the main content, the notebook. And then down at the end, you have a big heading appendix.
-
Not Synced
33
00:03:17,020 --> 00:03:20,770
And there you have extra things. You want to make sure you can still run from top to bottom.
-
Not Synced
34
00:03:20,770 --> 00:03:27,490
But there you have some of the other things that maybe dove into more details about the building blocks of some of your computations.
-
Not Synced
35
00:03:27,490 --> 00:03:32,650
But it is good to show the results after loading data and after doing a complex manipulation,
-
Not Synced
36
00:03:32,650 --> 00:03:38,910
especially one that significantly changes the shape of the data that you're working with.
-
Not Synced
37
00:03:38,910 --> 00:03:41,790
And they talk mostly in this video, though, about markdown sales,
-
Not Synced
38
00:03:41,790 --> 00:03:48,180
because markdown sells or what you use to build up the structure of your document and make it tell a story,
-
Not Synced
39
00:03:48,180 --> 00:03:52,830
not just be a kind of strange way to present Python code.
-
Not Synced
40
00:03:52,830 --> 00:04:01,710
So markdown is a text syntax for simple markup. I'm going to provide a link to the markdown documentation in the class notes that go with this video.
-
Not Synced
41
00:04:01,710 --> 00:04:06,570
But there's several inline formatting things. If you put two stars around some text, that'll make it bold.
-
Not Synced
42
00:04:06,570 --> 00:04:16,110
One star will make it italics. You can indicate a code using the fit, something that's going to show up as the fixed width code layout using back Tex.
-
Not Synced
43
00:04:16,110 --> 00:04:24,570
This is one that I see ignored very frequently in and writing up because it's really,
-
Not Synced
44
00:04:24,570 --> 00:04:27,960
really useful for function names, variable names, things like that.
-
Not Synced
45
00:04:27,960 --> 00:04:32,670
To be able to set apart like this is a special thing. This is a function name also.
-
Not Synced
46
00:04:32,670 --> 00:04:39,180
Then you can use tech math syntax by putting it between dollar signs in this markdown notebook.
-
Not Synced
47
00:04:39,180 --> 00:04:48,450
Pay attention to the details of what your markdown code or what your text formatted text looks like after you render it in the notebook.
-
Not Synced
48
00:04:48,450 --> 00:04:55,500
Make sure it reads well. Make sure it's clear. Ask yourself if I weren't the one who right wrote this.
-
Not Synced
49
00:04:55,500 --> 00:05:03,290
What? I like reading this and clean it up and pay attention to those details, to make it look,
-
Not Synced
50
00:05:03,290 --> 00:05:08,670
to make it look good and to make it be effective at communicating and so that the reader
-
Not Synced
51
00:05:08,670 --> 00:05:13,170
can clearly understand what the different pieces are and what needs to be emphasized,
-
Not Synced
52
00:05:13,170 --> 00:05:17,260
etc. Markdown also has a number of block elements.
-
Not Synced
53
00:05:17,260 --> 00:05:21,520
The basic one is a paragraph, paragraphs or just text separated by blank lines.
-
Not Synced
54
00:05:21,520 --> 00:05:26,890
You can also have bulleted and numbered lists. You can have code blocks for if you need to have a little.
-
Not Synced
55
00:05:26,890 --> 00:05:31,720
These aren't super common in a notebook because a lot of your code is in the code cells that you execute.
-
Not Synced
56
00:05:31,720 --> 00:05:34,900
But if you need to have a little code that you don't execute for some reason,
-
Not Synced
57
00:05:34,900 --> 00:05:42,970
you can put it in the code block and markdown and then you can also block mathematics, a line on its own that begins and ends with two dollar signs.
-
Not Synced
58
00:05:42,970 --> 00:05:49,170
And you can actually span multiple lines so long as there aren't any blanks that's going to be treated as a piece of block mathematics.
-
Not Synced
59
00:05:49,170 --> 00:05:54,510
It's not in line in a sentence, but it becomes its own block and the rendered self.
-
Not Synced
60
00:05:54,510 --> 00:06:00,450
Headings are an important one to pay attention to. So Mark Down headings are lines that start with one, two,
-
Not Synced
61
00:06:00,450 --> 00:06:06,510
three up to six hash marks and then a space in the heading text having one heading to hitting three.
-
Not Synced
62
00:06:06,510 --> 00:06:11,370
Something that's important to know is the hashes do not mean big and bold.
-
Not Synced
63
00:06:11,370 --> 00:06:16,890
That's what they look like. But that's not what they mean. What they mean is heading.
-
Not Synced
64
00:06:16,890 --> 00:06:23,300
And so you need to have an outline structure to your notebook using the headings.
-
Not Synced
65
00:06:23,300 --> 00:06:29,030
And you need to nest them properly, so within each one, you have your H 2s.
-
Not Synced
66
00:06:29,030 --> 00:06:32,760
And then you have your H threes. You don't go straight from H one to H for you.
-
Not Synced
67
00:06:32,760 --> 00:06:41,540
You have H three in the middle. Start the notebook with an H one that has the notebook title and that that will become in a lot of rendering context.
-
Not Synced
68
00:06:41,540 --> 00:06:46,850
That becomes the title at the top of your notebook. And then all your other headings are two or lower.
-
Not Synced
69
00:06:46,850 --> 00:06:55,670
Also you might if you have an appendix, you might have Appendix B, another H1, but also the section headers should be short, not sentences.
-
Not Synced
70
00:06:55,670 --> 00:07:00,630
If you're writing an entire sentence in your section header. You're you're putting too much there.
-
Not Synced
71
00:07:00,630 --> 00:07:06,840
The section header should be a short title and then the section content comes after it.
-
Not Synced
72
00:07:06,840 --> 00:07:13,650
Now, one of the few reasons why it's important to use the section headers heading levels properly.
-
Not Synced
73
00:07:13,650 --> 00:07:19,710
One is just visually, it helps break up your notebook so we can easily see which component we're at.
-
Not Synced
74
00:07:19,710 --> 00:07:26,130
Second, there are extensions that will do things like no your headings or give you a Browsr Bowl table of contents.
-
Not Synced
75
00:07:26,130 --> 00:07:32,130
You can use to navigate the notebook by heading what? I'm rendering notebooks as a part of the course website.
-
Not Synced
76
00:07:32,130 --> 00:07:36,960
You'll see this over in the right hand side. You can jump directly to notebook headings.
-
Not Synced
77
00:07:36,960 --> 00:07:45,720
That only works because I'm consistently using the heading levels to build the structure and outline based structure of my notebook.
-
Not Synced
78
00:07:45,720 --> 00:07:53,160
Another a third reason is for accessibility. If someone's reading your notebook with an assistive technology such as a screen reader,
-
Not Synced
79
00:07:53,160 --> 00:08:01,500
the section headings are very important to help them navigate to the parts. The notebook that are both relevant to them at a given time.
-
Not Synced
80
00:08:01,500 --> 00:08:09,180
So on the section headers, one additional little rule is if your section editor has to wrap onto a second line, really rethink.
-
Not Synced
81
00:08:09,180 --> 00:08:13,230
It's almost certainly too long particularly.
-
Not Synced
82
00:08:13,230 --> 00:08:16,860
Don't put an entire question in the section. Maybe. Usually.
-
Not Synced
83
00:08:16,860 --> 00:08:22,890
Occasionally it's OK to put a whole question, but maybe put a brief like three to five word summary of the questions topic and
-
Not Synced
84
00:08:22,890 --> 00:08:30,750
then write this question itself as the first paragraph of the of the section.
-
Not Synced
85
00:08:30,750 --> 00:08:33,240
But pay attention to these different formatting features.
-
Not Synced
86
00:08:33,240 --> 00:08:41,460
You can build a well-structured notebook that communicates clearly and draws the reader's emphasis to the places where it needs to go.
-
Not Synced
87
00:08:41,460 --> 00:08:46,650
Writing the text itself. Use the document to tell a story. What's the goal of what you're doing?
-
Not Synced
88
00:08:46,650 --> 00:08:51,630
Either the whole notebook or of individual pieces of analysis. What's the data that we're doing?
-
Not Synced
89
00:08:51,630 --> 00:08:58,140
What do we know about it going in at the up at the top, either at the very top of your notebook or where you're loading the data?
-
Not Synced
90
00:08:58,140 --> 00:09:04,050
It's useful to write some, especially at the notebooks and we report you submit to somebody.
-
Not Synced
91
00:09:04,050 --> 00:09:07,800
It's useful to write there. What do you know? Where did you get the data? How was it collected?
-
Not Synced
92
00:09:07,800 --> 00:09:16,290
Not a full data sheet, but at least some summary information to help the reader understand what it is that we're going to be going and looking at.
-
Not Synced
93
00:09:16,290 --> 00:09:19,800
Why are we doing each piece of the analysis? What's the purpose here?
-
Not Synced
94
00:09:19,800 --> 00:09:25,260
How does it fit into our broader picture, into our broader goals? What approach are we using?
-
Not Synced
95
00:09:25,260 --> 00:09:28,200
We don't want to just repeat the code writing a a numbered list here.
-
Not Synced
96
00:09:28,200 --> 00:09:33,240
The steps and those steps are just a literal translation of the code doesn't help understanding.
-
Not Synced
97
00:09:33,240 --> 00:09:37,570
It creates an opportunity for a code and documentation to become mismatched.
-
Not Synced
98
00:09:37,570 --> 00:09:41,670
But explain if there's anything tricky in the code.
-
Not Synced
99
00:09:41,670 --> 00:09:48,600
Explain why that does the job. Explain the conceptual idea behind why you're approaching things the way you are.
-
Not Synced
100
00:09:48,600 --> 00:09:56,700
If you're doing a data clean up, explain why that what that cleanup's doing and why that's the right cleanup for your data.
-
Not Synced
101
00:09:56,700 --> 00:10:03,600
And then what do we learn from it? So oftentimes what I do with us, with an individual piece of it, like a chart.
-
Not Synced
102
00:10:03,600 --> 00:10:07,260
All right. What question the charge is supposed to be answering.
-
Not Synced
103
00:10:07,260 --> 00:10:11,010
Or at least the purpose of the chart that we have the code to generate the chart itself.
-
Not Synced
104
00:10:11,010 --> 00:10:18,450
And then we have a tech cell that has observations about what we learn from the chart.
-
Not Synced
105
00:10:18,450 --> 00:10:23,160
So what are we doing? How are we going to do it if that's not immediately clear code results?
-
Not Synced
106
00:10:23,160 --> 00:10:25,590
And then what do we observe from these results?
-
Not Synced
107
00:10:25,590 --> 00:10:33,350
So the over then the high level document structure that I recommend is to start start with the title and intros.
-
Not Synced
108
00:10:33,350 --> 00:10:37,140
You've got your title. You're heading one. Then what's the notebook for?
-
Not Synced
109
00:10:37,140 --> 00:10:41,160
Why does this notebook exist? Are there to include links?
-
Not Synced
110
00:10:41,160 --> 00:10:46,650
There's hyperlinks and taxes and markdown as well. Read the markdown documentation to see how to use that.
-
Not Synced
111
00:10:46,650 --> 00:10:50,790
But where does this go? Are there things we need to know?
-
Not Synced
112
00:10:50,790 --> 00:10:54,990
Background about where why this documents being created?
-
Not Synced
113
00:10:54,990 --> 00:11:00,360
Where did the data come from? If we have defined research questions, what are those research questions?
-
Not Synced
114
00:11:00,360 --> 00:11:03,750
You can write those right in the intro, the notebook. Then I have a set up.
-
Not Synced
115
00:11:03,750 --> 00:11:09,060
I almost always have a setup section that comes next. That has input. I import my python libraries.
-
Not Synced
116
00:11:09,060 --> 00:11:11,970
I've maybe defined some help or functions that I'm going to be using throughout the
-
Not Synced
117
00:11:11,970 --> 00:11:16,860
notebook helper function specific to one section I might define in that section.
-
Not Synced
118
00:11:16,860 --> 00:11:20,730
But then and then I LOEs load the data. Sometimes I load the data as a part of the setup.
-
Not Synced
119
00:11:20,730 --> 00:11:27,990
So it's OK. Important modules and then load my data. Sometimes if specially if I have more to say about the data, it's its own section.
-
Not Synced
120
00:11:27,990 --> 00:11:34,500
But then as I load each table, I just show the first few rows of it often so that I can see, OK, I've loaded this data and then it's right there.
-
Not Synced
121
00:11:34,500 --> 00:11:39,390
We can see as we're going through the rest of the notebook. What is the data just loaded look like?
-
Not Synced
122
00:11:39,390 --> 00:11:45,030
Then we perform our analysis and this might be two sections. It might be five, six, seven, eight sections.
-
Not Synced
123
00:11:45,030 --> 00:11:49,920
And then finally at the end, we can summarize and conclude this is going to be really I don't always do this in
-
Not Synced
124
00:11:49,920 --> 00:11:53,340
my research notebook because often that's the material that goes in the paper.
-
Not Synced
125
00:11:53,340 --> 00:11:57,150
But this is going to be something particularly in our assignment, and we're submitting notebooks.
-
Not Synced
126
00:11:57,150 --> 00:12:00,810
Put that at the end of the notebook. What do we learn from this?
-
Not Synced
127
00:12:00,810 --> 00:12:03,150
Sometimes they going to have specific directions for things.
-
Not Synced
128
00:12:03,150 --> 00:12:08,730
I want you to reflect on there when like an assignment one, I've broken down the different requirements.
-
Not Synced
129
00:12:08,730 --> 00:12:15,120
Those become good candidates for your age to your level. Two headings for each of those.
-
Not Synced
130
00:12:15,120 --> 00:12:21,730
So we've got, I think six require six different requirements. An assignment one.
-
Not Synced
131
00:12:21,730 --> 00:12:28,120
H2, heading a primary section of your document for each of those is a good starting point for your layout.
-
Not Synced
132
00:12:28,120 --> 00:12:34,460
In addition to you're probably gonna have another one up at the top for the setup and maybe another for the data load.
-
Not Synced
133
00:12:34,460 --> 00:12:38,590
But think about this. This the flow, your document be able to communicate.
-
Not Synced
134
00:12:38,590 --> 00:12:42,760
What are we doing? What are the prerequisites in terms of and data?
-
Not Synced
135
00:12:42,760 --> 00:12:47,790
How are we actually doing it? And then at the end, what do we learn?
-
Not Synced
136
00:12:47,790 --> 00:12:53,580
So you're going to write a lot of cells and produce a lot of outputs in your notebook while you're debugging,
-
Not Synced
137
00:12:53,580 --> 00:12:59,220
before you submit to before you share in other contexts. Spend some time cleaning up your notebook,
-
Not Synced
138
00:12:59,220 --> 00:13:06,780
remove dead ends and extraneous outputs that you included for debugging, but don't fit in the flow of the story.
-
Not Synced
139
00:13:06,780 --> 00:13:13,350
Consider putting them in a supplementary notebook. If you want to keep them around and then make sure you can rerun your notebook from top to bottom.
-
Not Synced
140
00:13:13,350 --> 00:13:20,300
So when the Jupiter interface is the kernel, when you click that and choose, restart and rerun all.
-
Not Synced
141
00:13:20,300 --> 00:13:26,570
And it will restart the python kernel that's actually running your code so all your variables disappear.
-
Not Synced
142
00:13:26,570 --> 00:13:31,220
Your data is no longer loaded. And then it starts running the notebook from top to bottom.
-
Not Synced
143
00:13:31,220 --> 00:13:37,220
You want that to succeed so that someone else working with the notebook can actually rerun and reproduce your results.
-
Not Synced
144
00:13:37,220 --> 00:13:43,790
If that doesn't succeed, then that means either you deleted something that's that's important or you're the order of
-
Not Synced
145
00:13:43,790 --> 00:13:48,740
your source in the notebook does not match the order in which it actually has to be executed.
-
Not Synced
146
00:13:48,740 --> 00:13:54,980
But make sure it succeeds and also read back to the notebook to make sure that the charts all still look right.
-
Not Synced
147
00:13:54,980 --> 00:14:01,100
The data is the conclusions are all still correct, etc. before you submit the final notebook.
-
Not Synced
148
00:14:01,100 --> 00:14:05,330
So when you're writing an up of two, you also need to know your audience and your purpose.
-
Not Synced
149
00:14:05,330 --> 00:14:10,550
For example, the notebooks I'm writing for you for teaching purposes here.
-
Not Synced
150
00:14:10,550 --> 00:14:17,060
They the things I write in them differ from what I'm going to write in a research notebook that I share with my collaborators,
-
Not Synced
151
00:14:17,060 --> 00:14:22,700
or I use my own purposes because my purpose partially in their notebooks, is to explain how they're working.
-
Not Synced
152
00:14:22,700 --> 00:14:28,850
So I'm going to say more in these notebooks about how exactly the what exactly the code is
-
Not Synced
153
00:14:28,850 --> 00:14:35,040
doing is that you can learn how the code works that I would expect in a research notebook.
-
Not Synced
154
00:14:35,040 --> 00:14:41,730
But also, you're your own internal your own personal use sharing with your adviser or your supervisor,
-
Not Synced
155
00:14:41,730 --> 00:14:47,970
sharing with the public, either the professional public working on your topic or the general public.
-
Not Synced
156
00:14:47,970 --> 00:14:54,180
These are all different audiences and they're going to need different levels of explanation and different things highlighted in your notebook.
-
Not Synced
157
00:14:54,180 --> 00:14:56,820
Also, not all audiences are well served for notebooks.
-
Not Synced
158
00:14:56,820 --> 00:15:04,500
Notebooks are fantastic for internal reports, collaboration, et cetera, sharing the results of a data analysis with colleagues or with yourself.
-
Not Synced
159
00:15:04,500 --> 00:15:09,090
But for final publication, you're often going to need a separate final report.
-
Not Synced
160
00:15:09,090 --> 00:15:15,930
I don't know that it's possible to write a research paper and Jupiter notebooks. Somebody might have tried, but.
-
Not Synced
161
00:15:15,930 --> 00:15:21,180
But I'll still have the notebook where I explain the analysis. I often make that notebook available.
-
Not Synced
162
00:15:21,180 --> 00:15:31,050
So for a lot of my a lot of my published research papers, you can download a zip file or a get repository that contains the notebooks and you
-
Not Synced
163
00:15:31,050 --> 00:15:36,240
can rerun the experiment and rerun my analysis with the notebooks in the notebook.
-
Not Synced
164
00:15:36,240 --> 00:15:40,370
Then also I write the files out to disk. And we're not going to see this quite yet.
-
Not Synced
165
00:15:40,370 --> 00:15:45,630
We're going to see it later when we start talking about workflow. Because right now I'm just having to submit notebooks.
-
Not Synced
166
00:15:45,630 --> 00:15:49,720
But the note, the figures as they show up in the notebook aren't very high resolution.
-
Not Synced
167
00:15:49,720 --> 00:15:55,200
So we're gonna want to render a higher resolution version of them to a PMG file or a PDA file or a
-
Not Synced
168
00:15:55,200 --> 00:16:01,110
postscript file that we can then include in our document and word or law tech or whatever we're writing.
-
Not Synced
169
00:16:01,110 --> 00:16:08,910
So to wrap up, your notebook is first and foremost a document that contains code to generate the results that you're trying to discuss.
-
Not Synced
170
00:16:08,910 --> 00:16:14,490
Take advantage of the document structure and use it as a store to tell the story of your analysis.
-
Not Synced
171
00:16:14,490 --> 00:16:19,910
The conclusion you come to in why we should believe them. Pay attention to the examples I'm giving you in class.
-
Not Synced
172
00:16:19,910 --> 00:16:36,043
I'm also going to be trying to give you some examples of research oriented notebooks that you can look at to see examples of good notebook practice.
-
Not Synced