1
00:00:08,280 --> 00:00:13,300
Hello again. And this video, I want to talk about organizing notebooks as I've promised.
2
00:00:13,300 --> 00:00:18,210
So we've talked about how do we make charts? That's been a lot of what we've been talking about here.
3
00:00:18,210 --> 00:00:25,260
But I wanted to talk about how do we actually put together a notebook that's presenting these charts and presenting our conclusions from them.
4
00:00:25,260 --> 00:00:33,040
So learning outcomes for this video are for you to be able to use markdown document structure to organize a notebook, to use the Jupiter,
5
00:00:33,040 --> 00:00:40,350
a markdown features to format text in a notebook to create a notebook that clearly tells the story of a data analysis.
6
00:00:40,350 --> 00:00:47,610
First thing to understand is that a notebook is a document. It is a convenient way to run Python code and to see the results of it.
7
00:00:47,610 --> 00:00:52,950
But the notebook structure is first and foremost a document. It's meant to be read.
8
00:00:52,950 --> 00:01:00,300
And there's some structure imposed in the document because it has to read in the same order as the code is going to execute.
9
00:01:00,300 --> 00:01:08,420
But. We want to be able to actually read it and understand what's going on as we walk through the notebook.
10
00:01:08,420 --> 00:01:12,830
So we also want to factor particularly complex computations out of the notebook.
11
00:01:12,830 --> 00:01:15,410
So far, nothing. We've been doing a super complex.
12
00:01:15,410 --> 00:01:22,220
But if I have a large, complicated data processing operation, training and an extensive set of machine learning models or something,
13
00:01:22,220 --> 00:01:33,200
I'll put those out of the notebook and other scripts and other modules and leave the notebook for communicating the results of my data analysis.
14
00:01:33,200 --> 00:01:38,570
So a notebook has two primary types of cells. We have code cells, which you've seen a lot.
15
00:01:38,570 --> 00:01:44,720
The Python code and its output. And we have marked down cells that contain formatted text.
16
00:01:44,720 --> 00:01:48,260
Could keep you. I recommend keeping your code cells relatively short.
17
00:01:48,260 --> 00:01:57,050
One, a few lines. One function definition. If you're defining an entire class and it's taking 100 lines within a code cell.
18
00:01:57,050 --> 00:02:01,580
That's a good sign you to pull that out into a python module of some kind.
19
00:02:01,580 --> 00:02:06,440
If helpful, show results after the cell. I do this a lot, particularly in development.
20
00:02:06,440 --> 00:02:11,450
But if you have too much of it, it can make it hard to read the final notebook because you have all of these outputs.
21
00:02:11,450 --> 00:02:15,440
And the notebook wins it being a sea of charts and tables.
22
00:02:15,440 --> 00:02:20,600
And it's difficult to find your way through the notebook and find the pieces that you need to look at.
23
00:02:20,600 --> 00:02:27,020
So go ahead. Do a lot of them, especially while you're debugging in your prototyping before you submit.
24
00:02:27,020 --> 00:02:32,750
Maybe go through and clean up, remove things that were just there for you to test how something worked and leave the cells in
25
00:02:32,750 --> 00:02:37,940
your notebook being the ones that help the reader understand the results of what you're doing.
26
00:02:37,940 --> 00:02:45,790
Remember, the purpose of the presentation is to show the reader what you learned and how you know it's true.
27
00:02:45,790 --> 00:02:54,040
Cells that didn't help you do that. Maybe you can consider removing, though, or that don't help you do that.
28
00:02:54,040 --> 00:02:58,720
At the end of the day, they might have helped you figure out how to do that.
29
00:02:58,720 --> 00:03:02,650
You can save up a copy of your notebook before doing the cleanup so you don't lose them.
30
00:03:02,650 --> 00:03:07,210
You can have a supplementary notebook that has maybe Pazz, you went down.
31
00:03:07,210 --> 00:03:11,170
That didn't work out. Another thing you can consider doing is having an appendix.
32
00:03:11,170 --> 00:03:17,020
So you've got all of the main content, the notebook. And then down at the end, you have a big heading appendix.
33
00:03:17,020 --> 00:03:20,770
And there you have extra things. You want to make sure you can still run from top to bottom.
34
00:03:20,770 --> 00:03:27,490
But there you have some of the other things that maybe dove into more details about the building blocks of some of your computations.
35
00:03:27,490 --> 00:03:32,650
But it is good to show the results after loading data and after doing a complex manipulation,
36
00:03:32,650 --> 00:03:38,910
especially one that significantly changes the shape of the data that you're working with.
37
00:03:38,910 --> 00:03:41,790
And they talk mostly in this video, though, about markdown sales,
38
00:03:41,790 --> 00:03:48,180
because markdown sells or what you use to build up the structure of your document and make it tell a story,
39
00:03:48,180 --> 00:03:52,830
not just be a kind of strange way to present Python code.
40
00:03:52,830 --> 00:04:01,710
So markdown is a text syntax for simple markup. I'm going to provide a link to the markdown documentation in the class notes that go with this video.
41
00:04:01,710 --> 00:04:06,570
But there's several inline formatting things. If you put two stars around some text, that'll make it bold.
42
00:04:06,570 --> 00:04:16,110
One star will make it italics. You can indicate a code using the fit, something that's going to show up as the fixed width code layout using back Tex.
43
00:04:16,110 --> 00:04:24,570
This is one that I see ignored very frequently in and writing up because it's really,
44
00:04:24,570 --> 00:04:27,960
really useful for function names, variable names, things like that.
45
00:04:27,960 --> 00:04:32,670
To be able to set apart like this is a special thing. This is a function name also.
46
00:04:32,670 --> 00:04:39,180
Then you can use tech math syntax by putting it between dollar signs in this markdown notebook.
47
00:04:39,180 --> 00:04:48,450
Pay attention to the details of what your markdown code or what your text formatted text looks like after you render it in the notebook.
48
00:04:48,450 --> 00:04:55,500
Make sure it reads well. Make sure it's clear. Ask yourself if I weren't the one who right wrote this.
49
00:04:55,500 --> 00:05:03,290
What? I like reading this and clean it up and pay attention to those details, to make it look,
50
00:05:03,290 --> 00:05:08,670
to make it look good and to make it be effective at communicating and so that the reader
51
00:05:08,670 --> 00:05:13,170
can clearly understand what the different pieces are and what needs to be emphasized,
52
00:05:13,170 --> 00:05:17,260
etc. Markdown also has a number of block elements.
53
00:05:17,260 --> 00:05:21,520
The basic one is a paragraph, paragraphs or just text separated by blank lines.
54
00:05:21,520 --> 00:05:26,890
You can also have bulleted and numbered lists. You can have code blocks for if you need to have a little.
55
00:05:26,890 --> 00:05:31,720
These aren't super common in a notebook because a lot of your code is in the code cells that you execute.
56
00:05:31,720 --> 00:05:34,900
But if you need to have a little code that you don't execute for some reason,
57
00:05:34,900 --> 00:05:42,970
you can put it in the code block and markdown and then you can also block mathematics, a line on its own that begins and ends with two dollar signs.
58
00:05:42,970 --> 00:05:49,170
And you can actually span multiple lines so long as there aren't any blanks that's going to be treated as a piece of block mathematics.
59
00:05:49,170 --> 00:05:54,510
It's not in line in a sentence, but it becomes its own block and the rendered self.
60
00:05:54,510 --> 00:06:00,450
Headings are an important one to pay attention to. So Mark Down headings are lines that start with one, two,
61
00:06:00,450 --> 00:06:06,510
three up to six hash marks and then a space in the heading text having one heading to hitting three.
62
00:06:06,510 --> 00:06:11,370
Something that's important to know is the hashes do not mean big and bold.
63
00:06:11,370 --> 00:06:16,890
That's what they look like. But that's not what they mean. What they mean is heading.
64
00:06:16,890 --> 00:06:23,300
And so you need to have an outline structure to your notebook using the headings.
65
00:06:23,300 --> 00:06:29,030
And you need to nest them properly, so within each one, you have your H 2s.
66
00:06:29,030 --> 00:06:32,760
And then you have your H threes. You don't go straight from H one to H for you.
67
00:06:32,760 --> 00:06:41,540
You have H three in the middle. Start the notebook with an H one that has the notebook title and that that will become in a lot of rendering context.
68
00:06:41,540 --> 00:06:46,850
That becomes the title at the top of your notebook. And then all your other headings are two or lower.
69
00:06:46,850 --> 00:06:55,670
Also you might if you have an appendix, you might have Appendix B, another H1, but also the section headers should be short, not sentences.
70
00:06:55,670 --> 00:07:00,630
If you're writing an entire sentence in your section header. You're you're putting too much there.
71
00:07:00,630 --> 00:07:06,840
The section header should be a short title and then the section content comes after it.
72
00:07:06,840 --> 00:07:13,650
Now, one of the few reasons why it's important to use the section headers heading levels properly.
73
00:07:13,650 --> 00:07:19,710
One is just visually, it helps break up your notebook so we can easily see which component we're at.
74
00:07:19,710 --> 00:07:26,130
Second, there are extensions that will do things like no your headings or give you a Browsr Bowl table of contents.
75
00:07:26,130 --> 00:07:32,130
You can use to navigate the notebook by heading what? I'm rendering notebooks as a part of the course website.
76
00:07:32,130 --> 00:07:36,960
You'll see this over in the right hand side. You can jump directly to notebook headings.
77
00:07:36,960 --> 00:07:45,720
That only works because I'm consistently using the heading levels to build the structure and outline based structure of my notebook.
78
00:07:45,720 --> 00:07:53,160
Another a third reason is for accessibility. If someone's reading your notebook with an assistive technology such as a screen reader,
79
00:07:53,160 --> 00:08:01,500
the section headings are very important to help them navigate to the parts. The notebook that are both relevant to them at a given time.
80
00:08:01,500 --> 00:08:09,180
So on the section headers, one additional little rule is if your section editor has to wrap onto a second line, really rethink.
81
00:08:09,180 --> 00:08:13,230
It's almost certainly too long particularly.
82
00:08:13,230 --> 00:08:16,860
Don't put an entire question in the section. Maybe. Usually.
83
00:08:16,860 --> 00:08:22,890
Occasionally it's OK to put a whole question, but maybe put a brief like three to five word summary of the questions topic and
84
00:08:22,890 --> 00:08:30,750
then write this question itself as the first paragraph of the of the section.
85
00:08:30,750 --> 00:08:33,240
But pay attention to these different formatting features.
86
00:08:33,240 --> 00:08:41,460
You can build a well-structured notebook that communicates clearly and draws the reader's emphasis to the places where it needs to go.
87
00:08:41,460 --> 00:08:46,650
Writing the text itself. Use the document to tell a story. What's the goal of what you're doing?
88
00:08:46,650 --> 00:08:51,630
Either the whole notebook or of individual pieces of analysis. What's the data that we're doing?
89
00:08:51,630 --> 00:08:58,140
What do we know about it going in at the up at the top, either at the very top of your notebook or where you're loading the data?
90
00:08:58,140 --> 00:09:04,050
It's useful to write some, especially at the notebooks and we report you submit to somebody.
91
00:09:04,050 --> 00:09:07,800
It's useful to write there. What do you know? Where did you get the data? How was it collected?
92
00:09:07,800 --> 00:09:16,290
Not a full data sheet, but at least some summary information to help the reader understand what it is that we're going to be going and looking at.
93
00:09:16,290 --> 00:09:19,800
Why are we doing each piece of the analysis? What's the purpose here?
94
00:09:19,800 --> 00:09:25,260
How does it fit into our broader picture, into our broader goals? What approach are we using?
95
00:09:25,260 --> 00:09:28,200
We don't want to just repeat the code writing a a numbered list here.
96
00:09:28,200 --> 00:09:33,240
The steps and those steps are just a literal translation of the code doesn't help understanding.
97
00:09:33,240 --> 00:09:37,570
It creates an opportunity for a code and documentation to become mismatched.
98
00:09:37,570 --> 00:09:41,670
But explain if there's anything tricky in the code.
99
00:09:41,670 --> 00:09:48,600
Explain why that does the job. Explain the conceptual idea behind why you're approaching things the way you are.
100
00:09:48,600 --> 00:09:56,700
If you're doing a data clean up, explain why that what that cleanup's doing and why that's the right cleanup for your data.
101
00:09:56,700 --> 00:10:03,600
And then what do we learn from it? So oftentimes what I do with us, with an individual piece of it, like a chart.
102
00:10:03,600 --> 00:10:07,260
All right. What question the charge is supposed to be answering.
103
00:10:07,260 --> 00:10:11,010
Or at least the purpose of the chart that we have the code to generate the chart itself.
104
00:10:11,010 --> 00:10:18,450
And then we have a tech cell that has observations about what we learn from the chart.
105
00:10:18,450 --> 00:10:23,160
So what are we doing? How are we going to do it if that's not immediately clear code results?
106
00:10:23,160 --> 00:10:25,590
And then what do we observe from these results?
107
00:10:25,590 --> 00:10:33,350
So the over then the high level document structure that I recommend is to start start with the title and intros.
108
00:10:33,350 --> 00:10:37,140
You've got your title. You're heading one. Then what's the notebook for?
109
00:10:37,140 --> 00:10:41,160
Why does this notebook exist? Are there to include links?
110
00:10:41,160 --> 00:10:46,650
There's hyperlinks and taxes and markdown as well. Read the markdown documentation to see how to use that.
111
00:10:46,650 --> 00:10:50,790
But where does this go? Are there things we need to know?
112
00:10:50,790 --> 00:10:54,990
Background about where why this documents being created?
113
00:10:54,990 --> 00:11:00,360
Where did the data come from? If we have defined research questions, what are those research questions?
114
00:11:00,360 --> 00:11:03,750
You can write those right in the intro, the notebook. Then I have a set up.
115
00:11:03,750 --> 00:11:09,060
I almost always have a setup section that comes next. That has input. I import my python libraries.
116
00:11:09,060 --> 00:11:11,970
I've maybe defined some help or functions that I'm going to be using throughout the
117
00:11:11,970 --> 00:11:16,860
notebook helper function specific to one section I might define in that section.
118
00:11:16,860 --> 00:11:20,730
But then and then I LOEs load the data. Sometimes I load the data as a part of the setup.
119
00:11:20,730 --> 00:11:27,990
So it's OK. Important modules and then load my data. Sometimes if specially if I have more to say about the data, it's its own section.
120
00:11:27,990 --> 00:11:34,500
But then as I load each table, I just show the first few rows of it often so that I can see, OK, I've loaded this data and then it's right there.
121
00:11:34,500 --> 00:11:39,390
We can see as we're going through the rest of the notebook. What is the data just loaded look like?
122
00:11:39,390 --> 00:11:45,030
Then we perform our analysis and this might be two sections. It might be five, six, seven, eight sections.
123
00:11:45,030 --> 00:11:49,920
And then finally at the end, we can summarize and conclude this is going to be really I don't always do this in
124
00:11:49,920 --> 00:11:53,340
my research notebook because often that's the material that goes in the paper.
125
00:11:53,340 --> 00:11:57,150
But this is going to be something particularly in our assignment, and we're submitting notebooks.
126
00:11:57,150 --> 00:12:00,810
Put that at the end of the notebook. What do we learn from this?
127
00:12:00,810 --> 00:12:03,150
Sometimes they going to have specific directions for things.
128
00:12:03,150 --> 00:12:08,730
I want you to reflect on there when like an assignment one, I've broken down the different requirements.
129
00:12:08,730 --> 00:12:15,120
Those become good candidates for your age to your level. Two headings for each of those.
130
00:12:15,120 --> 00:12:21,730
So we've got, I think six require six different requirements. An assignment one.
131
00:12:21,730 --> 00:12:28,120
H2, heading a primary section of your document for each of those is a good starting point for your layout.
132
00:12:28,120 --> 00:12:34,460
In addition to you're probably gonna have another one up at the top for the setup and maybe another for the data load.
133
00:12:34,460 --> 00:12:38,590
But think about this. This the flow, your document be able to communicate.
134
00:12:38,590 --> 00:12:42,760
What are we doing? What are the prerequisites in terms of and data?
135
00:12:42,760 --> 00:12:47,790
How are we actually doing it? And then at the end, what do we learn?
136
00:12:47,790 --> 00:12:53,580
So you're going to write a lot of cells and produce a lot of outputs in your notebook while you're debugging,
137
00:12:53,580 --> 00:12:59,220
before you submit to before you share in other contexts. Spend some time cleaning up your notebook,
138
00:12:59,220 --> 00:13:06,780
remove dead ends and extraneous outputs that you included for debugging, but don't fit in the flow of the story.
139
00:13:06,780 --> 00:13:13,350
Consider putting them in a supplementary notebook. If you want to keep them around and then make sure you can rerun your notebook from top to bottom.
140
00:13:13,350 --> 00:13:20,300
So when the Jupiter interface is the kernel, when you click that and choose, restart and rerun all.
141
00:13:20,300 --> 00:13:26,570
And it will restart the python kernel that's actually running your code so all your variables disappear.
142
00:13:26,570 --> 00:13:31,220
Your data is no longer loaded. And then it starts running the notebook from top to bottom.
143
00:13:31,220 --> 00:13:37,220
You want that to succeed so that someone else working with the notebook can actually rerun and reproduce your results.
144
00:13:37,220 --> 00:13:43,790
If that doesn't succeed, then that means either you deleted something that's that's important or you're the order of
145
00:13:43,790 --> 00:13:48,740
your source in the notebook does not match the order in which it actually has to be executed.
146
00:13:48,740 --> 00:13:54,980
But make sure it succeeds and also read back to the notebook to make sure that the charts all still look right.
147
00:13:54,980 --> 00:14:01,100
The data is the conclusions are all still correct, etc. before you submit the final notebook.
148
00:14:01,100 --> 00:14:05,330
So when you're writing an up of two, you also need to know your audience and your purpose.
149
00:14:05,330 --> 00:14:10,550
For example, the notebooks I'm writing for you for teaching purposes here.
150
00:14:10,550 --> 00:14:17,060
They the things I write in them differ from what I'm going to write in a research notebook that I share with my collaborators,
151
00:14:17,060 --> 00:14:22,700
or I use my own purposes because my purpose partially in their notebooks, is to explain how they're working.
152
00:14:22,700 --> 00:14:28,850
So I'm going to say more in these notebooks about how exactly the what exactly the code is
153
00:14:28,850 --> 00:14:35,040
doing is that you can learn how the code works that I would expect in a research notebook.
154
00:14:35,040 --> 00:14:41,730
But also, you're your own internal your own personal use sharing with your adviser or your supervisor,
155
00:14:41,730 --> 00:14:47,970
sharing with the public, either the professional public working on your topic or the general public.
156
00:14:47,970 --> 00:14:54,180
These are all different audiences and they're going to need different levels of explanation and different things highlighted in your notebook.
157
00:14:54,180 --> 00:14:56,820
Also, not all audiences are well served for notebooks.
158
00:14:56,820 --> 00:15:04,500
Notebooks are fantastic for internal reports, collaboration, et cetera, sharing the results of a data analysis with colleagues or with yourself.
159
00:15:04,500 --> 00:15:09,090
But for final publication, you're often going to need a separate final report.
160
00:15:09,090 --> 00:15:15,930
I don't know that it's possible to write a research paper and Jupiter notebooks. Somebody might have tried, but.
161
00:15:15,930 --> 00:15:21,180
But I'll still have the notebook where I explain the analysis. I often make that notebook available.
162
00:15:21,180 --> 00:15:31,050
So for a lot of my a lot of my published research papers, you can download a zip file or a get repository that contains the notebooks and you
163
00:15:31,050 --> 00:15:36,240
can rerun the experiment and rerun my analysis with the notebooks in the notebook.
164
00:15:36,240 --> 00:15:40,370
Then also I write the files out to disk. And we're not going to see this quite yet.
165
00:15:40,370 --> 00:15:45,630
We're going to see it later when we start talking about workflow. Because right now I'm just having to submit notebooks.
166
00:15:45,630 --> 00:15:49,720
But the note, the figures as they show up in the notebook aren't very high resolution.
167
00:15:49,720 --> 00:15:55,200
So we're gonna want to render a higher resolution version of them to a PMG file or a PDA file or a
168
00:15:55,200 --> 00:16:01,110
postscript file that we can then include in our document and word or law tech or whatever we're writing.
169
00:16:01,110 --> 00:16:08,910
So to wrap up, your notebook is first and foremost a document that contains code to generate the results that you're trying to discuss.
170
00:16:08,910 --> 00:16:14,490
Take advantage of the document structure and use it as a store to tell the story of your analysis.
171
00:16:14,490 --> 00:16:19,910
The conclusion you come to in why we should believe them. Pay attention to the examples I'm giving you in class.
172
00:16:19,910 --> 00:16:36,043
I'm also going to be trying to give you some examples of research oriented notebooks that you can look at to see examples of good notebook practice.