-
I guess we should do an intro to to this as well,
-
so this is a just sort of a
-
free-form Q&A lecture where you, as in
the two people sitting here, but also
-
everyone at home who did not come here
in person get to ask questions and we
-
have a bunch of questions people asked
in advance but you can also ask
-
additional questions during, for the two
of you who are here, you can do it either
-
by raising your hand or you can submit it on
the forum and be anonymous, it's up to you
-
regardless though, what we're gonna
do is just go through some of the
-
questions have been asked and try to
give as helpful answers as we can
-
although they are unprepared on our side and
-
yeah that's the plan I guess we go
-
from popular to least popular
-
fire away
-
all right so for our first question any
-
recommendations on learning operating
system related topics like processes,
-
virtual memory, interrupts,
memory management, etc
-
so I think this is a
-
is an interesting question because these
are really low level concepts that often
-
do not matter, unless you have to
deal with this in some capacity,
-
right so
-
one instance where this matters is you're
writing really low level code like
-
you're implementing a kernel or something
like that, or you want to
-
just hack on the Linux kernel.
-
It's rare otherwise that you need to work with
-
especially like virtual memory and
interrupts and stuff yourself
-
processes, I think are a more general concept
that we've talked a little bit about in
-
this class as well and tools like
htop, pgrep, kill, and signals and
-
that sort of stuff
-
in terms of learning it
-
maybe one of the best ways, is to try to
take either an introductory class on the
-
topic, so for example MIT has a class
called 6.828, which is where
-
you essentially build and develop your
own operating system based on some code
-
that you're given, and all of those labs
are publicly available and all the
-
resources for the class are publicly available,
and so that is a good way to
-
really learn them is by doing them yourself.
-
There are also various
-
tutorials online that basically guide
you through how do you write a kernel
-
from scratch. Not necessarily a very
elaborate one, not one you would want
-
to run any real software on, but just to
teach you the basics and so that would
-
be another thing to look up.
-
Like how do I write a kernel in and then your
-
language of choice. You will probably not
find one that lets you do it in Python
-
but in like C, C++, Rust, there
are a bunch of topics like this
-
one other note on operating systems
-
so like Jon mentioned MIT has a 6.828 class but
-
if you're looking for a more high-level
overview, not necessarily programming or
-
an operating system, but just learning about
the concepts another good resource
-
is a book called "Modern Operating
Systems" by Andy Tannenbaum
-
there's also actually a book called the "The FreeBSD
Operating System" which is really good,
-
It doesn't go through Linux, but it goes
through FreeBSD and the BSD kernel is
-
arguably better organized than the Linux
one and better documented and so it
-
might be a gentler introduction to some of those
topics than trying to understand Linux
-
You want to check it as answered?
-
- Yes + Nice
-
Answered
-
For our next question,
-
What are some of the tools you'd
prioritize learning first?
-
- Maybe we can all go through and
give our opinion on this? + Yeah
-
Tools to prioritize learning first?
-
I think learning your editor well,
just serves you in all capacities
-
like being efficient at editing files,
is just like a majority of
-
what you're going to spend your time doing.
And in general, just using your
-
keyboard more and your mouse less. It means
that you get to spend more of your
-
time doing useful things and
less of your time moving
-
I think that would be my top priority,
-
so I would say that for what
-
tool to prioritize will depend
on what exactly you're doing
-
I think the core idea is you should try
to find the types of tasks that you are
-
doing repetitively and so
-
if you are doing some sort of like
machine learning workload and
-
you find yourself using Jupyter notebooks,
like the one we presented
-
yesterday, a lot. Then again, using
a mouse for that might not be
-
the best idea and you want to familiarize
with the keyboard shortcuts
-
and pretty much with anything you will
end up figuring out that there are some
-
repetitive tasks, and you're running a
computer, and just trying to figure out
-
oh there's probably a better way to do this
-
be it a terminal, be it an editor
-
And it might be really interesting to
learn to use some of the topics that
-
we have covered, but if they're not
extremely useful in a everyday
-
basis then it might not be worth prioritizing them
-
Out of the topics
-
covered in this class, in my opinion, two
of the most useful things are version
-
control and text editors, and I think they're
a little bit different from each
-
other, in the sense that text editors I
think are really useful to learn well,
-
but it was probably the case that before
we started using Vim and all its fancy
-
keyboard shortcuts you had some other
text editor you were using before and
-
you could edit text just fine maybe a little
bit inefficiently, whereas I think
-
version control is another really useful
skill and that's one where if you don't
-
really know the tool properly, it can actually
lead to some problems like loss
-
of data or just inability to collaborate
properly with people. So I
-
think version control is one of the first
things that's worth learning well.
-
Yeah, I agree with that. I think
learning a tool like Git is just
-
gonna save you so much heartache down the line.
-
It, also, to add on to that,
-
it really helps you collaborate with others,
and Anish touched a little bit on GitHub
-
in the last lecture, and just learning
to use that tool well in order
-
to work on larger software projects
that other people are working on is
-
an invaluable skill.
-
For our next question,
-
"When do I use Python versus a
-
Bash script versus some other language?"
-
This is tough, because I think this comes
-
down to what Jose was saying earlier too,
-
that it really depends on
what you're trying to do.
-
For me, I think for Bash scripts in particular,
-
Bash scripts are for
-
automating running a bunch of commands.
You don't want to write any
-
other, like, business logic in Bash.
-
Like, it is just for, 'I want to run these
-
commands, in this order... maybe with
arguments?' But - but, like, even that,
-
it's unclear that you want a Bash script
once you start taking arguments.
-
Similarly, like, once you start doing any
kind of, like, text processing, or
-
configuration, all that,
-
reach for a language that is... a more, a more serious
-
programming language than Bash is.
-
Bash is really for short, one-off
-
scripts, or ones that have a very well-defined
use case, on the terminal, in
-
the shell, probably.
-
For a slightly more concrete guideline,
you might say, 'Write a
-
Bash script if it's less than a hundred
lines of code or so', but once it gets
-
beyond that point, Bash is kind of
unwieldy, and it's probably worth
-
switching to a more serious programming
language, like Python.
-
And, to add to that,
-
I would say, like, I found myself writing,
sometimes, scripts in Python, because
-
if I have already solved some subproblem
that covers part of the problem in Python,
-
I find it much easier to compose the
previous solution that I found out in
-
Python than just try to reuse Bash code,
that I don't find as reusable as Python.
-
And in the same way it's kind of nice that
a lot of people have written something
-
like Python libraries or like Ruby libraries
to do a lot of these things,
-
whereas, in Bash, it's kind of hard
to have, like, code reuse.
-
And, in fact,
-
I think to add to that, usually, if you
find a library, in some language that
-
helps with the task you're trying to
do, use that language for the job.
-
And in Bash, there are no libraries. There
are only the programs on your computer.
-
So you probably don't want to use
it, unless like there's a program
-
you can just invoke. I do think another
thing worth remembering about Bash is:
-
Bash is really hard to get right.
-
It's very easy to get it right for the particular
use case you're trying to solve right now,
-
but, like, things like,
-
"What if one of the filenames has a space in it?"
-
It has caused so many bugs, and so
-
many problems in Bash scripts. And, if you
use a - a real programming language, then
-
those problems just go away.
-
Yes! Checked it.
-
For our next question,
-
what is the difference between sourcing
a script, and executing that script?
-
Ooh. So, this, actually, we got in office
hours a - a while back, as well, which is,
-
'Aren't they the same? Like, aren't they
both just running the Bash script?'
-
And, it is true
-
both of these will end up executing the
lines of code that are in the script.
-
The ways in which they differ is that
sourcing a script is telling your
-
current Bash script, your current Bash
session, to execute that program,
-
whereas the other one is, 'Start up a new instance
of Bash, and run the program there, instead.'
-
And, this matters for things like... Imagine that
"script.sh" tries to change directories.
-
If you are running the script,
as in the second invocation,
-
"./script.sh", then the new
process is going to change
-
directories. But, by the time that script
exits, and returns to your shell,
-
your shell still remains in the same place. However,
if you do "cd" in a script, and you "source" it,
-
your current instance of Bash is the
one that ends up running it, and
-
so, it ends up "cd"-ing where you are.
-
This is also why, if you define functions,
-
for example, that you may want to
execute in your shell session,
-
you need to source the script, not run it,
-
because if you run it, that function
will be defined in the
-
instance of Bash,
-
in the Bash process that gets launched, but it
will not be defined in your current shell.
-
I think those are two of the biggest
differences between the two.
-
Next question...
-
"What are the places where various packages and tools
are stored and how does referencing them work?
-
What even is /bin or /lib?"
-
So, as we covered in the first lecture,
there is this PATH environment variable,
-
which is like a semicolon-separated-
string of all the places
-
where your shell is gonna look for binaries.
And, if you just do something like
-
"echo $PATH", you're gonna get this list;
-
all these places are gonna
be consulted, in order.
-
It's gonna go through all of them, and, in fact,
-
- There is already... Did we cover which? + Yeah
-
So, if you run "which", and a specific command,
-
the shell is actually gonna tell
you where it's finding this (command).
-
Beyond that,
-
there is like some conventions where a lot
of programs will install their binaries
-
and they're like /usr/bin (or at
least they will include symlinks)
-
in /usr/bin so you can find them
-
There's also a /usr/local/bin
-
There are special directories. For example,
/usr/sbin it's only for sudo user and
-
some of these conventions are slightly
different between different distros so
-
I know like some distros for example install
the user libraries under /opt for example
-
Yeah I think one thing just
to talk a little bit of more
-
about /bin and then Anish maybe you can
do the other folders so when it comes to
-
/bin the convention
-
There are conventions, and the conventions are
usually /bin are for essential system utilities
-
/usr/bin are for user programs and
-
/usr/local/bin are for user
compiled programs, sort of
-
so things that you installed that you intend
the user to run, are in /usr/bin
-
things that a user has compiled themselves and stuck
on your system, probably goes in /usr/local/bin
-
but again, this varies a lot from machine
to machine, and distro to distro
-
On Arch Linux, for example, /bin
is a symlink to /usr/bin
-
They're the same and as Jose mentioned, there's
also /sbin which is for programs that are
-
intended to only be run as root, that
also varies from distro to distro
-
whether you even have that directory, and
on many systems like /usr/local/bin
-
might not even be in your PATH, or
might not even exist on your system
-
On BSD on the other hand /usr/local/bin
is often used a lot more heavily
-
yeah so
-
What we were talking about so far, these
are all ways that files and folders are
-
organized on Linux things or Linux or
BSD things vary a little bit between
-
that and macOS or other platforms
-
I think for the specific locations,
-
if you to know exactly what it's
used for, you can look it up
-
But some general patterns to keep in mind or anything
with /bin in it has binary executable programs in it,
-
anything with \lib in it, has
libraries in it so things that
-
programs can link against, and then some
other things that are useful to know are
-
there's a /etc on many systems, which
has configuration files in it and
-
then there's /home, which underneath that directory
contains each user's home directory
-
so like on a linux box my username
or if it's Anish will
-
correspond to a home directory /home/anish
-
Yeah I guess there are
-
a couple of others like /tmp is usually
a temporary directory that gets
-
erased when you reboot not always but sometimes,
you should check on your system
-
There's a /var which often holds like
files the change over time so
-
these these are usually going to be things
like lock files for package managers
-
they're gonna be things like log files
files to keep track of process IDs
-
then there's /dev which shows devices so
-
usually so these are special files that
correspond to devices on your system we
-
talked about /sys, Anish mentioned /etc
-
/opt is a common one for just like third-party
software that basically it's usually for
-
companies ported their software to Linux
but they don't actually understand what
-
running software on Linux is like, and
so they just have a directory with all
-
their stuff in it and when those get installed
they usually get installed into /opt
-
I think those are the ones off the top of my head
-
yeah
-
And we will list these in our lecture notes
which will produce after this lecture
-
Next question
-
Should I apt-get install a Python whatever
-
package or pip install that package
-
so this is a good question that I think at
-
a higher level this question is asking
should I use my systems package manager
-
to install things or should I use some other
package manager. Like in this case
-
one that's more specific to a particular
language. And the answer here is also
-
kind of it depends, sometimes it's nice
to manage things using a system package
-
manager so everything can be installed
and upgraded in a single place but
-
I think oftentimes whatever is available
in the system repositories the things
-
you can get via a tool like
apt-get or something similar
-
might be slightly out of date compared to
the more language specific repository
-
so for example a lot of the Python packages
I use I really want the most
-
up-to-date version and so
I use pip to install them
-
Then, to extend on that is
-
sometimes the case the system packages
might require some other
-
dependencies that you might not have realized
about, and it's also might be
-
the case or like for some systems,
at least for like alpine Linux they
-
don't have wheels for like a lot of the
Python packages so it will just take
-
longer to compile them, it will take more
space because they have to compile them
-
from scratch. Whereas if you just go
to pip, pip has binaries for a lot of
-
different platforms and that will probably work
-
You also should be aware that pip might not do
the exact same thing in different computers
-
So, for example, if you are in a kind of laptop
or like a desktop that is running like
-
a x86 or x86_64 you probably have binaries,
but if you're running something
-
like Raspberry Pi or some other kind of
embedded device. These are running on a
-
different kind of hardware architecture
and you might not have binaries
-
I think that's also good to take into account,
in that case in might be worthwhile to
-
use the system packages just because they
will take much shorter to get them
-
than to just to compile from scratch
the entire Python installation
-
Apart from that, I don't think I can think of any exceptions
where I would actually use the system packages
-
instead of the Python provided ones
-
So, one other thing to keep in mind is that
-
sometimes you will have more than one
program on your computer and you might
-
be developing more than one program on
your computer and for some reason not
-
all programs are always built with the latest
version of things, sometimes they
-
are a little bit behind, and when you
install something system-wide you can
-
only... depends on your exact system,
but often you just have one version
-
what pip lets you do, especially combined
with something like python's virtualenv,
-
and similar concepts exist for other
languages, where you can sort of say
-
I want to (NPM does the same thing as well
with its node modules, for example) where
-
I'm gonna compile the dependencies of
this package in sort of a subdirectory
-
of its own, and all of the versions that it
requires are going to be built in there
-
and you can do this separately for separate
projects so there they have
-
different dependencies or the same dependencies
with different versions
-
they still sort of kept separate. And that
is one thing that's hard to achieve
-
with system packages
-
Next question
-
What's the easiest and best profiling tools
to use to improve performance of my code?
-
This is a topic we could talk
about for a very long time
-
The easiest and best is to print stuff using time
-
Like, I'm not joking, very often
the easiest thing is in your code
-
At the top you figure out what the current
time is, and then you do sort of
-
a binary search over your program of add
a print statement that prints how much
-
time has elapsed since the start of your
program and then you do that until you
-
find the segment of code that took the
longest. And then you go into that
-
function and then you do the same thing
again and you keep doing this until you
-
find roughly where the time was spent. It's
not foolproof, but it is really easy
-
and it gives you good information quickly
-
if you do need more advanced information
Valgrind has a tool called cache-grind?
-
call grind? Cache grind? One of the two.
-
and this tool lets you run your program and
-
measure how long everything takes and
all of the call stacks, like which
-
function called which function, and what
you end up with is a really neat
-
annotation of your entire program source
with the heat of every line basically
-
how much time was spent there. It does
slow down your program by like an order
-
of magnitude or more, and it doesn't really
support threads but it is really
-
useful if you can use it. If you can't,
then tools like perf or similar tools
-
for other languages that do usually some
kind of sampling profiling like we
-
talked about in the profiler lecture, can
give you pretty useful data quickly,
-
but it's a lot of data around
this, but they're a little bit
-
biased and what kind of things they usually
highlight as a problem and it
-
can sometimes be hard to extract meaningful
information about what should
-
I change in response to them. Whereas the
sort of print approach very quickly
-
gives you like this section
of code is bad or slow
-
I think would be my answer
-
Flamegraphs are great, they're a good way
to visualize some of this information
-
Yeah I just have one thing to add,
oftentimes programming languages
-
have language specific tools for profiling
so to figure out what's the
-
right tool to use for your language like if
you're doing JavaScript in the web browser
-
the web browser has a really nice tool for
doing profiling you should just use that
-
or if you are using go, for example, go has a built-in
profiler is really good you should just use that
-
A last thing to add to that
-
Sometimes you might find that doing this binary
search over time that you're kind of
-
finding where the time is going, but this
time is sometimes happening because
-
you're waiting on the network, or you're
waiting for some file, and in that case
-
you want to make sure that the time
that is, if I want to write
-
like 1 gigabyte file or like read 1
gigabyte file and put it into memory
-
you want to check that the actual time
there, is the minimum amount of time
-
you actually have to wait. If it's ten times
longer, you should try to use some
-
other tools that we covered in the debugging
and profiling section to see
-
why you're not utilizing all your
resources because that might...
-
Because that might be a lot of what's happening
thing, like for example, in my research
-
in machine learning workloads, a lot of
time is loading data and you have to
-
make sure well like the time it takes to
load data is actually the minimum amount
-
of time you want to have that happening
-
And to build on that, there are actually
specialized tools for doing things like
-
analyzing wait times. Very often when
you're waiting for something what's
-
really happening is you're issuing your
system call, and that system call takes
-
some amount of time to respond. Like you do
a really large write, or a really large read
-
or you do many of them, and one thing
that can be really handy here is
-
to try to get information out of the
kernel about where your program is
-
spending its time. And so there's (it's
not new), but there's a relatively
-
newly available thing called BPF or eBPF.
Which is essentially kernel tracing
-
and you can do some really cool things with
it, and that includes tracing user programs.
-
It can be a little bit awkward to
get started with, there's a tool
-
called BPF trace that i would recommend
you looking to, if you need to do like
-
this kind of low-level performance debugging.
But it is really good for this
-
kind of stuff. You can get things like
histograms over how much time was spent
-
in particular system calls
-
It's a great tool
-
What browser plugins do you use?
-
I try to use as few as I can get away with using
-
because I don't like things being in
my browser, but there are a couple of
-
ones that are sort of staples.
The first one is uBlock Origin.
-
So uBlock Origin is one of many ad blockers but
it's a little bit more than an ad blocker.
-
It is (a what do they call it?) a
network filtering tool so it lets
-
you do more things than just block ads.
It also lets you like block connections
-
to certain domains, block connections
for certain types of resources
-
So I have mine set up in what they call
the Advanced Mode, where basically
-
you can disable basically all network requests.
But it's not just Network requests,
-
It's also like I have disabled all inline
scripts on every page and all
-
third-party images and resources, and then
you can sort of create a whitelist
-
for every page so it gives you really
low-level tools around how to
-
how to improve the security of your browsing.
But you can also set it in not the
-
advanced mode, and then it does much of
the same as a regular ad blocker would
-
do, although in a fairly efficient way
if you're looking at an ad blocker it's
-
probably the one to use and it
works on like every browser
-
That would be my top pick I think,
-
I think probably the one I
use like the most actively
-
is one called Stylus. It lets you modify
the CSS or like the stylesheets
-
that webpages have. And it's pretty
neat, because sometimes you're
-
looking at a website and you want
to hide some part of the website
-
you don't care about. Like maybe a ad, maybe
some sidebar you're not finding useful
-
The thing is, at the end of
the day these things are
-
displaying in your browser, and you
have control of what code is
-
executing and similar to what Jon was
saying, like you can customize this
-
to no end, and what I have for a lot of
web pages like hide this this part, or
-
also trying to make like dark modes for
them like you can change pretty much the
-
color for every single website. And what
is actually pretty neat is that there's
-
like a repository online of people that
have contributed this is stylesheets
-
for the websites. So someone probably
has (done) one for GitHub
-
Like I want dark GitHub and someone has
already contributed one that makes
-
that much more pleasing to browse. Apart
from that, one that it's not really
-
fancy, but I have found incredibly helpful
is one that just takes a screenshot an
-
entire website. And It will
scroll for you and make
-
compound image of the entire website and that's
really great for when you're trying to
-
print a website and is just terrible.
-
(It's built into Firefox)
-
oh interesting
-
oh now that you mention builtin to Firefox,
another one that I really like about
-
Firefox is the multi account containers
-
(Oh yeah, it's fantastic)
-
Which kind of lets you
-
By default a lot of web browsers, like
for example Chrome, have this
-
notion of like there's session that you
have, where you have all your cookies
-
and they are kind of all shared from the
different websites in the sense of
-
you keep opening new tabs and unless you go into
incognito you kind of have the same profile
-
And that profile is the same for
all websites, there is this
-
Is it an extension or is it built in?
-
(it's a mix, it's complicated)
-
So I think you actually have to say you want
to install it or enable it and again
-
the name is Multi Account Containers and
these let you tell Firefox to have
-
separate isolated sessions. So
for example, you want to say
-
I have a separate sessions for whenever I
visit to Google or whenever I visit Amazon
-
and that can be pretty neat, because then you can
-
At a browser level it's ensuring that no information
sharing is happening between the two of them
-
And it's much more convenient than
having to open a incognito window
-
where it's gonna clean all the time the stuff
-
(One thing to mention is Stylus vs Stylish)
-
Oh yeah, I forgot about that
-
One important thing is the browser extension
for side loading CSS Stylesheets
-
it's called a Stylus and that's different
from the older one that was
-
called Stylish, because that one got
bought at some point by some shady
-
company, that started abusing it not only to have
-
that functionality, but also to read your
entire browser history and send that
-
back to their servers so they could data mine it.
-
So, then people just built this open-source alternative
that is called Stylus, and that's the one
-
we recommend. Said that, I think the repository
for styles is the same for the
-
two of them, but I would have
to double check that.
-
Do you have any browser plugins Anish?
-
Yes, so I also have some recommendations
for browser plugins
-
I also use uBlock Origin and I also use Stylus,
-
but one other one that I'd recommend is
integration with a password manager
-
So this is a topic that we have in
the lecture notes for the security
-
lecture, but we didn't really get to talk
about in detail. But basically password
-
managers do a really good job of increasing
your security when working
-
with online accounts, and having browser
integration with your password manager
-
can save you a lot of time like you
can open up a website then it can
-
autofill your login information for you
sir and you go and copy and paste it
-
back and forth between a separate program
if it's not integrated with your
-
web browser, and it can also, this integration,
can save you from certain
-
attacks that would otherwise be possible if
you were doing this manual copy pasting.
-
For example, phishing attacks. So
you find a website that looks very
-
similar to Facebook and you go to log in
with your facebook login credentials and
-
you go to your password manager and copy
paste the correct credentials into this
-
funny web site and now all of a sudden
it has your password but if you have
-
browser integration then the extension
can automatically check
-
like. Am I on F A C E B O O K.com,or
is it some other domain
-
that maybe look similar and it will not enter
the login information if it's the wrong domain
-
so browser extension for
password managing is good
-
Yeah I agree
-
Next question
-
What are other useful data wrangling tools?
-
So in yesterday's lecture, I mentioned curl, so
curl is a fantastic tool for just making web
-
requests and dumping them to your terminal.
You can also use it for things
-
like uploading files which is really handy.
-
In the exercises of that lecture we also talked about
JQ and pup which are command line tools that let you
-
basically write queries over JSON
and HTML documents respectively
-
that can be really handy. Other
data wrangling tools?
-
Ah Perl, the Perl programming language is
-
often referred to as a write only
programming language because it's
-
impossible to read even if you wrote it.
But it is fantastic at doing just like
-
straight up text processing, like nothing
beats it there, so maybe worth learning
-
some very rudimentary Perl just
to write some of those scripts
-
It's easier often than writing some like hacked-up
combination of grep and awk and sed,
-
and it will be much faster to just tack something
up than writing it up in Python, for example
-
but apart from that, other data wrangling
-
No, not off the top of my head really
-
column -t, if you pipe any white space separated
-
input into column -t it will align all
the white space of the columns so that
-
you get nicely aligned columns that's, and
head and tail but we talked about those
-
I think a couple of additions to that,
that I find myself using commonly
-
one is vim. Vim can be pretty useful
for like data wrangling on itself
-
Sometimes you might find that the operation
that you're trying to do is
-
hard to put down in terms of piping
different operators but if you
-
can just open the file and just record
-
a couple of quick vim macros to do what you
want it to do, it might be like much,
-
much easier. That's one, and then the other
one, if you're dealing with tabular
-
data and you want to do more complex operations
like sorting by one column,
-
then grouping and then computing some sort
of statistic, I think a lot of that
-
workload I ended up just using Python
and pandas because it's built for that
-
And one of the pretty neat features that
I find myself also using is that it
-
will export to many different formats.
So this intermediate state
-
has its own kind of pandas dataframe
object but it can
-
export to HTM, LaTeX, a lot of different
like table formats so if your end
-
product is some sort of summary table, then pandas
I think it's a fantastic choice for that
-
I would second the vim and also
Python I think those are
-
two of my most used data wrangling tools.
For the vim one, last year we had a demo
-
in the series in the lecture notes, but
we didn't cover it in class we had a
-
demo of turning an XML file into a JSON version
of that same data using only vim macros
-
And I think that's actually the
way I would do it in practice
-
I don't want to go find a tool that does
this conversion it is actually simple
-
to encode as a vim macro,
then I just do it that way
-
And then also Python especially in an interactive
tool like a Jupyter notebook
-
is a really great way of doing data wrangling
-
A third tool I'd mention which
I don't remember if we
-
covered in the data wrangling
lecture or elsewhere
-
is a tool called pandoc which can do transformations
between different text
-
document formats so you can convert from
plaintext to HTML or HTML to markdown
-
or LaTeX to HTML or many other formats
it actually it supports a large
-
list of input formats and a
large list of output formats
-
I think there's one last one which I mentioned briefly
in the lecture on data wrangling which is
-
the R programming language, it's
an awful (I think it's an awful)
-
language to program in. And i would never
use it in the middle of a data wrangling
-
pipeline, but at the end, in order to like produce
pretty plots and statistics R is great
-
Because R is built for doing
statistics and plotting
-
there's a library for are called
ggplot which is just amazing
-
ggplot2 i guess technically It's
great, it produces very
-
nice visualizations and it lets you do,
it does very easily do things like
-
If you have a data set that has like multiple
facets like it's not just X and Y
-
it's like X Y Z and some other variable,
and then you want to plot like the
-
throughput grouped by all of those parameters
at the same time and produce
-
a visualization. R very easily let's you
do this and I haven't seen anywhere
-
that lets you do that as easily
-
Next question,
-
What's the difference between
Docker and a virtual machine
-
What's the easiest way to explain this? So docker
-
starts something called containers and
docker is not the only program that
-
starts containers. There are many others
and usually they rely on some feature of
-
the underlying kernel in the case of
docker they use something called LXC
-
which are Linux containers and the basic
premise there is if you want to start
-
what looks like a virtual machine that
is running roughly the same operating
-
system as you are already running on your
computer then you don't really need
-
to run another instance of the kernel
really that other virtual machine can
-
share a kernel. And you can just use the
kernels built in isolation mechanisms to
-
spin up a program that thinks it's
running on its own hardware but in
-
reality it's sharing the kernel and so this
means that containers can often run
-
with much lower overhead than a full virtual
machine will do but you should
-
keep in mind that it also has somewhat weaker
isolation because you are sharing
-
a kernel between the two if you spin up
a virtual machine the only thing that's
-
shared is sort of the hardware and to
some extent the hypervisor, whereas
-
with a docker container you're sharing
the full kernel and the that is a
-
different threat model that you
might have to keep in mind
-
One another small note there as Jon pointed
out, to use containers something
-
like Docker you need the underlying operating
system to be roughly the same
-
as whatever the program that's running
on top of the container expects and so
-
if you're using macOS for example, the
way you use docker is you run Linux
-
inside a virtual machine and then you can
run Docker on top of Linux so maybe
-
if you're going for containers in order
to get better performance your trading
-
isolation for performance if you're running
on Mac OS that may not work out
-
exactly as expected
-
And one last note is that there
is a slight difference, so
-
with Docker and containers,
one of the gotchas you have
-
to be familiar with is that containers
are more similar to virtual
-
machines in the sense of that they will
persist all the storage that you
-
have where Docker by default won't have that.
-
Like Docker is supposed to be running
-
So the main idea is like I want
to run some software and
-
I get the image and it runs and if you
want to have any kind of persistent
-
storage that links to the host system
you have to kind of manually specify
-
that, whereas a virtual machine is using
some virtual disk that is being provided
-
Next question
-
What are the advantages of each operating system
-
and how can we choose between them?
For example, choosing the best Linux
-
distribution for our purposes
-
I will say that for many, many tasks the
-
specific Linux distribution that you're
running is not that important
-
the thing is, it's just what kind of
-
knowing that there are different types
or like groups of distributions,
-
So for example, there are some distributions
that have really frequent updates
-
but they kind of break more easily. So for
example Arch Linux has a rolling update
-
way of pushing updates, where things might
break but they're fine with the things
-
being that way. Where maybe where you
have some really important web server
-
that is hosting all your business
analytics you want that thing
-
to have like a much more steady way of
updates. So that's for example why you
-
will see distributions like Debian being
-
much more conservative about what they push, or
even for example Ubuntu makes a difference
-
between the Long Term Releases
that they are only update every
-
two years and the more periodic
releases of one there is a
-
it's like two a year that they make.
So, kind of knowing that there's the
-
difference apart from that some distributions
have different ways
-
of providing the binaries
to you and the way they
-
have the repositories so I think a lot of Red
Hat Linux don't want non free drivers in
-
their official repositories where I
think Ubuntu is fine with some of
-
them, apart from that I think like just
a lot of what is core to most Linux
-
distros is kind of shared between them
and there's a lot of learning in the
-
common ground. So you don't have
to worry about the specifics
-
Keeping with the theme of this class being somewhat
opinionated, I'm gonna go ahead and say
-
that if you're using Linux especially for
the first time choose something like
-
Ubuntu or Debian. So you Ubuntu to is a
Debian based distribution but maybe is a
-
little bit more friendly, Debian is a little
bit more minimalist. I use Debian
-
and all my servers, for example. And I use
Debian desktop on my desktop computers
-
that run Linux if you're going for maybe
trying to learn more things and you want
-
a distribution that trades stability for
having more up-to-date software maybe
-
at the expense of you having to fix a
broken distribution every once in a
-
while then maybe you can consider something
like Arch Linux or Gentoo
-
or Slackware. Oh man, I'd say that like
if you're installing Linux and just like
-
want to get work done Debian is a great choice
-
Yeah I think I agree with that.
-
The other observation is like
you couldn't install BSD
-
BSD has gotten, has come a long way from
where it was. There's still a bunch of
-
software you can't really get for BSD but
it gives you a very well-documented
-
experience and and one thing that's different
about BSD compared to Linux is
-
that in an BSD when you install BSD you
get a full operating system, mostly
-
So many of the programs are maintained by
the same team that maintains the kernel
-
and everything is sort of upgraded together,
which is a little different
-
than how thanks work in the Linux world it does
-
mean that things often move a little bit
slower. I would not use it for things
-
like gaming either, because drivers support
is meh. But it is an interesting
-
environment to look at. And then for things
like Mac OS and Windows I think
-
If you are a programmer, I don't know why
you are using Windows unless you are
-
building things for Windows; or you want
to be able to do gaming and stuff
-
but in that case, maybe try dual booting,
even though that's a pain too
-
Mac OS is a is a good sort of middle point
between the two where you get a system
-
that is like relatively nicely polished
for you. But you still have access to
-
some of the lower-level bits
at least to a certain extent.
-
it's also really easy to dual boot Mac OS and Windows
it is not quite the case with like Mac OS and
-
Linux or Linux and Windows
-
Alright, for the rest of the
questions so these are
-
all 0 upvote questions so maybe we can go
through them quickly in the last five
-
or so minutes of class. So the next
one is Vim versus Emacs? Vim!
-
Easy answer, but a more serious answer is like I think
all three of us use vim as our primary editor
-
I use Emacs for some research specific
stuff which requires Emacs but
-
at a higher level both editors have interesting
ideas behind them and if you
-
have the time is worth exploring both
to see which fits you better and also
-
you can use Emacs and run it in a vim
emulation mode. I actually know a
-
good number of people who do that so
they get access to some of the cool
-
Emacs functionality and some of the cool
philosophy behind that like Emacs is
-
programmable through Lisp which is kind of cool.
-
Much better than vimscript, but people like
vim's modal editing, so there's an
-
emacs plugin called evil mode which gives
you vim modal editing within Emacs so
-
it's not necessarily a binary choice you
can kind of combine both tools if you
-
want to. And it's worth exploring
both if you have the time.
-
Next question
-
Any tips or tricks for machine
learning applications?
-
I think, like knowing how
-
a lot of these tools, mainly the data wrangling
-
a lot of the shell tools, it's really
important because it seems a lot
-
of what you're doing as machine learning
researcher is trying different things
-
but I think one core aspect of doing that,
and like a lot of scientific work is being
-
able to have reproducible results
and logging them in a sensible way
-
So for example, instead of trying to come
up with really hacky solutions of how
-
you name your folders to make
sense of the experiments
-
Maybe it's just worth having for example
-
what I do is have like a JSON
file that describes the
-
entire experiment I know like all the parameters
that are within and then I can
-
really quickly, using the tools that
we have covered, query for all the
-
experiments that have some specific
purpose or use some data set
-
Things like that. Apart from that, the other
side of this is, if you are running
-
kind of things for training machine
learning applications and you
-
are not already using some sort of
cluster, like university or your
-
company is providing and you're just kind
of manually sshing, like a lot of
-
labs do, because that's kind of the easy way
-
It's worth automating a lot of that job
because it might not seem like it but
-
manually doing a lot of these operations
takes away a lot of your time and also
-
kind of your mental energy
for running these things
-
Anymore vim tips?
-
I have one. So in the vim lecture we tried
not to link you to too many different
-
vim plugins because we didn't want that
lecture to be overwhelming but I think
-
it's actually worth exploring vim plugins
because there are lots and lots
-
of really cool ones out there.
One resource you can use is the
-
different instructors dotfiles like a lot
of us, I think I use like two dozen
-
vim plugins and I find a lot of them quite
helpful and I use them every day
-
we all use slightly different subsets of
them. So go look at what we use or look
-
at some of the other resources we've linked
to and you might find some stuff useful
-
A thing to add to that is, I don't think
we went into a lot detail in the
-
lecture, correct me if I'm wrong. It's
getting familiar with the leader key
-
Which is kind of a special key
that a lot of programs will
-
especially plugins, that will link to
and for a lot of the common operations
-
vim has short ways of doing it, but you
can just figure out like quicker
-
versions for doing them. So for example, like
I know that you can do like semicolon WQ
-
to save and exit or that you
can do like capital ZZ but I
-
just actually just do leader (which for
me is the space) and then W. And I have
-
done that for a lot of a lot of kind of
common operations that I keep doing all
-
the time. Because just saving one keystroke
for an extremely common operation
-
is just saving thousands a month
-
Yeah just to expand a little bit
-
on what the leader key is so in vim you
can bind some keys I can do like ctrl J
-
does something like holding one key and
then pressing another I can bind that to
-
something or I can bind a single keystroke
to something. What the leader
-
key lets you do, is bind
-
So you can assign any key
to be the leader key and
-
then you can assign leader followed by
some other key to some action so for
-
example like Jose's leader key is space
and they can combine space and then
-
releasing space followed by some other
key to an arbitrary vim command so it
-
just gives you yet another way of binding
like a whole set of key combinations.
-
Leader key plus kind of any key on
the keyboard to some functionality
-
I think I've I forget whether
we covered macros in the vim
-
uh sure but like vim macros are worth
learning they're not that complicated
-
but knowing that they're there and knowing
how to use them is going to save
-
you so much time. The other one is something
called marks. So in vim you can
-
press m and then any letter on your keyboard
to make a mark in that file and
-
then you can press apostrophe on the
same letter to jump back to the same
-
place. This is really useful if you're
like moving back and forth
-
between two different parts of your code
for example. You can mark one as A and
-
one as B and you can then jump between
them with tick A and tick B.
-
There's also Ctrl+O which jumps to the previous
place you were in the file no matter
-
what caused you to move. So for example
if I am in a some line and then I jump
-
to B and then I jump to A, Ctrl+O will
take me back to B and then back to the
-
place I originally was. This can also be
handy for things like if you're doing a
-
search then the place that you
started the search is a part of
-
that stack. So I can do a search I can
then like step through the results
-
and like change them and then Ctrl+O
all the way back up to the search
-
Ctrl+O also lets you move across files so
if I go from one file to somewhere else in
-
different file and somewhere else in the
first file Ctrl+O will move me back
-
through that stack and then there's
Ctrl+I to move forward in that
-
stack and so it's not as though you
pop it and it goes away forever
-
The command colon earlier is really handy.
So, colon earlier gives you an earlier
-
version of the same file and it it does
this based on time not based on actions
-
so for example if you press a bunch of like
undo and redo and make some changes
-
and stuff, earlier will take a literally
earlier as in time version of your file
-
and restore it to your buffer. This can
sometimes be good if you like undid and
-
then rewrote something and then realize
you actually wanted the version that was
-
there before you started undoing earlier
let's you do this. And there's a plug-in
-
called undo tree or something like
that There are several of these,
-
that let you actually explore the full
tree of undo history the vim keeps
-
because it doesn't just keep a linear history
it actually keeps the full tree
-
and letting you explore that might in
some cases save you from having to
-
re-type stuff you typed in the past or
stuff you just forgot exactly what you
-
had there that used to work and no longer
works. And this is one final one I
-
want to mention which is, we mentioned
how in vim you have verbs and nouns
-
right to your verbs like delete or yank
and then you have nouns like next of
-
this character or percent to swap brackets
and that sort of stuff the
-
search command is a noun so you can do
things like D slash and then a string
-
and it will delete up to the next match
of that pattern this is extremely useful
-
and I use it all the time
-
One another neat addition on the undo stuff
that I find incredibly valuable in
-
an everyday basis is that like one of
the built-in functionalities of vim
-
is that you can specify an undo directory
and if you have a specified an
-
undo directory by default vim, if you
don't have this enabled, whenever you
-
enter a file your undo history is
clean, there's nothing in there
-
and as you make changes and then
undo them you kind of create this
-
history but as soon as you exit the
file that's lost. Sorry, as soon
-
as you exit vim, that's lost. However
if you have an undodir, vim is
-
gonna persist all those changes into
this directory so no matter how many
-
times you enter and leave that history
is persisted and it's incredibly
-
helpful because even like
-
it can be very helpful for
some files that you modify
-
often because then you can kind of keep
the flow. But it's also sometimes really
-
helpful if you modify your bashrc see and
something broke like five days later and
-
then you've vim again. Like what actually
did I change ,if you don't
-
have say like version control, then
you can just check the undos and
-
that's actually what happened. And
the last one, it's also really
-
worth familiarizing yourself with registers
and what different special
-
registers vim uses. So for example if
you want to copy/paste really that's
-
gone into in a specific register and if you
want to for example use the a OS a copy
-
like the OS clipboard, you should
be copying or yanking
-
copying and pasting from a different register
and there's a lot of them and yeah
-
I think that you should explore, there's
a lot of things to know about registers
-
The next question is asking about two-factor
authentication and I'll just give
-
a very quick answer to this one in the interest
of time. So it's worth using two
-
factor auth for anything security sensitive
so I use it for my GitHub
-
account and for my email and stuff like
that. And there's a bunch of different
-
types of two-factor auth. From SMS based
to factor auth where you get special
-
like a number texted to you when you try
to log in you have to type that number
-
and to other tools like universal to
factor this is like those Yubikeys
-
that you plug into your you have
to tap it every time you login
-
so not all, (yeah Jon is holding a
Yubikey), not all two-factor auth is
-
created equal and you really want to be
using something like U2F rather than SMS
-
based to factor auth. There something
based on one-time pass codes that you
-
have to type in we don't have time to get
into the details of why some methods
-
are better than others but at a high
level use U2F and the Internet has
-
plenty of explanations for why other
methods are not a great idea
-
Last question, any comments on differences
between web browsers?
-
Yes
-
Differences between web browsers, there
are fewer and fewer differences between
-
web browsers these day. At this point
almost all web browsers are chrome
-
Either because you're using Chrome or
because you're using a browser that's
-
using the same browser engine as Chrome.
It's a little bit sad, one might say, but
-
I think these days whether you choose
-
Chrome is a great browser for security reasons
-
if you want to have something
that's more customizable or
-
you don't want to be tied to Google then
use Firefox, don't use Safari it's a
-
worse version of Chrome. The new Internet
Explorer edge is pretty decent and also
-
uses the same browser engine as
Chrome and that's probably fine
-
although avoid it if you can because it
has some like legacy modes you don't
-
want to deal with. I think that's
-
Oh, there's a cool new browser called flow
-
that you can't use for anything useful
yet but they're actually writing
-
their own browser engine and that's really neat
-
Firefox also has this project called servo which is
they're really implementing their browser engine
-
in Rust in order to write it to be like
super concurrent and what they've done
-
is they've started to take modules
from that version and port them
-
over to gecko or integrate them with gecko
which is the main browser engine
-
for Firefox just to get those
speed ups there as well
-
and that's a neat neat thing
you can be watching out for
-
That is all the questions, hey we did it. Nice
-
I guess thanks for taking the missing semester
class and let's do it again next year