-
all right so welcome to today's lecture
-
which is going to be on data wrangling
-
and data wrangling might be a phrase it
-
sounds a little bit odd to you but the
-
basic idea of data wrangling is that you
-
have data in one format and you want it
-
in some different format and this
-
happens all of the time I'm not just
-
talking about like converting images but
-
it could be like you have a text file or
-
a log file and what you really want this
-
data in some other format like you want
-
a graph or you want statistics over the
-
data anything that goes from one piece
-
of data to another representation of
-
that data is what I would call data
-
wrangling we've seen some examples of
-
this kind of data wrangling already
-
previously in the semester like
-
basically whenever you use the pipe
-
operator that lets you sort of take
-
output from one program and feed it
-
through another program you are doing
-
data wrangling in one way or another but
-
we're going to do in this lecture is
-
take a look at some of the fancier ways
-
you can do data wrangling and some of
-
the really useful ways you can do data
-
wrangling in order to do any kind of
-
data wrangling though you need a data
-
source you need some data to operate on
-
in the first place and there are a lot
-
of good candidates for that kind of data
-
we give some examples in the exercise
-
section for today's lecture notes in
-
this particular one though I'm going to
-
be using a system log so I have a server
-
that's running somewhere the Netherlands
-
because that seemed like a reasonable
-
thing at the time and on that server
-
it's running sort of a regular logging
-
daemon that comes with system Deeb's
-
it's a sort of relatively standard Linux
-
logging mechanism and there's a command
-
called journal CTL on Linux systems that
-
will let you view the system log and so
-
what I'm gonna do is I'm gonna do some
-
transformations over that log and see if
-
we can extract something interesting
-
from it you'll see though that if I run
-
this command I end up with a lot of data
-
because this is a log that has just like
-
there's a lot of stuff in it right a lot
-
of things have happened on my server and
-
this goes back to like January first and
-
their logs that go even further back on
-
this there's a lot of stuff so the first
-
thing we're gonna do is try to limit it
-
down to you only
-
one piece of content and here the grep
-
command is your friend so we're gonna
-
pipe this through grep and we're gonna
-
pipe for SSH right so SSH we haven't
-
really talked to you about yet but it is
-
a way to access computers remotely
-
through the command line and in
-
particular what happens when you put a
-
server on the public Internet is that
-
lots and lots of people around the world
-
to try to connect to it and log in and
-
take over your server and so I want to
-
see how those people are trying to do
-
that and so I'm going to grep for SSH
-
and you'll see pretty quickly that this
-
also generates a bunch of content at
-
least in theory this is gonna be real
-
slow there we go so this generates tons
-
and tons and tons of content and it's
-
really hard to even just visualize
-
what's going on here so let's look at
-
only what user names people have used to
-
try to log into my server so you'll see
-
some of these lines say disconnected
-
disconnected from invalid user and then
-
some user name I want only those lines
-
that's all I really care about I'm gonna
-
make one more change here though which
-
is if you think about how this pipeline
-
does if I here do this connected from so
-
this pipeline at the bottom here what
-
that will do is it will send the entire
-
log file over the network to my machine
-
and then locally run grep to find only
-
the lines to contained ssh and then
-
locally filter them further this seems a
-
little bit wasteful because i don't care
-
about most of these lines and the remote
-
site is also running a shell so what I
-
can actually do is I can have that
-
entire command run on the server right
-
so I'm telling you SSH the command I
-
want you to run on the server is this
-
pipeline of three things and then what I
-
get back I want to pipe through less so
-
what does this do well it's gonna do
-
that same filtering that we did but it's
-
gonna do it on the server side and the
-
server is only going to send me those
-
lines that I care about and then when I
-
pipe it locally through the program
-
called less less is a pager you'll see
-
some examples of this you've actually
-
seen some of them already like when you
-
type man and some command that opens in
-
a pager and a pagers is a convenient way
-
to take a long piece of content and fit
-
it into your term
-
window and have you scrolled down and
-
scroll up and navigate it so that it
-
doesn't just like scroll past your
-
screen and so if I run this it still
-
takes a little while because it has to
-
parse through a lot of log files and in
-
particular grep is buffering and
-
therefore it decides to be relatively
-
unhelpful
-
I may do this without let's see if
-
that's more helpful why doesn't it want
-
to be helpful to me fine I'm gonna cheat
-
a little just ignore me
-
or the internet is really slow those are
-
two possible options luckily there's a
-
fix for that because previously I have
-
run the following command so this
-
command just takes the output of that
-
command and sticks it into a file
-
locally on my computer alright so I ran
-
this when I was up in my office and so
-
what this did is it downloaded all of
-
the SSH log entries that matched
-
disconnect from so I have those locally
-
and this is really handy right there's
-
no reason for me to stream the full log
-
every single time because I know that
-
that starting pattern is what I'm going
-
to want anyway so we can take a look at
-
SSH dot log and you will see there are
-
lots and lots and lots of lines that all
-
say disconnected from invalid user
-
authenticating users etc right so these
-
are the lines that we have to work on
-
and this also means that going forward
-
we don't have to go through this whole
-
SSH process we can just cat that file
-
and then operate it on it directly so
-
here I can also demonstrate this pager
-
so if I do cat s is a cat SSH dot log
-
and I pipe it through less it gives me a
-
pager where I can scroll up and down
-
make that a little bit smaller maybe so
-
I can scroll this file screw through
-
this file and I can do so with what are
-
roughly vim bindings so control you to
-
scroll up control D to scroll down and
-
cue to exit this is still a lot of
-
content though and these lines contain a
-
bunch of garbage that I'm not really
-
interested in what I really want to see
-
is what are what are these user names
-
and here the tool that we're going to
-
start using is one called sent said is a
-
stream editor that's modify or it's it's
-
a modification of a much earlier program
-
called edie which was a really weird
-
editor that none of you will probably
-
want to use yeah Oh tsp is the name of
-
my the remote computer I'm connecting to
-
so said is a stream editor and it
-
basically lets you make changes to the
-
contents of a stream you can think of it
-
a little bit like doing replacements but
-
it's actually a full programming
-
language
-
over the stream that is given one of the
-
most common things you do with said
-
though is to just run replacement
-
expressions on an input stream what do
-
these looks like well let me show you
-
here I'm gonna pipe this sue said and
-
I'm going to say that I want to remove
-
everything that comes before
-
disconnected from so this might look a
-
little weird the observation is that the
-
date and the host name and the sort of
-
process ID of the SSH daemon I don't
-
care about I can just remove that
-
straightaway and I can also remove that
-
like disconnected from bit because that
-
seems to be present in every single log
-
entry so I just want to get rid of it
-
and so what I write is a set expression
-
in this particular case it's an S
-
expression which is a substitute
-
expression it takes two arguments that
-
are basically enclosed in these slashes
-
so the first one is the search string
-
and the second one which is currently
-
empty is a replacement string so here
-
I'm saying search for the following
-
pattern and replace it with blank and
-
then I'm gonna pipe it into less at the
-
end do you see that now what it's done
-
is trim off the beginning of all these
-
lines and that seems really handy but
-
you might wonder what is this pattern
-
that I've built up here right this is
-
this dot star what does that mean this
-
is an example of a regular expression
-
and regular expressions are something
-
that you may have come across in
-
programming in the past
-
but it's something that once you go into
-
the command line you will find yourself
-
using a lot especially for this kind of
-
data wrangling regular expressions are
-
essentially a powerful way to match text
-
you can use it for other things than
-
text too but Texas the most common
-
example and in regular expressions you
-
have a number of special characters that
-
say don't just match this character but
-
match for example a particular type of
-
character or a particular set of options
-
it essentially generates a program for
-
you that searches the given text dot for
-
example means any single
-
character and star if you follow a
-
character with a star it means zero or
-
more of that character and so in this
-
case is pattern of saying zero or more
-
of any character followed by the literal
-
string disconnected from I'm saying
-
match that and then replace it with
-
blank regular expressions have a number
-
of these kind of special characters that
-
have various meanings you can take
-
advantage of I talked about star which
-
is zero or more
-
there's also Plus which is one or more
-
right so this is saying I want the
-
previous expression to match at least
-
once
-
you also have square brackets so square
-
brackets let you match one of many
-
different characters so here let us
-
build up a string list something like a
-
BA and I want to substitute a and B with
-
nothing okay so here what I'm telling
-
the pattern to do is to replace any
-
character that is either A or B with
-
nothing so if I make the first character
-
B it will still produce BA you might
-
wonder though why did it only replace
-
once well it's because what regular
-
expressions will do especially in this
-
default mode is they will just match the
-
pattern once and then apply the
-
replacement once per line that is what's
-
said normally does you can provide the G
-
modifier which says do this as many
-
times as it keeps matching which in this
-
case would erase the entire line because
-
every single character is either an A or
-
a B if I added a C here and remove
-
everything but the C if I added other
-
characters in the middle of this string
-
somewhere they would all be preserved
-
but anything that is an A or and B is
-
removed you can also do things like add
-
modifiers to this for example
-
what would this do this is saying I want
-
zero or more of the string a B and I'm
-
gonna replace them with nothing
-
this means that if I have a standalone a
-
it will not be replaced if I have a
-
standalone B it will not be replaced but
-
if I have the string a B it will be
-
removed which yeah what are they said is
-
stupid
-
the - a here is because said is a really
-
old tool and so it supports only a very
-
old version of very cool expressions
-
generally you will want to run it with -
-
capital e which makes it use a more
-
modern syntax that supports more things
-
if you are in a place where you can't
-
you have to prefix these with back
-
slashes to say I want the special
-
meaning of parenthesis otherwise they
-
were just match a literal parenthesis
-
which is probably not what you want so
-
notice how this replaced the a B here
-
and it replaced the a be here but it
-
left this C and it also left the a at
-
the end because that a does not match
-
this pattern anymore and you can group
-
these patterns in whatever ways you want
-
you also have things like alternations
-
you can say anything that matches a b or
-
b c i want to remove and here you'll
-
notice that this a b got removed this bc
-
did not get removed even though it
-
matches the pattern because the a b had
-
already been removed this a b is removed
-
right but the c stays in place this a b
-
is removed and this c states because it
-
still does not match that if I made this
-
if I remove this a then now this a B
-
pattern will not match this B so it'll
-
be preserved and then BC will match BC
-
and it'll go away
-
Regulus presence can be all sorts of
-
complicated when you first encounter
-
them and even once you get more
-
experience with them they can be
-
daunting to look at and this is why very
-
often you want to use something like a
-
regular expression debugger which we'll
-
look at in a little bit but first let's
-
try to make up a
-
pattern that will match the logs and and
-
match the logs that we've been working
-
with so far so here I'm gonna just sort
-
of extract a couple of lines from this
-
file let's say the first five so these
-
lines all now look like this right and
-
what we want to do is we want to only
-
have the user name okay so what might
-
this look like well here's one thing we
-
could try to do actually let me show you
-
one except one thing first let me take a
-
line that says something like
-
disconnected from invalid user
-
disconnected from maybe four to one one
-
whatever okay so this is an example of a
-
login line where someone tried to login
-
with the username disconnected from
-
missing an S disconnected thank you
-
you'll notice that this actually removed
-
the username as well and this is because
-
when you use dot star and any of these
-
sort of range expressions indirect
-
expressions they are greedy they will
-
match as much as they can so in this
-
case this was the username that we
-
wanted to retain but this pattern
-
actually matched all the way up until
-
the second occurrence of it or the last
-
occurrence of it and so everything
-
before it including the username itself
-
got removed and so we need to come up
-
with a slightly clever or matching
-
strategy than just saying sort of dot
-
star because it means that if we have
-
particularly adversarial input we might
-
end up with something that we didn't
-
expect okay so let's see how we might
-
try to match these lines let's just do a
-
head first well let's try to construct
-
this up from the beginning we first of
-
all know that we want - capital e right
-
because we want to not have to put all
-
these back slashes everywhere
-
these lines look like they say from and
-
then some of them say invalid but some
-
of them do not right this line has
-
invalid that one does not question mark
-
here is saying zero or one so I want
-
zero or zero or one of invalid space
-
user what else well that's going to be a
-
double space so we can't have that and
-
then there's gonna be some username and
-
then there's gonna be what exactly is
-
gonna be what looks like an IP address
-
so here we can use our range syntax and
-
say zero to nine and a dot right that's
-
what IP addresses are and we want many
-
of those then it says porch so we're
-
just going to match a literal port and
-
then another number zero to nine and
-
we're going to wand plus of that the
-
other thing we're going to do here is
-
we're going to do what's known as
-
anchoring the regular expression so
-
there are two special characters and
-
regular expressions there's carrot or
-
hat which matches the beginning of a
-
line and there's dollar which matches
-
the end of a line so here we're gonna
-
say that this regression has to match
-
the complete line the reason we do this
-
is because imagine that someone made
-
their username the entire log string
-
then now if you try to match this
-
pattern it would match the username
-
itself which is not what we want
-
generally you will want to try to anchor
-
your patterns wherever you can to avoid
-
those kind of oddities okay let's see
-
what that gave us that removed many of
-
the lines but not all of them so this
-
one for example includes this pre off at
-
the end so we'll want to cut that off if
-
there's a space pre off square brackets
-
our specials we need to escape them
-
right now let's see what happens if we
-
try more lines of this no it still gets
-
something weird some of these lines are
-
not empty right which means that the
-
pattern did not match this one for
-
example it says authenticating user
-
instead of invalid
-
user okay so as to match invalid or
-
authenticated zero or one time before
-
user how about now okay that looks
-
pretty promising but this output is not
-
particularly helpful right here we've
-
just erased every line of our log files
-
successfully which is not very helpful
-
instead what we really wanted to do is
-
when we match the username right over
-
here we really wanted to remember what
-
that username was because that is what
-
we want to print out and the way we can
-
do that in regular expressions is using
-
something like capture groups so capture
-
groups are a way to say that I want to
-
remember this value and reuse it later
-
and in regular expressions any bracketed
-
expression any parenthesis expression is
-
going to be such a capture group so we
-
already actually have one here which is
-
this first group and now we're creating
-
a second one here notice that these
-
parentheses don't do anything to the
-
matching right because they're just
-
saying this expression as a unit but we
-
don't have any modifiers after it so
-
it's just match one-time and then the
-
reason matching groups are are useful or
-
capture groups are useful is because you
-
can refer back to them in the
-
replacement so in the replacement here I
-
can say backslash two this is the way
-
that you refer to the name of a capture
-
group in this say I'm in this case I'm
-
saying match the entire line and then in
-
the replacement put in the value you
-
captured in the second capture group
-
right remember this is the first capture
-
group and this is the second one and
-
this gives me all the usernames now if
-
you look back at what we wrote this is
-
pretty complicated right it might make
-
sense now that we walk through it and
-
why it had to be the way it was but this
-
is like not obvious that this is how
-
these lines work and this is where a
-
regular expression debugger can come in
-
really really handy so we have one here
-
there are many online but here I've sort
-
of pre filled in this expression that we
-
just used and notice that it it tells me
-
all the matching does in fact now this
-
window is a little small with this font
-
size but if I do hear this explanation
-
says dot star matches any character
-
between zero and unlimited times
-
followed by disconnected from literally
-
followed by a capture group and then
-
walks you through all the stuff and
-
that's one thing but it also lets you've
-
given a test string and then matches the
-
pattern against every single test string
-
that you give and highlights what the
-
different capture groups for example are
-
so here we made user a capture group
-
right so it'll say okay the full string
-
matched right the whole thing is blue so
-
it matched Green is the first capture
-
group red is the second capture group
-
and this is the third because preauth
-
was also put into parenthesis and this
-
can be a handy way to try to debug your
-
regular expressions for example if I put
-
disconnected from and let's add a new
-
line here and I make the username
-
disconnected from now that line already
-
had the username be disconnect from
-
great here me of thinking ahead you'll
-
notice that with this pattern this was
-
no longer a problem because it got
-
matched the username what happens if we
-
take this entire line or this entire
-
line and make that the username now what
-
happens it gets really confused right so
-
this is where regular expressions can be
-
a pain to get right because it now tries
-
to match it matches the first place
-
where username appears or the first
-
invalid in this case the second invalid
-
because this is greedy we can make this
-
non greedy by putting a question mark
-
here so if you suffix a plus or a star
-
with a question mark it becomes a non
-
greedy match so it will not try to match
-
as much as possible and then you see
-
that this actually gets parsed correctly
-
because this dots
-
we'll stop at the first disconnected
-
from which is the one that's actually
-
emitted by SSH the one that actually
-
appears in our logs as you can probably
-
tell from the explanation of this so far
-
regular expressions can get really
-
complicated and there are all sorts of
-
weird modifiers that you might have to
-
apply in your pattern the only way to
-
really learn them is to start with
-
simple ones and then build them up until
-
they match what you need often you're
-
just doing some like one-off job like
-
when we're hacking out the user names
-
here and you don't need to care about
-
all the special conditions right you
-
don't have to care about someone having
-
the SSH username perfectly match your
-
login format that's probably not
-
something that matters because you're
-
just trying to find the usernames but
-
regular expressions are really powerful
-
and you want to be careful if you're
-
doing something where it actually
-
matters you had a question
-
regular expressions by default only
-
match per line anyway they will not
-
match across new lines so so the way
-
that said works is that it operates per
-
line and so said we'll do this
-
expression for every line okay questions
-
about regular sessions or this pattern
-
so far it is a complicated pattern so if
-
it if it feels confusing like don't be
-
worried about it look at it in the
-
debugger later yep so so keep in mind
-
that the we're assuming here that the
-
user only has control over their
-
username right so the worst that they
-
could do is take like this entire entry
-
and make that the username let's see
-
what happens right so that's the works
-
and the reason for this is this question
-
mark means that the moment we hit the
-
disconnect keyword we start parsing the
-
rest of the pattern right and the
-
first occurrence of disconnected is
-
printed by SSH before anything the user
-
controls so in this particular instance
-
even this will not confuse the pattern
-
yep if well so if you're writing a this
-
sort of odd matching will in general
-
when you're doing data wrangling is like
-
not security it's not security related
-
but it might mean that you get really
-
weird data back and so if you're doing
-
something like plotting data you might
-
drop data points that matter you might
-
parse out the wrong number and then like
-
your plot suddenly have data points that
-
weren't in the original data and so it's
-
more that if you find yourself writing a
-
complicated regular expression like
-
double check that it's actually matching
-
what you think it's matching and even if
-
it's not security related and as you can
-
imagine these patterns can get really
-
complicated like for example there's a
-
big debate about how do you match an
-
email address with a regular expression
-
and you might think of something like
-
this so this is a very straightforward
-
one that just says letters and numbers
-
and rotor scores some percent followed
-
by a plus because in Gmail you can have
-
pluses in email addresses with a suffix
-
in this case the plus is just for any
-
number of these but at least one because
-
you can't have an email address that
-
doesn't have anything before the ad and
-
then similarly after the domain right
-
and the top-level domain has to be at
-
least two characters and can't include
-
digits right you can have it calm but
-
you can't have adopt seven it turns out
-
this is not really correct right there
-
are a bunch of valid email addresses
-
that will not be matched by this and
-
they're a bunch of invalid email
-
addresses that will be matched by this
-
so there are many many suggestions and
-
there are people who've built like full
-
test suites to try to see which regular
-
expression is best and this is this
-
particular one is for URLs there are
-
similar ones for email where they found
-
that the best one is this one I don't
-
recommend you trying to understand this
-
pattern but this one apparently will all
-
most perfectly match the what the like
-
internet standard for email addresses
-
says as a valid email address and that
-
includes all sorts of weird Unicode code
-
points this is just to say regular
-
expressions can be really hairy and if
-
you end up somewhere like this there's
-
probably a better way to do it for
-
example if you find yourself trying to
-
parse HTML or something or parse like
-
parse JSON where they're expressions you
-
should probably use a different tool and
-
there is an exercise that has you do
-
this not with the regular sessions point
-
you yeah that it's there's all sorts of
-
suggestions and they give you deep deep
-
dives into how they works if you want to
-
look that up it's it's in the lecture
-
notes okay so now we have the sister of
-
user names so let's go back to data
-
wrangling right like this list of user
-
names is still not that interesting to
-
me right let's let's see how many lines
-
there are so if I do WC - oh there are
-
one hundred and ninety eight thousand
-
lines so WC is the word count program -
-
L makes it count the number of lines
-
this is a lot of lines then if I start
-
scrolling through them that still
-
doesn't really help me right like I need
-
statistics over this I need aggregates
-
of some kind and the send tool is like
-
useful for many things it gives you a
-
full programming language it can do
-
weird things like insert text or only
-
print matching lines but it's not
-
necessarily the perfect tool for
-
everything right like sometimes there
-
are better tools like for example you
-
could write a line counter instead you
-
just should never said it's a terrible
-
programming language except for
-
searching and replacing but there are
-
other useful tools so for example
-
there's a tool called sort so sort this
-
is also not going to be very helpful but
-
sort takes a bunch of lines of input
-
sorts them and then prints them to your
-
output so in this case I now get the
-
sorted output of that list it is still
-
two hundred thousand lines long so it's
-
still not very helpful to me but now I
-
can combine it
-
the tool called unique so unique we'll
-
look at a sorted list of lines and it
-
will only print those that are unique so
-
if you have multiple instances of any
-
given line it will only print it once
-
and then I can say unique - C so this is
-
gonna say count the number of duplicates
-
for any lines that are duplicated and
-
eliminate them what does this look like
-
well if I run it it's gonna take a while
-
there were thirteen zze user names there
-
were ten ZX VF user names etc there and
-
I can scroll through this this is still
-
a very long list right but at least now
-
it's a little bit more collated than it
-
was let's see how many lines I'm dumped
-
in now okay
-
twenty-four thousand lines it's still
-
too much it's not useful information to
-
me but I can keep burning down this with
-
more tools for example what I might care
-
about is which user names have been used
-
the most well I can do sort again and I
-
can say I want a numeric sort on the
-
first column of the input so - n says
-
numeric sort - K lets you select a white
-
space separated column from the input to
-
sort my and the reason I'm giving one
-
comma one here is because I want to
-
start at the first column and stop at
-
the first column alternatively I could
-
say I want you to sort by this list of
-
columns but in this case I just want to
-
sort by that column and then I want only
-
the ten last lines so sort by default
-
will output in ascending order so the
-
the ones with the highest counts are
-
gonna be at the bottom and then I want
-
only lost ten lines and now when I run
-
this I actually get a useful bit of data
-
right it tells me there were eleven
-
thousand login attempts with the
-
username root there were four thousand
-
with one two three four five six isn't
-
username etc and this is pretty handy
-
right and now suddenly this giant log
-
file actually produces useful
-
information for me this is what I really
-
from that log file now maybe I want to
-
just like do a quick disabling of root
-
for example for SSH login on my machine
-
which I recommend you will do by the way
-
in this particular case we don't
-
actually need the k4 sort because sort
-
by default will sort by the entire line
-
and the number happens to come first but
-
it's useful to know about these
-
additional flags and you might wonder
-
well how would I know that these flags
-
exist how would I know that these
-
programs even exist
-
well the programs usually pick up just
-
from being told about them in classes
-
like here the flags are usually like I
-
want to sort by something that is not
-
the full line your first instinct should
-
be to type man sort and then read
-
through the page and then very quickly
-
will tell you here's how to select a
-
pretty good column here's how to sort by
-
a number etc okay what if now that I
-
have this like top let's say top 20 list
-
let's say I don't actually care about
-
the counts I just want like a comma
-
separated list of the user names because
-
I'm gonna like send it to myself by
-
email every day or something like that
-
like these are the top 20 usernames well
-
I can do this
-
ok that's a lot more weird commands but
-
their commands that are useful to know
-
about so awk is a column based stream
-
processor so we talked about said which
-
is a stream editor so it tries to edit
-
text primarily in the inputs awk on the
-
other hand also lets you edit text it is
-
still a full programming language but
-
it's more focused on columnar data so in
-
this case awk by default will parse its
-
input in white space separated columns
-
and then that you operate on those
-
columns separately in this case I'm
-
saying just print the second column
-
which is the user name right paste is a
-
command that takes a bunch of lines and
-
paste them together into a single line
-
that's the - s with the delimiter comma
-
so in this case for on this I want to
-
get a comma separated list of the top
-
user names which I can then do whatever
-
useful thing I might want maybe I want
-
to stick this in a config file of
-
disallowed usernames or something along
-
those lines
-
um awk is worth talking a little bit
-
more about because it turns out to be a
-
really powerful language for this kind
-
of data wrangling we mentioned briefly
-
what this print dollar 2 does but it
-
turns out the for awk you can do some
-
really really fancy things so for
-
example let's go back to here where we
-
just have the usernames I say let's
-
still do sort and unique because we
-
don't otherwise the list gets far too
-
long
-
and let's say that I only want to print
-
the usernames that match a particular
-
pattern let's say for example that I
-
want to see I want all of the usernames
-
that only appear once and that start
-
with a C and end with an e there's a
-
really weird thing to look for but in
-
all this is really simple to express I
-
can say I want the first column to be 1
-
and I want the second column to match
-
the following regular expression
-
hey this could probably just be dot and
-
then I want to print the whole line so
-
unless I mess something up this will
-
give me all the usernames that start
-
with a C end with an e and only appear
-
once in my log now that might not be a
-
very useful thing to do with the data
-
what I'm trying to do in this lecture is
-
show you the kind of tools that are
-
available and in this particular case
-
this pattern is like not that
-
complicated even though what we're doing
-
is sort of weird and this is because
-
very often on Linux with Linux tools in
-
particular and command-line tools in
-
general the tools are built to be based
-
on lines of input and lines of output
-
and very often those lines are going to
-
be have multiple columns and awk is
-
great for operating over columns now awk
-
is is not just able to do things like
-
match per line but it lets you do things
-
like let's say I want the number of
-
these right I want to know how many user
-
names match this pattern well I can do
-
WCHL that works just fine all right
-
there are 31 such user names but awk is
-
a programming language this is something
-
that you will probably never end up
-
doing yourself but it's important to
-
know that you can every now and again it
-
is actually useful to know about these
-
this might be hard to read on my screen
-
I just realized let me try to fix that
-
in a second
-
let's do yeah apparently fish does not
-
want me to do that um so here begin is a
-
special pattern that only matches the
-
zeroth line end is a special pattern
-
that only matches after the last line
-
and then this is gonna be a normal
-
pattern that's matched against every
-
line so what I'm saying here is on the
-
zeroth line set the variable rose to
-
zero on every line that matches this
-
pattern increment rose and after you
-
have matched the last line print the
-
value of rose and this will have the
-
same effect as running WCHL but all
-
within awk his particular instance like
-
WCHL is just fine but sometimes you want
-
to do things like you want to might want
-
to keep a dictionary or a map of some
-
kind you might want to compute
-
statistics you might want to do things
-
like I want the second match of this
-
pattern so you need a stateful matcher
-
like ignore the first match but then
-
print everything following the second
-
match and for that this kind of simple
-
programming in all can be useful to know
-
about in fact we could in this pattern
-
get rid of said and sort and unique and
-
grep that we originally used to produce
-
this file and do it all in awk
-
but you probably don't want to do that
-
it would be probably too painful to be
-
worth it it's worth talking a little bit
-
about the other kinds of tools that you
-
might want to use on the command line
-
the first of these is a really handy
-
program called BC so BC is the Berkeley
-
calculator I believe man BC I think BC
-
is originally from Berkeley calculator
-
anyway it is a very simple command-line
-
calculator but instead of giving you a
-
prompt it reads from standard in so I
-
can do something like echo 1 plus 2 and
-
pipe it to BC - shell because many of
-
these programs normally operate in like
-
a stupid mode where they're unhelpful so
-
here it prints 3 Wow very impressive but
-
it turns out this can be really handy
-
imagine you have a file with a bunch of
-
lines
-
let's say something like oh I don't know
-
this file and let's say I want to sum up
-
the number of logins the number of user
-
names that have not been used only once
-
all right so the ones where the count is
-
not equal to one I want to print just
-
the count right this is me give me the
-
counts for all the non single-use user
-
names and then I want to know how many
-
are there of these notice that I can't
-
just count the lines that wouldn't work
-
right because there are numbers on each
-
ran I want to sum well I can use paste
-
to paste by plus so this paste every
-
line together into a plus expression
-
right and this is now an arithmetic
-
expression so I can pipe it through BCL
-
and now there have been hundred and
-
ninety one thousand logins that share to
-
username with at least one other login
-
again probably not something you really
-
care about but this is just to show you
-
that you can extract this data pretty
-
easily and there's all sort of other
-
stuff you can do with this for example
-
there are tools so that you compute
-
statistics over inputs so for example
-
for this list of numbers that's that I
-
just took the numbers and just print it
-
out just the distribution of numbers I
-
could do things like use our our is the
-
separate programming language that's
-
specifically built for a statistical
-
analysis and I can say let's see if I
-
got this right
-
this is again a different programming
-
language that you would have to learn
-
but if you already know R or you can
-
pipe them through all their languages
-
too like so so this gives me summary
-
statistics over that input stream of
-
numbers so the median number of login
-
attempts per user name is 3 the max is
-
10,000 that was route
-
we saw before I'll tell me the average
-
was 8 for this might not matter in this
-
particular instance like this might not
-
be interesting numbers but if you're
-
looking at things like output from your
-
benchmarking script or something else
-
where you have some numerical
-
distribution and you want to look at
-
them these tools are really handy we can
-
even do some simple plotting if we
-
wanted to right so this has a bunch of
-
numbers let's do let's go back to our
-
sort and k-11 and look at only the two
-
top 5 new plot is a plotter that lets
-
you take things from standard in I'm not
-
expecting you to know all of these
-
programming languages because they
-
really are programming languages in
-
their own right but is it just show you
-
what is possible right so this is now a
-
histogram of how many times each of the
-
top 5 user names have been used for my
-
server since January 1st and it's just
-
one command line it's somewhat
-
complicated command line but it's just
-
one command line thing that you can do
-
there are two sort of special types of
-
data wrangling that I want to talk to
-
you about in the in the last little bit
-
of time that we have and the first one
-
is command line argument wrangling
-
sometimes you might have something that
-
actually we looked at in the last
-
lecture like you have things like find
-
that produces a list of files or maybe
-
something that produces a list of
-
arguments for your benchmarking script
-
like you want to run it with a
-
particular distribution of arguments
-
like let's say you had a script that
-
printed the number of iterations to run
-
a particular project and you wanted like
-
an exponential distribution or something
-
and this prints the number of iterations
-
on each line and you were to run your
-
benchmark for each one well here is a
-
tool called X args
-
that's your friend so X args takes lines
-
of input and turns them into arguments
-
and this is my
-
look a little weird see if I can come
-
with a good example for this so I
-
program in rust and rust lets you
-
install multiple versions of the
-
compiler so in this case you can see
-
that I have stable beta I have a couple
-
of earlier stable releases and I've
-
launched a different dated Knightley's
-
and this is all very well but over time
-
like I don't really need the nightly
-
version from like March of last year
-
anymore
-
I can probably delete that every now and
-
again and maybe I want to clean these up
-
a little well this is a list of lines so
-
I can get for nightly I can get rid of
-
so - V is don't match I don't want to
-
match to the current nightly okay so
-
this is al a list of dated Knightley's
-
maybe I want only the ones from 2019
-
and now I want to remove each of these
-
tool chains for my machine I could copy
-
paste each one into so there's a rust up
-
tool chain remove or uninstall maybe
-
tool chain uninstall right so I could
-
manually type out the name of each one
-
or copy/paste them but that's getting
-
gets annoying really quickly because I
-
have the list right here so instead how
-
about I said away this sort of this
-
suffix that it adds right so now it's
-
just that and then I use ex args so ex
-
args takes a list of inputs and turns
-
them into arguments so I want this to
-
become arguments to rust up tool chain
-
uninstall and just for my own sanity
-
sake I'm gonna make this echo just so
-
it's going to show which command it's
-
gonna run well it's relatively unhelpful
-
but are hard to read at least you see
-
the command it's going to execute if I
-
remove this echo is rust up tool chain
-
uninstall and then the list of
-
Knightley's as arguments to that program
-
and so if I run this it on installs
-
every tool chain instead of me having to
-
copy paste them so this is one example
-
where this kind of data wrangling
-
actually can be useful for other tasks
-
than just looking at data it's just
-
going from one
-
format to another you can also wrangle
-
binary data so a good example of this is
-
stuff like videos and images where you
-
might actually want to operate over them
-
in some interesting way so for example
-
there's a tool called ffmpeg ffmpeg is
-
for encoding and decoding video and to
-
some extent images I'm gonna set its log
-
level to panic because otherwise it
-
prints a bunch of stuff I want it to
-
read from dev video 0 which is my video
-
of my webcam video device and I wanted
-
to take the first frame so I just wanted
-
to take a picture and I wanted to take
-
an image rather than a single frame
-
video file and I wanted to print its
-
output so the image it captures to
-
standard output - is usually the way you
-
tell the program to use standard input
-
or output rather than a given file so
-
here it expects a file name and the file
-
name - means standard output in this
-
context and then I want to pipe that
-
through a parameter called convert
-
convert is a image manipulation program
-
I want to tell convert to read from
-
standard input and turn the image into
-
the color space gray and then write the
-
resulting image into the file - which is
-
standard output and I don't want to pipe
-
that into gzip we're just gonna compress
-
this image file and that's also going to
-
just operate on standard input standard
-
output and then I'm going to pipe that
-
to my remote server and on that I'm
-
going to decode that image and then I'm
-
gonna store a copy of that image so
-
remember T reads input prints it to
-
standard out and to a file this is gonna
-
make a copy of the decoded image file
-
ass copy about PNG and then it's gonna
-
continue to stream that out so now I'm
-
gonna bring that back into a local
-
stream and here I'm going to display
-
that in an image display err let's see
-
if that works
-
Hey right so this now did a round-trip
-
to my server
-
and then came back over pipes and
-
there's now a computer there's a
-
decompressed version of this file at
-
least in theory on my server let's see
-
if that's there a CPT's p copy PNG 2
-
here and CP 8 yeah hey same file ended
-
up on the server so our pipeline worked
-
again this is a sort of silly example
-
but let's you see the power of building
-
these pipelines where it doesn't have to
-
be textual data it's just go taking data
-
from any format to any other like for
-
example if I wanted to I can do cat dev
-
video 0 and then pipe that to a server
-
that like Anish controls and then he
-
could watch that video stream by piping
-
it into a video player on his machine if
-
we wanted to write it just need to know
-
that these thing exist there are a bunch
-
of exercises for this lab and some of
-
them rely on you having a data source
-
that looks a little bit like a log on
-
Mac OS and Linux we give you some
-
commands you can try to experiment with
-
but keep in mind that it's not it's not
-
that important exactly what data source
-
you use this is more find some data
-
source that where you think there might
-
be an interesting signal and then try to
-
extract something interesting from it
-
that is what all of the exercises are
-
about we will not have class on Monday
-
because it's MLK Day so next lecture
-
will be Tuesday on command line
-
environments any questions about what
-
we've guarded so far or the pipelines or
-
regular expressions I really recommend
-
that you look into regular expressions
-
and try to learn them they are extremely
-
handy both for this and in programming
-
in general and if you have any questions
-
come to office hours and we'll help you
-
up