< Return to Video

Lecture 4: Data Wrangling (2020)

  • 0:01 - 0:06
    all right so welcome to today's lecture
  • 0:04 - 0:09
    which is going to be on data wrangling
  • 0:06 - 0:11
    and data wrangling might be a phrase it
  • 0:09 - 0:13
    sounds a little bit odd to you but the
  • 0:11 - 0:15
    basic idea of data wrangling is that you
  • 0:13 - 0:17
    have data in one format and you want it
  • 0:15 - 0:19
    in some different format and this
  • 0:17 - 0:21
    happens all of the time I'm not just
  • 0:19 - 0:23
    talking about like converting images but
  • 0:21 - 0:25
    it could be like you have a text file or
  • 0:23 - 0:27
    a log file and what you really want this
  • 0:25 - 0:29
    data in some other format like you want
  • 0:27 - 0:32
    a graph or you want statistics over the
  • 0:29 - 0:35
    data anything that goes from one piece
  • 0:32 - 0:37
    of data to another representation of
  • 0:35 - 0:40
    that data is what I would call data
  • 0:37 - 0:42
    wrangling we've seen some examples of
  • 0:40 - 0:44
    this kind of data wrangling already
  • 0:42 - 0:46
    previously in the semester like
  • 0:44 - 0:48
    basically whenever you use the pipe
  • 0:46 - 0:50
    operator that lets you sort of take
  • 0:48 - 0:51
    output from one program and feed it
  • 0:50 - 0:54
    through another program you are doing
  • 0:51 - 0:55
    data wrangling in one way or another but
  • 0:54 - 0:58
    we're going to do in this lecture is
  • 0:55 - 1:00
    take a look at some of the fancier ways
  • 0:58 - 1:02
    you can do data wrangling and some of
  • 1:00 - 1:06
    the really useful ways you can do data
  • 1:02 - 1:07
    wrangling in order to do any kind of
  • 1:06 - 1:09
    data wrangling though you need a data
  • 1:07 - 1:12
    source you need some data to operate on
  • 1:09 - 1:14
    in the first place and there are a lot
  • 1:12 - 1:17
    of good candidates for that kind of data
  • 1:14 - 1:19
    we give some examples in the exercise
  • 1:17 - 1:21
    section for today's lecture notes in
  • 1:19 - 1:23
    this particular one though I'm going to
  • 1:21 - 1:26
    be using a system log so I have a server
  • 1:23 - 1:27
    that's running somewhere the Netherlands
  • 1:26 - 1:30
    because that seemed like a reasonable
  • 1:27 - 1:33
    thing at the time and on that server
  • 1:30 - 1:34
    it's running sort of a regular logging
  • 1:33 - 1:37
    daemon that comes with system Deeb's
  • 1:34 - 1:39
    it's a sort of relatively standard Linux
  • 1:37 - 1:42
    logging mechanism and there's a command
  • 1:39 - 1:45
    called journal CTL on Linux systems that
  • 1:42 - 1:46
    will let you view the system log and so
  • 1:45 - 1:49
    what I'm gonna do is I'm gonna do some
  • 1:46 - 1:50
    transformations over that log and see if
  • 1:49 - 1:53
    we can extract something interesting
  • 1:50 - 1:56
    from it you'll see though that if I run
  • 1:53 - 1:59
    this command I end up with a lot of data
  • 1:56 - 2:02
    because this is a log that has just like
  • 1:59 - 2:03
    there's a lot of stuff in it right a lot
  • 2:02 - 2:06
    of things have happened on my server and
  • 2:03 - 2:08
    this goes back to like January first and
  • 2:06 - 2:11
    their logs that go even further back on
  • 2:08 - 2:12
    this there's a lot of stuff so the first
  • 2:11 - 2:13
    thing we're gonna do is try to limit it
  • 2:12 - 2:16
    down to you only
  • 2:13 - 2:18
    one piece of content and here the grep
  • 2:16 - 2:20
    command is your friend so we're gonna
  • 2:18 - 2:23
    pipe this through grep and we're gonna
  • 2:20 - 2:25
    pipe for SSH right so SSH we haven't
  • 2:23 - 2:27
    really talked to you about yet but it is
  • 2:25 - 2:29
    a way to access computers remotely
  • 2:27 - 2:31
    through the command line and in
  • 2:29 - 2:32
    particular what happens when you put a
  • 2:31 - 2:34
    server on the public Internet is that
  • 2:32 - 2:36
    lots and lots of people around the world
  • 2:34 - 2:38
    to try to connect to it and log in and
  • 2:36 - 2:39
    take over your server and so I want to
  • 2:38 - 2:41
    see how those people are trying to do
  • 2:39 - 2:45
    that and so I'm going to grep for SSH
  • 2:41 - 2:48
    and you'll see pretty quickly that this
  • 2:45 - 2:51
    also generates a bunch of content at
  • 2:48 - 2:56
    least in theory this is gonna be real
  • 2:51 - 2:59
    slow there we go so this generates tons
  • 2:56 - 3:00
    and tons and tons of content and it's
  • 2:59 - 3:02
    really hard to even just visualize
  • 3:00 - 3:05
    what's going on here so let's look at
  • 3:02 - 3:07
    only what user names people have used to
  • 3:05 - 3:10
    try to log into my server so you'll see
  • 3:07 - 3:13
    some of these lines say disconnected
  • 3:10 - 3:15
    disconnected from invalid user and then
  • 3:13 - 3:17
    some user name I want only those lines
  • 3:15 - 3:19
    that's all I really care about I'm gonna
  • 3:17 - 3:22
    make one more change here though which
  • 3:19 - 3:26
    is if you think about how this pipeline
  • 3:22 - 3:29
    does if I here do this connected from so
  • 3:26 - 3:31
    this pipeline at the bottom here what
  • 3:29 - 3:33
    that will do is it will send the entire
  • 3:31 - 3:36
    log file over the network to my machine
  • 3:33 - 3:38
    and then locally run grep to find only
  • 3:36 - 3:41
    the lines to contained ssh and then
  • 3:38 - 3:42
    locally filter them further this seems a
  • 3:41 - 3:44
    little bit wasteful because i don't care
  • 3:42 - 3:46
    about most of these lines and the remote
  • 3:44 - 3:49
    site is also running a shell so what I
  • 3:46 - 3:52
    can actually do is I can have that
  • 3:49 - 3:54
    entire command run on the server right
  • 3:52 - 3:55
    so I'm telling you SSH the command I
  • 3:54 - 3:57
    want you to run on the server is this
  • 3:55 - 4:01
    pipeline of three things and then what I
  • 3:57 - 4:03
    get back I want to pipe through less so
  • 4:01 - 4:04
    what does this do well it's gonna do
  • 4:03 - 4:06
    that same filtering that we did but it's
  • 4:04 - 4:08
    gonna do it on the server side and the
  • 4:06 - 4:12
    server is only going to send me those
  • 4:08 - 4:13
    lines that I care about and then when I
  • 4:12 - 4:16
    pipe it locally through the program
  • 4:13 - 4:18
    called less less is a pager you'll see
  • 4:16 - 4:19
    some examples of this you've actually
  • 4:18 - 4:22
    seen some of them already like when you
  • 4:19 - 4:24
    type man and some command that opens in
  • 4:22 - 4:27
    a pager and a pagers is a convenient way
  • 4:24 - 4:27
    to take a long piece of content and fit
  • 4:27 - 4:30
    it into your term
  • 4:27 - 4:32
    window and have you scrolled down and
  • 4:30 - 4:33
    scroll up and navigate it so that it
  • 4:32 - 4:36
    doesn't just like scroll past your
  • 4:33 - 4:37
    screen and so if I run this it still
  • 4:36 - 4:41
    takes a little while because it has to
  • 4:37 - 4:43
    parse through a lot of log files and in
  • 4:41 - 4:46
    particular grep is buffering and
  • 4:43 - 4:47
    therefore it decides to be relatively
  • 4:46 - 4:56
    unhelpful
  • 4:47 - 5:01
    I may do this without let's see if
  • 4:56 - 5:05
    that's more helpful why doesn't it want
  • 5:01 - 5:10
    to be helpful to me fine I'm gonna cheat
  • 5:05 - 5:10
    a little just ignore me
  • 5:17 - 5:23
    or the internet is really slow those are
  • 5:21 - 5:27
    two possible options luckily there's a
  • 5:23 - 5:30
    fix for that because previously I have
  • 5:27 - 5:33
    run the following command so this
  • 5:30 - 5:34
    command just takes the output of that
  • 5:33 - 5:37
    command and sticks it into a file
  • 5:34 - 5:39
    locally on my computer alright so I ran
  • 5:37 - 5:41
    this when I was up in my office and so
  • 5:39 - 5:43
    what this did is it downloaded all of
  • 5:41 - 5:46
    the SSH log entries that matched
  • 5:43 - 5:47
    disconnect from so I have those locally
  • 5:46 - 5:49
    and this is really handy right there's
  • 5:47 - 5:51
    no reason for me to stream the full log
  • 5:49 - 5:53
    every single time because I know that
  • 5:51 - 5:55
    that starting pattern is what I'm going
  • 5:53 - 5:57
    to want anyway so we can take a look at
  • 5:55 - 5:59
    SSH dot log and you will see there are
  • 5:57 - 6:02
    lots and lots and lots of lines that all
  • 5:59 - 6:05
    say disconnected from invalid user
  • 6:02 - 6:06
    authenticating users etc right so these
  • 6:05 - 6:09
    are the lines that we have to work on
  • 6:06 - 6:11
    and this also means that going forward
  • 6:09 - 6:12
    we don't have to go through this whole
  • 6:11 - 6:16
    SSH process we can just cat that file
  • 6:12 - 6:18
    and then operate it on it directly so
  • 6:16 - 6:22
    here I can also demonstrate this pager
  • 6:18 - 6:24
    so if I do cat s is a cat SSH dot log
  • 6:22 - 6:25
    and I pipe it through less it gives me a
  • 6:24 - 6:29
    pager where I can scroll up and down
  • 6:25 - 6:31
    make that a little bit smaller maybe so
  • 6:29 - 6:33
    I can scroll this file screw through
  • 6:31 - 6:36
    this file and I can do so with what are
  • 6:33 - 6:38
    roughly vim bindings so control you to
  • 6:36 - 6:43
    scroll up control D to scroll down and
  • 6:38 - 6:45
    cue to exit this is still a lot of
  • 6:43 - 6:47
    content though and these lines contain a
  • 6:45 - 6:48
    bunch of garbage that I'm not really
  • 6:47 - 6:50
    interested in what I really want to see
  • 6:48 - 6:53
    is what are what are these user names
  • 6:50 - 6:56
    and here the tool that we're going to
  • 6:53 - 6:59
    start using is one called sent said is a
  • 6:56 - 7:01
    stream editor that's modify or it's it's
  • 6:59 - 7:04
    a modification of a much earlier program
  • 7:01 - 7:06
    called edie which was a really weird
  • 7:04 - 7:12
    editor that none of you will probably
  • 7:06 - 7:16
    want to use yeah Oh tsp is the name of
  • 7:12 - 7:16
    my the remote computer I'm connecting to
  • 7:16 - 7:24
    so said is a stream editor and it
  • 7:20 - 7:26
    basically lets you make changes to the
  • 7:24 - 7:28
    contents of a stream you can think of it
  • 7:26 - 7:30
    a little bit like doing replacements but
  • 7:28 - 7:30
    it's actually a full programming
  • 7:30 - 7:33
    language
  • 7:30 - 7:35
    over the stream that is given one of the
  • 7:33 - 7:38
    most common things you do with said
  • 7:35 - 7:41
    though is to just run replacement
  • 7:38 - 7:45
    expressions on an input stream what do
  • 7:41 - 7:45
    these looks like well let me show you
  • 7:45 - 7:50
    here I'm gonna pipe this sue said and
  • 7:48 - 7:53
    I'm going to say that I want to remove
  • 7:50 - 7:58
    everything that comes before
  • 7:53 - 8:01
    disconnected from so this might look a
  • 7:58 - 8:04
    little weird the observation is that the
  • 8:01 - 8:06
    date and the host name and the sort of
  • 8:04 - 8:07
    process ID of the SSH daemon I don't
  • 8:06 - 8:10
    care about I can just remove that
  • 8:07 - 8:12
    straightaway and I can also remove that
  • 8:10 - 8:14
    like disconnected from bit because that
  • 8:12 - 8:15
    seems to be present in every single log
  • 8:14 - 8:18
    entry so I just want to get rid of it
  • 8:15 - 8:20
    and so what I write is a set expression
  • 8:18 - 8:22
    in this particular case it's an S
  • 8:20 - 8:26
    expression which is a substitute
  • 8:22 - 8:28
    expression it takes two arguments that
  • 8:26 - 8:31
    are basically enclosed in these slashes
  • 8:28 - 8:32
    so the first one is the search string
  • 8:31 - 8:34
    and the second one which is currently
  • 8:32 - 8:36
    empty is a replacement string so here
  • 8:34 - 8:40
    I'm saying search for the following
  • 8:36 - 8:41
    pattern and replace it with blank and
  • 8:40 - 8:43
    then I'm gonna pipe it into less at the
  • 8:41 - 8:45
    end do you see that now what it's done
  • 8:43 - 8:50
    is trim off the beginning of all these
  • 8:45 - 8:52
    lines and that seems really handy but
  • 8:50 - 8:55
    you might wonder what is this pattern
  • 8:52 - 8:58
    that I've built up here right this is
  • 8:55 - 8:59
    this dot star what does that mean this
  • 8:58 - 9:02
    is an example of a regular expression
  • 8:59 - 9:04
    and regular expressions are something
  • 9:02 - 9:05
    that you may have come across in
  • 9:04 - 9:07
    programming in the past
  • 9:05 - 9:08
    but it's something that once you go into
  • 9:07 - 9:10
    the command line you will find yourself
  • 9:08 - 9:13
    using a lot especially for this kind of
  • 9:10 - 9:16
    data wrangling regular expressions are
  • 9:13 - 9:18
    essentially a powerful way to match text
  • 9:16 - 9:20
    you can use it for other things than
  • 9:18 - 9:23
    text too but Texas the most common
  • 9:20 - 9:27
    example and in regular expressions you
  • 9:23 - 9:30
    have a number of special characters that
  • 9:27 - 9:32
    say don't just match this character but
  • 9:30 - 9:34
    match for example a particular type of
  • 9:32 - 9:37
    character or a particular set of options
  • 9:34 - 9:40
    it essentially generates a program for
  • 9:37 - 9:42
    you that searches the given text dot for
  • 9:40 - 9:46
    example means any single
  • 9:42 - 9:49
    character and star if you follow a
  • 9:46 - 9:52
    character with a star it means zero or
  • 9:49 - 9:54
    more of that character and so in this
  • 9:52 - 9:58
    case is pattern of saying zero or more
  • 9:54 - 10:00
    of any character followed by the literal
  • 9:58 - 10:03
    string disconnected from I'm saying
  • 10:00 - 10:06
    match that and then replace it with
  • 10:03 - 10:08
    blank regular expressions have a number
  • 10:06 - 10:09
    of these kind of special characters that
  • 10:08 - 10:12
    have various meanings you can take
  • 10:09 - 10:12
    advantage of I talked about star which
  • 10:12 - 10:15
    is zero or more
  • 10:12 - 10:16
    there's also Plus which is one or more
  • 10:15 - 10:18
    right so this is saying I want the
  • 10:16 - 10:19
    previous expression to match at least
  • 10:18 - 10:23
    once
  • 10:19 - 10:25
    you also have square brackets so square
  • 10:23 - 10:27
    brackets let you match one of many
  • 10:25 - 10:30
    different characters so here let us
  • 10:27 - 10:36
    build up a string list something like a
  • 10:30 - 10:42
    BA and I want to substitute a and B with
  • 10:36 - 10:44
    nothing okay so here what I'm telling
  • 10:42 - 10:47
    the pattern to do is to replace any
  • 10:44 - 10:50
    character that is either A or B with
  • 10:47 - 10:53
    nothing so if I make the first character
  • 10:50 - 10:54
    B it will still produce BA you might
  • 10:53 - 10:56
    wonder though why did it only replace
  • 10:54 - 10:58
    once well it's because what regular
  • 10:56 - 11:00
    expressions will do especially in this
  • 10:58 - 11:02
    default mode is they will just match the
  • 11:00 - 11:04
    pattern once and then apply the
  • 11:02 - 11:07
    replacement once per line that is what's
  • 11:04 - 11:09
    said normally does you can provide the G
  • 11:07 - 11:12
    modifier which says do this as many
  • 11:09 - 11:14
    times as it keeps matching which in this
  • 11:12 - 11:16
    case would erase the entire line because
  • 11:14 - 11:19
    every single character is either an A or
  • 11:16 - 11:21
    a B if I added a C here and remove
  • 11:19 - 11:23
    everything but the C if I added other
  • 11:21 - 11:24
    characters in the middle of this string
  • 11:23 - 11:26
    somewhere they would all be preserved
  • 11:24 - 11:34
    but anything that is an A or and B is
  • 11:26 - 11:38
    removed you can also do things like add
  • 11:34 - 11:38
    modifiers to this for example
  • 11:42 - 11:52
    what would this do this is saying I want
  • 11:47 - 11:53
    zero or more of the string a B and I'm
  • 11:52 - 11:55
    gonna replace them with nothing
  • 11:53 - 11:57
    this means that if I have a standalone a
  • 11:55 - 12:00
    it will not be replaced if I have a
  • 11:57 - 12:02
    standalone B it will not be replaced but
  • 12:00 - 12:10
    if I have the string a B it will be
  • 12:02 - 12:12
    removed which yeah what are they said is
  • 12:10 - 12:12
    stupid
  • 12:12 - 12:18
    the - a here is because said is a really
  • 12:15 - 12:20
    old tool and so it supports only a very
  • 12:18 - 12:22
    old version of very cool expressions
  • 12:20 - 12:24
    generally you will want to run it with -
  • 12:22 - 12:26
    capital e which makes it use a more
  • 12:24 - 12:29
    modern syntax that supports more things
  • 12:26 - 12:31
    if you are in a place where you can't
  • 12:29 - 12:33
    you have to prefix these with back
  • 12:31 - 12:36
    slashes to say I want the special
  • 12:33 - 12:37
    meaning of parenthesis otherwise they
  • 12:36 - 12:40
    were just match a literal parenthesis
  • 12:37 - 12:44
    which is probably not what you want so
  • 12:40 - 12:46
    notice how this replaced the a B here
  • 12:44 - 12:49
    and it replaced the a be here but it
  • 12:46 - 12:51
    left this C and it also left the a at
  • 12:49 - 12:54
    the end because that a does not match
  • 12:51 - 12:56
    this pattern anymore and you can group
  • 12:54 - 12:58
    these patterns in whatever ways you want
  • 12:56 - 13:01
    you also have things like alternations
  • 12:58 - 13:07
    you can say anything that matches a b or
  • 13:01 - 13:11
    b c i want to remove and here you'll
  • 13:07 - 13:12
    notice that this a b got removed this bc
  • 13:11 - 13:15
    did not get removed even though it
  • 13:12 - 13:18
    matches the pattern because the a b had
  • 13:15 - 13:20
    already been removed this a b is removed
  • 13:18 - 13:23
    right but the c stays in place this a b
  • 13:20 - 13:26
    is removed and this c states because it
  • 13:23 - 13:29
    still does not match that if I made this
  • 13:26 - 13:32
    if I remove this a then now this a B
  • 13:29 - 13:34
    pattern will not match this B so it'll
  • 13:32 - 13:36
    be preserved and then BC will match BC
  • 13:34 - 13:38
    and it'll go away
  • 13:36 - 13:40
    Regulus presence can be all sorts of
  • 13:38 - 13:42
    complicated when you first encounter
  • 13:40 - 13:43
    them and even once you get more
  • 13:42 - 13:45
    experience with them they can be
  • 13:43 - 13:48
    daunting to look at and this is why very
  • 13:45 - 13:50
    often you want to use something like a
  • 13:48 - 13:52
    regular expression debugger which we'll
  • 13:50 - 13:53
    look at in a little bit but first let's
  • 13:52 - 13:56
    try to make up a
  • 13:53 - 13:57
    pattern that will match the logs and and
  • 13:56 - 14:00
    match the logs that we've been working
  • 13:57 - 14:02
    with so far so here I'm gonna just sort
  • 14:00 - 14:05
    of extract a couple of lines from this
  • 14:02 - 14:09
    file let's say the first five so these
  • 14:05 - 14:12
    lines all now look like this right and
  • 14:09 - 14:15
    what we want to do is we want to only
  • 14:12 - 14:21
    have the user name okay so what might
  • 14:15 - 14:30
    this look like well here's one thing we
  • 14:21 - 14:33
    could try to do actually let me show you
  • 14:30 - 14:34
    one except one thing first let me take a
  • 14:33 - 14:39
    line that says something like
  • 14:34 - 14:44
    disconnected from invalid user
  • 14:39 - 14:47
    disconnected from maybe four to one one
  • 14:44 - 14:50
    whatever okay so this is an example of a
  • 14:47 - 14:54
    login line where someone tried to login
  • 14:50 - 14:54
    with the username disconnected from
  • 14:54 - 15:05
    missing an S disconnected thank you
  • 15:03 - 15:08
    you'll notice that this actually removed
  • 15:05 - 15:11
    the username as well and this is because
  • 15:08 - 15:12
    when you use dot star and any of these
  • 15:11 - 15:14
    sort of range expressions indirect
  • 15:12 - 15:17
    expressions they are greedy they will
  • 15:14 - 15:20
    match as much as they can so in this
  • 15:17 - 15:22
    case this was the username that we
  • 15:20 - 15:25
    wanted to retain but this pattern
  • 15:22 - 15:27
    actually matched all the way up until
  • 15:25 - 15:29
    the second occurrence of it or the last
  • 15:27 - 15:31
    occurrence of it and so everything
  • 15:29 - 15:33
    before it including the username itself
  • 15:31 - 15:34
    got removed and so we need to come up
  • 15:33 - 15:36
    with a slightly clever or matching
  • 15:34 - 15:38
    strategy than just saying sort of dot
  • 15:36 - 15:40
    star because it means that if we have
  • 15:38 - 15:41
    particularly adversarial input we might
  • 15:40 - 15:44
    end up with something that we didn't
  • 15:41 - 15:48
    expect okay so let's see how we might
  • 15:44 - 15:57
    try to match these lines let's just do a
  • 15:48 - 16:01
    head first well let's try to construct
  • 15:57 - 16:03
    this up from the beginning we first of
  • 16:01 - 16:05
    all know that we want - capital e right
  • 16:03 - 16:07
    because we want to not have to put all
  • 16:05 - 16:10
    these back slashes everywhere
  • 16:07 - 16:15
    these lines look like they say from and
  • 16:10 - 16:17
    then some of them say invalid but some
  • 16:15 - 16:19
    of them do not right this line has
  • 16:17 - 16:22
    invalid that one does not question mark
  • 16:19 - 16:26
    here is saying zero or one so I want
  • 16:22 - 16:31
    zero or zero or one of invalid space
  • 16:26 - 16:34
    user what else well that's going to be a
  • 16:31 - 16:37
    double space so we can't have that and
  • 16:34 - 16:40
    then there's gonna be some username and
  • 16:37 - 16:43
    then there's gonna be what exactly is
  • 16:40 - 16:46
    gonna be what looks like an IP address
  • 16:43 - 16:50
    so here we can use our range syntax and
  • 16:46 - 16:53
    say zero to nine and a dot right that's
  • 16:50 - 16:58
    what IP addresses are and we want many
  • 16:53 - 17:00
    of those then it says porch so we're
  • 16:58 - 17:03
    just going to match a literal port and
  • 17:00 - 17:08
    then another number zero to nine and
  • 17:03 - 17:09
    we're going to wand plus of that the
  • 17:08 - 17:10
    other thing we're going to do here is
  • 17:09 - 17:12
    we're going to do what's known as
  • 17:10 - 17:13
    anchoring the regular expression so
  • 17:12 - 17:16
    there are two special characters and
  • 17:13 - 17:18
    regular expressions there's carrot or
  • 17:16 - 17:20
    hat which matches the beginning of a
  • 17:18 - 17:22
    line and there's dollar which matches
  • 17:20 - 17:25
    the end of a line so here we're gonna
  • 17:22 - 17:28
    say that this regression has to match
  • 17:25 - 17:30
    the complete line the reason we do this
  • 17:28 - 17:33
    is because imagine that someone made
  • 17:30 - 17:35
    their username the entire log string
  • 17:33 - 17:38
    then now if you try to match this
  • 17:35 - 17:41
    pattern it would match the username
  • 17:38 - 17:43
    itself which is not what we want
  • 17:41 - 17:44
    generally you will want to try to anchor
  • 17:43 - 17:47
    your patterns wherever you can to avoid
  • 17:44 - 17:50
    those kind of oddities okay let's see
  • 17:47 - 17:52
    what that gave us that removed many of
  • 17:50 - 17:54
    the lines but not all of them so this
  • 17:52 - 17:57
    one for example includes this pre off at
  • 17:54 - 18:03
    the end so we'll want to cut that off if
  • 17:57 - 18:05
    there's a space pre off square brackets
  • 18:03 - 18:07
    our specials we need to escape them
  • 18:05 - 18:11
    right now let's see what happens if we
  • 18:07 - 18:12
    try more lines of this no it still gets
  • 18:11 - 18:14
    something weird some of these lines are
  • 18:12 - 18:17
    not empty right which means that the
  • 18:14 - 18:19
    pattern did not match this one for
  • 18:17 - 18:20
    example it says authenticating user
  • 18:19 - 18:25
    instead of invalid
  • 18:20 - 18:27
    user okay so as to match invalid or
  • 18:25 - 18:31
    authenticated zero or one time before
  • 18:27 - 18:35
    user how about now okay that looks
  • 18:31 - 18:37
    pretty promising but this output is not
  • 18:35 - 18:39
    particularly helpful right here we've
  • 18:37 - 18:41
    just erased every line of our log files
  • 18:39 - 18:44
    successfully which is not very helpful
  • 18:41 - 18:46
    instead what we really wanted to do is
  • 18:44 - 18:49
    when we match the username right over
  • 18:46 - 18:50
    here we really wanted to remember what
  • 18:49 - 18:53
    that username was because that is what
  • 18:50 - 18:56
    we want to print out and the way we can
  • 18:53 - 19:00
    do that in regular expressions is using
  • 18:56 - 19:04
    something like capture groups so capture
  • 19:00 - 19:07
    groups are a way to say that I want to
  • 19:04 - 19:10
    remember this value and reuse it later
  • 19:07 - 19:12
    and in regular expressions any bracketed
  • 19:10 - 19:14
    expression any parenthesis expression is
  • 19:12 - 19:17
    going to be such a capture group so we
  • 19:14 - 19:19
    already actually have one here which is
  • 19:17 - 19:21
    this first group and now we're creating
  • 19:19 - 19:23
    a second one here notice that these
  • 19:21 - 19:25
    parentheses don't do anything to the
  • 19:23 - 19:27
    matching right because they're just
  • 19:25 - 19:29
    saying this expression as a unit but we
  • 19:27 - 19:33
    don't have any modifiers after it so
  • 19:29 - 19:35
    it's just match one-time and then the
  • 19:33 - 19:37
    reason matching groups are are useful or
  • 19:35 - 19:38
    capture groups are useful is because you
  • 19:37 - 19:41
    can refer back to them in the
  • 19:38 - 19:44
    replacement so in the replacement here I
  • 19:41 - 19:46
    can say backslash two this is the way
  • 19:44 - 19:48
    that you refer to the name of a capture
  • 19:46 - 19:50
    group in this say I'm in this case I'm
  • 19:48 - 19:53
    saying match the entire line and then in
  • 19:50 - 19:55
    the replacement put in the value you
  • 19:53 - 19:57
    captured in the second capture group
  • 19:55 - 20:00
    right remember this is the first capture
  • 19:57 - 20:03
    group and this is the second one and
  • 20:00 - 20:06
    this gives me all the usernames now if
  • 20:03 - 20:09
    you look back at what we wrote this is
  • 20:06 - 20:10
    pretty complicated right it might make
  • 20:09 - 20:12
    sense now that we walk through it and
  • 20:10 - 20:14
    why it had to be the way it was but this
  • 20:12 - 20:16
    is like not obvious that this is how
  • 20:14 - 20:20
    these lines work and this is where a
  • 20:16 - 20:22
    regular expression debugger can come in
  • 20:20 - 20:25
    really really handy so we have one here
  • 20:22 - 20:28
    there are many online but here I've sort
  • 20:25 - 20:32
    of pre filled in this expression that we
  • 20:28 - 20:34
    just used and notice that it it tells me
  • 20:32 - 20:37
    all the matching does in fact now this
  • 20:34 - 20:43
    window is a little small with this font
  • 20:37 - 20:46
    size but if I do hear this explanation
  • 20:43 - 20:48
    says dot star matches any character
  • 20:46 - 20:52
    between zero and unlimited times
  • 20:48 - 20:54
    followed by disconnected from literally
  • 20:52 - 20:57
    followed by a capture group and then
  • 20:54 - 20:59
    walks you through all the stuff and
  • 20:57 - 21:01
    that's one thing but it also lets you've
  • 20:59 - 21:04
    given a test string and then matches the
  • 21:01 - 21:05
    pattern against every single test string
  • 21:04 - 21:07
    that you give and highlights what the
  • 21:05 - 21:11
    different capture groups for example are
  • 21:07 - 21:15
    so here we made user a capture group
  • 21:11 - 21:17
    right so it'll say okay the full string
  • 21:15 - 21:19
    matched right the whole thing is blue so
  • 21:17 - 21:21
    it matched Green is the first capture
  • 21:19 - 21:23
    group red is the second capture group
  • 21:21 - 21:26
    and this is the third because preauth
  • 21:23 - 21:28
    was also put into parenthesis and this
  • 21:26 - 21:31
    can be a handy way to try to debug your
  • 21:28 - 21:36
    regular expressions for example if I put
  • 21:31 - 21:41
    disconnected from and let's add a new
  • 21:36 - 21:45
    line here and I make the username
  • 21:41 - 21:47
    disconnected from now that line already
  • 21:45 - 21:50
    had the username be disconnect from
  • 21:47 - 21:54
    great here me of thinking ahead you'll
  • 21:50 - 21:56
    notice that with this pattern this was
  • 21:54 - 21:59
    no longer a problem because it got
  • 21:56 - 22:03
    matched the username what happens if we
  • 21:59 - 22:07
    take this entire line or this entire
  • 22:03 - 22:14
    line and make that the username now what
  • 22:07 - 22:15
    happens it gets really confused right so
  • 22:14 - 22:18
    this is where regular expressions can be
  • 22:15 - 22:22
    a pain to get right because it now tries
  • 22:18 - 22:24
    to match it matches the first place
  • 22:22 - 22:27
    where username appears or the first
  • 22:24 - 22:30
    invalid in this case the second invalid
  • 22:27 - 22:32
    because this is greedy we can make this
  • 22:30 - 22:36
    non greedy by putting a question mark
  • 22:32 - 22:39
    here so if you suffix a plus or a star
  • 22:36 - 22:41
    with a question mark it becomes a non
  • 22:39 - 22:43
    greedy match so it will not try to match
  • 22:41 - 22:44
    as much as possible and then you see
  • 22:43 - 22:46
    that this actually gets parsed correctly
  • 22:44 - 22:48
    because this dots
  • 22:46 - 22:49
    we'll stop at the first disconnected
  • 22:48 - 22:52
    from which is the one that's actually
  • 22:49 - 22:57
    emitted by SSH the one that actually
  • 22:52 - 22:59
    appears in our logs as you can probably
  • 22:57 - 23:01
    tell from the explanation of this so far
  • 22:59 - 23:03
    regular expressions can get really
  • 23:01 - 23:05
    complicated and there are all sorts of
  • 23:03 - 23:07
    weird modifiers that you might have to
  • 23:05 - 23:09
    apply in your pattern the only way to
  • 23:07 - 23:11
    really learn them is to start with
  • 23:09 - 23:13
    simple ones and then build them up until
  • 23:11 - 23:15
    they match what you need often you're
  • 23:13 - 23:16
    just doing some like one-off job like
  • 23:15 - 23:18
    when we're hacking out the user names
  • 23:16 - 23:20
    here and you don't need to care about
  • 23:18 - 23:22
    all the special conditions right you
  • 23:20 - 23:24
    don't have to care about someone having
  • 23:22 - 23:26
    the SSH username perfectly match your
  • 23:24 - 23:27
    login format that's probably not
  • 23:26 - 23:29
    something that matters because you're
  • 23:27 - 23:31
    just trying to find the usernames but
  • 23:29 - 23:33
    regular expressions are really powerful
  • 23:31 - 23:34
    and you want to be careful if you're
  • 23:33 - 23:37
    doing something where it actually
  • 23:34 - 23:37
    matters you had a question
  • 23:41 - 23:48
    regular expressions by default only
  • 23:44 - 23:59
    match per line anyway they will not
  • 23:48 - 24:01
    match across new lines so so the way
  • 23:59 - 24:05
    that said works is that it operates per
  • 24:01 - 24:10
    line and so said we'll do this
  • 24:05 - 24:12
    expression for every line okay questions
  • 24:10 - 24:14
    about regular sessions or this pattern
  • 24:12 - 24:16
    so far it is a complicated pattern so if
  • 24:14 - 24:18
    it if it feels confusing like don't be
  • 24:16 - 24:31
    worried about it look at it in the
  • 24:18 - 24:34
    debugger later yep so so keep in mind
  • 24:31 - 24:36
    that the we're assuming here that the
  • 24:34 - 24:39
    user only has control over their
  • 24:36 - 24:42
    username right so the worst that they
  • 24:39 - 24:44
    could do is take like this entire entry
  • 24:42 - 24:48
    and make that the username let's see
  • 24:44 - 24:51
    what happens right so that's the works
  • 24:48 - 24:54
    and the reason for this is this question
  • 24:51 - 24:56
    mark means that the moment we hit the
  • 24:54 - 24:59
    disconnect keyword we start parsing the
  • 24:56 - 25:01
    rest of the pattern right and the
  • 24:59 - 25:03
    first occurrence of disconnected is
  • 25:01 - 25:06
    printed by SSH before anything the user
  • 25:03 - 25:08
    controls so in this particular instance
  • 25:06 - 25:21
    even this will not confuse the pattern
  • 25:08 - 25:25
    yep if well so if you're writing a this
  • 25:21 - 25:26
    sort of odd matching will in general
  • 25:25 - 25:29
    when you're doing data wrangling is like
  • 25:26 - 25:31
    not security it's not security related
  • 25:29 - 25:34
    but it might mean that you get really
  • 25:31 - 25:35
    weird data back and so if you're doing
  • 25:34 - 25:37
    something like plotting data you might
  • 25:35 - 25:40
    drop data points that matter you might
  • 25:37 - 25:41
    parse out the wrong number and then like
  • 25:40 - 25:43
    your plot suddenly have data points that
  • 25:41 - 25:46
    weren't in the original data and so it's
  • 25:43 - 25:47
    more that if you find yourself writing a
  • 25:46 - 25:49
    complicated regular expression like
  • 25:47 - 25:52
    double check that it's actually matching
  • 25:49 - 25:57
    what you think it's matching and even if
  • 25:52 - 25:58
    it's not security related and as you can
  • 25:57 - 26:01
    imagine these patterns can get really
  • 25:58 - 26:03
    complicated like for example there's a
  • 26:01 - 26:04
    big debate about how do you match an
  • 26:03 - 26:06
    email address with a regular expression
  • 26:04 - 26:09
    and you might think of something like
  • 26:06 - 26:11
    this so this is a very straightforward
  • 26:09 - 26:14
    one that just says letters and numbers
  • 26:11 - 26:16
    and rotor scores some percent followed
  • 26:14 - 26:18
    by a plus because in Gmail you can have
  • 26:16 - 26:22
    pluses in email addresses with a suffix
  • 26:18 - 26:25
    in this case the plus is just for any
  • 26:22 - 26:26
    number of these but at least one because
  • 26:25 - 26:27
    you can't have an email address that
  • 26:26 - 26:29
    doesn't have anything before the ad and
  • 26:27 - 26:32
    then similarly after the domain right
  • 26:29 - 26:33
    and the top-level domain has to be at
  • 26:32 - 26:35
    least two characters and can't include
  • 26:33 - 26:38
    digits right you can have it calm but
  • 26:35 - 26:40
    you can't have adopt seven it turns out
  • 26:38 - 26:42
    this is not really correct right there
  • 26:40 - 26:43
    are a bunch of valid email addresses
  • 26:42 - 26:44
    that will not be matched by this and
  • 26:43 - 26:46
    they're a bunch of invalid email
  • 26:44 - 26:51
    addresses that will be matched by this
  • 26:46 - 26:52
    so there are many many suggestions and
  • 26:51 - 26:55
    there are people who've built like full
  • 26:52 - 26:58
    test suites to try to see which regular
  • 26:55 - 27:01
    expression is best and this is this
  • 26:58 - 27:03
    particular one is for URLs there are
  • 27:01 - 27:06
    similar ones for email where they found
  • 27:03 - 27:08
    that the best one is this one I don't
  • 27:06 - 27:11
    recommend you trying to understand this
  • 27:08 - 27:14
    pattern but this one apparently will all
  • 27:11 - 27:16
    most perfectly match the what the like
  • 27:14 - 27:18
    internet standard for email addresses
  • 27:16 - 27:20
    says as a valid email address and that
  • 27:18 - 27:22
    includes all sorts of weird Unicode code
  • 27:20 - 27:24
    points this is just to say regular
  • 27:22 - 27:26
    expressions can be really hairy and if
  • 27:24 - 27:29
    you end up somewhere like this there's
  • 27:26 - 27:31
    probably a better way to do it for
  • 27:29 - 27:35
    example if you find yourself trying to
  • 27:31 - 27:38
    parse HTML or something or parse like
  • 27:35 - 27:40
    parse JSON where they're expressions you
  • 27:38 - 27:42
    should probably use a different tool and
  • 27:40 - 27:44
    there is an exercise that has you do
  • 27:42 - 27:50
    this not with the regular sessions point
  • 27:44 - 27:53
    you yeah that it's there's all sorts of
  • 27:50 - 27:55
    suggestions and they give you deep deep
  • 27:53 - 27:57
    dives into how they works if you want to
  • 27:55 - 28:02
    look that up it's it's in the lecture
  • 27:57 - 28:04
    notes okay so now we have the sister of
  • 28:02 - 28:06
    user names so let's go back to data
  • 28:04 - 28:08
    wrangling right like this list of user
  • 28:06 - 28:10
    names is still not that interesting to
  • 28:08 - 28:16
    me right let's let's see how many lines
  • 28:10 - 28:16
    there are so if I do WC - oh there are
  • 28:16 - 28:21
    one hundred and ninety eight thousand
  • 28:18 - 28:23
    lines so WC is the word count program -
  • 28:21 - 28:26
    L makes it count the number of lines
  • 28:23 - 28:28
    this is a lot of lines then if I start
  • 28:26 - 28:30
    scrolling through them that still
  • 28:28 - 28:32
    doesn't really help me right like I need
  • 28:30 - 28:37
    statistics over this I need aggregates
  • 28:32 - 28:38
    of some kind and the send tool is like
  • 28:37 - 28:40
    useful for many things it gives you a
  • 28:38 - 28:43
    full programming language it can do
  • 28:40 - 28:45
    weird things like insert text or only
  • 28:43 - 28:46
    print matching lines but it's not
  • 28:45 - 28:49
    necessarily the perfect tool for
  • 28:46 - 28:50
    everything right like sometimes there
  • 28:49 - 28:53
    are better tools like for example you
  • 28:50 - 28:55
    could write a line counter instead you
  • 28:53 - 28:57
    just should never said it's a terrible
  • 28:55 - 29:00
    programming language except for
  • 28:57 - 29:03
    searching and replacing but there are
  • 29:00 - 29:08
    other useful tools so for example
  • 29:03 - 29:10
    there's a tool called sort so sort this
  • 29:08 - 29:12
    is also not going to be very helpful but
  • 29:10 - 29:14
    sort takes a bunch of lines of input
  • 29:12 - 29:17
    sorts them and then prints them to your
  • 29:14 - 29:19
    output so in this case I now get the
  • 29:17 - 29:21
    sorted output of that list it is still
  • 29:19 - 29:24
    two hundred thousand lines long so it's
  • 29:21 - 29:25
    still not very helpful to me but now I
  • 29:24 - 29:27
    can combine it
  • 29:25 - 29:31
    the tool called unique so unique we'll
  • 29:27 - 29:33
    look at a sorted list of lines and it
  • 29:31 - 29:35
    will only print those that are unique so
  • 29:33 - 29:37
    if you have multiple instances of any
  • 29:35 - 29:41
    given line it will only print it once
  • 29:37 - 29:44
    and then I can say unique - C so this is
  • 29:41 - 29:46
    gonna say count the number of duplicates
  • 29:44 - 29:48
    for any lines that are duplicated and
  • 29:46 - 29:52
    eliminate them what does this look like
  • 29:48 - 29:56
    well if I run it it's gonna take a while
  • 29:52 - 30:00
    there were thirteen zze user names there
  • 29:56 - 30:01
    were ten ZX VF user names etc there and
  • 30:00 - 30:03
    I can scroll through this this is still
  • 30:01 - 30:06
    a very long list right but at least now
  • 30:03 - 30:08
    it's a little bit more collated than it
  • 30:06 - 30:11
    was let's see how many lines I'm dumped
  • 30:08 - 30:11
    in now okay
  • 30:13 - 30:17
    twenty-four thousand lines it's still
  • 30:15 - 30:20
    too much it's not useful information to
  • 30:17 - 30:23
    me but I can keep burning down this with
  • 30:20 - 30:25
    more tools for example what I might care
  • 30:23 - 30:29
    about is which user names have been used
  • 30:25 - 30:31
    the most well I can do sort again and I
  • 30:29 - 30:36
    can say I want a numeric sort on the
  • 30:31 - 30:39
    first column of the input so - n says
  • 30:36 - 30:41
    numeric sort - K lets you select a white
  • 30:39 - 30:44
    space separated column from the input to
  • 30:41 - 30:46
    sort my and the reason I'm giving one
  • 30:44 - 30:48
    comma one here is because I want to
  • 30:46 - 30:50
    start at the first column and stop at
  • 30:48 - 30:52
    the first column alternatively I could
  • 30:50 - 30:54
    say I want you to sort by this list of
  • 30:52 - 30:58
    columns but in this case I just want to
  • 30:54 - 31:02
    sort by that column and then I want only
  • 30:58 - 31:07
    the ten last lines so sort by default
  • 31:02 - 31:09
    will output in ascending order so the
  • 31:07 - 31:10
    the ones with the highest counts are
  • 31:09 - 31:15
    gonna be at the bottom and then I want
  • 31:10 - 31:17
    only lost ten lines and now when I run
  • 31:15 - 31:21
    this I actually get a useful bit of data
  • 31:17 - 31:22
    right it tells me there were eleven
  • 31:21 - 31:25
    thousand login attempts with the
  • 31:22 - 31:26
    username root there were four thousand
  • 31:25 - 31:30
    with one two three four five six isn't
  • 31:26 - 31:34
    username etc and this is pretty handy
  • 31:30 - 31:36
    right and now suddenly this giant log
  • 31:34 - 31:38
    file actually produces useful
  • 31:36 - 31:41
    information for me this is what I really
  • 31:38 - 31:44
    from that log file now maybe I want to
  • 31:41 - 31:47
    just like do a quick disabling of root
  • 31:44 - 31:51
    for example for SSH login on my machine
  • 31:47 - 31:51
    which I recommend you will do by the way
  • 31:51 - 31:57
    in this particular case we don't
  • 31:53 - 31:59
    actually need the k4 sort because sort
  • 31:57 - 32:01
    by default will sort by the entire line
  • 31:59 - 32:02
    and the number happens to come first but
  • 32:01 - 32:04
    it's useful to know about these
  • 32:02 - 32:06
    additional flags and you might wonder
  • 32:04 - 32:07
    well how would I know that these flags
  • 32:06 - 32:09
    exist how would I know that these
  • 32:07 - 32:11
    programs even exist
  • 32:09 - 32:13
    well the programs usually pick up just
  • 32:11 - 32:16
    from being told about them in classes
  • 32:13 - 32:19
    like here the flags are usually like I
  • 32:16 - 32:22
    want to sort by something that is not
  • 32:19 - 32:24
    the full line your first instinct should
  • 32:22 - 32:26
    be to type man sort and then read
  • 32:24 - 32:28
    through the page and then very quickly
  • 32:26 - 32:29
    will tell you here's how to select a
  • 32:28 - 32:36
    pretty good column here's how to sort by
  • 32:29 - 32:38
    a number etc okay what if now that I
  • 32:36 - 32:40
    have this like top let's say top 20 list
  • 32:38 - 32:43
    let's say I don't actually care about
  • 32:40 - 32:45
    the counts I just want like a comma
  • 32:43 - 32:47
    separated list of the user names because
  • 32:45 - 32:50
    I'm gonna like send it to myself by
  • 32:47 - 32:53
    email every day or something like that
  • 32:50 - 32:57
    like these are the top 20 usernames well
  • 32:53 - 32:57
    I can do this
  • 32:58 - 33:03
    ok that's a lot more weird commands but
  • 33:01 - 33:07
    their commands that are useful to know
  • 33:03 - 33:10
    about so awk is a column based stream
  • 33:07 - 33:12
    processor so we talked about said which
  • 33:10 - 33:16
    is a stream editor so it tries to edit
  • 33:12 - 33:19
    text primarily in the inputs awk on the
  • 33:16 - 33:21
    other hand also lets you edit text it is
  • 33:19 - 33:23
    still a full programming language but
  • 33:21 - 33:26
    it's more focused on columnar data so in
  • 33:23 - 33:28
    this case awk by default will parse its
  • 33:26 - 33:30
    input in white space separated columns
  • 33:28 - 33:32
    and then that you operate on those
  • 33:30 - 33:33
    columns separately in this case I'm
  • 33:32 - 33:38
    saying just print the second column
  • 33:33 - 33:40
    which is the user name right paste is a
  • 33:38 - 33:43
    command that takes a bunch of lines and
  • 33:40 - 33:46
    paste them together into a single line
  • 33:43 - 33:49
    that's the - s with the delimiter comma
  • 33:46 - 33:52
    so in this case for on this I want to
  • 33:49 - 33:54
    get a comma separated list of the top
  • 33:52 - 33:56
    user names which I can then do whatever
  • 33:54 - 33:58
    useful thing I might want maybe I want
  • 33:56 - 33:59
    to stick this in a config file of
  • 33:58 - 34:00
    disallowed usernames or something along
  • 33:59 - 34:04
    those lines
  • 34:00 - 34:06
    um awk is worth talking a little bit
  • 34:04 - 34:09
    more about because it turns out to be a
  • 34:06 - 34:13
    really powerful language for this kind
  • 34:09 - 34:16
    of data wrangling we mentioned briefly
  • 34:13 - 34:19
    what this print dollar 2 does but it
  • 34:16 - 34:21
    turns out the for awk you can do some
  • 34:19 - 34:23
    really really fancy things so for
  • 34:21 - 34:25
    example let's go back to here where we
  • 34:23 - 34:29
    just have the usernames I say let's
  • 34:25 - 34:32
    still do sort and unique because we
  • 34:29 - 34:32
    don't otherwise the list gets far too
  • 34:32 - 34:34
    long
  • 34:32 - 34:37
    and let's say that I only want to print
  • 34:34 - 34:41
    the usernames that match a particular
  • 34:37 - 34:51
    pattern let's say for example that I
  • 34:41 - 34:57
    want to see I want all of the usernames
  • 34:51 - 35:00
    that only appear once and that start
  • 34:57 - 35:02
    with a C and end with an e there's a
  • 35:00 - 35:04
    really weird thing to look for but in
  • 35:02 - 35:06
    all this is really simple to express I
  • 35:04 - 35:11
    can say I want the first column to be 1
  • 35:06 - 35:15
    and I want the second column to match
  • 35:11 - 35:15
    the following regular expression
  • 35:20 - 35:32
    hey this could probably just be dot and
  • 35:26 - 35:34
    then I want to print the whole line so
  • 35:32 - 35:36
    unless I mess something up this will
  • 35:34 - 35:39
    give me all the usernames that start
  • 35:36 - 35:43
    with a C end with an e and only appear
  • 35:39 - 35:45
    once in my log now that might not be a
  • 35:43 - 35:47
    very useful thing to do with the data
  • 35:45 - 35:48
    what I'm trying to do in this lecture is
  • 35:47 - 35:50
    show you the kind of tools that are
  • 35:48 - 35:52
    available and in this particular case
  • 35:50 - 35:53
    this pattern is like not that
  • 35:52 - 35:55
    complicated even though what we're doing
  • 35:53 - 35:58
    is sort of weird and this is because
  • 35:55 - 36:00
    very often on Linux with Linux tools in
  • 35:58 - 36:03
    particular and command-line tools in
  • 36:00 - 36:05
    general the tools are built to be based
  • 36:03 - 36:06
    on lines of input and lines of output
  • 36:05 - 36:09
    and very often those lines are going to
  • 36:06 - 36:18
    be have multiple columns and awk is
  • 36:09 - 36:22
    great for operating over columns now awk
  • 36:18 - 36:27
    is is not just able to do things like
  • 36:22 - 36:29
    match per line but it lets you do things
  • 36:27 - 36:31
    like let's say I want the number of
  • 36:29 - 36:33
    these right I want to know how many user
  • 36:31 - 36:37
    names match this pattern well I can do
  • 36:33 - 36:40
    WCHL that works just fine all right
  • 36:37 - 36:42
    there are 31 such user names but awk is
  • 36:40 - 36:45
    a programming language this is something
  • 36:42 - 36:47
    that you will probably never end up
  • 36:45 - 36:49
    doing yourself but it's important to
  • 36:47 - 36:53
    know that you can every now and again it
  • 36:49 - 36:53
    is actually useful to know about these
  • 36:54 - 37:02
    this might be hard to read on my screen
  • 36:57 - 37:05
    I just realized let me try to fix that
  • 37:02 - 37:05
    in a second
  • 37:07 - 37:18
    let's do yeah apparently fish does not
  • 37:14 - 37:20
    want me to do that um so here begin is a
  • 37:18 - 37:23
    special pattern that only matches the
  • 37:20 - 37:26
    zeroth line end is a special pattern
  • 37:23 - 37:28
    that only matches after the last line
  • 37:26 - 37:30
    and then this is gonna be a normal
  • 37:28 - 37:32
    pattern that's matched against every
  • 37:30 - 37:34
    line so what I'm saying here is on the
  • 37:32 - 37:37
    zeroth line set the variable rose to
  • 37:34 - 37:40
    zero on every line that matches this
  • 37:37 - 37:42
    pattern increment rose and after you
  • 37:40 - 37:45
    have matched the last line print the
  • 37:42 - 37:47
    value of rose and this will have the
  • 37:45 - 37:50
    same effect as running WCHL but all
  • 37:47 - 37:53
    within awk his particular instance like
  • 37:50 - 37:56
    WCHL is just fine but sometimes you want
  • 37:53 - 37:57
    to do things like you want to might want
  • 37:56 - 37:59
    to keep a dictionary or a map of some
  • 37:57 - 38:01
    kind you might want to compute
  • 37:59 - 38:03
    statistics you might want to do things
  • 38:01 - 38:05
    like I want the second match of this
  • 38:03 - 38:08
    pattern so you need a stateful matcher
  • 38:05 - 38:09
    like ignore the first match but then
  • 38:08 - 38:11
    print everything following the second
  • 38:09 - 38:13
    match and for that this kind of simple
  • 38:11 - 38:18
    programming in all can be useful to know
  • 38:13 - 38:23
    about in fact we could in this pattern
  • 38:18 - 38:25
    get rid of said and sort and unique and
  • 38:23 - 38:27
    grep that we originally used to produce
  • 38:25 - 38:28
    this file and do it all in awk
  • 38:27 - 38:31
    but you probably don't want to do that
  • 38:28 - 38:35
    it would be probably too painful to be
  • 38:31 - 38:37
    worth it it's worth talking a little bit
  • 38:35 - 38:39
    about the other kinds of tools that you
  • 38:37 - 38:41
    might want to use on the command line
  • 38:39 - 38:45
    the first of these is a really handy
  • 38:41 - 38:50
    program called BC so BC is the Berkeley
  • 38:45 - 38:51
    calculator I believe man BC I think BC
  • 38:50 - 38:54
    is originally from Berkeley calculator
  • 38:51 - 38:56
    anyway it is a very simple command-line
  • 38:54 - 38:59
    calculator but instead of giving you a
  • 38:56 - 39:01
    prompt it reads from standard in so I
  • 38:59 - 39:05
    can do something like echo 1 plus 2 and
  • 39:01 - 39:07
    pipe it to BC - shell because many of
  • 39:05 - 39:11
    these programs normally operate in like
  • 39:07 - 39:16
    a stupid mode where they're unhelpful so
  • 39:11 - 39:17
    here it prints 3 Wow very impressive but
  • 39:16 - 39:20
    it turns out this can be really handy
  • 39:17 - 39:21
    imagine you have a file with a bunch of
  • 39:20 - 39:26
    lines
  • 39:21 - 39:32
    let's say something like oh I don't know
  • 39:26 - 39:35
    this file and let's say I want to sum up
  • 39:32 - 39:37
    the number of logins the number of user
  • 39:35 - 39:40
    names that have not been used only once
  • 39:37 - 39:44
    all right so the ones where the count is
  • 39:40 - 39:49
    not equal to one I want to print just
  • 39:44 - 39:51
    the count right this is me give me the
  • 39:49 - 39:53
    counts for all the non single-use user
  • 39:51 - 39:55
    names and then I want to know how many
  • 39:53 - 39:57
    are there of these notice that I can't
  • 39:55 - 39:59
    just count the lines that wouldn't work
  • 39:57 - 40:02
    right because there are numbers on each
  • 39:59 - 40:06
    ran I want to sum well I can use paste
  • 40:02 - 40:08
    to paste by plus so this paste every
  • 40:06 - 40:12
    line together into a plus expression
  • 40:08 - 40:14
    right and this is now an arithmetic
  • 40:12 - 40:19
    expression so I can pipe it through BCL
  • 40:14 - 40:21
    and now there have been hundred and
  • 40:19 - 40:23
    ninety one thousand logins that share to
  • 40:21 - 40:26
    username with at least one other login
  • 40:23 - 40:28
    again probably not something you really
  • 40:26 - 40:30
    care about but this is just to show you
  • 40:28 - 40:34
    that you can extract this data pretty
  • 40:30 - 40:36
    easily and there's all sort of other
  • 40:34 - 40:38
    stuff you can do with this for example
  • 40:36 - 40:41
    there are tools so that you compute
  • 40:38 - 40:44
    statistics over inputs so for example
  • 40:41 - 40:46
    for this list of numbers that's that I
  • 40:44 - 40:50
    just took the numbers and just print it
  • 40:46 - 40:55
    out just the distribution of numbers I
  • 40:50 - 40:56
    could do things like use our our is the
  • 40:55 - 40:58
    separate programming language that's
  • 40:56 - 41:02
    specifically built for a statistical
  • 40:58 - 41:04
    analysis and I can say let's see if I
  • 41:02 - 41:06
    got this right
  • 41:04 - 41:10
    this is again a different programming
  • 41:06 - 41:13
    language that you would have to learn
  • 41:10 - 41:14
    but if you already know R or you can
  • 41:13 - 41:24
    pipe them through all their languages
  • 41:14 - 41:26
    too like so so this gives me summary
  • 41:24 - 41:30
    statistics over that input stream of
  • 41:26 - 41:33
    numbers so the median number of login
  • 41:30 - 41:34
    attempts per user name is 3 the max is
  • 41:33 - 41:36
    10,000 that was route
  • 41:34 - 41:39
    we saw before I'll tell me the average
  • 41:36 - 41:41
    was 8 for this might not matter in this
  • 41:39 - 41:42
    particular instance like this might not
  • 41:41 - 41:44
    be interesting numbers but if you're
  • 41:42 - 41:46
    looking at things like output from your
  • 41:44 - 41:47
    benchmarking script or something else
  • 41:46 - 41:49
    where you have some numerical
  • 41:47 - 41:53
    distribution and you want to look at
  • 41:49 - 41:54
    them these tools are really handy we can
  • 41:53 - 41:58
    even do some simple plotting if we
  • 41:54 - 42:01
    wanted to right so this has a bunch of
  • 41:58 - 42:06
    numbers let's do let's go back to our
  • 42:01 - 42:12
    sort and k-11 and look at only the two
  • 42:06 - 42:18
    top 5 new plot is a plotter that lets
  • 42:12 - 42:19
    you take things from standard in I'm not
  • 42:18 - 42:22
    expecting you to know all of these
  • 42:19 - 42:24
    programming languages because they
  • 42:22 - 42:26
    really are programming languages in
  • 42:24 - 42:31
    their own right but is it just show you
  • 42:26 - 42:34
    what is possible right so this is now a
  • 42:31 - 42:37
    histogram of how many times each of the
  • 42:34 - 42:41
    top 5 user names have been used for my
  • 42:37 - 42:44
    server since January 1st and it's just
  • 42:41 - 42:45
    one command line it's somewhat
  • 42:44 - 42:49
    complicated command line but it's just
  • 42:45 - 42:49
    one command line thing that you can do
  • 42:51 - 42:55
    there are two sort of special types of
  • 42:54 - 42:56
    data wrangling that I want to talk to
  • 42:55 - 42:58
    you about in the in the last little bit
  • 42:56 - 43:02
    of time that we have and the first one
  • 42:58 - 43:08
    is command line argument wrangling
  • 43:02 - 43:09
    sometimes you might have something that
  • 43:08 - 43:11
    actually we looked at in the last
  • 43:09 - 43:14
    lecture like you have things like find
  • 43:11 - 43:18
    that produces a list of files or maybe
  • 43:14 - 43:18
    something that produces a list of
  • 43:19 - 43:23
    arguments for your benchmarking script
  • 43:22 - 43:25
    like you want to run it with a
  • 43:23 - 43:26
    particular distribution of arguments
  • 43:25 - 43:29
    like let's say you had a script that
  • 43:26 - 43:30
    printed the number of iterations to run
  • 43:29 - 43:32
    a particular project and you wanted like
  • 43:30 - 43:34
    an exponential distribution or something
  • 43:32 - 43:36
    and this prints the number of iterations
  • 43:34 - 43:38
    on each line and you were to run your
  • 43:36 - 43:39
    benchmark for each one well here is a
  • 43:38 - 43:43
    tool called X args
  • 43:39 - 43:46
    that's your friend so X args takes lines
  • 43:43 - 43:48
    of input and turns them into arguments
  • 43:46 - 43:50
    and this is my
  • 43:48 - 43:52
    look a little weird see if I can come
  • 43:50 - 43:55
    with a good example for this so I
  • 43:52 - 43:57
    program in rust and rust lets you
  • 43:55 - 43:59
    install multiple versions of the
  • 43:57 - 44:01
    compiler so in this case you can see
  • 43:59 - 44:04
    that I have stable beta I have a couple
  • 44:01 - 44:06
    of earlier stable releases and I've
  • 44:04 - 44:09
    launched a different dated Knightley's
  • 44:06 - 44:12
    and this is all very well but over time
  • 44:09 - 44:14
    like I don't really need the nightly
  • 44:12 - 44:15
    version from like March of last year
  • 44:14 - 44:16
    anymore
  • 44:15 - 44:18
    I can probably delete that every now and
  • 44:16 - 44:22
    again and maybe I want to clean these up
  • 44:18 - 44:25
    a little well this is a list of lines so
  • 44:22 - 44:30
    I can get for nightly I can get rid of
  • 44:25 - 44:32
    so - V is don't match I don't want to
  • 44:30 - 44:35
    match to the current nightly okay so
  • 44:32 - 44:38
    this is al a list of dated Knightley's
  • 44:35 - 44:43
    maybe I want only the ones from 2019
  • 44:38 - 44:45
    and now I want to remove each of these
  • 44:43 - 44:48
    tool chains for my machine I could copy
  • 44:45 - 44:53
    paste each one into so there's a rust up
  • 44:48 - 44:56
    tool chain remove or uninstall maybe
  • 44:53 - 44:58
    tool chain uninstall right so I could
  • 44:56 - 44:59
    manually type out the name of each one
  • 44:58 - 45:01
    or copy/paste them but that's getting
  • 44:59 - 45:04
    gets annoying really quickly because I
  • 45:01 - 45:11
    have the list right here so instead how
  • 45:04 - 45:15
    about I said away this sort of this
  • 45:11 - 45:18
    suffix that it adds right so now it's
  • 45:15 - 45:21
    just that and then I use ex args so ex
  • 45:18 - 45:24
    args takes a list of inputs and turns
  • 45:21 - 45:27
    them into arguments so I want this to
  • 45:24 - 45:31
    become arguments to rust up tool chain
  • 45:27 - 45:33
    uninstall and just for my own sanity
  • 45:31 - 45:34
    sake I'm gonna make this echo just so
  • 45:33 - 45:36
    it's going to show which command it's
  • 45:34 - 45:39
    gonna run well it's relatively unhelpful
  • 45:36 - 45:42
    but are hard to read at least you see
  • 45:39 - 45:44
    the command it's going to execute if I
  • 45:42 - 45:46
    remove this echo is rust up tool chain
  • 45:44 - 45:48
    uninstall and then the list of
  • 45:46 - 45:51
    Knightley's as arguments to that program
  • 45:48 - 45:53
    and so if I run this it on installs
  • 45:51 - 45:56
    every tool chain instead of me having to
  • 45:53 - 45:58
    copy paste them so this is one example
  • 45:56 - 45:59
    where this kind of data wrangling
  • 45:58 - 46:01
    actually can be useful for other tasks
  • 45:59 - 46:01
    than just looking at data it's just
  • 46:01 - 46:04
    going from one
  • 46:01 - 46:07
    format to another you can also wrangle
  • 46:04 - 46:10
    binary data so a good example of this is
  • 46:07 - 46:12
    stuff like videos and images where you
  • 46:10 - 46:15
    might actually want to operate over them
  • 46:12 - 46:17
    in some interesting way so for example
  • 46:15 - 46:20
    there's a tool called ffmpeg ffmpeg is
  • 46:17 - 46:23
    for encoding and decoding video and to
  • 46:20 - 46:24
    some extent images I'm gonna set its log
  • 46:23 - 46:27
    level to panic because otherwise it
  • 46:24 - 46:31
    prints a bunch of stuff I want it to
  • 46:27 - 46:35
    read from dev video 0 which is my video
  • 46:31 - 46:37
    of my webcam video device and I wanted
  • 46:35 - 46:40
    to take the first frame so I just wanted
  • 46:37 - 46:43
    to take a picture and I wanted to take
  • 46:40 - 46:46
    an image rather than a single frame
  • 46:43 - 46:48
    video file and I wanted to print its
  • 46:46 - 46:50
    output so the image it captures to
  • 46:48 - 46:53
    standard output - is usually the way you
  • 46:50 - 46:54
    tell the program to use standard input
  • 46:53 - 46:56
    or output rather than a given file so
  • 46:54 - 46:59
    here it expects a file name and the file
  • 46:56 - 47:01
    name - means standard output in this
  • 46:59 - 47:03
    context and then I want to pipe that
  • 47:01 - 47:06
    through a parameter called convert
  • 47:03 - 47:08
    convert is a image manipulation program
  • 47:06 - 47:12
    I want to tell convert to read from
  • 47:08 - 47:16
    standard input and turn the image into
  • 47:12 - 47:19
    the color space gray and then write the
  • 47:16 - 47:22
    resulting image into the file - which is
  • 47:19 - 47:25
    standard output and I don't want to pipe
  • 47:22 - 47:29
    that into gzip we're just gonna compress
  • 47:25 - 47:31
    this image file and that's also going to
  • 47:29 - 47:33
    just operate on standard input standard
  • 47:31 - 47:38
    output and then I'm going to pipe that
  • 47:33 - 47:41
    to my remote server and on that I'm
  • 47:38 - 47:44
    going to decode that image and then I'm
  • 47:41 - 47:47
    gonna store a copy of that image so
  • 47:44 - 47:49
    remember T reads input prints it to
  • 47:47 - 47:51
    standard out and to a file this is gonna
  • 47:49 - 47:56
    make a copy of the decoded image file
  • 47:51 - 47:58
    ass copy about PNG and then it's gonna
  • 47:56 - 48:01
    continue to stream that out so now I'm
  • 47:58 - 48:05
    gonna bring that back into a local
  • 48:01 - 48:07
    stream and here I'm going to display
  • 48:05 - 48:09
    that in an image display err let's see
  • 48:07 - 48:13
    if that works
  • 48:09 - 48:15
    Hey right so this now did a round-trip
  • 48:13 - 48:18
    to my server
  • 48:15 - 48:21
    and then came back over pipes and
  • 48:18 - 48:23
    there's now a computer there's a
  • 48:21 - 48:26
    decompressed version of this file at
  • 48:23 - 48:29
    least in theory on my server let's see
  • 48:26 - 48:38
    if that's there a CPT's p copy PNG 2
  • 48:29 - 48:41
    here and CP 8 yeah hey same file ended
  • 48:38 - 48:44
    up on the server so our pipeline worked
  • 48:41 - 48:46
    again this is a sort of silly example
  • 48:44 - 48:48
    but let's you see the power of building
  • 48:46 - 48:50
    these pipelines where it doesn't have to
  • 48:48 - 48:52
    be textual data it's just go taking data
  • 48:50 - 48:55
    from any format to any other like for
  • 48:52 - 48:58
    example if I wanted to I can do cat dev
  • 48:55 - 49:01
    video 0 and then pipe that to a server
  • 48:58 - 49:03
    that like Anish controls and then he
  • 49:01 - 49:05
    could watch that video stream by piping
  • 49:03 - 49:09
    it into a video player on his machine if
  • 49:05 - 49:13
    we wanted to write it just need to know
  • 49:09 - 49:15
    that these thing exist there are a bunch
  • 49:13 - 49:17
    of exercises for this lab and some of
  • 49:15 - 49:19
    them rely on you having a data source
  • 49:17 - 49:21
    that looks a little bit like a log on
  • 49:19 - 49:22
    Mac OS and Linux we give you some
  • 49:21 - 49:25
    commands you can try to experiment with
  • 49:22 - 49:27
    but keep in mind that it's not it's not
  • 49:25 - 49:29
    that important exactly what data source
  • 49:27 - 49:30
    you use this is more find some data
  • 49:29 - 49:32
    source that where you think there might
  • 49:30 - 49:34
    be an interesting signal and then try to
  • 49:32 - 49:36
    extract something interesting from it
  • 49:34 - 49:39
    that is what all of the exercises are
  • 49:36 - 49:41
    about we will not have class on Monday
  • 49:39 - 49:43
    because it's MLK Day so next lecture
  • 49:41 - 49:45
    will be Tuesday on command line
  • 49:43 - 49:47
    environments any questions about what
  • 49:45 - 49:51
    we've guarded so far or the pipelines or
  • 49:47 - 49:53
    regular expressions I really recommend
  • 49:51 - 49:55
    that you look into regular expressions
  • 49:53 - 49:57
    and try to learn them they are extremely
  • 49:55 - 49:59
    handy both for this and in programming
  • 49:57 - 50:00
    in general and if you have any questions
  • 49:59 - 50:03
    come to office hours and we'll help you
  • 50:00 - 50:03
    up
Title:
Lecture 4: Data Wrangling (2020)
Description:

more » « less
Video Language:
English
Duration:
50:04

English subtitles

Revisions