WEBVTT 00:00:01.310 --> 00:00:06.420 all right so welcome to today's lecture 00:00:04.440 --> 00:00:08.760 which is going to be on data wrangling 00:00:06.420 --> 00:00:10.620 and data wrangling might be a phrase it 00:00:08.760 --> 00:00:12.630 sounds a little bit odd to you but the 00:00:10.620 --> 00:00:14.940 basic idea of data wrangling is that you 00:00:12.630 --> 00:00:16.800 have data in one format and you want it 00:00:14.940 --> 00:00:18.930 in some different format and this 00:00:16.800 --> 00:00:20.820 happens all of the time I'm not just 00:00:18.930 --> 00:00:22.859 talking about like converting images but 00:00:20.820 --> 00:00:25.080 it could be like you have a text file or 00:00:22.859 --> 00:00:27.480 a log file and what you really want this 00:00:25.080 --> 00:00:29.429 data in some other format like you want 00:00:27.480 --> 00:00:32.399 a graph or you want statistics over the 00:00:29.429 --> 00:00:35.160 data anything that goes from one piece 00:00:32.399 --> 00:00:37.110 of data to another representation of 00:00:35.160 --> 00:00:40.079 that data is what I would call data 00:00:37.110 --> 00:00:42.180 wrangling we've seen some examples of 00:00:40.079 --> 00:00:43.739 this kind of data wrangling already 00:00:42.180 --> 00:00:45.750 previously in the semester like 00:00:43.739 --> 00:00:48.000 basically whenever you use the pipe 00:00:45.750 --> 00:00:49.739 operator that lets you sort of take 00:00:48.000 --> 00:00:51.449 output from one program and feed it 00:00:49.739 --> 00:00:54.149 through another program you are doing 00:00:51.449 --> 00:00:55.289 data wrangling in one way or another but 00:00:54.149 --> 00:00:57.960 we're going to do in this lecture is 00:00:55.289 --> 00:00:59.850 take a look at some of the fancier ways 00:00:57.960 --> 00:01:01.859 you can do data wrangling and some of 00:00:59.850 --> 00:01:05.640 the really useful ways you can do data 00:01:01.859 --> 00:01:06.990 wrangling in order to do any kind of 00:01:05.640 --> 00:01:09.000 data wrangling though you need a data 00:01:06.990 --> 00:01:12.240 source you need some data to operate on 00:01:09.000 --> 00:01:14.400 in the first place and there are a lot 00:01:12.240 --> 00:01:16.560 of good candidates for that kind of data 00:01:14.400 --> 00:01:18.930 we give some examples in the exercise 00:01:16.560 --> 00:01:20.580 section for today's lecture notes in 00:01:18.930 --> 00:01:23.400 this particular one though I'm going to 00:01:20.580 --> 00:01:25.500 be using a system log so I have a server 00:01:23.400 --> 00:01:27.180 that's running somewhere the Netherlands 00:01:25.500 --> 00:01:29.750 because that seemed like a reasonable 00:01:27.180 --> 00:01:32.790 thing at the time and on that server 00:01:29.750 --> 00:01:34.380 it's running sort of a regular logging 00:01:32.790 --> 00:01:36.630 daemon that comes with system Deeb's 00:01:34.380 --> 00:01:39.030 it's a sort of relatively standard Linux 00:01:36.630 --> 00:01:41.880 logging mechanism and there's a command 00:01:39.030 --> 00:01:44.700 called journal CTL on Linux systems that 00:01:41.880 --> 00:01:46.439 will let you view the system log and so 00:01:44.700 --> 00:01:48.689 what I'm gonna do is I'm gonna do some 00:01:46.439 --> 00:01:50.009 transformations over that log and see if 00:01:48.689 --> 00:01:52.829 we can extract something interesting 00:01:50.009 --> 00:01:56.280 from it you'll see though that if I run 00:01:52.829 --> 00:01:59.329 this command I end up with a lot of data 00:01:56.280 --> 00:02:01.979 because this is a log that has just like 00:01:59.329 --> 00:02:03.360 there's a lot of stuff in it right a lot 00:02:01.979 --> 00:02:06.299 of things have happened on my server and 00:02:03.360 --> 00:02:08.250 this goes back to like January first and 00:02:06.299 --> 00:02:10.560 their logs that go even further back on 00:02:08.250 --> 00:02:12.120 this there's a lot of stuff so the first 00:02:10.560 --> 00:02:13.440 thing we're gonna do is try to limit it 00:02:12.120 --> 00:02:16.260 down to you only 00:02:13.440 --> 00:02:18.060 one piece of content and here the grep 00:02:16.260 --> 00:02:19.830 command is your friend so we're gonna 00:02:18.060 --> 00:02:23.220 pipe this through grep and we're gonna 00:02:19.830 --> 00:02:24.810 pipe for SSH right so SSH we haven't 00:02:23.220 --> 00:02:26.760 really talked to you about yet but it is 00:02:24.810 --> 00:02:28.560 a way to access computers remotely 00:02:26.760 --> 00:02:30.780 through the command line and in 00:02:28.560 --> 00:02:32.190 particular what happens when you put a 00:02:30.780 --> 00:02:34.080 server on the public Internet is that 00:02:32.190 --> 00:02:35.700 lots and lots of people around the world 00:02:34.080 --> 00:02:37.530 to try to connect to it and log in and 00:02:35.700 --> 00:02:39.360 take over your server and so I want to 00:02:37.530 --> 00:02:41.480 see how those people are trying to do 00:02:39.360 --> 00:02:44.850 that and so I'm going to grep for SSH 00:02:41.480 --> 00:02:47.700 and you'll see pretty quickly that this 00:02:44.850 --> 00:02:51.270 also generates a bunch of content at 00:02:47.700 --> 00:02:55.980 least in theory this is gonna be real 00:02:51.270 --> 00:02:58.650 slow there we go so this generates tons 00:02:55.980 --> 00:03:00.240 and tons and tons of content and it's 00:02:58.650 --> 00:03:01.860 really hard to even just visualize 00:03:00.240 --> 00:03:05.070 what's going on here so let's look at 00:03:01.860 --> 00:03:06.660 only what user names people have used to 00:03:05.070 --> 00:03:09.780 try to log into my server so you'll see 00:03:06.660 --> 00:03:12.540 some of these lines say disconnected 00:03:09.780 --> 00:03:14.940 disconnected from invalid user and then 00:03:12.540 --> 00:03:17.430 some user name I want only those lines 00:03:14.940 --> 00:03:19.080 that's all I really care about I'm gonna 00:03:17.430 --> 00:03:21.750 make one more change here though which 00:03:19.080 --> 00:03:26.459 is if you think about how this pipeline 00:03:21.750 --> 00:03:29.160 does if I here do this connected from so 00:03:26.459 --> 00:03:31.320 this pipeline at the bottom here what 00:03:29.160 --> 00:03:33.420 that will do is it will send the entire 00:03:31.320 --> 00:03:36.209 log file over the network to my machine 00:03:33.420 --> 00:03:38.250 and then locally run grep to find only 00:03:36.209 --> 00:03:40.530 the lines to contained ssh and then 00:03:38.250 --> 00:03:42.150 locally filter them further this seems a 00:03:40.530 --> 00:03:44.220 little bit wasteful because i don't care 00:03:42.150 --> 00:03:45.959 about most of these lines and the remote 00:03:44.220 --> 00:03:48.900 site is also running a shell so what I 00:03:45.959 --> 00:03:51.510 can actually do is I can have that 00:03:48.900 --> 00:03:53.519 entire command run on the server right 00:03:51.510 --> 00:03:55.200 so I'm telling you SSH the command I 00:03:53.519 --> 00:03:57.420 want you to run on the server is this 00:03:55.200 --> 00:04:01.230 pipeline of three things and then what I 00:03:57.420 --> 00:04:02.700 get back I want to pipe through less so 00:04:01.230 --> 00:04:04.260 what does this do well it's gonna do 00:04:02.700 --> 00:04:06.150 that same filtering that we did but it's 00:04:04.260 --> 00:04:08.280 gonna do it on the server side and the 00:04:06.150 --> 00:04:11.730 server is only going to send me those 00:04:08.280 --> 00:04:13.290 lines that I care about and then when I 00:04:11.730 --> 00:04:16.320 pipe it locally through the program 00:04:13.290 --> 00:04:17.519 called less less is a pager you'll see 00:04:16.320 --> 00:04:19.290 some examples of this you've actually 00:04:17.519 --> 00:04:21.900 seen some of them already like when you 00:04:19.290 --> 00:04:24.180 type man and some command that opens in 00:04:21.900 --> 00:04:26.669 a pager and a pagers is a convenient way 00:04:24.180 --> 00:04:27.389 to take a long piece of content and fit 00:04:26.669 --> 00:04:29.759 it into your term 00:04:27.389 --> 00:04:31.889 window and have you scrolled down and 00:04:29.759 --> 00:04:33.150 scroll up and navigate it so that it 00:04:31.889 --> 00:04:36.120 doesn't just like scroll past your 00:04:33.150 --> 00:04:37.409 screen and so if I run this it still 00:04:36.120 --> 00:04:40.800 takes a little while because it has to 00:04:37.409 --> 00:04:42.919 parse through a lot of log files and in 00:04:40.800 --> 00:04:45.930 particular grep is buffering and 00:04:42.919 --> 00:04:46.919 therefore it decides to be relatively 00:04:45.930 --> 00:04:56.039 unhelpful 00:04:46.919 --> 00:05:01.259 I may do this without let's see if 00:04:56.039 --> 00:05:05.189 that's more helpful why doesn't it want 00:05:01.259 --> 00:05:09.949 to be helpful to me fine I'm gonna cheat 00:05:05.189 --> 00:05:09.949 a little just ignore me 00:05:17.380 --> 00:05:22.520 or the internet is really slow those are 00:05:20.570 --> 00:05:27.140 two possible options luckily there's a 00:05:22.520 --> 00:05:30.470 fix for that because previously I have 00:05:27.140 --> 00:05:33.080 run the following command so this 00:05:30.470 --> 00:05:34.340 command just takes the output of that 00:05:33.080 --> 00:05:36.560 command and sticks it into a file 00:05:34.340 --> 00:05:38.660 locally on my computer alright so I ran 00:05:36.560 --> 00:05:40.970 this when I was up in my office and so 00:05:38.660 --> 00:05:43.490 what this did is it downloaded all of 00:05:40.970 --> 00:05:45.530 the SSH log entries that matched 00:05:43.490 --> 00:05:47.330 disconnect from so I have those locally 00:05:45.530 --> 00:05:49.070 and this is really handy right there's 00:05:47.330 --> 00:05:50.990 no reason for me to stream the full log 00:05:49.070 --> 00:05:52.640 every single time because I know that 00:05:50.990 --> 00:05:55.220 that starting pattern is what I'm going 00:05:52.640 --> 00:05:57.260 to want anyway so we can take a look at 00:05:55.220 --> 00:05:59.480 SSH dot log and you will see there are 00:05:57.260 --> 00:06:01.760 lots and lots and lots of lines that all 00:05:59.480 --> 00:06:04.940 say disconnected from invalid user 00:06:01.760 --> 00:06:06.230 authenticating users etc right so these 00:06:04.940 --> 00:06:08.870 are the lines that we have to work on 00:06:06.230 --> 00:06:10.550 and this also means that going forward 00:06:08.870 --> 00:06:12.500 we don't have to go through this whole 00:06:10.550 --> 00:06:16.220 SSH process we can just cat that file 00:06:12.500 --> 00:06:18.080 and then operate it on it directly so 00:06:16.220 --> 00:06:21.680 here I can also demonstrate this pager 00:06:18.080 --> 00:06:23.720 so if I do cat s is a cat SSH dot log 00:06:21.680 --> 00:06:25.220 and I pipe it through less it gives me a 00:06:23.720 --> 00:06:28.850 pager where I can scroll up and down 00:06:25.220 --> 00:06:30.560 make that a little bit smaller maybe so 00:06:28.850 --> 00:06:33.320 I can scroll this file screw through 00:06:30.560 --> 00:06:36.260 this file and I can do so with what are 00:06:33.320 --> 00:06:37.820 roughly vim bindings so control you to 00:06:36.260 --> 00:06:42.770 scroll up control D to scroll down and 00:06:37.820 --> 00:06:45.169 cue to exit this is still a lot of 00:06:42.770 --> 00:06:47.000 content though and these lines contain a 00:06:45.169 --> 00:06:48.440 bunch of garbage that I'm not really 00:06:47.000 --> 00:06:50.030 interested in what I really want to see 00:06:48.440 --> 00:06:52.610 is what are what are these user names 00:06:50.030 --> 00:06:55.790 and here the tool that we're going to 00:06:52.610 --> 00:06:59.210 start using is one called sent said is a 00:06:55.790 --> 00:07:01.040 stream editor that's modify or it's it's 00:06:59.210 --> 00:07:04.100 a modification of a much earlier program 00:07:01.040 --> 00:07:05.540 called edie which was a really weird 00:07:04.100 --> 00:07:12.320 editor that none of you will probably 00:07:05.540 --> 00:07:16.270 want to use yeah Oh tsp is the name of 00:07:12.320 --> 00:07:16.270 my the remote computer I'm connecting to 00:07:16.390 --> 00:07:23.720 so said is a stream editor and it 00:07:19.850 --> 00:07:26.060 basically lets you make changes to the 00:07:23.720 --> 00:07:28.490 contents of a stream you can think of it 00:07:26.060 --> 00:07:29.870 a little bit like doing replacements but 00:07:28.490 --> 00:07:30.410 it's actually a full programming 00:07:29.870 --> 00:07:33.440 language 00:07:30.410 --> 00:07:35.180 over the stream that is given one of the 00:07:33.440 --> 00:07:38.060 most common things you do with said 00:07:35.180 --> 00:07:40.610 though is to just run replacement 00:07:38.060 --> 00:07:44.590 expressions on an input stream what do 00:07:40.610 --> 00:07:44.590 these looks like well let me show you 00:07:45.160 --> 00:07:50.000 here I'm gonna pipe this sue said and 00:07:47.780 --> 00:07:52.540 I'm going to say that I want to remove 00:07:50.000 --> 00:07:58.370 everything that comes before 00:07:52.540 --> 00:08:00.980 disconnected from so this might look a 00:07:58.370 --> 00:08:03.950 little weird the observation is that the 00:08:00.980 --> 00:08:06.230 date and the host name and the sort of 00:08:03.950 --> 00:08:07.310 process ID of the SSH daemon I don't 00:08:06.230 --> 00:08:09.740 care about I can just remove that 00:08:07.310 --> 00:08:11.930 straightaway and I can also remove that 00:08:09.740 --> 00:08:13.580 like disconnected from bit because that 00:08:11.930 --> 00:08:15.170 seems to be present in every single log 00:08:13.580 --> 00:08:18.200 entry so I just want to get rid of it 00:08:15.170 --> 00:08:20.360 and so what I write is a set expression 00:08:18.200 --> 00:08:21.980 in this particular case it's an S 00:08:20.360 --> 00:08:25.730 expression which is a substitute 00:08:21.980 --> 00:08:27.620 expression it takes two arguments that 00:08:25.730 --> 00:08:30.590 are basically enclosed in these slashes 00:08:27.620 --> 00:08:32.360 so the first one is the search string 00:08:30.590 --> 00:08:34.430 and the second one which is currently 00:08:32.360 --> 00:08:36.470 empty is a replacement string so here 00:08:34.430 --> 00:08:39.560 I'm saying search for the following 00:08:36.470 --> 00:08:40.820 pattern and replace it with blank and 00:08:39.560 --> 00:08:43.099 then I'm gonna pipe it into less at the 00:08:40.820 --> 00:08:45.380 end do you see that now what it's done 00:08:43.099 --> 00:08:49.760 is trim off the beginning of all these 00:08:45.380 --> 00:08:52.220 lines and that seems really handy but 00:08:49.760 --> 00:08:54.740 you might wonder what is this pattern 00:08:52.220 --> 00:08:57.890 that I've built up here right this is 00:08:54.740 --> 00:08:59.480 this dot star what does that mean this 00:08:57.890 --> 00:09:01.820 is an example of a regular expression 00:08:59.480 --> 00:09:03.620 and regular expressions are something 00:09:01.820 --> 00:09:04.970 that you may have come across in 00:09:03.620 --> 00:09:06.710 programming in the past 00:09:04.970 --> 00:09:08.030 but it's something that once you go into 00:09:06.710 --> 00:09:09.920 the command line you will find yourself 00:09:08.030 --> 00:09:12.550 using a lot especially for this kind of 00:09:09.920 --> 00:09:16.040 data wrangling regular expressions are 00:09:12.550 --> 00:09:18.080 essentially a powerful way to match text 00:09:16.040 --> 00:09:19.580 you can use it for other things than 00:09:18.080 --> 00:09:23.030 text too but Texas the most common 00:09:19.580 --> 00:09:26.840 example and in regular expressions you 00:09:23.030 --> 00:09:29.810 have a number of special characters that 00:09:26.840 --> 00:09:31.580 say don't just match this character but 00:09:29.810 --> 00:09:34.210 match for example a particular type of 00:09:31.580 --> 00:09:36.980 character or a particular set of options 00:09:34.210 --> 00:09:39.770 it essentially generates a program for 00:09:36.980 --> 00:09:42.040 you that searches the given text dot for 00:09:39.770 --> 00:09:46.000 example means any single 00:09:42.040 --> 00:09:48.730 character and star if you follow a 00:09:46.000 --> 00:09:51.910 character with a star it means zero or 00:09:48.730 --> 00:09:54.399 more of that character and so in this 00:09:51.910 --> 00:09:57.579 case is pattern of saying zero or more 00:09:54.399 --> 00:10:00.490 of any character followed by the literal 00:09:57.579 --> 00:10:02.680 string disconnected from I'm saying 00:10:00.490 --> 00:10:05.560 match that and then replace it with 00:10:02.680 --> 00:10:07.660 blank regular expressions have a number 00:10:05.560 --> 00:10:09.310 of these kind of special characters that 00:10:07.660 --> 00:10:11.500 have various meanings you can take 00:10:09.310 --> 00:10:12.459 advantage of I talked about star which 00:10:11.500 --> 00:10:14.560 is zero or more 00:10:12.459 --> 00:10:16.149 there's also Plus which is one or more 00:10:14.560 --> 00:10:17.620 right so this is saying I want the 00:10:16.149 --> 00:10:19.139 previous expression to match at least 00:10:17.620 --> 00:10:22.509 once 00:10:19.139 --> 00:10:24.910 you also have square brackets so square 00:10:22.509 --> 00:10:27.180 brackets let you match one of many 00:10:24.910 --> 00:10:29.800 different characters so here let us 00:10:27.180 --> 00:10:36.370 build up a string list something like a 00:10:29.800 --> 00:10:41.680 BA and I want to substitute a and B with 00:10:36.370 --> 00:10:43.899 nothing okay so here what I'm telling 00:10:41.680 --> 00:10:46.540 the pattern to do is to replace any 00:10:43.899 --> 00:10:50.079 character that is either A or B with 00:10:46.540 --> 00:10:52.810 nothing so if I make the first character 00:10:50.079 --> 00:10:54.100 B it will still produce BA you might 00:10:52.810 --> 00:10:56.019 wonder though why did it only replace 00:10:54.100 --> 00:10:57.699 once well it's because what regular 00:10:56.019 --> 00:11:00.160 expressions will do especially in this 00:10:57.699 --> 00:11:01.569 default mode is they will just match the 00:11:00.160 --> 00:11:04.269 pattern once and then apply the 00:11:01.569 --> 00:11:07.360 replacement once per line that is what's 00:11:04.269 --> 00:11:09.279 said normally does you can provide the G 00:11:07.360 --> 00:11:12.250 modifier which says do this as many 00:11:09.279 --> 00:11:14.139 times as it keeps matching which in this 00:11:12.250 --> 00:11:15.790 case would erase the entire line because 00:11:14.139 --> 00:11:18.699 every single character is either an A or 00:11:15.790 --> 00:11:21.100 a B if I added a C here and remove 00:11:18.699 --> 00:11:23.019 everything but the C if I added other 00:11:21.100 --> 00:11:24.370 characters in the middle of this string 00:11:23.019 --> 00:11:26.260 somewhere they would all be preserved 00:11:24.370 --> 00:11:34.209 but anything that is an A or and B is 00:11:26.260 --> 00:11:37.889 removed you can also do things like add 00:11:34.209 --> 00:11:37.889 modifiers to this for example 00:11:42.330 --> 00:11:51.730 what would this do this is saying I want 00:11:46.720 --> 00:11:52.800 zero or more of the string a B and I'm 00:11:51.730 --> 00:11:55.270 gonna replace them with nothing 00:11:52.800 --> 00:11:57.400 this means that if I have a standalone a 00:11:55.270 --> 00:11:59.560 it will not be replaced if I have a 00:11:57.400 --> 00:12:01.540 standalone B it will not be replaced but 00:11:59.560 --> 00:12:09.580 if I have the string a B it will be 00:12:01.540 --> 00:12:11.940 removed which yeah what are they said is 00:12:09.580 --> 00:12:11.940 stupid 00:12:12.340 --> 00:12:18.250 the - a here is because said is a really 00:12:15.160 --> 00:12:19.930 old tool and so it supports only a very 00:12:18.250 --> 00:12:22.270 old version of very cool expressions 00:12:19.930 --> 00:12:24.070 generally you will want to run it with - 00:12:22.270 --> 00:12:25.810 capital e which makes it use a more 00:12:24.070 --> 00:12:28.620 modern syntax that supports more things 00:12:25.810 --> 00:12:30.940 if you are in a place where you can't 00:12:28.620 --> 00:12:33.160 you have to prefix these with back 00:12:30.940 --> 00:12:35.650 slashes to say I want the special 00:12:33.160 --> 00:12:37.180 meaning of parenthesis otherwise they 00:12:35.650 --> 00:12:39.990 were just match a literal parenthesis 00:12:37.180 --> 00:12:43.510 which is probably not what you want so 00:12:39.990 --> 00:12:46.390 notice how this replaced the a B here 00:12:43.510 --> 00:12:48.790 and it replaced the a be here but it 00:12:46.390 --> 00:12:51.040 left this C and it also left the a at 00:12:48.790 --> 00:12:54.100 the end because that a does not match 00:12:51.040 --> 00:12:55.740 this pattern anymore and you can group 00:12:54.100 --> 00:12:58.180 these patterns in whatever ways you want 00:12:55.740 --> 00:13:00.850 you also have things like alternations 00:12:58.180 --> 00:13:07.420 you can say anything that matches a b or 00:13:00.850 --> 00:13:10.510 b c i want to remove and here you'll 00:13:07.420 --> 00:13:12.220 notice that this a b got removed this bc 00:13:10.510 --> 00:13:14.740 did not get removed even though it 00:13:12.220 --> 00:13:17.950 matches the pattern because the a b had 00:13:14.740 --> 00:13:20.500 already been removed this a b is removed 00:13:17.950 --> 00:13:22.960 right but the c stays in place this a b 00:13:20.500 --> 00:13:25.870 is removed and this c states because it 00:13:22.960 --> 00:13:29.470 still does not match that if I made this 00:13:25.870 --> 00:13:31.750 if I remove this a then now this a B 00:13:29.470 --> 00:13:34.000 pattern will not match this B so it'll 00:13:31.750 --> 00:13:36.280 be preserved and then BC will match BC 00:13:34.000 --> 00:13:37.810 and it'll go away 00:13:36.280 --> 00:13:39.940 Regulus presence can be all sorts of 00:13:37.810 --> 00:13:41.530 complicated when you first encounter 00:13:39.940 --> 00:13:42.790 them and even once you get more 00:13:41.530 --> 00:13:45.160 experience with them they can be 00:13:42.790 --> 00:13:47.770 daunting to look at and this is why very 00:13:45.160 --> 00:13:49.600 often you want to use something like a 00:13:47.770 --> 00:13:51.700 regular expression debugger which we'll 00:13:49.600 --> 00:13:52.560 look at in a little bit but first let's 00:13:51.700 --> 00:13:55.500 try to make up a 00:13:52.560 --> 00:13:57.300 pattern that will match the logs and and 00:13:55.500 --> 00:14:00.390 match the logs that we've been working 00:13:57.300 --> 00:14:02.070 with so far so here I'm gonna just sort 00:14:00.390 --> 00:14:04.680 of extract a couple of lines from this 00:14:02.070 --> 00:14:08.910 file let's say the first five so these 00:14:04.680 --> 00:14:12.300 lines all now look like this right and 00:14:08.910 --> 00:14:15.360 what we want to do is we want to only 00:14:12.300 --> 00:14:21.210 have the user name okay so what might 00:14:15.360 --> 00:14:30.120 this look like well here's one thing we 00:14:21.210 --> 00:14:32.670 could try to do actually let me show you 00:14:30.120 --> 00:14:34.370 one except one thing first let me take a 00:14:32.670 --> 00:14:38.990 line that says something like 00:14:34.370 --> 00:14:44.279 disconnected from invalid user 00:14:38.990 --> 00:14:46.620 disconnected from maybe four to one one 00:14:44.279 --> 00:14:49.740 whatever okay so this is an example of a 00:14:46.620 --> 00:14:54.200 login line where someone tried to login 00:14:49.740 --> 00:14:54.200 with the username disconnected from 00:14:54.500 --> 00:15:05.400 missing an S disconnected thank you 00:15:03.200 --> 00:15:08.310 you'll notice that this actually removed 00:15:05.400 --> 00:15:10.770 the username as well and this is because 00:15:08.310 --> 00:15:11.940 when you use dot star and any of these 00:15:10.770 --> 00:15:14.490 sort of range expressions indirect 00:15:11.940 --> 00:15:17.070 expressions they are greedy they will 00:15:14.490 --> 00:15:19.890 match as much as they can so in this 00:15:17.070 --> 00:15:22.130 case this was the username that we 00:15:19.890 --> 00:15:24.930 wanted to retain but this pattern 00:15:22.130 --> 00:15:27.060 actually matched all the way up until 00:15:24.930 --> 00:15:28.620 the second occurrence of it or the last 00:15:27.060 --> 00:15:30.960 occurrence of it and so everything 00:15:28.620 --> 00:15:33.000 before it including the username itself 00:15:30.960 --> 00:15:34.470 got removed and so we need to come up 00:15:33.000 --> 00:15:36.150 with a slightly clever or matching 00:15:34.470 --> 00:15:38.190 strategy than just saying sort of dot 00:15:36.150 --> 00:15:39.959 star because it means that if we have 00:15:38.190 --> 00:15:41.339 particularly adversarial input we might 00:15:39.959 --> 00:15:44.430 end up with something that we didn't 00:15:41.339 --> 00:15:47.670 expect okay so let's see how we might 00:15:44.430 --> 00:15:56.850 try to match these lines let's just do a 00:15:47.670 --> 00:16:00.660 head first well let's try to construct 00:15:56.850 --> 00:16:02.970 this up from the beginning we first of 00:16:00.660 --> 00:16:05.190 all know that we want - capital e right 00:16:02.970 --> 00:16:07.170 because we want to not have to put all 00:16:05.190 --> 00:16:09.839 these back slashes everywhere 00:16:07.170 --> 00:16:14.880 these lines look like they say from and 00:16:09.839 --> 00:16:16.769 then some of them say invalid but some 00:16:14.880 --> 00:16:19.170 of them do not right this line has 00:16:16.769 --> 00:16:21.690 invalid that one does not question mark 00:16:19.170 --> 00:16:26.029 here is saying zero or one so I want 00:16:21.690 --> 00:16:31.320 zero or zero or one of invalid space 00:16:26.029 --> 00:16:34.320 user what else well that's going to be a 00:16:31.320 --> 00:16:36.529 double space so we can't have that and 00:16:34.320 --> 00:16:40.440 then there's gonna be some username and 00:16:36.529 --> 00:16:43.160 then there's gonna be what exactly is 00:16:40.440 --> 00:16:46.290 gonna be what looks like an IP address 00:16:43.160 --> 00:16:50.190 so here we can use our range syntax and 00:16:46.290 --> 00:16:53.490 say zero to nine and a dot right that's 00:16:50.190 --> 00:16:58.170 what IP addresses are and we want many 00:16:53.490 --> 00:17:00.300 of those then it says porch so we're 00:16:58.170 --> 00:17:03.060 just going to match a literal port and 00:17:00.300 --> 00:17:07.980 then another number zero to nine and 00:17:03.060 --> 00:17:09.150 we're going to wand plus of that the 00:17:07.980 --> 00:17:10.049 other thing we're going to do here is 00:17:09.150 --> 00:17:11.880 we're going to do what's known as 00:17:10.049 --> 00:17:13.439 anchoring the regular expression so 00:17:11.880 --> 00:17:15.780 there are two special characters and 00:17:13.439 --> 00:17:17.699 regular expressions there's carrot or 00:17:15.780 --> 00:17:19.799 hat which matches the beginning of a 00:17:17.699 --> 00:17:22.439 line and there's dollar which matches 00:17:19.799 --> 00:17:24.839 the end of a line so here we're gonna 00:17:22.439 --> 00:17:27.990 say that this regression has to match 00:17:24.839 --> 00:17:29.760 the complete line the reason we do this 00:17:27.990 --> 00:17:33.290 is because imagine that someone made 00:17:29.760 --> 00:17:35.250 their username the entire log string 00:17:33.290 --> 00:17:38.460 then now if you try to match this 00:17:35.250 --> 00:17:40.730 pattern it would match the username 00:17:38.460 --> 00:17:42.990 itself which is not what we want 00:17:40.730 --> 00:17:44.490 generally you will want to try to anchor 00:17:42.990 --> 00:17:46.860 your patterns wherever you can to avoid 00:17:44.490 --> 00:17:49.919 those kind of oddities okay let's see 00:17:46.860 --> 00:17:51.960 what that gave us that removed many of 00:17:49.919 --> 00:17:54.360 the lines but not all of them so this 00:17:51.960 --> 00:17:56.880 one for example includes this pre off at 00:17:54.360 --> 00:18:02.760 the end so we'll want to cut that off if 00:17:56.880 --> 00:18:04.549 there's a space pre off square brackets 00:18:02.760 --> 00:18:07.350 our specials we need to escape them 00:18:04.549 --> 00:18:10.650 right now let's see what happens if we 00:18:07.350 --> 00:18:12.360 try more lines of this no it still gets 00:18:10.650 --> 00:18:13.710 something weird some of these lines are 00:18:12.360 --> 00:18:16.740 not empty right which means that the 00:18:13.710 --> 00:18:18.990 pattern did not match this one for 00:18:16.740 --> 00:18:20.010 example it says authenticating user 00:18:18.990 --> 00:18:24.690 instead of invalid 00:18:20.010 --> 00:18:27.300 user okay so as to match invalid or 00:18:24.690 --> 00:18:30.900 authenticated zero or one time before 00:18:27.300 --> 00:18:34.530 user how about now okay that looks 00:18:30.900 --> 00:18:36.990 pretty promising but this output is not 00:18:34.530 --> 00:18:38.880 particularly helpful right here we've 00:18:36.990 --> 00:18:41.360 just erased every line of our log files 00:18:38.880 --> 00:18:43.890 successfully which is not very helpful 00:18:41.360 --> 00:18:46.110 instead what we really wanted to do is 00:18:43.890 --> 00:18:48.780 when we match the username right over 00:18:46.110 --> 00:18:50.310 here we really wanted to remember what 00:18:48.780 --> 00:18:53.310 that username was because that is what 00:18:50.310 --> 00:18:55.770 we want to print out and the way we can 00:18:53.310 --> 00:19:00.300 do that in regular expressions is using 00:18:55.770 --> 00:19:03.630 something like capture groups so capture 00:19:00.300 --> 00:19:06.570 groups are a way to say that I want to 00:19:03.630 --> 00:19:10.350 remember this value and reuse it later 00:19:06.570 --> 00:19:12.180 and in regular expressions any bracketed 00:19:10.350 --> 00:19:14.460 expression any parenthesis expression is 00:19:12.180 --> 00:19:16.770 going to be such a capture group so we 00:19:14.460 --> 00:19:18.570 already actually have one here which is 00:19:16.770 --> 00:19:20.850 this first group and now we're creating 00:19:18.570 --> 00:19:22.590 a second one here notice that these 00:19:20.850 --> 00:19:24.870 parentheses don't do anything to the 00:19:22.590 --> 00:19:27.210 matching right because they're just 00:19:24.870 --> 00:19:28.800 saying this expression as a unit but we 00:19:27.210 --> 00:19:32.550 don't have any modifiers after it so 00:19:28.800 --> 00:19:34.980 it's just match one-time and then the 00:19:32.550 --> 00:19:36.810 reason matching groups are are useful or 00:19:34.980 --> 00:19:38.370 capture groups are useful is because you 00:19:36.810 --> 00:19:40.920 can refer back to them in the 00:19:38.370 --> 00:19:43.800 replacement so in the replacement here I 00:19:40.920 --> 00:19:45.630 can say backslash two this is the way 00:19:43.800 --> 00:19:47.760 that you refer to the name of a capture 00:19:45.630 --> 00:19:50.250 group in this say I'm in this case I'm 00:19:47.760 --> 00:19:53.340 saying match the entire line and then in 00:19:50.250 --> 00:19:55.380 the replacement put in the value you 00:19:53.340 --> 00:19:57.330 captured in the second capture group 00:19:55.380 --> 00:20:00.020 right remember this is the first capture 00:19:57.330 --> 00:20:03.330 group and this is the second one and 00:20:00.020 --> 00:20:05.670 this gives me all the usernames now if 00:20:03.330 --> 00:20:08.580 you look back at what we wrote this is 00:20:05.670 --> 00:20:10.050 pretty complicated right it might make 00:20:08.580 --> 00:20:12.000 sense now that we walk through it and 00:20:10.050 --> 00:20:14.130 why it had to be the way it was but this 00:20:12.000 --> 00:20:16.140 is like not obvious that this is how 00:20:14.130 --> 00:20:19.680 these lines work and this is where a 00:20:16.140 --> 00:20:22.260 regular expression debugger can come in 00:20:19.680 --> 00:20:25.410 really really handy so we have one here 00:20:22.260 --> 00:20:27.510 there are many online but here I've sort 00:20:25.410 --> 00:20:31.710 of pre filled in this expression that we 00:20:27.510 --> 00:20:34.380 just used and notice that it it tells me 00:20:31.710 --> 00:20:37.470 all the matching does in fact now this 00:20:34.380 --> 00:20:42.950 window is a little small with this font 00:20:37.470 --> 00:20:45.620 size but if I do hear this explanation 00:20:42.950 --> 00:20:48.320 says dot star matches any character 00:20:45.620 --> 00:20:52.170 between zero and unlimited times 00:20:48.320 --> 00:20:54.270 followed by disconnected from literally 00:20:52.170 --> 00:20:56.790 followed by a capture group and then 00:20:54.270 --> 00:20:59.190 walks you through all the stuff and 00:20:56.790 --> 00:21:00.960 that's one thing but it also lets you've 00:20:59.190 --> 00:21:03.510 given a test string and then matches the 00:21:00.960 --> 00:21:05.370 pattern against every single test string 00:21:03.510 --> 00:21:07.460 that you give and highlights what the 00:21:05.370 --> 00:21:11.490 different capture groups for example are 00:21:07.460 --> 00:21:15.060 so here we made user a capture group 00:21:11.490 --> 00:21:16.980 right so it'll say okay the full string 00:21:15.060 --> 00:21:19.110 matched right the whole thing is blue so 00:21:16.980 --> 00:21:21.180 it matched Green is the first capture 00:21:19.110 --> 00:21:23.370 group red is the second capture group 00:21:21.180 --> 00:21:26.130 and this is the third because preauth 00:21:23.370 --> 00:21:27.750 was also put into parenthesis and this 00:21:26.130 --> 00:21:31.020 can be a handy way to try to debug your 00:21:27.750 --> 00:21:35.610 regular expressions for example if I put 00:21:31.020 --> 00:21:41.070 disconnected from and let's add a new 00:21:35.610 --> 00:21:45.240 line here and I make the username 00:21:41.070 --> 00:21:46.530 disconnected from now that line already 00:21:45.240 --> 00:21:49.950 had the username be disconnect from 00:21:46.530 --> 00:21:54.150 great here me of thinking ahead you'll 00:21:49.950 --> 00:21:56.010 notice that with this pattern this was 00:21:54.150 --> 00:21:58.740 no longer a problem because it got 00:21:56.010 --> 00:22:02.580 matched the username what happens if we 00:21:58.740 --> 00:22:07.170 take this entire line or this entire 00:22:02.580 --> 00:22:13.830 line and make that the username now what 00:22:07.170 --> 00:22:15.180 happens it gets really confused right so 00:22:13.830 --> 00:22:18.390 this is where regular expressions can be 00:22:15.180 --> 00:22:21.780 a pain to get right because it now tries 00:22:18.390 --> 00:22:23.970 to match it matches the first place 00:22:21.780 --> 00:22:27.420 where username appears or the first 00:22:23.970 --> 00:22:29.700 invalid in this case the second invalid 00:22:27.420 --> 00:22:31.830 because this is greedy we can make this 00:22:29.700 --> 00:22:36.360 non greedy by putting a question mark 00:22:31.830 --> 00:22:38.520 here so if you suffix a plus or a star 00:22:36.360 --> 00:22:40.860 with a question mark it becomes a non 00:22:38.520 --> 00:22:42.540 greedy match so it will not try to match 00:22:40.860 --> 00:22:43.820 as much as possible and then you see 00:22:42.540 --> 00:22:46.030 that this actually gets parsed correctly 00:22:43.820 --> 00:22:47.950 because this dots 00:22:46.030 --> 00:22:49.480 we'll stop at the first disconnected 00:22:47.950 --> 00:22:52.450 from which is the one that's actually 00:22:49.480 --> 00:22:57.070 emitted by SSH the one that actually 00:22:52.450 --> 00:22:58.720 appears in our logs as you can probably 00:22:57.070 --> 00:23:00.790 tell from the explanation of this so far 00:22:58.720 --> 00:23:03.130 regular expressions can get really 00:23:00.790 --> 00:23:05.320 complicated and there are all sorts of 00:23:03.130 --> 00:23:07.330 weird modifiers that you might have to 00:23:05.320 --> 00:23:09.130 apply in your pattern the only way to 00:23:07.330 --> 00:23:10.750 really learn them is to start with 00:23:09.130 --> 00:23:12.970 simple ones and then build them up until 00:23:10.750 --> 00:23:14.860 they match what you need often you're 00:23:12.970 --> 00:23:16.150 just doing some like one-off job like 00:23:14.860 --> 00:23:17.770 when we're hacking out the user names 00:23:16.150 --> 00:23:19.870 here and you don't need to care about 00:23:17.770 --> 00:23:21.610 all the special conditions right you 00:23:19.870 --> 00:23:24.190 don't have to care about someone having 00:23:21.610 --> 00:23:26.020 the SSH username perfectly match your 00:23:24.190 --> 00:23:27.430 login format that's probably not 00:23:26.020 --> 00:23:29.440 something that matters because you're 00:23:27.430 --> 00:23:30.730 just trying to find the usernames but 00:23:29.440 --> 00:23:32.710 regular expressions are really powerful 00:23:30.730 --> 00:23:33.730 and you want to be careful if you're 00:23:32.710 --> 00:23:36.870 doing something where it actually 00:23:33.730 --> 00:23:36.870 matters you had a question 00:23:41.380 --> 00:23:47.560 regular expressions by default only 00:23:43.510 --> 00:23:58.630 match per line anyway they will not 00:23:47.560 --> 00:24:01.210 match across new lines so so the way 00:23:58.630 --> 00:24:04.680 that said works is that it operates per 00:24:01.210 --> 00:24:10.390 line and so said we'll do this 00:24:04.680 --> 00:24:12.250 expression for every line okay questions 00:24:10.390 --> 00:24:14.410 about regular sessions or this pattern 00:24:12.250 --> 00:24:16.390 so far it is a complicated pattern so if 00:24:14.410 --> 00:24:17.560 it if it feels confusing like don't be 00:24:16.390 --> 00:24:31.450 worried about it look at it in the 00:24:17.560 --> 00:24:33.550 debugger later yep so so keep in mind 00:24:31.450 --> 00:24:36.130 that the we're assuming here that the 00:24:33.550 --> 00:24:38.590 user only has control over their 00:24:36.130 --> 00:24:41.800 username right so the worst that they 00:24:38.590 --> 00:24:43.510 could do is take like this entire entry 00:24:41.800 --> 00:24:48.490 and make that the username let's see 00:24:43.510 --> 00:24:51.490 what happens right so that's the works 00:24:48.490 --> 00:24:53.710 and the reason for this is this question 00:24:51.490 --> 00:24:56.200 mark means that the moment we hit the 00:24:53.710 --> 00:24:58.820 disconnect keyword we start parsing the 00:24:56.200 --> 00:25:00.769 rest of the pattern right and the 00:24:58.820 --> 00:25:03.200 first occurrence of disconnected is 00:25:00.769 --> 00:25:05.720 printed by SSH before anything the user 00:25:03.200 --> 00:25:08.210 controls so in this particular instance 00:25:05.720 --> 00:25:21.049 even this will not confuse the pattern 00:25:08.210 --> 00:25:24.919 yep if well so if you're writing a this 00:25:21.049 --> 00:25:26.149 sort of odd matching will in general 00:25:24.919 --> 00:25:29.120 when you're doing data wrangling is like 00:25:26.149 --> 00:25:31.370 not security it's not security related 00:25:29.120 --> 00:25:33.889 but it might mean that you get really 00:25:31.370 --> 00:25:35.299 weird data back and so if you're doing 00:25:33.889 --> 00:25:37.399 something like plotting data you might 00:25:35.299 --> 00:25:39.559 drop data points that matter you might 00:25:37.399 --> 00:25:41.450 parse out the wrong number and then like 00:25:39.559 --> 00:25:43.370 your plot suddenly have data points that 00:25:41.450 --> 00:25:45.559 weren't in the original data and so it's 00:25:43.370 --> 00:25:47.419 more that if you find yourself writing a 00:25:45.559 --> 00:25:49.070 complicated regular expression like 00:25:47.419 --> 00:25:51.710 double check that it's actually matching 00:25:49.070 --> 00:25:56.570 what you think it's matching and even if 00:25:51.710 --> 00:25:58.220 it's not security related and as you can 00:25:56.570 --> 00:26:00.950 imagine these patterns can get really 00:25:58.220 --> 00:26:02.809 complicated like for example there's a 00:26:00.950 --> 00:26:04.210 big debate about how do you match an 00:26:02.809 --> 00:26:06.230 email address with a regular expression 00:26:04.210 --> 00:26:08.870 and you might think of something like 00:26:06.230 --> 00:26:10.850 this so this is a very straightforward 00:26:08.870 --> 00:26:13.909 one that just says letters and numbers 00:26:10.850 --> 00:26:15.620 and rotor scores some percent followed 00:26:13.909 --> 00:26:17.799 by a plus because in Gmail you can have 00:26:15.620 --> 00:26:22.100 pluses in email addresses with a suffix 00:26:17.799 --> 00:26:24.620 in this case the plus is just for any 00:26:22.100 --> 00:26:25.730 number of these but at least one because 00:26:24.620 --> 00:26:26.929 you can't have an email address that 00:26:25.730 --> 00:26:29.269 doesn't have anything before the ad and 00:26:26.929 --> 00:26:31.789 then similarly after the domain right 00:26:29.269 --> 00:26:33.139 and the top-level domain has to be at 00:26:31.789 --> 00:26:35.059 least two characters and can't include 00:26:33.139 --> 00:26:38.000 digits right you can have it calm but 00:26:35.059 --> 00:26:40.039 you can't have adopt seven it turns out 00:26:38.000 --> 00:26:42.139 this is not really correct right there 00:26:40.039 --> 00:26:43.220 are a bunch of valid email addresses 00:26:42.139 --> 00:26:44.360 that will not be matched by this and 00:26:43.220 --> 00:26:45.559 they're a bunch of invalid email 00:26:44.360 --> 00:26:50.629 addresses that will be matched by this 00:26:45.559 --> 00:26:52.399 so there are many many suggestions and 00:26:50.629 --> 00:26:54.529 there are people who've built like full 00:26:52.399 --> 00:26:58.460 test suites to try to see which regular 00:26:54.529 --> 00:27:00.889 expression is best and this is this 00:26:58.460 --> 00:27:02.899 particular one is for URLs there are 00:27:00.889 --> 00:27:06.470 similar ones for email where they found 00:27:02.899 --> 00:27:07.909 that the best one is this one I don't 00:27:06.470 --> 00:27:10.790 recommend you trying to understand this 00:27:07.909 --> 00:27:13.720 pattern but this one apparently will all 00:27:10.790 --> 00:27:15.830 most perfectly match the what the like 00:27:13.720 --> 00:27:17.840 internet standard for email addresses 00:27:15.830 --> 00:27:20.000 says as a valid email address and that 00:27:17.840 --> 00:27:22.250 includes all sorts of weird Unicode code 00:27:20.000 --> 00:27:24.440 points this is just to say regular 00:27:22.250 --> 00:27:26.060 expressions can be really hairy and if 00:27:24.440 --> 00:27:28.880 you end up somewhere like this there's 00:27:26.060 --> 00:27:30.620 probably a better way to do it for 00:27:28.880 --> 00:27:35.320 example if you find yourself trying to 00:27:30.620 --> 00:27:38.300 parse HTML or something or parse like 00:27:35.320 --> 00:27:40.310 parse JSON where they're expressions you 00:27:38.300 --> 00:27:42.230 should probably use a different tool and 00:27:40.310 --> 00:27:44.480 there is an exercise that has you do 00:27:42.230 --> 00:27:49.960 this not with the regular sessions point 00:27:44.480 --> 00:27:53.180 you yeah that it's there's all sorts of 00:27:49.960 --> 00:27:54.740 suggestions and they give you deep deep 00:27:53.180 --> 00:27:56.660 dives into how they works if you want to 00:27:54.740 --> 00:28:01.670 look that up it's it's in the lecture 00:27:56.660 --> 00:28:04.280 notes okay so now we have the sister of 00:28:01.670 --> 00:28:05.960 user names so let's go back to data 00:28:04.280 --> 00:28:08.210 wrangling right like this list of user 00:28:05.960 --> 00:28:10.250 names is still not that interesting to 00:28:08.210 --> 00:28:15.790 me right let's let's see how many lines 00:28:10.250 --> 00:28:15.790 there are so if I do WC - oh there are 00:28:15.910 --> 00:28:21.470 one hundred and ninety eight thousand 00:28:18.320 --> 00:28:23.260 lines so WC is the word count program - 00:28:21.470 --> 00:28:26.030 L makes it count the number of lines 00:28:23.260 --> 00:28:27.530 this is a lot of lines then if I start 00:28:26.030 --> 00:28:29.690 scrolling through them that still 00:28:27.530 --> 00:28:31.730 doesn't really help me right like I need 00:28:29.690 --> 00:28:37.130 statistics over this I need aggregates 00:28:31.730 --> 00:28:38.450 of some kind and the send tool is like 00:28:37.130 --> 00:28:40.100 useful for many things it gives you a 00:28:38.450 --> 00:28:43.010 full programming language it can do 00:28:40.100 --> 00:28:45.020 weird things like insert text or only 00:28:43.010 --> 00:28:46.400 print matching lines but it's not 00:28:45.020 --> 00:28:48.560 necessarily the perfect tool for 00:28:46.400 --> 00:28:50.330 everything right like sometimes there 00:28:48.560 --> 00:28:53.420 are better tools like for example you 00:28:50.330 --> 00:28:55.400 could write a line counter instead you 00:28:53.420 --> 00:28:56.840 just should never said it's a terrible 00:28:55.400 --> 00:29:00.440 programming language except for 00:28:56.840 --> 00:29:02.740 searching and replacing but there are 00:29:00.440 --> 00:29:07.940 other useful tools so for example 00:29:02.740 --> 00:29:09.710 there's a tool called sort so sort this 00:29:07.940 --> 00:29:12.080 is also not going to be very helpful but 00:29:09.710 --> 00:29:13.850 sort takes a bunch of lines of input 00:29:12.080 --> 00:29:16.940 sorts them and then prints them to your 00:29:13.850 --> 00:29:19.130 output so in this case I now get the 00:29:16.940 --> 00:29:20.540 sorted output of that list it is still 00:29:19.130 --> 00:29:23.840 two hundred thousand lines long so it's 00:29:20.540 --> 00:29:24.760 still not very helpful to me but now I 00:29:23.840 --> 00:29:27.340 can combine it 00:29:24.760 --> 00:29:30.550 the tool called unique so unique we'll 00:29:27.340 --> 00:29:33.130 look at a sorted list of lines and it 00:29:30.550 --> 00:29:34.930 will only print those that are unique so 00:29:33.130 --> 00:29:37.090 if you have multiple instances of any 00:29:34.930 --> 00:29:40.750 given line it will only print it once 00:29:37.090 --> 00:29:44.290 and then I can say unique - C so this is 00:29:40.750 --> 00:29:46.030 gonna say count the number of duplicates 00:29:44.290 --> 00:29:48.010 for any lines that are duplicated and 00:29:46.030 --> 00:29:52.000 eliminate them what does this look like 00:29:48.010 --> 00:29:56.050 well if I run it it's gonna take a while 00:29:52.000 --> 00:29:59.710 there were thirteen zze user names there 00:29:56.050 --> 00:30:01.240 were ten ZX VF user names etc there and 00:29:59.710 --> 00:30:03.460 I can scroll through this this is still 00:30:01.240 --> 00:30:06.130 a very long list right but at least now 00:30:03.460 --> 00:30:08.200 it's a little bit more collated than it 00:30:06.130 --> 00:30:10.770 was let's see how many lines I'm dumped 00:30:08.200 --> 00:30:10.770 in now okay 00:30:13.480 --> 00:30:17.380 twenty-four thousand lines it's still 00:30:15.460 --> 00:30:19.810 too much it's not useful information to 00:30:17.380 --> 00:30:22.960 me but I can keep burning down this with 00:30:19.810 --> 00:30:24.730 more tools for example what I might care 00:30:22.960 --> 00:30:29.050 about is which user names have been used 00:30:24.730 --> 00:30:31.330 the most well I can do sort again and I 00:30:29.050 --> 00:30:35.560 can say I want a numeric sort on the 00:30:31.330 --> 00:30:38.980 first column of the input so - n says 00:30:35.560 --> 00:30:41.320 numeric sort - K lets you select a white 00:30:38.980 --> 00:30:43.720 space separated column from the input to 00:30:41.320 --> 00:30:45.760 sort my and the reason I'm giving one 00:30:43.720 --> 00:30:47.680 comma one here is because I want to 00:30:45.760 --> 00:30:49.690 start at the first column and stop at 00:30:47.680 --> 00:30:52.150 the first column alternatively I could 00:30:49.690 --> 00:30:54.130 say I want you to sort by this list of 00:30:52.150 --> 00:30:58.300 columns but in this case I just want to 00:30:54.130 --> 00:31:01.840 sort by that column and then I want only 00:30:58.300 --> 00:31:06.720 the ten last lines so sort by default 00:31:01.840 --> 00:31:08.890 will output in ascending order so the 00:31:06.720 --> 00:31:10.330 the ones with the highest counts are 00:31:08.890 --> 00:31:14.560 gonna be at the bottom and then I want 00:31:10.330 --> 00:31:17.470 only lost ten lines and now when I run 00:31:14.560 --> 00:31:20.590 this I actually get a useful bit of data 00:31:17.470 --> 00:31:21.730 right it tells me there were eleven 00:31:20.590 --> 00:31:24.730 thousand login attempts with the 00:31:21.730 --> 00:31:26.500 username root there were four thousand 00:31:24.730 --> 00:31:29.530 with one two three four five six isn't 00:31:26.500 --> 00:31:33.790 username etc and this is pretty handy 00:31:29.530 --> 00:31:36.040 right and now suddenly this giant log 00:31:33.790 --> 00:31:38.230 file actually produces useful 00:31:36.040 --> 00:31:40.540 information for me this is what I really 00:31:38.230 --> 00:31:44.230 from that log file now maybe I want to 00:31:40.540 --> 00:31:46.530 just like do a quick disabling of root 00:31:44.230 --> 00:31:50.610 for example for SSH login on my machine 00:31:46.530 --> 00:31:50.610 which I recommend you will do by the way 00:31:51.210 --> 00:31:56.559 in this particular case we don't 00:31:53.410 --> 00:31:58.510 actually need the k4 sort because sort 00:31:56.559 --> 00:32:00.850 by default will sort by the entire line 00:31:58.510 --> 00:32:01.990 and the number happens to come first but 00:32:00.850 --> 00:32:04.059 it's useful to know about these 00:32:01.990 --> 00:32:06.010 additional flags and you might wonder 00:32:04.059 --> 00:32:07.330 well how would I know that these flags 00:32:06.010 --> 00:32:08.559 exist how would I know that these 00:32:07.330 --> 00:32:11.410 programs even exist 00:32:08.559 --> 00:32:12.850 well the programs usually pick up just 00:32:11.410 --> 00:32:15.900 from being told about them in classes 00:32:12.850 --> 00:32:19.030 like here the flags are usually like I 00:32:15.900 --> 00:32:22.299 want to sort by something that is not 00:32:19.030 --> 00:32:24.160 the full line your first instinct should 00:32:22.299 --> 00:32:25.929 be to type man sort and then read 00:32:24.160 --> 00:32:27.669 through the page and then very quickly 00:32:25.929 --> 00:32:29.230 will tell you here's how to select a 00:32:27.669 --> 00:32:35.919 pretty good column here's how to sort by 00:32:29.230 --> 00:32:38.490 a number etc okay what if now that I 00:32:35.919 --> 00:32:40.419 have this like top let's say top 20 list 00:32:38.490 --> 00:32:42.790 let's say I don't actually care about 00:32:40.419 --> 00:32:45.010 the counts I just want like a comma 00:32:42.790 --> 00:32:47.470 separated list of the user names because 00:32:45.010 --> 00:32:49.510 I'm gonna like send it to myself by 00:32:47.470 --> 00:32:53.410 email every day or something like that 00:32:49.510 --> 00:32:56.910 like these are the top 20 usernames well 00:32:53.410 --> 00:32:56.910 I can do this 00:32:58.290 --> 00:33:02.559 ok that's a lot more weird commands but 00:33:01.360 --> 00:33:07.330 their commands that are useful to know 00:33:02.559 --> 00:33:09.880 about so awk is a column based stream 00:33:07.330 --> 00:33:12.429 processor so we talked about said which 00:33:09.880 --> 00:33:15.640 is a stream editor so it tries to edit 00:33:12.429 --> 00:33:18.820 text primarily in the inputs awk on the 00:33:15.640 --> 00:33:20.650 other hand also lets you edit text it is 00:33:18.820 --> 00:33:23.290 still a full programming language but 00:33:20.650 --> 00:33:25.660 it's more focused on columnar data so in 00:33:23.290 --> 00:33:28.390 this case awk by default will parse its 00:33:25.660 --> 00:33:30.190 input in white space separated columns 00:33:28.390 --> 00:33:32.169 and then that you operate on those 00:33:30.190 --> 00:33:33.429 columns separately in this case I'm 00:33:32.169 --> 00:33:38.320 saying just print the second column 00:33:33.429 --> 00:33:40.299 which is the user name right paste is a 00:33:38.320 --> 00:33:43.030 command that takes a bunch of lines and 00:33:40.299 --> 00:33:46.350 paste them together into a single line 00:33:43.030 --> 00:33:49.450 that's the - s with the delimiter comma 00:33:46.350 --> 00:33:51.740 so in this case for on this I want to 00:33:49.450 --> 00:33:53.929 get a comma separated list of the top 00:33:51.740 --> 00:33:56.120 user names which I can then do whatever 00:33:53.929 --> 00:33:57.500 useful thing I might want maybe I want 00:33:56.120 --> 00:33:59.149 to stick this in a config file of 00:33:57.500 --> 00:34:00.429 disallowed usernames or something along 00:33:59.149 --> 00:34:04.039 those lines 00:34:00.429 --> 00:34:05.720 um awk is worth talking a little bit 00:34:04.039 --> 00:34:08.510 more about because it turns out to be a 00:34:05.720 --> 00:34:12.859 really powerful language for this kind 00:34:08.510 --> 00:34:16.190 of data wrangling we mentioned briefly 00:34:12.859 --> 00:34:19.010 what this print dollar 2 does but it 00:34:16.190 --> 00:34:21.020 turns out the for awk you can do some 00:34:19.010 --> 00:34:22.849 really really fancy things so for 00:34:21.020 --> 00:34:25.129 example let's go back to here where we 00:34:22.849 --> 00:34:29.419 just have the usernames I say let's 00:34:25.129 --> 00:34:31.669 still do sort and unique because we 00:34:29.419 --> 00:34:32.089 don't otherwise the list gets far too 00:34:31.669 --> 00:34:34.040 long 00:34:32.089 --> 00:34:36.800 and let's say that I only want to print 00:34:34.040 --> 00:34:40.760 the usernames that match a particular 00:34:36.800 --> 00:34:51.440 pattern let's say for example that I 00:34:40.760 --> 00:34:56.570 want to see I want all of the usernames 00:34:51.440 --> 00:34:59.599 that only appear once and that start 00:34:56.570 --> 00:35:02.359 with a C and end with an e there's a 00:34:59.599 --> 00:35:04.310 really weird thing to look for but in 00:35:02.359 --> 00:35:06.410 all this is really simple to express I 00:35:04.310 --> 00:35:11.200 can say I want the first column to be 1 00:35:06.410 --> 00:35:15.190 and I want the second column to match 00:35:11.200 --> 00:35:15.190 the following regular expression 00:35:20.480 --> 00:35:32.030 hey this could probably just be dot and 00:35:26.119 --> 00:35:33.920 then I want to print the whole line so 00:35:32.030 --> 00:35:36.230 unless I mess something up this will 00:35:33.920 --> 00:35:38.900 give me all the usernames that start 00:35:36.230 --> 00:35:42.859 with a C end with an e and only appear 00:35:38.900 --> 00:35:44.780 once in my log now that might not be a 00:35:42.859 --> 00:35:46.640 very useful thing to do with the data 00:35:44.780 --> 00:35:48.230 what I'm trying to do in this lecture is 00:35:46.640 --> 00:35:49.940 show you the kind of tools that are 00:35:48.230 --> 00:35:51.619 available and in this particular case 00:35:49.940 --> 00:35:53.180 this pattern is like not that 00:35:51.619 --> 00:35:54.980 complicated even though what we're doing 00:35:53.180 --> 00:35:58.339 is sort of weird and this is because 00:35:54.980 --> 00:35:59.570 very often on Linux with Linux tools in 00:35:58.339 --> 00:36:02.570 particular and command-line tools in 00:35:59.570 --> 00:36:04.609 general the tools are built to be based 00:36:02.570 --> 00:36:06.440 on lines of input and lines of output 00:36:04.609 --> 00:36:09.079 and very often those lines are going to 00:36:06.440 --> 00:36:18.079 be have multiple columns and awk is 00:36:09.079 --> 00:36:22.160 great for operating over columns now awk 00:36:18.079 --> 00:36:26.750 is is not just able to do things like 00:36:22.160 --> 00:36:29.060 match per line but it lets you do things 00:36:26.750 --> 00:36:31.220 like let's say I want the number of 00:36:29.060 --> 00:36:32.900 these right I want to know how many user 00:36:31.220 --> 00:36:36.829 names match this pattern well I can do 00:36:32.900 --> 00:36:39.710 WCHL that works just fine all right 00:36:36.829 --> 00:36:41.990 there are 31 such user names but awk is 00:36:39.710 --> 00:36:44.780 a programming language this is something 00:36:41.990 --> 00:36:46.819 that you will probably never end up 00:36:44.780 --> 00:36:49.430 doing yourself but it's important to 00:36:46.819 --> 00:36:53.200 know that you can every now and again it 00:36:49.430 --> 00:36:53.200 is actually useful to know about these 00:36:53.619 --> 00:37:02.420 this might be hard to read on my screen 00:36:57.140 --> 00:37:04.960 I just realized let me try to fix that 00:37:02.420 --> 00:37:04.960 in a second 00:37:07.299 --> 00:37:17.649 let's do yeah apparently fish does not 00:37:14.469 --> 00:37:19.749 want me to do that um so here begin is a 00:37:17.649 --> 00:37:22.539 special pattern that only matches the 00:37:19.749 --> 00:37:25.779 zeroth line end is a special pattern 00:37:22.539 --> 00:37:28.179 that only matches after the last line 00:37:25.779 --> 00:37:29.619 and then this is gonna be a normal 00:37:28.179 --> 00:37:32.019 pattern that's matched against every 00:37:29.619 --> 00:37:34.149 line so what I'm saying here is on the 00:37:32.019 --> 00:37:36.579 zeroth line set the variable rose to 00:37:34.149 --> 00:37:40.419 zero on every line that matches this 00:37:36.579 --> 00:37:42.309 pattern increment rose and after you 00:37:40.419 --> 00:37:44.919 have matched the last line print the 00:37:42.309 --> 00:37:47.499 value of rose and this will have the 00:37:44.919 --> 00:37:50.259 same effect as running WCHL but all 00:37:47.499 --> 00:37:52.809 within awk his particular instance like 00:37:50.259 --> 00:37:55.599 WCHL is just fine but sometimes you want 00:37:52.809 --> 00:37:57.429 to do things like you want to might want 00:37:55.599 --> 00:37:59.109 to keep a dictionary or a map of some 00:37:57.429 --> 00:38:01.119 kind you might want to compute 00:37:59.109 --> 00:38:03.219 statistics you might want to do things 00:38:01.119 --> 00:38:05.469 like I want the second match of this 00:38:03.219 --> 00:38:07.630 pattern so you need a stateful matcher 00:38:05.469 --> 00:38:09.099 like ignore the first match but then 00:38:07.630 --> 00:38:11.140 print everything following the second 00:38:09.099 --> 00:38:12.639 match and for that this kind of simple 00:38:11.140 --> 00:38:18.489 programming in all can be useful to know 00:38:12.639 --> 00:38:22.929 about in fact we could in this pattern 00:38:18.489 --> 00:38:24.789 get rid of said and sort and unique and 00:38:22.929 --> 00:38:26.799 grep that we originally used to produce 00:38:24.789 --> 00:38:28.209 this file and do it all in awk 00:38:26.799 --> 00:38:30.880 but you probably don't want to do that 00:38:28.209 --> 00:38:34.539 it would be probably too painful to be 00:38:30.880 --> 00:38:37.359 worth it it's worth talking a little bit 00:38:34.539 --> 00:38:38.999 about the other kinds of tools that you 00:38:37.359 --> 00:38:41.169 might want to use on the command line 00:38:38.999 --> 00:38:45.039 the first of these is a really handy 00:38:41.169 --> 00:38:49.929 program called BC so BC is the Berkeley 00:38:45.039 --> 00:38:51.449 calculator I believe man BC I think BC 00:38:49.929 --> 00:38:54.069 is originally from Berkeley calculator 00:38:51.449 --> 00:38:56.169 anyway it is a very simple command-line 00:38:54.069 --> 00:38:58.959 calculator but instead of giving you a 00:38:56.169 --> 00:39:00.759 prompt it reads from standard in so I 00:38:58.959 --> 00:39:04.899 can do something like echo 1 plus 2 and 00:39:00.759 --> 00:39:06.789 pipe it to BC - shell because many of 00:39:04.899 --> 00:39:11.319 these programs normally operate in like 00:39:06.789 --> 00:39:15.699 a stupid mode where they're unhelpful so 00:39:11.319 --> 00:39:17.469 here it prints 3 Wow very impressive but 00:39:15.699 --> 00:39:19.779 it turns out this can be really handy 00:39:17.469 --> 00:39:21.100 imagine you have a file with a bunch of 00:39:19.779 --> 00:39:26.340 lines 00:39:21.100 --> 00:39:32.020 let's say something like oh I don't know 00:39:26.340 --> 00:39:35.020 this file and let's say I want to sum up 00:39:32.020 --> 00:39:36.910 the number of logins the number of user 00:39:35.020 --> 00:39:40.030 names that have not been used only once 00:39:36.910 --> 00:39:43.870 all right so the ones where the count is 00:39:40.030 --> 00:39:48.550 not equal to one I want to print just 00:39:43.870 --> 00:39:50.950 the count right this is me give me the 00:39:48.550 --> 00:39:52.930 counts for all the non single-use user 00:39:50.950 --> 00:39:55.180 names and then I want to know how many 00:39:52.930 --> 00:39:56.740 are there of these notice that I can't 00:39:55.180 --> 00:39:59.110 just count the lines that wouldn't work 00:39:56.740 --> 00:40:02.200 right because there are numbers on each 00:39:59.110 --> 00:40:05.950 ran I want to sum well I can use paste 00:40:02.200 --> 00:40:08.100 to paste by plus so this paste every 00:40:05.950 --> 00:40:12.040 line together into a plus expression 00:40:08.100 --> 00:40:14.200 right and this is now an arithmetic 00:40:12.040 --> 00:40:18.910 expression so I can pipe it through BCL 00:40:14.200 --> 00:40:20.920 and now there have been hundred and 00:40:18.910 --> 00:40:22.720 ninety one thousand logins that share to 00:40:20.920 --> 00:40:25.540 username with at least one other login 00:40:22.720 --> 00:40:27.700 again probably not something you really 00:40:25.540 --> 00:40:29.560 care about but this is just to show you 00:40:27.700 --> 00:40:34.360 that you can extract this data pretty 00:40:29.560 --> 00:40:36.070 easily and there's all sort of other 00:40:34.360 --> 00:40:37.810 stuff you can do with this for example 00:40:36.070 --> 00:40:40.810 there are tools so that you compute 00:40:37.810 --> 00:40:43.660 statistics over inputs so for example 00:40:40.810 --> 00:40:45.850 for this list of numbers that's that I 00:40:43.660 --> 00:40:49.590 just took the numbers and just print it 00:40:45.850 --> 00:40:54.880 out just the distribution of numbers I 00:40:49.590 --> 00:40:56.080 could do things like use our our is the 00:40:54.880 --> 00:40:57.640 separate programming language that's 00:40:56.080 --> 00:41:02.230 specifically built for a statistical 00:40:57.640 --> 00:41:03.570 analysis and I can say let's see if I 00:41:02.230 --> 00:41:06.280 got this right 00:41:03.570 --> 00:41:10.440 this is again a different programming 00:41:06.280 --> 00:41:13.210 language that you would have to learn 00:41:10.440 --> 00:41:14.200 but if you already know R or you can 00:41:13.210 --> 00:41:23.860 pipe them through all their languages 00:41:14.200 --> 00:41:26.380 too like so so this gives me summary 00:41:23.860 --> 00:41:30.160 statistics over that input stream of 00:41:26.380 --> 00:41:33.310 numbers so the median number of login 00:41:30.160 --> 00:41:34.330 attempts per user name is 3 the max is 00:41:33.310 --> 00:41:35.980 10,000 that was route 00:41:34.330 --> 00:41:39.250 we saw before I'll tell me the average 00:41:35.980 --> 00:41:40.600 was 8 for this might not matter in this 00:41:39.250 --> 00:41:42.040 particular instance like this might not 00:41:40.600 --> 00:41:43.660 be interesting numbers but if you're 00:41:42.040 --> 00:41:45.790 looking at things like output from your 00:41:43.660 --> 00:41:46.780 benchmarking script or something else 00:41:45.790 --> 00:41:48.520 where you have some numerical 00:41:46.780 --> 00:41:52.900 distribution and you want to look at 00:41:48.520 --> 00:41:54.250 them these tools are really handy we can 00:41:52.900 --> 00:41:57.640 even do some simple plotting if we 00:41:54.250 --> 00:42:01.330 wanted to right so this has a bunch of 00:41:57.640 --> 00:42:06.220 numbers let's do let's go back to our 00:42:01.330 --> 00:42:11.860 sort and k-11 and look at only the two 00:42:06.220 --> 00:42:17.770 top 5 new plot is a plotter that lets 00:42:11.860 --> 00:42:19.150 you take things from standard in I'm not 00:42:17.770 --> 00:42:22.480 expecting you to know all of these 00:42:19.150 --> 00:42:23.950 programming languages because they 00:42:22.480 --> 00:42:25.810 really are programming languages in 00:42:23.950 --> 00:42:30.580 their own right but is it just show you 00:42:25.810 --> 00:42:34.360 what is possible right so this is now a 00:42:30.580 --> 00:42:37.360 histogram of how many times each of the 00:42:34.360 --> 00:42:41.020 top 5 user names have been used for my 00:42:37.360 --> 00:42:43.810 server since January 1st and it's just 00:42:41.020 --> 00:42:45.340 one command line it's somewhat 00:42:43.810 --> 00:42:48.570 complicated command line but it's just 00:42:45.340 --> 00:42:48.570 one command line thing that you can do 00:42:50.520 --> 00:42:54.790 there are two sort of special types of 00:42:53.590 --> 00:42:56.290 data wrangling that I want to talk to 00:42:54.790 --> 00:42:58.420 you about in the in the last little bit 00:42:56.290 --> 00:43:01.980 of time that we have and the first one 00:42:58.420 --> 00:43:07.750 is command line argument wrangling 00:43:01.980 --> 00:43:09.220 sometimes you might have something that 00:43:07.750 --> 00:43:11.140 actually we looked at in the last 00:43:09.220 --> 00:43:14.170 lecture like you have things like find 00:43:11.140 --> 00:43:17.760 that produces a list of files or maybe 00:43:14.170 --> 00:43:17.760 something that produces a list of 00:43:19.380 --> 00:43:23.080 arguments for your benchmarking script 00:43:21.940 --> 00:43:24.670 like you want to run it with a 00:43:23.080 --> 00:43:26.020 particular distribution of arguments 00:43:24.670 --> 00:43:28.810 like let's say you had a script that 00:43:26.020 --> 00:43:29.980 printed the number of iterations to run 00:43:28.810 --> 00:43:31.630 a particular project and you wanted like 00:43:29.980 --> 00:43:33.520 an exponential distribution or something 00:43:31.630 --> 00:43:35.500 and this prints the number of iterations 00:43:33.520 --> 00:43:37.960 on each line and you were to run your 00:43:35.500 --> 00:43:39.190 benchmark for each one well here is a 00:43:37.960 --> 00:43:43.420 tool called X args 00:43:39.190 --> 00:43:46.210 that's your friend so X args takes lines 00:43:43.420 --> 00:43:47.620 of input and turns them into arguments 00:43:46.210 --> 00:43:50.170 and this is my 00:43:47.620 --> 00:43:52.270 look a little weird see if I can come 00:43:50.170 --> 00:43:55.480 with a good example for this so I 00:43:52.270 --> 00:43:56.770 program in rust and rust lets you 00:43:55.480 --> 00:43:58.540 install multiple versions of the 00:43:56.770 --> 00:44:01.360 compiler so in this case you can see 00:43:58.540 --> 00:44:04.420 that I have stable beta I have a couple 00:44:01.360 --> 00:44:05.860 of earlier stable releases and I've 00:44:04.420 --> 00:44:08.980 launched a different dated Knightley's 00:44:05.860 --> 00:44:12.010 and this is all very well but over time 00:44:08.980 --> 00:44:14.140 like I don't really need the nightly 00:44:12.010 --> 00:44:14.890 version from like March of last year 00:44:14.140 --> 00:44:16.450 anymore 00:44:14.890 --> 00:44:17.710 I can probably delete that every now and 00:44:16.450 --> 00:44:21.550 again and maybe I want to clean these up 00:44:17.710 --> 00:44:25.330 a little well this is a list of lines so 00:44:21.550 --> 00:44:29.770 I can get for nightly I can get rid of 00:44:25.330 --> 00:44:32.170 so - V is don't match I don't want to 00:44:29.770 --> 00:44:34.540 match to the current nightly okay so 00:44:32.170 --> 00:44:37.810 this is al a list of dated Knightley's 00:44:34.540 --> 00:44:42.730 maybe I want only the ones from 2019 00:44:37.810 --> 00:44:45.370 and now I want to remove each of these 00:44:42.730 --> 00:44:48.340 tool chains for my machine I could copy 00:44:45.370 --> 00:44:52.630 paste each one into so there's a rust up 00:44:48.340 --> 00:44:56.110 tool chain remove or uninstall maybe 00:44:52.630 --> 00:44:58.060 tool chain uninstall right so I could 00:44:56.110 --> 00:44:59.470 manually type out the name of each one 00:44:58.060 --> 00:45:01.030 or copy/paste them but that's getting 00:44:59.470 --> 00:45:03.700 gets annoying really quickly because I 00:45:01.030 --> 00:45:10.660 have the list right here so instead how 00:45:03.700 --> 00:45:14.890 about I said away this sort of this 00:45:10.660 --> 00:45:17.770 suffix that it adds right so now it's 00:45:14.890 --> 00:45:20.800 just that and then I use ex args so ex 00:45:17.770 --> 00:45:23.770 args takes a list of inputs and turns 00:45:20.800 --> 00:45:27.060 them into arguments so I want this to 00:45:23.770 --> 00:45:30.730 become arguments to rust up tool chain 00:45:27.060 --> 00:45:32.710 uninstall and just for my own sanity 00:45:30.730 --> 00:45:33.910 sake I'm gonna make this echo just so 00:45:32.710 --> 00:45:36.460 it's going to show which command it's 00:45:33.910 --> 00:45:39.460 gonna run well it's relatively unhelpful 00:45:36.460 --> 00:45:41.770 but are hard to read at least you see 00:45:39.460 --> 00:45:43.990 the command it's going to execute if I 00:45:41.770 --> 00:45:45.550 remove this echo is rust up tool chain 00:45:43.990 --> 00:45:47.520 uninstall and then the list of 00:45:45.550 --> 00:45:51.130 Knightley's as arguments to that program 00:45:47.520 --> 00:45:52.630 and so if I run this it on installs 00:45:51.130 --> 00:45:56.110 every tool chain instead of me having to 00:45:52.630 --> 00:45:57.520 copy paste them so this is one example 00:45:56.110 --> 00:45:59.110 where this kind of data wrangling 00:45:57.520 --> 00:46:00.670 actually can be useful for other tasks 00:45:59.110 --> 00:46:01.480 than just looking at data it's just 00:46:00.670 --> 00:46:04.420 going from one 00:46:01.480 --> 00:46:07.150 format to another you can also wrangle 00:46:04.420 --> 00:46:09.550 binary data so a good example of this is 00:46:07.150 --> 00:46:11.710 stuff like videos and images where you 00:46:09.550 --> 00:46:14.770 might actually want to operate over them 00:46:11.710 --> 00:46:17.109 in some interesting way so for example 00:46:14.770 --> 00:46:19.720 there's a tool called ffmpeg ffmpeg is 00:46:17.109 --> 00:46:23.079 for encoding and decoding video and to 00:46:19.720 --> 00:46:24.310 some extent images I'm gonna set its log 00:46:23.079 --> 00:46:26.800 level to panic because otherwise it 00:46:24.310 --> 00:46:30.730 prints a bunch of stuff I want it to 00:46:26.800 --> 00:46:34.570 read from dev video 0 which is my video 00:46:30.730 --> 00:46:37.300 of my webcam video device and I wanted 00:46:34.570 --> 00:46:40.420 to take the first frame so I just wanted 00:46:37.300 --> 00:46:42.670 to take a picture and I wanted to take 00:46:40.420 --> 00:46:45.790 an image rather than a single frame 00:46:42.670 --> 00:46:48.070 video file and I wanted to print its 00:46:45.790 --> 00:46:50.410 output so the image it captures to 00:46:48.070 --> 00:46:52.570 standard output - is usually the way you 00:46:50.410 --> 00:46:54.430 tell the program to use standard input 00:46:52.570 --> 00:46:56.200 or output rather than a given file so 00:46:54.430 --> 00:46:58.930 here it expects a file name and the file 00:46:56.200 --> 00:47:00.790 name - means standard output in this 00:46:58.930 --> 00:47:02.550 context and then I want to pipe that 00:47:00.790 --> 00:47:05.500 through a parameter called convert 00:47:02.550 --> 00:47:08.170 convert is a image manipulation program 00:47:05.500 --> 00:47:12.280 I want to tell convert to read from 00:47:08.170 --> 00:47:16.050 standard input and turn the image into 00:47:12.280 --> 00:47:19.390 the color space gray and then write the 00:47:16.050 --> 00:47:22.119 resulting image into the file - which is 00:47:19.390 --> 00:47:25.119 standard output and I don't want to pipe 00:47:22.119 --> 00:47:28.720 that into gzip we're just gonna compress 00:47:25.119 --> 00:47:30.579 this image file and that's also going to 00:47:28.720 --> 00:47:33.450 just operate on standard input standard 00:47:30.579 --> 00:47:37.780 output and then I'm going to pipe that 00:47:33.450 --> 00:47:41.349 to my remote server and on that I'm 00:47:37.780 --> 00:47:44.050 going to decode that image and then I'm 00:47:41.349 --> 00:47:46.839 gonna store a copy of that image so 00:47:44.050 --> 00:47:49.030 remember T reads input prints it to 00:47:46.839 --> 00:47:51.250 standard out and to a file this is gonna 00:47:49.030 --> 00:47:55.750 make a copy of the decoded image file 00:47:51.250 --> 00:47:58.210 ass copy about PNG and then it's gonna 00:47:55.750 --> 00:48:00.550 continue to stream that out so now I'm 00:47:58.210 --> 00:48:04.990 gonna bring that back into a local 00:48:00.550 --> 00:48:07.240 stream and here I'm going to display 00:48:04.990 --> 00:48:08.550 that in an image display err let's see 00:48:07.240 --> 00:48:13.240 if that works 00:48:08.550 --> 00:48:15.050 Hey right so this now did a round-trip 00:48:13.240 --> 00:48:18.340 to my server 00:48:15.050 --> 00:48:21.380 and then came back over pipes and 00:48:18.340 --> 00:48:23.060 there's now a computer there's a 00:48:21.380 --> 00:48:25.820 decompressed version of this file at 00:48:23.060 --> 00:48:29.360 least in theory on my server let's see 00:48:25.820 --> 00:48:38.180 if that's there a CPT's p copy PNG 2 00:48:29.360 --> 00:48:40.900 here and CP 8 yeah hey same file ended 00:48:38.180 --> 00:48:43.580 up on the server so our pipeline worked 00:48:40.900 --> 00:48:45.890 again this is a sort of silly example 00:48:43.580 --> 00:48:48.290 but let's you see the power of building 00:48:45.890 --> 00:48:50.150 these pipelines where it doesn't have to 00:48:48.290 --> 00:48:52.310 be textual data it's just go taking data 00:48:50.150 --> 00:48:55.100 from any format to any other like for 00:48:52.310 --> 00:48:58.280 example if I wanted to I can do cat dev 00:48:55.100 --> 00:49:00.710 video 0 and then pipe that to a server 00:48:58.280 --> 00:49:02.660 that like Anish controls and then he 00:49:00.710 --> 00:49:05.420 could watch that video stream by piping 00:49:02.660 --> 00:49:08.900 it into a video player on his machine if 00:49:05.420 --> 00:49:13.100 we wanted to write it just need to know 00:49:08.900 --> 00:49:15.200 that these thing exist there are a bunch 00:49:13.100 --> 00:49:17.180 of exercises for this lab and some of 00:49:15.200 --> 00:49:19.310 them rely on you having a data source 00:49:17.180 --> 00:49:21.110 that looks a little bit like a log on 00:49:19.310 --> 00:49:22.460 Mac OS and Linux we give you some 00:49:21.110 --> 00:49:24.590 commands you can try to experiment with 00:49:22.460 --> 00:49:26.630 but keep in mind that it's not it's not 00:49:24.590 --> 00:49:28.970 that important exactly what data source 00:49:26.630 --> 00:49:30.290 you use this is more find some data 00:49:28.970 --> 00:49:32.240 source that where you think there might 00:49:30.290 --> 00:49:33.680 be an interesting signal and then try to 00:49:32.240 --> 00:49:35.510 extract something interesting from it 00:49:33.680 --> 00:49:38.660 that is what all of the exercises are 00:49:35.510 --> 00:49:41.240 about we will not have class on Monday 00:49:38.660 --> 00:49:43.370 because it's MLK Day so next lecture 00:49:41.240 --> 00:49:45.440 will be Tuesday on command line 00:49:43.370 --> 00:49:47.420 environments any questions about what 00:49:45.440 --> 00:49:51.410 we've guarded so far or the pipelines or 00:49:47.420 --> 00:49:52.790 regular expressions I really recommend 00:49:51.410 --> 00:49:54.800 that you look into regular expressions 00:49:52.790 --> 00:49:57.230 and try to learn them they are extremely 00:49:54.800 --> 00:49:59.300 handy both for this and in programming 00:49:57.230 --> 00:50:00.440 in general and if you have any questions 00:49:59.300 --> 00:50:02.560 come to office hours and we'll help you 00:50:00.440 --> 00:50:02.560 up