0:00:01.310,0:00:06.420 all right so welcome to today's lecture 0:00:04.440,0:00:08.760 which is going to be on data wrangling 0:00:06.420,0:00:10.620 and data wrangling might be a phrase it 0:00:08.760,0:00:12.630 sounds a little bit odd to you but the 0:00:10.620,0:00:14.940 basic idea of data wrangling is that you 0:00:12.630,0:00:16.800 have data in one format and you want it 0:00:14.940,0:00:18.930 in some different format and this 0:00:16.800,0:00:20.820 happens all of the time I'm not just 0:00:18.930,0:00:22.859 talking about like converting images but 0:00:20.820,0:00:25.080 it could be like you have a text file or 0:00:22.859,0:00:27.480 a log file and what you really want this 0:00:25.080,0:00:29.429 data in some other format like you want 0:00:27.480,0:00:32.399 a graph or you want statistics over the 0:00:29.429,0:00:35.160 data anything that goes from one piece 0:00:32.399,0:00:37.110 of data to another representation of 0:00:35.160,0:00:40.079 that data is what I would call data 0:00:37.110,0:00:42.180 wrangling we've seen some examples of 0:00:40.079,0:00:43.739 this kind of data wrangling already 0:00:42.180,0:00:45.750 previously in the semester like 0:00:43.739,0:00:48.000 basically whenever you use the pipe 0:00:45.750,0:00:49.739 operator that lets you sort of take 0:00:48.000,0:00:51.449 output from one program and feed it 0:00:49.739,0:00:54.149 through another program you are doing 0:00:51.449,0:00:55.289 data wrangling in one way or another but 0:00:54.149,0:00:57.960 we're going to do in this lecture is 0:00:55.289,0:00:59.850 take a look at some of the fancier ways 0:00:57.960,0:01:01.859 you can do data wrangling and some of 0:00:59.850,0:01:05.640 the really useful ways you can do data 0:01:01.859,0:01:06.990 wrangling in order to do any kind of 0:01:05.640,0:01:09.000 data wrangling though you need a data 0:01:06.990,0:01:12.240 source you need some data to operate on 0:01:09.000,0:01:14.400 in the first place and there are a lot 0:01:12.240,0:01:16.560 of good candidates for that kind of data 0:01:14.400,0:01:18.930 we give some examples in the exercise 0:01:16.560,0:01:20.580 section for today's lecture notes in 0:01:18.930,0:01:23.400 this particular one though I'm going to 0:01:20.580,0:01:25.500 be using a system log so I have a server 0:01:23.400,0:01:27.180 that's running somewhere the Netherlands 0:01:25.500,0:01:29.750 because that seemed like a reasonable 0:01:27.180,0:01:32.790 thing at the time and on that server 0:01:29.750,0:01:34.380 it's running sort of a regular logging 0:01:32.790,0:01:36.630 daemon that comes with system Deeb's 0:01:34.380,0:01:39.030 it's a sort of relatively standard Linux 0:01:36.630,0:01:41.880 logging mechanism and there's a command 0:01:39.030,0:01:44.700 called journal CTL on Linux systems that 0:01:41.880,0:01:46.439 will let you view the system log and so 0:01:44.700,0:01:48.689 what I'm gonna do is I'm gonna do some 0:01:46.439,0:01:50.009 transformations over that log and see if 0:01:48.689,0:01:52.829 we can extract something interesting 0:01:50.009,0:01:56.280 from it you'll see though that if I run 0:01:52.829,0:01:59.329 this command I end up with a lot of data 0:01:56.280,0:02:01.979 because this is a log that has just like 0:01:59.329,0:02:03.360 there's a lot of stuff in it right a lot 0:02:01.979,0:02:06.299 of things have happened on my server and 0:02:03.360,0:02:08.250 this goes back to like January first and 0:02:06.299,0:02:10.560 their logs that go even further back on 0:02:08.250,0:02:12.120 this there's a lot of stuff so the first 0:02:10.560,0:02:13.440 thing we're gonna do is try to limit it 0:02:12.120,0:02:16.260 down to you only 0:02:13.440,0:02:18.060 one piece of content and here the grep 0:02:16.260,0:02:19.830 command is your friend so we're gonna 0:02:18.060,0:02:23.220 pipe this through grep and we're gonna 0:02:19.830,0:02:24.810 pipe for SSH right so SSH we haven't 0:02:23.220,0:02:26.760 really talked to you about yet but it is 0:02:24.810,0:02:28.560 a way to access computers remotely 0:02:26.760,0:02:30.780 through the command line and in 0:02:28.560,0:02:32.190 particular what happens when you put a 0:02:30.780,0:02:34.080 server on the public Internet is that 0:02:32.190,0:02:35.700 lots and lots of people around the world 0:02:34.080,0:02:37.530 to try to connect to it and log in and 0:02:35.700,0:02:39.360 take over your server and so I want to 0:02:37.530,0:02:41.480 see how those people are trying to do 0:02:39.360,0:02:44.850 that and so I'm going to grep for SSH 0:02:41.480,0:02:47.700 and you'll see pretty quickly that this 0:02:44.850,0:02:51.270 also generates a bunch of content at 0:02:47.700,0:02:55.980 least in theory this is gonna be real 0:02:51.270,0:02:58.650 slow there we go so this generates tons 0:02:55.980,0:03:00.240 and tons and tons of content and it's 0:02:58.650,0:03:01.860 really hard to even just visualize 0:03:00.240,0:03:05.070 what's going on here so let's look at 0:03:01.860,0:03:06.660 only what user names people have used to 0:03:05.070,0:03:09.780 try to log into my server so you'll see 0:03:06.660,0:03:12.540 some of these lines say disconnected 0:03:09.780,0:03:14.940 disconnected from invalid user and then 0:03:12.540,0:03:17.430 some user name I want only those lines 0:03:14.940,0:03:19.080 that's all I really care about I'm gonna 0:03:17.430,0:03:21.750 make one more change here though which 0:03:19.080,0:03:26.459 is if you think about how this pipeline 0:03:21.750,0:03:29.160 does if I here do this connected from so 0:03:26.459,0:03:31.320 this pipeline at the bottom here what 0:03:29.160,0:03:33.420 that will do is it will send the entire 0:03:31.320,0:03:36.209 log file over the network to my machine 0:03:33.420,0:03:38.250 and then locally run grep to find only 0:03:36.209,0:03:40.530 the lines to contained ssh and then 0:03:38.250,0:03:42.150 locally filter them further this seems a 0:03:40.530,0:03:44.220 little bit wasteful because i don't care 0:03:42.150,0:03:45.959 about most of these lines and the remote 0:03:44.220,0:03:48.900 site is also running a shell so what I 0:03:45.959,0:03:51.510 can actually do is I can have that 0:03:48.900,0:03:53.519 entire command run on the server right 0:03:51.510,0:03:55.200 so I'm telling you SSH the command I 0:03:53.519,0:03:57.420 want you to run on the server is this 0:03:55.200,0:04:01.230 pipeline of three things and then what I 0:03:57.420,0:04:02.700 get back I want to pipe through less so 0:04:01.230,0:04:04.260 what does this do well it's gonna do 0:04:02.700,0:04:06.150 that same filtering that we did but it's 0:04:04.260,0:04:08.280 gonna do it on the server side and the 0:04:06.150,0:04:11.730 server is only going to send me those 0:04:08.280,0:04:13.290 lines that I care about and then when I 0:04:11.730,0:04:16.320 pipe it locally through the program 0:04:13.290,0:04:17.519 called less less is a pager you'll see 0:04:16.320,0:04:19.290 some examples of this you've actually 0:04:17.519,0:04:21.900 seen some of them already like when you 0:04:19.290,0:04:24.180 type man and some command that opens in 0:04:21.900,0:04:26.669 a pager and a pagers is a convenient way 0:04:24.180,0:04:27.389 to take a long piece of content and fit 0:04:26.669,0:04:29.759 it into your term 0:04:27.389,0:04:31.889 window and have you scrolled down and 0:04:29.759,0:04:33.150 scroll up and navigate it so that it 0:04:31.889,0:04:36.120 doesn't just like scroll past your 0:04:33.150,0:04:37.409 screen and so if I run this it still 0:04:36.120,0:04:40.800 takes a little while because it has to 0:04:37.409,0:04:42.919 parse through a lot of log files and in 0:04:40.800,0:04:45.930 particular grep is buffering and 0:04:42.919,0:04:46.919 therefore it decides to be relatively 0:04:45.930,0:04:56.039 unhelpful 0:04:46.919,0:05:01.259 I may do this without let's see if 0:04:56.039,0:05:05.189 that's more helpful why doesn't it want 0:05:01.259,0:05:09.949 to be helpful to me fine I'm gonna cheat 0:05:05.189,0:05:09.949 a little just ignore me 0:05:17.380,0:05:22.520 or the internet is really slow those are 0:05:20.570,0:05:27.140 two possible options luckily there's a 0:05:22.520,0:05:30.470 fix for that because previously I have 0:05:27.140,0:05:33.080 run the following command so this 0:05:30.470,0:05:34.340 command just takes the output of that 0:05:33.080,0:05:36.560 command and sticks it into a file 0:05:34.340,0:05:38.660 locally on my computer alright so I ran 0:05:36.560,0:05:40.970 this when I was up in my office and so 0:05:38.660,0:05:43.490 what this did is it downloaded all of 0:05:40.970,0:05:45.530 the SSH log entries that matched 0:05:43.490,0:05:47.330 disconnect from so I have those locally 0:05:45.530,0:05:49.070 and this is really handy right there's 0:05:47.330,0:05:50.990 no reason for me to stream the full log 0:05:49.070,0:05:52.640 every single time because I know that 0:05:50.990,0:05:55.220 that starting pattern is what I'm going 0:05:52.640,0:05:57.260 to want anyway so we can take a look at 0:05:55.220,0:05:59.480 SSH dot log and you will see there are 0:05:57.260,0:06:01.760 lots and lots and lots of lines that all 0:05:59.480,0:06:04.940 say disconnected from invalid user 0:06:01.760,0:06:06.230 authenticating users etc right so these 0:06:04.940,0:06:08.870 are the lines that we have to work on 0:06:06.230,0:06:10.550 and this also means that going forward 0:06:08.870,0:06:12.500 we don't have to go through this whole 0:06:10.550,0:06:16.220 SSH process we can just cat that file 0:06:12.500,0:06:18.080 and then operate it on it directly so 0:06:16.220,0:06:21.680 here I can also demonstrate this pager 0:06:18.080,0:06:23.720 so if I do cat s is a cat SSH dot log 0:06:21.680,0:06:25.220 and I pipe it through less it gives me a 0:06:23.720,0:06:28.850 pager where I can scroll up and down 0:06:25.220,0:06:30.560 make that a little bit smaller maybe so 0:06:28.850,0:06:33.320 I can scroll this file screw through 0:06:30.560,0:06:36.260 this file and I can do so with what are 0:06:33.320,0:06:37.820 roughly vim bindings so control you to 0:06:36.260,0:06:42.770 scroll up control D to scroll down and 0:06:37.820,0:06:45.169 cue to exit this is still a lot of 0:06:42.770,0:06:47.000 content though and these lines contain a 0:06:45.169,0:06:48.440 bunch of garbage that I'm not really 0:06:47.000,0:06:50.030 interested in what I really want to see 0:06:48.440,0:06:52.610 is what are what are these user names 0:06:50.030,0:06:55.790 and here the tool that we're going to 0:06:52.610,0:06:59.210 start using is one called sent said is a 0:06:55.790,0:07:01.040 stream editor that's modify or it's it's 0:06:59.210,0:07:04.100 a modification of a much earlier program 0:07:01.040,0:07:05.540 called edie which was a really weird 0:07:04.100,0:07:12.320 editor that none of you will probably 0:07:05.540,0:07:16.270 want to use yeah Oh tsp is the name of 0:07:12.320,0:07:16.270 my the remote computer I'm connecting to 0:07:16.390,0:07:23.720 so said is a stream editor and it 0:07:19.850,0:07:26.060 basically lets you make changes to the 0:07:23.720,0:07:28.490 contents of a stream you can think of it 0:07:26.060,0:07:29.870 a little bit like doing replacements but 0:07:28.490,0:07:30.410 it's actually a full programming 0:07:29.870,0:07:33.440 language 0:07:30.410,0:07:35.180 over the stream that is given one of the 0:07:33.440,0:07:38.060 most common things you do with said 0:07:35.180,0:07:40.610 though is to just run replacement 0:07:38.060,0:07:44.590 expressions on an input stream what do 0:07:40.610,0:07:44.590 these looks like well let me show you 0:07:45.160,0:07:50.000 here I'm gonna pipe this sue said and 0:07:47.780,0:07:52.540 I'm going to say that I want to remove 0:07:50.000,0:07:58.370 everything that comes before 0:07:52.540,0:08:00.980 disconnected from so this might look a 0:07:58.370,0:08:03.950 little weird the observation is that the 0:08:00.980,0:08:06.230 date and the host name and the sort of 0:08:03.950,0:08:07.310 process ID of the SSH daemon I don't 0:08:06.230,0:08:09.740 care about I can just remove that 0:08:07.310,0:08:11.930 straightaway and I can also remove that 0:08:09.740,0:08:13.580 like disconnected from bit because that 0:08:11.930,0:08:15.170 seems to be present in every single log 0:08:13.580,0:08:18.200 entry so I just want to get rid of it 0:08:15.170,0:08:20.360 and so what I write is a set expression 0:08:18.200,0:08:21.980 in this particular case it's an S 0:08:20.360,0:08:25.730 expression which is a substitute 0:08:21.980,0:08:27.620 expression it takes two arguments that 0:08:25.730,0:08:30.590 are basically enclosed in these slashes 0:08:27.620,0:08:32.360 so the first one is the search string 0:08:30.590,0:08:34.430 and the second one which is currently 0:08:32.360,0:08:36.470 empty is a replacement string so here 0:08:34.430,0:08:39.560 I'm saying search for the following 0:08:36.470,0:08:40.820 pattern and replace it with blank and 0:08:39.560,0:08:43.099 then I'm gonna pipe it into less at the 0:08:40.820,0:08:45.380 end do you see that now what it's done 0:08:43.099,0:08:49.760 is trim off the beginning of all these 0:08:45.380,0:08:52.220 lines and that seems really handy but 0:08:49.760,0:08:54.740 you might wonder what is this pattern 0:08:52.220,0:08:57.890 that I've built up here right this is 0:08:54.740,0:08:59.480 this dot star what does that mean this 0:08:57.890,0:09:01.820 is an example of a regular expression 0:08:59.480,0:09:03.620 and regular expressions are something 0:09:01.820,0:09:04.970 that you may have come across in 0:09:03.620,0:09:06.710 programming in the past 0:09:04.970,0:09:08.030 but it's something that once you go into 0:09:06.710,0:09:09.920 the command line you will find yourself 0:09:08.030,0:09:12.550 using a lot especially for this kind of 0:09:09.920,0:09:16.040 data wrangling regular expressions are 0:09:12.550,0:09:18.080 essentially a powerful way to match text 0:09:16.040,0:09:19.580 you can use it for other things than 0:09:18.080,0:09:23.030 text too but Texas the most common 0:09:19.580,0:09:26.840 example and in regular expressions you 0:09:23.030,0:09:29.810 have a number of special characters that 0:09:26.840,0:09:31.580 say don't just match this character but 0:09:29.810,0:09:34.210 match for example a particular type of 0:09:31.580,0:09:36.980 character or a particular set of options 0:09:34.210,0:09:39.770 it essentially generates a program for 0:09:36.980,0:09:42.040 you that searches the given text dot for 0:09:39.770,0:09:46.000 example means any single 0:09:42.040,0:09:48.730 character and star if you follow a 0:09:46.000,0:09:51.910 character with a star it means zero or 0:09:48.730,0:09:54.399 more of that character and so in this 0:09:51.910,0:09:57.579 case is pattern of saying zero or more 0:09:54.399,0:10:00.490 of any character followed by the literal 0:09:57.579,0:10:02.680 string disconnected from I'm saying 0:10:00.490,0:10:05.560 match that and then replace it with 0:10:02.680,0:10:07.660 blank regular expressions have a number 0:10:05.560,0:10:09.310 of these kind of special characters that 0:10:07.660,0:10:11.500 have various meanings you can take 0:10:09.310,0:10:12.459 advantage of I talked about star which 0:10:11.500,0:10:14.560 is zero or more 0:10:12.459,0:10:16.149 there's also Plus which is one or more 0:10:14.560,0:10:17.620 right so this is saying I want the 0:10:16.149,0:10:19.139 previous expression to match at least 0:10:17.620,0:10:22.509 once 0:10:19.139,0:10:24.910 you also have square brackets so square 0:10:22.509,0:10:27.180 brackets let you match one of many 0:10:24.910,0:10:29.800 different characters so here let us 0:10:27.180,0:10:36.370 build up a string list something like a 0:10:29.800,0:10:41.680 BA and I want to substitute a and B with 0:10:36.370,0:10:43.899 nothing okay so here what I'm telling 0:10:41.680,0:10:46.540 the pattern to do is to replace any 0:10:43.899,0:10:50.079 character that is either A or B with 0:10:46.540,0:10:52.810 nothing so if I make the first character 0:10:50.079,0:10:54.100 B it will still produce BA you might 0:10:52.810,0:10:56.019 wonder though why did it only replace 0:10:54.100,0:10:57.699 once well it's because what regular 0:10:56.019,0:11:00.160 expressions will do especially in this 0:10:57.699,0:11:01.569 default mode is they will just match the 0:11:00.160,0:11:04.269 pattern once and then apply the 0:11:01.569,0:11:07.360 replacement once per line that is what's 0:11:04.269,0:11:09.279 said normally does you can provide the G 0:11:07.360,0:11:12.250 modifier which says do this as many 0:11:09.279,0:11:14.139 times as it keeps matching which in this 0:11:12.250,0:11:15.790 case would erase the entire line because 0:11:14.139,0:11:18.699 every single character is either an A or 0:11:15.790,0:11:21.100 a B if I added a C here and remove 0:11:18.699,0:11:23.019 everything but the C if I added other 0:11:21.100,0:11:24.370 characters in the middle of this string 0:11:23.019,0:11:26.260 somewhere they would all be preserved 0:11:24.370,0:11:34.209 but anything that is an A or and B is 0:11:26.260,0:11:37.889 removed you can also do things like add 0:11:34.209,0:11:37.889 modifiers to this for example 0:11:42.330,0:11:51.730 what would this do this is saying I want 0:11:46.720,0:11:52.800 zero or more of the string a B and I'm 0:11:51.730,0:11:55.270 gonna replace them with nothing 0:11:52.800,0:11:57.400 this means that if I have a standalone a 0:11:55.270,0:11:59.560 it will not be replaced if I have a 0:11:57.400,0:12:01.540 standalone B it will not be replaced but 0:11:59.560,0:12:09.580 if I have the string a B it will be 0:12:01.540,0:12:11.940 removed which yeah what are they said is 0:12:09.580,0:12:11.940 stupid 0:12:12.340,0:12:18.250 the - a here is because said is a really 0:12:15.160,0:12:19.930 old tool and so it supports only a very 0:12:18.250,0:12:22.270 old version of very cool expressions 0:12:19.930,0:12:24.070 generally you will want to run it with - 0:12:22.270,0:12:25.810 capital e which makes it use a more 0:12:24.070,0:12:28.620 modern syntax that supports more things 0:12:25.810,0:12:30.940 if you are in a place where you can't 0:12:28.620,0:12:33.160 you have to prefix these with back 0:12:30.940,0:12:35.650 slashes to say I want the special 0:12:33.160,0:12:37.180 meaning of parenthesis otherwise they 0:12:35.650,0:12:39.990 were just match a literal parenthesis 0:12:37.180,0:12:43.510 which is probably not what you want so 0:12:39.990,0:12:46.390 notice how this replaced the a B here 0:12:43.510,0:12:48.790 and it replaced the a be here but it 0:12:46.390,0:12:51.040 left this C and it also left the a at 0:12:48.790,0:12:54.100 the end because that a does not match 0:12:51.040,0:12:55.740 this pattern anymore and you can group 0:12:54.100,0:12:58.180 these patterns in whatever ways you want 0:12:55.740,0:13:00.850 you also have things like alternations 0:12:58.180,0:13:07.420 you can say anything that matches a b or 0:13:00.850,0:13:10.510 b c i want to remove and here you'll 0:13:07.420,0:13:12.220 notice that this a b got removed this bc 0:13:10.510,0:13:14.740 did not get removed even though it 0:13:12.220,0:13:17.950 matches the pattern because the a b had 0:13:14.740,0:13:20.500 already been removed this a b is removed 0:13:17.950,0:13:22.960 right but the c stays in place this a b 0:13:20.500,0:13:25.870 is removed and this c states because it 0:13:22.960,0:13:29.470 still does not match that if I made this 0:13:25.870,0:13:31.750 if I remove this a then now this a B 0:13:29.470,0:13:34.000 pattern will not match this B so it'll 0:13:31.750,0:13:36.280 be preserved and then BC will match BC 0:13:34.000,0:13:37.810 and it'll go away 0:13:36.280,0:13:39.940 Regulus presence can be all sorts of 0:13:37.810,0:13:41.530 complicated when you first encounter 0:13:39.940,0:13:42.790 them and even once you get more 0:13:41.530,0:13:45.160 experience with them they can be 0:13:42.790,0:13:47.770 daunting to look at and this is why very 0:13:45.160,0:13:49.600 often you want to use something like a 0:13:47.770,0:13:51.700 regular expression debugger which we'll 0:13:49.600,0:13:52.560 look at in a little bit but first let's 0:13:51.700,0:13:55.500 try to make up a 0:13:52.560,0:13:57.300 pattern that will match the logs and and 0:13:55.500,0:14:00.390 match the logs that we've been working 0:13:57.300,0:14:02.070 with so far so here I'm gonna just sort 0:14:00.390,0:14:04.680 of extract a couple of lines from this 0:14:02.070,0:14:08.910 file let's say the first five so these 0:14:04.680,0:14:12.300 lines all now look like this right and 0:14:08.910,0:14:15.360 what we want to do is we want to only 0:14:12.300,0:14:21.210 have the user name okay so what might 0:14:15.360,0:14:30.120 this look like well here's one thing we 0:14:21.210,0:14:32.670 could try to do actually let me show you 0:14:30.120,0:14:34.370 one except one thing first let me take a 0:14:32.670,0:14:38.990 line that says something like 0:14:34.370,0:14:44.279 disconnected from invalid user 0:14:38.990,0:14:46.620 disconnected from maybe four to one one 0:14:44.279,0:14:49.740 whatever okay so this is an example of a 0:14:46.620,0:14:54.200 login line where someone tried to login 0:14:49.740,0:14:54.200 with the username disconnected from 0:14:54.500,0:15:05.400 missing an S disconnected thank you 0:15:03.200,0:15:08.310 you'll notice that this actually removed 0:15:05.400,0:15:10.770 the username as well and this is because 0:15:08.310,0:15:11.940 when you use dot star and any of these 0:15:10.770,0:15:14.490 sort of range expressions indirect 0:15:11.940,0:15:17.070 expressions they are greedy they will 0:15:14.490,0:15:19.890 match as much as they can so in this 0:15:17.070,0:15:22.130 case this was the username that we 0:15:19.890,0:15:24.930 wanted to retain but this pattern 0:15:22.130,0:15:27.060 actually matched all the way up until 0:15:24.930,0:15:28.620 the second occurrence of it or the last 0:15:27.060,0:15:30.960 occurrence of it and so everything 0:15:28.620,0:15:33.000 before it including the username itself 0:15:30.960,0:15:34.470 got removed and so we need to come up 0:15:33.000,0:15:36.150 with a slightly clever or matching 0:15:34.470,0:15:38.190 strategy than just saying sort of dot 0:15:36.150,0:15:39.959 star because it means that if we have 0:15:38.190,0:15:41.339 particularly adversarial input we might 0:15:39.959,0:15:44.430 end up with something that we didn't 0:15:41.339,0:15:47.670 expect okay so let's see how we might 0:15:44.430,0:15:56.850 try to match these lines let's just do a 0:15:47.670,0:16:00.660 head first well let's try to construct 0:15:56.850,0:16:02.970 this up from the beginning we first of 0:16:00.660,0:16:05.190 all know that we want - capital e right 0:16:02.970,0:16:07.170 because we want to not have to put all 0:16:05.190,0:16:09.839 these back slashes everywhere 0:16:07.170,0:16:14.880 these lines look like they say from and 0:16:09.839,0:16:16.769 then some of them say invalid but some 0:16:14.880,0:16:19.170 of them do not right this line has 0:16:16.769,0:16:21.690 invalid that one does not question mark 0:16:19.170,0:16:26.029 here is saying zero or one so I want 0:16:21.690,0:16:31.320 zero or zero or one of invalid space 0:16:26.029,0:16:34.320 user what else well that's going to be a 0:16:31.320,0:16:36.529 double space so we can't have that and 0:16:34.320,0:16:40.440 then there's gonna be some username and 0:16:36.529,0:16:43.160 then there's gonna be what exactly is 0:16:40.440,0:16:46.290 gonna be what looks like an IP address 0:16:43.160,0:16:50.190 so here we can use our range syntax and 0:16:46.290,0:16:53.490 say zero to nine and a dot right that's 0:16:50.190,0:16:58.170 what IP addresses are and we want many 0:16:53.490,0:17:00.300 of those then it says porch so we're 0:16:58.170,0:17:03.060 just going to match a literal port and 0:17:00.300,0:17:07.980 then another number zero to nine and 0:17:03.060,0:17:09.150 we're going to wand plus of that the 0:17:07.980,0:17:10.049 other thing we're going to do here is 0:17:09.150,0:17:11.880 we're going to do what's known as 0:17:10.049,0:17:13.439 anchoring the regular expression so 0:17:11.880,0:17:15.780 there are two special characters and 0:17:13.439,0:17:17.699 regular expressions there's carrot or 0:17:15.780,0:17:19.799 hat which matches the beginning of a 0:17:17.699,0:17:22.439 line and there's dollar which matches 0:17:19.799,0:17:24.839 the end of a line so here we're gonna 0:17:22.439,0:17:27.990 say that this regression has to match 0:17:24.839,0:17:29.760 the complete line the reason we do this 0:17:27.990,0:17:33.290 is because imagine that someone made 0:17:29.760,0:17:35.250 their username the entire log string 0:17:33.290,0:17:38.460 then now if you try to match this 0:17:35.250,0:17:40.730 pattern it would match the username 0:17:38.460,0:17:42.990 itself which is not what we want 0:17:40.730,0:17:44.490 generally you will want to try to anchor 0:17:42.990,0:17:46.860 your patterns wherever you can to avoid 0:17:44.490,0:17:49.919 those kind of oddities okay let's see 0:17:46.860,0:17:51.960 what that gave us that removed many of 0:17:49.919,0:17:54.360 the lines but not all of them so this 0:17:51.960,0:17:56.880 one for example includes this pre off at 0:17:54.360,0:18:02.760 the end so we'll want to cut that off if 0:17:56.880,0:18:04.549 there's a space pre off square brackets 0:18:02.760,0:18:07.350 our specials we need to escape them 0:18:04.549,0:18:10.650 right now let's see what happens if we 0:18:07.350,0:18:12.360 try more lines of this no it still gets 0:18:10.650,0:18:13.710 something weird some of these lines are 0:18:12.360,0:18:16.740 not empty right which means that the 0:18:13.710,0:18:18.990 pattern did not match this one for 0:18:16.740,0:18:20.010 example it says authenticating user 0:18:18.990,0:18:24.690 instead of invalid 0:18:20.010,0:18:27.300 user okay so as to match invalid or 0:18:24.690,0:18:30.900 authenticated zero or one time before 0:18:27.300,0:18:34.530 user how about now okay that looks 0:18:30.900,0:18:36.990 pretty promising but this output is not 0:18:34.530,0:18:38.880 particularly helpful right here we've 0:18:36.990,0:18:41.360 just erased every line of our log files 0:18:38.880,0:18:43.890 successfully which is not very helpful 0:18:41.360,0:18:46.110 instead what we really wanted to do is 0:18:43.890,0:18:48.780 when we match the username right over 0:18:46.110,0:18:50.310 here we really wanted to remember what 0:18:48.780,0:18:53.310 that username was because that is what 0:18:50.310,0:18:55.770 we want to print out and the way we can 0:18:53.310,0:19:00.300 do that in regular expressions is using 0:18:55.770,0:19:03.630 something like capture groups so capture 0:19:00.300,0:19:06.570 groups are a way to say that I want to 0:19:03.630,0:19:10.350 remember this value and reuse it later 0:19:06.570,0:19:12.180 and in regular expressions any bracketed 0:19:10.350,0:19:14.460 expression any parenthesis expression is 0:19:12.180,0:19:16.770 going to be such a capture group so we 0:19:14.460,0:19:18.570 already actually have one here which is 0:19:16.770,0:19:20.850 this first group and now we're creating 0:19:18.570,0:19:22.590 a second one here notice that these 0:19:20.850,0:19:24.870 parentheses don't do anything to the 0:19:22.590,0:19:27.210 matching right because they're just 0:19:24.870,0:19:28.800 saying this expression as a unit but we 0:19:27.210,0:19:32.550 don't have any modifiers after it so 0:19:28.800,0:19:34.980 it's just match one-time and then the 0:19:32.550,0:19:36.810 reason matching groups are are useful or 0:19:34.980,0:19:38.370 capture groups are useful is because you 0:19:36.810,0:19:40.920 can refer back to them in the 0:19:38.370,0:19:43.800 replacement so in the replacement here I 0:19:40.920,0:19:45.630 can say backslash two this is the way 0:19:43.800,0:19:47.760 that you refer to the name of a capture 0:19:45.630,0:19:50.250 group in this say I'm in this case I'm 0:19:47.760,0:19:53.340 saying match the entire line and then in 0:19:50.250,0:19:55.380 the replacement put in the value you 0:19:53.340,0:19:57.330 captured in the second capture group 0:19:55.380,0:20:00.020 right remember this is the first capture 0:19:57.330,0:20:03.330 group and this is the second one and 0:20:00.020,0:20:05.670 this gives me all the usernames now if 0:20:03.330,0:20:08.580 you look back at what we wrote this is 0:20:05.670,0:20:10.050 pretty complicated right it might make 0:20:08.580,0:20:12.000 sense now that we walk through it and 0:20:10.050,0:20:14.130 why it had to be the way it was but this 0:20:12.000,0:20:16.140 is like not obvious that this is how 0:20:14.130,0:20:19.680 these lines work and this is where a 0:20:16.140,0:20:22.260 regular expression debugger can come in 0:20:19.680,0:20:25.410 really really handy so we have one here 0:20:22.260,0:20:27.510 there are many online but here I've sort 0:20:25.410,0:20:31.710 of pre filled in this expression that we 0:20:27.510,0:20:34.380 just used and notice that it it tells me 0:20:31.710,0:20:37.470 all the matching does in fact now this 0:20:34.380,0:20:42.950 window is a little small with this font 0:20:37.470,0:20:45.620 size but if I do hear this explanation 0:20:42.950,0:20:48.320 says dot star matches any character 0:20:45.620,0:20:52.170 between zero and unlimited times 0:20:48.320,0:20:54.270 followed by disconnected from literally 0:20:52.170,0:20:56.790 followed by a capture group and then 0:20:54.270,0:20:59.190 walks you through all the stuff and 0:20:56.790,0:21:00.960 that's one thing but it also lets you've 0:20:59.190,0:21:03.510 given a test string and then matches the 0:21:00.960,0:21:05.370 pattern against every single test string 0:21:03.510,0:21:07.460 that you give and highlights what the 0:21:05.370,0:21:11.490 different capture groups for example are 0:21:07.460,0:21:15.060 so here we made user a capture group 0:21:11.490,0:21:16.980 right so it'll say okay the full string 0:21:15.060,0:21:19.110 matched right the whole thing is blue so 0:21:16.980,0:21:21.180 it matched Green is the first capture 0:21:19.110,0:21:23.370 group red is the second capture group 0:21:21.180,0:21:26.130 and this is the third because preauth 0:21:23.370,0:21:27.750 was also put into parenthesis and this 0:21:26.130,0:21:31.020 can be a handy way to try to debug your 0:21:27.750,0:21:35.610 regular expressions for example if I put 0:21:31.020,0:21:41.070 disconnected from and let's add a new 0:21:35.610,0:21:45.240 line here and I make the username 0:21:41.070,0:21:46.530 disconnected from now that line already 0:21:45.240,0:21:49.950 had the username be disconnect from 0:21:46.530,0:21:54.150 great here me of thinking ahead you'll 0:21:49.950,0:21:56.010 notice that with this pattern this was 0:21:54.150,0:21:58.740 no longer a problem because it got 0:21:56.010,0:22:02.580 matched the username what happens if we 0:21:58.740,0:22:07.170 take this entire line or this entire 0:22:02.580,0:22:13.830 line and make that the username now what 0:22:07.170,0:22:15.180 happens it gets really confused right so 0:22:13.830,0:22:18.390 this is where regular expressions can be 0:22:15.180,0:22:21.780 a pain to get right because it now tries 0:22:18.390,0:22:23.970 to match it matches the first place 0:22:21.780,0:22:27.420 where username appears or the first 0:22:23.970,0:22:29.700 invalid in this case the second invalid 0:22:27.420,0:22:31.830 because this is greedy we can make this 0:22:29.700,0:22:36.360 non greedy by putting a question mark 0:22:31.830,0:22:38.520 here so if you suffix a plus or a star 0:22:36.360,0:22:40.860 with a question mark it becomes a non 0:22:38.520,0:22:42.540 greedy match so it will not try to match 0:22:40.860,0:22:43.820 as much as possible and then you see 0:22:42.540,0:22:46.030 that this actually gets parsed correctly 0:22:43.820,0:22:47.950 because this dots 0:22:46.030,0:22:49.480 we'll stop at the first disconnected 0:22:47.950,0:22:52.450 from which is the one that's actually 0:22:49.480,0:22:57.070 emitted by SSH the one that actually 0:22:52.450,0:22:58.720 appears in our logs as you can probably 0:22:57.070,0:23:00.790 tell from the explanation of this so far 0:22:58.720,0:23:03.130 regular expressions can get really 0:23:00.790,0:23:05.320 complicated and there are all sorts of 0:23:03.130,0:23:07.330 weird modifiers that you might have to 0:23:05.320,0:23:09.130 apply in your pattern the only way to 0:23:07.330,0:23:10.750 really learn them is to start with 0:23:09.130,0:23:12.970 simple ones and then build them up until 0:23:10.750,0:23:14.860 they match what you need often you're 0:23:12.970,0:23:16.150 just doing some like one-off job like 0:23:14.860,0:23:17.770 when we're hacking out the user names 0:23:16.150,0:23:19.870 here and you don't need to care about 0:23:17.770,0:23:21.610 all the special conditions right you 0:23:19.870,0:23:24.190 don't have to care about someone having 0:23:21.610,0:23:26.020 the SSH username perfectly match your 0:23:24.190,0:23:27.430 login format that's probably not 0:23:26.020,0:23:29.440 something that matters because you're 0:23:27.430,0:23:30.730 just trying to find the usernames but 0:23:29.440,0:23:32.710 regular expressions are really powerful 0:23:30.730,0:23:33.730 and you want to be careful if you're 0:23:32.710,0:23:36.870 doing something where it actually 0:23:33.730,0:23:36.870 matters you had a question 0:23:41.380,0:23:47.560 regular expressions by default only 0:23:43.510,0:23:58.630 match per line anyway they will not 0:23:47.560,0:24:01.210 match across new lines so so the way 0:23:58.630,0:24:04.680 that said works is that it operates per 0:24:01.210,0:24:10.390 line and so said we'll do this 0:24:04.680,0:24:12.250 expression for every line okay questions 0:24:10.390,0:24:14.410 about regular sessions or this pattern 0:24:12.250,0:24:16.390 so far it is a complicated pattern so if 0:24:14.410,0:24:17.560 it if it feels confusing like don't be 0:24:16.390,0:24:31.450 worried about it look at it in the 0:24:17.560,0:24:33.550 debugger later yep so so keep in mind 0:24:31.450,0:24:36.130 that the we're assuming here that the 0:24:33.550,0:24:38.590 user only has control over their 0:24:36.130,0:24:41.800 username right so the worst that they 0:24:38.590,0:24:43.510 could do is take like this entire entry 0:24:41.800,0:24:48.490 and make that the username let's see 0:24:43.510,0:24:51.490 what happens right so that's the works 0:24:48.490,0:24:53.710 and the reason for this is this question 0:24:51.490,0:24:56.200 mark means that the moment we hit the 0:24:53.710,0:24:58.820 disconnect keyword we start parsing the 0:24:56.200,0:25:00.769 rest of the pattern right and the 0:24:58.820,0:25:03.200 first occurrence of disconnected is 0:25:00.769,0:25:05.720 printed by SSH before anything the user 0:25:03.200,0:25:08.210 controls so in this particular instance 0:25:05.720,0:25:21.049 even this will not confuse the pattern 0:25:08.210,0:25:24.919 yep if well so if you're writing a this 0:25:21.049,0:25:26.149 sort of odd matching will in general 0:25:24.919,0:25:29.120 when you're doing data wrangling is like 0:25:26.149,0:25:31.370 not security it's not security related 0:25:29.120,0:25:33.889 but it might mean that you get really 0:25:31.370,0:25:35.299 weird data back and so if you're doing 0:25:33.889,0:25:37.399 something like plotting data you might 0:25:35.299,0:25:39.559 drop data points that matter you might 0:25:37.399,0:25:41.450 parse out the wrong number and then like 0:25:39.559,0:25:43.370 your plot suddenly have data points that 0:25:41.450,0:25:45.559 weren't in the original data and so it's 0:25:43.370,0:25:47.419 more that if you find yourself writing a 0:25:45.559,0:25:49.070 complicated regular expression like 0:25:47.419,0:25:51.710 double check that it's actually matching 0:25:49.070,0:25:56.570 what you think it's matching and even if 0:25:51.710,0:25:58.220 it's not security related and as you can 0:25:56.570,0:26:00.950 imagine these patterns can get really 0:25:58.220,0:26:02.809 complicated like for example there's a 0:26:00.950,0:26:04.210 big debate about how do you match an 0:26:02.809,0:26:06.230 email address with a regular expression 0:26:04.210,0:26:08.870 and you might think of something like 0:26:06.230,0:26:10.850 this so this is a very straightforward 0:26:08.870,0:26:13.909 one that just says letters and numbers 0:26:10.850,0:26:15.620 and rotor scores some percent followed 0:26:13.909,0:26:17.799 by a plus because in Gmail you can have 0:26:15.620,0:26:22.100 pluses in email addresses with a suffix 0:26:17.799,0:26:24.620 in this case the plus is just for any 0:26:22.100,0:26:25.730 number of these but at least one because 0:26:24.620,0:26:26.929 you can't have an email address that 0:26:25.730,0:26:29.269 doesn't have anything before the ad and 0:26:26.929,0:26:31.789 then similarly after the domain right 0:26:29.269,0:26:33.139 and the top-level domain has to be at 0:26:31.789,0:26:35.059 least two characters and can't include 0:26:33.139,0:26:38.000 digits right you can have it calm but 0:26:35.059,0:26:40.039 you can't have adopt seven it turns out 0:26:38.000,0:26:42.139 this is not really correct right there 0:26:40.039,0:26:43.220 are a bunch of valid email addresses 0:26:42.139,0:26:44.360 that will not be matched by this and 0:26:43.220,0:26:45.559 they're a bunch of invalid email 0:26:44.360,0:26:50.629 addresses that will be matched by this 0:26:45.559,0:26:52.399 so there are many many suggestions and 0:26:50.629,0:26:54.529 there are people who've built like full 0:26:52.399,0:26:58.460 test suites to try to see which regular 0:26:54.529,0:27:00.889 expression is best and this is this 0:26:58.460,0:27:02.899 particular one is for URLs there are 0:27:00.889,0:27:06.470 similar ones for email where they found 0:27:02.899,0:27:07.909 that the best one is this one I don't 0:27:06.470,0:27:10.790 recommend you trying to understand this 0:27:07.909,0:27:13.720 pattern but this one apparently will all 0:27:10.790,0:27:15.830 most perfectly match the what the like 0:27:13.720,0:27:17.840 internet standard for email addresses 0:27:15.830,0:27:20.000 says as a valid email address and that 0:27:17.840,0:27:22.250 includes all sorts of weird Unicode code 0:27:20.000,0:27:24.440 points this is just to say regular 0:27:22.250,0:27:26.060 expressions can be really hairy and if 0:27:24.440,0:27:28.880 you end up somewhere like this there's 0:27:26.060,0:27:30.620 probably a better way to do it for 0:27:28.880,0:27:35.320 example if you find yourself trying to 0:27:30.620,0:27:38.300 parse HTML or something or parse like 0:27:35.320,0:27:40.310 parse JSON where they're expressions you 0:27:38.300,0:27:42.230 should probably use a different tool and 0:27:40.310,0:27:44.480 there is an exercise that has you do 0:27:42.230,0:27:49.960 this not with the regular sessions point 0:27:44.480,0:27:53.180 you yeah that it's there's all sorts of 0:27:49.960,0:27:54.740 suggestions and they give you deep deep 0:27:53.180,0:27:56.660 dives into how they works if you want to 0:27:54.740,0:28:01.670 look that up it's it's in the lecture 0:27:56.660,0:28:04.280 notes okay so now we have the sister of 0:28:01.670,0:28:05.960 user names so let's go back to data 0:28:04.280,0:28:08.210 wrangling right like this list of user 0:28:05.960,0:28:10.250 names is still not that interesting to 0:28:08.210,0:28:15.790 me right let's let's see how many lines 0:28:10.250,0:28:15.790 there are so if I do WC - oh there are 0:28:15.910,0:28:21.470 one hundred and ninety eight thousand 0:28:18.320,0:28:23.260 lines so WC is the word count program - 0:28:21.470,0:28:26.030 L makes it count the number of lines 0:28:23.260,0:28:27.530 this is a lot of lines then if I start 0:28:26.030,0:28:29.690 scrolling through them that still 0:28:27.530,0:28:31.730 doesn't really help me right like I need 0:28:29.690,0:28:37.130 statistics over this I need aggregates 0:28:31.730,0:28:38.450 of some kind and the send tool is like 0:28:37.130,0:28:40.100 useful for many things it gives you a 0:28:38.450,0:28:43.010 full programming language it can do 0:28:40.100,0:28:45.020 weird things like insert text or only 0:28:43.010,0:28:46.400 print matching lines but it's not 0:28:45.020,0:28:48.560 necessarily the perfect tool for 0:28:46.400,0:28:50.330 everything right like sometimes there 0:28:48.560,0:28:53.420 are better tools like for example you 0:28:50.330,0:28:55.400 could write a line counter instead you 0:28:53.420,0:28:56.840 just should never said it's a terrible 0:28:55.400,0:29:00.440 programming language except for 0:28:56.840,0:29:02.740 searching and replacing but there are 0:29:00.440,0:29:07.940 other useful tools so for example 0:29:02.740,0:29:09.710 there's a tool called sort so sort this 0:29:07.940,0:29:12.080 is also not going to be very helpful but 0:29:09.710,0:29:13.850 sort takes a bunch of lines of input 0:29:12.080,0:29:16.940 sorts them and then prints them to your 0:29:13.850,0:29:19.130 output so in this case I now get the 0:29:16.940,0:29:20.540 sorted output of that list it is still 0:29:19.130,0:29:23.840 two hundred thousand lines long so it's 0:29:20.540,0:29:24.760 still not very helpful to me but now I 0:29:23.840,0:29:27.340 can combine it 0:29:24.760,0:29:30.550 the tool called unique so unique we'll 0:29:27.340,0:29:33.130 look at a sorted list of lines and it 0:29:30.550,0:29:34.930 will only print those that are unique so 0:29:33.130,0:29:37.090 if you have multiple instances of any 0:29:34.930,0:29:40.750 given line it will only print it once 0:29:37.090,0:29:44.290 and then I can say unique - C so this is 0:29:40.750,0:29:46.030 gonna say count the number of duplicates 0:29:44.290,0:29:48.010 for any lines that are duplicated and 0:29:46.030,0:29:52.000 eliminate them what does this look like 0:29:48.010,0:29:56.050 well if I run it it's gonna take a while 0:29:52.000,0:29:59.710 there were thirteen zze user names there 0:29:56.050,0:30:01.240 were ten ZX VF user names etc there and 0:29:59.710,0:30:03.460 I can scroll through this this is still 0:30:01.240,0:30:06.130 a very long list right but at least now 0:30:03.460,0:30:08.200 it's a little bit more collated than it 0:30:06.130,0:30:10.770 was let's see how many lines I'm dumped 0:30:08.200,0:30:10.770 in now okay 0:30:13.480,0:30:17.380 twenty-four thousand lines it's still 0:30:15.460,0:30:19.810 too much it's not useful information to 0:30:17.380,0:30:22.960 me but I can keep burning down this with 0:30:19.810,0:30:24.730 more tools for example what I might care 0:30:22.960,0:30:29.050 about is which user names have been used 0:30:24.730,0:30:31.330 the most well I can do sort again and I 0:30:29.050,0:30:35.560 can say I want a numeric sort on the 0:30:31.330,0:30:38.980 first column of the input so - n says 0:30:35.560,0:30:41.320 numeric sort - K lets you select a white 0:30:38.980,0:30:43.720 space separated column from the input to 0:30:41.320,0:30:45.760 sort my and the reason I'm giving one 0:30:43.720,0:30:47.680 comma one here is because I want to 0:30:45.760,0:30:49.690 start at the first column and stop at 0:30:47.680,0:30:52.150 the first column alternatively I could 0:30:49.690,0:30:54.130 say I want you to sort by this list of 0:30:52.150,0:30:58.300 columns but in this case I just want to 0:30:54.130,0:31:01.840 sort by that column and then I want only 0:30:58.300,0:31:06.720 the ten last lines so sort by default 0:31:01.840,0:31:08.890 will output in ascending order so the 0:31:06.720,0:31:10.330 the ones with the highest counts are 0:31:08.890,0:31:14.560 gonna be at the bottom and then I want 0:31:10.330,0:31:17.470 only lost ten lines and now when I run 0:31:14.560,0:31:20.590 this I actually get a useful bit of data 0:31:17.470,0:31:21.730 right it tells me there were eleven 0:31:20.590,0:31:24.730 thousand login attempts with the 0:31:21.730,0:31:26.500 username root there were four thousand 0:31:24.730,0:31:29.530 with one two three four five six isn't 0:31:26.500,0:31:33.790 username etc and this is pretty handy 0:31:29.530,0:31:36.040 right and now suddenly this giant log 0:31:33.790,0:31:38.230 file actually produces useful 0:31:36.040,0:31:40.540 information for me this is what I really 0:31:38.230,0:31:44.230 from that log file now maybe I want to 0:31:40.540,0:31:46.530 just like do a quick disabling of root 0:31:44.230,0:31:50.610 for example for SSH login on my machine 0:31:46.530,0:31:50.610 which I recommend you will do by the way 0:31:51.210,0:31:56.559 in this particular case we don't 0:31:53.410,0:31:58.510 actually need the k4 sort because sort 0:31:56.559,0:32:00.850 by default will sort by the entire line 0:31:58.510,0:32:01.990 and the number happens to come first but 0:32:00.850,0:32:04.059 it's useful to know about these 0:32:01.990,0:32:06.010 additional flags and you might wonder 0:32:04.059,0:32:07.330 well how would I know that these flags 0:32:06.010,0:32:08.559 exist how would I know that these 0:32:07.330,0:32:11.410 programs even exist 0:32:08.559,0:32:12.850 well the programs usually pick up just 0:32:11.410,0:32:15.900 from being told about them in classes 0:32:12.850,0:32:19.030 like here the flags are usually like I 0:32:15.900,0:32:22.299 want to sort by something that is not 0:32:19.030,0:32:24.160 the full line your first instinct should 0:32:22.299,0:32:25.929 be to type man sort and then read 0:32:24.160,0:32:27.669 through the page and then very quickly 0:32:25.929,0:32:29.230 will tell you here's how to select a 0:32:27.669,0:32:35.919 pretty good column here's how to sort by 0:32:29.230,0:32:38.490 a number etc okay what if now that I 0:32:35.919,0:32:40.419 have this like top let's say top 20 list 0:32:38.490,0:32:42.790 let's say I don't actually care about 0:32:40.419,0:32:45.010 the counts I just want like a comma 0:32:42.790,0:32:47.470 separated list of the user names because 0:32:45.010,0:32:49.510 I'm gonna like send it to myself by 0:32:47.470,0:32:53.410 email every day or something like that 0:32:49.510,0:32:56.910 like these are the top 20 usernames well 0:32:53.410,0:32:56.910 I can do this 0:32:58.290,0:33:02.559 ok that's a lot more weird commands but 0:33:01.360,0:33:07.330 their commands that are useful to know 0:33:02.559,0:33:09.880 about so awk is a column based stream 0:33:07.330,0:33:12.429 processor so we talked about said which 0:33:09.880,0:33:15.640 is a stream editor so it tries to edit 0:33:12.429,0:33:18.820 text primarily in the inputs awk on the 0:33:15.640,0:33:20.650 other hand also lets you edit text it is 0:33:18.820,0:33:23.290 still a full programming language but 0:33:20.650,0:33:25.660 it's more focused on columnar data so in 0:33:23.290,0:33:28.390 this case awk by default will parse its 0:33:25.660,0:33:30.190 input in white space separated columns 0:33:28.390,0:33:32.169 and then that you operate on those 0:33:30.190,0:33:33.429 columns separately in this case I'm 0:33:32.169,0:33:38.320 saying just print the second column 0:33:33.429,0:33:40.299 which is the user name right paste is a 0:33:38.320,0:33:43.030 command that takes a bunch of lines and 0:33:40.299,0:33:46.350 paste them together into a single line 0:33:43.030,0:33:49.450 that's the - s with the delimiter comma 0:33:46.350,0:33:51.740 so in this case for on this I want to 0:33:49.450,0:33:53.929 get a comma separated list of the top 0:33:51.740,0:33:56.120 user names which I can then do whatever 0:33:53.929,0:33:57.500 useful thing I might want maybe I want 0:33:56.120,0:33:59.149 to stick this in a config file of 0:33:57.500,0:34:00.429 disallowed usernames or something along 0:33:59.149,0:34:04.039 those lines 0:34:00.429,0:34:05.720 um awk is worth talking a little bit 0:34:04.039,0:34:08.510 more about because it turns out to be a 0:34:05.720,0:34:12.859 really powerful language for this kind 0:34:08.510,0:34:16.190 of data wrangling we mentioned briefly 0:34:12.859,0:34:19.010 what this print dollar 2 does but it 0:34:16.190,0:34:21.020 turns out the for awk you can do some 0:34:19.010,0:34:22.849 really really fancy things so for 0:34:21.020,0:34:25.129 example let's go back to here where we 0:34:22.849,0:34:29.419 just have the usernames I say let's 0:34:25.129,0:34:31.669 still do sort and unique because we 0:34:29.419,0:34:32.089 don't otherwise the list gets far too 0:34:31.669,0:34:34.040 long 0:34:32.089,0:34:36.800 and let's say that I only want to print 0:34:34.040,0:34:40.760 the usernames that match a particular 0:34:36.800,0:34:51.440 pattern let's say for example that I 0:34:40.760,0:34:56.570 want to see I want all of the usernames 0:34:51.440,0:34:59.599 that only appear once and that start 0:34:56.570,0:35:02.359 with a C and end with an e there's a 0:34:59.599,0:35:04.310 really weird thing to look for but in 0:35:02.359,0:35:06.410 all this is really simple to express I 0:35:04.310,0:35:11.200 can say I want the first column to be 1 0:35:06.410,0:35:15.190 and I want the second column to match 0:35:11.200,0:35:15.190 the following regular expression 0:35:20.480,0:35:32.030 hey this could probably just be dot and 0:35:26.119,0:35:33.920 then I want to print the whole line so 0:35:32.030,0:35:36.230 unless I mess something up this will 0:35:33.920,0:35:38.900 give me all the usernames that start 0:35:36.230,0:35:42.859 with a C end with an e and only appear 0:35:38.900,0:35:44.780 once in my log now that might not be a 0:35:42.859,0:35:46.640 very useful thing to do with the data 0:35:44.780,0:35:48.230 what I'm trying to do in this lecture is 0:35:46.640,0:35:49.940 show you the kind of tools that are 0:35:48.230,0:35:51.619 available and in this particular case 0:35:49.940,0:35:53.180 this pattern is like not that 0:35:51.619,0:35:54.980 complicated even though what we're doing 0:35:53.180,0:35:58.339 is sort of weird and this is because 0:35:54.980,0:35:59.570 very often on Linux with Linux tools in 0:35:58.339,0:36:02.570 particular and command-line tools in 0:35:59.570,0:36:04.609 general the tools are built to be based 0:36:02.570,0:36:06.440 on lines of input and lines of output 0:36:04.609,0:36:09.079 and very often those lines are going to 0:36:06.440,0:36:18.079 be have multiple columns and awk is 0:36:09.079,0:36:22.160 great for operating over columns now awk 0:36:18.079,0:36:26.750 is is not just able to do things like 0:36:22.160,0:36:29.060 match per line but it lets you do things 0:36:26.750,0:36:31.220 like let's say I want the number of 0:36:29.060,0:36:32.900 these right I want to know how many user 0:36:31.220,0:36:36.829 names match this pattern well I can do 0:36:32.900,0:36:39.710 WCHL that works just fine all right 0:36:36.829,0:36:41.990 there are 31 such user names but awk is 0:36:39.710,0:36:44.780 a programming language this is something 0:36:41.990,0:36:46.819 that you will probably never end up 0:36:44.780,0:36:49.430 doing yourself but it's important to 0:36:46.819,0:36:53.200 know that you can every now and again it 0:36:49.430,0:36:53.200 is actually useful to know about these 0:36:53.619,0:37:02.420 this might be hard to read on my screen 0:36:57.140,0:37:04.960 I just realized let me try to fix that 0:37:02.420,0:37:04.960 in a second 0:37:07.299,0:37:17.649 let's do yeah apparently fish does not 0:37:14.469,0:37:19.749 want me to do that um so here begin is a 0:37:17.649,0:37:22.539 special pattern that only matches the 0:37:19.749,0:37:25.779 zeroth line end is a special pattern 0:37:22.539,0:37:28.179 that only matches after the last line 0:37:25.779,0:37:29.619 and then this is gonna be a normal 0:37:28.179,0:37:32.019 pattern that's matched against every 0:37:29.619,0:37:34.149 line so what I'm saying here is on the 0:37:32.019,0:37:36.579 zeroth line set the variable rose to 0:37:34.149,0:37:40.419 zero on every line that matches this 0:37:36.579,0:37:42.309 pattern increment rose and after you 0:37:40.419,0:37:44.919 have matched the last line print the 0:37:42.309,0:37:47.499 value of rose and this will have the 0:37:44.919,0:37:50.259 same effect as running WCHL but all 0:37:47.499,0:37:52.809 within awk his particular instance like 0:37:50.259,0:37:55.599 WCHL is just fine but sometimes you want 0:37:52.809,0:37:57.429 to do things like you want to might want 0:37:55.599,0:37:59.109 to keep a dictionary or a map of some 0:37:57.429,0:38:01.119 kind you might want to compute 0:37:59.109,0:38:03.219 statistics you might want to do things 0:38:01.119,0:38:05.469 like I want the second match of this 0:38:03.219,0:38:07.630 pattern so you need a stateful matcher 0:38:05.469,0:38:09.099 like ignore the first match but then 0:38:07.630,0:38:11.140 print everything following the second 0:38:09.099,0:38:12.639 match and for that this kind of simple 0:38:11.140,0:38:18.489 programming in all can be useful to know 0:38:12.639,0:38:22.929 about in fact we could in this pattern 0:38:18.489,0:38:24.789 get rid of said and sort and unique and 0:38:22.929,0:38:26.799 grep that we originally used to produce 0:38:24.789,0:38:28.209 this file and do it all in awk 0:38:26.799,0:38:30.880 but you probably don't want to do that 0:38:28.209,0:38:34.539 it would be probably too painful to be 0:38:30.880,0:38:37.359 worth it it's worth talking a little bit 0:38:34.539,0:38:38.999 about the other kinds of tools that you 0:38:37.359,0:38:41.169 might want to use on the command line 0:38:38.999,0:38:45.039 the first of these is a really handy 0:38:41.169,0:38:49.929 program called BC so BC is the Berkeley 0:38:45.039,0:38:51.449 calculator I believe man BC I think BC 0:38:49.929,0:38:54.069 is originally from Berkeley calculator 0:38:51.449,0:38:56.169 anyway it is a very simple command-line 0:38:54.069,0:38:58.959 calculator but instead of giving you a 0:38:56.169,0:39:00.759 prompt it reads from standard in so I 0:38:58.959,0:39:04.899 can do something like echo 1 plus 2 and 0:39:00.759,0:39:06.789 pipe it to BC - shell because many of 0:39:04.899,0:39:11.319 these programs normally operate in like 0:39:06.789,0:39:15.699 a stupid mode where they're unhelpful so 0:39:11.319,0:39:17.469 here it prints 3 Wow very impressive but 0:39:15.699,0:39:19.779 it turns out this can be really handy 0:39:17.469,0:39:21.100 imagine you have a file with a bunch of 0:39:19.779,0:39:26.340 lines 0:39:21.100,0:39:32.020 let's say something like oh I don't know 0:39:26.340,0:39:35.020 this file and let's say I want to sum up 0:39:32.020,0:39:36.910 the number of logins the number of user 0:39:35.020,0:39:40.030 names that have not been used only once 0:39:36.910,0:39:43.870 all right so the ones where the count is 0:39:40.030,0:39:48.550 not equal to one I want to print just 0:39:43.870,0:39:50.950 the count right this is me give me the 0:39:48.550,0:39:52.930 counts for all the non single-use user 0:39:50.950,0:39:55.180 names and then I want to know how many 0:39:52.930,0:39:56.740 are there of these notice that I can't 0:39:55.180,0:39:59.110 just count the lines that wouldn't work 0:39:56.740,0:40:02.200 right because there are numbers on each 0:39:59.110,0:40:05.950 ran I want to sum well I can use paste 0:40:02.200,0:40:08.100 to paste by plus so this paste every 0:40:05.950,0:40:12.040 line together into a plus expression 0:40:08.100,0:40:14.200 right and this is now an arithmetic 0:40:12.040,0:40:18.910 expression so I can pipe it through BCL 0:40:14.200,0:40:20.920 and now there have been hundred and 0:40:18.910,0:40:22.720 ninety one thousand logins that share to 0:40:20.920,0:40:25.540 username with at least one other login 0:40:22.720,0:40:27.700 again probably not something you really 0:40:25.540,0:40:29.560 care about but this is just to show you 0:40:27.700,0:40:34.360 that you can extract this data pretty 0:40:29.560,0:40:36.070 easily and there's all sort of other 0:40:34.360,0:40:37.810 stuff you can do with this for example 0:40:36.070,0:40:40.810 there are tools so that you compute 0:40:37.810,0:40:43.660 statistics over inputs so for example 0:40:40.810,0:40:45.850 for this list of numbers that's that I 0:40:43.660,0:40:49.590 just took the numbers and just print it 0:40:45.850,0:40:54.880 out just the distribution of numbers I 0:40:49.590,0:40:56.080 could do things like use our our is the 0:40:54.880,0:40:57.640 separate programming language that's 0:40:56.080,0:41:02.230 specifically built for a statistical 0:40:57.640,0:41:03.570 analysis and I can say let's see if I 0:41:02.230,0:41:06.280 got this right 0:41:03.570,0:41:10.440 this is again a different programming 0:41:06.280,0:41:13.210 language that you would have to learn 0:41:10.440,0:41:14.200 but if you already know R or you can 0:41:13.210,0:41:23.860 pipe them through all their languages 0:41:14.200,0:41:26.380 too like so so this gives me summary 0:41:23.860,0:41:30.160 statistics over that input stream of 0:41:26.380,0:41:33.310 numbers so the median number of login 0:41:30.160,0:41:34.330 attempts per user name is 3 the max is 0:41:33.310,0:41:35.980 10,000 that was route 0:41:34.330,0:41:39.250 we saw before I'll tell me the average 0:41:35.980,0:41:40.600 was 8 for this might not matter in this 0:41:39.250,0:41:42.040 particular instance like this might not 0:41:40.600,0:41:43.660 be interesting numbers but if you're 0:41:42.040,0:41:45.790 looking at things like output from your 0:41:43.660,0:41:46.780 benchmarking script or something else 0:41:45.790,0:41:48.520 where you have some numerical 0:41:46.780,0:41:52.900 distribution and you want to look at 0:41:48.520,0:41:54.250 them these tools are really handy we can 0:41:52.900,0:41:57.640 even do some simple plotting if we 0:41:54.250,0:42:01.330 wanted to right so this has a bunch of 0:41:57.640,0:42:06.220 numbers let's do let's go back to our 0:42:01.330,0:42:11.860 sort and k-11 and look at only the two 0:42:06.220,0:42:17.770 top 5 new plot is a plotter that lets 0:42:11.860,0:42:19.150 you take things from standard in I'm not 0:42:17.770,0:42:22.480 expecting you to know all of these 0:42:19.150,0:42:23.950 programming languages because they 0:42:22.480,0:42:25.810 really are programming languages in 0:42:23.950,0:42:30.580 their own right but is it just show you 0:42:25.810,0:42:34.360 what is possible right so this is now a 0:42:30.580,0:42:37.360 histogram of how many times each of the 0:42:34.360,0:42:41.020 top 5 user names have been used for my 0:42:37.360,0:42:43.810 server since January 1st and it's just 0:42:41.020,0:42:45.340 one command line it's somewhat 0:42:43.810,0:42:48.570 complicated command line but it's just 0:42:45.340,0:42:48.570 one command line thing that you can do 0:42:50.520,0:42:54.790 there are two sort of special types of 0:42:53.590,0:42:56.290 data wrangling that I want to talk to 0:42:54.790,0:42:58.420 you about in the in the last little bit 0:42:56.290,0:43:01.980 of time that we have and the first one 0:42:58.420,0:43:07.750 is command line argument wrangling 0:43:01.980,0:43:09.220 sometimes you might have something that 0:43:07.750,0:43:11.140 actually we looked at in the last 0:43:09.220,0:43:14.170 lecture like you have things like find 0:43:11.140,0:43:17.760 that produces a list of files or maybe 0:43:14.170,0:43:17.760 something that produces a list of 0:43:19.380,0:43:23.080 arguments for your benchmarking script 0:43:21.940,0:43:24.670 like you want to run it with a 0:43:23.080,0:43:26.020 particular distribution of arguments 0:43:24.670,0:43:28.810 like let's say you had a script that 0:43:26.020,0:43:29.980 printed the number of iterations to run 0:43:28.810,0:43:31.630 a particular project and you wanted like 0:43:29.980,0:43:33.520 an exponential distribution or something 0:43:31.630,0:43:35.500 and this prints the number of iterations 0:43:33.520,0:43:37.960 on each line and you were to run your 0:43:35.500,0:43:39.190 benchmark for each one well here is a 0:43:37.960,0:43:43.420 tool called X args 0:43:39.190,0:43:46.210 that's your friend so X args takes lines 0:43:43.420,0:43:47.620 of input and turns them into arguments 0:43:46.210,0:43:50.170 and this is my 0:43:47.620,0:43:52.270 look a little weird see if I can come 0:43:50.170,0:43:55.480 with a good example for this so I 0:43:52.270,0:43:56.770 program in rust and rust lets you 0:43:55.480,0:43:58.540 install multiple versions of the 0:43:56.770,0:44:01.360 compiler so in this case you can see 0:43:58.540,0:44:04.420 that I have stable beta I have a couple 0:44:01.360,0:44:05.860 of earlier stable releases and I've 0:44:04.420,0:44:08.980 launched a different dated Knightley's 0:44:05.860,0:44:12.010 and this is all very well but over time 0:44:08.980,0:44:14.140 like I don't really need the nightly 0:44:12.010,0:44:14.890 version from like March of last year 0:44:14.140,0:44:16.450 anymore 0:44:14.890,0:44:17.710 I can probably delete that every now and 0:44:16.450,0:44:21.550 again and maybe I want to clean these up 0:44:17.710,0:44:25.330 a little well this is a list of lines so 0:44:21.550,0:44:29.770 I can get for nightly I can get rid of 0:44:25.330,0:44:32.170 so - V is don't match I don't want to 0:44:29.770,0:44:34.540 match to the current nightly okay so 0:44:32.170,0:44:37.810 this is al a list of dated Knightley's 0:44:34.540,0:44:42.730 maybe I want only the ones from 2019 0:44:37.810,0:44:45.370 and now I want to remove each of these 0:44:42.730,0:44:48.340 tool chains for my machine I could copy 0:44:45.370,0:44:52.630 paste each one into so there's a rust up 0:44:48.340,0:44:56.110 tool chain remove or uninstall maybe 0:44:52.630,0:44:58.060 tool chain uninstall right so I could 0:44:56.110,0:44:59.470 manually type out the name of each one 0:44:58.060,0:45:01.030 or copy/paste them but that's getting 0:44:59.470,0:45:03.700 gets annoying really quickly because I 0:45:01.030,0:45:10.660 have the list right here so instead how 0:45:03.700,0:45:14.890 about I said away this sort of this 0:45:10.660,0:45:17.770 suffix that it adds right so now it's 0:45:14.890,0:45:20.800 just that and then I use ex args so ex 0:45:17.770,0:45:23.770 args takes a list of inputs and turns 0:45:20.800,0:45:27.060 them into arguments so I want this to 0:45:23.770,0:45:30.730 become arguments to rust up tool chain 0:45:27.060,0:45:32.710 uninstall and just for my own sanity 0:45:30.730,0:45:33.910 sake I'm gonna make this echo just so 0:45:32.710,0:45:36.460 it's going to show which command it's 0:45:33.910,0:45:39.460 gonna run well it's relatively unhelpful 0:45:36.460,0:45:41.770 but are hard to read at least you see 0:45:39.460,0:45:43.990 the command it's going to execute if I 0:45:41.770,0:45:45.550 remove this echo is rust up tool chain 0:45:43.990,0:45:47.520 uninstall and then the list of 0:45:45.550,0:45:51.130 Knightley's as arguments to that program 0:45:47.520,0:45:52.630 and so if I run this it on installs 0:45:51.130,0:45:56.110 every tool chain instead of me having to 0:45:52.630,0:45:57.520 copy paste them so this is one example 0:45:56.110,0:45:59.110 where this kind of data wrangling 0:45:57.520,0:46:00.670 actually can be useful for other tasks 0:45:59.110,0:46:01.480 than just looking at data it's just 0:46:00.670,0:46:04.420 going from one 0:46:01.480,0:46:07.150 format to another you can also wrangle 0:46:04.420,0:46:09.550 binary data so a good example of this is 0:46:07.150,0:46:11.710 stuff like videos and images where you 0:46:09.550,0:46:14.770 might actually want to operate over them 0:46:11.710,0:46:17.109 in some interesting way so for example 0:46:14.770,0:46:19.720 there's a tool called ffmpeg ffmpeg is 0:46:17.109,0:46:23.079 for encoding and decoding video and to 0:46:19.720,0:46:24.310 some extent images I'm gonna set its log 0:46:23.079,0:46:26.800 level to panic because otherwise it 0:46:24.310,0:46:30.730 prints a bunch of stuff I want it to 0:46:26.800,0:46:34.570 read from dev video 0 which is my video 0:46:30.730,0:46:37.300 of my webcam video device and I wanted 0:46:34.570,0:46:40.420 to take the first frame so I just wanted 0:46:37.300,0:46:42.670 to take a picture and I wanted to take 0:46:40.420,0:46:45.790 an image rather than a single frame 0:46:42.670,0:46:48.070 video file and I wanted to print its 0:46:45.790,0:46:50.410 output so the image it captures to 0:46:48.070,0:46:52.570 standard output - is usually the way you 0:46:50.410,0:46:54.430 tell the program to use standard input 0:46:52.570,0:46:56.200 or output rather than a given file so 0:46:54.430,0:46:58.930 here it expects a file name and the file 0:46:56.200,0:47:00.790 name - means standard output in this 0:46:58.930,0:47:02.550 context and then I want to pipe that 0:47:00.790,0:47:05.500 through a parameter called convert 0:47:02.550,0:47:08.170 convert is a image manipulation program 0:47:05.500,0:47:12.280 I want to tell convert to read from 0:47:08.170,0:47:16.050 standard input and turn the image into 0:47:12.280,0:47:19.390 the color space gray and then write the 0:47:16.050,0:47:22.119 resulting image into the file - which is 0:47:19.390,0:47:25.119 standard output and I don't want to pipe 0:47:22.119,0:47:28.720 that into gzip we're just gonna compress 0:47:25.119,0:47:30.579 this image file and that's also going to 0:47:28.720,0:47:33.450 just operate on standard input standard 0:47:30.579,0:47:37.780 output and then I'm going to pipe that 0:47:33.450,0:47:41.349 to my remote server and on that I'm 0:47:37.780,0:47:44.050 going to decode that image and then I'm 0:47:41.349,0:47:46.839 gonna store a copy of that image so 0:47:44.050,0:47:49.030 remember T reads input prints it to 0:47:46.839,0:47:51.250 standard out and to a file this is gonna 0:47:49.030,0:47:55.750 make a copy of the decoded image file 0:47:51.250,0:47:58.210 ass copy about PNG and then it's gonna 0:47:55.750,0:48:00.550 continue to stream that out so now I'm 0:47:58.210,0:48:04.990 gonna bring that back into a local 0:48:00.550,0:48:07.240 stream and here I'm going to display 0:48:04.990,0:48:08.550 that in an image display err let's see 0:48:07.240,0:48:13.240 if that works 0:48:08.550,0:48:15.050 Hey right so this now did a round-trip 0:48:13.240,0:48:18.340 to my server 0:48:15.050,0:48:21.380 and then came back over pipes and 0:48:18.340,0:48:23.060 there's now a computer there's a 0:48:21.380,0:48:25.820 decompressed version of this file at 0:48:23.060,0:48:29.360 least in theory on my server let's see 0:48:25.820,0:48:38.180 if that's there a CPT's p copy PNG 2 0:48:29.360,0:48:40.900 here and CP 8 yeah hey same file ended 0:48:38.180,0:48:43.580 up on the server so our pipeline worked 0:48:40.900,0:48:45.890 again this is a sort of silly example 0:48:43.580,0:48:48.290 but let's you see the power of building 0:48:45.890,0:48:50.150 these pipelines where it doesn't have to 0:48:48.290,0:48:52.310 be textual data it's just go taking data 0:48:50.150,0:48:55.100 from any format to any other like for 0:48:52.310,0:48:58.280 example if I wanted to I can do cat dev 0:48:55.100,0:49:00.710 video 0 and then pipe that to a server 0:48:58.280,0:49:02.660 that like Anish controls and then he 0:49:00.710,0:49:05.420 could watch that video stream by piping 0:49:02.660,0:49:08.900 it into a video player on his machine if 0:49:05.420,0:49:13.100 we wanted to write it just need to know 0:49:08.900,0:49:15.200 that these thing exist there are a bunch 0:49:13.100,0:49:17.180 of exercises for this lab and some of 0:49:15.200,0:49:19.310 them rely on you having a data source 0:49:17.180,0:49:21.110 that looks a little bit like a log on 0:49:19.310,0:49:22.460 Mac OS and Linux we give you some 0:49:21.110,0:49:24.590 commands you can try to experiment with 0:49:22.460,0:49:26.630 but keep in mind that it's not it's not 0:49:24.590,0:49:28.970 that important exactly what data source 0:49:26.630,0:49:30.290 you use this is more find some data 0:49:28.970,0:49:32.240 source that where you think there might 0:49:30.290,0:49:33.680 be an interesting signal and then try to 0:49:32.240,0:49:35.510 extract something interesting from it 0:49:33.680,0:49:38.660 that is what all of the exercises are 0:49:35.510,0:49:41.240 about we will not have class on Monday 0:49:38.660,0:49:43.370 because it's MLK Day so next lecture 0:49:41.240,0:49:45.440 will be Tuesday on command line 0:49:43.370,0:49:47.420 environments any questions about what 0:49:45.440,0:49:51.410 we've guarded so far or the pipelines or 0:49:47.420,0:49:52.790 regular expressions I really recommend 0:49:51.410,0:49:54.800 that you look into regular expressions 0:49:52.790,0:49:57.230 and try to learn them they are extremely 0:49:54.800,0:49:59.300 handy both for this and in programming 0:49:57.230,0:50:00.440 in general and if you have any questions 0:49:59.300,0:50:02.560 come to office hours and we'll help you 0:50:00.440,0:50:02.560 up