0:00:17.080,0:00:18.190 TODD SCHNEIDER: All right. We're, we're good.[br]Thank you. 0:00:18.190,0:00:19.910 Sorry for the delay. Classic. 0:00:19.910,0:00:22.270 Even in the future nothing works. Welcome. 0:00:22.270,0:00:26.240 I am Todd. I'm an engineer at Rap Genius. 0:00:26.240,0:00:31.640 And today's talk is going to be about data[br]science with a live tutorial. 0:00:31.640,0:00:34.360 And before we get into the live coding component, 0:00:34.360,0:00:36.070 I wanted to show you all a project I 0:00:36.070,0:00:39.030 built previously, which kind of serves as[br]the inspiration 0:00:39.030,0:00:41.470 for this talk. Sort of. So this is a 0:00:41.470,0:00:45.440 website called weddingcrunchers dot com. What[br]is Wedding Crunchers? 0:00:45.440,0:00:48.110 It's a place where you can track the, the 0:00:48.110,0:00:50.979 popularity of words and phrases in the New[br]York 0:00:50.979,0:00:54.449 Times wedding section over the past thirty-some[br]years. 0:00:54.449,0:00:56.129 And a lot of you might be wondering why 0:00:56.129,0:00:58.640 on earth would this be interesting or relevant[br]or 0:00:58.640,0:01:01.530 funny or anything, and I hope to convince[br]you 0:01:01.530,0:01:04.360 of that very quickly. Here is a, a example 0:01:04.360,0:01:07.220 wedding announcement from the New York Times.[br]This one's 0:01:07.220,0:01:08.030 from 1985. 0:01:08.030,0:01:08.970 If you don't know me, you don't live in 0:01:08.970,0:01:11.260 New York, read the New York Times, the wedding 0:01:11.260,0:01:14.280 section is a certain cultural cache. It's[br]kind of 0:01:14.280,0:01:15.720 an honor to be listed in there and it's 0:01:15.720,0:01:18.580 got a very resume-like structure. People get[br]to brag 0:01:18.580,0:01:20.110 about where they went to school and what they 0:01:20.110,0:01:20.979 do. 0:01:20.979,0:01:23.050 So here is an example. You know, Diane deCordova 0:01:23.050,0:01:25.270 is marrying Michael Monro Lewis. They both[br]went to 0:01:25.270,0:01:28.250 Princeton. They graduated Cum Laude. You know,[br]she works 0:01:28.250,0:01:30.440 at Morgan Stanley. He works at Solomon Brothers[br]in 0:01:30.440,0:01:32.610 New York and they're gonna go to London. And 0:01:32.610,0:01:34.430 this should be a little familiar to a bunch 0:01:34.430,0:01:35.420 of you. 0:01:35.420,0:01:37.870 Mr. Lewis and associates Solomon Brothers[br]is Michael Lewis. 0:01:37.870,0:01:40.600 He's given you Right Lawyers Poker??, famous[br]book about 0:01:40.600,0:01:42.810 his experience there. And before, before he[br]was a 0:01:42.810,0:01:45.710 famous writer, he was just another New York[br]Times 0:01:45.710,0:01:49.630 wedding announced person. 0:01:49.630,0:01:51.560 And so what Wedding Crunchers does is it takes 0:01:51.560,0:01:54.560 the entire corpus of New York Times wedding[br]announcements 0:01:54.560,0:01:57.409 back from 1981 and you can searh for words 0:01:57.409,0:01:59.520 and phrases and you can see how common those 0:01:59.520,0:02:01.800 words and phrases are, you know, by year.[br]It's 0:02:01.800,0:02:03.320 like, this is a good one that's relevant to 0:02:03.320,0:02:06.409 people here. You know, banker and programmer.[br]You know, 0:02:06.409,0:02:08.979 for example, when you list so-and-so is a[br]banker 0:02:08.979,0:02:11.780 or is a programmer in the announcement and[br]you 0:02:11.780,0:02:13.700 see, over time, you know, banker used to be 0:02:13.700,0:02:18.450 way more commonly used than programmer in[br]these announcements. 0:02:18.450,0:02:21.140 And only just this year, in 2014, programmer[br]has 0:02:21.140,0:02:28.140 finally overtaken banker as, you know, the,[br]the place, 0:02:28.190,0:02:29.890 you know, the people getting married in New[br]York, 0:02:29.890,0:02:32.770 who are part of society, come from. Another[br]good 0:02:32.770,0:02:35.170 one is, if you look at goldman, sachs and 0:02:35.170,0:02:37.600 google- is my internet on? Good. 0:02:37.600,0:02:41.150 So here's another good one. So Goldman Sachs,[br]you 0:02:41.150,0:02:44.120 know, classic New York financial instition.[br]Google, new kid 0:02:44.120,0:02:47.160 on the block. Tech scene. Boom. Taking over. 0:02:47.160,0:02:49.800 And, you know, this is obviously fun, and[br]it's 0:02:49.800,0:02:52.440 amusing. But it's also actually pretty insightful[br]for a 0:02:52.440,0:02:55.760 relatively simple concept. I mean, this one[br]graph tells 0:02:55.760,0:02:58.740 a pretty powerful story of, you know, New[br]York 0:02:58.740,0:03:01.750 the, the finance capitol of the world. Meanwhile,[br]we 0:03:01.750,0:03:03.550 have this sort of emerging tech scene. You[br]know, 0:03:03.550,0:03:05.150 Google may be the biggest player in the kind 0:03:05.150,0:03:06.959 of new tech world. 0:03:06.959,0:03:09.510 And now, when you turn to the society pages 0:03:09.510,0:03:11.209 to see who's getting married, you know, there's[br]more 0:03:11.209,0:03:13.970 employees from Google than there are from[br]Gullman Sachs. 0:03:13.970,0:03:16.750 And that, you know, kind of interesting thing[br]in 0:03:16.750,0:03:17.739 the world. 0:03:17.739,0:03:20.500 And so what we're gonna do today is build 0:03:20.500,0:03:25.120 something just like Wedding Crunchers, except,[br]instead of using 0:03:25.120,0:03:28.280 the text of wedding announcements to analyze,[br]we're going 0:03:28.280,0:03:32.670 to look at all of the RailsConf talk abstracts. 0:03:32.670,0:03:34.080 And so, you know, hopefully this is, this[br]is 0:03:34.080,0:03:36.550 interesting to people here and, I always say,[br]you 0:03:36.550,0:03:38.709 know, if there's only one thing you take from 0:03:38.709,0:03:41.319 this talk, really, what it should be is that, 0:03:41.319,0:03:43.709 you know, work on a problem that's interesting[br]to 0:03:43.709,0:03:46.260 you. Because, especially when you're dealing[br]with data science, 0:03:46.260,0:03:47.590 a lot of it's pretty messy and then you 0:03:47.590,0:03:49.290 have to go through scraping stuff as we'll[br]get 0:03:49.290,0:03:51.879 into, and it's easy to get frustrated and[br]kind 0:03:51.879,0:03:53.810 of lost and like, if you're not working on 0:03:53.810,0:03:55.450 something that you care about, and something[br]that you 0:03:55.450,0:03:58.060 really want to know, kind of, the final result, 0:03:58.060,0:04:00.110 it's just much easier to get distracted and[br]kind 0:04:00.110,0:04:01.069 of, ultimately, bail. 0:04:01.069,0:04:03.819 So, again, if you take one thing, just work 0:04:03.819,0:04:07.550 on something that is interesting to you. So[br]the 0:04:07.550,0:04:09.819 particular kind of analysis we're gonna do[br]is something 0:04:09.819,0:04:12.680 called n-gram analysis. And I have a little[br]example 0:04:12.680,0:04:14.190 set up here. So what is an n-gram? You 0:04:14.190,0:04:15.800 may have heard the word before. 0:04:15.800,0:04:19.099 Really, all it means is, you know, a, a 0:04:19.099,0:04:23.830 consecutive words as part of a sentence. So[br]like, 0:04:23.830,0:04:26.030 examples very simple, for one simple. This[br]talk is 0:04:26.030,0:04:28.000 boring. What are the, what are the one grams 0:04:28.000,0:04:30.330 in this sentence? It's just the words. This,[br]talk, 0:04:30.330,0:04:32.780 is, and boring. The two grams are every pair 0:04:32.780,0:04:35.839 of consecutive words. This talk, talk is,[br]is boring, 0:04:35.839,0:04:37.219 and so on. 0:04:37.219,0:04:38.150 And so what we need to be able to 0:04:38.150,0:04:40.889 do in order to build, you know, a graph 0:04:40.889,0:04:43.300 like this, is we need to take a term 0:04:43.300,0:04:45.159 that's, you know, relavent to RailsConf, say[br]something like 0:04:45.159,0:04:46.960 Ember or whatever, and we need to be able 0:04:46.960,0:04:48.759 to look up, you know, for each year how 0:04:48.759,0:04:51.300 many times does this, you know, word or n-gram 0:04:51.300,0:04:53.610 appear in the data. 0:04:53.610,0:04:55.550 And so that is what we are going to 0:04:55.550,0:04:58.789 build. And I have this brief little outline[br]here. 0:04:58.789,0:05:01.020 There's kind of three steps. And this is pretty 0:05:01.020,0:05:04.629 general to, to any data project. You know,[br]step 0:05:04.629,0:05:06.719 one is gonna be just gathering the data, getting 0:05:06.719,0:05:09.659 it in some usable form. Step two is gonna 0:05:09.659,0:05:11.259 be kind of the analysis part where we do 0:05:11.259,0:05:14.050 the n-gram calculation. We store the results.[br]And then 0:05:14.050,0:05:15.789 step three is gonna be to create a nice 0:05:15.789,0:05:19.259 little front-end interface that lets us investigate,[br]visualize and 0:05:19.259,0:05:20.809 see what we've done. 0:05:20.809,0:05:23.300 Now unfortunately, you know, in a, in a thirty 0:05:23.300,0:05:26.020 minute talk we can't possibly do all of this. 0:05:26.020,0:05:28.689 So we're gonna focus more on items one and 0:05:28.689,0:05:31.490 two and less so on three, and even then 0:05:31.490,0:05:33.099 it's too much. So, you know, I sort of 0:05:33.099,0:05:34.689 used the analogy, it'll be a bit like watching 0:05:34.689,0:05:37.419 TV on the Food Network, where we might, you 0:05:37.419,0:05:40.039 know, throw something in the oven, mysteriously[br]something else 0:05:40.039,0:05:42.009 pops out of the other oven even though it's, 0:05:42.009,0:05:43.759 where did that come from? 0:05:43.759,0:05:46.089 But not to worry. Everything is also on GitHub. 0:05:46.089,0:05:47.869 There's a repo I'll share with you at the 0:05:47.869,0:05:50.339 end. So anything that we don't cover or that 0:05:50.339,0:05:51.979 we cover too quickly or something, you'll[br]be able 0:05:51.979,0:05:53.779 to see sort of the, the full version on 0:05:53.779,0:05:55.740 GitHub. 0:05:55.740,0:05:57.770 So let us jump in now to step one, 0:05:57.770,0:06:00.189 which is, you know, gathering the data. And[br]so 0:06:00.189,0:06:01.909 let's take a look back at the, the RailsConf 0:06:01.909,0:06:03.080 website again. So we have to figure out how 0:06:03.080,0:06:06.460 we're gonna model a, a RailsConf talk in our 0:06:06.460,0:06:09.889 database. So like, what, you know, attributes[br]does a, 0:06:09.889,0:06:13.339 do a, excuse me, does a RailsConf talk have. 0:06:13.339,0:06:14.289 And it's like, one thing we see is they 0:06:14.289,0:06:17.669 all have titles. So that looks like something.[br]They 0:06:17.669,0:06:20.089 have speakers. You know, there's this thing,[br]which is 0:06:20.089,0:06:23.330 the abstract, and then there's the bio. And[br]that's 0:06:23.330,0:06:25.469 probably it. That's probably all we need. 0:06:25.469,0:06:27.669 So that's pretty simple. And, you know, I[br]have 0:06:27.669,0:06:29.999 the little migration. I've already run here.[br]But here 0:06:29.999,0:06:31.789 are attributes for talks. It's just the year,[br]you 0:06:31.789,0:06:33.909 know, what, what conference were we actually[br]at. The 0:06:33.909,0:06:36.110 title of the talk, the speaker, the abstract,[br]and 0:06:36.110,0:06:37.569 the bio. 0:06:37.569,0:06:41.490 And so also, that's, again, pretty straightforward.[br]The gemfile 0:06:41.490,0:06:45.089 is also very simple. It's mostly pretty boiler[br]plate. 0:06:45.089,0:06:47.830 Rails 4, Ruby 2.1. The only gems I wanted 0:06:47.830,0:06:49.409 to call out here are, we're gonna use nokogiri 0:06:49.409,0:06:52.309 for, you know, fetching, or, parsing websites[br]and kind 0:06:52.309,0:06:54.229 of scraping the data we need. We're gonna[br]use 0:06:54.229,0:06:56.389 PosGres as our main data store and we're gonna 0:06:56.389,0:06:58.219 use redis to build these sort of index that 0:06:58.219,0:07:00.180 we can ultimately use to look up, you know, 0:07:00.180,0:07:02.389 how common a word is. 0:07:02.389,0:07:05.389 And so one thing that's not here is, like, 0:07:05.389,0:07:09.009 you know, gem fancy data algorithm. And a[br]lot 0:07:09.009,0:07:10.689 of people, this is kind of where Ruby often 0:07:10.689,0:07:13.369 gets a bad reputation of, you know, not being 0:07:13.369,0:07:16.039 supportive of scientific computing or whatever.[br]And other languages 0:07:16.039,0:07:18.589 have more, more support. But my claim is that 0:07:18.589,0:07:20.520 it's really not that important. You can get[br]a 0:07:20.520,0:07:23.509 ton of mileage out of very simple tools that 0:07:23.509,0:07:24.210 you can build yourself. 0:07:24.210,0:07:25.809 You know, you don't need a fancy gem or 0:07:25.809,0:07:28.360 any fancy algorithm. Those things are cool[br]too and 0:07:28.360,0:07:30.740 they have their place. But they're not needed[br]a 0:07:30.740,0:07:33.349 lot of the time. And, you know, Ruby is 0:07:33.349,0:07:36.210 a wonderful language for, especially, scraping[br]stuff from the 0:07:36.210,0:07:38.449 web. There's a ton of support there. And so 0:07:38.449,0:07:40.979 I don't think that the, the lack of, you 0:07:40.979,0:07:43.509 know, fancy algorithm gems should necessarily[br]be a deterrant 0:07:43.509,0:07:44.439 at all. 0:07:44.439,0:07:46.960 And so hopefully part of this talk is convincing 0:07:46.960,0:07:49.649 people that Ruby and Rails are actually quite[br]well-suited 0:07:49.649,0:07:50.939 to problems like this. 0:07:50.939,0:07:53.559 OK. So now we actually need to write some 0:07:53.559,0:07:56.249 code to scrape the talk. And you know, if 0:07:56.249,0:07:57.419 you've ever done anything like this before,[br]you know 0:07:57.419,0:07:59.520 that Chrome Inspector is your best friend.[br]So let's 0:07:59.520,0:08:02.499 fire that up. We're gonna inspect element,[br]and so 0:08:02.499,0:08:04.069 like, we actually, what we need to do now 0:08:04.069,0:08:06.889 is take you know, this HTML on the page 0:08:06.889,0:08:09.119 and turn it into a database record that we 0:08:09.119,0:08:11.889 can then, you know, use to our advantage later. 0:08:11.889,0:08:13.050 And so it looks like, you know, all the 0:08:13.050,0:08:16.629 talks are in these session classes. So that's[br]something. 0:08:16.629,0:08:19.849 We can look in here. This looks like something. 0:08:19.849,0:08:23.469 So let's make this bigger. 0:08:23.469,0:08:25.039 And you know it helps to, well, it's kind 0:08:25.039,0:08:29.059 of essential to be decent with CSS selectors[br]here, 0:08:29.059,0:08:32.149 because that's how we're going to basically[br]find stuff. 0:08:32.149,0:08:34.719 So let's see, OK, so there's eighty-one session[br]divs. 0:08:34.719,0:08:37.990 That sounds about right. I happen to know[br]that 0:08:37.990,0:08:42.229 mine is number seventy-eight, so let's, let's[br]look at 0:08:42.229,0:08:44.360 that. And so here we are. So we need 0:08:44.360,0:08:46.970 to, again, the, the things we're mod- or,[br]the 0:08:46.970,0:08:50.250 attributes we're storing at the title, the[br]speaker, the 0:08:50.250,0:08:52.680 abstract, and the bio. And so we're gonna[br]need 0:08:52.680,0:08:54.850 to pull these things out. 0:08:54.850,0:08:57.630 So let's see. It looks like the, the title 0:08:57.630,0:09:00.490 is in this h1 element inside the header. So 0:09:00.490,0:09:04.830 let's just make sure that works. You know,[br]header 0:09:04.830,0:09:08.450 h1. That looks right. 0:09:08.450,0:09:13.650 The, the speaker looks to be the header h2. 0:09:13.650,0:09:16.060 Cool. 0:09:16.060,0:09:20.640 Now the abstract is in this p tag, so 0:09:20.640,0:09:23.130 we can do something like this. But this is 0:09:23.130,0:09:26.490 actually not quite right. So what's wrong[br]with this? 0:09:26.490,0:09:30.140 Well, the abstract ends, you know, suited[br]to the 0:09:30.140,0:09:32.310 problem. The bio here is also in the p 0:09:32.310,0:09:35.310 tag. Originally a math guy. And we've actually[br]pulled 0:09:35.310,0:09:37.010 all the p-tags. So we need a way of 0:09:37.010,0:09:38.940 not doing that. And this is where you just 0:09:38.940,0:09:40.200 need to know a little bit of CSS. Not 0:09:40.200,0:09:42.550 very complicated. But if you use the little[br]greater 0:09:42.550,0:09:44.800 than guy, what this says is only take the 0:09:44.800,0:09:47.210 p tags that are immediate descendants of the[br]session 0:09:47.210,0:09:50.390 div. And so now we have, you know, only 0:09:50.390,0:09:51.060 the abstract. 0:09:51.060,0:09:54.340 And lastly, you know, the bio is just in 0:09:54.340,0:09:58.460 its own little section. So something like[br]that. Cool. 0:09:58.460,0:10:00.190 So that is the jQuery version of it. We 0:10:00.190,0:10:03.180 need to do this, though, in Ruby. And as 0:10:03.180,0:10:05.250 I said, this does sometimes get a little tedious. 0:10:05.250,0:10:07.340 But let's, let's write the code. So I have 0:10:07.340,0:10:12.160 this empty method - create_railsconf_2014_talks.[br]And also this method 0:10:12.160,0:10:14.760 I've written already called fetch_and_parse,[br]which just gets a 0:10:14.760,0:10:16.610 URL and sends it to nokogiri, which we can 0:10:16.610,0:10:17.690 then use to do our CSS selectors. 0:10:17.690,0:10:20.510 So let, let's just write this. So we can 0:10:20.510,0:10:27.400 say doc is fetch_and_parse. The url is this.[br]Let's 0:10:27.400,0:10:33.940 see if this works in the console. 0:10:33.940,0:10:40.940 Of course, in here. Do I have internet? Nice. 0:10:47.360,0:10:52.700 So we can then check the same thing. Again. 0:10:52.700,0:10:57.830 Looks right. Let's find my talk, which, this[br]part 0:10:57.830,0:10:59.310 I couldn't possibly tell you. When you use[br]the 0:10:59.310,0:11:01.610 nokogiri, the eq thing, you have to add two 0:11:01.610,0:11:04.330 from whatever jQuery does. So I'm number 80[br]now. 0:11:04.330,0:11:06.570 Don't ask me why. I couldn't possibly tell[br]you. 0:11:06.570,0:11:10.210 But maybe someone here knows. Be curious to[br]find 0:11:10.210,0:11:10.780 out. 0:11:10.780,0:11:11.920 AUDIENCE: ?? (00:11:13) 0:11:11.920,0:11:15.400 T.S.: So there it is. There's the title. So 0:11:15.400,0:11:17.380 let us now write some code here. We have 0:11:17.380,0:11:21.520 our, our document. We're gonna go through[br]each session. 0:11:21.520,0:11:24.320 The CSS method is kind of like, you know, 0:11:24.320,0:11:28.900 the selector for nokogiri. Each elements.[br]So each of 0:11:28.900,0:11:35.370 these we're gonna create a talk. 0:11:35.370,0:11:38.390 And again. So the year we already know is 0:11:38.390,0:11:45.390 2014. The title we're gonna say is, elm.css("header[br]h1").inner_text. 0:11:48.300,0:11:55.300 Speaker, header h2, dun nuh nuh dun nuh nuh 0:12:00.460,0:12:04.520 nuh. Gettin' there. 0:12:04.520,0:12:09.950 All right. So I think this will probably work. 0:12:09.950,0:12:13.980 Let's find out. And so we're back in here. 0:12:13.980,0:12:19.470 Just to prove to you that I'm not lying, 0:12:19.470,0:12:23.450 2014 dot count. There's none of them. And,[br]what'd 0:12:23.450,0:12:26.440 I call this method? This guy. Delayed::Job. 0:12:26.440,0:12:33.440 All right. So we just did something. Did it 0:12:33.440,0:12:40.440 work? Nice. We got eighty-one talks. Most[br]importantly, let's, 0:12:41.150,0:12:42.390 we have my talk. That's the, that's the only 0:12:42.390,0:12:46.760 one that matters anyway. And so, you know,[br]you 0:12:46.760,0:12:48.260 might be thinking now, like, you know, what[br]the 0:12:48.260,0:12:50.120 heck, I came to the, the data science talk, 0:12:50.120,0:12:52.330 not the scraping talk. You know, to that,[br]I 0:12:52.330,0:12:56.020 would say, tough luck. They're the same thing.[br]You 0:12:56.020,0:12:57.880 know, you might not, you might not want to 0:12:57.880,0:13:00.040 hear it, but guess what, this is usually the 0:13:00.040,0:13:02.020 most important part of the entire project. 0:13:02.020,0:13:04.960 It's the hardest part, you know, because guess[br]what, 0:13:04.960,0:13:07.080 just because we got the 2014 talks, you know, 0:13:07.080,0:13:08.630 now we have to get the 2013 talks. And 0:13:08.630,0:13:10.880 the 2012 talks. And they're all on different[br]websites. 0:13:10.880,0:13:12.890 They all have different structures. You know,[br]you're gonna 0:13:12.890,0:13:15.090 have to write different code to get each type 0:13:15.090,0:13:17.120 of website. It's a pain. And this is why 0:13:17.120,0:13:19.240 I said earlier, you know, really make sure[br]you're 0:13:19.240,0:13:21.160 working on something you care about. Because[br]it's just 0:13:21.160,0:13:24.300 not fun to like, like, ugh, in 2008 they 0:13:24.300,0:13:26.850 separated the speakers and the abstracts.[br]And it's like, 0:13:26.850,0:13:29.260 it's just, it's annoying, but again, it's[br]the most 0:13:29.260,0:13:30.290 important part I would say. 0:13:30.290,0:13:32.920 You know, so much of data science is taking 0:13:32.920,0:13:35.960 data that's either unstructured or structured[br]in the wrong 0:13:35.960,0:13:39.020 format to you and, you know, getting it into 0:13:39.020,0:13:40.510 the way, you know, into the structure that[br]you 0:13:40.510,0:13:43.410 need to do whatever analysis you want to do. 0:13:43.410,0:13:45.130 So in this case, that's taking, you know,[br]html 0:13:45.130,0:13:47.930 on a page and converting it into a PosGres 0:13:47.930,0:13:49.430 database. 0:13:49.430,0:13:52.800 And so we have done that now. And again, 0:13:52.800,0:13:53.920 take my word that, you know, I've done this 0:13:53.920,0:13:56.980 for the other years as well. Back in 2007 0:13:56.980,0:14:00.930 and so we have a total of 497 talks 0:14:00.930,0:14:04.220 in here from RailsConfs over the years. And[br]so 0:14:04.220,0:14:06.870 that's cool. That's basically our dataset[br]that we're gonna 0:14:06.870,0:14:07.200 use. 0:14:07.200,0:14:08.730 And so we can sort of move on to, 0:14:08.730,0:14:11.000 you know, step two of the project here, which 0:14:11.000,0:14:14.000 is, you know, do the n-gram calculation and[br]store 0:14:14.000,0:14:16.800 the results. And so let's go back to talk.rb. 0:14:16.800,0:14:18.700 All this by the way is just in, you 0:14:18.700,0:14:21.920 know, app/models/talk.rb. That's where all[br]this code is. 0:14:21.920,0:14:25.560 And I have another empty method somewhere[br]called def 0:14:25.560,0:14:27.590 ngrams. And so this method, we're gonna need[br]to 0:14:27.590,0:14:29.810 give, you know, it goes on a talk. So 0:14:29.810,0:14:32.400 given a value of n, calculate on the ngrams 0:14:32.400,0:14:34.540 from that talk's abstract. 0:14:34.540,0:14:36.160 And so, what are we gonna do here? So 0:14:36.160,0:14:43.160 again, let's look at, talk dot mine. Dot abstract. 0:14:43.550,0:14:45.160 So here's the abstract, and we need to, you 0:14:45.160,0:14:48.920 know, get ngrams out of this. And so the 0:14:48.920,0:14:51.060 first thing, I've written a little helper[br]method over 0:14:51.060,0:14:54.050 here. Which I've just tacked on a string called 0:14:54.050,0:14:57.410 normalized_for_ngrams. And you know, what[br]does this do? Well, 0:14:57.410,0:14:59.640 it downcases it, cause we're gonna do case[br]insensitive. 0:14:59.640,0:15:01.560 There might be cases where you want to keep 0:15:01.560,0:15:03.820 case sensitivity. Whatever. Doesn't really[br]matter. In this case 0:15:03.820,0:15:06.060 we're gonna go case insensitive. 0:15:06.060,0:15:08.880 Squish is a nice, convenient method that will[br]kind 0:15:08.880,0:15:11.460 of standardize the white space for you. So[br]like, 0:15:11.460,0:15:13.990 if there's any trailing or leading white space,[br]and 0:15:13.990,0:15:16.600 if there's like a bunch of middle white space, 0:15:16.600,0:15:18.730 this will, it'll kill the beginning and ending[br]and 0:15:18.730,0:15:20.630 it'll turn anything in the middle into a single 0:15:20.630,0:15:21.220 space. 0:15:21.220,0:15:22.230 So that way you just don't have to worry 0:15:22.230,0:15:25.130 about things like double spaces or, you know,[br]other, 0:15:25.130,0:15:26.820 other weird things that can happen. Cause[br]of course 0:15:26.820,0:15:28.600 it's the web. Whatever can go wrong will go 0:15:28.600,0:15:31.510 wrong. So make sure that you're data's in[br]some 0:15:31.510,0:15:33.360 kind of standardized format. 0:15:33.360,0:15:36.500 And the last thing I've done is removed punctuation. 0:15:36.500,0:15:38.360 And the reason for that is just cause like, 0:15:38.360,0:15:40.279 you know, there's commas, periods, colons,[br]all sorts of 0:15:40.279,0:15:42.930 stuff like that. We don't really care about[br]it. 0:15:42.930,0:15:44.710 And so let's just kill any character that's[br]not 0:15:44.710,0:15:46.540 either a space or a word character. This is 0:15:46.540,0:15:49.450 kind of the, little like, Ruby special regex[br]thing. 0:15:49.450,0:15:53.040 So we're gonna kill punctuation. 0:15:53.040,0:15:54.190 And so we can actually just mess with this 0:15:54.190,0:15:56.610 in the console maybe. So let's take our little 0:15:56.610,0:16:00.460 example sentence. You know, this talk is boring.[br]And 0:16:00.460,0:16:04.240 let's normalize that for ngrams. OK. All it[br]did 0:16:04.240,0:16:07.710 was downcase it. And now we want to get 0:16:07.710,0:16:09.410 that into an array of words, which we can 0:16:09.410,0:16:13.060 just do with split. Cool. 0:16:13.060,0:16:16.830 And now there's actually this neat little[br]Ruby enumerable 0:16:16.830,0:16:18.290 thing, which I didn't know about until pretty[br]recently. 0:16:18.290,0:16:21.800 Each const, which stands for each consecutive.[br]And it 0:16:21.800,0:16:25.380 takes an argument, a single number, like two,[br]and 0:16:25.380,0:16:27.279 what this says is give me all of the, 0:16:27.279,0:16:29.779 you know, consecutive pairs of two. So if[br]we 0:16:29.779,0:16:32.440 to_a this, now we have this array of arrays, 0:16:32.440,0:16:34.180 which looks like exactly what we want. 0:16:34.180,0:16:36.870 This talk, talk is, and is boring. And so 0:16:36.870,0:16:38.310 the last thing we can do there is we 0:16:38.310,0:16:43.690 can just map that array to make these just 0:16:43.690,0:16:44.190 phrases. 0:16:44.190,0:16:46.860 So cool. So this is actually the entirety[br]of 0:16:46.860,0:16:49.820 our ngrams method, is just, you know, this[br]code 0:16:49.820,0:16:51.630 right here. So let's copy and paste this into 0:16:51.630,0:16:56.500 the old method here. So we want. We're doing 0:16:56.500,0:17:03.040 this on the abstract. Let's get some new lines 0:17:03.040,0:17:04.079 here. 0:17:04.079,0:17:09.839 All right, cool. So again, just to recap,[br]you 0:17:09.839,0:17:12.039 take the abstract, we normalize it, which[br]means, you 0:17:12.039,0:17:14.880 know, downcase and kill the punctuation. We[br]split it 0:17:14.880,0:17:17.289 to words. Uh, wait. Actually this should not[br]be 0:17:17.289,0:17:21.118 two. That should be n. And then we join 0:17:21.118,0:17:24.220 those. So let's, let's see if this worked. 0:17:24.220,0:17:31.220 So talk dot mine again. And one. OK. So 0:17:31.360,0:17:32.769 here are all the one grams, which is just 0:17:32.769,0:17:36.240 the sequence of words. And that looks correct.[br]And 0:17:36.240,0:17:41.869 all of the two grams. Also looks correct,[br]I 0:17:41.869,0:17:45.369 think. Yeah. To get, get a, yeah, OK, perfect. 0:17:45.369,0:17:47.619 And so this is kind of the, the method 0:17:47.619,0:17:50.690 we're gonna use to decompose these talks into[br]just, 0:17:50.690,0:17:53.799 you know, an array of words and phrases. And 0:17:53.799,0:17:55.929 so what is the next step, now that we 0:17:55.929,0:17:57.549 have this method? Well, the next step is we 0:17:57.549,0:17:59.470 have to build these indexes that we're actually[br]gonna 0:17:59.470,0:18:03.659 use to look up, you know, the final results. 0:18:03.659,0:18:05.139 And so for that, we're gonna use redis. 0:18:05.139,0:18:07.179 Now, we don't have sort of enough time to 0:18:07.179,0:18:10.990 really get totally into the details of redis.[br]But, 0:18:10.990,0:18:12.039 you know, the, the thing that we're really[br]gonna 0:18:12.039,0:18:14.759 use is the, the sorted set data structure,[br]which 0:18:14.759,0:18:16.440 I'd definitely encourage you to check out.[br]It's a 0:18:16.440,0:18:19.159 great data structure. Great feature of redis.[br]And so 0:18:19.159,0:18:20.210 what is a sorted set? 0:18:20.210,0:18:22.730 Well, it's got the word set in it, so 0:18:22.730,0:18:24.720 that tells you something. It's, you know,[br]unique elements. 0:18:24.720,0:18:27.059 And the, the neat feature of a sorted set 0:18:27.059,0:18:28.990 is that each element in the set also has 0:18:28.990,0:18:32.360 a score associated with it. So the way we 0:18:32.360,0:18:34.669 can use this is, remember, again, the question[br]I'm 0:18:34.669,0:18:36.610 gonna answer is, like, you know, if someone[br]searches 0:18:36.610,0:18:38.559 for Ember, you know, how many times was Ember 0:18:38.559,0:18:40.429 mentioned in 2007. How many times was it mentioned 0:18:40.429,0:18:42.169 in 2008. How many times was it mentioned in 0:18:42.169,0:18:42.659 2009? 0:18:42.659,0:18:44.610 So we're gonna have one sorted set for each 0:18:44.610,0:18:47.700 year, where the members of the sorted set[br]are 0:18:47.700,0:18:50.259 all the words and phrases that appeared in[br]RailsConf 0:18:50.259,0:18:54.100 talks, and the scores are the number of times 0:18:54.100,0:18:56.419 that those ngrams appeared. 0:18:56.419,0:18:58.399 And then, you know, redis is very efficient[br]about 0:18:58.399,0:19:00.249 this zscore method. You can look up. It's[br]like 0:19:00.249,0:19:02.590 this command right here would say, OK, in[br]the 0:19:02.590,0:19:05.990 sorted set for 2014, get me the score associated 0:19:05.990,0:19:09.249 with the member ember. And that's gonna tell[br]you, 0:19:09.249,0:19:11.559 you know, some number. Like, three or whatever.[br]Is 0:19:11.559,0:19:14.340 the number of times it gets mentioned. 0:19:14.340,0:19:15.840 So what we have to do is build these 0:19:15.840,0:19:18.799 sorted sets. One for each year again. And[br]again 0:19:18.799,0:19:23.590 I have an empty method called generate_ngram_data_by_year.[br]So iterate 0:19:23.590,0:19:26.110 through all talks from a given year, you know, 0:19:26.110,0:19:27.389 calculate the ngram counts and add it to the 0:19:27.389,0:19:29.940 appropriate redis sorted set. So let's write[br]that. 0:19:29.940,0:19:32.450 So one thing we need to do is make 0:19:32.450,0:19:34.460 sure we're not double counting. So if we have 0:19:34.460,0:19:37.240 an old sorted set sitting around, let's delete[br]it. 0:19:37.240,0:19:40.210 So let's, redis.delete year. We need to decide[br]what 0:19:40.210,0:19:43.460 values of n we're gonna use. So let's just 0:19:43.460,0:19:46.210 say one, two, and three, meaning we're gonna[br]calculate 0:19:46.210,0:19:48.190 all the one grams, two grams, three grams.[br]Anything 0:19:48.190,0:19:49.700 longer than that and it's sort of, like, what's 0:19:49.700,0:19:51.740 even the point. You're getting into pretty[br]specific sentences. 0:19:51.740,0:19:53.110 There's not gonna be a lot of repetition. 0:19:53.110,0:19:55.789 So now we need to iterate through each talk 0:19:55.789,0:20:02.789 for the given years. Where(:year => year).find_each.[br]And then 0:20:05.789,0:20:07.860 for each talk we need to iterate through each 0:20:07.860,0:20:14.330 value of n. And then for each value of 0:20:14.330,0:20:15.610 n, what do we need to do? We need 0:20:15.610,0:20:17.480 to calculate the ngram, so do talk dot ngrams. 0:20:17.480,0:20:19.059 This is the method we just wrote. We're gonna 0:20:19.059,0:20:19.989 pass it n. 0:20:19.989,0:20:22.649 Do |ngram|. 0:20:22.649,0:20:26.489 And then finally, we're going to add this[br]to 0:20:26.489,0:20:29.330 the relevant redis sorted set. So the command[br]for 0:20:29.330,0:20:30.049 that is redis.zincrby. 0:20:30.049,0:20:34.669 And this goes, you give it a year, you 0:20:34.669,0:20:38.769 give it a number, like one, and you give 0:20:38.769,0:20:40.320 it what are you incrementing. 0:20:40.320,0:20:42.779 OK. So let's look at this method now. We're 0:20:42.779,0:20:45.019 gonna take, give it a year. We're gonna go 0:20:45.019,0:20:48.419 through every talk from that year. We're gonna[br]go 0:20:48.419,0:20:50.629 through values of n, which is one, two and 0:20:50.629,0:20:53.200 three, so let's say one, OK. Get the talk. 0:20:53.200,0:20:55.289 Calculate all of its one grams. And then for 0:20:55.289,0:20:59.149 each one gram, add to the year sorted set 0:20:59.149,0:21:02.869 the value of one for that ngram. And then 0:21:02.869,0:21:05.139 do that just a bunch of times. 0:21:05.139,0:21:07.549 So let's see if this works. 0:21:07.549,0:21:14.480 Let's reload. Again to prove I'm not lying.[br]There's 0:21:14.480,0:21:21.360 nothing in redis at the moment. Oops. Gotta[br]do 0:21:21.360,0:21:22.419 talk. 0:21:22.419,0:21:29.419 Let's worry about those Delayed::Jobs. Perfect.[br]Drink break. 0:21:30.419,0:21:33.019 So it's going through each year now. And each 0:21:33.019,0:21:34.559 talk in each year, counting up all the words 0:21:34.559,0:21:39.489 and phrases and building our sorted sets.[br]And it 0:21:39.489,0:21:40.440 is done. 0:21:40.440,0:21:43.049 So let's see what we got in here now. 0:21:43.049,0:21:46.779 OK, cool. So we got these keys. Let's, let's 0:21:46.779,0:21:48.039 look into one of these. One of the nice 0:21:48.039,0:21:49.610 things about the sorted set is you can, of 0:21:49.610,0:21:52.909 course, sort by it. And so the command here 0:21:52.909,0:21:55.950 is zrevrange. So we can do the 2014 sorted 0:21:55.950,0:21:58.869 set. So this is gonna give us the top 0:21:58.869,0:22:01.470 ten, or actually eleven, top eleven, you know,[br]ngrams 0:22:01.470,0:22:03.909 in 2014. So let's see. 0:22:03.909,0:22:09.090 And we can actually add :with_scores = true.[br]So 0:22:09.090,0:22:11.759 the most common words and phrases from 2014[br]RailsConf 0:22:11.759,0:22:16.639 talk abstracts. Not very surprising. The,[br]to, and, a, 0:22:16.639,0:22:20.200 of, in, you, how. Rails. OK. Rails makes the 0:22:20.200,0:22:21.110 number ten. 0:22:21.110,0:22:23.519 So there you go. 0:22:23.519,0:22:25.249 Now we can also, let's just have a little 0:22:25.249,0:22:28.369 fun here. See what some of the sort top 0:22:28.369,0:22:30.480 non-trivial ones are. Obviously you could[br]write some code, 0:22:30.480,0:22:32.950 maybe kill stop words. Stuff like that. If[br]you 0:22:32.950,0:22:34.690 don't care about them. 0:22:34.690,0:22:40.330 But, so. Rails. Can code. This talk. Most[br]popular 0:22:40.330,0:22:44.619 two-word phrase. Pretty good. How to. Ruby[br]developers. Eh, 0:22:44.619,0:22:46.399 this looks pretty, pretty relevant, right.[br]I mean, these 0:22:46.399,0:22:51.220 are not words you'd be surprised to see in 0:22:51.220,0:22:53.289 a RailsConf talk abstract. 0:22:53.289,0:22:56.220 So those, you know, are the most common words. 0:22:56.220,0:22:57.289 So we now have this. We have this for 0:22:57.289,0:22:58.509 every year, by the way. So we can also 0:22:58.509,0:23:01.440 do something, this is the same thing for 2011. 0:23:01.440,0:23:04.279 Whatever. And the last piece of code we're[br]going 0:23:04.279,0:23:05.739 to write, is we need to be able to 0:23:05.739,0:23:06.769 query this data. 0:23:06.769,0:23:08.940 So, you know, the actual, sort of, website[br]or 0:23:08.940,0:23:11.590 finished product, you're gonna have to, you[br]know, search 0:23:11.590,0:23:13.429 for a term. And you're gonna have to go 0:23:13.429,0:23:15.739 look up in your data, you know, what, what 0:23:15.739,0:23:19.340 are the relevant values for that term. 0:23:19.340,0:23:21.299 And so, how we're gonna do this. Well, the 0:23:21.299,0:23:23.499 first thing we gotta remember is that we normal- 0:23:23.499,0:23:27.409 remember we did this normalize for ngrams[br]thing. So 0:23:27.409,0:23:28.919 we have to do that again, because what if 0:23:28.919,0:23:31.100 someone searches for a capitalized word or[br]with something 0:23:31.100,0:23:32.989 with punctuation. We have to process it the[br]exact 0:23:32.989,0:23:35.739 same way that we processed our input. Otherwise[br]it 0:23:35.739,0:23:38.889 won't match. So let's just do that. 0:23:38.889,0:23:42.809 And then we have this constant ALL_YEARS.[br]And we're 0:23:42.809,0:23:45.950 gonna iterate through that with an object[br]with a 0:23:45.950,0:23:47.299 hash. Let's just build up a hash. That's probably 0:23:47.299,0:23:51.999 the easy way to do it. Do |year, hash|. 0:23:51.999,0:23:57.549 And the, the relevant redis command, again,[br]is zscore. 0:23:57.549,0:24:03.700 So we can do redis dot zscore(). We're gonna 0:24:03.700,0:24:05.869 look up in the hash for that year, the 0:24:05.869,0:24:08.470 term. And we need to put this actually in 0:24:08.470,0:24:13.739 the hash. And so, and then we need to 0:24:13.739,0:24:16.289 to_i that in case it's nil. 0:24:16.289,0:24:19.100 OK. So this now, what does this say? ALL_YEARS 0:24:19.100,0:24:22.859 is just, you know, 2007 through 2014. Go through 0:24:22.859,0:24:25.889 each of those years. And then build up our 0:24:25.889,0:24:27.609 hash so that the hash, the key of the 0:24:27.609,0:24:30.499 year, maps to the value of, you know, the 0:24:30.499,0:24:33.889 number of times that term appeared in that[br]year. 0:24:33.889,0:24:38.179 So let's, again, see if that works. Talk dot 0:24:38.179,0:24:43.639 query, you know, ruby or something. Cool.[br]So in 0:24:43.639,0:24:47.330 2007 it was mentioned 52 times, 2014 22 times. 0:24:47.330,0:24:50.230 Whatever. We can, I guess, we said Ember originally. 0:24:50.230,0:24:54.309 And there you go. It was not mentioned until 0:24:54.309,0:24:58.369 this year. Which is also kind of telling. 0:24:58.369,0:25:01.690 And so this is basically, you know, all of 0:25:01.690,0:25:04.100 the kind of step two code you need. That's 0:25:04.100,0:25:06.840 sort of the ngram calculation, store the results.[br]And 0:25:06.840,0:25:09.840 again, I reiterate, like, everything we just[br]did, is 0:25:09.840,0:25:12.830 kind of trivially simple. There's no fancy[br]algorithms. It's 0:25:12.830,0:25:15.220 just counting, you know, putting stuff in[br]the right 0:25:15.220,0:25:17.169 data structure. Accessing it in sort of the[br]right 0:25:17.169,0:25:18.269 way. 0:25:18.269,0:25:20.940 And I just think there's something like pretty,[br]you 0:25:20.940,0:25:23.179 know, insightful about that, that you don't[br]need to 0:25:23.179,0:25:26.389 do fancy things all the time. And that often 0:25:26.389,0:25:28.590 the kind of the coolest results will come[br]from 0:25:28.590,0:25:30.749 something simple. 0:25:30.749,0:25:31.769 And so, as I said, the last thing we're 0:25:31.769,0:25:33.139 gonna do here is create this nice front end 0:25:33.139,0:25:35.970 interface that lets us investigate the results.[br]You know, 0:25:35.970,0:25:37.989 unfortunately, we don't really have time to[br]get into 0:25:37.989,0:25:40.320 that. It is all on the GitHub. But, I 0:25:40.320,0:25:42.940 will tell you, I use pie charts as a 0:25:42.940,0:25:46.100 nice library, front-end library that makes[br]it very simple 0:25:46.100,0:25:47.450 to get charts up and running. It's actually[br]not 0:25:47.450,0:25:48.419 that much code. 0:25:48.419,0:25:49.889 And I've done this already. So let's start[br]up 0:25:49.889,0:25:54.039 a server. And, oops. Let's fire up the localhost. 0:25:54.039,0:25:58.950 And so here we are. The abstractogram is our 0:25:58.950,0:26:00.009 app. So what are we, what are we gonna 0:26:00.009,0:26:01.080 search for here? 0:26:01.080,0:26:03.919 Let's see. I, you, we or something. And there 0:26:03.919,0:26:05.330 we go. So there, there it is. The number 0:26:05.330,0:26:08.730 of times the word you appears in each year. 0:26:08.730,0:26:11.100 Looks pretty flat. So, you know, the, these[br]are 0:26:11.100,0:26:13.100 kind of constant. Anyone have any, anything[br]else they 0:26:13.100,0:26:15.539 want to search for? Let's try ember, backbone. 0:26:15.539,0:26:19.369 All right. Let's say, we got, PosGres I heard. 0:26:19.369,0:26:24.109 All right. I guess we could all say, let's 0:26:24.109,0:26:28.639 say SQL. No one cares about PosGres this year. 0:26:28.639,0:26:32.700 Service. SOA. Oh, there is sort of a rising 0:26:32.700,0:26:35.850 trend of service-oriented architecture. 0:26:35.850,0:26:36.320 Anything else? 0:26:36.320,0:26:41.419 TDD. That's a good one. TDD. Testing. Test-driven,[br]how 0:26:41.419,0:26:48.419 about. So there we go. I'm sorry? 0:26:48.909,0:26:53.739 Rest. That's a trick one though, cause rest[br]is 0:26:53.739,0:26:55.480 also like a real word that, you know, like, 0:26:55.480,0:26:57.440 the rest of the time will be something else. 0:26:57.440,0:27:04.149 And. Refactor. Let's see. Ooh. That's a good[br]one. 0:27:04.149,0:27:09.629 DHH. Wow. Peaked 2011, peak DHH. Let's see,[br]we 0:27:09.629,0:27:11.570 got, Heroku is a good one. On the rise. 0:27:11.570,0:27:13.700 I like we can just look at Ruby and 0:27:13.700,0:27:15.409 Rails. This is actually, I think, pretty relevant.[br]It's 0:27:15.409,0:27:18.980 like, what are people talking about? Not Rails[br]anymore. 0:27:18.980,0:27:20.269 We got to find something new to talk about. 0:27:20.269,0:27:22.730 You know, it's like, too many RailsConfs.[br]And, in 0:27:22.730,0:27:25.350 fact, this actually came up at the, you know, 0:27:25.350,0:27:27.119 there was a speaker meeting, whatever, and[br]everyone was 0:27:27.119,0:27:29.489 talking about how, you know, their talks weren't[br]actually 0:27:29.489,0:27:30.600 about Rails. 0:27:30.600,0:27:32.879 And, you know, maybe this is actually an insightful 0:27:32.879,0:27:35.639 statement, that, you know, the, the community[br]has obviously 0:27:35.639,0:27:37.710 gotten very large and there's just a ton of 0:27:37.710,0:27:38.350 other stuff to talk about. People have been[br]talking 0:27:38.350,0:27:41.299 about Rails for a long time. And so, you 0:27:41.299,0:27:42.909 know, here I am giving a talk that's not 0:27:42.909,0:27:46.059 really directly about Rails. But, so maybe[br]this is 0:27:46.059,0:27:47.369 like a real trend that people are just finding 0:27:47.369,0:27:49.039 other stuff to talk about. 0:27:49.039,0:27:53.080 And that is pretty cool. So I promised that 0:27:53.080,0:27:56.470 I would show you the repo or whatever on 0:27:56.470,0:27:59.609 GitHub. You can just do bit.ly slash railsconfdata.[br]It's 0:27:59.609,0:28:02.059 just the code. Everything we've looked at[br]today. Plus 0:28:02.059,0:28:04.419 some more stuff. It's actually running live[br]on the 0:28:04.419,0:28:07.399 internet at abstractogram dot herokuapp dot[br]com. 0:28:07.399,0:28:09.679 I figure the internet's probably not working,[br]but let's 0:28:09.679,0:28:16.679 see. Yup. Classic. And, you know, otherwise[br]that is 0:28:16.809,0:28:19.649 it. And thank you for listening. And I think 0:28:19.649,0:28:20.450 we have time for questions.