1 00:00:17,080 --> 00:00:18,190 TODD SCHNEIDER: All right. We're, we're good. Thank you. 2 00:00:18,190 --> 00:00:19,910 Sorry for the delay. Classic. 3 00:00:19,910 --> 00:00:22,270 Even in the future nothing works. Welcome. 4 00:00:22,270 --> 00:00:26,240 I am Todd. I'm an engineer at Rap Genius. 5 00:00:26,240 --> 00:00:31,640 And today's talk is going to be about data science with a live tutorial. 6 00:00:31,640 --> 00:00:34,360 And before we get into the live coding component, 7 00:00:34,360 --> 00:00:36,070 I wanted to show you all a project I 8 00:00:36,070 --> 00:00:39,030 built previously, which kind of serves as the inspiration 9 00:00:39,030 --> 00:00:41,470 for this talk. Sort of. So this is a 10 00:00:41,470 --> 00:00:45,440 website called weddingcrunchers dot com. What is Wedding Crunchers? 11 00:00:45,440 --> 00:00:48,110 It's a place where you can track the, the 12 00:00:48,110 --> 00:00:50,979 popularity of words and phrases in the New York 13 00:00:50,979 --> 00:00:54,449 Times wedding section over the past thirty-some years. 14 00:00:54,449 --> 00:00:56,129 And a lot of you might be wondering why 15 00:00:56,129 --> 00:00:58,640 on earth would this be interesting or relevant or 16 00:00:58,640 --> 00:01:01,530 funny or anything, and I hope to convince you 17 00:01:01,530 --> 00:01:04,360 of that very quickly. Here is a, a example 18 00:01:04,360 --> 00:01:07,220 wedding announcement from the New York Times. This one's 19 00:01:07,220 --> 00:01:08,030 from 1985. 20 00:01:08,030 --> 00:01:08,970 If you don't know me, you don't live in 21 00:01:08,970 --> 00:01:11,260 New York, read the New York Times, the wedding 22 00:01:11,260 --> 00:01:14,280 section is a certain cultural cache. It's kind of 23 00:01:14,280 --> 00:01:15,720 an honor to be listed in there and it's 24 00:01:15,720 --> 00:01:18,580 got a very resume-like structure. People get to brag 25 00:01:18,580 --> 00:01:20,110 about where they went to school and what they 26 00:01:20,110 --> 00:01:20,979 do. 27 00:01:20,979 --> 00:01:23,050 So here is an example. You know, Diane deCordova 28 00:01:23,050 --> 00:01:25,270 is marrying Michael Monro Lewis. They both went to 29 00:01:25,270 --> 00:01:28,250 Princeton. They graduated Cum Laude. You know, she works 30 00:01:28,250 --> 00:01:30,440 at Morgan Stanley. He works at Solomon Brothers in 31 00:01:30,440 --> 00:01:32,610 New York and they're gonna go to London. And 32 00:01:32,610 --> 00:01:34,430 this should be a little familiar to a bunch 33 00:01:34,430 --> 00:01:35,420 of you. 34 00:01:35,420 --> 00:01:37,870 Mr. Lewis and associates Solomon Brothers is Michael Lewis. 35 00:01:37,870 --> 00:01:40,600 He's given you Right Lawyers Poker??, famous book about 36 00:01:40,600 --> 00:01:42,810 his experience there. And before, before he was a 37 00:01:42,810 --> 00:01:45,710 famous writer, he was just another New York Times 38 00:01:45,710 --> 00:01:49,630 wedding announced person. 39 00:01:49,630 --> 00:01:51,560 And so what Wedding Crunchers does is it takes 40 00:01:51,560 --> 00:01:54,560 the entire corpus of New York Times wedding announcements 41 00:01:54,560 --> 00:01:57,409 back from 1981 and you can searh for words 42 00:01:57,409 --> 00:01:59,520 and phrases and you can see how common those 43 00:01:59,520 --> 00:02:01,800 words and phrases are, you know, by year. It's 44 00:02:01,800 --> 00:02:03,320 like, this is a good one that's relevant to 45 00:02:03,320 --> 00:02:06,409 people here. You know, banker and programmer. You know, 46 00:02:06,409 --> 00:02:08,979 for example, when you list so-and-so is a banker 47 00:02:08,979 --> 00:02:11,780 or is a programmer in the announcement and you 48 00:02:11,780 --> 00:02:13,700 see, over time, you know, banker used to be 49 00:02:13,700 --> 00:02:18,450 way more commonly used than programmer in these announcements. 50 00:02:18,450 --> 00:02:21,140 And only just this year, in 2014, programmer has 51 00:02:21,140 --> 00:02:28,140 finally overtaken banker as, you know, the, the place, 52 00:02:28,190 --> 00:02:29,890 you know, the people getting married in New York, 53 00:02:29,890 --> 00:02:32,770 who are part of society, come from. Another good 54 00:02:32,770 --> 00:02:35,170 one is, if you look at goldman, sachs and 55 00:02:35,170 --> 00:02:37,600 google- is my internet on? Good. 56 00:02:37,600 --> 00:02:41,150 So here's another good one. So Goldman Sachs, you 57 00:02:41,150 --> 00:02:44,120 know, classic New York financial instition. Google, new kid 58 00:02:44,120 --> 00:02:47,160 on the block. Tech scene. Boom. Taking over. 59 00:02:47,160 --> 00:02:49,800 And, you know, this is obviously fun, and it's 60 00:02:49,800 --> 00:02:52,440 amusing. But it's also actually pretty insightful for a 61 00:02:52,440 --> 00:02:55,760 relatively simple concept. I mean, this one graph tells 62 00:02:55,760 --> 00:02:58,740 a pretty powerful story of, you know, New York 63 00:02:58,740 --> 00:03:01,750 the, the finance capitol of the world. Meanwhile, we 64 00:03:01,750 --> 00:03:03,550 have this sort of emerging tech scene. You know, 65 00:03:03,550 --> 00:03:05,150 Google may be the biggest player in the kind 66 00:03:05,150 --> 00:03:06,959 of new tech world. 67 00:03:06,959 --> 00:03:09,510 And now, when you turn to the society pages 68 00:03:09,510 --> 00:03:11,209 to see who's getting married, you know, there's more 69 00:03:11,209 --> 00:03:13,970 employees from Google than there are from Gullman Sachs. 70 00:03:13,970 --> 00:03:16,750 And that, you know, kind of interesting thing in 71 00:03:16,750 --> 00:03:17,739 the world. 72 00:03:17,739 --> 00:03:20,500 And so what we're gonna do today is build 73 00:03:20,500 --> 00:03:25,120 something just like Wedding Crunchers, except, instead of using 74 00:03:25,120 --> 00:03:28,280 the text of wedding announcements to analyze, we're going 75 00:03:28,280 --> 00:03:32,670 to look at all of the RailsConf talk abstracts. 76 00:03:32,670 --> 00:03:34,080 And so, you know, hopefully this is, this is 77 00:03:34,080 --> 00:03:36,550 interesting to people here and, I always say, you 78 00:03:36,550 --> 00:03:38,709 know, if there's only one thing you take from 79 00:03:38,709 --> 00:03:41,319 this talk, really, what it should be is that, 80 00:03:41,319 --> 00:03:43,709 you know, work on a problem that's interesting to 81 00:03:43,709 --> 00:03:46,260 you. Because, especially when you're dealing with data science, 82 00:03:46,260 --> 00:03:47,590 a lot of it's pretty messy and then you 83 00:03:47,590 --> 00:03:49,290 have to go through scraping stuff as we'll get 84 00:03:49,290 --> 00:03:51,879 into, and it's easy to get frustrated and kind 85 00:03:51,879 --> 00:03:53,810 of lost and like, if you're not working on 86 00:03:53,810 --> 00:03:55,450 something that you care about, and something that you 87 00:03:55,450 --> 00:03:58,060 really want to know, kind of, the final result, 88 00:03:58,060 --> 00:04:00,110 it's just much easier to get distracted and kind 89 00:04:00,110 --> 00:04:01,069 of, ultimately, bail. 90 00:04:01,069 --> 00:04:03,819 So, again, if you take one thing, just work 91 00:04:03,819 --> 00:04:07,550 on something that is interesting to you. So the 92 00:04:07,550 --> 00:04:09,819 particular kind of analysis we're gonna do is something 93 00:04:09,819 --> 00:04:12,680 called n-gram analysis. And I have a little example 94 00:04:12,680 --> 00:04:14,190 set up here. So what is an n-gram? You 95 00:04:14,190 --> 00:04:15,800 may have heard the word before. 96 00:04:15,800 --> 00:04:19,099 Really, all it means is, you know, a, a 97 00:04:19,099 --> 00:04:23,830 consecutive words as part of a sentence. So like, 98 00:04:23,830 --> 00:04:26,030 examples very simple, for one simple. This talk is 99 00:04:26,030 --> 00:04:28,000 boring. What are the, what are the one grams 100 00:04:28,000 --> 00:04:30,330 in this sentence? It's just the words. This, talk, 101 00:04:30,330 --> 00:04:32,780 is, and boring. The two grams are every pair 102 00:04:32,780 --> 00:04:35,839 of consecutive words. This talk, talk is, is boring, 103 00:04:35,839 --> 00:04:37,219 and so on. 104 00:04:37,219 --> 00:04:38,150 And so what we need to be able to 105 00:04:38,150 --> 00:04:40,889 do in order to build, you know, a graph 106 00:04:40,889 --> 00:04:43,300 like this, is we need to take a term 107 00:04:43,300 --> 00:04:45,159 that's, you know, relavent to RailsConf, say something like 108 00:04:45,159 --> 00:04:46,960 Ember or whatever, and we need to be able 109 00:04:46,960 --> 00:04:48,759 to look up, you know, for each year how 110 00:04:48,759 --> 00:04:51,300 many times does this, you know, word or n-gram 111 00:04:51,300 --> 00:04:53,610 appear in the data. 112 00:04:53,610 --> 00:04:55,550 And so that is what we are going to 113 00:04:55,550 --> 00:04:58,789 build. And I have this brief little outline here. 114 00:04:58,789 --> 00:05:01,020 There's kind of three steps. And this is pretty 115 00:05:01,020 --> 00:05:04,629 general to, to any data project. You know, step 116 00:05:04,629 --> 00:05:06,719 one is gonna be just gathering the data, getting 117 00:05:06,719 --> 00:05:09,659 it in some usable form. Step two is gonna 118 00:05:09,659 --> 00:05:11,259 be kind of the analysis part where we do 119 00:05:11,259 --> 00:05:14,050 the n-gram calculation. We store the results. And then 120 00:05:14,050 --> 00:05:15,789 step three is gonna be to create a nice 121 00:05:15,789 --> 00:05:19,259 little front-end interface that lets us investigate, visualize and 122 00:05:19,259 --> 00:05:20,809 see what we've done. 123 00:05:20,809 --> 00:05:23,300 Now unfortunately, you know, in a, in a thirty 124 00:05:23,300 --> 00:05:26,020 minute talk we can't possibly do all of this. 125 00:05:26,020 --> 00:05:28,689 So we're gonna focus more on items one and 126 00:05:28,689 --> 00:05:31,490 two and less so on three, and even then 127 00:05:31,490 --> 00:05:33,099 it's too much. So, you know, I sort of 128 00:05:33,099 --> 00:05:34,689 used the analogy, it'll be a bit like watching 129 00:05:34,689 --> 00:05:37,419 TV on the Food Network, where we might, you 130 00:05:37,419 --> 00:05:40,039 know, throw something in the oven, mysteriously something else 131 00:05:40,039 --> 00:05:42,009 pops out of the other oven even though it's, 132 00:05:42,009 --> 00:05:43,759 where did that come from? 133 00:05:43,759 --> 00:05:46,089 But not to worry. Everything is also on GitHub. 134 00:05:46,089 --> 00:05:47,869 There's a repo I'll share with you at the 135 00:05:47,869 --> 00:05:50,339 end. So anything that we don't cover or that 136 00:05:50,339 --> 00:05:51,979 we cover too quickly or something, you'll be able 137 00:05:51,979 --> 00:05:53,779 to see sort of the, the full version on 138 00:05:53,779 --> 00:05:55,740 GitHub. 139 00:05:55,740 --> 00:05:57,770 So let us jump in now to step one, 140 00:05:57,770 --> 00:06:00,189 which is, you know, gathering the data. And so 141 00:06:00,189 --> 00:06:01,909 let's take a look back at the, the RailsConf 142 00:06:01,909 --> 00:06:03,080 website again. So we have to figure out how 143 00:06:03,080 --> 00:06:06,460 we're gonna model a, a RailsConf talk in our 144 00:06:06,460 --> 00:06:09,889 database. So like, what, you know, attributes does a, 145 00:06:09,889 --> 00:06:13,339 do a, excuse me, does a RailsConf talk have. 146 00:06:13,339 --> 00:06:14,289 And it's like, one thing we see is they 147 00:06:14,289 --> 00:06:17,669 all have titles. So that looks like something. They 148 00:06:17,669 --> 00:06:20,089 have speakers. You know, there's this thing, which is 149 00:06:20,089 --> 00:06:23,330 the abstract, and then there's the bio. And that's 150 00:06:23,330 --> 00:06:25,469 probably it. That's probably all we need. 151 00:06:25,469 --> 00:06:27,669 So that's pretty simple. And, you know, I have 152 00:06:27,669 --> 00:06:29,999 the little migration. I've already run here. But here 153 00:06:29,999 --> 00:06:31,789 are attributes for talks. It's just the year, you 154 00:06:31,789 --> 00:06:33,909 know, what, what conference were we actually at. The 155 00:06:33,909 --> 00:06:36,110 title of the talk, the speaker, the abstract, and 156 00:06:36,110 --> 00:06:37,569 the bio. 157 00:06:37,569 --> 00:06:41,490 And so also, that's, again, pretty straightforward. The gemfile 158 00:06:41,490 --> 00:06:45,089 is also very simple. It's mostly pretty boiler plate. 159 00:06:45,089 --> 00:06:47,830 Rails 4, Ruby 2.1. The only gems I wanted 160 00:06:47,830 --> 00:06:49,409 to call out here are, we're gonna use nokogiri 161 00:06:49,409 --> 00:06:52,309 for, you know, fetching, or, parsing websites and kind 162 00:06:52,309 --> 00:06:54,229 of scraping the data we need. We're gonna use 163 00:06:54,229 --> 00:06:56,389 PosGres as our main data store and we're gonna 164 00:06:56,389 --> 00:06:58,219 use redis to build these sort of index that 165 00:06:58,219 --> 00:07:00,180 we can ultimately use to look up, you know, 166 00:07:00,180 --> 00:07:02,389 how common a word is. 167 00:07:02,389 --> 00:07:05,389 And so one thing that's not here is, like, 168 00:07:05,389 --> 00:07:09,009 you know, gem fancy data algorithm. And a lot 169 00:07:09,009 --> 00:07:10,689 of people, this is kind of where Ruby often 170 00:07:10,689 --> 00:07:13,369 gets a bad reputation of, you know, not being 171 00:07:13,369 --> 00:07:16,039 supportive of scientific computing or whatever. And other languages 172 00:07:16,039 --> 00:07:18,589 have more, more support. But my claim is that 173 00:07:18,589 --> 00:07:20,520 it's really not that important. You can get a 174 00:07:20,520 --> 00:07:23,509 ton of mileage out of very simple tools that 175 00:07:23,509 --> 00:07:24,210 you can build yourself. 176 00:07:24,210 --> 00:07:25,809 You know, you don't need a fancy gem or 177 00:07:25,809 --> 00:07:28,360 any fancy algorithm. Those things are cool too and 178 00:07:28,360 --> 00:07:30,740 they have their place. But they're not needed a 179 00:07:30,740 --> 00:07:33,349 lot of the time. And, you know, Ruby is 180 00:07:33,349 --> 00:07:36,210 a wonderful language for, especially, scraping stuff from the 181 00:07:36,210 --> 00:07:38,449 web. There's a ton of support there. And so 182 00:07:38,449 --> 00:07:40,979 I don't think that the, the lack of, you 183 00:07:40,979 --> 00:07:43,509 know, fancy algorithm gems should necessarily be a deterrant 184 00:07:43,509 --> 00:07:44,439 at all. 185 00:07:44,439 --> 00:07:46,960 And so hopefully part of this talk is convincing 186 00:07:46,960 --> 00:07:49,649 people that Ruby and Rails are actually quite well-suited 187 00:07:49,649 --> 00:07:50,939 to problems like this. 188 00:07:50,939 --> 00:07:53,559 OK. So now we actually need to write some 189 00:07:53,559 --> 00:07:56,249 code to scrape the talk. And you know, if 190 00:07:56,249 --> 00:07:57,419 you've ever done anything like this before, you know 191 00:07:57,419 --> 00:07:59,520 that Chrome Inspector is your best friend. So let's 192 00:07:59,520 --> 00:08:02,499 fire that up. We're gonna inspect element, and so 193 00:08:02,499 --> 00:08:04,069 like, we actually, what we need to do now 194 00:08:04,069 --> 00:08:06,889 is take you know, this HTML on the page 195 00:08:06,889 --> 00:08:09,119 and turn it into a database record that we 196 00:08:09,119 --> 00:08:11,889 can then, you know, use to our advantage later. 197 00:08:11,889 --> 00:08:13,050 And so it looks like, you know, all the 198 00:08:13,050 --> 00:08:16,629 talks are in these session classes. So that's something. 199 00:08:16,629 --> 00:08:19,849 We can look in here. This looks like something. 200 00:08:19,849 --> 00:08:23,469 So let's make this bigger. 201 00:08:23,469 --> 00:08:25,039 And you know it helps to, well, it's kind 202 00:08:25,039 --> 00:08:29,059 of essential to be decent with CSS selectors here, 203 00:08:29,059 --> 00:08:32,149 because that's how we're going to basically find stuff. 204 00:08:32,149 --> 00:08:34,719 So let's see, OK, so there's eighty-one session divs. 205 00:08:34,719 --> 00:08:37,990 That sounds about right. I happen to know that 206 00:08:37,990 --> 00:08:42,229 mine is number seventy-eight, so let's, let's look at 207 00:08:42,229 --> 00:08:44,360 that. And so here we are. So we need 208 00:08:44,360 --> 00:08:46,970 to, again, the, the things we're mod- or, the 209 00:08:46,970 --> 00:08:50,250 attributes we're storing at the title, the speaker, the 210 00:08:50,250 --> 00:08:52,680 abstract, and the bio. And so we're gonna need 211 00:08:52,680 --> 00:08:54,850 to pull these things out. 212 00:08:54,850 --> 00:08:57,630 So let's see. It looks like the, the title 213 00:08:57,630 --> 00:09:00,490 is in this h1 element inside the header. So 214 00:09:00,490 --> 00:09:04,830 let's just make sure that works. You know, header 215 00:09:04,830 --> 00:09:08,450 h1. That looks right. 216 00:09:08,450 --> 00:09:13,650 The, the speaker looks to be the header h2. 217 00:09:13,650 --> 00:09:16,060 Cool. 218 00:09:16,060 --> 00:09:20,640 Now the abstract is in this p tag, so 219 00:09:20,640 --> 00:09:23,130 we can do something like this. But this is 220 00:09:23,130 --> 00:09:26,490 actually not quite right. So what's wrong with this? 221 00:09:26,490 --> 00:09:30,140 Well, the abstract ends, you know, suited to the 222 00:09:30,140 --> 00:09:32,310 problem. The bio here is also in the p 223 00:09:32,310 --> 00:09:35,310 tag. Originally a math guy. And we've actually pulled 224 00:09:35,310 --> 00:09:37,010 all the p-tags. So we need a way of 225 00:09:37,010 --> 00:09:38,940 not doing that. And this is where you just 226 00:09:38,940 --> 00:09:40,200 need to know a little bit of CSS. Not 227 00:09:40,200 --> 00:09:42,550 very complicated. But if you use the little greater 228 00:09:42,550 --> 00:09:44,800 than guy, what this says is only take the 229 00:09:44,800 --> 00:09:47,210 p tags that are immediate descendants of the session 230 00:09:47,210 --> 00:09:50,390 div. And so now we have, you know, only 231 00:09:50,390 --> 00:09:51,060 the abstract. 232 00:09:51,060 --> 00:09:54,340 And lastly, you know, the bio is just in 233 00:09:54,340 --> 00:09:58,460 its own little section. So something like that. Cool. 234 00:09:58,460 --> 00:10:00,190 So that is the jQuery version of it. We 235 00:10:00,190 --> 00:10:03,180 need to do this, though, in Ruby. And as 236 00:10:03,180 --> 00:10:05,250 I said, this does sometimes get a little tedious. 237 00:10:05,250 --> 00:10:07,340 But let's, let's write the code. So I have 238 00:10:07,340 --> 00:10:12,160 this empty method - create_railsconf_2014_talks. And also this method 239 00:10:12,160 --> 00:10:14,760 I've written already called fetch_and_parse, which just gets a 240 00:10:14,760 --> 00:10:16,610 URL and sends it to nokogiri, which we can 241 00:10:16,610 --> 00:10:17,690 then use to do our CSS selectors. 242 00:10:17,690 --> 00:10:20,510 So let, let's just write this. So we can 243 00:10:20,510 --> 00:10:27,400 say doc is fetch_and_parse. The url is this. Let's 244 00:10:27,400 --> 00:10:33,940 see if this works in the console. 245 00:10:33,940 --> 00:10:40,940 Of course, in here. Do I have internet? Nice. 246 00:10:47,360 --> 00:10:52,700 So we can then check the same thing. Again. 247 00:10:52,700 --> 00:10:57,830 Looks right. Let's find my talk, which, this part 248 00:10:57,830 --> 00:10:59,310 I couldn't possibly tell you. When you use the 249 00:10:59,310 --> 00:11:01,610 nokogiri, the eq thing, you have to add two 250 00:11:01,610 --> 00:11:04,330 from whatever jQuery does. So I'm number 80 now. 251 00:11:04,330 --> 00:11:06,570 Don't ask me why. I couldn't possibly tell you. 252 00:11:06,570 --> 00:11:10,210 But maybe someone here knows. Be curious to find 253 00:11:10,210 --> 00:11:10,780 out. 254 00:11:10,780 --> 00:11:11,920 AUDIENCE: ?? (00:11:13) 255 00:11:11,920 --> 00:11:15,400 T.S.: So there it is. There's the title. So 256 00:11:15,400 --> 00:11:17,380 let us now write some code here. We have 257 00:11:17,380 --> 00:11:21,520 our, our document. We're gonna go through each session. 258 00:11:21,520 --> 00:11:24,320 The CSS method is kind of like, you know, 259 00:11:24,320 --> 00:11:28,900 the selector for nokogiri. Each elements. So each of 260 00:11:28,900 --> 00:11:35,370 these we're gonna create a talk. 261 00:11:35,370 --> 00:11:38,390 And again. So the year we already know is 262 00:11:38,390 --> 00:11:45,390 2014. The title we're gonna say is, elm.css("header h1").inner_text. 263 00:11:48,300 --> 00:11:55,300 Speaker, header h2, dun nuh nuh dun nuh nuh 264 00:12:00,460 --> 00:12:04,520 nuh. Gettin' there. 265 00:12:04,520 --> 00:12:09,950 All right. So I think this will probably work. 266 00:12:09,950 --> 00:12:13,980 Let's find out. And so we're back in here. 267 00:12:13,980 --> 00:12:19,470 Just to prove to you that I'm not lying, 268 00:12:19,470 --> 00:12:23,450 2014 dot count. There's none of them. And, what'd 269 00:12:23,450 --> 00:12:26,440 I call this method? This guy. Delayed::Job. 270 00:12:26,440 --> 00:12:33,440 All right. So we just did something. Did it 271 00:12:33,440 --> 00:12:40,440 work? Nice. We got eighty-one talks. Most importantly, let's, 272 00:12:41,150 --> 00:12:42,390 we have my talk. That's the, that's the only 273 00:12:42,390 --> 00:12:46,760 one that matters anyway. And so, you know, you 274 00:12:46,760 --> 00:12:48,260 might be thinking now, like, you know, what the 275 00:12:48,260 --> 00:12:50,120 heck, I came to the, the data science talk, 276 00:12:50,120 --> 00:12:52,330 not the scraping talk. You know, to that, I 277 00:12:52,330 --> 00:12:56,020 would say, tough luck. They're the same thing. You 278 00:12:56,020 --> 00:12:57,880 know, you might not, you might not want to 279 00:12:57,880 --> 00:13:00,040 hear it, but guess what, this is usually the 280 00:13:00,040 --> 00:13:02,020 most important part of the entire project. 281 00:13:02,020 --> 00:13:04,960 It's the hardest part, you know, because guess what, 282 00:13:04,960 --> 00:13:07,080 just because we got the 2014 talks, you know, 283 00:13:07,080 --> 00:13:08,630 now we have to get the 2013 talks. And 284 00:13:08,630 --> 00:13:10,880 the 2012 talks. And they're all on different websites. 285 00:13:10,880 --> 00:13:12,890 They all have different structures. You know, you're gonna 286 00:13:12,890 --> 00:13:15,090 have to write different code to get each type 287 00:13:15,090 --> 00:13:17,120 of website. It's a pain. And this is why 288 00:13:17,120 --> 00:13:19,240 I said earlier, you know, really make sure you're 289 00:13:19,240 --> 00:13:21,160 working on something you care about. Because it's just 290 00:13:21,160 --> 00:13:24,300 not fun to like, like, ugh, in 2008 they 291 00:13:24,300 --> 00:13:26,850 separated the speakers and the abstracts. And it's like, 292 00:13:26,850 --> 00:13:29,260 it's just, it's annoying, but again, it's the most 293 00:13:29,260 --> 00:13:30,290 important part I would say. 294 00:13:30,290 --> 00:13:32,920 You know, so much of data science is taking 295 00:13:32,920 --> 00:13:35,960 data that's either unstructured or structured in the wrong 296 00:13:35,960 --> 00:13:39,020 format to you and, you know, getting it into 297 00:13:39,020 --> 00:13:40,510 the way, you know, into the structure that you 298 00:13:40,510 --> 00:13:43,410 need to do whatever analysis you want to do. 299 00:13:43,410 --> 00:13:45,130 So in this case, that's taking, you know, html 300 00:13:45,130 --> 00:13:47,930 on a page and converting it into a PosGres 301 00:13:47,930 --> 00:13:49,430 database. 302 00:13:49,430 --> 00:13:52,800 And so we have done that now. And again, 303 00:13:52,800 --> 00:13:53,920 take my word that, you know, I've done this 304 00:13:53,920 --> 00:13:56,980 for the other years as well. Back in 2007 305 00:13:56,980 --> 00:14:00,930 and so we have a total of 497 talks 306 00:14:00,930 --> 00:14:04,220 in here from RailsConfs over the years. And so 307 00:14:04,220 --> 00:14:06,870 that's cool. That's basically our dataset that we're gonna 308 00:14:06,870 --> 00:14:07,200 use. 309 00:14:07,200 --> 00:14:08,730 And so we can sort of move on to, 310 00:14:08,730 --> 00:14:11,000 you know, step two of the project here, which 311 00:14:11,000 --> 00:14:14,000 is, you know, do the n-gram calculation and store 312 00:14:14,000 --> 00:14:16,800 the results. And so let's go back to talk.rb. 313 00:14:16,800 --> 00:14:18,700 All this by the way is just in, you 314 00:14:18,700 --> 00:14:21,920 know, app/models/talk.rb. That's where all this code is. 315 00:14:21,920 --> 00:14:25,560 And I have another empty method somewhere called def 316 00:14:25,560 --> 00:14:27,590 ngrams. And so this method, we're gonna need to 317 00:14:27,590 --> 00:14:29,810 give, you know, it goes on a talk. So 318 00:14:29,810 --> 00:14:32,400 given a value of n, calculate on the ngrams 319 00:14:32,400 --> 00:14:34,540 from that talk's abstract. 320 00:14:34,540 --> 00:14:36,160 And so, what are we gonna do here? So 321 00:14:36,160 --> 00:14:43,160 again, let's look at, talk dot mine. Dot abstract. 322 00:14:43,550 --> 00:14:45,160 So here's the abstract, and we need to, you 323 00:14:45,160 --> 00:14:48,920 know, get ngrams out of this. And so the 324 00:14:48,920 --> 00:14:51,060 first thing, I've written a little helper method over 325 00:14:51,060 --> 00:14:54,050 here. Which I've just tacked on a string called 326 00:14:54,050 --> 00:14:57,410 normalized_for_ngrams. And you know, what does this do? Well, 327 00:14:57,410 --> 00:14:59,640 it downcases it, cause we're gonna do case insensitive. 328 00:14:59,640 --> 00:15:01,560 There might be cases where you want to keep 329 00:15:01,560 --> 00:15:03,820 case sensitivity. Whatever. Doesn't really matter. In this case 330 00:15:03,820 --> 00:15:06,060 we're gonna go case insensitive. 331 00:15:06,060 --> 00:15:08,880 Squish is a nice, convenient method that will kind 332 00:15:08,880 --> 00:15:11,460 of standardize the white space for you. So like, 333 00:15:11,460 --> 00:15:13,990 if there's any trailing or leading white space, and 334 00:15:13,990 --> 00:15:16,600 if there's like a bunch of middle white space, 335 00:15:16,600 --> 00:15:18,730 this will, it'll kill the beginning and ending and 336 00:15:18,730 --> 00:15:20,630 it'll turn anything in the middle into a single 337 00:15:20,630 --> 00:15:21,220 space. 338 00:15:21,220 --> 00:15:22,230 So that way you just don't have to worry 339 00:15:22,230 --> 00:15:25,130 about things like double spaces or, you know, other, 340 00:15:25,130 --> 00:15:26,820 other weird things that can happen. Cause of course 341 00:15:26,820 --> 00:15:28,600 it's the web. Whatever can go wrong will go 342 00:15:28,600 --> 00:15:31,510 wrong. So make sure that you're data's in some 343 00:15:31,510 --> 00:15:33,360 kind of standardized format. 344 00:15:33,360 --> 00:15:36,500 And the last thing I've done is removed punctuation. 345 00:15:36,500 --> 00:15:38,360 And the reason for that is just cause like, 346 00:15:38,360 --> 00:15:40,279 you know, there's commas, periods, colons, all sorts of 347 00:15:40,279 --> 00:15:42,930 stuff like that. We don't really care about it. 348 00:15:42,930 --> 00:15:44,710 And so let's just kill any character that's not 349 00:15:44,710 --> 00:15:46,540 either a space or a word character. This is 350 00:15:46,540 --> 00:15:49,450 kind of the, little like, Ruby special regex thing. 351 00:15:49,450 --> 00:15:53,040 So we're gonna kill punctuation. 352 00:15:53,040 --> 00:15:54,190 And so we can actually just mess with this 353 00:15:54,190 --> 00:15:56,610 in the console maybe. So let's take our little 354 00:15:56,610 --> 00:16:00,460 example sentence. You know, this talk is boring. And 355 00:16:00,460 --> 00:16:04,240 let's normalize that for ngrams. OK. All it did 356 00:16:04,240 --> 00:16:07,710 was downcase it. And now we want to get 357 00:16:07,710 --> 00:16:09,410 that into an array of words, which we can 358 00:16:09,410 --> 00:16:13,060 just do with split. Cool. 359 00:16:13,060 --> 00:16:16,830 And now there's actually this neat little Ruby enumerable 360 00:16:16,830 --> 00:16:18,290 thing, which I didn't know about until pretty recently. 361 00:16:18,290 --> 00:16:21,800 Each const, which stands for each consecutive. And it 362 00:16:21,800 --> 00:16:25,380 takes an argument, a single number, like two, and 363 00:16:25,380 --> 00:16:27,279 what this says is give me all of the, 364 00:16:27,279 --> 00:16:29,779 you know, consecutive pairs of two. So if we 365 00:16:29,779 --> 00:16:32,440 to_a this, now we have this array of arrays, 366 00:16:32,440 --> 00:16:34,180 which looks like exactly what we want. 367 00:16:34,180 --> 00:16:36,870 This talk, talk is, and is boring. And so 368 00:16:36,870 --> 00:16:38,310 the last thing we can do there is we 369 00:16:38,310 --> 00:16:43,690 can just map that array to make these just 370 00:16:43,690 --> 00:16:44,190 phrases. 371 00:16:44,190 --> 00:16:46,860 So cool. So this is actually the entirety of 372 00:16:46,860 --> 00:16:49,820 our ngrams method, is just, you know, this code 373 00:16:49,820 --> 00:16:51,630 right here. So let's copy and paste this into 374 00:16:51,630 --> 00:16:56,500 the old method here. So we want. We're doing 375 00:16:56,500 --> 00:17:03,040 this on the abstract. Let's get some new lines 376 00:17:03,040 --> 00:17:04,079 here. 377 00:17:04,079 --> 00:17:09,839 All right, cool. So again, just to recap, you 378 00:17:09,839 --> 00:17:12,039 take the abstract, we normalize it, which means, you 379 00:17:12,039 --> 00:17:14,880 know, downcase and kill the punctuation. We split it 380 00:17:14,880 --> 00:17:17,289 to words. Uh, wait. Actually this should not be 381 00:17:17,289 --> 00:17:21,118 two. That should be n. And then we join 382 00:17:21,118 --> 00:17:24,220 those. So let's, let's see if this worked. 383 00:17:24,220 --> 00:17:31,220 So talk dot mine again. And one. OK. So 384 00:17:31,360 --> 00:17:32,769 here are all the one grams, which is just 385 00:17:32,769 --> 00:17:36,240 the sequence of words. And that looks correct. And 386 00:17:36,240 --> 00:17:41,869 all of the two grams. Also looks correct, I 387 00:17:41,869 --> 00:17:45,369 think. Yeah. To get, get a, yeah, OK, perfect. 388 00:17:45,369 --> 00:17:47,619 And so this is kind of the, the method 389 00:17:47,619 --> 00:17:50,690 we're gonna use to decompose these talks into just, 390 00:17:50,690 --> 00:17:53,799 you know, an array of words and phrases. And 391 00:17:53,799 --> 00:17:55,929 so what is the next step, now that we 392 00:17:55,929 --> 00:17:57,549 have this method? Well, the next step is we 393 00:17:57,549 --> 00:17:59,470 have to build these indexes that we're actually gonna 394 00:17:59,470 --> 00:18:03,659 use to look up, you know, the final results. 395 00:18:03,659 --> 00:18:05,139 And so for that, we're gonna use redis. 396 00:18:05,139 --> 00:18:07,179 Now, we don't have sort of enough time to 397 00:18:07,179 --> 00:18:10,990 really get totally into the details of redis. But, 398 00:18:10,990 --> 00:18:12,039 you know, the, the thing that we're really gonna 399 00:18:12,039 --> 00:18:14,759 use is the, the sorted set data structure, which 400 00:18:14,759 --> 00:18:16,440 I'd definitely encourage you to check out. It's a 401 00:18:16,440 --> 00:18:19,159 great data structure. Great feature of redis. And so 402 00:18:19,159 --> 00:18:20,210 what is a sorted set? 403 00:18:20,210 --> 00:18:22,730 Well, it's got the word set in it, so 404 00:18:22,730 --> 00:18:24,720 that tells you something. It's, you know, unique elements. 405 00:18:24,720 --> 00:18:27,059 And the, the neat feature of a sorted set 406 00:18:27,059 --> 00:18:28,990 is that each element in the set also has 407 00:18:28,990 --> 00:18:32,360 a score associated with it. So the way we 408 00:18:32,360 --> 00:18:34,669 can use this is, remember, again, the question I'm 409 00:18:34,669 --> 00:18:36,610 gonna answer is, like, you know, if someone searches 410 00:18:36,610 --> 00:18:38,559 for Ember, you know, how many times was Ember 411 00:18:38,559 --> 00:18:40,429 mentioned in 2007. How many times was it mentioned 412 00:18:40,429 --> 00:18:42,169 in 2008. How many times was it mentioned in 413 00:18:42,169 --> 00:18:42,659 2009? 414 00:18:42,659 --> 00:18:44,610 So we're gonna have one sorted set for each 415 00:18:44,610 --> 00:18:47,700 year, where the members of the sorted set are 416 00:18:47,700 --> 00:18:50,259 all the words and phrases that appeared in RailsConf 417 00:18:50,259 --> 00:18:54,100 talks, and the scores are the number of times 418 00:18:54,100 --> 00:18:56,419 that those ngrams appeared. 419 00:18:56,419 --> 00:18:58,399 And then, you know, redis is very efficient about 420 00:18:58,399 --> 00:19:00,249 this zscore method. You can look up. It's like 421 00:19:00,249 --> 00:19:02,590 this command right here would say, OK, in the 422 00:19:02,590 --> 00:19:05,990 sorted set for 2014, get me the score associated 423 00:19:05,990 --> 00:19:09,249 with the member ember. And that's gonna tell you, 424 00:19:09,249 --> 00:19:11,559 you know, some number. Like, three or whatever. Is 425 00:19:11,559 --> 00:19:14,340 the number of times it gets mentioned. 426 00:19:14,340 --> 00:19:15,840 So what we have to do is build these 427 00:19:15,840 --> 00:19:18,799 sorted sets. One for each year again. And again 428 00:19:18,799 --> 00:19:23,590 I have an empty method called generate_ngram_data_by_year. So iterate 429 00:19:23,590 --> 00:19:26,110 through all talks from a given year, you know, 430 00:19:26,110 --> 00:19:27,389 calculate the ngram counts and add it to the 431 00:19:27,389 --> 00:19:29,940 appropriate redis sorted set. So let's write that. 432 00:19:29,940 --> 00:19:32,450 So one thing we need to do is make 433 00:19:32,450 --> 00:19:34,460 sure we're not double counting. So if we have 434 00:19:34,460 --> 00:19:37,240 an old sorted set sitting around, let's delete it. 435 00:19:37,240 --> 00:19:40,210 So let's, redis.delete year. We need to decide what 436 00:19:40,210 --> 00:19:43,460 values of n we're gonna use. So let's just 437 00:19:43,460 --> 00:19:46,210 say one, two, and three, meaning we're gonna calculate 438 00:19:46,210 --> 00:19:48,190 all the one grams, two grams, three grams. Anything 439 00:19:48,190 --> 00:19:49,700 longer than that and it's sort of, like, what's 440 00:19:49,700 --> 00:19:51,740 even the point. You're getting into pretty specific sentences. 441 00:19:51,740 --> 00:19:53,110 There's not gonna be a lot of repetition. 442 00:19:53,110 --> 00:19:55,789 So now we need to iterate through each talk 443 00:19:55,789 --> 00:20:02,789 for the given years. Where(:year => year).find_each. And then 444 00:20:05,789 --> 00:20:07,860 for each talk we need to iterate through each 445 00:20:07,860 --> 00:20:14,330 value of n. And then for each value of 446 00:20:14,330 --> 00:20:15,610 n, what do we need to do? We need 447 00:20:15,610 --> 00:20:17,480 to calculate the ngram, so do talk dot ngrams. 448 00:20:17,480 --> 00:20:19,059 This is the method we just wrote. We're gonna 449 00:20:19,059 --> 00:20:19,989 pass it n. 450 00:20:19,989 --> 00:20:22,649 Do |ngram|. 451 00:20:22,649 --> 00:20:26,489 And then finally, we're going to add this to 452 00:20:26,489 --> 00:20:29,330 the relevant redis sorted set. So the command for 453 00:20:29,330 --> 00:20:30,049 that is redis.zincrby. 454 00:20:30,049 --> 00:20:34,669 And this goes, you give it a year, you 455 00:20:34,669 --> 00:20:38,769 give it a number, like one, and you give 456 00:20:38,769 --> 00:20:40,320 it what are you incrementing. 457 00:20:40,320 --> 00:20:42,779 OK. So let's look at this method now. We're 458 00:20:42,779 --> 00:20:45,019 gonna take, give it a year. We're gonna go 459 00:20:45,019 --> 00:20:48,419 through every talk from that year. We're gonna go 460 00:20:48,419 --> 00:20:50,629 through values of n, which is one, two and 461 00:20:50,629 --> 00:20:53,200 three, so let's say one, OK. Get the talk. 462 00:20:53,200 --> 00:20:55,289 Calculate all of its one grams. And then for 463 00:20:55,289 --> 00:20:59,149 each one gram, add to the year sorted set 464 00:20:59,149 --> 00:21:02,869 the value of one for that ngram. And then 465 00:21:02,869 --> 00:21:05,139 do that just a bunch of times. 466 00:21:05,139 --> 00:21:07,549 So let's see if this works. 467 00:21:07,549 --> 00:21:14,480 Let's reload. Again to prove I'm not lying. There's 468 00:21:14,480 --> 00:21:21,360 nothing in redis at the moment. Oops. Gotta do 469 00:21:21,360 --> 00:21:22,419 talk. 470 00:21:22,419 --> 00:21:29,419 Let's worry about those Delayed::Jobs. Perfect. Drink break. 471 00:21:30,419 --> 00:21:33,019 So it's going through each year now. And each 472 00:21:33,019 --> 00:21:34,559 talk in each year, counting up all the words 473 00:21:34,559 --> 00:21:39,489 and phrases and building our sorted sets. And it 474 00:21:39,489 --> 00:21:40,440 is done. 475 00:21:40,440 --> 00:21:43,049 So let's see what we got in here now. 476 00:21:43,049 --> 00:21:46,779 OK, cool. So we got these keys. Let's, let's 477 00:21:46,779 --> 00:21:48,039 look into one of these. One of the nice 478 00:21:48,039 --> 00:21:49,610 things about the sorted set is you can, of 479 00:21:49,610 --> 00:21:52,909 course, sort by it. And so the command here 480 00:21:52,909 --> 00:21:55,950 is zrevrange. So we can do the 2014 sorted 481 00:21:55,950 --> 00:21:58,869 set. So this is gonna give us the top 482 00:21:58,869 --> 00:22:01,470 ten, or actually eleven, top eleven, you know, ngrams 483 00:22:01,470 --> 00:22:03,909 in 2014. So let's see. 484 00:22:03,909 --> 00:22:09,090 And we can actually add :with_scores = true. So 485 00:22:09,090 --> 00:22:11,759 the most common words and phrases from 2014 RailsConf 486 00:22:11,759 --> 00:22:16,639 talk abstracts. Not very surprising. The, to, and, a, 487 00:22:16,639 --> 00:22:20,200 of, in, you, how. Rails. OK. Rails makes the 488 00:22:20,200 --> 00:22:21,110 number ten. 489 00:22:21,110 --> 00:22:23,519 So there you go. 490 00:22:23,519 --> 00:22:25,249 Now we can also, let's just have a little 491 00:22:25,249 --> 00:22:28,369 fun here. See what some of the sort top 492 00:22:28,369 --> 00:22:30,480 non-trivial ones are. Obviously you could write some code, 493 00:22:30,480 --> 00:22:32,950 maybe kill stop words. Stuff like that. If you 494 00:22:32,950 --> 00:22:34,690 don't care about them. 495 00:22:34,690 --> 00:22:40,330 But, so. Rails. Can code. This talk. Most popular 496 00:22:40,330 --> 00:22:44,619 two-word phrase. Pretty good. How to. Ruby developers. Eh, 497 00:22:44,619 --> 00:22:46,399 this looks pretty, pretty relevant, right. I mean, these 498 00:22:46,399 --> 00:22:51,220 are not words you'd be surprised to see in 499 00:22:51,220 --> 00:22:53,289 a RailsConf talk abstract. 500 00:22:53,289 --> 00:22:56,220 So those, you know, are the most common words. 501 00:22:56,220 --> 00:22:57,289 So we now have this. We have this for 502 00:22:57,289 --> 00:22:58,509 every year, by the way. So we can also 503 00:22:58,509 --> 00:23:01,440 do something, this is the same thing for 2011. 504 00:23:01,440 --> 00:23:04,279 Whatever. And the last piece of code we're going 505 00:23:04,279 --> 00:23:05,739 to write, is we need to be able to 506 00:23:05,739 --> 00:23:06,769 query this data. 507 00:23:06,769 --> 00:23:08,940 So, you know, the actual, sort of, website or 508 00:23:08,940 --> 00:23:11,590 finished product, you're gonna have to, you know, search 509 00:23:11,590 --> 00:23:13,429 for a term. And you're gonna have to go 510 00:23:13,429 --> 00:23:15,739 look up in your data, you know, what, what 511 00:23:15,739 --> 00:23:19,340 are the relevant values for that term. 512 00:23:19,340 --> 00:23:21,299 And so, how we're gonna do this. Well, the 513 00:23:21,299 --> 00:23:23,499 first thing we gotta remember is that we normal- 514 00:23:23,499 --> 00:23:27,409 remember we did this normalize for ngrams thing. So 515 00:23:27,409 --> 00:23:28,919 we have to do that again, because what if 516 00:23:28,919 --> 00:23:31,100 someone searches for a capitalized word or with something 517 00:23:31,100 --> 00:23:32,989 with punctuation. We have to process it the exact 518 00:23:32,989 --> 00:23:35,739 same way that we processed our input. Otherwise it 519 00:23:35,739 --> 00:23:38,889 won't match. So let's just do that. 520 00:23:38,889 --> 00:23:42,809 And then we have this constant ALL_YEARS. And we're 521 00:23:42,809 --> 00:23:45,950 gonna iterate through that with an object with a 522 00:23:45,950 --> 00:23:47,299 hash. Let's just build up a hash. That's probably 523 00:23:47,299 --> 00:23:51,999 the easy way to do it. Do |year, hash|. 524 00:23:51,999 --> 00:23:57,549 And the, the relevant redis command, again, is zscore. 525 00:23:57,549 --> 00:24:03,700 So we can do redis dot zscore(). We're gonna 526 00:24:03,700 --> 00:24:05,869 look up in the hash for that year, the 527 00:24:05,869 --> 00:24:08,470 term. And we need to put this actually in 528 00:24:08,470 --> 00:24:13,739 the hash. And so, and then we need to 529 00:24:13,739 --> 00:24:16,289 to_i that in case it's nil. 530 00:24:16,289 --> 00:24:19,100 OK. So this now, what does this say? ALL_YEARS 531 00:24:19,100 --> 00:24:22,859 is just, you know, 2007 through 2014. Go through 532 00:24:22,859 --> 00:24:25,889 each of those years. And then build up our 533 00:24:25,889 --> 00:24:27,609 hash so that the hash, the key of the 534 00:24:27,609 --> 00:24:30,499 year, maps to the value of, you know, the 535 00:24:30,499 --> 00:24:33,889 number of times that term appeared in that year. 536 00:24:33,889 --> 00:24:38,179 So let's, again, see if that works. Talk dot 537 00:24:38,179 --> 00:24:43,639 query, you know, ruby or something. Cool. So in 538 00:24:43,639 --> 00:24:47,330 2007 it was mentioned 52 times, 2014 22 times. 539 00:24:47,330 --> 00:24:50,230 Whatever. We can, I guess, we said Ember originally. 540 00:24:50,230 --> 00:24:54,309 And there you go. It was not mentioned until 541 00:24:54,309 --> 00:24:58,369 this year. Which is also kind of telling. 542 00:24:58,369 --> 00:25:01,690 And so this is basically, you know, all of 543 00:25:01,690 --> 00:25:04,100 the kind of step two code you need. That's 544 00:25:04,100 --> 00:25:06,840 sort of the ngram calculation, store the results. And 545 00:25:06,840 --> 00:25:09,840 again, I reiterate, like, everything we just did, is 546 00:25:09,840 --> 00:25:12,830 kind of trivially simple. There's no fancy algorithms. It's 547 00:25:12,830 --> 00:25:15,220 just counting, you know, putting stuff in the right 548 00:25:15,220 --> 00:25:17,169 data structure. Accessing it in sort of the right 549 00:25:17,169 --> 00:25:18,269 way. 550 00:25:18,269 --> 00:25:20,940 And I just think there's something like pretty, you 551 00:25:20,940 --> 00:25:23,179 know, insightful about that, that you don't need to 552 00:25:23,179 --> 00:25:26,389 do fancy things all the time. And that often 553 00:25:26,389 --> 00:25:28,590 the kind of the coolest results will come from 554 00:25:28,590 --> 00:25:30,749 something simple. 555 00:25:30,749 --> 00:25:31,769 And so, as I said, the last thing we're 556 00:25:31,769 --> 00:25:33,139 gonna do here is create this nice front end 557 00:25:33,139 --> 00:25:35,970 interface that lets us investigate the results. You know, 558 00:25:35,970 --> 00:25:37,989 unfortunately, we don't really have time to get into 559 00:25:37,989 --> 00:25:40,320 that. It is all on the GitHub. But, I 560 00:25:40,320 --> 00:25:42,940 will tell you, I use pie charts as a 561 00:25:42,940 --> 00:25:46,100 nice library, front-end library that makes it very simple 562 00:25:46,100 --> 00:25:47,450 to get charts up and running. It's actually not 563 00:25:47,450 --> 00:25:48,419 that much code. 564 00:25:48,419 --> 00:25:49,889 And I've done this already. So let's start up 565 00:25:49,889 --> 00:25:54,039 a server. And, oops. Let's fire up the localhost. 566 00:25:54,039 --> 00:25:58,950 And so here we are. The abstractogram is our 567 00:25:58,950 --> 00:26:00,009 app. So what are we, what are we gonna 568 00:26:00,009 --> 00:26:01,080 search for here? 569 00:26:01,080 --> 00:26:03,919 Let's see. I, you, we or something. And there 570 00:26:03,919 --> 00:26:05,330 we go. So there, there it is. The number 571 00:26:05,330 --> 00:26:08,730 of times the word you appears in each year. 572 00:26:08,730 --> 00:26:11,100 Looks pretty flat. So, you know, the, these are 573 00:26:11,100 --> 00:26:13,100 kind of constant. Anyone have any, anything else they 574 00:26:13,100 --> 00:26:15,539 want to search for? Let's try ember, backbone. 575 00:26:15,539 --> 00:26:19,369 All right. Let's say, we got, PosGres I heard. 576 00:26:19,369 --> 00:26:24,109 All right. I guess we could all say, let's 577 00:26:24,109 --> 00:26:28,639 say SQL. No one cares about PosGres this year. 578 00:26:28,639 --> 00:26:32,700 Service. SOA. Oh, there is sort of a rising 579 00:26:32,700 --> 00:26:35,850 trend of service-oriented architecture. 580 00:26:35,850 --> 00:26:36,320 Anything else? 581 00:26:36,320 --> 00:26:41,419 TDD. That's a good one. TDD. Testing. Test-driven, how 582 00:26:41,419 --> 00:26:48,419 about. So there we go. I'm sorry? 583 00:26:48,909 --> 00:26:53,739 Rest. That's a trick one though, cause rest is 584 00:26:53,739 --> 00:26:55,480 also like a real word that, you know, like, 585 00:26:55,480 --> 00:26:57,440 the rest of the time will be something else. 586 00:26:57,440 --> 00:27:04,149 And. Refactor. Let's see. Ooh. That's a good one. 587 00:27:04,149 --> 00:27:09,629 DHH. Wow. Peaked 2011, peak DHH. Let's see, we 588 00:27:09,629 --> 00:27:11,570 got, Heroku is a good one. On the rise. 589 00:27:11,570 --> 00:27:13,700 I like we can just look at Ruby and 590 00:27:13,700 --> 00:27:15,409 Rails. This is actually, I think, pretty relevant. It's 591 00:27:15,409 --> 00:27:18,980 like, what are people talking about? Not Rails anymore. 592 00:27:18,980 --> 00:27:20,269 We got to find something new to talk about. 593 00:27:20,269 --> 00:27:22,730 You know, it's like, too many RailsConfs. And, in 594 00:27:22,730 --> 00:27:25,350 fact, this actually came up at the, you know, 595 00:27:25,350 --> 00:27:27,119 there was a speaker meeting, whatever, and everyone was 596 00:27:27,119 --> 00:27:29,489 talking about how, you know, their talks weren't actually 597 00:27:29,489 --> 00:27:30,600 about Rails. 598 00:27:30,600 --> 00:27:32,879 And, you know, maybe this is actually an insightful 599 00:27:32,879 --> 00:27:35,639 statement, that, you know, the, the community has obviously 600 00:27:35,639 --> 00:27:37,710 gotten very large and there's just a ton of 601 00:27:37,710 --> 00:27:38,350 other stuff to talk about. People have been talking 602 00:27:38,350 --> 00:27:41,299 about Rails for a long time. And so, you 603 00:27:41,299 --> 00:27:42,909 know, here I am giving a talk that's not 604 00:27:42,909 --> 00:27:46,059 really directly about Rails. But, so maybe this is 605 00:27:46,059 --> 00:27:47,369 like a real trend that people are just finding 606 00:27:47,369 --> 00:27:49,039 other stuff to talk about. 607 00:27:49,039 --> 00:27:53,080 And that is pretty cool. So I promised that 608 00:27:53,080 --> 00:27:56,470 I would show you the repo or whatever on 609 00:27:56,470 --> 00:27:59,609 GitHub. You can just do bit.ly slash railsconfdata. It's 610 00:27:59,609 --> 00:28:02,059 just the code. Everything we've looked at today. Plus 611 00:28:02,059 --> 00:28:04,419 some more stuff. It's actually running live on the 612 00:28:04,419 --> 00:28:07,399 internet at abstractogram dot herokuapp dot com. 613 00:28:07,399 --> 00:28:09,679 I figure the internet's probably not working, but let's 614 00:28:09,679 --> 00:28:16,679 see. Yup. Classic. And, you know, otherwise that is 615 00:28:16,809 --> 00:28:19,649 it. And thank you for listening. And I think 616 00:28:19,649 --> 00:28:20,450 we have time for questions.