STARR HORNE: So thank you guys for coming. This is gonna be an awesome talk. Quite possibly the best talk you'll ever hear in your life. What? I actually have a built-in escape hatch here, so if things start going wrong, I might just bail. Yeah. So, anyway. A couple weeks ago, I was, well, I was sort of having a panic attack about this presentation, because I'm a programmer, and my natural habitat is in some sort of dark place looking a screen. It's not, it's not talking to a bunch of people with, like, lights. What's up with this? So I, I was in the middle of this panic attack and I went out on a walk, cause that always clears my mind. I was listening to a podcast. My favorite podcast. It's called Hardcore History. I highly recommend it. Yeah. Oh. Somebody likes it. Yeah! So this, this particular episode about Hardcore, in, of Hardcore History was about the Punic Wars. And in case you haven't heard about the Punic Wars, don't know the story, I'll just give you, like, the brief rundown. Basically, a long, long time ago, there was this place called Rome. And they, they decided that they should take over the world. And, for the most part, they did. But there was this one country that was a thorn in their side for a long time called Carthidge. And they fought all these wars back and forth. It was kind of a stale mate. And then one day, when, I don't know, maybe he was making a sacrifice to the fire god or whatever, this guy named Hannibal had this great idea. He was like, I'm gonna, I'm gonna lead this sneak-attack on Rome, and I'm gonna do so by marching my army through the Alps. Which is, is pretty cool. I mean, that, that's pretty bad ass. But I, I think the most awesome part of the story, for me, at least, is that the dude had war elephants. So, I, I don't know if you can see it in the slide, but there's actually a war elephant. It's, it's kind of, I don't know, it's kind of under the G. And a lot of people have doubted this story, you know, over the years. And so, in 1959, there was a British engineer who, on a bar bet, borrowed a circus elephant named Jumbo and marched it across the Alps, too. So, I don't know what my point is here, really. I guess don't underestimate elephants, because they don't like it and they have long memories. So, OK. This talk is really about biggish data. So what, what the hell is biggish data. It's not, what is? Whoa, I've got one of these things. It's not big data. It's not about, you know, this talk isn't gonna be about Hadoop clusters and server farms and all that stuff. That's, that's way over my head. It's not about fancy architectures. I'm not gonna show you how to make multi-right PostGres clusters that do automatic sharding and stuff. I mean, that's all like wicked cool, but it's not what we're gonna talk about. Instead, I'm gonna talk about something that I think is, is more practical, and it's kind of more interesting to me, and that is, how do you keep your app working as your production dataset grows? And grows into biggish data territory? And this is very easy. This is very easy to happen. Even if you don't have a firehose of data coming in. If you run a, a popular e-commerce site or just a site with a lot of users, over the course of years you can accumulate a ton of data. And as this data accumulates, you find that your site performance goes downhill sort of gradually in ways that you don't understand. So. What is biggish data? This talk is based on my experience at Honeybadger. In case you haven't heard it, of us, we are an exception performance and uptime monitoring company. That means that we essentially have a firehose of new data coming in every day. Right now we have about a terabyte of errors in our database. The world has a lot of errors. You guys need to start, like, doing a better job. And, and we get about, we get two gigabytes of new errors per day, roughly. And all this goes into a plain vanilla PostGres database, and it's served, that backs a pretty plain vanilla Rails application. Well, the good news is that PostGres can handle this pretty easily. PostGres has got it covered. The bad news is that, unless you've engineered your app to deal with this level of data, you're kind of screwed. And the reason is that a hundred megabyte database behaves fundamentally different than a one terabyte database. And a lot of the conventions, a lot of the normal things we do in Rails apps just stop working when you have this much data. Yup. This is actually a live feed. This queries been running since this morning. Something I learned, pretty much every pagination system breaks on page, like, 2000. Even if you, you're like, dear God, make this stop, I just want to delete half my data. You're, you're just still screwed, because deleting data takes a lot of time. Now, I'm gonna explain all of this. I'm gonna explain why this happens, how you can work around it, and how you can optimize the queries in your own database and optimize your stack. But, to really understand this, we've got to take a little trip back in time, to the summer of 1978. The summer of disco, free love, and the VT-100 terminal. The VT-100 was the first sort of computer-shaped thing that you could buy from an actual company and sit on your desk to prove that you were the alpha nerd of the office. It was also the time that Oracle 1 point 0 was being developed, which was one of the first databases that used the SQL language. It's pretty cool. It was written in PDP-11 assembly language. And, and in case you're like too young to have ever seen a PDP-11, this is what a PDP-11 looks like. Yeah. Give it, show some love for the PDP-11. Now, this is pretty awesome. Nowadays, in our, like, modern day times, right this second, marketers down in the, the presentat- what, what do they call that? The exhibit hall. Are, right now, throwing around all sorts of words about cloud computing and platform as a service and, and I don't know, some, some stuff I haven't even heard of, probably. And back then it was the same, but the, the buzz words were different. The buzz words on everybody's, the buzz word on everybody's lips, at that time, was real time computing. And I, I'm using air quotes around real time, because that meant that you could have your, your report generated in five minutes instead of five days. Which I'm sure was pretty awesome. Like, I would have been really happy about that. But, you know, looking at this, it doesn't really seem like the sort of thing I would want to build a web app on top of, right? Like, I. Yeah. I, I wouldn't stick Honeybadger on that. But, a funny thing happened in the past thirty, thirty-six years. A funny thing happened is that Moore's Law made this into a web stack. Now, I know that everybody here knows sort of, vaguely, what Moore's Law is. But I thought it would be good to go over the technical, precise definition of Moore's Law. And that is that, whoa. That is that computers get more awesome as time goes forward. Did that come back normally? Yeah. As time goes forward. And awesomeness is generally defined as, well, technically, being able to do more stuff faster. Being able to process more data faster. So, based on this, I would like to postulate, for the first time in public - you guys are very lucky to be able to hear this - the first time in public, Starr's Corelary to Moore's Law, which is, as database growth outpaces Moore's Law, you literally travel back in time. That was the Ted moment of this, of this talk. It's like, I can hear you guys, your minds being blown. It sounds like bubble wrap popping. So, based on my, my extensive research of time-travel movies, the, the first rule for surviving your new time period is that you need to understand the methods and the motivations of the people in that time period. We need to start thinking like our ancestors. And our ancestors were very interested in this thing called hard ware. Well, I, I don't mean to be flip, but I just want to bring this up first, because if you happen to be having database scaling issues, like, if your app right now is getting slower because the amount of data in your database is getting bigger, and your app happens to live anywhere named EC-2, Linode, I don't know, whatever, whatever other companies offer that, you can probably solve your scaling issue just, like, right now just by going and buying a real damn computer. Yeah. You'll be shocked and amazed. Because the two things that databases need to perform well, to work at high performance with lots of data, is they need lots of ram, and they need really fast disc io. And virtual servers give you neither of those things. So, just go buy a real damn computer. Yeah. And, and while you're at it, throw in a bunch of discs. I'm not talking about a radar ray. You probably want each of your, you probably want each of your operating system, your PostGres data, your PostGres log file, you probably want all that stuff to be on separate hard drives. And that's just gonna make the operating system able to more efficiently sort of schedule disc io. So, if, if that solves your problems, you know, great. You're welcome. If you guys want to leave the room now, that's fine. I won't be offended. You can go call your local Colo facility. Work something out. If, if that doesn't solve your problem, or even if it does, you probably want to look at your queries next. Now, most people, most of us develop - oops. Hey. Most of us develop against smallish datasets. So, when you develop against a small dataset, you don't notice inefficient queries. It's just life. And books have been written about query optimization. It's, it's a very huge topic, and I can't ev, explain everything in thirty minutes. So I'm just going to explain one thing, and that's called a, a explain. Fortunately, PostGres gives us an awesome tool called Explain, which basically you pass it a query and it gives us a query plan. Oh it, wait a second. That's a, that's a chapter from the Iliad. Sorry. It gives us a query plan, which still kind of looks like a chapter from the Iliad. But, fortunately, we only have to, to look at one metric here. The only thing we need to worry about is rows. For this talk. And here we have an example of a very efficient query, right. It's gonna use an index, and it's gonna look at one row. That's pretty sweet. Very fast. This is the type of query that biggish data works with. But there's one thing you need to know about rows. And that, obviously, the more rows you're dealing with, the more data your computer has to crunch to give you the answers you want. And so the whole, the whole name of the game in query optimization is to limit the number of rows that you have to touch. Let's go back to that inefficient count query. So if you run Explain on that, it turns out that when you ask PostGres to count all of the tab, all of the rows in a table, it actually literally counts all the rows in the table, one by one. And so you wind up with a, one of these things. It's not pretty. And it's often hard to know when stuff like this is gonna crop up. For example, I mentioned that most pagination systems break at page 1000 or 2000. And that's because, well, the offset and limit operators work a lot like count, in that, if you do an offset of 500,000, PostGres is going to count up 500,000 rows, and then if you have a limit of 100, it's gonna count up another 100. And so what you see is that pages one, two, and three load extremely quickly. And so you, as a developer, are actually, when I say you I mean me, cause I actually, I actually fell into this trap. You test it with, like, one or two pages and it works fine, so you move on and then later on your customers are complaining because they can't access page 500. It's timing out. The solution to this is to use a range query. Range queries are very fast, and I'm gonna give you a really stupid example here. Here's a link with a much more detailed analysis of all this. But essentially, if you do a range query, you're gonna be able to use an index. It's gonna touch 100 rows and it's gonna be really fast. Now, I know what you guys are thinking. That's not the same as an offset limit query. And that's the reason I gave you the link in the previous slide, so. Sorting. Sorting is tricky. Sorting is the devil. Sometimes it's super fast. Sometimes if you happen to be asking for data sorted in exactly the same way that an index is sorted, well, it'll be super fast. Other times, even if you have an index on a column, if it's not set up right or if you're asking for the data in a slightly unusual way, you'll wind up sorting the whole damn dataset. It's no good. So here's your rundown for query optimization. You want to develop against a real dataset, because otherwise you won't know when things are going wrong until they go wrong in production. Use, use Explain. Use it a lot. And the whole name of the game here is to limit the number of rows that you touch. Because the more rows you touch, the slower your queries are. You know, in general. So, I, I don't know about you, but that just like, that just was a lot of cognitive load happening, right up here. So I want to give you guys a cookie. For the second half of this talk, we're gonna cover a lot of issues that relate to kind of the infrastructure around biggish data, around big datasets and firehoses of data coming in. And here's a, a legend of, of how we're gonna approach this. We're gonna have a cute picture that, hopefully, sort of symbolizes something about the, the topic. We're gonna have the name of the topic, and then we're gonna have a link with more info about the topic if, if you want to, you know, find this later. And you don't have to write all these links down right now. At the end of the presentation, at the very last slide, I'll have a url where you can get a list of all of them. And, yeah, and I promise this isn't malware. So, you just gotta take my word on that. Our ancestors were really interested in disc operating systems. It seems like they were building a new disc operating system like every two seconds, but. What, what's up with that? I, I don't know. Now we have, like, three. The first thing that you should do if you're using Linux is to increase your read-ahead cache. A read-ahead cache is something that, well, I found not, not a ton of people know about, but it's a super easy way to get up to a doubling of your read-throughput. And essentially what this means is that Linux examines the request that you make to the, to the disc. And if it sees that you're asking for lots of blocks that come right after another in a row, it's gonna preload the next set of blocks into RAM. The normal, the, the default for this is, like, 256K RAM that it uses to prefix these blocks. If you update this to use, like, two megs, four megs, you'll get a really big increase in read performance. Use a modern file system. That means EXT-3 is not an option. If you want to know why, check out the link. And, and if you happen to be using EXT-4 or XFS. I can never say that unless I say it super slow. You might want to consider, you might want to look into journaling settings. If you have your database on a completely separate hard drive, and it's running EXT-4 and you have full data journaling enabled, since PostGres does its own journaling, you're gonna have an inefficiency there because you have two things journaling. Finally, or, wait. Not finally. Anyway. You gotta tell PG, tell, tell Paul Gram about, about all the RAM that you bought in that, that fancy new box. A, a really easy way to set this up is to use a PG-tune script. It examines your computer and writes a, a PostGres configuration file that has some pretty, pretty reasonable values in it. And you can tweak them from there. This stuff is all really easy to find. And then, finally, the bane of, of DB administrators is the vacuum command. PostGres needs to vacuum, because it's messy. It. When you, when you run queries, when you delete things, when you update things, it leaves sort of a lot of stuff undone. And it does that, it does that in the, to, it. In the name of speed, right. It only does what is necessary at the time of query to get you an answer to your query. And then it uses vacuum to go and sort of clean up after itself. The problem is that vacuum can be really resource intensive. And so if your, if your server is under a lot of load, and you see that, OK, vacuum is also causing a lot of load, you may be tempted to turn off vacuum or to, or to make autovacuum happen maybe once a night or something. And that's generally a bad idea. We actually got bitten by this ourselves, so that's why I bring it up. Usually the answer to vacuum problems is to vacuum more often, not less often. All right. So. Velocity. I really wanted to like - I didn't have time - I really wanted to have a little, a little like, Tron guy on the motorcycle going down the little grid there. But. Eh. Now we're gonna talk about a, a couple things that are important if you have a ton of data coming in, or a ton of read-requests coming in, a ton of queries coming in. The first is too many database connections. Each database connection in PostGres is its own process, and each process has its own RAM overhead. So there's a limited number of connections that you want to have to your, your database server. If you have, I don't know, a thousand workers and, and web app processes and all this, you don't want to open a thousand database connections and you probably already know about this. The solution is to pool connections. There's ways to do this in Ruby. There's also, if you're interested in a more ops-y approach, you can check out something like PG-bouncer, which is a proxy that sits in between your Ruby app and your database and functions as a, as a connection pool. You can also run into problems with too many locks. And this is, this is the sort of problem that you don't really ever run into if you don't have a firehose of database, of data coming in. I don't expect you to know, like, everything about locks, because it's a, it's a pretty complex topic. But you should know that, within a transaction, if you go to update a row, that transaction is gonna put a lock on the row until it's done. It's gonna say that nothing else can write to that row until it's done. And, you know, that makes sense. But where this can bite you is if you have, say, imagine you have two Rails models. You have a parent model and a child model. And the parent model has a counter-cache that gets incremented every time you add a child. Normally, this is no big deal. People do this sort of thing all the time. But if something crazy happens and someone, you know, bombards your API and suddenly you have, like, fifteen thousand children created, you're, you're gonna have some locking issues. Cause what's gonna happen is your first query is going to execute fine. The child's gonna get created. It's gonna increment the counter. Everything's gonna be fine. And while it was doing that, it put a lock on, on that row. And so, once the first query's done, the lock is, is removed, and the second query happens. And if you had infinite time, like, this would all be fine. It would all eventually work itself out. But what happens is that by the time you finish query 100, query 15, 15,000 has timed out, which causes all sorts of fun in your, you know, in your Unicorns and your Sidekiqs and all that. It's a, yeah, it's a huge pain in the neck. And the way you avoid this is to, it, it's just an architectural thing. You just have to avoid any situation where you could be updating the same record in the database like a gillion times per second. Intensive database queries. Like, sometimes, sometimes we have our production database, and we need that to be very, very performant for our users. But we also need to maybe do some preprocessing on data as it comes in. And a really simple way to do this is to use PostGres's streaming replication facilities to create a read-only replicant. And then you just do your intensive queries against that and it doesn't affect, affect your users. It's super simple. Sorry. I, I, I kind of feel dumb even talking about it here. But, there you go. Partitioning is awesome. Partitioning is like the best thing ever, because it allows you to dramatically speed up data calling and data archiving. What you can set up in, in PostGres is a partitioning scheme in which data for different days go into different physical database tables. And, you know, that's all right. But the really cool thing is that you access that data by querying a parent sort of virtual table that then propagates that query across all of its children and, you know, magically does its results and, and spits them back to you. So you don't have to update your Ruby code at all, which is sweet. And this way, if you want to, say, delete data that's 180 days old, you just drop the table that's associated with 180 days ago. If you were to do a delete where, you know, created_at is less than 180 days ago, you're probably gonna be waiting for weeks. And last, but not least, backups are a real pain in the neck when you have a huge dataset. Like, that crime job that dumps your entire database and uploads it to S3 doesn't work so well when your database is, like, two terabytes big. And, yeah, that's a big S3 bill, too. Fortunately, we can take advantage of the same sort of things that PostGres does to allow it to do streaming replication to do, sort of, on the fly, incremental backups. And there's a tool called Wall-E, which makes this super easy. And it, it's really cool, because it allows, it, it makes it very easy to upload incremental backups to, say, S3, and then when you want to restore your database, it makes it really easy to, to restore a specific point in time. And so that, Wall-E is really awesome. I love it. Now, I, I, I stayed in my hotel room last night. I didn't get to go to the Speakeasy thing, which kind of bummed me out. But I, I had to work on these slides for you people. And, and, and when I got done with this, when I sort of reached this point, I looked back on all of these, these slides - there's like fifty-four of them - and I was like, holy crap. This is a lot of information. Yeah. This is a lot of information. But the thing I, I want to stress is that, at least in, in my experience, these issues tend to come at you one at a time. This isn't the sort of thing where you have to know all of this stuff in advance in order to deal with biggish data in PostGres. Things come at you one at a time, and you can deal with them one at a time. And I have faith in you. I think you can do it, because you're awesome, because nobody told us that we could turn, like, transistors into LOLCats, but we did it. Like, that's the type of people we are, and that's, that's why I'm proud to be at RailsConf. If you're interested in, like, talking to me about this, if you think I'm full of shit about anything, which I probably am on at least one point, just, just say so. Just feel free to come up to me after the conference, or after my talk here. I have delicious hot and spicy Mexican candy, as an incentive. So there you go. If you want to learn more about the stuff that I. Are you taking? Do you want to take a picture of the Mexican candy? OK. OK. I can send it to you. If. If you want to learn more, all of the links referenced in this talk are at this url, and if you are interested in having more visibility into your production errors, check out Honeybadger, because, yeah. We love Rails devs. And that's it. That's. It says end of show here, so I guess it must be the end of show.