STARR HORNE: So thank you guys for coming.
This is gonna be an awesome talk.
Quite possibly the best talk you'll ever hear
in your life. What? I actually have
a built-in escape hatch here,
so if things start going wrong,
I might just bail. Yeah.
So, anyway. A couple weeks ago, I was, well,
I was sort of having a panic attack about
this presentation, because I'm a programmer,
and my natural
habitat is in some sort of dark place looking
a screen. It's not, it's not talking to a
bunch of people with, like, lights. What's
up with
this?
So I, I was in the middle of this
panic attack and I went out on a walk,
cause that always clears my mind. I was listening
to a podcast. My favorite podcast. It's called
Hardcore
History. I highly recommend it. Yeah. Oh.
Somebody likes
it. Yeah!
So this, this particular episode about Hardcore,
in, of
Hardcore History was about the Punic Wars.
And in
case you haven't heard about the Punic Wars,
don't
know the story, I'll just give you, like,
the
brief rundown. Basically, a long, long time
ago, there
was this place called Rome. And they, they
decided
that they should take over the world. And,
for
the most part, they did. But there was this
one country that was a thorn in their side
for a long time called Carthidge. And they
fought
all these wars back and forth. It was kind
of a stale mate.
And then one day, when, I don't know, maybe
he was making a sacrifice to the fire god
or whatever, this guy named Hannibal had this
great
idea. He was like, I'm gonna, I'm gonna lead
this sneak-attack on Rome, and I'm gonna do
so
by marching my army through the Alps. Which
is,
is pretty cool. I mean, that, that's pretty
bad
ass. But I, I think the most awesome part
of the story, for me, at least, is that
the dude had war elephants. So, I, I don't
know if you can see it in the slide,
but there's actually a war elephant. It's,
it's kind
of, I don't know, it's kind of under the
G. And a lot of people have doubted this
story, you know, over the years. And so, in
1959, there was a British engineer who, on
a
bar bet, borrowed a circus elephant named
Jumbo and
marched it across the Alps, too.
So, I don't know what my point is here,
really. I guess don't underestimate elephants,
because they don't
like it and they have long memories.
So, OK. This talk is really about biggish
data.
So what, what the hell is biggish data. It's
not, what is? Whoa, I've got one of these
things. It's not big data. It's not about,
you
know, this talk isn't gonna be about Hadoop
clusters
and server farms and all that stuff. That's,
that's
way over my head.
It's not about fancy architectures. I'm not
gonna show
you how to make multi-right PostGres clusters
that do
automatic sharding and stuff. I mean, that's
all like
wicked cool, but it's not what we're gonna
talk
about.
Instead, I'm gonna talk about something that
I think
is, is more practical, and it's kind of more
interesting to me, and that is, how do you
keep your app working as your production dataset
grows?
And grows into biggish data territory?
And this is very easy. This is very easy
to happen. Even if you don't have a firehose
of data coming in. If you run a, a
popular e-commerce site or just a site with
a
lot of users, over the course of years you
can accumulate a ton of data. And as this
data accumulates, you find that your site
performance goes
downhill sort of gradually in ways that you
don't
understand.
So. What is biggish data? This talk is based
on my experience at Honeybadger. In case you
haven't
heard it, of us, we are an exception performance
and uptime monitoring company.
That means that we essentially
have a firehose of new data coming in every
day. Right now we have about a terabyte of
errors in our database. The world has a lot
of errors. You guys need to start, like, doing
a better job.
And, and we get about, we get two gigabytes
of new errors per day, roughly. And all this
goes into a plain vanilla PostGres database,
and it's
served, that backs a pretty plain
vanilla Rails application.
Well, the good news is that PostGres can handle
this pretty easily. PostGres has got it covered.
The bad news is that, unless you've engineered
your
app to deal with this level of data, you're
kind of screwed. And the reason is that a
hundred megabyte database behaves
fundamentally different than a one
terabyte database. And a lot of the conventions,
a
lot of the normal things we do in Rails
apps just stop working when you have this
much
data.
Yup. This is actually a live feed. This queries
been running since this morning.
Something I learned, pretty much every pagination
system breaks
on page, like, 2000. Even if you, you're like,
dear God, make this stop, I just want to
delete half my data. You're, you're just still
screwed,
because deleting data takes a lot of time.
Now, I'm gonna explain all of this. I'm gonna
explain why this happens, how you can work
around
it, and how you can optimize the queries in
your own database and optimize your stack.
But, to
really understand this, we've got to take
a little
trip back in time, to the summer of 1978.
The summer of disco, free love, and the VT-100
terminal. The VT-100 was the first sort of
computer-shaped
thing that you could buy from an actual company
and sit on your desk to prove that you
were the alpha nerd of the office. It was
also the time that Oracle 1 point 0 was
being developed, which was one of the first
databases
that used the SQL language. It's pretty cool.
It
was written in PDP-11 assembly language.
And, and in case you're like too young to
have ever seen a PDP-11, this is what a
PDP-11 looks like. Yeah. Give it, show some
love
for the PDP-11.
Now, this is pretty awesome. Nowadays, in
our, like,
modern day times, right this second, marketers
down in
the, the presentat- what, what do they call
that?
The exhibit hall. Are, right now, throwing
around all
sorts of words about cloud computing and platform
as
a service and, and I don't know, some, some
stuff I haven't even heard of, probably. And
back
then it was the same, but the, the buzz
words were different. The buzz words on everybody's,
the
buzz word on everybody's lips, at that time,
was
real time computing.
And I, I'm using air quotes around real time,
because that meant that you could have your,
your
report generated in five minutes instead of
five days.
Which I'm sure was pretty awesome. Like, I
would
have been really happy about that. But, you
know,
looking at this, it doesn't really seem like
the
sort of thing I would want to build a
web app on top of, right?
Like, I. Yeah. I, I wouldn't stick Honeybadger
on
that. But, a funny thing happened in the past
thirty, thirty-six years. A funny thing happened
is that
Moore's Law made this into a web stack. Now,
I know that everybody here knows sort of,
vaguely,
what Moore's Law is. But I thought it would
be good to go over the technical, precise
definition
of Moore's Law.
And that is that, whoa. That is that computers
get more awesome as time goes forward. Did
that
come back normally? Yeah. As time goes forward.
And
awesomeness is generally defined as, well,
technically, being able
to do more stuff faster. Being able to process
more data faster. So, based on this, I would
like to postulate, for the first time in public
- you guys are very lucky to be able
to hear this - the first time in public,
Starr's Corelary to Moore's Law, which is,
as database
growth outpaces Moore's Law, you literally
travel back in
time.
That was the Ted moment of this, of this
talk. It's like, I can hear you guys, your
minds being blown. It sounds like bubble wrap
popping.
So, based on my, my extensive research of
time-travel
movies, the, the first rule for surviving
your new
time period is that you need to understand
the
methods and the motivations of the people
in that
time period. We need to start thinking like
our
ancestors. And our ancestors were very interested
in this
thing called hard ware.
Well, I, I don't mean to be flip, but
I just want to bring this up first, because
if you happen to be having database scaling
issues,
like, if your app right now is getting slower
because the amount of data in your database
is
getting bigger, and your app happens to live
anywhere
named EC-2, Linode, I don't know, whatever,
whatever other
companies offer that, you can probably solve
your scaling
issue just, like, right now just by going
and
buying a real damn computer.
Yeah. You'll be shocked and amazed. Because
the two
things that databases need to perform well,
to work
at high performance with lots of data, is
they
need lots of ram, and they need really fast
disc io. And virtual servers give you neither
of
those things. So, just go buy a real damn
computer. Yeah. And, and while you're at it,
throw
in a bunch of discs.
I'm not talking about a radar ray. You probably
want each of your, you probably want each
of
your operating system, your PostGres data,
your PostGres log
file, you probably want all that stuff to
be
on separate hard drives. And that's just gonna
make
the operating system able to more efficiently
sort of
schedule disc io.
So, if, if that solves your problems, you
know,
great. You're welcome. If you guys want to
leave
the room now, that's fine. I won't be offended.
You can go call your local Colo facility.
Work
something out. If, if that doesn't solve your
problem,
or even if it does, you probably want to
look at your queries next.
Now, most people, most of us develop - oops.
Hey. Most of us develop against smallish datasets.
So,
when you develop against a small dataset,
you don't
notice inefficient queries. It's just life.
And books have
been written about query optimization. It's,
it's a very
huge topic, and I can't ev, explain everything
in
thirty minutes. So I'm just going to explain
one
thing, and that's called a, a explain.
Fortunately, PostGres gives us an awesome
tool called Explain,
which basically you pass it a query and it
gives us a query plan. Oh it, wait a
second. That's a, that's a chapter from the
Iliad.
Sorry. It gives us a query plan, which still
kind of looks like a chapter from the Iliad.
But, fortunately, we only have to, to look
at
one metric here. The only thing we need to
worry about is rows. For this talk.
And here we have an example of a very
efficient query, right. It's gonna use an
index, and
it's gonna look at one row. That's pretty
sweet.
Very fast. This is the type of query that
biggish data works with.
But there's one thing you need to know about
rows. And that, obviously, the more rows you're
dealing
with, the more data your computer has to crunch
to give you the answers you want. And so
the whole, the whole name of the game in
query optimization is to limit the number
of rows
that you have to touch.
Let's go back to that inefficient count query.
So
if you run Explain on that, it turns out
that when you ask PostGres to count all of
the tab, all of the rows in a table,
it actually literally counts all the rows
in the
table, one by one.
And so you wind up with a, one of
these things. It's not pretty. And it's often
hard
to know when stuff like this is gonna crop
up. For example, I mentioned that most pagination
systems
break at page 1000 or 2000. And that's because,
well, the offset and limit operators work
a lot
like count, in that, if you do an offset
of 500,000, PostGres is going to count up
500,000
rows, and then if you have a limit of
100, it's gonna count up another 100. And
so
what you see is that pages one, two, and
three load extremely quickly. And so you,
as a
developer, are actually, when I say you I
mean
me, cause I actually, I actually fell into
this
trap. You test it with, like, one or two
pages and it works fine, so you move on
and then later on your customers are complaining
because
they can't access page 500. It's timing out.
The solution to this is to use a range
query. Range queries are very fast, and I'm
gonna
give you a really stupid example here. Here's
a
link with a much more detailed analysis of
all
this. But essentially, if you do a range query,
you're gonna be able to use an index. It's
gonna touch 100 rows and it's gonna be really
fast.
Now, I know what you guys are thinking. That's
not the same as an offset limit query. And
that's the reason I gave you the link in
the previous slide, so.
Sorting. Sorting is tricky. Sorting is the
devil. Sometimes
it's super fast. Sometimes if you happen to
be
asking for data sorted in exactly the same
way
that an index is sorted, well, it'll be super
fast. Other times, even if you have an index
on a column, if it's not set up right
or if you're asking for the data in a
slightly unusual way, you'll wind up sorting
the whole
damn dataset.
It's no good. So here's your rundown for query
optimization. You want to develop against
a real dataset,
because otherwise you won't know when things
are going
wrong until they go wrong in production. Use,
use
Explain. Use it a lot. And the whole name
of the game here is to limit the number
of rows that you touch. Because the more rows
you touch, the slower your queries are. You
know,
in general.
So, I, I don't know about you, but that
just like, that just was a lot of cognitive
load happening, right up here. So I want to
give you guys a cookie.
For the second half of this talk, we're gonna
cover a lot of issues that relate to kind
of the infrastructure around biggish data,
around big datasets
and firehoses of data coming in. And here's
a,
a legend of, of how we're gonna approach this.
We're gonna have a cute picture that, hopefully,
sort
of symbolizes something about the, the topic.
We're gonna
have the name of the topic, and then we're
gonna have a link with more info about the
topic if, if you want to, you know, find
this later.
And you don't have to write all these links
down right now. At the end of the presentation,
at the very last slide, I'll have a url
where you can get a list of all of
them. And, yeah, and I promise this isn't
malware.
So, you just gotta take my word on that.
Our ancestors were really interested in disc
operating systems.
It seems like they were building a new disc
operating system like every two seconds, but.
What, what's
up with that? I, I don't know. Now we
have, like, three.
The first thing that you should do if you're
using Linux is to increase your read-ahead
cache. A
read-ahead cache is something that, well,
I found not,
not a ton of people know about, but it's
a super easy way to get up to a
doubling of your read-throughput. And essentially
what this means
is that Linux examines the request that you
make
to the, to the disc. And if it sees
that you're asking for lots of blocks that
come
right after another in a row, it's gonna preload
the next set of blocks into RAM.
The normal, the, the default for this is,
like,
256K RAM that it uses to prefix these blocks.
If you update this to use, like, two megs,
four megs, you'll get a really big increase
in
read performance. Use a modern file system.
That means
EXT-3 is not an option. If you want to
know why, check out the link.
And, and if you happen to be using EXT-4
or XFS. I can never say that unless I
say it super slow. You might want to consider,
you might want to look into journaling settings.
If
you have your database on a completely separate
hard
drive, and it's running EXT-4 and you have
full
data journaling enabled, since PostGres does
its own journaling,
you're gonna have an inefficiency there because
you have
two things journaling.
Finally, or, wait. Not finally. Anyway. You
gotta tell
PG, tell, tell Paul Gram about, about all
the
RAM that you bought in that, that fancy new
box. A, a really easy way to set this
up is to use a PG-tune script. It examines
your computer and writes a, a PostGres configuration
file
that has some pretty, pretty reasonable values
in it.
And you can tweak them from there. This stuff
is all really easy to find.
And then, finally, the bane of, of DB administrators
is the vacuum command. PostGres needs to vacuum,
because
it's messy. It. When you, when you run queries,
when you delete things, when you update things,
it
leaves sort of a lot of stuff undone. And
it does that, it does that in the, to,
it. In the name of speed, right. It only
does what is necessary at the time of query
to get you an answer to your query.
And then it uses vacuum to go and sort
of clean up after itself. The problem is that
vacuum can be really resource intensive. And
so if
your, if your server is under a lot of
load, and you see that, OK, vacuum is also
causing a lot of load, you may be tempted
to turn off vacuum or to, or to make
autovacuum happen maybe once a night or something.
And
that's generally a bad idea. We actually got
bitten
by this ourselves, so that's why I bring it
up. Usually the answer to vacuum problems
is to
vacuum more often, not less often.
All right. So. Velocity. I really wanted to
like
- I didn't have time - I really wanted
to have a little, a little like, Tron guy
on the motorcycle going down the little grid
there.
But. Eh.
Now we're gonna talk about a, a couple things
that are important if you have a ton of
data coming in, or a ton of read-requests
coming
in, a ton of queries coming in. The first
is too many database connections. Each database
connection in
PostGres is its own process, and each process
has
its own RAM overhead. So there's a limited
number
of connections that you want to have to your,
your database server.
If you have, I don't know, a thousand workers
and, and web app processes and all this, you
don't want to open a thousand database connections
and
you probably already know about this. The
solution is
to pool connections.
There's ways to do this in Ruby. There's also,
if you're interested in a more ops-y approach,
you
can check out something like PG-bouncer, which
is a
proxy that sits in between your Ruby app and
your database and functions as a, as a connection
pool.
You can also run into problems with too many
locks. And this is, this is the sort of
problem that you don't really ever run into
if
you don't have a firehose of database, of
data
coming in.
I don't expect you to know, like, everything
about
locks, because it's a, it's a pretty complex
topic.
But you should know that, within a transaction,
if
you go to update a row, that transaction is
gonna put a lock on the row until it's
done. It's gonna say that nothing else can
write
to that row until it's done. And, you know,
that makes sense.
But where this can bite you is if you
have, say, imagine you have two Rails models.
You
have a parent model and a child model. And
the parent model has a counter-cache that
gets incremented
every time you add a child. Normally, this
is
no big deal. People do this sort of thing
all the time. But if something crazy happens
and
someone, you know, bombards your API and suddenly
you
have, like, fifteen thousand children created,
you're, you're gonna
have some locking issues.
Cause what's gonna happen is your first query
is
going to execute fine. The child's gonna get
created.
It's gonna increment the counter. Everything's
gonna be fine.
And while it was doing that, it put a
lock on, on that row. And so, once the
first query's done, the lock is, is removed,
and
the second query happens. And if you had infinite
time, like, this would all be fine. It would
all eventually work itself out. But what happens
is
that by the time you finish query 100, query
15, 15,000 has timed out, which causes all
sorts
of fun in your, you know, in your Unicorns
and your Sidekiqs and all that. It's a, yeah,
it's a huge pain in the neck.
And the way you avoid this is to, it,
it's just an architectural thing. You just
have to
avoid any situation where you could be updating
the
same record in the database like a gillion
times
per second.
Intensive database queries. Like, sometimes,
sometimes we have our
production database, and we need that to be
very,
very performant for our users. But we also
need
to maybe do some preprocessing on data as
it
comes in. And a really simple way to do
this is to use PostGres's streaming replication
facilities to
create a read-only replicant. And then you
just do
your intensive queries against that and it
doesn't affect,
affect your users. It's super simple. Sorry.
I, I,
I kind of feel dumb even talking about it
here. But, there you go.
Partitioning is awesome. Partitioning is like
the best thing
ever, because it allows you to dramatically
speed up
data calling and data archiving. What you
can set
up in, in PostGres is a partitioning scheme
in
which data for different days go into different
physical
database tables.
And, you know, that's all right. But the really
cool thing is that you access that data by
querying a parent sort of virtual table that
then
propagates that query across all of its children
and,
you know, magically does its results and,
and spits
them back to you.
So you don't have to update your Ruby code
at all, which is sweet. And this way, if
you want to, say, delete data that's 180 days
old, you just drop the table that's associated
with
180 days ago. If you were to do a
delete where, you know, created_at is less
than 180
days ago, you're probably gonna be waiting
for weeks.
And last, but not least, backups are a real
pain in the neck when you have a huge
dataset. Like, that crime job that dumps your
entire
database and uploads it to S3 doesn't work
so
well when your database is, like, two terabytes
big.
And, yeah, that's a big S3 bill, too.
Fortunately, we can take advantage of the
same sort
of things that PostGres does to allow it to
do streaming replication to do, sort of, on
the
fly, incremental backups. And there's a tool
called Wall-E,
which makes this super easy. And it, it's
really
cool, because it allows, it, it makes it very
easy to upload incremental backups to, say,
S3, and
then when you want to restore your database,
it
makes it really easy to, to restore a specific
point in time.
And so that, Wall-E is really awesome. I love
it.
Now, I, I, I stayed in my hotel room
last night. I didn't get to go to the
Speakeasy thing, which kind of bummed me out.
But
I, I had to work on these slides for
you people. And, and, and when I got done
with this, when I sort of reached this point,
I looked back on all of these, these slides
- there's like fifty-four of them - and I
was like, holy crap. This is a lot of
information.
Yeah. This is a lot of information. But the
thing I, I want to stress is that, at
least in, in my experience, these issues tend
to
come at you one at a time. This isn't
the sort of thing where you have to know
all of this stuff in advance in order to
deal with biggish data in PostGres. Things
come at
you one at a time, and you can deal
with them one at a time.
And I have faith in you. I think you
can do it, because you're awesome, because
nobody told
us that we could turn, like, transistors into
LOLCats,
but we did it. Like, that's the type of
people we are, and that's, that's why I'm
proud
to be at RailsConf. If you're interested in,
like,
talking to me about this, if you think I'm
full of shit about anything, which I probably
am
on at least one point, just, just say so.
Just feel free to come up to me after
the conference, or after my talk here. I have
delicious hot and spicy Mexican candy, as
an incentive.
So there you go.
If you want to learn more about the stuff
that I. Are you taking? Do you want to
take a picture of the Mexican candy? OK. OK.
I can send it to you.
If. If you want to learn more, all of
the links referenced in this talk are at this
url, and if you are interested in having more
visibility into your production errors, check
out Honeybadger, because,
yeah. We love Rails devs. And that's it. That's.
It says end of show here, so I guess
it must be the end of show.