-
STARR HORNE: So thank you guys for coming.
-
This is gonna be an awesome talk.
-
Quite possibly the best talk you'll ever hear
-
in your life. What? I actually have
-
a built-in escape hatch here,
-
so if things start going wrong,
-
I might just bail. Yeah.
-
So, anyway. A couple weeks ago, I was, well,
-
I was sort of having a panic attack about
-
this presentation, because I'm a programmer,
and my natural
-
habitat is in some sort of dark place looking
-
a screen. It's not, it's not talking to a
-
bunch of people with, like, lights. What's
up with
-
this?
-
So I, I was in the middle of this
-
panic attack and I went out on a walk,
-
cause that always clears my mind. I was listening
-
to a podcast. My favorite podcast. It's called
Hardcore
-
History. I highly recommend it. Yeah. Oh.
Somebody likes
-
it. Yeah!
-
So this, this particular episode about Hardcore,
in, of
-
Hardcore History was about the Punic Wars.
And in
-
case you haven't heard about the Punic Wars,
don't
-
know the story, I'll just give you, like,
the
-
brief rundown. Basically, a long, long time
ago, there
-
was this place called Rome. And they, they
decided
-
that they should take over the world. And,
for
-
the most part, they did. But there was this
-
one country that was a thorn in their side
-
for a long time called Carthidge. And they
fought
-
all these wars back and forth. It was kind
-
of a stale mate.
-
And then one day, when, I don't know, maybe
-
he was making a sacrifice to the fire god
-
or whatever, this guy named Hannibal had this
great
-
idea. He was like, I'm gonna, I'm gonna lead
-
this sneak-attack on Rome, and I'm gonna do
so
-
by marching my army through the Alps. Which
is,
-
is pretty cool. I mean, that, that's pretty
bad
-
ass. But I, I think the most awesome part
-
of the story, for me, at least, is that
-
the dude had war elephants. So, I, I don't
-
know if you can see it in the slide,
-
but there's actually a war elephant. It's,
it's kind
-
of, I don't know, it's kind of under the
-
G. And a lot of people have doubted this
-
story, you know, over the years. And so, in
-
1959, there was a British engineer who, on
a
-
bar bet, borrowed a circus elephant named
Jumbo and
-
marched it across the Alps, too.
-
So, I don't know what my point is here,
-
really. I guess don't underestimate elephants,
-
because they don't
-
like it and they have long memories.
-
So, OK. This talk is really about biggish
data.
-
So what, what the hell is biggish data. It's
-
not, what is? Whoa, I've got one of these
-
things. It's not big data. It's not about,
you
-
know, this talk isn't gonna be about Hadoop
clusters
-
and server farms and all that stuff. That's,
that's
-
way over my head.
-
It's not about fancy architectures. I'm not
gonna show
-
you how to make multi-right PostGres clusters
that do
-
automatic sharding and stuff. I mean, that's
all like
-
wicked cool, but it's not what we're gonna
talk
-
about.
-
Instead, I'm gonna talk about something that
I think
-
is, is more practical, and it's kind of more
-
interesting to me, and that is, how do you
-
keep your app working as your production dataset
grows?
-
And grows into biggish data territory?
-
And this is very easy. This is very easy
-
to happen. Even if you don't have a firehose
-
of data coming in. If you run a, a
-
popular e-commerce site or just a site with
a
-
lot of users, over the course of years you
-
can accumulate a ton of data. And as this
-
data accumulates, you find that your site
performance goes
-
downhill sort of gradually in ways that you
don't
-
understand.
-
So. What is biggish data? This talk is based
-
on my experience at Honeybadger. In case you
haven't
-
heard it, of us, we are an exception performance
-
and uptime monitoring company.
-
That means that we essentially
-
have a firehose of new data coming in every
-
day. Right now we have about a terabyte of
-
errors in our database. The world has a lot
-
of errors. You guys need to start, like, doing
-
a better job.
-
And, and we get about, we get two gigabytes
-
of new errors per day, roughly. And all this
-
goes into a plain vanilla PostGres database,
and it's
-
served, that backs a pretty plain
-
vanilla Rails application.
-
Well, the good news is that PostGres can handle
-
this pretty easily. PostGres has got it covered.
-
The bad news is that, unless you've engineered
your
-
app to deal with this level of data, you're
-
kind of screwed. And the reason is that a
-
hundred megabyte database behaves
-
fundamentally different than a one
-
terabyte database. And a lot of the conventions,
a
-
lot of the normal things we do in Rails
-
apps just stop working when you have this
much
-
data.
-
Yup. This is actually a live feed. This queries
-
been running since this morning.
-
Something I learned, pretty much every pagination
system breaks
-
on page, like, 2000. Even if you, you're like,
-
dear God, make this stop, I just want to
-
delete half my data. You're, you're just still
screwed,
-
because deleting data takes a lot of time.
-
Now, I'm gonna explain all of this. I'm gonna
-
explain why this happens, how you can work
around
-
it, and how you can optimize the queries in
-
your own database and optimize your stack.
But, to
-
really understand this, we've got to take
a little
-
trip back in time, to the summer of 1978.
-
The summer of disco, free love, and the VT-100
-
terminal. The VT-100 was the first sort of
computer-shaped
-
thing that you could buy from an actual company
-
and sit on your desk to prove that you
-
were the alpha nerd of the office. It was
-
also the time that Oracle 1 point 0 was
-
being developed, which was one of the first
databases
-
that used the SQL language. It's pretty cool.
It
-
was written in PDP-11 assembly language.
-
And, and in case you're like too young to
-
have ever seen a PDP-11, this is what a
-
PDP-11 looks like. Yeah. Give it, show some
love
-
for the PDP-11.
-
Now, this is pretty awesome. Nowadays, in
our, like,
-
modern day times, right this second, marketers
down in
-
the, the presentat- what, what do they call
that?
-
The exhibit hall. Are, right now, throwing
around all
-
sorts of words about cloud computing and platform
as
-
a service and, and I don't know, some, some
-
stuff I haven't even heard of, probably. And
back
-
then it was the same, but the, the buzz
-
words were different. The buzz words on everybody's,
the
-
buzz word on everybody's lips, at that time,
was
-
real time computing.
-
And I, I'm using air quotes around real time,
-
because that meant that you could have your,
your
-
report generated in five minutes instead of
five days.
-
Which I'm sure was pretty awesome. Like, I
would
-
have been really happy about that. But, you
know,
-
looking at this, it doesn't really seem like
the
-
sort of thing I would want to build a
-
web app on top of, right?
-
Like, I. Yeah. I, I wouldn't stick Honeybadger
on
-
that. But, a funny thing happened in the past
-
thirty, thirty-six years. A funny thing happened
is that
-
Moore's Law made this into a web stack. Now,
-
I know that everybody here knows sort of,
vaguely,
-
what Moore's Law is. But I thought it would
-
be good to go over the technical, precise
definition
-
of Moore's Law.
-
And that is that, whoa. That is that computers
-
get more awesome as time goes forward. Did
that
-
come back normally? Yeah. As time goes forward.
And
-
awesomeness is generally defined as, well,
technically, being able
-
to do more stuff faster. Being able to process
-
more data faster. So, based on this, I would
-
like to postulate, for the first time in public
-
- you guys are very lucky to be able
-
to hear this - the first time in public,
-
Starr's Corelary to Moore's Law, which is,
as database
-
growth outpaces Moore's Law, you literally
travel back in
-
time.
-
That was the Ted moment of this, of this
-
talk. It's like, I can hear you guys, your
-
minds being blown. It sounds like bubble wrap
popping.
-
So, based on my, my extensive research of
time-travel
-
movies, the, the first rule for surviving
your new
-
time period is that you need to understand
the
-
methods and the motivations of the people
in that
-
time period. We need to start thinking like
our
-
ancestors. And our ancestors were very interested
in this
-
thing called hard ware.
-
Well, I, I don't mean to be flip, but
-
I just want to bring this up first, because
-
if you happen to be having database scaling
issues,
-
like, if your app right now is getting slower
-
because the amount of data in your database
is
-
getting bigger, and your app happens to live
anywhere
-
named EC-2, Linode, I don't know, whatever,
whatever other
-
companies offer that, you can probably solve
your scaling
-
issue just, like, right now just by going
and
-
buying a real damn computer.
-
Yeah. You'll be shocked and amazed. Because
the two
-
things that databases need to perform well,
to work
-
at high performance with lots of data, is
they
-
need lots of ram, and they need really fast
-
disc io. And virtual servers give you neither
of
-
those things. So, just go buy a real damn
-
computer. Yeah. And, and while you're at it,
throw
-
in a bunch of discs.
-
I'm not talking about a radar ray. You probably
-
want each of your, you probably want each
of
-
your operating system, your PostGres data,
your PostGres log
-
file, you probably want all that stuff to
be
-
on separate hard drives. And that's just gonna
make
-
the operating system able to more efficiently
sort of
-
schedule disc io.
-
So, if, if that solves your problems, you
know,
-
great. You're welcome. If you guys want to
leave
-
the room now, that's fine. I won't be offended.
-
You can go call your local Colo facility.
Work
-
something out. If, if that doesn't solve your
problem,
-
or even if it does, you probably want to
-
look at your queries next.
-
Now, most people, most of us develop - oops.
-
Hey. Most of us develop against smallish datasets.
So,
-
when you develop against a small dataset,
you don't
-
notice inefficient queries. It's just life.
And books have
-
been written about query optimization. It's,
it's a very
-
huge topic, and I can't ev, explain everything
in
-
thirty minutes. So I'm just going to explain
one
-
thing, and that's called a, a explain.
-
Fortunately, PostGres gives us an awesome
tool called Explain,
-
which basically you pass it a query and it
-
gives us a query plan. Oh it, wait a
-
second. That's a, that's a chapter from the
Iliad.
-
Sorry. It gives us a query plan, which still
-
kind of looks like a chapter from the Iliad.
-
But, fortunately, we only have to, to look
at
-
one metric here. The only thing we need to
-
worry about is rows. For this talk.
-
And here we have an example of a very
-
efficient query, right. It's gonna use an
index, and
-
it's gonna look at one row. That's pretty
sweet.
-
Very fast. This is the type of query that
-
biggish data works with.
-
But there's one thing you need to know about
-
rows. And that, obviously, the more rows you're
dealing
-
with, the more data your computer has to crunch
-
to give you the answers you want. And so
-
the whole, the whole name of the game in
-
query optimization is to limit the number
of rows
-
that you have to touch.
-
Let's go back to that inefficient count query.
So
-
if you run Explain on that, it turns out
-
that when you ask PostGres to count all of
-
the tab, all of the rows in a table,
-
it actually literally counts all the rows
in the
-
table, one by one.
-
And so you wind up with a, one of
-
these things. It's not pretty. And it's often
hard
-
to know when stuff like this is gonna crop
-
up. For example, I mentioned that most pagination
systems
-
break at page 1000 or 2000. And that's because,
-
well, the offset and limit operators work
a lot
-
like count, in that, if you do an offset
-
of 500,000, PostGres is going to count up
500,000
-
rows, and then if you have a limit of
-
100, it's gonna count up another 100. And
so
-
what you see is that pages one, two, and
-
three load extremely quickly. And so you,
as a
-
developer, are actually, when I say you I
mean
-
me, cause I actually, I actually fell into
this
-
trap. You test it with, like, one or two
-
pages and it works fine, so you move on
-
and then later on your customers are complaining
because
-
they can't access page 500. It's timing out.
-
The solution to this is to use a range
-
query. Range queries are very fast, and I'm
gonna
-
give you a really stupid example here. Here's
a
-
link with a much more detailed analysis of
all
-
this. But essentially, if you do a range query,
-
you're gonna be able to use an index. It's
-
gonna touch 100 rows and it's gonna be really
-
fast.
-
Now, I know what you guys are thinking. That's
-
not the same as an offset limit query. And
-
that's the reason I gave you the link in
-
the previous slide, so.
-
Sorting. Sorting is tricky. Sorting is the
devil. Sometimes
-
it's super fast. Sometimes if you happen to
be
-
asking for data sorted in exactly the same
way
-
that an index is sorted, well, it'll be super
-
fast. Other times, even if you have an index
-
on a column, if it's not set up right
-
or if you're asking for the data in a
-
slightly unusual way, you'll wind up sorting
the whole
-
damn dataset.
-
It's no good. So here's your rundown for query
-
optimization. You want to develop against
a real dataset,
-
because otherwise you won't know when things
are going
-
wrong until they go wrong in production. Use,
use
-
Explain. Use it a lot. And the whole name
-
of the game here is to limit the number
-
of rows that you touch. Because the more rows
-
you touch, the slower your queries are. You
know,
-
in general.
-
So, I, I don't know about you, but that
-
just like, that just was a lot of cognitive
-
load happening, right up here. So I want to
-
give you guys a cookie.
-
For the second half of this talk, we're gonna
-
cover a lot of issues that relate to kind
-
of the infrastructure around biggish data,
around big datasets
-
and firehoses of data coming in. And here's
a,
-
a legend of, of how we're gonna approach this.
-
We're gonna have a cute picture that, hopefully,
sort
-
of symbolizes something about the, the topic.
We're gonna
-
have the name of the topic, and then we're
-
gonna have a link with more info about the
-
topic if, if you want to, you know, find
-
this later.
-
And you don't have to write all these links
-
down right now. At the end of the presentation,
-
at the very last slide, I'll have a url
-
where you can get a list of all of
-
them. And, yeah, and I promise this isn't
malware.
-
So, you just gotta take my word on that.
-
Our ancestors were really interested in disc
operating systems.
-
It seems like they were building a new disc
-
operating system like every two seconds, but.
What, what's
-
up with that? I, I don't know. Now we
-
have, like, three.
-
The first thing that you should do if you're
-
using Linux is to increase your read-ahead
cache. A
-
read-ahead cache is something that, well,
I found not,
-
not a ton of people know about, but it's
-
a super easy way to get up to a
-
doubling of your read-throughput. And essentially
what this means
-
is that Linux examines the request that you
make
-
to the, to the disc. And if it sees
-
that you're asking for lots of blocks that
come
-
right after another in a row, it's gonna preload
-
the next set of blocks into RAM.
-
The normal, the, the default for this is,
like,
-
256K RAM that it uses to prefix these blocks.
-
If you update this to use, like, two megs,
-
four megs, you'll get a really big increase
in
-
read performance. Use a modern file system.
That means
-
EXT-3 is not an option. If you want to
-
know why, check out the link.
-
And, and if you happen to be using EXT-4
-
or XFS. I can never say that unless I
-
say it super slow. You might want to consider,
-
you might want to look into journaling settings.
If
-
you have your database on a completely separate
hard
-
drive, and it's running EXT-4 and you have
full
-
data journaling enabled, since PostGres does
its own journaling,
-
you're gonna have an inefficiency there because
you have
-
two things journaling.
-
Finally, or, wait. Not finally. Anyway. You
gotta tell
-
PG, tell, tell Paul Gram about, about all
the
-
RAM that you bought in that, that fancy new
-
box. A, a really easy way to set this
-
up is to use a PG-tune script. It examines
-
your computer and writes a, a PostGres configuration
file
-
that has some pretty, pretty reasonable values
in it.
-
And you can tweak them from there. This stuff
-
is all really easy to find.
-
And then, finally, the bane of, of DB administrators
-
is the vacuum command. PostGres needs to vacuum,
because
-
it's messy. It. When you, when you run queries,
-
when you delete things, when you update things,
it
-
leaves sort of a lot of stuff undone. And
-
it does that, it does that in the, to,
-
it. In the name of speed, right. It only
-
does what is necessary at the time of query
-
to get you an answer to your query.
-
And then it uses vacuum to go and sort
-
of clean up after itself. The problem is that
-
vacuum can be really resource intensive. And
so if
-
your, if your server is under a lot of
-
load, and you see that, OK, vacuum is also
-
causing a lot of load, you may be tempted
-
to turn off vacuum or to, or to make
-
autovacuum happen maybe once a night or something.
And
-
that's generally a bad idea. We actually got
bitten
-
by this ourselves, so that's why I bring it
-
up. Usually the answer to vacuum problems
is to
-
vacuum more often, not less often.
-
All right. So. Velocity. I really wanted to
like
-
- I didn't have time - I really wanted
-
to have a little, a little like, Tron guy
-
on the motorcycle going down the little grid
there.
-
But. Eh.
-
Now we're gonna talk about a, a couple things
-
that are important if you have a ton of
-
data coming in, or a ton of read-requests
coming
-
in, a ton of queries coming in. The first
-
is too many database connections. Each database
connection in
-
PostGres is its own process, and each process
has
-
its own RAM overhead. So there's a limited
number
-
of connections that you want to have to your,
-
your database server.
-
If you have, I don't know, a thousand workers
-
and, and web app processes and all this, you
-
don't want to open a thousand database connections
and
-
you probably already know about this. The
solution is
-
to pool connections.
-
There's ways to do this in Ruby. There's also,
-
if you're interested in a more ops-y approach,
you
-
can check out something like PG-bouncer, which
is a
-
proxy that sits in between your Ruby app and
-
your database and functions as a, as a connection
-
pool.
-
You can also run into problems with too many
-
locks. And this is, this is the sort of
-
problem that you don't really ever run into
if
-
you don't have a firehose of database, of
data
-
coming in.
-
I don't expect you to know, like, everything
about
-
locks, because it's a, it's a pretty complex
topic.
-
But you should know that, within a transaction,
if
-
you go to update a row, that transaction is
-
gonna put a lock on the row until it's
-
done. It's gonna say that nothing else can
write
-
to that row until it's done. And, you know,
-
that makes sense.
-
But where this can bite you is if you
-
have, say, imagine you have two Rails models.
You
-
have a parent model and a child model. And
-
the parent model has a counter-cache that
gets incremented
-
every time you add a child. Normally, this
is
-
no big deal. People do this sort of thing
-
all the time. But if something crazy happens
and
-
someone, you know, bombards your API and suddenly
you
-
have, like, fifteen thousand children created,
you're, you're gonna
-
have some locking issues.
-
Cause what's gonna happen is your first query
is
-
going to execute fine. The child's gonna get
created.
-
It's gonna increment the counter. Everything's
gonna be fine.
-
And while it was doing that, it put a
-
lock on, on that row. And so, once the
-
first query's done, the lock is, is removed,
and
-
the second query happens. And if you had infinite
-
time, like, this would all be fine. It would
-
all eventually work itself out. But what happens
is
-
that by the time you finish query 100, query
-
15, 15,000 has timed out, which causes all
sorts
-
of fun in your, you know, in your Unicorns
-
and your Sidekiqs and all that. It's a, yeah,
-
it's a huge pain in the neck.
-
And the way you avoid this is to, it,
-
it's just an architectural thing. You just
have to
-
avoid any situation where you could be updating
the
-
same record in the database like a gillion
times
-
per second.
-
Intensive database queries. Like, sometimes,
sometimes we have our
-
production database, and we need that to be
very,
-
very performant for our users. But we also
need
-
to maybe do some preprocessing on data as
it
-
comes in. And a really simple way to do
-
this is to use PostGres's streaming replication
facilities to
-
create a read-only replicant. And then you
just do
-
your intensive queries against that and it
doesn't affect,
-
affect your users. It's super simple. Sorry.
I, I,
-
I kind of feel dumb even talking about it
-
here. But, there you go.
-
Partitioning is awesome. Partitioning is like
the best thing
-
ever, because it allows you to dramatically
speed up
-
data calling and data archiving. What you
can set
-
up in, in PostGres is a partitioning scheme
in
-
which data for different days go into different
physical
-
database tables.
-
And, you know, that's all right. But the really
-
cool thing is that you access that data by
-
querying a parent sort of virtual table that
then
-
propagates that query across all of its children
and,
-
you know, magically does its results and,
and spits
-
them back to you.
-
So you don't have to update your Ruby code
-
at all, which is sweet. And this way, if
-
you want to, say, delete data that's 180 days
-
old, you just drop the table that's associated
with
-
180 days ago. If you were to do a
-
delete where, you know, created_at is less
than 180
-
days ago, you're probably gonna be waiting
for weeks.
-
And last, but not least, backups are a real
-
pain in the neck when you have a huge
-
dataset. Like, that crime job that dumps your
entire
-
database and uploads it to S3 doesn't work
so
-
well when your database is, like, two terabytes
big.
-
And, yeah, that's a big S3 bill, too.
-
Fortunately, we can take advantage of the
same sort
-
of things that PostGres does to allow it to
-
do streaming replication to do, sort of, on
the
-
fly, incremental backups. And there's a tool
called Wall-E,
-
which makes this super easy. And it, it's
really
-
cool, because it allows, it, it makes it very
-
easy to upload incremental backups to, say,
S3, and
-
then when you want to restore your database,
it
-
makes it really easy to, to restore a specific
-
point in time.
-
And so that, Wall-E is really awesome. I love
-
it.
-
Now, I, I, I stayed in my hotel room
-
last night. I didn't get to go to the
-
Speakeasy thing, which kind of bummed me out.
But
-
I, I had to work on these slides for
-
you people. And, and, and when I got done
-
with this, when I sort of reached this point,
-
I looked back on all of these, these slides
-
- there's like fifty-four of them - and I
-
was like, holy crap. This is a lot of
-
information.
-
Yeah. This is a lot of information. But the
-
thing I, I want to stress is that, at
-
least in, in my experience, these issues tend
to
-
come at you one at a time. This isn't
-
the sort of thing where you have to know
-
all of this stuff in advance in order to
-
deal with biggish data in PostGres. Things
come at
-
you one at a time, and you can deal
-
with them one at a time.
-
And I have faith in you. I think you
-
can do it, because you're awesome, because
nobody told
-
us that we could turn, like, transistors into
LOLCats,
-
but we did it. Like, that's the type of
-
people we are, and that's, that's why I'm
proud
-
to be at RailsConf. If you're interested in,
like,
-
talking to me about this, if you think I'm
-
full of shit about anything, which I probably
am
-
on at least one point, just, just say so.
-
Just feel free to come up to me after
-
the conference, or after my talk here. I have
-
delicious hot and spicy Mexican candy, as
an incentive.
-
So there you go.
-
If you want to learn more about the stuff
-
that I. Are you taking? Do you want to
-
take a picture of the Mexican candy? OK. OK.
-
I can send it to you.
-
If. If you want to learn more, all of
-
the links referenced in this talk are at this
-
url, and if you are interested in having more
-
visibility into your production errors, check
out Honeybadger, because,
-
yeah. We love Rails devs. And that's it. That's.
-
It says end of show here, so I guess
-
it must be the end of show.