RailsConf 2014 - An Ode to 17 Databases in 33 Minutes by Toby Hede

Edit subtitles

0:17 - 0:18

TOBY HEDE: Good morning everybody.
0:18 - 0:24

Friday. Yes. It's been a long week. I'm excited.
0:24 - 0:29

I'm highly caffeinated. So without further
ado,
0:29 - 0:34

I present An Ode to 17 Databases in 33 Minutes.
0:34 - 0:38

I'm gonna mangle a large number of metaphors.
0:38 - 0:41

There'll be a lot of animated gifs.
0:41 - 0:44

I've learned that this week, if you see it
like that,
0:44 - 0:48

there's Star Wars, Dungeons and Dragons,
0:48 - 0:49

and all of that's very, unfortunately, stereotypical.
0:49 - 0:52

So a bit of an indictment.
0:52 - 0:56

This whole thing started as a joke. Seventeen
databases.
0:56 - 0:59

I actually did in five minutes. Thirty-three
minutes is
0:59 - 1:04

worse. The whole thing is just a catastrophe,
really.
1:04 - 1:05

But anyway.
1:05 - 1:08

We're gonna cover a whole bunch of different
databases
1:08 - 1:10

and a little bit of the underlying theory,
and
1:10 - 1:13

hopefully you'll walk out and you'll understand
why to
1:13 - 1:14

use PostGres.
1:14 - 1:14

[laughter]
1:14 - 1:20

I'm Toby. You can find me on the internet.
1:20 - 1:22

I work at a company called Nine Fold.
1:22 - 1:26

V.O.: We're having a problem, there's no screen.
1:26 - 1:33

T.H.: Oh. No screens. Is that me?
1:36 - 1:41

Before it was, there was no red. So, now
1:41 - 1:44

there's no any, anything.
1:44 - 1:45

V.O.: Nothing.
1:45 - 1:46

T.H.: Hey.
1:46 - 1:48

AUDIENCE: Hey!
1:48 - 1:51

T.H.: I have no slides.
1:51 - 1:55

Well, you missed my beautiful slides. There's.
You missed
1:55 - 1:58

the first animation. That's a shame. You missed
the
1:58 - 2:02

list. It's awesome. You missed me and my excellent
2:02 - 2:05

job titles. So yes.
2:05 - 2:08

I work at Nine Fold. They have very kindly
2:08 - 2:13

flown me over here from Australia, which explains
why
2:13 - 2:17

I sound like I come from the deep south.
2:17 - 2:18

Cause I do.
2:18 - 2:21

Most of this week, this has been me. So
2:21 - 2:24

today I'm finally over the jetlag just in
time
2:24 - 2:27

to go home and have it all over again
2:27 - 2:28

next week.
2:28 - 2:32

So, a couple of quick facts about Straya.
There
2:32 - 2:39

are much fewer syllables than you're used
to using.
2:39 - 2:44

This is an, a genuine Australian politician.
He's a
2:44 - 2:48

mining magnate billionaire and he is currently
running a
2:48 - 2:53

MVP Jurrassic theme park with giant fiberglass
dinosaurs. And
2:53 - 2:56

I, I for one am for it. So I
2:56 - 2:59

realize there wasn't enough Star Wars references
so this
2:59 - 3:01

is just completely gratuitous.
3:01 - 3:05

Anyway. So. The thrust is that distributed
systems are
3:05 - 3:08

hard and databases are fun. Pictured here
is a
3:08 - 3:14

distributed system. You can see there's two
app nodes
3:14 - 3:17

and then there's two, there's like a master/slave
kind
3:17 - 3:21

of setup going on here as well. So we're
3:21 - 3:24

gonna talk about some of the complexities
of running
3:24 - 3:28

these types of systems, and it's really fun
stuff
3:28 - 3:30

once you get under the cover and start thinking
3:30 - 3:32

about some of the complexities.
3:32 - 3:37

So. NoSQL is a thing. We have NewSQL now.
3:37 - 3:39

I'm gonna be covering some of these things.
We've
3:39 - 3:44

also got PostSQL, Post-Rock Ambient SQL. And
there's a
3:44 - 3:47

whole gammit of these things. They all make
my
3:47 - 3:51

brain explode and the, I think the trick to
3:51 - 3:53

understanding all of this stuff is to actually
think
3:53 - 3:55

about some of what's happening underneath.
And you can
3:55 - 4:00

make decisions about your databases.
4:00 - 4:02

Hopefully you're all familiar with some of
the concepts
4:02 - 4:07

of traditional relational databases. We have
Acid, which provides
4:07 - 4:11

certain guarantees about the way that your
data behaves.
4:11 - 4:13

You can update data and be sure it was
4:13 - 4:18

updated. Things are isolated from each other.
Things persist
4:18 - 4:21

over time.
4:21 - 4:23

Another thing that you may have heard of,
this
4:23 - 4:26

is a, this is a leap that I need
4:26 - 4:28

to another animation, is a thing called the
CAP
4:28 - 4:31

Theorem. So this gets talked about a lot when
4:31 - 4:35

we start talking about this new generation
of databases.
4:35 - 4:40

CAP stands for consistency, availability,
and partition tolerance, and
4:40 - 4:44

it provides, basically, some strong foundation
for reasoning about
4:44 - 4:48

the way distributed systems behave and how
they interoperate
4:48 - 4:50

and how they communicate. So I'm gonna give
you
4:50 - 4:53

a brief introduction to how that all kind
of
4:53 - 4:53

works.
4:53 - 4:57

So, the original CAP Theorem, as stated, was,
is
4:57 - 5:00

called Brewer's Conjecture. A guy called Brewer
just sort
5:00 - 5:03

of had this idea. It's actually on some really
5:03 - 5:07

awesomely-designed PowerPoint slides from
some thing he did. And
5:07 - 5:12

he was saying that with consistency, availability,
and partition
5:12 - 5:15

tolerance - so the data can, can only be
5:15 - 5:17

two of these things at any one time. So
5:17 - 5:20

the data can be consistent or it can be
5:20 - 5:24

accessible or it can handle network failures.
5:24 - 5:28

So people then took this conjecture and actually
made
5:28 - 5:32

a formal kind of proof in, in much more
5:32 - 5:39

rigorous computer science terms. And actually
said, it's impossible,
5:39 - 5:43

in an asynchronous network model, to implement
a read/write
5:43 - 5:49

data object that is simultaneously available
and is also
5:49 - 5:51

atomically consistent.
5:51 - 5:53

And so all of this stuff around NewSQL and
5:53 - 5:57

NoSQL and bleh, all of that stuff, is about
5:57 - 6:02

manipulating these different variables. There's
also a thing called
6:02 - 6:03

Base but I'm not gonna talk about it cause
6:03 - 6:06

it's actually just a made-up acronym that
has no
6:06 - 6:07

relevance to anything.
6:07 - 6:10

So, what, what does CAP actually, what, what
are
6:10 - 6:14

we talking about here? And why is it important?
6:14 - 6:17

It's important, actually, because everything
is already distributed. What
6:17 - 6:20

we do today is inherently a distributed system.
You
6:20 - 6:23

have a browser talking to a server, an app
6:23 - 6:26

server, Rails server - cause we're at RailsConf
-
6:26 - 6:29

and then that's talking to a PostGres database,
or
6:29 - 6:34

a MySQL database or something even fancier
and shinier.
6:34 - 6:36

That's a distributed system. And as we move
into
6:36 - 6:41

more heavy client-based operations, that distribution
is getting much
6:41 - 6:44

more front-loaded, so you, you've got state
in the
6:44 - 6:47

browser that's now synchronizing with state
on the server.
6:47 - 6:50

So we already actually suffer many of these
problems.
6:50 - 6:55

This is a handy and completely untrue guide
to
6:55 - 6:59

NoSQL systems and breaking them into this
idea of
6:59 - 7:03

some things are available and some things
are consistent.
7:03 - 7:07

So, all of that is almost but not quite
7:07 - 7:09

entirely untrue.
7:09 - 7:13

What the actual theorem says is that under
a
7:13 - 7:16

network failure - so you've got multiple nodes
and
7:16 - 7:20

they now can no longer communicate - you can
7:20 - 7:23

choose whether the data is consistent or whether
the
7:23 - 7:29

data is available. And I have some demonstrations
here
7:29 - 7:31

to just - it actually ends up being very
7:31 - 7:32

easy to understand.
7:32 - 7:36

So, here we have typical cluster of nodes
working
7:36 - 7:41

together. We're gonna model some communication
between them. So
7:41 - 7:46

there's a, there's a write on this system.
It
7:46 - 7:49

comes in, that gets replicated across, and
then on
7:49 - 7:51

the other system we now have that data coming
7:51 - 7:53

out. Someone's doing a read. And so this is
7:53 - 7:58

the kind of situation that we're talking about.
So
7:58 - 8:01

whether you're doing master/slave setup in
a relational database
8:01 - 8:06

or something trickier, this is kind of the
way
8:06 - 8:08

it works. A node gets some data and it
8:08 - 8:12

gives it to another node, and they have the
8:12 - 8:14

same information.
8:14 - 8:18

So when there's a network partition, that,
they no
8:18 - 8:22

longer can communicate. So a write comes in,
and
8:22 - 8:26

now we have to make a decision. And all
8:26 - 8:28

of this is actually just science, as you can
8:28 - 8:31

tell from this diagram. If those two nodes
can't
8:31 - 8:33

communicate, you can talk to the one that
got
8:33 - 8:37

the write - that's consistent. It got the
write.
8:37 - 8:39

It can now, can read out that same data.
8:39 - 8:40

That's all cool.
8:40 - 8:44

Or, you can have both nodes still communicating,
and
8:44 - 8:46

now you have someone reading data that is
no
8:46 - 8:49

longer in the write state. So we've got, you
8:49 - 8:52

know, we have updated a bank account. It's
got
8:52 - 8:54

a hundred dollars in it. It used to have
8:54 - 8:57

ten dollars in it. These people are reading
ten.
8:57 - 8:59

These people are reading a hundred. That's
available. The
8:59 - 9:01

data is now not consistent. But all of the
9:01 - 9:03

nodes can send back that data.
9:03 - 9:07

And so all of the discussion about CAP Theorem
9:07 - 9:11

and, and you know, people even claiming, we've
defeated
9:11 - 9:14

the CAP Theorem in our database at, you know,
9:14 - 9:19

low-low prices is incredibly awesome. Just
remember this image.
9:19 - 9:26

Two things that cannot communicate cannot
communicate. It's science.
9:26 - 9:28

And then when they can communicate, we're
back into
9:28 - 9:31

the realm of normal operations and things
get a
9:31 - 9:35

lot easier. If you were interested in any
of
9:35 - 9:40

the guts of how these things work, definitely
have
9:40 - 9:42

a look at a thing called jepsen, which is
9:42 - 9:48

this crazy motherfucker who is just analyzing
the network
9:48 - 9:52

operations of a whole variety of distributed
systems, and
9:52 - 9:55

it will, it's just, it will blow your mind.
9:55 - 10:01

OK. Good. That's, that's why. Now I remember.
10:01 - 10:05

So, here is our cast. We're about to go
10:05 - 10:09

on an adventure through a tortured maze of
ridiculous
10:09 - 10:12

Dungeons and Dragons metaphors. But, first
of all, a
10:12 - 10:15

shout out to the OwlBear. Yeah. The thing
I
10:15 - 10:19

love about the OwlBear is they've taken the
wrong,
10:19 - 10:23

the least scary aspects of a bear and an
10:23 - 10:26

owl, like if that was an owl with, you
10:26 - 10:30

know, if it had a bears head and wings,
10:30 - 10:34

that would be way more scary. Anyway.
10:34 - 10:37

It's just been bugging me for months. So.
10:37 - 10:42

PostGres. As we all know, it's MySQL for hipsters.
10:42 - 10:45

It's actually pretty good. So here's its character
reference
10:45 - 10:49

sheet. We, it's a relational database. It
has a
10:49 - 10:54

consistent model. So under conditions in network
partition, you
10:54 - 10:57

know, your, your slave is not in contact with
10:57 - 11:00

the master, it's, it's essentially unavailable.
That's the way
11:00 - 11:02

we treat it.
11:02 - 11:05

PostGres is actually really, really interesting
tic, because it
11:05 - 11:10

has a bunch of cool stuff hidden underneath
it.
11:10 - 11:13

So there's a thing called Hstore which is
a
11:13 - 11:16

key-value store that's baked right in. So
if you
11:16 - 11:19

need a lightweight key-value store and you're
already running
11:19 - 11:23

PostGres in production, you, you have one.
You don't
11:23 - 11:26

need to spin up any other thing. You can
11:26 - 11:27

actually do that today.
11:27 - 11:30

The really interesting thing about that is,
you can
11:30 - 11:34

index those keys. You can do joins across
an
11:34 - 11:38

Hstore reference into, across multiple tables.
It looks and
11:38 - 11:40

feels exactly like the kind of thing that
you're
11:40 - 11:42

already working with.
11:42 - 11:47

We've got, there's some things already baked
into the
11:47 - 11:49

Rails ecosystem that make this really easy
if you're
11:49 - 11:53

doing that kind of information. But the really
exciting
11:53 - 11:56

thing about what PostGres is up to at the
11:56 - 12:02

moment is JSON. And 9.2, 9.3, and upcoming
9.4
12:02 - 12:06

have pretty much a fully baked in JSON document
12:06 - 12:11

database. And it is crazy awesome. The new
one
12:11 - 12:14

is super high-performance. If you were sort
of, it's
12:14 - 12:17

the same thing. If you're thinking, ah, you
know,
12:17 - 12:20

documents would be easier for this use case,
let's
12:20 - 12:25

install something else, we're actually, you
already have one,
12:25 - 12:27

and it, it has all of those same properties.
12:27 - 12:29

You can index. You can do joins across your
12:29 - 12:33

normal table into the documents. It's crazy
cool.
12:33 - 12:37

MySQL. It's pretty much the same as PostGres,
is
12:37 - 12:42

my answer. But there's a slight caveat. So,
you
12:42 - 12:48

know, I, I recall, they're a company. Many
of
12:48 - 12:50

the same things apply. Like, this is why,
you
12:50 - 12:52

know, they're, they're kind of in the same
bucket.
12:52 - 12:56

For me, it doesn't particularly matter at
the end
12:56 - 12:58

of the day. Whatever you happen to have expertise
12:58 - 13:01

in, it's cool. It's got some kind of interesting
13:01 - 13:03

things that you can do. You can switch out
13:03 - 13:08

storage engines to actually get your different
performance profiles.
13:08 - 13:12

It is everywhere. It's got a thing called
Handler
13:12 - 13:16

Socket, which is essentially raw, right. Access
through a
13:16 - 13:20

low-level socket into the table infrastructure.
There's some paper
13:20 - 13:24

with really high performance kind of things.
13:24 - 13:27

You can actually just sort of bypass the whole
13:27 - 13:30

SQL engine, which is kind of interesting.
The other
13:30 - 13:32

thing that's happened since Oracle took over,
which is
13:32 - 13:35

kind of a really good thing, is that there's
13:35 - 13:40

some alternatives. So MariaDB is sort of the,
the
13:40 - 13:45

more open fork. There's a semi-commercial
addition that has
13:45 - 13:48

lots of really high-performance features,
and they basically run
13:48 - 13:52

binary compatible patches, that's Percona.
And they have, like,
13:52 - 13:56

huge expertise. And this Toku is quite interesting.
It's,
13:56 - 13:58

they're doing all of this crazy fractal indexing
and
13:58 - 14:02

things for particular use cases on very large
datasets.
14:02 - 14:05

But it still just looks and behaves in many
14:05 - 14:08

ways like the MySQL that you are kind of
14:08 - 14:09

used to.
14:09 - 14:13

So, there's some interesting things happening
there. So these,
14:13 - 14:17

hopefully none of that's a huge surprise.
That's databases.
14:17 - 14:21

You use it. It comes in the box, and
14:21 - 14:23

ActiveRecord talks to it.
14:23 - 14:25

So now we're gonna get slightly off the beaten
14:25 - 14:30

track. So, a lot of what we know SQL
14:30 - 14:35

comes from Dynamo, which was actually a paper
that
14:35 - 14:40

Amazon released years ago. I'm not gonna labor
too
14:40 - 14:42

much on this one. The paper's quite interesting.
It
14:42 - 14:48

talks about how you make a distributed system.
14:48 - 14:52

The interesting thing is actually that Riak
is essentially
14:52 - 14:55

an implementation of the underlying Dynamo
theory. So Riak
14:55 - 14:58

is crazy awesome. This is what happens to
you
14:58 - 15:02

when you run Riak in production.
15:02 - 15:02

[laughter]
15:02 - 15:06

I pretty much, like, it's a conversation I,
I
15:06 - 15:09

often have with people is like, wouldn't it
be
15:09 - 15:13

awesome to have a problem that needed Riak?
And
15:13 - 15:14

it was like, yeah, that would be so cool.
15:14 - 15:18

I'd be like the awesomeness engineer.
15:18 - 15:21

So Riak is, it's just crazy-well engineered.
They're doing
15:21 - 15:27

all sorts of interesting stuff. It's inherently,
it just
15:27 - 15:31

understands clustering. You know, you add
a new node,
15:31 - 15:35

it just, it's there. You know. With, with
those
15:35 - 15:38

older kind of databases, it's, it's a pain
in
15:38 - 15:40

the ass to actually get it working.
15:40 - 15:45

So, yeah, they're doing some really interesting
things. It's
15:45 - 15:48

got a cloud storage thing so you've got an
15:48 - 15:50

S3-compatible API and all of these kind of
stuff.
15:50 - 15:51

A lot of the magic of the way this
15:51 - 15:57

works is through consistent hashing. So, my
slides are
15:57 - 15:58

all mucked up. But anyway.
15:58 - 16:01

So, basically what it does is it just partitions
16:01 - 16:05

all of your data into a giant hash ring.
16:05 - 16:10

Excuse me. Physical nodes then just own parts
of
16:10 - 16:12

that hash. You add a new node or take
16:12 - 16:16

a node away and it repartitions all the rest
16:16 - 16:18

of the data across the remaining nodes. And
all
16:18 - 16:21

of that is just completely in the background
of
16:21 - 16:25

how Riak just works operationally.
16:25 - 16:27

So for large scale data and, you know, you,
16:27 - 16:31

you get away with, it has some really nice
16:31 - 16:34

operational characteristics that, that make
it quite cool to
16:34 - 16:35

manage.
16:35 - 16:37

And then the other thing is, it's a very
16:37 - 16:40

simple API. It's key-value store, you can
store JSON
16:40 - 16:42

documents in it, and it's just a bucket that
16:42 - 16:45

has keys, and then it's got other stuff on
16:45 - 16:50

top to retrieve data, do secondary indexes
and searching
16:50 - 16:51

and all of that kind of stuff.
16:51 - 16:54

So, it's a very cool piece of tech.
16:54 - 16:59

So, the other one we've got is, Google. Fucking
16:59 - 17:04

annoying. And you'll see why in a second.
So,
17:04 - 17:07

Google had this thing called BigTable that,
again, kind
17:07 - 17:10

of comes out of the internal research. You
have
17:10 - 17:14

access to it through some of their cloud properties.
17:14 - 17:17

As you can see, it's got, it's actually a
17:17 - 17:21

sparse distributed multidimensional sorted
map, which is good, I
17:21 - 17:24

guess. I imagine. It's awesome.
17:24 - 17:28

The stuff they're doing with this is crazy.
So
17:28 - 17:30

this is actually a, all, a couple years old
17:30 - 17:33

I think now. Some of these, some of the
17:33 - 17:37

information, so. Hundreds of petabytes of
data, you know,
17:37 - 17:41

ridiculous numbers of operations a second.
You do not
17:41 - 17:43

have any of these problems.
17:43 - 17:47

So, then they, they took this stuff, they
were
17:47 - 17:50

like, ah, we've got BigTable. You know, that
was,
17:50 - 17:53

that was fucking easy. Whatever. And so now
they've
17:53 - 17:55

got two other things. They've got one called
Spanner
17:55 - 18:00

and one called F-one, where they're basically
doing, you
18:00 - 18:07

know, proper, sort of relational looking data
across multiple
18:07 - 18:10

data centers and, you know, and. They're kind
of
18:10 - 18:13

really pushing the boundaries of some of that
CAP
18:13 - 18:15

stuff that's going on.
18:15 - 18:18

But all you need is a GPS in every
18:18 - 18:21

server, a couple of atomic clocks in each
data
18:21 - 18:27

center, and you, great. So, Google's basically
telling everyone
18:27 - 18:30

to, you know, just fuck off.
18:30 - 18:35

So, another one that I really, I really like,
18:35 - 18:39

and have used a long, a long time ago
18:39 - 18:46

in, in tech land, tech time, is Cassandra.
Cassandra
18:46 - 18:50

is a column-oriented database. Eventually
it's awesome. It's really
18:50 - 18:54

all about eventual consistency.
18:54 - 18:58

And you can see here, this is a man,
18:58 - 18:59

he eventually gets it right. So that's well
done
18:59 - 19:02

to him there. So Cassandra's a lot like that.
19:02 - 19:06

And, again, you know, the cool thing is, it's
19:06 - 19:11

a sparse distributor multi dimensional sorted
map. It, when
19:11 - 19:13

I was working with it, you, it was, you
19:13 - 19:16

had, you described your tables kind of thing
in
19:16 - 19:20

XML and hated yourself, and then every time
something
19:20 - 19:23

changed you rebooted the server and that took
awhile
19:23 - 19:27

and, yeah, the whole thing was really difficult.
19:27 - 19:31

What it basically does is it takes the availability
19:31 - 19:34

side of the question. Like, that's its world
model.
19:34 - 19:38

It has, again, a very simple clustering system.
New
19:38 - 19:41

nodes, add in, the data gets streamed out.
It
19:41 - 19:46

has a data model that is really complicated,
and
19:46 - 19:48

I, even though I've used it, it's really hard
19:48 - 19:51

to explain how it actually works.
19:51 - 19:55

So column databases basically kind of invert
the, the
19:55 - 19:57

whole table structure that you're used to
from the
19:57 - 20:01

relational world. And the advantage is that,
for some
20:01 - 20:04

types of data, and for some queries, it is
20:04 - 20:08

crazy blazing fast, cause you can just. Time
series
20:08 - 20:09

are always a good one, where you can just
20:09 - 20:11

have long streams of time series and it will
20:11 - 20:13

actually put that on disk or next to each
20:13 - 20:16

other and you can just pull it all out.
20:16 - 20:19

The cool thing in the new versions of Cassandra
20:19 - 20:22

is that they've abstracted all of that out,
and
20:22 - 20:26

you actually just get tables, so you can create
20:26 - 20:28

a table and give it a primary key, and
20:28 - 20:32

under the covers, it's setting up rows and
column
20:32 - 20:35

families and columns and all of, all of these
20:35 - 20:39

really abstract concepts, and they've completely
made some of
20:39 - 20:41

that go away. Which is really nice.
20:41 - 20:44

So you end up with something that looks a
20:44 - 20:49

lot like just SQL and, you know, a normal
20:49 - 20:53

table kind of structure. It's just clustering
out lots
20:53 - 20:55

of nodes. It's very tunable, so you can actually
20:55 - 20:58

set up, you know, it writes to a node
20:58 - 21:00

and you can say, actually write to five nodes
21:00 - 21:02

and that's a quorem and now we're cool. So
21:02 - 21:06

you can tune how much redundancy you have.
21:06 - 21:13

So that's kind of cool. That is a reminder.
21:13 - 21:18

That went cold really fast. Thank you.
21:18 - 21:21

So, the next one on our list is Memcache.
21:21 - 21:24

Memcache, there was, there was a talk earlier
in
21:24 - 21:28

the week that was describing using Memcache
and caching
21:28 - 21:30

and it, it had a very interesting observation,
which
21:30 - 21:33

was, it just works. He didn't even know what
21:33 - 21:37

version he was running in production, cause
neh. Doesn't
21:37 - 21:39

matter. That API has been stable for ages.
21:39 - 21:42

And I know, I know what you're saying. It's
21:42 - 21:46

not a database. It's a cache. Technically
true. But
21:46 - 21:48

it's interesting to think about, because the
moment you
21:48 - 21:51

add caching, even if you've been ignoring
the fact
21:51 - 21:55

that you had a distributed system before,
with caching
21:55 - 21:57

you now really have a distributed system.
You've got
21:57 - 22:00

data in one thing that may or may not
22:00 - 22:03

be fresh, and you've got data in your database
22:03 - 22:05

that, you know, you assume is up to date,
22:05 - 22:07

and now you've got a synchronization problem.
22:07 - 22:12

So, Memcache is actually really, you know,
it's, it's
22:12 - 22:17

just rock solid, old as the hills technology,
completely
22:17 - 22:22

simple. The API is everywhere. Lots of people
actually
22:22 - 22:26

have made their, you know, key-value store
they made
22:26 - 22:28

in the hacknight, which, you know, is a useful
22:28 - 22:31

hobby if you want to annoy everyone.
22:31 - 22:33

You have the, their API is actually the Memcached
22:33 - 22:36

API. It's got a handful of things. You can
22:36 - 22:40

set a key, you can replace one. It does
22:40 - 22:44

have something atomic operations so you can
increment and
22:44 - 22:46

decrement so that there is some flexibility
to actually
22:46 - 22:52

do a little bit of data storage in a,
22:52 - 22:56

in a more traditional sense.
22:56 - 22:59

It's actually a client-server model. Your,
your driver is
22:59 - 23:02

responsible for the clustering in a way, so
you
23:02 - 23:07

can have multiple Memcache nodes and the,
the hashing
23:07 - 23:11

algorithm determines which node, which node
a particular piece
23:11 - 23:13

of data is gonna be on.
23:13 - 23:16

That has the property of making it very, very
23:16 - 23:19

simple to use. And there's no cluster state.
There's
23:19 - 23:22

no coordination that nodes have. Like, a lot
of
23:22 - 23:24

the heavy lifting all of these other things
are
23:24 - 23:28

doing is about coordinating around all of
that information.
23:28 - 23:30

There's a whole bunch of awesome stuff just
baked
23:30 - 23:35

into Rails. So you can just easily cache into
23:35 - 23:39

Memcache, or your normal Rails fragment mutations.
All of
23:39 - 23:41

that kind of stuff.
23:41 - 23:42

And there's even some things we can, you can
23:42 - 23:46

actually put, push that into ActiveRecord
and have, have
23:46 - 23:48

caching at that level as well.
23:48 - 23:51

Redis is an interesting one for the, the Rails
23:51 - 23:57

community. Cause it's basically a queue, now.
Everyone seems
23:57 - 24:01

to be running Resq, Sidekiq, and, you know,
Redis
24:01 - 24:06

is, again, one of those just pieces of technology
24:06 - 24:12

that is beautifully engineered, incredibly
simple, incredibly robust. The
24:12 - 24:19

maintainers are just absolute, you know, scientists,
I guess.
24:19 - 24:23

Just a whole other level of crazy algorithm
stuff.
24:23 - 24:25

And they make blog posts and, you know, I'm
24:25 - 24:32

so stupid. I don't understand what you're
talking about.
24:32 - 24:36

It's really fast, it's slightly hard to distribute.
A
24:36 - 24:39

lot of that's in the pipeline with Redis.
It's
24:39 - 24:42

much more, it's much more simple to, to stick
24:42 - 24:46

it on one node and increase the RAM. It's
24:46 - 24:49

mu, more complicated then Memcache. It's essentially
just an
24:49 - 24:52

in-memory cache. It has a bunch of really
interesting
24:52 - 24:57

data structures, though. I think if you've
been confused
24:57 - 24:59

all week, now, which country I'm from, whether
I
24:59 - 25:02

say dayta or dahta, so now I just changed
25:02 - 25:04

them randomly.
25:04 - 25:08

So, you can, you have hashes you have lists,
25:08 - 25:10

you have strings. You've got all sorts of
other
25:10 - 25:14

interesting things. You can do optimistic
locking and have,
25:14 - 25:18

you know, a bunch of operations that are essentially
25:18 - 25:22

batched. You can do sort of, there's long
ways
25:22 - 25:25

of doing this kind of stuff. It's Resque and
25:25 - 25:29

Sidekiq both just make this, make it super
simple
25:29 - 25:31

to do background tasks with Rails and install
the
25:31 - 25:37

gem, have a worker, and it's all just magic.
25:37 - 25:40

It is Lua baked in, which is a whole
25:40 - 25:42

other thing. But Lua is a really cool programming
25:42 - 25:45

language that is designed for embeddability.
But one of
25:45 - 25:47

the things that happens if you can actually
write
25:47 - 25:51

little rule, Lua scripts that end up going
into
25:51 - 25:55

the Redis server to do more complex operations.
So,
25:55 - 25:57

in this case, this is a little script that
25:57 - 26:00

grabs something off a sorted hash and then
deletes
26:00 - 26:03

them and then returns the first thing, like,
then
26:03 - 26:06

returns what we had done. But it's, it's an
26:06 - 26:10

atomic kind of transactional way.
26:10 - 26:13

And, good news everybody! We've just invented
stored procedures.
26:13 - 26:16

So that's very exciting. Except now they're
much more
26:16 - 26:19

hip, because it's an in-memory database with
a language
26:19 - 26:23

no one's heard of. So. We are rocking it.
26:23 - 26:28

Also, maybe use a queue. Just, I know it's
26:28 - 26:33

crazy. But, if you're actually queuing, using
Redis as
26:33 - 26:37

your queue, maybe you have a queuing problem
and
26:37 - 26:40

you have queues. They exist. They're a thing.
It's
26:40 - 26:41

ridiculous. I know.
26:41 - 26:46

So, RabbitMQ is sort of the gold standard,
and
26:46 - 26:49

Kafka is another one that was talked about
earlier
26:49 - 26:51

this week, and it is crazy cool.
26:51 - 26:56

Where am I? Man. All right. Just gonna stretch.
26:56 - 26:59

I've lost count, so I don't know, now I'm
26:59 - 27:02

just gonna talk faster. Cool.
27:02 - 27:08

Neo4j is really interesting. It's a graph
database. That's.
27:08 - 27:13

It's slightly hard to explain. But you, the
way
27:13 - 27:15

I actually think about it, we'll just jump
straight
27:15 - 27:17

to here, is it's almost but not quite entirely
27:17 - 27:23

unlike a relational database. The difference,
essentially, is that
27:23 - 27:27

it is optimize for the connections rather
than aggregated
27:27 - 27:32

data. So relational database, you, puts things
in, in
27:32 - 27:33

a way where you can get a sum and
27:33 - 27:35

a count and like, that's kind of the heritage
27:35 - 27:37

of that kind of world view.
27:37 - 27:40

Whereas what the Neo4j people are doing is
actually
27:40 - 27:45

thinking about connections between pieces
of data, and for
27:45 - 27:49

some use cases, this is actually really, really
amazing
27:49 - 27:52

stuff. So you have, a graph is basically a
27:52 - 27:57

collection of nodes, and those nodes can have
relationships
27:57 - 27:59

between each other, and then a node just has
27:59 - 28:01

properties.
28:01 - 28:04

It's essentially an object database in a way.
It's
28:04 - 28:06

like very similar to the way that we think
28:06 - 28:08

about objects. So it has some really nice
properties
28:08 - 28:12

if you're working in a language like Ruby.
And
28:12 - 28:17

then it just does stuff that, you know, in
28:17 - 28:19

a really intuitive way. So if we've got a
28:19 - 28:22

graph of movies and actors, you actually define
a
28:22 - 28:26

relationship by name. Then an actor acts in
a
28:26 - 28:29

movie. And then when you were doing your queries,
28:29 - 28:33

this is a language called Cypher, you actually,
that's
28:33 - 28:34

a first-class thing.
28:34 - 28:36

Whereas in a relational world, you're, you're
using a
28:36 - 28:39

foreign key, which has no semantic meaning
at all.
28:39 - 28:41

You, you just have to remember that, you know,
28:41 - 28:43

an actor, you know, there's a table with an
28:43 - 28:46

actor id, and a movie id, and we're joining
28:46 - 28:50

across somewhere. Whereas Neo4j actually makes
those relationships first
28:50 - 28:53

class citizens. So if you've got problems
that are
28:53 - 29:00

graph problems, like social network friend
cloud stuff, some
29:02 - 29:05

of that stuff, Neo4j just makes trivially
easy in
29:05 - 29:06

a way that you would have had to do
29:06 - 29:10

a recursive self-join in PostGres and hate
your life
29:10 - 29:12

and, you know.
29:12 - 29:17

Couch is cool. I guess. Pretty much that's
my
29:17 - 29:21

opinion of it. It's really awesome. But, you
can't
29:21 - 29:26

query it. So cool.
29:26 - 29:28

That's it. That's a slight disservice to Couch
but,
29:28 - 29:32

you know, whatever. MongoDB, as we all know,
it
29:32 - 29:35

is webscale and that's excellent. If you think
of
29:35 - 29:39

it as Redis for JSON, that's good. Sixty percent
29:39 - 29:41

of the time, it works every time. Everyone's
familiar
29:41 - 29:43

with that.
29:43 - 29:47

So, the thing that's really, I mean, Mongo,
it
29:47 - 29:51

reminds me of My, MySQL. Like, Mongo is kind
29:51 - 29:54

of terrible, but MySQL was kind of terrible,
too.
29:54 - 29:57

Like, when that came out, it didn't do transactions,
29:57 - 30:00

for example, and I, I was working in enterprise-y
30:00 - 30:04

land, and transactions are actually a thing.
And, you're
30:04 - 30:09

like, you script kiddies with your database.
30:09 - 30:11

So Mongo feels like that, and not, you know,
30:11 - 30:14

what we learned is, if you make something
that's
30:14 - 30:18

awesome and useful and everywhere and ubiquitous
and it
30:18 - 30:21

doesn't work, you can make it work. And eventually,
30:21 - 30:23

you know, MySQL is a real database. So Mongo
30:23 - 30:25

feels a bit like that. It's come a massive
30:25 - 30:31

way, right about really early on with very
early
30:31 - 30:32

versions.
30:32 - 30:35

It stores JSON. Well sort of it. It stores
30:35 - 30:40

BSON, anyway. That's just binary JSON basically.
And it's
30:40 - 30:42

a, it's a really beautiful model to work with
30:42 - 30:45

in a development cycle, which is why think
is
30:45 - 30:47

why there's, why there's so much appeal. You've
just
30:47 - 30:51

got kind of, people treat it like an object
30:51 - 30:54

database. You've just got an object that's
in there,
30:54 - 30:56

and you can pull out objects and manipulate
them
30:56 - 31:00

and do all of this kind of crazy stuff.
31:00 - 31:05

The people who know what they're talking about,
though,
31:05 - 31:08

with distributed systems, if the reason you're
using Mongo
31:08 - 31:10

is because you think it's a panacea for all
31:10 - 31:14

of this, you know, we need to be webscale
31:14 - 31:17

and do all of this kind of stuff, that
31:17 - 31:19

is not a good reason to use it. Cause
31:19 - 31:22

there, there's still a lot of operational
problems and,
31:22 - 31:24

and stuff going on.
31:24 - 31:30

This, this one is interesting. It's essentially,
RethinkDB is
31:30 - 31:33

coming from the PostGres world view. Cause
PostGres made,
31:33 - 31:37

you know, MySQL was like, whatever, we'll
fix it.
31:37 - 31:40

PostGres was like, we'll do it right and it,
31:40 - 31:42

you can't use it cause it's so slow, but
31:42 - 31:44

at least it's correct. And they took lots
of
31:44 - 31:47

iterations to make it usable. So Rethink is
kind
31:47 - 31:48

of that school of thought. It's like, we're
gonna
31:48 - 31:51

make it all correct first, and then we'll
make
31:51 - 31:56

it usable. So it's very similar idea. JSON,
you
31:56 - 31:59

know, they're trying to make it operationally
great with
31:59 - 32:03

automatic clustering and all this kind of
stuff. You
32:03 - 32:05

know. Who knows what it is and how it's
32:05 - 32:07

actually gonna behave in the real world. It's
still
32:07 - 32:09

a very early piece of tech.
32:09 - 32:11

And that leads me into, there's a whole world
32:11 - 32:15

of databases around what I'm loosely calling
the commercial
32:15 - 32:20

fringe. So Couchbase is the Couch guys and
sort
32:20 - 32:24

of some commercial Memcached guys who got
together to
32:24 - 32:28

make a hybrid something. Aerospike is, their
marketing is
32:28 - 32:32

great. That's about the best you can say about
32:32 - 32:32

it.
32:32 - 32:33

So there's a whole bunch of people trying
to
32:33 - 32:37

solve these problems in interesting ways.
But all of
32:37 - 32:41

these ones cost money and, you know, they're,
the
32:41 - 32:42

mileage varies and all of that kind of stuff.
32:42 - 32:44

The cool thing about open sources ones is
you
32:44 - 32:45

get it and you try it and you hate
32:45 - 32:47

it and you go back to PostGres so it's
32:47 - 32:48

all fine.
32:48 - 32:53

So, Hyperdex. This is my favorite. Because
they have
32:53 - 32:58

HyperSpace Hashing, and it is so cool. These
guys
32:58 - 33:02

are making some really broad, amazing claims
about the,
33:02 - 33:07

the kind of things that they can do. Crazy
33:07 - 33:09

fast. It's, it's a key-value store but it
will
33:09 - 33:12

index, you know, it's not just a key but
33:12 - 33:14

it will index the properties of a value. So
33:14 - 33:17

now you can do que, you know, genuine queries
33:17 - 33:21

into the structure of objects that you're
storing.
33:21 - 33:23

They've got a whole bunch of papers around
what
33:23 - 33:27

they're doing. So, you can read that as, who
33:27 - 33:30

knows what it means. It maps objects to coordinates
33:30 - 33:35

in a multi-dimensioned Euclidean space. HyperSpace.
And I'm like.
33:35 - 33:37

Take my money!
33:37 - 33:41

And there's a, there's a picture of HyperSpace.
And,
33:41 - 33:44

like, I've read that like eight times. I don't
33:44 - 33:50

understand what's going on. But if, it does
seem
33:50 - 33:52

to be true. They're trying to solve some of
33:52 - 33:55

these problems and, you know, they call themselves
like
33:55 - 34:00

a second generation NoSQL thing, in a similar
way
34:00 - 34:02

to Google, you know, kind of taking all of
34:02 - 34:05

this stuff and trying to push the science
underneath
34:05 - 34:07

it forward.
34:07 - 34:10

So you can, you know, it's got a Ruby
34:10 - 34:13

client. You can use it now. It's got, just,
34:13 - 34:18

normal key-value. It's got atomic stuff. You
can do
34:18 - 34:23

conditional ports, so this is some code that's
basically
34:23 - 34:27

is only updating if the, only updating the
current
34:27 - 34:32

balance if the, updating the balance if the
current
34:32 - 34:34

balance is what we think it is. Otherwise
some
34:34 - 34:36

other thread has updated it.
34:36 - 34:39

So there's some really interesting stuff they
can do.
34:39 - 34:44

And they're guaranteeing those operations
across the cluster. And
34:44 - 34:46

it's also got a transactional engine as well,
so
34:46 - 34:47

that's really exciting.
34:47 - 34:52

Running out of time. HBase and Hadoop. You
don't
34:52 - 34:55

have any of these problems. Don't worry about
it.
34:55 - 34:56

You probably don't want to have any of these
34:56 - 35:00

problems. Cause this just ends up, you need
to
35:00 - 35:04

install every fucking thing the Apache foundation
has ever
35:04 - 35:08

made. And this isn't even the full list. This
35:08 - 35:10

is like, you probably need those.
35:10 - 35:13

I have a friend, he's a bit of a
35:13 - 35:17

dick, and he, he calls it, cause he, he
35:17 - 35:20

works in an actual big data organization,
and he
35:20 - 35:22

just, he goes, oh, you people with your small
35:22 - 35:26

to medium data. So, yeah, like, most of us,
35:26 - 35:28

we don't have big data in any sense of
35:28 - 35:31

the word, really. Like, if, if it's got GB
35:31 - 35:35

on the end of it, meh. You're not there
35:35 - 35:36

yet.
35:36 - 35:41

So, again, this is just you know, Facebook
is
35:41 - 35:42

using the hell out of this stuff, and they're
35:42 - 35:45

just like, this is all out of date. They're
35:45 - 35:50

like now just, they can't buy hard disks fast
35:50 - 35:54

enough. It's crazy. Yeah. There was a punch
line
35:54 - 35:56

at the end of all of that.
35:56 - 35:58

But my friend, the guy who I said was
35:58 - 36:01

a bit of a dick, he, he recommends having
36:01 - 36:04

a look at this. And this is his quote,
36:04 - 36:07

if you want to appear really cool and underground,
36:07 - 36:09

then I reckon the next big thing is the
36:09 - 36:12

Berkeley Data Analytics Stack. So, there's
a whole bunch
36:12 - 36:16

of people who are looking at that, you know,
36:16 - 36:18

crazy big data situation and trying to work
out
36:18 - 36:22

what that means and what the future is.
36:22 - 36:25

And so Apache and Berkeley are kind of in
36:25 - 36:27

a cold war for that at the moment. And
36:27 - 36:29

then there's heaps of people in the enterprise
space
36:29 - 36:32

because you can sell lots of products and
or
36:32 - 36:35

services to large companies who think they
have a
36:35 - 36:38

big data problem. So that's cool.
36:38 - 36:40

That's fine. This isn't, this is just a little
36:40 - 36:45

thing that's an embeddable document key-value
store that you
36:45 - 36:47

can, it's just kind of a fun team and
36:47 - 36:49

has an API that looks very similar to the
36:49 - 36:53

Mongo one. And it just sits in process.
36:53 - 36:56

Oh, ElasticSearch. Every time I use it, I
think,
36:56 - 37:01

why can you not be my database? It's awesome.
37:01 - 37:03

But it loses a couple of points there because
37:03 - 37:09

of its configurationability. It went, it works
when you
37:09 - 37:11

know how to make it works, and it's crazy
37:11 - 37:13

complicated sometimes.
37:13 - 37:20

So anyway. Thirty. Four minutes over technically,
I think.
37:20 - 37:22

Yeah. So that's good.
37:22 - 37:29

That's databases in a nutshell. I'm Toby Hede.
I'm
37:29 - 37:31

around the conference if you want to talk
about
37:31 - 37:35

databases. I think of myself as a lapa-, a
37:35 - 37:39

lap- a butterfly collector, I guess, is what
I'm
37:39 - 37:41

looking for, of databases.
37:41 - 37:46

Yeah. So come and say hi. Cool.

Title:: RailsConf 2014 - An Ode to 17 Databases in 33 Minutes by Toby Hede
Description:: more » « less
Duration:: 38:13

Amara Bot edited English subtitles for RailsConf 2014 - An Ode to 17 Databases in 33 Minutes by Toby Hede

English subtitles

Revisions

Revision 1 Imported

Amara Bot

RailsConf 2014 - An Ode to 17 Databases in 33 Minutes by Toby Hede

Revisions

Our website uses cookies

Operating cookies (Required)