-
TOM DALE: Hey, you guys ready?
-
Thank you guys so much for coming.
-
This is awesome. I was really,
-
I, when they were putting together the schedule,
-
I said, make sure that you put us down
-
in the Caves of Moria. So thank you
-
guys for coming down and making it.
-
I'm Tom. This is Yehuda.
-
YEHUDA KATZ: When people told me was signed
up
-
to do a back-to-back talk, I don't know what
-
I was thinking.
-
T.D.: Yup. So. We want to talk to you
-
today about, about Skylight. So, just a little
bit
-
before we talk about that, I want to talk
-
about us a little bit.
-
So, in 2011 we started a company called Tilde.
-
It's this shirt. It may have made me self-conscious,
-
because this is actually a first edition and
it's
-
printed off-center. Well, either I'm off-center
or the shirt's
-
off-center. One of the two.
-
So we started Tilde in 2011, and we had
-
all just left a venture-backed company, and
that was
-
a pretty traumatic experience for us because
we spent
-
a lot of time building the company and then
-
we ran out of money and sold to Facebook
-
and we really didn't want to repeat that experience.
-
So, we decided to start Tilde, and when we
-
did it, we decided to be. DHH and the
-
other people at Basecamp were talking about,
you know,
-
being bootstrapped and proud. And that was
a message
-
that really resonated with us, and so we wanted
-
to capture the same thing.
-
There's only problem with being bootstrapped
and proud, and
-
that is, in order to be both of those
-
things, you actually need money, it turns
out. It's
-
not like you just say it in a blog
-
post and then all of the sudden you are
-
in business.
-
So, we had to think a lot about, OK,
-
well, how do we make money? How do we
-
make money? How do we make a profitable and,
-
most importantly, sustainable business? Because
we didn't want to
-
just flip to Facebook in a couple years.
-
So, looking around, I think the most obvious
thing
-
that people suggested to us is, well, why
don't
-
you guys just become Ember, Inc.? Raise a
few
-
million dollars, you know, build a bunch of
business
-
model, mostly prayer. But that's not really
how we
-
want to think about building open source communities.
-
We don't really think that that necessarily
leads to
-
the best open source communities. And if you're
interested
-
more in that, I recommend Leia Silver, who
is
-
one of our co-founders. She's giving a talk
this
-
afternoon. Oh, sorry. Friday afternoon, about
how to build
-
a company that is centered on open source.
So
-
if you want to learn more about how we've
-
done that, I would really suggest you go check
-
out her talk.
-
So, no. So, no Ember, Inc. Not allowed.
-
So, we really want to build something that
leveraged
-
the strengths that we thought that we had.
One,
-
I think most importantly, a really deep knowledge
of
-
open source and a deep knowledge of the Rails
-
stack, and also Carl, it turns out, is really,
-
really good at building highly scalable big
data sys-
-
big data systems. Lots of Hadoop in there.
-
So, last year at RailsConf, we announced the
private
-
beta of Skylight. How many of you have used
-
Skylight? Can you raise your hand if you have
-
used it? OK. Many of you. Awesome.
-
So, so Skylight is a tool for profiling and
-
measuring the performance of your Rails applications
in production.
-
And, as a product, Skylight, I think, was
built
-
on three really, three key break-throughs.
There were key,
-
three key break-throughs. We didn't want to
ship a
-
product that was incrementally better than
the competition. We
-
wanted to ship a product that was dramatically
better.
-
Quantum leap. An order of magnitude better.
-
And, in order to do that, we spent a
-
lot of time thinking about it, about how we
-
could solve most of the problems that we saw
-
in the existing landscape. And so those, those
break-throughs
-
are predicated- sorry, those, delivering a
product that does
-
that is predicated on these three break-throughs.
-
So, the first one I want to talk about
-
is, honest response times. Honest response
times. So, DHH
-
wrote a blog post on what was then the
-
37Signals blog, now the Basecamp blog, called
The problem
-
with averages. How many of you have read this?
-
Awesome.
-
For those of you that have not, how many
-
of you hate raising your hands at presentations?
-
So, for those of you that-
-
Y.K.: Just put a button in every seat. Press
-
this button-
-
T.D.: Press the button if you have. Yes. Great.
-
So, if you read this blog post, the way
-
it opens is, Our average response time for
Basecamp
-
right now is 87ms... That sounds fantastic.
And it
-
easily leads you to believe that all is well
-
and that we wouldn't need to spend any more
-
time optimizing performance.
-
But that's actually wrong. The average number
is completely
-
skewed by tons of fast responses to feed requests
-
and other cached replies. If you have 1000
requests
-
that return in 5ms, and then you can have
-
200 requests taking 2000ms, or two seconds,
you can
-
still report an av- a respectable 170ms of
average.
-
That's useless.
-
So what does DHH say that we need? DHH
-
says the solution is histograms. So, for those
of
-
you like me who were sleeping through your
statistics
-
class in high school, and college, a brief
primer
-
on histograms. So a histogram is very simple.
Basically,
-
you have a, you have a series of numbers
-
along some axis, and every time you, you're
in
-
that number, you're in that bucket, you basically
increment
-
that bar by one.
-
So, this is an example of a histogram of
-
response times in a Rails application. So
you can
-
see that there's a big cluster in the middle
-
around 488ms, 500ms. This isn't a super speedy
app
-
but it's not the worst thing in the world.
-
And they're all clustered, and then as you
kind
-
of move to the right you can see that
-
the respond times get longer and longer and
longer,
-
and as you move to the left, response times
-
get shorter and shorter and shorter.
-
So, why do you want a histogram? What's the,
-
what's the most important thing about a histogram?
-
Y.K.: Well, I think it's because most requests
don't
-
actually look like this.
-
T.D.: Yes.
-
Y.K.: Most end points don't actually look
like this.
-
T.D.: Right. If you think about what your
Rails
-
app is doing, it's a complicated beast, right.
Turns
-
out, Ruby frankly, you can, you can do branching
-
logic. You can do a lot of things.
-
And so what that means is that one end
-
point, if you represent that with a single
number,
-
you are losing a lot of fidelity, to the
-
point where it becomes, as DHH said, useless.
So,
-
for example, in a histogram, you can easily
see,
-
oh, here's a group of requests and response
times
-
where I'm hitting the cache, and here's another
group
-
where I'm missing it. And you can see that
-
that cluster is significantly slower than
the faster cache-hitting
-
cluster.
-
And the other thing that you get when you
-
have a, a distribution, when you keep the
whole
-
distribution in the histogram, is you can
look at
-
this number at the 95th percentile, right.
So the
-
right, the way to think about the performance
of
-
your web application is not the average, because
the
-
average doesn't really tell you anything.
You want to
-
think about the 95th percentile, because that's
not the
-
average response time, that's the average
worst response time
-
that a user is likely to hit.
-
And the thing to keep in mind is that
-
it's not as though a customer comes to your
-
site, they issue one request, and then they're
done,
-
right. As someone is using your website, they're
gonna
-
be generating a lot of requests. And you need
-
to look at the 95th percentile, because otherwise
every
-
request is basically you rolling the dice
that they're
-
not gonna hit one of those two second, three
-
second, four second responses, close the tab
and go
-
to your competitor.
-
So we look at this as, here's the crazy
-
thing. Here's what I think is crazy. That
blog
-
post that DHH wrote, it's from 2009. It's
been
-
five years, and there's still no tool that
does
-
what DHH was asking for. So, we, frankly,
we
-
smelled money. We were like, holy crap.
-
Y.K.: Yeah, why isn't that slide green?
-
T.D.: Yeah. It should be green and dollars.
I
-
think keynote has the dollars, the make it
rain
-
effect I should have used. So we smelled blood
-
in the water. We're like, this is awesome.
There's
-
only one problem that we discovered, and that
is,
-
it turns out that building this thing is actually
-
really, really freaky hard. Really, really
hard.
-
So, we announced the private beta at RailsConf
last
-
year. Before doing that, we spent a year of
-
research spiking out prototypes, building
prototypes, building out the
-
beta. We launched at RailsConf, and we realized,
we
-
made a lot of problems. We made a lot
-
of errors when we were building this system.
-
So then, after RailsConf last year, we basically
took
-
six months to completely rewrite the backend
from the
-
ground up. And I think tying into your keynote,
-
Yehuda, we, we were like, oh. We clearly have
-
a bespoke problem. No one else is doing this.
-
So we rewrote our own custom backend. And
then
-
we had all these problems, and we realized
that
-
they had actually already all been solved
by the
-
open source community. And so we benefited
tremendously by
-
having a shared solution.
-
Y.K.: Yeah. So our first release of this was
-
really very bespoke, and the current release
uses a
-
tremendous amount of very off-the-shelf open
source projects that
-
just solved the particular problem very effectively,
very well.
-
None of which are as easy to use as
-
Rails, but all of which solve really thorny
problems
-
very effectively.
-
T.D.: So, so let's just talk, just for your
-
own understanding, let's talk about how most
performance monitoring
-
tools work. So the way that most of these
-
work is that you run your Rails app, and
-
running inside of your Rails app is some gem,
-
some agent that you install. And every time
the
-
Rails app handles a request, it generates
events, and
-
those events, which include information about
performance data, those
-
events are passed into the agent.
-
And then the agent sends that data to some
-
kind of centralized server. Now, it turns
out that
-
doing a running average is actually really
simple. Which
-
is why everyone does it. Basically you can
do
-
it in a single SQL query, right. All you
-
do is you have three columns in database.
The
-
end point, the running average, and the number
of
-
requests, and then so, you can, those are
the
-
two things that you need to keep a running
-
average, right.
-
So keeping a running average is actually really
simple
-
from a technical point of view.
-
Y.K.: I don't think you could even JavaScript
through
-
to the lack of integers.
-
T.D.: Yes. You probably wouldn't want to do
any
-
math in JavaScript, it turns out. So, so we
-
took a little bit different approach. Yehuda,
do you
-
want to go over the next section?
-
Y.K.: Yeah. Sure. So, when we first started,
right
-
at the beginning, we basically did a similar
thing
-
where we had a bunch - your app creates
-
events. Most of those start off as being ActiveSupport::Notifications,
-
although it turns out that there's very limited
use
-
of ActiveSupport::Notifications so we had
to do some normalization
-
work to get them sane, which we're gonna be
-
upstreaming back into, into Rails.
-
But, one thing that's kind of unfortunate
about having
-
every single Rails app have an agent is that
-
you end up having to do a lot of
-
the same kind of work over and over again,
-
and use up a lot of memory. So, for
-
example, every one of these things is making
HTTP
-
requests. So now you have a queue of things
-
that you're sending over HTTP in every single
one
-
of your Rails processes. And, of course, you
probably
-
don't notice this. People are used to Rails
taking
-
up hundreds and hundreds of megabytes, so
you probably
-
don't notice if you install some agent and
it
-
suddenly starts taking twenty, thirty, forty,
fifty more megabytes.
-
But we really wanted to keep the actual memory
-
per process down to a small amount. So one
-
of the very first things that we did, we
-
even did it before last year, is that we
-
pulled out all that shared logic into a, a
-
separate process called the coordinator. And
the agent is
-
basically responsible simply for collecting
the, the trace, and
-
it's not responsible for actually talking
to our server
-
at all. And that means that the coordinator
only
-
has to do this queue, this keeping a st-
-
a bunch of stuff of work in one place,
-
and it doesn't end up using up as much
-
memory.
-
And I think this, this ended up being very
-
effective for us.
-
T.D.: And I think that low overhead also allows
-
us to just collect more information, in general.
-
Y.K.: Yeah.
-
Now, after our first attempt, we started getting
a
-
bunch of customers that were telling us that
even
-
the separate - so the separate coordinator,
started as
-
a good thing and a bad thing. On the
-
one hand, there's only one of them, so it
-
uses up only one set of memory. On the
-
other hand, it's really easy for someone to
go
-
in and PS that process and see how many
-
megabytes of memory it's using.
-
So, we got a lot of additional complaints
that
-
said oh, your process is using a lot of
-
memory. And, I spent a few weeks, I, I
-
know Ruby pretty well. I spent a couple of
-
weeks. I actually wrote a gem called Allocation
Counter
-
that basically went in to try to pin point
-
exactly where the allocations were hap- coming
from. But
-
it turns out that it's actually really, really
hard
-
to track down exactly where allocations are
coming from
-
in Ruby, because something as simple as using
a
-
regular expression in Ruby can allocate match
objects that
-
get put back on the stack.
-
And so I was able to pair this down
-
to some degree. But I really discovered quickly
that,
-
trying to keep a lid on the memory allocation
-
by doing all the stuff in Ruby, is mostly
-
fine. But for our specific use case where
we
-
really wanna, we wanna be telling you, you
can
-
run the agent on your process, on your box,
-
and it's not gonna use a lot of memory.
-
We really needed something more efficient,
and our first
-
thought was, we'll use C++ or C. No problem.
-
C is, is native. It's great. And Carl did
-
the work. Carl is very smart. And then he
-
said, Yehuda. It is now your turn. You need
-
to start maintaining this. And I said, I don't
-
trust myself to write C++ code that's running
in
-
all of your guys's boxes, and not seg-fault.
So
-
I don't think that, that doesn't work for
me.
-
And so I, I noticed that rust was coming
-
along, and what rust really gives you is it
-
gives you the ability to write low-level code
a
-
la C or C++ with magma memory management,
that
-
keeps your memory allocation low and keeps
things speedy.
-
Low resource utilization. While also giving
you compile time
-
guarantees about not seg-faulting. So, again,
if your processes
-
randomly started seg-faulting because you
installed the agent, I
-
think you would stop being our customer very
quickly.
-
So having what, pretty much 100% guarantees
about that
-
was very important to us. And so that's why
-
we decided to use rust.
-
I'll just keep going.
-
T.D.: Keep going.
-
Y.K.: So, we had this coordinator object.
And basically
-
the coordinator object is receiving events.
So the events
-
basically end up being these traces that describe
what's
-
happening in your application. And the next
thing, I
-
think our initial work on this we used JSON
-
just to send the pay load to the server,
-
but we noticed that a lot of people have
-
really big requests. So you may have a big
-
request with a big SQL query in it, or
-
a lot of big SQL queries in it. Some
-
people have traces that are hundreds and hundreds
of
-
nodes long. And so we really wanted to figure
-
out how to shrink down the payload size to
-
something that we could be, you know, pumping
out
-
of your box on a regular basis without tearing
-
up your bandwidth costs.
-
So, one of the first things that we did
-
early on was we switched using protobuf as
the
-
transport mechanism, and that really shrunk,
shrunk down the
-
payloads a lot. Our earlier prototypes for
actually collecting
-
the data were written in Ruby, but I think
-
Carl did, like, a weekend hack to just pour
-
it over the Java and got, like, 200x performance.
-
And you don't always get 200x performance,
if mostly
-
what you're doing is database queries, you're
not gonna
-
get a huge performance swing.
-
But mostly what we're doing is math. And algorithms
-
and data structures. And for that, Ruby is,
it
-
could, in theory, one day, have a good git
-
or something, but today, writing that code
in Java
-
didn't end up being significantly more code
cause it's
-
just, you know, algorithms and data structures.
-
T.D.: And I'll just note something about standardizing
on
-
protobufs in our, in our stack, is actually
a
-
huge win, because we, we realized, hey, browsers,
as
-
it turns out are pretty powerful these days.
They've
-
got, you know, they can allocate memory, they
can
-
do all these types of computation. So, and
protobuff's
-
libraries exist everywhere. So we save ourselves
a lot
-
of computation and a lot of time by just
-
treating protobuff as the canonical serialization
form, and then
-
you can move payloads around the entire stack
and
-
everything speaks the same language, so you've
saved the
-
serialization and deserialization.
-
Y.K.: And JavaScript is actually surprisingly
effective at des-
-
at taking protobuffs and converting them to
the format
-
that we need efficiently. So, so we basically
take
-
this data. The Java collector is basically
collecting all
-
these protobuffs, and pretty much it just
turns around,
-
and this is sort of where we got into
-
bespoke territory before we started rolling
our own, but
-
we realized that when you write a big, distributed,
-
fault-tolerant system, there's a lot of problems
that you
-
really just want someone else to have thought
about.
-
So, what we do is we basically take these,
-
take these payloads that are coming in. We
convert
-
them into batches and we send the batches
down
-
into the Kafka queue. And the, the next thing
-
that happens, so the Kafka, sorry, Kafka's
basically just
-
a queue that allows you to throw things into,
-
I guess, it might be considered similar to
like,
-
something lime AMQP. It has some nice fault-tolerance
properties
-
and integrates well with storm. But most important
it's
-
just super, super high through-put.
-
So basically didn't want to put any barrier
between
-
you giving us the data and us getting it
-
to disc as soon as possible.
-
T.D.: Yeah. Which we'll, I think, talk about
in
-
a bit.
-
Y.K.: So we, so the basic Kafka takes the
-
data and starts sending it into Storm. And
if
-
you think about what has to happen in order
-
to get some request. So, you have these requests.
-
There's, you know, maybe traces that have
a bunch
-
of SQL queries, and our job is basically to
-
take all those SQL queries and say, OK, I
-
can see that in all of your requests, you
-
had the SQL query and it took around this
-
amount of time and it happened as a child
-
of this other node. And the way to think
-
about that is basically just a processing
pipeline. Right.
-
So you have these traces that come in one
-
side. You start passing them through a bunch
of
-
processing steps, and then you end up on the
-
other side with the data.
-
And Storm is actually a way of describing
that
-
processing pipeline in sort of functional
style, and then
-
you tell it, OK. Here's how many servers I
-
need. Here's how, here's how I'm gonna handle
failures.
-
And it basically deals with distribution and
scaling and
-
all that stuff for you. And part of that
-
is because you wrote everything using functional
style.
-
And so what happens is Kafka sends the data
-
into the entry spout, which is sort of terminology
-
in, terminology in Storm for these streams
that get
-
created. And they basically go into these
processing things,
-
which very clever- cutely are called bolts.
This is
-
definitely not the naming I would have used,
but.
-
So they're called bolts. And the idea is that
-
basically every request may have several things.
-
So, for example, we now automatically detect
n +
-
1 queries and that's sort of a different kind
-
of processing from just, make a picture of
the
-
entire request. Or what is the 95th percentile
across
-
your entire app, right. These are all different
kinds
-
of processing. So we take the data and we
-
send them into a bunch of bolts, and the
-
cool thing about bolts is that, again, because
they're
-
just functional chaining, you can take the
output from
-
one bolt and feed it into another bolt. And
-
that works, that works pretty well. And, and
you
-
don't have to worry about - I mean, you
-
have to worry a little bit about things like
-
fault tolerance, failure, item potence. But
you worry about
-
them at, at the abstraction level, and then
the
-
operational part is handled for you.
-
T.D.: So it's just like a very declarative
way
-
of describing how this computation work in,
in a
-
way that's easy to scale.
-
Y.K.: And Carl actually talked about this
at very
-
high speed yesterday, and you, some of you
may
-
have been there. I would recommend watching
the video
-
when it comes out if you want to make
-
use of this stuff in your own applications.
-
And then when you're finally done with all
the
-
processing, you need to actually do something
with it.
-
You need to put it somewhere so that the
-
web app can get access to it, and that
-
is basically, we use Cassandra for this. And
Cassandra
-
again is mostly, it's a dumb database, but
it
-
has, it's, has high capacity. It has some
of
-
the fault-tolerance capacities that we want.
-
T.D.: We're very, we're just very, very heavy,
right.
-
Like, we tend to be writing more than we're
-
ever reading.
-
Y.K.: Yup. And then when we're done, when
we're
-
done with a particular batch, Cassandra basically
kicks off
-
the process over again. So we're basically
doing these
-
things as batches.
-
T.D.: So these are, these are roll-ups, is
what's
-
happening here. So basically every minute,
every ten minutes,
-
and then at every hour, we reprocess and we
-
re-aggregate, so that when you query us we
know
-
exactly what to give you.
-
Y.K.: Yup. So we sort of have this cycle
-
where we start off, obviously, in the first
five
-
second, the first minute, you really want
high granularity.
-
You want to see what's happening right now.
But,
-
if you want to go back and look at
-
data from three months ago, you probably care
about
-
it, like the day granularity or maybe the
hour
-
granularity. So, we basically do these roll-ups
and we
-
cycle through the process.
-
T.D.: So this, it turns out, building the
system
-
required an intense amount of work. Carl spent
probably
-
six months reading PHP thesises to find-
-
Y.K.: Thesis.
-
T.D.: Thesis. To find, to find data structures
and
-
algorithms that we could use. Because this
is a
-
huge amount of data. Like, I think even a
-
few months after we were in private data,
private
-
beta, we were already handling over a billion
requests
-
per month. And obviously there's no way that
we-
-
Y.K.: Basically the number of requests that
we handle
-
is the sum of all of the requests that
-
you handle.
-
T.D.: Right.
-
Y.K.: And all of our customers handle.
-
T.D.: Right. Right. So.
-
Y.K.: So that's a lot of requests.
-
T.D.: So obviously we can't provide a service,
at
-
least one that's not, we can't provide an
affordable
-
service, an accessible service, if we have
to store
-
terabytes or exabytes of data just to tell
you
-
how your app is running.
-
Y.K.: And I think, also a problem, it's problematic
-
if you store all the data in a database
-
and then every single time someone wants to
learn
-
something about that, you have to do a query.
-
Those queries can take a very long time. They
-
can take minutes. And I think we really wanted
-
to have something that would be very, that
would,
-
where the feedback loop would be fast. So
we
-
wanted to find algorithms that let us handle
the
-
data at, at real time, and then provide it
-
to you at real time instead of these, like,
-
dump the data somewhere and then do these
complicated
-
queries.
-
T.D.: So, hold on. So this slide was not
-
supposed to be here. It was supposed to be
-
a Rails slide. So, whoa. I went too far.
-
K. We'll watch that again. That's pretty cool.
So
-
then the last thing I want to say is,
-
perhaps your take away from looking at this
architecture
-
diagram is, oh my gosh, these Rails guys completely-
-
Y.K.: They really jumped the shark.
-
T.D.: They jumped the shark. They ditched
Rails. I
-
saw, like, three Tweets yesterday - I wasn't
here,
-
I was in Portland yesterday, but I saw, like,
-
three Tweets that were like, I'm at RailsConf
and
-
I haven't seen a single talk about, like,
Rails.
-
So that's true here, too. But, I want to
-
assure you that we are only using this stack
-
for the heavy computation. We started in Rails.
We
-
started, we were like, hey, what do we need.
-
Ah, well, people probably need to authenticate
and log
-
in, and we probably need to do billing. And
-
those are all things that Rails is really,
really
-
good at. So we started with Rails as, basically,
-
the starting point, and then when we realized
oh
-
my gosh, computation is really slow. There's
no way
-
we're gonna be able to offer this service.
OK.
-
Now let's think about how we can do all
-
of that.
-
Y.K.: And I think notably, a lot of people
-
who look at Rails are like, there's a lot
-
of companies that have built big stuff on
Rails,
-
and their attitude is, like, oh, this legacy
terrible
-
Rails app. I really wish we could get rid
-
of it. If we could just write everything in
-
Scala or Clojure or Go, everything would be
amazing.
-
That is definitely not our attitude. Our attitude
is
-
that Rails is really amazing, at particular,
at the
-
kinds of things that are really common across
everyone's
-
web applications - authentication, billing,
et cetera. And we
-
really want to be using Rails for the parts
-
of our app- even things like error-tracking,
we do
-
through the Rails app. We want to be using
-
Rails because it's very productive at doing
those things.
-
It happens to be very slow with doing data
-
crunching, so we're gonna use a different
tool for
-
that.
-
But I don't think you'll ever see me getting
-
up and saying, ah, I really wish we had
-
just started writing, you know, the Rails
app in
-
rust.
-
T.D.: Yeah.
-
Y.K.: That would be terrible.
-
T.D.: So that's number one, is, is, honest
response
-
times, which we're, which it turns out, seems
like
-
it should be easy, requires storing insane
amount of
-
data.
-
So the second thing that we realized when
we
-
were looking at a lot of these tools, is
-
that most of them focus on data. They focus
-
on giving you the raw data. But I'm not
-
a machine. I'm not a computer. I don't enjoy
-
sifting through data. That's what computers
are good for.
-
I would rather be drinking a beer. It's really
-
nice in Portland, this time of year.
-
So, we wanted to think about, if you're trying
-
to solve the performance problems in your
application, what
-
are the things that you would suss out with
-
the existing tools after spending, like, four
hours depleting
-
your ego to get there?
-
Y.K.: And I think part of this is just
-
people are actually very, people like to think
that
-
they're gonna use these tools, but when the
tools
-
require you to dig through a lot of data,
-
people just don't use them very much. So,
the
-
goal here was to build a tool that people
-
actually use and actually like using, and
not to
-
build a tool that happens to provide a lot
-
of data you can sift through.
-
T.D.: Yes.
-
So, probably the, one of the first things
that
-
we realized is that we don't want to provide.
-
This is a trace of a request, you've probably
-
seen similar UIs using other tools, using,
for example,
-
the inspector in, in like Chrome or Safari,
and
-
this is just showing basically, it's basically
a visual
-
stack trace of where your application is spending
its
-
time.
-
But I think what was important for us is
-
showing not just a single request, because
your app
-
handles, you know, hundreds of thousands of
requests, or
-
millions of requests. So looking at a single
request
-
statistically is complete, it's just noise.
-
Y.K.: And it's especially bad if it's the
worst
-
request, because the worst request is, is
really noise.
-
It's like, a hiccup in the network, right.
-
T.D.: It's the outlier. Yeah.
-
Y.K.: It's literally the outlier.
-
T.D.: It's literally the outlier. Yup. So,
what we
-
present in Skylight is something a little
bit different,
-
and it's something that we call the aggregate
trace.
-
So the aggregate trace is basically us taking
all
-
of your requests, averaging them out where
each of
-
these things spends their time, and then showing
you
-
that. So this is basically like, this is like,
-
this is like the statue of David. It is
-
the idealized form of the stack trace of how
-
your application's behaving.
-
But, of course, you have the same problem
as
-
before, which is, if this is all that we
-
were showing you, it would be obscuring a
lot
-
of information. You want to actually be able
to
-
tell the difference between, OK, what's my
stack trace
-
look like for fast requests, and how does
that
-
differ from requests that are slower.
-
So what we've got, I've got a little video
-
here. You can see that when I move the
-
slider, that this trace below it is actually
updating
-
in real time. As I move the slider around,
-
you can see that the aggregate trace actually
updates
-
with it. And that's because we're collecting
all this
-
information. We're collecting, like I said,
a lot of
-
data. We can recompute this aggregate trace
on the
-
fly.
-
Basically, for each bucket, we're storing
a different trace,
-
and then on the client we're reassembling
that. We'll
-
go into that a little bit.
-
Y.K.: And I think it's really important that
you
-
be able to do these experiments quickly. If
every
-
time you think, oh, I wonder what happens
if
-
I add another histogram bucket, if it requires
a
-
whole full page refresh. Then that would basically
make
-
people not want to use the tool. Not able
-
to use the tool. So, actually building something
which
-
is real time and fast, gets the data as
-
it comes, was really important to us.
-
T.D.: So that's number one.
-
And the second thing. So we built that, and
-
we're like, OK, well what's next? And I think
-
that the big problem with this is that you
-
need to know that there's a problem before
you
-
go look at it, right. So we have been
-
working for the past few months, and the Storm
-
infrastructure that we built makes it pretty
straight-forward to
-
start building more abstractions on top of
the data
-
that we've already collected.
-
It's a very declarative system. So we've been
working
-
on a feature called inspections. And what's
cool about
-
inspections is that we can look at this tremendous
-
volume of data that we've collected from your
app,
-
and we can automatically tease out what the
problems
-
are. So the first one that we shipped, this
-
is in beta right now. It's not, it's not
-
out and enabled by default, but there, it's
behind
-
a feature flag that we've had some users turning
-
on.
-
And, and trying out. And so what we can
-
do in this case, is because we have information
-
about all of the database queries in your
app,
-
we can look and see if you have n
-
plus one queries. Can you maybe explain what
an
-
n plus one query is?
-
Y.K.: Yeah. So, I'm, people know, hopefully,
what n
-
plus one queries. But the, it's the idea that
-
you, by accident, for some reason, instead
of making
-
one query, you ask for like all the posts
-
and then you iterated through all of them
and
-
got all the comments and now you, instead
of
-
having one query, you have one query per post,
-
right. And you, what I've, what I've like
to
-
do is do eager reloading, where you say include
-
comments, right. But you have to know that
you
-
have to do that.
-
So there's some tools that will run in development
-
mode, if you happen to catch it, like a
-
bullet. This is basically a tool that's looking
at
-
every single one of your classes and has some
-
thresholds that, once we see that a bunch
of
-
your requests have the same exact query, so
we
-
do some work to pull out binds. So if
-
it's, like, where something equals one, we
will automatically
-
pull out the one and replace it with a
-
question mark.
-
And then we basically take all those queries,
if
-
they're the exact same query repeated multiple
times, subject
-
to some thresholds, we'll start showing you
hey, there's
-
an n plus one query.
-
And you can imagine this same sort of thing
-
being done for things, like, are you missing
an
-
index, right. Or, are you using the Ruby version
-
of JSON when you should be using the native
-
version of JSON. These are all things that
we
-
can start detecting just because we're consuming
an enormous
-
amount of information, and we can start writing
some
-
heuristics for bubbling it up.
-
So, third and final breakthrough, we realized
that we
-
really, really needed a lightning fast UI.
Something really
-
responsive. So, in particular, the feedback
loop is critical,
-
right. You can imagine, if the way that you
-
dug into data was you clicked and you wait
-
an hour, and then you get your results, no
-
one would do it. No one would ever do
-
it.
-
And the existing tools are OK, but you click
-
and you wait. You look at it and you're
-
like, oh, I want a different view, so then
-
you go edit your query and then you click
-
and you wait and it's just not a pleasant
-
experience.
-
So, so we use Ember, the, the UI that
-
you're using when you log into Skylight. Even
though
-
it feels just like a regular website, it doesn't
-
feel like a native app, is powered, all of
-
the routing, all of the rendering, all of
the
-
decision making, is happening in, as an Ember.js
app,
-
and we pair that with D3. So all of
-
the charts, the charts that you saw there
in
-
the aggregate trace, that is all Ember components
powered
-
by D3.
-
So, this is actually significantly cleaned
up our client-side
-
code. It makes re-usability really, really
awesome. So to
-
give you an example, this is from our billing
-
page that I, the designer came and they had,
-
they had a component that was like, the gate
-
component.
-
And, the-
-
T.D.: It seems really boring at first.
-
Y.K.: It seemed really boring. But, this is
the
-
implementation, right. So you could copy and
paste this
-
code over and over again, everywhere you go.
Just
-
remember to format it correctly. If you forget
to
-
format it, it's not gonna look the same everywhere.
-
But I was like, hey, we're using this all
-
over the place. Why don't we bundle this up
-
into a component? And so with Ember, it was
-
super easy. We basically just said, OK, here's
new
-
calendar date component. It has a property
on it
-
called date. Just set that to any JavaScript
data
-
object. Just set, you don't have to remember
about
-
converting it or formatting it. Here's the
component. Set
-
the date and it will render the correct thing
-
automatically.
-
And, so the architecture of the Ember app
looks
-
a little bit, something like this, where you
have
-
many, many different components, most of them
just driven
-
by D3, and then they're plugged into the model
-
and the controller.
-
And the Ember app will go fetch those models
-
from the cloud, and the cloud from the Java
-
app, which just queries Cassandra, and render
them. And
-
what's neat about this model is turning on
web
-
sockets is super easy, right. Because all
of these
-
components are bound to a single place. So
when
-
the web socket says, hey, we have updated
information
-
for you to show, it just pushes it onto
-
the model or onto the controller, and the
whole
-
UI updates automatically.
-
It's like magic.
-
And-
-
Y.K.: Like magic.
-
T.D.: It's like magic. And, and when debugging,
this
-
is especially awesome too, because, and I'll
maybe show
-
a demo of the Ember inspector. It's nice.
-
So. Yeah. So, lightning fast UI. Reducing
the feedback
-
loop so that you can quickly play with your
-
data, makes it go from a chore to something
-
that actually feels kind of fun.
-
So, these were the breakthroughs that we had
when
-
we were building Skylight. The things that
made us
-
think, yes, this is actually a product that
we
-
think deserves to be on the market. So, one,
-
honest response times. Collect data that no
one else
-
can collect. Focus on answers instead of just
dumping
-
data, and have a lightning fast UI to do
-
it.
-
So we like to think of Skylight as basically
-
a smart profiler. It's a smart profiler that
runs
-
in production. It's like the profiler that
you run
-
on your local development machine, but instead
of being
-
on your local dev box which has nothing to
-
do with the performance characteristics of
what your users
-
are experience, we're actually running in
production.
-
So, let me just give you guys a quick
-
demo.
-
So, this is what the Skylight, this is what
-
Skylight looks like. What's under this? There
we go.
-
So, the first thing here is we've got the
-
app dash board. So this, it's like our, 95th
-
responsile- 95th percentile response time
has peaked. Maybe you're
-
all hammering it right now. That would be
nice.
-
So, this is a graph of your response time
-
over time, and then on the right, this is
-
the graph of the RPMs, the requests per minute
-
that your app is handling. So this is app-wide.
-
And this is live. This updates every minute.
-
Then down below, you have a list of the
-
end points in your application. So you can
see,
-
actually, the top, the slowest ones for us
were,
-
we have an instrumentation API, and we've
gone and
-
instrumented our background workers. So we
can see them
-
here, and their response time plays in. So
we
-
can see that we have this reporting worker
that's
-
taking 95th percentile, thirteen seconds.
-
Y.K.: So all that time used to be inside
-
of some request somewhere, and we discovered
that there
-
was a lot of time being spent in things
-
that we could push to the background. We probably
-
need to update the agony index so that it
-
doesn't make workers very high, because spending
some time
-
in your workers is not that big of a
-
deal.
-
T.D.: So, so then, if we dive into one
-
of these, you can see that for this request,
-
we've got the time explorer up above, and
that
-
shows a graph of response time at, again,
95th
-
percentile, and you can, if you want to go
-
back and look at historical data, you just
drag
-
it like this. And this has got a brush,
-
so you can zoom in and out on different
-
times.
-
And every time you change the range, you can
-
see that it's very responsive. It's never
waiting for
-
the server. But it is going back and fetching
-
data from the server and then when the data
-
comes back, you see the whole UI just updates.
-
And we get that for free with Ember and
-
And then down below, as we discussed, you
actually
-
have a real histogram. And this histogram,
in this
-
case, is showing. So this is for fifty-seven
requests.
-
And if we click and drag, we could just
-
move this. And you can see that the aggregate
-
trace below updates in response to us dragging
this.
-
And if we want to look at the fastest
-
quartile, we just click faster and we'll just
choose
-
that range on the histogram.
-
Y.K.: I think it's the fastest load.
-
T.D.: The fastest load. And then if you click
-
on slower, you can see the slower requests.
So
-
this makes it really easy to compare and contrast.
-
OK. Why are certain requests faster and why
are
-
certain requests slow?
-
You can see the blue, these blue areas. This
-
is Ruby code. So, right now it's not super
-
granular. It would be nice if you could actually
-
know what was going on here. But, it'll at
-
least tell you where in your controller action
this
-
is happening, and then you can actually see
which
-
database queries are being executed, and what
their duration
-
is.
-
And you can see that we actually extract the
-
SQL and we denormalize it so we, so you,
-
or, we normalize it so you can see exactly
-
what those requests are even if the values
are
-
totally different between them.
-
Y.K.: Yeah. So the real query, courtesy of
Rails,
-
not yet supporting bind extraction is like,
where id
-
equals one or, ten or whatever.
-
T.D.: Yup. So that's pretty cool.
-
Y.K.: So one, one other thing is, initially,
we
-
actually just showed the whole trace, but
we discovered
-
that, obviously when you show whole traces
you have
-
information that doesn't really matter that
much. So we
-
started off by, we've recently basically started
to collapse
-
things that don't matter so much so that you
-
can basically expand or condense the trace.
-
And we wanted to make it not, but you
-
have to think about expanding or condensing
individual areas,
-
but just, you see what matters the most and
-
then you can see trivial errors.
-
T.D.: Yup. So, so that's the demo of Skylight.
-
We'd really like it if you checked it out.
-
There is one more thing I want to show
-
you that is, like, really freaking cool. This
is
-
coming out of Tilde labs. Carl was like, has
-
been hacking, he's been up until past midnight,
getting
-
almost no sleep for the past month trying
to
-
have this ready.
-
I don't know how many of you know this,
-
but Ruby 2 point 1 has a new, a,
-
a stack sampling feature. So you can get really
-
granular information about how your Ruby code
is performing.
-
So I want to show you, I just mentioned
-
how it would be nice if we could get
-
more information out of what your Ruby code
is
-
doing. And now we can do that.
-
Basically, every few milliseconds, this code
that Carl wrote
-
is going into the, to the Ruby, into MRI,
-
and it's taking a snap shot of the stack.
-
And because this is built-in, it's very low-impact.
It's
-
not allocating any new memory. It's very little
performance
-
hit. Basically you wouldn't even notice it.
And so
-
every few milliseconds it's sampling, and
we take that
-
information and we send it up to our servers.
-
So it's almost like you're running Ruby profiler
on
-
your local dev box, where you get extremely
granular
-
information about where your code is spending
its time
-
in Ruby, per method, per all of these things.
-
But it's happening in production.
-
So, this is, so this is a, we enabled
-
it in staging. You can see that we've got
-
some rendering bugs. It's still in beta.
-
Y.K.: Yeah, and we haven't yet collapsed things
that
-
are not important-
-
T.D.: Yes.
-
Y.K.: -for this particular feature.
-
T.D.: So we want to show, we want to
-
hide things like, like framework code, obviously.
But this
-
gives you an incredibly, incredibly granular
view of what
-
your app is doing in production. And we think.
-
This is a, an API that's built into, into
-
Ruby 2.1.1. Because our agent is running so
low-level,
-
because we wrote it in Rust, we have the
-
ability to do things like this, and Carl thinks
-
that we may be able to actually back port
-
this to older Rubies, too. So if you're not
-
on Ruby 2.1, we think that we can actually
-
bring this. But that's TPD.
-
Y.K.: Yeah, I- so I think the cool thing
-
about this, in general, is when you run a
-
sampling- so this is a sampling profiler,
right, we
-
don't want to be burning every single thing
that
-
you do in your program with tracing, right.
That
-
would be very slow.
-
So when you normally run a sampling profiler,
you
-
have to basically make a loop. You have to
-
basically create a loop, run this code a million
-
times and keep sampling it. Eventually we'll
get enough
-
samples to get the information. But it turns
out
-
that your production server is a loop. Your
production
-
server is serving tons and tons of requests.
So,
-
by simply tak- you know, taking a few microseconds
-
out of every request and collecting a couple
of
-
samples, over time we can actually get this
really
-
high fidelity picture with basically no cost.
-
And that's pretty mind-blowing. And this is
the kind
-
of stuff that we can start doing by really
-
caring about, about both the user experience
and the
-
implementation and getting really scary about
it. And I'm
-
really, like, honestly this is a really exciting
feature
-
that really shows what we can do as we
-
start building this out.
-
T.D.: Once we've got that, once we've got
that
-
groundwork.
-
So if you guys want to check it out,
-
Skylight dot io, it's available today. It's
no longer
-
in private beta. Everyone can sign up. No
invitation
-
token necessary. And you can get a thirty-day
free
-
trial if you haven't started one already.
So if
-
you have any questions, please come see us
right
-
now, or we have a booth in the vendor
-
hall. Thank you guys very much.