TOM DALE: Hey, you guys ready?
Thank you guys so much for coming.
This is awesome. I was really,
I, when they were putting together the schedule,
I said, make sure that you put us down
in the Caves of Moria. So thank you
guys for coming down and making it.
I'm Tom. This is Yehuda.
YEHUDA KATZ: When people told me was signed
up
to do a back-to-back talk, I don't know what
I was thinking.
T.D.: Yup. So. We want to talk to you
today about, about Skylight. So, just a little
bit
before we talk about that, I want to talk
about us a little bit.
So, in 2011 we started a company called Tilde.
It's this shirt. It may have made me self-conscious,
because this is actually a first edition and
it's
printed off-center. Well, either I'm off-center
or the shirt's
off-center. One of the two.
So we started Tilde in 2011, and we had
all just left a venture-backed company, and
that was
a pretty traumatic experience for us because
we spent
a lot of time building the company and then
we ran out of money and sold to Facebook
and we really didn't want to repeat that experience.
So, we decided to start Tilde, and when we
did it, we decided to be. DHH and the
other people at Basecamp were talking about,
you know,
being bootstrapped and proud. And that was
a message
that really resonated with us, and so we wanted
to capture the same thing.
There's only problem with being bootstrapped
and proud, and
that is, in order to be both of those
things, you actually need money, it turns
out. It's
not like you just say it in a blog
post and then all of the sudden you are
in business.
So, we had to think a lot about, OK,
well, how do we make money? How do we
make money? How do we make a profitable and,
most importantly, sustainable business? Because
we didn't want to
just flip to Facebook in a couple years.
So, looking around, I think the most obvious
thing
that people suggested to us is, well, why
don't
you guys just become Ember, Inc.? Raise a
few
million dollars, you know, build a bunch of
business
model, mostly prayer. But that's not really
how we
want to think about building open source communities.
We don't really think that that necessarily
leads to
the best open source communities. And if you're
interested
more in that, I recommend Leia Silver, who
is
one of our co-founders. She's giving a talk
this
afternoon. Oh, sorry. Friday afternoon, about
how to build
a company that is centered on open source.
So
if you want to learn more about how we've
done that, I would really suggest you go check
out her talk.
So, no. So, no Ember, Inc. Not allowed.
So, we really want to build something that
leveraged
the strengths that we thought that we had.
One,
I think most importantly, a really deep knowledge
of
open source and a deep knowledge of the Rails
stack, and also Carl, it turns out, is really,
really good at building highly scalable big
data sys-
big data systems. Lots of Hadoop in there.
So, last year at RailsConf, we announced the
private
beta of Skylight. How many of you have used
Skylight? Can you raise your hand if you have
used it? OK. Many of you. Awesome.
So, so Skylight is a tool for profiling and
measuring the performance of your Rails applications
in production.
And, as a product, Skylight, I think, was
built
on three really, three key break-throughs.
There were key,
three key break-throughs. We didn't want to
ship a
product that was incrementally better than
the competition. We
wanted to ship a product that was dramatically
better.
Quantum leap. An order of magnitude better.
And, in order to do that, we spent a
lot of time thinking about it, about how we
could solve most of the problems that we saw
in the existing landscape. And so those, those
break-throughs
are predicated- sorry, those, delivering a
product that does
that is predicated on these three break-throughs.
So, the first one I want to talk about
is, honest response times. Honest response
times. So, DHH
wrote a blog post on what was then the
37Signals blog, now the Basecamp blog, called
The problem
with averages. How many of you have read this?
Awesome.
For those of you that have not, how many
of you hate raising your hands at presentations?
So, for those of you that-
Y.K.: Just put a button in every seat. Press
this button-
T.D.: Press the button if you have. Yes. Great.
So, if you read this blog post, the way
it opens is, Our average response time for
Basecamp
right now is 87ms... That sounds fantastic.
And it
easily leads you to believe that all is well
and that we wouldn't need to spend any more
time optimizing performance.
But that's actually wrong. The average number
is completely
skewed by tons of fast responses to feed requests
and other cached replies. If you have 1000
requests
that return in 5ms, and then you can have
200 requests taking 2000ms, or two seconds,
you can
still report an av- a respectable 170ms of
average.
That's useless.
So what does DHH say that we need? DHH
says the solution is histograms. So, for those
of
you like me who were sleeping through your
statistics
class in high school, and college, a brief
primer
on histograms. So a histogram is very simple.
Basically,
you have a, you have a series of numbers
along some axis, and every time you, you're
in
that number, you're in that bucket, you basically
increment
that bar by one.
So, this is an example of a histogram of
response times in a Rails application. So
you can
see that there's a big cluster in the middle
around 488ms, 500ms. This isn't a super speedy
app
but it's not the worst thing in the world.
And they're all clustered, and then as you
kind
of move to the right you can see that
the respond times get longer and longer and
longer,
and as you move to the left, response times
get shorter and shorter and shorter.
So, why do you want a histogram? What's the,
what's the most important thing about a histogram?
Y.K.: Well, I think it's because most requests
don't
actually look like this.
T.D.: Yes.
Y.K.: Most end points don't actually look
like this.
T.D.: Right. If you think about what your
Rails
app is doing, it's a complicated beast, right.
Turns
out, Ruby frankly, you can, you can do branching
logic. You can do a lot of things.
And so what that means is that one end
point, if you represent that with a single
number,
you are losing a lot of fidelity, to the
point where it becomes, as DHH said, useless.
So,
for example, in a histogram, you can easily
see,
oh, here's a group of requests and response
times
where I'm hitting the cache, and here's another
group
where I'm missing it. And you can see that
that cluster is significantly slower than
the faster cache-hitting
cluster.
And the other thing that you get when you
have a, a distribution, when you keep the
whole
distribution in the histogram, is you can
look at
this number at the 95th percentile, right.
So the
right, the way to think about the performance
of
your web application is not the average, because
the
average doesn't really tell you anything.
You want to
think about the 95th percentile, because that's
not the
average response time, that's the average
worst response time
that a user is likely to hit.
And the thing to keep in mind is that
it's not as though a customer comes to your
site, they issue one request, and then they're
done,
right. As someone is using your website, they're
gonna
be generating a lot of requests. And you need
to look at the 95th percentile, because otherwise
every
request is basically you rolling the dice
that they're
not gonna hit one of those two second, three
second, four second responses, close the tab
and go
to your competitor.
So we look at this as, here's the crazy
thing. Here's what I think is crazy. That
blog
post that DHH wrote, it's from 2009. It's
been
five years, and there's still no tool that
does
what DHH was asking for. So, we, frankly,
we
smelled money. We were like, holy crap.
Y.K.: Yeah, why isn't that slide green?
T.D.: Yeah. It should be green and dollars.
I
think keynote has the dollars, the make it
rain
effect I should have used. So we smelled blood
in the water. We're like, this is awesome.
There's
only one problem that we discovered, and that
is,
it turns out that building this thing is actually
really, really freaky hard. Really, really
hard.
So, we announced the private beta at RailsConf
last
year. Before doing that, we spent a year of
research spiking out prototypes, building
prototypes, building out the
beta. We launched at RailsConf, and we realized,
we
made a lot of problems. We made a lot
of errors when we were building this system.
So then, after RailsConf last year, we basically
took
six months to completely rewrite the backend
from the
ground up. And I think tying into your keynote,
Yehuda, we, we were like, oh. We clearly have
a bespoke problem. No one else is doing this.
So we rewrote our own custom backend. And
then
we had all these problems, and we realized
that
they had actually already all been solved
by the
open source community. And so we benefited
tremendously by
having a shared solution.
Y.K.: Yeah. So our first release of this was
really very bespoke, and the current release
uses a
tremendous amount of very off-the-shelf open
source projects that
just solved the particular problem very effectively,
very well.
None of which are as easy to use as
Rails, but all of which solve really thorny
problems
very effectively.
T.D.: So, so let's just talk, just for your
own understanding, let's talk about how most
performance monitoring
tools work. So the way that most of these
work is that you run your Rails app, and
running inside of your Rails app is some gem,
some agent that you install. And every time
the
Rails app handles a request, it generates
events, and
those events, which include information about
performance data, those
events are passed into the agent.
And then the agent sends that data to some
kind of centralized server. Now, it turns
out that
doing a running average is actually really
simple. Which
is why everyone does it. Basically you can
do
it in a single SQL query, right. All you
do is you have three columns in database.
The
end point, the running average, and the number
of
requests, and then so, you can, those are
the
two things that you need to keep a running
average, right.
So keeping a running average is actually really
simple
from a technical point of view.
Y.K.: I don't think you could even JavaScript
through
to the lack of integers.
T.D.: Yes. You probably wouldn't want to do
any
math in JavaScript, it turns out. So, so we
took a little bit different approach. Yehuda,
do you
want to go over the next section?
Y.K.: Yeah. Sure. So, when we first started,
right
at the beginning, we basically did a similar
thing
where we had a bunch - your app creates
events. Most of those start off as being ActiveSupport::Notifications,
although it turns out that there's very limited
use
of ActiveSupport::Notifications so we had
to do some normalization
work to get them sane, which we're gonna be
upstreaming back into, into Rails.
But, one thing that's kind of unfortunate
about having
every single Rails app have an agent is that
you end up having to do a lot of
the same kind of work over and over again,
and use up a lot of memory. So, for
example, every one of these things is making
HTTP
requests. So now you have a queue of things
that you're sending over HTTP in every single
one
of your Rails processes. And, of course, you
probably
don't notice this. People are used to Rails
taking
up hundreds and hundreds of megabytes, so
you probably
don't notice if you install some agent and
it
suddenly starts taking twenty, thirty, forty,
fifty more megabytes.
But we really wanted to keep the actual memory
per process down to a small amount. So one
of the very first things that we did, we
even did it before last year, is that we
pulled out all that shared logic into a, a
separate process called the coordinator. And
the agent is
basically responsible simply for collecting
the, the trace, and
it's not responsible for actually talking
to our server
at all. And that means that the coordinator
only
has to do this queue, this keeping a st-
a bunch of stuff of work in one place,
and it doesn't end up using up as much
memory.
And I think this, this ended up being very
effective for us.
T.D.: And I think that low overhead also allows
us to just collect more information, in general.
Y.K.: Yeah.
Now, after our first attempt, we started getting
a
bunch of customers that were telling us that
even
the separate - so the separate coordinator,
started as
a good thing and a bad thing. On the
one hand, there's only one of them, so it
uses up only one set of memory. On the
other hand, it's really easy for someone to
go
in and PS that process and see how many
megabytes of memory it's using.
So, we got a lot of additional complaints
that
said oh, your process is using a lot of
memory. And, I spent a few weeks, I, I
know Ruby pretty well. I spent a couple of
weeks. I actually wrote a gem called Allocation
Counter
that basically went in to try to pin point
exactly where the allocations were hap- coming
from. But
it turns out that it's actually really, really
hard
to track down exactly where allocations are
coming from
in Ruby, because something as simple as using
a
regular expression in Ruby can allocate match
objects that
get put back on the stack.
And so I was able to pair this down
to some degree. But I really discovered quickly
that,
trying to keep a lid on the memory allocation
by doing all the stuff in Ruby, is mostly
fine. But for our specific use case where
we
really wanna, we wanna be telling you, you
can
run the agent on your process, on your box,
and it's not gonna use a lot of memory.
We really needed something more efficient,
and our first
thought was, we'll use C++ or C. No problem.
C is, is native. It's great. And Carl did
the work. Carl is very smart. And then he
said, Yehuda. It is now your turn. You need
to start maintaining this. And I said, I don't
trust myself to write C++ code that's running
in
all of your guys's boxes, and not seg-fault.
So
I don't think that, that doesn't work for
me.
And so I, I noticed that rust was coming
along, and what rust really gives you is it
gives you the ability to write low-level code
a
la C or C++ with magma memory management,
that
keeps your memory allocation low and keeps
things speedy.
Low resource utilization. While also giving
you compile time
guarantees about not seg-faulting. So, again,
if your processes
randomly started seg-faulting because you
installed the agent, I
think you would stop being our customer very
quickly.
So having what, pretty much 100% guarantees
about that
was very important to us. And so that's why
we decided to use rust.
I'll just keep going.
T.D.: Keep going.
Y.K.: So, we had this coordinator object.
And basically
the coordinator object is receiving events.
So the events
basically end up being these traces that describe
what's
happening in your application. And the next
thing, I
think our initial work on this we used JSON
just to send the pay load to the server,
but we noticed that a lot of people have
really big requests. So you may have a big
request with a big SQL query in it, or
a lot of big SQL queries in it. Some
people have traces that are hundreds and hundreds
of
nodes long. And so we really wanted to figure
out how to shrink down the payload size to
something that we could be, you know, pumping
out
of your box on a regular basis without tearing
up your bandwidth costs.
So, one of the first things that we did
early on was we switched using protobuf as
the
transport mechanism, and that really shrunk,
shrunk down the
payloads a lot. Our earlier prototypes for
actually collecting
the data were written in Ruby, but I think
Carl did, like, a weekend hack to just pour
it over the Java and got, like, 200x performance.
And you don't always get 200x performance,
if mostly
what you're doing is database queries, you're
not gonna
get a huge performance swing.
But mostly what we're doing is math. And algorithms
and data structures. And for that, Ruby is,
it
could, in theory, one day, have a good git
or something, but today, writing that code
in Java
didn't end up being significantly more code
cause it's
just, you know, algorithms and data structures.
T.D.: And I'll just note something about standardizing
on
protobufs in our, in our stack, is actually
a
huge win, because we, we realized, hey, browsers,
as
it turns out are pretty powerful these days.
They've
got, you know, they can allocate memory, they
can
do all these types of computation. So, and
protobuff's
libraries exist everywhere. So we save ourselves
a lot
of computation and a lot of time by just
treating protobuff as the canonical serialization
form, and then
you can move payloads around the entire stack
and
everything speaks the same language, so you've
saved the
serialization and deserialization.
Y.K.: And JavaScript is actually surprisingly
effective at des-
at taking protobuffs and converting them to
the format
that we need efficiently. So, so we basically
take
this data. The Java collector is basically
collecting all
these protobuffs, and pretty much it just
turns around,
and this is sort of where we got into
bespoke territory before we started rolling
our own, but
we realized that when you write a big, distributed,
fault-tolerant system, there's a lot of problems
that you
really just want someone else to have thought
about.
So, what we do is we basically take these,
take these payloads that are coming in. We
convert
them into batches and we send the batches
down
into the Kafka queue. And the, the next thing
that happens, so the Kafka, sorry, Kafka's
basically just
a queue that allows you to throw things into,
I guess, it might be considered similar to
like,
something lime AMQP. It has some nice fault-tolerance
properties
and integrates well with storm. But most important
it's
just super, super high through-put.
So basically didn't want to put any barrier
between
you giving us the data and us getting it
to disc as soon as possible.
T.D.: Yeah. Which we'll, I think, talk about
in
a bit.
Y.K.: So we, so the basic Kafka takes the
data and starts sending it into Storm. And
if
you think about what has to happen in order
to get some request. So, you have these requests.
There's, you know, maybe traces that have
a bunch
of SQL queries, and our job is basically to
take all those SQL queries and say, OK, I
can see that in all of your requests, you
had the SQL query and it took around this
amount of time and it happened as a child
of this other node. And the way to think
about that is basically just a processing
pipeline. Right.
So you have these traces that come in one
side. You start passing them through a bunch
of
processing steps, and then you end up on the
other side with the data.
And Storm is actually a way of describing
that
processing pipeline in sort of functional
style, and then
you tell it, OK. Here's how many servers I
need. Here's how, here's how I'm gonna handle
failures.
And it basically deals with distribution and
scaling and
all that stuff for you. And part of that
is because you wrote everything using functional
style.
And so what happens is Kafka sends the data
into the entry spout, which is sort of terminology
in, terminology in Storm for these streams
that get
created. And they basically go into these
processing things,
which very clever- cutely are called bolts.
This is
definitely not the naming I would have used,
but.
So they're called bolts. And the idea is that
basically every request may have several things.
So, for example, we now automatically detect
n +
1 queries and that's sort of a different kind
of processing from just, make a picture of
the
entire request. Or what is the 95th percentile
across
your entire app, right. These are all different
kinds
of processing. So we take the data and we
send them into a bunch of bolts, and the
cool thing about bolts is that, again, because
they're
just functional chaining, you can take the
output from
one bolt and feed it into another bolt. And
that works, that works pretty well. And, and
you
don't have to worry about - I mean, you
have to worry a little bit about things like
fault tolerance, failure, item potence. But
you worry about
them at, at the abstraction level, and then
the
operational part is handled for you.
T.D.: So it's just like a very declarative
way
of describing how this computation work in,
in a
way that's easy to scale.
Y.K.: And Carl actually talked about this
at very
high speed yesterday, and you, some of you
may
have been there. I would recommend watching
the video
when it comes out if you want to make
use of this stuff in your own applications.
And then when you're finally done with all
the
processing, you need to actually do something
with it.
You need to put it somewhere so that the
web app can get access to it, and that
is basically, we use Cassandra for this. And
Cassandra
again is mostly, it's a dumb database, but
it
has, it's, has high capacity. It has some
of
the fault-tolerance capacities that we want.
T.D.: We're very, we're just very, very heavy,
right.
Like, we tend to be writing more than we're
ever reading.
Y.K.: Yup. And then when we're done, when
we're
done with a particular batch, Cassandra basically
kicks off
the process over again. So we're basically
doing these
things as batches.
T.D.: So these are, these are roll-ups, is
what's
happening here. So basically every minute,
every ten minutes,
and then at every hour, we reprocess and we
re-aggregate, so that when you query us we
know
exactly what to give you.
Y.K.: Yup. So we sort of have this cycle
where we start off, obviously, in the first
five
second, the first minute, you really want
high granularity.
You want to see what's happening right now.
But,
if you want to go back and look at
data from three months ago, you probably care
about
it, like the day granularity or maybe the
hour
granularity. So, we basically do these roll-ups
and we
cycle through the process.
T.D.: So this, it turns out, building the
system
required an intense amount of work. Carl spent
probably
six months reading PHP thesises to find-
Y.K.: Thesis.
T.D.: Thesis. To find, to find data structures
and
algorithms that we could use. Because this
is a
huge amount of data. Like, I think even a
few months after we were in private data,
private
beta, we were already handling over a billion
requests
per month. And obviously there's no way that
we-
Y.K.: Basically the number of requests that
we handle
is the sum of all of the requests that
you handle.
T.D.: Right.
Y.K.: And all of our customers handle.
T.D.: Right. Right. So.
Y.K.: So that's a lot of requests.
T.D.: So obviously we can't provide a service,
at
least one that's not, we can't provide an
affordable
service, an accessible service, if we have
to store
terabytes or exabytes of data just to tell
you
how your app is running.
Y.K.: And I think, also a problem, it's problematic
if you store all the data in a database
and then every single time someone wants to
learn
something about that, you have to do a query.
Those queries can take a very long time. They
can take minutes. And I think we really wanted
to have something that would be very, that
would,
where the feedback loop would be fast. So
we
wanted to find algorithms that let us handle
the
data at, at real time, and then provide it
to you at real time instead of these, like,
dump the data somewhere and then do these
complicated
queries.
T.D.: So, hold on. So this slide was not
supposed to be here. It was supposed to be
a Rails slide. So, whoa. I went too far.
K. We'll watch that again. That's pretty cool.
So
then the last thing I want to say is,
perhaps your take away from looking at this
architecture
diagram is, oh my gosh, these Rails guys completely-
Y.K.: They really jumped the shark.
T.D.: They jumped the shark. They ditched
Rails. I
saw, like, three Tweets yesterday - I wasn't
here,
I was in Portland yesterday, but I saw, like,
three Tweets that were like, I'm at RailsConf
and
I haven't seen a single talk about, like,
Rails.
So that's true here, too. But, I want to
assure you that we are only using this stack
for the heavy computation. We started in Rails.
We
started, we were like, hey, what do we need.
Ah, well, people probably need to authenticate
and log
in, and we probably need to do billing. And
those are all things that Rails is really,
really
good at. So we started with Rails as, basically,
the starting point, and then when we realized
oh
my gosh, computation is really slow. There's
no way
we're gonna be able to offer this service.
OK.
Now let's think about how we can do all
of that.
Y.K.: And I think notably, a lot of people
who look at Rails are like, there's a lot
of companies that have built big stuff on
Rails,
and their attitude is, like, oh, this legacy
terrible
Rails app. I really wish we could get rid
of it. If we could just write everything in
Scala or Clojure or Go, everything would be
amazing.
That is definitely not our attitude. Our attitude
is
that Rails is really amazing, at particular,
at the
kinds of things that are really common across
everyone's
web applications - authentication, billing,
et cetera. And we
really want to be using Rails for the parts
of our app- even things like error-tracking,
we do
through the Rails app. We want to be using
Rails because it's very productive at doing
those things.
It happens to be very slow with doing data
crunching, so we're gonna use a different
tool for
that.
But I don't think you'll ever see me getting
up and saying, ah, I really wish we had
just started writing, you know, the Rails
app in
rust.
T.D.: Yeah.
Y.K.: That would be terrible.
T.D.: So that's number one, is, is, honest
response
times, which we're, which it turns out, seems
like
it should be easy, requires storing insane
amount of
data.
So the second thing that we realized when
we
were looking at a lot of these tools, is
that most of them focus on data. They focus
on giving you the raw data. But I'm not
a machine. I'm not a computer. I don't enjoy
sifting through data. That's what computers
are good for.
I would rather be drinking a beer. It's really
nice in Portland, this time of year.
So, we wanted to think about, if you're trying
to solve the performance problems in your
application, what
are the things that you would suss out with
the existing tools after spending, like, four
hours depleting
your ego to get there?
Y.K.: And I think part of this is just
people are actually very, people like to think
that
they're gonna use these tools, but when the
tools
require you to dig through a lot of data,
people just don't use them very much. So,
the
goal here was to build a tool that people
actually use and actually like using, and
not to
build a tool that happens to provide a lot
of data you can sift through.
T.D.: Yes.
So, probably the, one of the first things
that
we realized is that we don't want to provide.
This is a trace of a request, you've probably
seen similar UIs using other tools, using,
for example,
the inspector in, in like Chrome or Safari,
and
this is just showing basically, it's basically
a visual
stack trace of where your application is spending
its
time.
But I think what was important for us is
showing not just a single request, because
your app
handles, you know, hundreds of thousands of
requests, or
millions of requests. So looking at a single
request
statistically is complete, it's just noise.
Y.K.: And it's especially bad if it's the
worst
request, because the worst request is, is
really noise.
It's like, a hiccup in the network, right.
T.D.: It's the outlier. Yeah.
Y.K.: It's literally the outlier.
T.D.: It's literally the outlier. Yup. So,
what we
present in Skylight is something a little
bit different,
and it's something that we call the aggregate
trace.
So the aggregate trace is basically us taking
all
of your requests, averaging them out where
each of
these things spends their time, and then showing
you
that. So this is basically like, this is like,
this is like the statue of David. It is
the idealized form of the stack trace of how
your application's behaving.
But, of course, you have the same problem
as
before, which is, if this is all that we
were showing you, it would be obscuring a
lot
of information. You want to actually be able
to
tell the difference between, OK, what's my
stack trace
look like for fast requests, and how does
that
differ from requests that are slower.
So what we've got, I've got a little video
here. You can see that when I move the
slider, that this trace below it is actually
updating
in real time. As I move the slider around,
you can see that the aggregate trace actually
updates
with it. And that's because we're collecting
all this
information. We're collecting, like I said,
a lot of
data. We can recompute this aggregate trace
on the
fly.
Basically, for each bucket, we're storing
a different trace,
and then on the client we're reassembling
that. We'll
go into that a little bit.
Y.K.: And I think it's really important that
you
be able to do these experiments quickly. If
every
time you think, oh, I wonder what happens
if
I add another histogram bucket, if it requires
a
whole full page refresh. Then that would basically
make
people not want to use the tool. Not able
to use the tool. So, actually building something
which
is real time and fast, gets the data as
it comes, was really important to us.
T.D.: So that's number one.
And the second thing. So we built that, and
we're like, OK, well what's next? And I think
that the big problem with this is that you
need to know that there's a problem before
you
go look at it, right. So we have been
working for the past few months, and the Storm
infrastructure that we built makes it pretty
straight-forward to
start building more abstractions on top of
the data
that we've already collected.
It's a very declarative system. So we've been
working
on a feature called inspections. And what's
cool about
inspections is that we can look at this tremendous
volume of data that we've collected from your
app,
and we can automatically tease out what the
problems
are. So the first one that we shipped, this
is in beta right now. It's not, it's not
out and enabled by default, but there, it's
behind
a feature flag that we've had some users turning
on.
And, and trying out. And so what we can
do in this case, is because we have information
about all of the database queries in your
app,
we can look and see if you have n
plus one queries. Can you maybe explain what
an
n plus one query is?
Y.K.: Yeah. So, I'm, people know, hopefully,
what n
plus one queries. But the, it's the idea that
you, by accident, for some reason, instead
of making
one query, you ask for like all the posts
and then you iterated through all of them
and
got all the comments and now you, instead
of
having one query, you have one query per post,
right. And you, what I've, what I've like
to
do is do eager reloading, where you say include
comments, right. But you have to know that
you
have to do that.
So there's some tools that will run in development
mode, if you happen to catch it, like a
bullet. This is basically a tool that's looking
at
every single one of your classes and has some
thresholds that, once we see that a bunch
of
your requests have the same exact query, so
we
do some work to pull out binds. So if
it's, like, where something equals one, we
will automatically
pull out the one and replace it with a
question mark.
And then we basically take all those queries,
if
they're the exact same query repeated multiple
times, subject
to some thresholds, we'll start showing you
hey, there's
an n plus one query.
And you can imagine this same sort of thing
being done for things, like, are you missing
an
index, right. Or, are you using the Ruby version
of JSON when you should be using the native
version of JSON. These are all things that
we
can start detecting just because we're consuming
an enormous
amount of information, and we can start writing
some
heuristics for bubbling it up.
So, third and final breakthrough, we realized
that we
really, really needed a lightning fast UI.
Something really
responsive. So, in particular, the feedback
loop is critical,
right. You can imagine, if the way that you
dug into data was you clicked and you wait
an hour, and then you get your results, no
one would do it. No one would ever do
it.
And the existing tools are OK, but you click
and you wait. You look at it and you're
like, oh, I want a different view, so then
you go edit your query and then you click
and you wait and it's just not a pleasant
experience.
So, so we use Ember, the, the UI that
you're using when you log into Skylight. Even
though
it feels just like a regular website, it doesn't
feel like a native app, is powered, all of
the routing, all of the rendering, all of
the
decision making, is happening in, as an Ember.js
app,
and we pair that with D3. So all of
the charts, the charts that you saw there
in
the aggregate trace, that is all Ember components
powered
by D3.
So, this is actually significantly cleaned
up our client-side
code. It makes re-usability really, really
awesome. So to
give you an example, this is from our billing
page that I, the designer came and they had,
they had a component that was like, the gate
component.
And, the-
T.D.: It seems really boring at first.
Y.K.: It seemed really boring. But, this is
the
implementation, right. So you could copy and
paste this
code over and over again, everywhere you go.
Just
remember to format it correctly. If you forget
to
format it, it's not gonna look the same everywhere.
But I was like, hey, we're using this all
over the place. Why don't we bundle this up
into a component? And so with Ember, it was
super easy. We basically just said, OK, here's
new
calendar date component. It has a property
on it
called date. Just set that to any JavaScript
data
object. Just set, you don't have to remember
about
converting it or formatting it. Here's the
component. Set
the date and it will render the correct thing
automatically.
And, so the architecture of the Ember app
looks
a little bit, something like this, where you
have
many, many different components, most of them
just driven
by D3, and then they're plugged into the model
and the controller.
And the Ember app will go fetch those models
from the cloud, and the cloud from the Java
app, which just queries Cassandra, and render
them. And
what's neat about this model is turning on
web
sockets is super easy, right. Because all
of these
components are bound to a single place. So
when
the web socket says, hey, we have updated
information
for you to show, it just pushes it onto
the model or onto the controller, and the
whole
UI updates automatically.
It's like magic.
And-
Y.K.: Like magic.
T.D.: It's like magic. And, and when debugging,
this
is especially awesome too, because, and I'll
maybe show
a demo of the Ember inspector. It's nice.
So. Yeah. So, lightning fast UI. Reducing
the feedback
loop so that you can quickly play with your
data, makes it go from a chore to something
that actually feels kind of fun.
So, these were the breakthroughs that we had
when
we were building Skylight. The things that
made us
think, yes, this is actually a product that
we
think deserves to be on the market. So, one,
honest response times. Collect data that no
one else
can collect. Focus on answers instead of just
dumping
data, and have a lightning fast UI to do
it.
So we like to think of Skylight as basically
a smart profiler. It's a smart profiler that
runs
in production. It's like the profiler that
you run
on your local development machine, but instead
of being
on your local dev box which has nothing to
do with the performance characteristics of
what your users
are experience, we're actually running in
production.
So, let me just give you guys a quick
demo.
So, this is what the Skylight, this is what
Skylight looks like. What's under this? There
we go.
So, the first thing here is we've got the
app dash board. So this, it's like our, 95th
responsile- 95th percentile response time
has peaked. Maybe you're
all hammering it right now. That would be
nice.
So, this is a graph of your response time
over time, and then on the right, this is
the graph of the RPMs, the requests per minute
that your app is handling. So this is app-wide.
And this is live. This updates every minute.
Then down below, you have a list of the
end points in your application. So you can
see,
actually, the top, the slowest ones for us
were,
we have an instrumentation API, and we've
gone and
instrumented our background workers. So we
can see them
here, and their response time plays in. So
we
can see that we have this reporting worker
that's
taking 95th percentile, thirteen seconds.
Y.K.: So all that time used to be inside
of some request somewhere, and we discovered
that there
was a lot of time being spent in things
that we could push to the background. We probably
need to update the agony index so that it
doesn't make workers very high, because spending
some time
in your workers is not that big of a
deal.
T.D.: So, so then, if we dive into one
of these, you can see that for this request,
we've got the time explorer up above, and
that
shows a graph of response time at, again,
95th
percentile, and you can, if you want to go
back and look at historical data, you just
drag
it like this. And this has got a brush,
so you can zoom in and out on different
times.
And every time you change the range, you can
see that it's very responsive. It's never
waiting for
the server. But it is going back and fetching
data from the server and then when the data
comes back, you see the whole UI just updates.
And we get that for free with Ember and
And then down below, as we discussed, you
actually
have a real histogram. And this histogram,
in this
case, is showing. So this is for fifty-seven
requests.
And if we click and drag, we could just
move this. And you can see that the aggregate
trace below updates in response to us dragging
this.
And if we want to look at the fastest
quartile, we just click faster and we'll just
choose
that range on the histogram.
Y.K.: I think it's the fastest load.
T.D.: The fastest load. And then if you click
on slower, you can see the slower requests.
So
this makes it really easy to compare and contrast.
OK. Why are certain requests faster and why
are
certain requests slow?
You can see the blue, these blue areas. This
is Ruby code. So, right now it's not super
granular. It would be nice if you could actually
know what was going on here. But, it'll at
least tell you where in your controller action
this
is happening, and then you can actually see
which
database queries are being executed, and what
their duration
is.
And you can see that we actually extract the
SQL and we denormalize it so we, so you,
or, we normalize it so you can see exactly
what those requests are even if the values
are
totally different between them.
Y.K.: Yeah. So the real query, courtesy of
Rails,
not yet supporting bind extraction is like,
where id
equals one or, ten or whatever.
T.D.: Yup. So that's pretty cool.
Y.K.: So one, one other thing is, initially,
we
actually just showed the whole trace, but
we discovered
that, obviously when you show whole traces
you have
information that doesn't really matter that
much. So we
started off by, we've recently basically started
to collapse
things that don't matter so much so that you
can basically expand or condense the trace.
And we wanted to make it not, but you
have to think about expanding or condensing
individual areas,
but just, you see what matters the most and
then you can see trivial errors.
T.D.: Yup. So, so that's the demo of Skylight.
We'd really like it if you checked it out.
There is one more thing I want to show
you that is, like, really freaking cool. This
is
coming out of Tilde labs. Carl was like, has
been hacking, he's been up until past midnight,
getting
almost no sleep for the past month trying
to
have this ready.
I don't know how many of you know this,
but Ruby 2 point 1 has a new, a,
a stack sampling feature. So you can get really
granular information about how your Ruby code
is performing.
So I want to show you, I just mentioned
how it would be nice if we could get
more information out of what your Ruby code
is
doing. And now we can do that.
Basically, every few milliseconds, this code
that Carl wrote
is going into the, to the Ruby, into MRI,
and it's taking a snap shot of the stack.
And because this is built-in, it's very low-impact.
It's
not allocating any new memory. It's very little
performance
hit. Basically you wouldn't even notice it.
And so
every few milliseconds it's sampling, and
we take that
information and we send it up to our servers.
So it's almost like you're running Ruby profiler
on
your local dev box, where you get extremely
granular
information about where your code is spending
its time
in Ruby, per method, per all of these things.
But it's happening in production.
So, this is, so this is a, we enabled
it in staging. You can see that we've got
some rendering bugs. It's still in beta.
Y.K.: Yeah, and we haven't yet collapsed things
that
are not important-
T.D.: Yes.
Y.K.: -for this particular feature.
T.D.: So we want to show, we want to
hide things like, like framework code, obviously.
But this
gives you an incredibly, incredibly granular
view of what
your app is doing in production. And we think.
This is a, an API that's built into, into
Ruby 2.1.1. Because our agent is running so
low-level,
because we wrote it in Rust, we have the
ability to do things like this, and Carl thinks
that we may be able to actually back port
this to older Rubies, too. So if you're not
on Ruby 2.1, we think that we can actually
bring this. But that's TPD.
Y.K.: Yeah, I- so I think the cool thing
about this, in general, is when you run a
sampling- so this is a sampling profiler,
right, we
don't want to be burning every single thing
that
you do in your program with tracing, right.
That
would be very slow.
So when you normally run a sampling profiler,
you
have to basically make a loop. You have to
basically create a loop, run this code a million
times and keep sampling it. Eventually we'll
get enough
samples to get the information. But it turns
out
that your production server is a loop. Your
production
server is serving tons and tons of requests.
So,
by simply tak- you know, taking a few microseconds
out of every request and collecting a couple
of
samples, over time we can actually get this
really
high fidelity picture with basically no cost.
And that's pretty mind-blowing. And this is
the kind
of stuff that we can start doing by really
caring about, about both the user experience
and the
implementation and getting really scary about
it. And I'm
really, like, honestly this is a really exciting
feature
that really shows what we can do as we
start building this out.
T.D.: Once we've got that, once we've got
that
groundwork.
So if you guys want to check it out,
Skylight dot io, it's available today. It's
no longer
in private beta. Everyone can sign up. No
invitation
token necessary. And you can get a thirty-day
free
trial if you haven't started one already.
So if
you have any questions, please come see us
right
now, or we have a booth in the vendor
hall. Thank you guys very much.