English sous-titres

← Lec 8 | MIT 6.033 Computer System Engineering, Spring 2005

Performance

View the complete course at: http://ocw.mit.edu/6-033S05

License: Creative Commons BY-NC-SA
More information at http://ocw.mit.edu/terms
More courses at http://ocw.mit.edu

Obtenir le code d’intégration
1 langue

Afficher la révision 1 créée 03/17/2011 par Amara Bot.

  1. Let's go ahead and get started.
  2. OK, so today we have one topic
    to finish up very briefly from
  3. last time.
    So if you remember,
  4. when we finished off last time,
    we were talking about the
  5. example of a multithreaded Web
    server.
  6. So the example that we were
    talking about,
  7. and this is an example that I'm
    going to use throughout the
  8. lecture today consisted of a Web
    server with, this example
  9. consists of a Web server with
    three main modules or main
  10. components.
    So it consists,
  11. the three modules are a
    networking module,
  12. a Web server module --
  13. -- which is in charge of
    generating, for example,
  14. HTML pages, and then a disk
    module which is in charge of
  15. reading data off a disk.
    OK, so this thing is going to
  16. be communicating with the disk,
    which I've drawn as a cylinder
  17. here.
    So what happens is the client
  18. requests come in to this Web
    server.
  19. They come in to the network
    module.
  20. The network module forwards
    those requests on to the Web
  21. server.
    The Web server is in charge of
  22. generating, say,
    the HTML page that corresponds
  23. to the request.
    And in order to do that,
  24. it may need to read some data
    off of the disk.
  25. So it forwards this request
    onto the disk module,
  26. which goes and actually gets
    the page from the disk and at
  27. some point later,
    the disk returns the page to
  28. the Web server.
    The Web server returns the page
  29. to the network module,
    and then the network module
  30. sends the answer back over the
    network to the user.
  31. So this is a very simple
    example of a Web server.
  32. It should be sort of familiar
    to you since you have just spent
  33. a while studying the Flash Web
    server.
  34. So you can see that this is
    sort of a simplified description
  35. of what a Web server does.
    So if you think about how you
  36. actually go about designing a
    Web server like this,
  37. of course it's not the case
    that there is only one request
  38. that is moving between these
    modules at any one point in
  39. time.
    So, in fact,
  40. there may be multiple client
    requests that come into the
  41. network module.
    And the network module may want
  42. to have multiple outstanding
    pages that it's asking the Web
  43. server to generate.
    And the Web server itself might
  44. be requesting multiple items
    from the disk.
  45. And so, in turn,
    that means that at any point in
  46. time there could be sort of
    multiple results.
  47. There could be results
    streaming back in from the disk
  48. which are going into the Web
    server, which is sort of chewing
  49. on results and producing pages
    to the network module.
  50. And so it's possible for there
    to be sort of queues that are
  51. building up between these
    modules both on the send and
  52. receive.
    So, I'm going to draw a queue,
  53. I'll draw a queue just sort as
    a box with these vertical arrows
  54. through it.
    So there is some buffering
  55. that's happening between the
    sort of incoming requests and
  56. the outgoing requests on these
    modules.
  57. OK, and this buffering is a
    good thing.
  58. And we're going to talk more
    about this throughout the
  59. lecture today because what it
    allows us to do is it allows us
  60. to decouple the operations of
    these different modules.
  61. So, for example,
    the disk module can be reading
  62. a page from disk while the HTML
    page, while the HTML server is,
  63. for example simultaneously
    generating an HTML page that
  64. wants to return to the client.
    But in this architecture,
  65. you can see that,
    for example,
  66. when the Web server wants to
    produce a result,
  67. it can only produce a result
    when the disk pages that it
  68. needs are actually available.
    So the Web server is dependent
  69. on some result from the disk
    module being available.
  70. So if we were to look at just
    this Web server,
  71. I'm going to call this the HTML
    thread here and the disk thread,
  72. so these two threads that are
    on the right side of this
  73. diagram that I've drawn here,
    here to look at the code that
  74. was running in these things,
    and we saw this last time,
  75. the code might look something
    like this.
  76. So the HTML thread is just
    going to sit in a loop
  77. continually trying to de-queue
    information from this queue that
  78. is shared between it and the
    disk thread.
  79. And then, the disk thread is
    going to be in a loop where it
  80. continually reads blocks off the
    disk, and then enqueues them
  81. onto this queue.
    So this design at first seems
  82. like it might be fine.
    But then if you start thinking
  83. about what's really going on
    here, there could be a problem.
  84. So, suppose for example that
    the queue is of a finite length.
  85. It only has a certain number of
    elements in it.
  86. Now when we keep calling in
    queue over and over and over
  87. again, it's possible that if the
    HTML thread isn't consuming
  88. these pages off the queue fast
    enough, that the queue could
  89. fill up, and it could overflow,
    right?
  90. So that might be a problem that
    we want to sort of make a
  91. condition that we would
    explicitly check for in the
  92. code.
    And so we could do that by
  93. adding a set of conditions like
    this.
  94. So what you see here is that I
    have just augmented the code
  95. with these two additional
    variables, used and free,
  96. where used indicates the number
    blocks that are in the queue
  97. that are currently in use.
    And free indicates the number
  98. of blocks that are in the code,
    the number of blocks that are
  99. in the queue that are currently
    free.
  100. So what this loop does is that
    the disk thread says it only
  101. wants to enqueue something onto
    the queue when there are some
  102. free blocks.
    So, it has a while loop that
  103. just loops forever and ever and
    ever while they're waiting when
  104. there are no free blocks,
    OK?
  105. And similarly,
    the HTML thread is just going
  106. to wait forever when there are
    no used blocks,
  107. OK?
    And then, when the disk thread
  108. enqueues a block onto the queue,
    it's going to decrement the
  109. free count because it's reduced
    the number of things that are in
  110. the queue.
    And it's going to increment the
  111. used count because now there is
    one additional thing that's
  112. available in the queue,
    OK?
  113. So this is a simple way in
    which now we've made it so these
  114. things are waiting for each
    other.
  115. They are coordinating with each
    other by use of these two shared
  116. variables used in free.
    OK, so these two threads share
  117. these variables.
    So that's fine.
  118. But if you think about this
    from a scheduling point of view,
  119. there still is a little bit of
    a problem with this approach.
  120. So in particular,
    what's going on here is that,
  121. oops, when one of these threads
    enters into one of these while
  122. loops, it's just going to sit
    there checking this condition
  123. over and over and over and over
    again, right?
  124. So then the thread scheduler
    schedules that thread.
  125. It's going to repeatedly check
    this condition.
  126. And that's maybe not so
    desirable.
  127. So suppose, for example,
    that the HTML thread enters
  128. into this loop and starts
    looping because there's no data
  129. available.
    Now, what really we would like
  130. to have happen is for the disk
    thread to be allowed to get a
  131. chance to run,
    so maybe it can produce some
  132. data so that the HTML thread can
    then go ahead and operate.
  133. But, with this while loop
    there, we can't quite do that.
  134. We just sort of waste the CPU
    during the time we are in this
  135. while loop.
    So, instead what we are going
  136. to do is introduce the set of
    what we call sequence
  137. coordination operators.
  138. So in order to introduce this,
    we're going to add something,
  139. a new kind of data type that we
    call an event count.
  140. An event count,
    you can just think of it as an
  141. integer that indicates the
    number of times that something
  142. has occurred.
    It's just some sort of running
  143. counter-variable.
    And we're going to introduce
  144. two new routines.
    So these two routines are
  145. called wait and notify.
    OK, so wait takes two
  146. arguments.
    It takes one of these event
  147. count variables.
    And it takes a value.
  148. OK, so what wait says is check
    the value of this event count
  149. thing, and see whether or when
    we check it, the value of this
  150. event count is less than or
    equal to value.
  151. If the event count is less than
    or equal to value,
  152. then it waits.
    And what it means for it to
  153. wait is that it tells the thread
    scheduler that it no longer
  154. wants to be scheduled until
    somebody later calls this notify
  155. routine on this same event count
    variable.
  156. OK, so wait says wait,
    if this condition is true,
  157. and then notify says,
    wake up everybody who's waiting
  158. on this variable.
    So we can use these routines in
  159. the following way in this code.
    And it's really very
  160. straightforward.
    We simply change our iteration
  161. through, our while loops into
    wait statements.
  162. So what we're going to do is
    we're going to have the HTML
  163. thread wait until the value of
    used becomes greater than zero.
  164. And we're going to have our
    disk thread wait until the value
  165. of free becomes greater than
    zero.
  166. And then the only other thing
    that we have to add to this is
  167. simply a call to notify.
    So what notify does is it
  168. indicates to any other thread
    that is waiting on a particular
  169. variable that that thread can
    run.
  170. So the HTML thread will notify
    [free?], which will tell the
  171. disk thread that it can now
    begin running if it had been
  172. waiting on the variable free.
    OK, so this emulates the
  173. behavior of the while loop that
    we had before except that the
  174. thread scheduler,
    rather than sitting in any
  175. infinite while loop simply
    doesn't schedule the HTML thread
  176. of the disk thread while it's
    waiting in one of these wait
  177. statements.
    OK, so what we're going to talk
  178. about for the rest of the
    lecture today is related to
  179. this, and I think you will see
    why as we get through the talk.
  180. The topic for today is
    performance.
  181. So performance,
    what we've looked at so far in
  182. this class are these various
    ways of structuring complex
  183. programs, how to break them up
    into several modules,
  184. the client/server paradigm,
    how threads work,
  185. how a thread scheduler works,
    all of these sort of big topics
  186. about how you design a system.
    But we haven't said anything
  187. about how you take a system
    design and in an ordered,
  188. regular, systematic way,
    think about making that system
  189. run efficiently.
    So that's what we're going to
  190. try and get at today.
    We're going to look at a set of
  191. techniques that we can use to
    make a computer system more
  192. efficient.
    And so, these techniques,
  193. there are really three
    techniques that we're going to
  194. look at today.
    The first one is a technique
  195. called concurrency.
    And concurrency is really about
  196. allowing the system to perform
    multiple operations
  197. simultaneously.
    So, for example,
  198. in our sample Web server,
    it may be the case that we have
  199. this disc that we can sort of
    read pages from at the same time
  200. that, for example,
    the CPU generates some Web
  201. pages that it's going to output
    to the client.
  202. OK, so that's what concurrency
    is about.
  203. We are also going to look at a
    technique called caching,
  204. which you guys should have all
    seen before.
  205. Caching is really just about
    saving off some previous work,
  206. some previous computation that
    we've already done,
  207. or our previous disk page that
    we've already read in.
  208. We want to save it off so that
    we can reuse it again at a later
  209. time.
    And then finally,
  210. we are going to look at
    something called scheduling.
  211. So scheduling is about when we
    have multiple requests to
  212. process, we might be able to
    order those requests in a
  213. certain way or group the
    requests together in a certain
  214. way so that we can make the
    system more efficient.
  215. So it's really about sort of
    choosing the order in which we
  216. do things in order to make the
    system run more efficiently.
  217. And throughout the course of
    this, I'm going to use this
  218. example of this Web server that
    we've been talking about to sort
  219. of motivate each of the
    applications,
  220. or each of these performance
    techniques that we're going to
  221. talk about.
    So in order to get to the point
  222. where we can understand how
    these performance techniques
  223. work, we need to talk a little
    bit about what we mean by
  224. performance.
    How do we measure the
  225. performance of the system,
    and how do we understand where
  226. the bottlenecks in performance
    in a system might be?
  227. So one thing we might want to,
    the first thing we need to do
  228. is to define a set of
    performance metrics.
  229. These are just a set of terms
    and definitions that we can use
  230. so that we can talk about what
    the performance of the system
  231. is.
    So the first metric we might be
  232. interested in is the capacity of
    the system.
  233. And capacity is simply some
    measure of the amount of
  234. resource in a system.
    So this sounds kind of
  235. abstract, but what we mean by a
    resource is some sort of thing
  236. that we can compete with.
    It's a disk,
  237. or a CPU, or a network,
    so we might,
  238. for example,
    talk about the capacity of a
  239. disk might be the size in
    gigabytes or the capacity of a
  240. processor might be the number of
    instructions it can execute per
  241. second.
    OK, so once we have capacity,
  242. now we can start talking about
    how much of the system we are
  243. actually using.
    So we talk about the
  244. utilization.
    So utilization is simply the
  245. percentage of capacity we're
    using.
  246. So we might have used up 80% of
    the disk blocks on our computer.
  247. So now there are two sort of
    properties, or two metrics that
  248. are very commonly used in
    computer systems in order to
  249. classify or sort of talk about
    what the performance of the
  250. system is.
    So, the first metric is
  251. latency.
    So, latency is simply the time
  252. for a request to complete.
    The REQ is request,
  253. OK, and we can also talk about
    sort of the inverse of this,
  254. what at first will seem like
    the inverse of this,
  255. which is throughput.
    That's simply the number of
  256. requests per second that we can
    process.
  257. So when you think about latency
    and throughput,
  258. when you first see this
    definition, it's tempting to
  259. think that simply throughput is
    the inverse of latency,
  260. right?
    If it takes 10 ms for a request
  261. to complete, well,
    then I must be able to complete
  262. 100 requests per second,
    right?
  263. And, that's true in the simple
    case where in the very simple
  264. example where I have a single
    module, for example,
  265. that can process one request at
    a time, so a single
  266. computational resource,
    for example,
  267. that can only do one thing at a
    time, if this thing has some
  268. infinite set of inputs in it,
    it takes 10 ms to process each
  269. input, we'll see,
    say, 100 results per second
  270. coming out, OK?
    So if something takes 10 ms to
  271. do, you can be 100 of them per
    second.
  272. So we could say the throughput
    of this system is 100 per
  273. second, and the latency is 10
    ms.
  274. What we're going to see
    throughout this talk is that in
  275. fact a strict relationship
    between latency and throughput
  276. doesn't hold I mean,
    you guys have probably already
  277. seen the notion of pipelining
    before in 6.004,
  278. and you understand that
    pipelining is a way in which we
  279. can improve the throughput of
    the system without necessarily
  280. changing the latency.
    And we'll talk about that more
  281. carefully as this talk goes on.
    OK, so given these metrics,
  282. now what we need to do is think
    a little bit about,
  283. OK, so suppose I have some
    system, and suppose I have some
  284. sort of set of goals for that
    system like I want the system to
  285. be able to process a certain
    number of requests per second,
  286. or I want the latency of this
    system to be under some amount.
  287. So then the question is,
    so you are given this computer
  288. system and you sit down and you
    want to measure it.
  289. And so you're going to measure
    the system.
  290. And what do you expect to find?
    So, in the design of computer
  291. systems, it turns out that there
    is some sort of well-known
  292. performance pitfalls,
    or so-called performance
  293. bottlenecks.
    And the goal of sort of doing
  294. performance analysis of a system
    is to look at the system and
  295. figure out where the bottlenecks
    are.
  296. So, this typically in the
    design of the big computer
  297. system, what we're worried about
    is which of the little
  298. individual modules within the
    system is most responsible for
  299. slowing down my computer.
    And what should I do in order
  300. to, sort of, and then once
    you've identified that module,
  301. picking about how to make a
    particular module that slow run
  302. faster.
    So that's really what finding
  303. performance bottlenecks is
    about.
  304. And there's a classic
    bottleneck that occurs in
  305. computer systems that you guys
    all need to know about.
  306. It's this so-called IO
    bottleneck.
  307. OK, so what the IO bottleneck
    says it's really fairly
  308. straightforward.
    If you think about a computer
  309. system, it has a hierarchy of
    memory devices in it,
  310. OK?
    And these memory devices start,
  311. or storage devices.
    So these storage devices first
  312. for start with the CPU.
    So the CPU has some set of
  313. registers on it,
    a small number of them,
  314. say for example,
    32.
  315. And you can access those
    registers very,
  316. very fast, say once per
    instruction, once per cycle on
  317. the computer.
    So, for example,
  318. if your CPU is on gigahertz,
    you may be able to access one
  319. of these registers in 1 ns.
    OK, and so typically at the
  320. tallest level,
    it is pure med,
  321. we have a small storage that is
    fast, OK?
  322. As we go down this pyramid
    adding new layers,
  323. and looking at this storage
    hierarchy, we're going to see
  324. that things get bigger and
    slower.
  325. So, just below the CPU,
    we may have some processor
  326. cache, OK, and this might be,
    for example,
  327. 512 kB.
    And it might take 20 ns to
  328. access a single,
    say, block of this memory.
  329. And then we're going to have
    the ram, the main memory of the
  330. device, which on a modern
    machine might be 1 GB.
  331. And it might take 100 ns to
    access.
  332. And then below that,
    you take a big step down or big
  333. step up in size and big step
    down and performance.
  334. You typically have a disk.
    So a disk might be as big as
  335. 100 GB, right?
    But, performance is very slow.
  336. So it's a mechanical thing that
    has to spend,
  337. and it only spends so fast.
    So a typical access time for a
  338. block of the disk might be as
    high as 10 ms or even higher.
  339. And then sometimes people will
    talk in this hierarchy the
  340. network is actually a level
    below that.
  341. So if something isn't available
    on the local disk,
  342. for example,
    on our Web server,
  343. we might actually have to go
    out into the network and fetch
  344. it.
    And if this network is the
  345. Internet, right,
    the Internet has a huge amount
  346. of data.
    I mean, who knows how much it
  347. is.
    It's certainly orders of
  348. terabytes.
    And it could take a long time
  349. to get a page of the Internet.
    So it might take 100 ms to
  350. reach some remote site on the
    Internet.
  351. All right, so the point about
    this IO bottleneck is that this
  352. is going to be a very common,
    sort of the disparity in the
  353. performance of these different
    levels of the system is going to
  354. be a very common source of
    performance problems in our
  355. computers.
    So in particular,
  356. if you look at the access time,
    here's 1 ns.
  357. To access time down here is 100
    ms.
  358. This is a ten to the eighth
    difference, right,
  359. which is equal to 100 million
    times difference in the
  360. performance of the fastest to
    the slowest thing here.
  361. So, if the CPU has to wait for
    something to come over the
  362. network, you're waiting for a
    very long time in terms of the
  363. amount of time the CPU takes to,
    say, read a single word of
  364. memory.
    So when we look at the
  365. performance of a computer
    system, we're going to see that
  366. often this sort of IO bottleneck
    is the problem with that system.
  367. So if we look,
    for example,
  368. at our Web server,
    with its three stages,
  369. where at this stage is the one
    that goes to disk,
  370. this is the HTML stage,
    which maybe can just be
  371. computed in memory.
    And this is the network stage.
  372. We might be talking about 10 ms
    latency for the disk stage.
  373. We might be talking about just
    1 ms for the HTML page,
  374. because all it has to do is do
    some computation in memory.
  375. And we might be talking about
    100 ms for the network stage to
  376. run because it has to send some
    data out to some remote site.
  377. So if you, in order to process
    a single request,
  378. have to go through each of
    these steps in sequence,
  379. then the total performance of
    the system, the time to process
  380. a single request is going to be,
    say for example,
  381. 111 ms, the sum of these three
    things, OK?
  382. And so if you look at the
    system and you say,
  383. OK, what's the performance
    bottleneck in this system?
  384. So the performance bottleneck,
    right, is clearly this network
  385. stage because it takes the
    longest to run.
  386. And so if we want to answer a
    question about where we should
  387. be optimizing the system,
    one place we might think to
  388. optimize is within this network
    stage.
  389. And we'll see later an example
    of a simple kind of optimization
  390. that we can apply based on this
    notion of concurrency to improve
  391. the performance of the
    networking stage.
  392. So as I just said,
    the notion of concurrency is
  393. going to be the way that we are
    really going to get at sort of
  394. eliminating these IO
    bottlenecks.
  395. So --
  396. And the idea is going to be
    that we want to overlap the use
  397. of some other resource during
    the time that we are waiting for
  398. one of these slow IO devices to
    complete.
  399. And, we are going to look at
    two types of concurrency.
  400. We're going to look at
    concurrency between modules --
  401. -- and within a module,
    OK?
  402. So we may have modules that are
    composed, for example,
  403. our networking module may be
    composed of multiple threads,
  404. each of which can be accessing
    the network.
  405. So that's an example of
    concurrency within a module.
  406. And, we're going to look at the
    case of between module
  407. concurrency where,
    for example,
  408. the HTML module can be
    processing and degenerating an
  409. HTML page, while the disk module
    is reading a request for another
  410. client at the same time.
    OK, and so the idea behind
  411. concurrency is really going to
    be by using concurrency,
  412. we can hide the latency of one
    of these slow IO stages.
  413. OK, so the first kind of
    concurrency we're going to talk
  414. about is concurrency between
    modules.
  415. And the primary technique we
    use for doing this is
  416. pipelining.
    So the idea with pipelining is
  417. as follows.
    Suppose we have our Web server
  418. again.
    And this time let's draw it as
  419. I drew it at first with Q's
    between each of the modules,
  420. OK?
    So, we have our Web server
  421. which has our three stages.
    And suppose that what we are
  422. doing is we have some set of
    requests, sort of an infinite
  423. queue of requests that is sort
    of queued up at the disk thread,
  424. and the disk thread is
    producing these things.
  425. And we're sending them through.
    Well, we want to look at how
  426. many pages come out here per
    second, and what the latency of
  427. each page is.
    So, if we have some list of
  428. requests, suppose these requests
    are numbered R1 through RN,
  429. OK?
    So what's going to happen is
  430. that the first request is going
    to start being processed by the
  431. disk server, right?
    So, it's going to start
  432. processing R1.
    Now, in a pipelining system,
  433. what we're going to want to do
    is to have each one of these
  434. threads sort of working on a
    different request,
  435. each one of these modules
    working on a different request
  436. at each point in time.
    And because the disk is this
  437. independent resource from the
    CPU, is an independent resource
  438. from the network,
    this is going to be OK.
  439. These three modules aren't
    actually going to contend with
  440. each other too much.
    So what's going to happen is
  441. this guy's going to start
    processing R1,
  442. right?
    And then after 10 ms,
  443. he's going to pass R1 up to
    here, and start working on R2,
  444. OK?
    And now, 1 ms after that,
  445. this guy is going to finish R1
    and send it to here.
  446. And then, 9 ms after that,
    R2 is going to come up here.
  447. And this guy can start
    processing R3.
  448. OK, so does everybody sort of
    see where those numbers are
  449. coming from?
    OK.
  450. [LAUGHTER] Good.
    So now , what we're going to do
  451. is if we look at time starting
    with this equal to time zero,
  452. in terms of the requests that
    come in and out of this last
  453. network thread,
    we can sort of get a sense of
  454. how fast this thing is
    processing.
  455. So the first R1 enters into
    this system after 11 ms,
  456. right?
    It takes 10 ms to get through
  457. here and 1 ms to get through
    here.
  458. And, it starts processing R1 at
    this time.
  459. So, I'm going to write plus R1
    to suggest that we start
  460. processing and here.
    The next time that this module
  461. can do anything is 100 ms after
    it first started processing,
  462. the next time this module does
    anything is 100 ms after it
  463. started processing R1.
    So, at time 111 ms,
  464. it can output R1,
    or it's done processing R1.
  465. And then, of course,
    by that time,
  466. R2 and R3, some set of requests
    have already queued up in this
  467. queue waiting for it.
    So it can immediately begin
  468. processing R2 at this time,
    OK?
  469. So then, clearly what's going
    to happen is after 211 ms,
  470. it's going to output R2,
    and it's going to begin
  471. processing R3,
    OK?
  472. So, there should be a plus
    there and a plus there.
  473. So, and similarly,
    at 311 we're going to move onto
  474. the next one.
    So, he look now at the system,
  475. we've done something pretty
    interesting, which is that it
  476. still took us,
    sort of the time for this
  477. request to travel through this
    whole thing was 110 ms.
  478. But if you look at the [inter?]
    arrival time between each of
  479. these successive outputs of R1,
    they are only 100 ms,
  480. right?
    So we are only waiting as long
  481. as it takes R1 to process a
    result in order to produce these
  482. results, in order to produce
    answers.
  483. So by pipelining the system in
    this way and having the Web
  484. server thread and the disk
    thread do their processing on
  485. later requests while R1 is
    processing its request,
  486. we can increase the throughput
    of the system.
  487. So in this case,
    we get an arrival every 100 ms.
  488. So the throughput is now equal
    to one result every 100 ms,
  489. or ten results per second,
    OK?
  490. So, even though the latency is
    still 111 ms,
  491. the throughput is no longer one
    over the latency because we have
  492. separated them in this way by
    pipelining them.
  493. OK, so that was good.
    That was nice.
  494. We improve the performance of
    the system a little bit.
  495. But we didn't really improve it
    very much, right?
  496. We increased the throughput of
    this thing a little bit.
  497. But we haven't really addressed
    what we identified earlier as
  498. being a bottleneck,
    which the fact that this R1
  499. stage is taking 100 ms to
    process.
  500. And in general,
    when we have a pipeline system
  501. like this, we can say that the
    throughput of the system is
  502. bottlenecked by the slowest
    stage of the system.
  503. So anytime you have a pipeline,
    the throughput of the system is
  504. going to be the throughput of
    the slowest stage.
  505. So in this case,
    the throughput is 10 results
  506. per second.
    And that's the throughput of
  507. the whole system.
    So if we want to increase the
  508. throughput anymore than this,
    what we're going to have to do
  509. is to somehow improve the
    performance of this module here.
  510. And the way that we're going to
    do that is also by exploiting
  511. concurrency.
  512. This is going to be this within
    a module concurrency.
  513. So if you think about how a Web
    server works,
  514. or how a network works,
    typically when we are sending
  515. these requests to a client,
    it's not that we are using up
  516. all of the available bandwidth
    of the network when we are
  517. sending these requests to a
    client, right?
  518. You may be able to send 100 MB
    per second out over your
  519. network.
    Or if you're connected to a
  520. machine here,
    you may be able to send 10 MB a
  521. second across the country to
    some other university.
  522. The issue is that it takes a
    relatively long time for that
  523. request to propagate,
    especially when that request is
  524. propagating out over the
    Internet.
  525. The latency can be quite high.
    But you may not be using all
  526. the bandwidth when you are,
    say, for example,
  527. sending an HTML page.
    So in particular it is the case
  528. that multiple applications,
    multiple threads,
  529. can be simultaneously sending
    data out over the network.
  530. And if that doesn't make sense
    to right now,
  531. we're going to spend the whole
    next four lectures talking about
  532. network performance.
    And it should make sense for
  533. you.
    So just take my word for it
  534. that one of the properties of
    the network is so that the
  535. latency of the network may be
    relatively high.
  536. But in this case we are not
    actually going to be using all
  537. the bandwidth that's available
    to us.
  538. So that suggests that there is
    an idle resource.
  539. It means that we sort of have
    some network bandwidth that we
  540. could be using that we are not
    using.
  541. So we'd like to take advantage
    of that in the design of our
  542. system.
    So we can do this in a
  543. relatively simple way,
    which is simply to say,
  544. let's, within networking
    module, rather than only having
  545. one thread sending out requests
    at one time, let's have multiple
  546. threads.
    Let's, for example have,
  547. say we have 10 threads.
    So we have thread one,
  548. thread two, thread ten,
    OK?
  549. And we're going to allow these
    all to be using the network at
  550. once.
    And they are all going to be
  551. talking to the same queue that's
    connected to the same HTML
  552. module that's connected to the
    same disk module.
  553. And there's a queue between
    these as well.
  554. OK, so now when we think about
    the performance of this,
  555. now let's see what happens when
    we start running requests
  556. through this pipeline.
    And let's see how frequently we
  557. get requests coming out of the
    other end.
  558. We draw our timeline again.
    You can see that R1 is going to
  559. come in here,
    and then after 10 ms it's going
  560. to move to here.
    And then after 11 ms it'll
  561. arrive here.
    We'll start processing request
  562. one.
    Now the second request,
  563. R2, is going to be here.
    And, we're going to have 9 ms
  564. of processing left to do on it.
    After R1, it gets sent on to
  565. the next thread.
    So, R2 is going to be in here
  566. for 9 ms.
    It will be in here for 1 ms.
  567. So, 10 ms after R1 arrives
    here, R2 is going to arrive
  568. here.
    So, what we have is we have 11
  569. ms.
    We have R1.
  570. Now, 10 ms later,
    we have R2.
  571. OK, so now you can see that
    suddenly this module,
  572. this system is able to process
    multiple requests,
  573. so it has multiple requests
    that processing at the same
  574. time.
    And so 10 ms after that,
  575. R3 is going to start being
    processed, and then,
  576. so what that means is that
    after some passage of time,
  577. we're going to have R10 in
    here.
  578. And, that's going to go in
    after 101 ms,
  579. right?
    So, we're going to get R10.
  580. OK, and now we are ready to
    start processing.
  581. Now we've sort of pushed all
    these through.
  582. And now, suppose we start
    processing R11.
  583. OK, so R11 is going to flow
    through this pipeline.
  584. And then, it's at time 111,
    R11 is going to be ready to be
  585. processed.
    But notice that at time 111,
  586. we are finished processing R1,
    right?
  587. So, at this time,
    we can add R11 to the system,
  588. and we can output R1.
    OK, so now every 10 ms after
  589. this, another result is going to
    arrive, and we're going to be
  590. able to output the next one.
    OK, and this is just going to
  591. continue.
    So now, you see what we've
  592. managed to do is we've made this
    system so that every 10 ms after
  593. this sort of startup time of 111
    ms, after every 10 ms,
  594. we are producing results,
    right?
  595. So we are going to get,
    actually, 100 per second.
  596. This is going to be the
    throughput of this system now.
  597. OK, so that was kind of neat.
    How did we do that?
  598. What have we done here?
    Well, effectively what we've
  599. done is we've made it so that
    this module here can process
  600. sort of 10 times as many
    requests as it could before.
  601. So this module itself now has
    10 times the throughput that it
  602. had before.
    And we said before that the
  603. bottleneck in the system is,
    the throughput of the system is
  604. the throughput of the slowest
    stage.
  605. So what we've managed to do is
    decrease the throughput of the
  606. slowest stage.
    And so now the system is
  607. running 10 times as fast.
    Notice now that the disk thread
  608. and the network threads both
    take 10 ms, sort of the
  609. throughput of each of them is
    100 per second.
  610. And so, now we have sort of two
    stages that have been equalized
  611. in their throughput.
    And so if we wanted to further
  612. increase the performance of the
    system, we would have to
  613. increase the performance of both
    of these stages,
  614. not just one of them.
    OK, so that was a nice result,
  615. right?
    It seems like we've done
  616. something sort of,
    we've shown that we can use
  617. this notion of concurrency to
    increase the performance of a
  618. system.
    But, we've introduced a little
  619. bit of a problem.
    In particular,
  620. the problem we've introduced is
    as follows.
  621. So, remember,
    we said we had this set of
  622. threads, one through,
    say for example,
  623. ten, that our processing,
    they're all sharing this queue
  624. data structure that is connected
    up to our HTML thread.
  625. So, the problem with this is
    that what we've done is to
  626. introduce what's called a race
    condition on this queue.
  627. And I'll show you what I mean
    by that.
  628. So if we look at our code
    snippet up here,
  629. for example for what's
    happening in our HTML thread,
  630. we see that what it does is it
    calls de-queue,
  631. right?
    So the problem that we can have
  632. is that we may have multiple of
    these modules that are
  633. simultaneously executing at the
    same time.
  634. And they may simultaneously
    both call de-queue,
  635. right?
    So depending on how de-queue is
  636. implemented, we can get some
    weird results.
  637. So, let me give you a sort of
    very simple possible
  638. implementation of de-queue.
    Suppose that what de-queue does
  639. is it reads, so,
    given this queue here,
  640. let's say the queue is managed
    by, there's two variables that
  641. keep track of the current state
    of this queue.
  642. There is a variable called
    first, which points to the head
  643. of the queue,
    and there's a variable called
  644. last, which first points to the
    first used element in this
  645. queue, and last points to the
    last used element.
  646. So, the elements that are in
    use in the queue at any one time
  647. are between first and last,
    OK?
  648. And, what's going to happen is
    when we de-queue,
  649. we're going to sort of move
    first over one,
  650. right?
    So, when we de-queue something,
  651. we'll free up this cell.
    And when we enqueue,
  652. we'll move last down one.
    And then, when last reaches the
  653. end, we are going to wrap it
    around.
  654. So this is a fairly standard
    implementation of the queue.
  655. It's called the circular
    buffer.
  656. And if first is equal to last,
    then we know that the queue is
  657. full.
    So that's the condition that we
  658. can check.
    So we are not going to go into
  659. too many details about how this
    thing is actually implemented.
  660. But let's look at a very simple
    example of how de-queue might
  661. work.
    So remember we have these two
  662. shared variables first and last
    that are shared between these,
  663. say, all these threads that are
    accessing this thing.
  664. And what de-queue might do is
    to say it's going to read a
  665. block, read a page from this
    queue, so read the next HTML
  666. page to output,
    and it's going to read that
  667. into a local variable called
    page.
  668. Let's call this queue buf,
    B-U-F, I mean we'll use a
  669. [ray?] notation for accessing
    it.
  670. So it's going to re-buff sub
    first, OK, and then it's going
  671. to increment first.
    First gets first plus one,
  672. and then it's going to return
    page.
  673. OK, that seems like a
    straightforward implementation
  674. of de-queue.
    And so we have one thread
  675. that's doing this.
    Now, suppose we have another
  676. thread that's doing exactly the
    same thing at the same time.
  677. So it runs exactly the same
    code.
  678. And remember that these two
    threads are sharing the
  679. variables buf and first.
  680. OK, so if you think about this
    any think about how these two
  681. things running two threads at
    the same time,
  682. there is sort of an interesting
    problem that can arise.
  683. So one thing that might happen
    when we are running these two
  684. things at the same time is that
    the thread scheduler might first
  685. start running thread one.
    And it might run the first
  686. instruction of thread one.
    And then it might run the
  687. second instruction.
    And then it might run this
  688. return thing.
    And that might come over here,
  689. and it might start running T2.
    So, it might,
  690. then, stop running T1 and start
    running T2, and execute its
  691. three instructions.
    So if the thread scheduler does
  692. this, there's nothing wrong.
    It's not a problem,
  693. right?
    The thread scheduler,
  694. each of these things read its
    value from the queue and
  695. incremented it.
    T1 read one thing from the
  696. queue, and then T2 read the next
    thing from the queue.
  697. So clearly some of the time
    this is going to work fine.
  698. So let's make a list of
    possible outcomes.
  699. Sometimes we'll be OK.
    The first possible outcome was
  700. OK.
    But let's look at a different
  701. situation.
    Suppose what happens is that
  702. the first thing the thread
    scheduler does is schedule T1.
  703. And T1 executes this first
    instruction, and then just after
  704. that the thread scheduler
    decides to preempt T1,
  705. and allow T2 to start running.
    So it in particular allows T2
  706. to execute this de-queue
    instruction to its end,
  707. and then it comes over here and
    it runs T1.
  708. OK, so what's the problem now?
  709. Yeah?
  710. Right, OK, so they both read in
    the same page variable.
  711. So now both of these threads
    have de-queued the same page.
  712. So the value first,
    for T1, it was pointing here.
  713. And then we switched.
    And it was still pointing here,
  714. right?
    And now, so both of these guys
  715. have read the same page.
    And now they are both at some
  716. point going to increment first.
    So you're going to increment it
  717. once.
    Then you're going to increment
  718. it again.
    So this second element here in
  719. the queue has been skipped.
    OK, so this is a problem.
  720. We don't want this to happen.
    Because the system is not
  721. outputting all the pages that it
    was supposed to output.
  722. So what can we do to fix this?
  723. So the way that we fixed this
    is by introducing something we
  724. call isolation primitives.
  725. In the basic idea is that we
    want to introduce an operation
  726. that will make it so that any
    time that the page variable gets
  727. read out of the queue,
    that we also at the same time
  728. increment first without any
    other sort of threads' accesses
  729. to this queue being interleaved
    with our accesses to this queue,
  730. or our de-queues from the
    queue.
  731. So in sort of technical terms,
    what we say is we want these
  732. two things, the reading of page
    and the incrementing of first to
  733. be so-called atomic.
    OK, in the way that we're going
  734. to make these things atomic is
    by isolating them from each
  735. other, that by isolating these
    two threads from each other when
  736. they are executing the enqueue
    and de-queue things.
  737. So, these two terms we're going
    to come back to in a few months
  738. and a class towards the end of
    the class.
  739. But all you need to understand
    here is that there is this race
  740. condition, and we want some way
    to prevent it.
  741. And the way that we're going to
    prevent it is by using these
  742. isolation routines also
    sometimes called locks.
  743. So in this case,
    the isolation schemes are going
  744. to be called locks.
    So the idea is that a lot is
  745. simply a variable,
    which can be in one of two
  746. states.
    It can either be set or unset.
  747. And we have two operations that
    we can apply on a lock.
  748. We can acquire it,
    and we can release it.
  749. OK, and acquire and release
    have the following behavior.
  750. What acquire says is check the
    state of the lock,
  751. and if the lock is unset,
    then change the state to set.
  752. But if the lock is set,
    then wait until the lock
  753. becomes unset.
    What a release says is it
  754. simply says change the state of
    the lock from unset to set,
  755. or from set to unset,
    excuse me.
  756. So let's see how we can use
    these two routines in our code.
  757. So let's go back to our example
    of enqueue and de-queue.
  758. Let's introduce a lock
    variable.
  759. We'll call it TL for thread
    lock.
  760. And, what we're going to do is
    simply around these two
  761. operations to access the queue,
    to modify this page and first,
  762. to read the page and modify
    first, we're going to put in an
  763. acquire and a release.
  764. OK so we have ACQ on this
    thread lock, and we have release
  765. on this thread lock.
    OK, so let's look,
  766. so this seems fine.
    It looks like we've done this.
  767. But it's sort of positing the
    existence of this acquire
  768. procedure that just sort of does
    the right thing.
  769. If you think about this for a
    minute, it seems like we can
  770. have the same race condition
    problem in the acquire module as
  771. well, right, or the acquire
    function as well.
  772. With two guys both try and
    acquire the lock at the same
  773. time?
    How are we going to avoid this
  774. problem?
    And there's a couple of ways
  775. that are sort of well understood
    for avoiding this problem in
  776. practice, and their talk about
    in the book.
  777. I'm just going to introduce the
    simplest of them now,
  778. which is that we're going to
    add a special instruction to the
  779. microprocessor that allows us to
    do this, acquire efficiently.
  780. It turns out that most modern
    microprocessors have an
  781. equivalent instruction.
    So we're going to call this
  782. instruction RSL for read and set
    lock.
  783. OK, so the idea with RSL is as
    follows.
  784. We can basically,
    the implementation of the
  785. acquire module is going to be
    like this.
  786. What it's going to do,
    remember we want to wait until
  787. we want to loop.
    We don't have the lock.
  788. If we don't have the lock,
    we want to loop until [we've
  789. had?] the lock.
    So the implementation require
  790. may look as follows.
    We'll have a local variable
  791. called held.
    We'll initially set it to false
  792. in a while loop while we don't
    hold the lock.
  793. We're going to use this RSL
    instruction.
  794. So, what this says is held
    equals RSL of TL,
  795. OK?
    So, what the RSL instruction
  796. does is it looks at the state of
    the lock, and if the lock is
  797. unset, then it sets it.
    And if the lock is set,
  798. then it sets it and it returns
    true.
  799. And if the lock is set,
    then it returns false.
  800. So it has the property that it
    can both read and set the lock
  801. within a single instruction,
    right?
  802. And we're going to use this
    read and set lock sort of
  803. primitive, this basic thing,
    as a way to build up this sort
  804. of more complicated acquire
    function, which we can then use
  805. to build up these locks.
    OK, so anytime you're designing
  806. a multithreaded system in this
    way, or a system with lots of
  807. concurrency, you should be
    worrying about whether you have
  808. race conditions.
    And if you have race
  809. conditions, you need to think
    about how to use locks in order
  810. to prevent those race
    conditions.
  811. All right, so there are a
    couple of other topics related
  812. to performance that appear in
    the text.
  813. And one of those topics is
    caching.
  814. And I just want to spend one
    very brief minute on caching.
  815. So you guys have already seen
    catching presumably in the
  816. context of 6.004 with processor
    caches.
  817. And what we would want to do,
    so you might want to sit down
  818. and think through as an example
    of how you would use a cache,
  819. you improve the performance of
    our Web server.
  820. So one thing that you might do
    in order to improve the
  821. permanence of the Web server is
    to put a cash in the disk thread
  822. that you use instead of going to
    disk in order to sort of reduce
  823. the latency of a disk access.
    And I'll at the beginning of
  824. class next time take you through
    a very simple example of how we
  825. can actually use the disk thread
    in order to do that.
  826. But you guys should think about
    this a little bit on your own.
  827. So barring that little
    digression that we'll have next
  828. time, this takes us to the end
    of our discussion of sort of
  829. modularity, abstraction,
    and performance.
  830. And what we're going to start
    talking about next time is
  831. networking, and how networks
    work.
  832. But I want you guys to make
    sure you keep in mind all these
  833. topics that we've talked about
    because these are going to be
  834. the sort of fundamental tools
    that we are going to use
  835. throughout the class in the
    design of computer systems.
  836. So because we've finished this
    module, it doesn't mean that
  837. it's sort of OK to stop thinking
    about this stuff.
  838. You need to keep all of this in
    mind at the same time.
  839. So we'll see you all on
    Wednesday.