Return to Video

Transpose Code Example Part3 - Intro to Parallel Programming

  • 0:00 - 0:02
    So let's keep track of what we're doing.
  • 0:02 - 0:05
    Version 1 is serial in a single thread,
  • 0:05 - 0:08
    and it took about 466 milliseconds.
  • 0:08 - 0:11
    So 466 milliseconds is pretty slow,
  • 0:11 - 0:13
    but sometimes that's okay.
  • 0:13 - 0:15
    For code that's only going to be executed once,
  • 0:15 - 0:19
    code that's not performance critical at all, or code that's going to run on a really small data set,
  • 0:19 - 0:24
    like that 8 by 8 matrix that we started with, it's just not worthwhile to optimize the heck out of this.
  • 0:24 - 0:27
    So even though this simple serial kernel may seem very naive,
  • 0:27 - 0:30
    that's really sometimes the right thing to do,
  • 0:30 - 0:32
    so keep this in mind when you're optimizing.
  • 0:32 - 0:35
    Think about what you need to optimize, whether it's important.
  • 0:35 - 0:42
    Now let's assume that in fact, performance is critical on this section, and that's why we're optimizing it,
  • 0:42 - 0:44
    and let's go back to the code and see what we can do.
  • 0:44 - 0:49
    Now one easy step would be to launch 1 thread for each row of the input, okay?
  • 0:49 - 0:51
    So now here's the code that does that.
  • 0:51 - 0:55
    In this code, we're going to launch 1 thread per row of the output matrix.
  • 0:55 - 1:00
    So the value of i is going to be fixed by the thread ID,
  • 1:00 - 1:02
    and every thread is going to execute
  • 1:02 - 1:06
    just the outer loop of this code we saw before,
  • 1:06 - 1:11
    and the inner loop we're essentially handing off to be run across many different threads.
  • 1:11 - 1:14
    So these 2 codes are almost identical,
  • 1:14 - 1:19
    and the only difference is that we're launching threads instead of looping over values of i.
  • 1:19 - 1:21
    Let's time this.
  • 1:21 - 1:24
    Okay so here's the code for calling our new function.
  • 1:24 - 1:27
    We're going to launch the function transpose parallel per row as a kernel.
  • 1:27 - 1:30
    We're going to launch a single thread block consisting of n threads.
  • 1:30 - 1:34
    Remember, n is the size of our matrix, 1,024, currently.
  • 1:34 - 1:38
    We're going to pass in the input matrix, pull out the output matrix, copy it,
  • 1:38 - 1:41
    and then we're going to print out the timing and verify it.
  • 1:41 - 1:43
    Let's compile and run this code.
  • 1:43 - 1:45
    Okay.
  • 1:45 - 1:48
    Transpose serial ran 484 milliseconds again, roughly what we saw before.
  • 1:48 - 1:52
    Transpose parallel per row is running in 4.7 milliseconds.
  • 1:52 - 1:55
    So obviously we're making a huge improvement
  • 1:55 - 1:59
    by parallelizing this just across the threads of a single thread block.
  • 1:59 - 2:01
    So let's note that down:
  • 2:01 - 2:05
    4.7 milliseconds, roughly a 100x improvement.
Title:
Transpose Code Example Part3 - Intro to Parallel Programming
Description:

more » « less
Video Language:
English
Team:
Udacity
Project:
CS344 - Intro to Parallel Programming
Duration:
02:05
Udacity Robot edited English subtitles for 07-15 Transpose Code Example Part3
Udacity Robot edited English subtitles for 07-15 Transpose Code Example Part3
Stacy Taylor approved English subtitles for 07-15 Transpose Code Example Part3
Lauren Birdsong edited English subtitles for 07-15 Transpose Code Example Part3
Cogi-Admin added a translation

English subtitles

Revisions Compare revisions