# CSE113: Parallel Programming June 1, 2021

- Topic: GPUs 2
  - Review last week optimizations
  - Continue optimizing and have the final round of CPU vs. GPU!
  - Overview of advanced GPU topics

| Instruction Buffer              |      |      |               |       |     |  |  |  |  |
|---------------------------------|------|------|---------------|-------|-----|--|--|--|--|
| Warp Scheduler                  |      |      |               |       |     |  |  |  |  |
| Dispatch Unit                   |      |      | Dispatch Unit |       |     |  |  |  |  |
| Register File (16,384 x 32-bit) |      |      |               |       |     |  |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |  |

https://www.techpowerup.com/gpu-specs/docs/nvidia-gtx-980.pdf

### Announcements

- HW2 grades posted
  - Talk to us in 1 week if you have questions/issues
- We plan to have HW3 done in ~1 week.
- HW4 is out
  - Please try not to be late on this one!
  - Due on Monday, June 7
  - There is no guarantee that we will check Piazza on the weekend
  - Joint office hours on Wednesday

### Few hints on HW4

- If you are having trouble observing relaxed behaviors:
  - Try closing ALL applications
  - Try running natively (e.g. not in Docker)
  - Try running on the unix timeshare
    - Compilation on the time share works
    - I (and other students) have been able to get relaxed behaviors observations
    - run "top" and "who" to make sure the machine is not being heavily used

### Announcements

- SETs are out
  - Please do them!
- Final:
  - Wendesday June 9.
  - You have 1 day (Released midnight June 8, due midnight June 9)
  - If you want to budget time: 4pm 7pm is our allotted time
  - Plan on duration similar to midterm
  - We will be monitoring private piazza posts and emails for clarification questions
  - Late finals will not be accepted!

### Announcements

- The rest of the quarter:
  - 1 lectures about GPUs
  - 1 lecture about distributed computing
- If you are interested in GPU programming:
  - CUDA by example is a great book!
  - Linked to in the course material
  - IF you are interested and IF you do not have an Nvidia GPU, message the teach mailing list and we can try to find (limited) resources on campus

### Quiz

## Quiz

• Go over answers

## Schedule

- Review previous optimizations
- New optimizations
- Advanced GPU topics

### Schedule

- Review previous optimizations
- New optimizations
- Advanced GPU topics

### Round 2

The GPU in my PhD laptop



Nvidia 940m 1.8 Billion transistors 33 TDP Est. \$130 Fight!



The CPU in my professor workstation



Intel i7-9700K 2.16 Billion transistors 95 TDP Est. \$316

https://www.techpowerup.com/gpu-specs/geforce-940m.c2648 https://www.alibaba.com/product-detail/Intel-Core-i7-9700K-8-Cores\_62512430487.html https://www.prolast.com/prolast-elevated-boxing-rings-22-x-22/

## Round 2

The GPU in my PhD laptop



Nvidia 940m 1.8 Billion transistors 33 TDP Est. \$130 Fight!



The CPU in my professor workstation

Where were we?



Intel i7-9700K 2.16 Billion transistors 95 TDP Est. \$316

https://www.techpowerup.com/gpu-specs/geforce-940m.c2648 https://www.alibaba.com/product-detail/Intel-Core-i7-9700K-8-Cores\_62512430487.html https://www.prolast.com/prolast-elevated-boxing-rings-22-x-22/

# Programming a GPU

• The problem: Vector addition

# Embarrassingly parallel



# Programming a GPU

- The problem: Vector addition
- Who can do it faster?

### Lets set up the CPU

- CPU code
- Why do we access memory like this?

### GPU code

• Review:

### GPU set up

• Our heterogeneous, parallel, programming model



# The GPU Program

- Write a special function in your C++ code.
  - Called a Kernel
  - Use the new keyword \_\_\_global\_\_\_
  - Keywords in
    - OpenCL \_\_kernel
    - Metal kernel
- Write it how you'd write any other function

### The GPU Program

```
__global___ void vector_add(int * a, int * b, int * c, int size) {
   for (int i = 0; i < size; i++) {
      a[i] = b[i] + c[i];
   }
}</pre>
```

calling the function

vector\_add<<<1,1>>>(d\_a, d\_b, d\_c, size);

#### The GPU hardware



https://www.techpowerup.com/gpu-specs/docs/nvidia-gtx-980.pdf

### First parallelization attempt

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    int chunk_size = size/blockDim.x;
    int start = chunk_size * threadIdx.x;
    int end = start + end;
    for (int i = start; i < end; i++) {
        d_a[i] = d_b[i] + d_c[i];
    }
}</pre>
```

calling the function

vector\_add<<<1,32>>>(d\_a, d\_b, d\_c, size);

number of threads thread id

# GPU Memory

**CPU Memory:** 

Fast: Low Latency Easily saturated: Low Bandwidth Scales well: up to 1 TB DDR

2-lane straight highway driven on by sports cars



Different technologies

16-lane highway on a windy road driven by semi trucks

## GPU Memory

bandwidth: ~700 GB/s for GPU ~50 GB/s for CPUs

memory Latency:~600 cycles for GPU memory~200 cycles for CPU memory

Cache Latency: ~**28** cycles for L1 hit for GPU ~**4** cycles for L1 hit on CPUs



















Hey, my memory has arrived!





But wait, I thought preemption was expensive?

Registers all stay on chip

| Instruction Buffer              |      |      |               |       |     |  |  |  |
|---------------------------------|------|------|---------------|-------|-----|--|--|--|
| Warp Scheduler                  |      |      |               |       |     |  |  |  |
| Dispatch Unit                   |      |      | Dispatch Unit |       |     |  |  |  |
| Register File (16,384 x 32-bit) |      |      |               |       |     |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |
| Core                            | Core | Core | Core          | LD/ST | SFU |  |  |  |

But wait, I thought preemption was expensive? dedicated scheduler logic



But wait, I thought preemption was expensive?

bound on number of warps: 32

### Go back to our program

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    int chunk_size = size/blockDim.x;
    int start = chunk_size * threadIdx.x;
    int end = start + end;
    for (int i = start; i < end; i++) {
        d_a[i] = d_b[i] + d_c[i];
    }
}</pre>
```

calling the function

Lets launch with 32 warps

```
vector_add<<<1,1024>>>(d_a, d_b, d_c, size);
```
# Go back to our program

• What performance were we at?

# Schedule

- Review previous optimizations
- New optimizations
- Advanced GPU topics

# Schedule

- Review previous optimizations
- New optimizations
- Advanced GPU topics

# Optimizing memory accesses

| Instruction Buffer              |      |      |               |       |     |
|---------------------------------|------|------|---------------|-------|-----|
| Warp Scheduler                  |      |      |               |       |     |
| Dispatch Unit                   |      |      | Dispatch Unit |       |     |
| Register File (16,384 x 32-bit) |      |      |               |       |     |
| Core                            | Core | Core | Core          | LD/ST | SFU |
| Core                            | Core | Core | Core          | LD/ST | SFU |
| Core                            | Core | Core | Core          | LD/ST | SFU |
| Core                            | Core | Core | Core          | LD/ST | SFU |
| Core                            | Core | Core | Core          | LD/ST | SFU |
| Core                            | Core | Core | Core          | LD/ST | SFU |
| Core                            | Core | Core | Core          | LD/ST | SFU |
| Core                            | Core | Core | Core          | LD/ST | SFU |

# Optimizing memory accesses



this is the load/store unit. The hardware component responsible for issuing loads and stores.

Why doesn't every core have one?

### Optimizing memory accesses



This is the instruction cache... Why doesn't every core have a instruction buffer to keep track of its program?

this is the load/store unit. The hardware component responsible for issuing loads and stores.

Why doesn't every core have one?



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

instruction is fetched from the buffer and distributed to all the cores.

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

Cores can a large register file they share expensive HW units (load/store and special functions)

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

All cores need to wait until all cores finish the first instruction

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

Start the next instruction.

Program: int variable1 = b[0]; int variable2 = c[0]; int variable3 = variable1 + variable2; a[0] = variable3;

Why would we have a programming model like this?



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

Start the next instruction.

Program: int variable1 = b[0]; int variable2 = c[0]; int variable3 = variable1 + variable2; a[0] = variable3;

Why would we have a programming model like this? More cores (share program counters) Can be efficient to share other hardware resources



#### Lets look closer at memory

#### Program:

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```

4 cores are accessing memory. what happens if they access the same value?

**GPU Memory** 

### Load Store Unit



All read the same value

**GPU Memory** 

### Load Store Unit



### All read the same value

This is efficient: the load store unit can ask for the value and then broadcast it to all cores.



### All read the same value

This is efficient: the load store unit can ask for the value and then broadcast it to all cores.

1 request to GPU memory

Efficient, but probably not too common.



Read contiguous values

GPU Memory

### Load Store Unit



**Read contiguous values** Like the CPU cache, the Load/Store Unit reads in memory in chunks. 16 bytes **GPU Memory** 

### Load Store Unit



**Read contiguous values** Like the CPU cache, the Load/Store Unit reads in memory in chunks. 16 bytes





**Read contiguous values** Like the CPU cache, the Load/Store Unit reads in memory in chunks. 16 bytes

Can easily distribute the values to the threads





**Read contiguous values** Like the CPU cache, the Load/Store Unit reads in memory in chunks. 16 bytes

Can easily distribute the values to the threads

1 request to GPU memory



Read non-contiguous values

Not good!

Accesses are Serialized. You need 4 requests to GPU memory

### **GPU Memory**

### Load Store Unit



Read non-contiguous values

Not good!



Read non-contiguous values

Not good!



Read non-contiguous values

Not good!



Read non-contiguous values

Not good!



### Go back to our program

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    int chunk_size = size/blockDim.x;
    int start = chunk_size * threadIdx.x;
    int end = start + end;
    for (int i = start; i < end; i++) {
        d_a[i] = d_b[i] + d_c[i];
    }
}</pre>
```

calling the function

```
vector_add<<<1,1024>>>(d_a, d_b, d_c, size);
```

### Chunked Pattern



array a

## Chunked Pattern

the first element accessed by the 4 threads sharing a load store queue. What sort of access is this?



array a

## Chunked Pattern

the first element accessed by the 4 threads sharing a load store queue. What sort of access is this?



array a

How can we fix this

Read non-contiguous values

Not good!

Accesses are Serialized. You need 4 requests to GPU memory

### **GPU Memory**

### Load Store Unit



Read non-contiguous values

Not good!

Accesses are Serialized. You need 4 requests to GPU memory

### **GPU Memory**

### Load Store Unit



Read non-contiguous values

Not good!


4 cores are accessing memory. What can happen

Read non-contiguous values

Not good!

Accesses are Serialized. You need 4 requests to GPU memory



4 cores are accessing memory. What can happen

Read non-contiguous values

Not good!

Accesses are Serialized. You need 4 requests to GPU memory



4 cores are accessing memory. What can happen

Read non-contiguous values

Not good!

Accesses are Serialized. You need 4 requests to GPU memory



## Chunked Pattern

the first element accessed by the 4 threads sharing a load store queue. What sort of access is this?



array a

How can we fix this

## Stride Pattern



array a

#### What sort of pattern is this?

### Stride Pattern



array a

```
__global__ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    int chunk_size = size/blockDim.x;
    int start = chunk_size * threadIdx.x;
    int end = start + end;
    for (int i = start; i < end; i++) {
        d_a[i] = d_b[i] + d_c[i];
    }
}</pre>
```

calling the function

*Lets change this to a stride pattern* 

```
vector_add<<<1,1024>>>(d_a, d_b, d_c, size);
```

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
   for (int i = threadIdx.x; i < size; i+=blockDim.x) {
      d_a[i] = d_b[i] + d_c[i];
   }
}</pre>
```

calling the function

```
vector_add<<<1,1024>>>(d_a, d_b, d_c, size);
```

#### What sort of pattern is this?

### Stride Pattern



array a

## Coalesced memory accesses

Lets try it! What do we think?

## Coalesced memory accesses

Lets try it! What do we think?



What else can we do?

## Good time for a break

• 5 minute break

# Multiple streaming multiprocessors

We've been talking only about 1 streaming multiprocessor, most GPUs have multiple SMs big ML GPUs have 32. My little GPU has 4

|                | Instruction Buffer |            |          |            |      |  |  |  |  |
|----------------|--------------------|------------|----------|------------|------|--|--|--|--|
| Warp Scheduler |                    |            |          |            |      |  |  |  |  |
| Di             | spatch Uni         | it         | C        | ispatch Un | it   |  |  |  |  |
|                | <b>•</b>           |            |          | •          |      |  |  |  |  |
|                | Regist             | er File (' | 16,384 x | 32-bit)    |      |  |  |  |  |
| Coro           | Coro               | Coro       | Core     | I D/ST     | SEII |  |  |  |  |
| Core           | Core               | Core       | Core     | LD/ST      | SFU  |  |  |  |  |
| Core           | Core               | Core       | Core     | LD/ST      | SFU  |  |  |  |  |
| Core           | Core               | Core       | Core     | LD/ST      | SFU  |  |  |  |  |
| Core           | Core               | Core       | Core     | LD/ST      | SFU  |  |  |  |  |
| Core           | Core               | Core       | Core     | LD/ST      | SFU  |  |  |  |  |
| Core           | Core               | Core       | Core     | LD/ST      | SFU  |  |  |  |  |
| Core           | Core               | Core       | Core     | LD/ST      | SFU  |  |  |  |  |
| Core           | Core               | Core       | Core     | LD/ST      | SFU  |  |  |  |  |

# Multiple streaming multiprocessors

We've been talking only about 1 streaming multiprocessor, most GPUs have multiple SMs big ML GPUs have 32. My little GPU has 4

|      | h              | nstructi   | on Buffe | er             |     |      | h                                                                                                   | nstructio | on Buffe | er          |                      |      | h          | nstructio      | on Buffe | r          |     |               | Ir   | structio | on Buffe      | r     |     |
|------|----------------|------------|----------|----------------|-----|------|-----------------------------------------------------------------------------------------------------|-----------|----------|-------------|----------------------|------|------------|----------------|----------|------------|-----|---------------|------|----------|---------------|-------|-----|
|      | Warp Scheduler |            |          | Warp Scheduler |     |      |                                                                                                     |           |          | Warp So     | heduler              |      |            | Warp Scheduler |          |            |     |               |      |          |               |       |     |
| D    | ispatch Uni    | it         | [        | Dispatch Ur    | nit | Dis  | spatch Uni                                                                                          | t         | [        | )ispatch Ur | it                   | Di   | spatch Uni | it             | [        | ispatch Ur | nit | Dispatch Unit |      | t        | Dispatch Unit |       | it  |
|      | Regist         | er File (' | 16,384 x | 32-bit)        |     |      | Register File (16,384 x 32-bit)   Register File (16,384 x 32-bit)   Register File (16,384 x 32-bit) |           |          | er File (1  | le (16,384 x 32-bit) |      |            |                |          |            |     |               |      |          |               |       |     |
| Core | Core           | Core       | Core     | LD/ST          | SFU | Core | Core                                                                                                | Core      | Core     | LD/ST       | SFU                  | Core | Core       | Core           | Core     | LD/ST      | SFU | Core          | Core | Core     | Core          | LD/ST | SFU |
| Core | Core           | Core       | Core     | LD/ST          | SFU | Core | Core                                                                                                | Core      | Core     | LD/ST       | SFU                  | Core | Core       | Core           | Core     | LD/ST      | SFU | Core          | Core | Core     | Core          | LD/ST | SFU |
| Core | Core           | Core       | Core     | LD/ST          | SFU | Core | Core                                                                                                | Core      | Core     | LD/ST       | SFU                  | Core | Core       | Core           | Core     | LD/ST      | SFU | Core          | Core | Core     | Core          | LD/ST | SFU |
| Core | Core           | Core       | Core     | LD/ST          | SFU | Core | Core                                                                                                | Core      | Core     | LD/ST       | SFU                  | Core | Core       | Core           | Core     | LD/ST      | SFU | Core          | Core | Core     | Core          | LD/ST | SFU |
| Core | Core           | Core       | Core     | LD/ST          | SFU | Core | Core                                                                                                | Core      | Core     | LD/ST       | SFU                  | Core | Core       | Core           | Core     | LD/ST      | SFU | Core          | Core | Core     | Core          | LD/ST | SFU |
| Core | Core           | Core       | Core     | LD/ST          | SFU | Core | Core                                                                                                | Core      | Core     | LD/ST       | SFU                  | Core | Core       | Core           | Core     | LD/ST      | SFU | Core          | Core | Core     | Core          | LD/ST | SFU |
| Core | Core           | Core       | Core     | LD/ST          | SFU | Core | Core                                                                                                | Core      | Core     | LD/ST       | SFU                  | Core | Core       | Core           | Core     | LD/ST      | SFU | Core          | Core | Core     | Core          | LD/ST | SFU |
| Core | Core           | Core       | Core     | LD/ST          | SFU | Core | Core                                                                                                | Core      | Core     | LD/ST       | SFU                  | Core | Core       | Core           | Core     | LD/ST      | SFU | Core          | Core | Core     | Core          | LD/ST | SFU |

# Multiple streaming multiprocessors

CUDA provides virtual streaming multiprocessors called **blocks** 

Very efficient at launching and joining blocks.

practically no limit on blocks: launch as many as you need to map 1 thread to 1 data element

| Buffer           Warp Scheduler           Orgadized Unit           Orgadized Unit           Core Core Core Core Core Core Core Core           Core Core Core Core Core Core Core Core                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | Instruction Buffer Warp Scheduler Disports Unit Register File (16,384 x 32-bit) Core Core Core Core Lost SFU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | Instruction Buffer           Ward Scheduler           Dispatch Unit         Dispatch Unit           Dispatch Unit         Dispatch Unit           Register File (16,364 x 32-bit)           Core         Core         Unit           Core         Core <th>Instruction Buffer           Were Scheduler           Urgedot Unit           Dispatch Unit           Register F10 (16.384 x 32-bit)           Core         Core         Structure           Core         Core         Lott         Structure</th> <th>Instruction Buffer<br/>Warp Scheduler<br/>Dispatch Unit<br/>Register File (16,384 x 32-bit)<br/>Core Core Core Core Lost SFU<br/>Core Core Core Core Lost SFU<br/>Core Core Core Core Lost SFU<br/>Core Core Core Core Stat SFU</th> <th>Instruction Buffer           Wars Scheduler           Dispatch Unit         Dispatch Unit           Register File (16,384 × 32-bit)           Core         Core         Structure           Core         Core         Core         Structure</th> <th>Instruction Buffer           Useable           Disable Unit           Disable Unit           Rogister File (16,384 x 32-bit)           Core         Core         SFU           Core         Core         SFU           Core         Core         Core         SFU           Core         Core         Core         SFU           Core         Core         Core         SFU           Core         Core         Core         SFU</th> <th>Instruction Buffer           Unspace Unit           Dispace Unit           Dispace Unit           Dispace Unit           Core         Core         Dispace Unit           Core         Core         Core         Dispace Unit         SPEC           Core         Core         Core         Dispace Unit         SPEC           Core         Core         Core         Core         Dispace Unit         SPEC           Core         Core         Core         Core         Core         Core         Core         Core         Core         <th c<="" th=""><th>Instruction Buffer           Warp Scheduler           Disearts from           Rogister File (16,394 x 32-bit)           Core         Core</th></th></th> | Instruction Buffer           Were Scheduler           Urgedot Unit           Dispatch Unit           Register F10 (16.384 x 32-bit)           Core         Core         Structure           Core         Core         Lott         Structure                                                                                                                                                                                                                                                                                                                                       | Instruction Buffer<br>Warp Scheduler<br>Dispatch Unit<br>Register File (16,384 x 32-bit)<br>Core Core Core Core Lost SFU<br>Core Core Core Core Lost SFU<br>Core Core Core Core Lost SFU<br>Core Core Core Core Stat SFU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Instruction Buffer           Wars Scheduler           Dispatch Unit         Dispatch Unit           Register File (16,384 × 32-bit)           Core         Core         Structure           Core         Core         Core         Structure                                                                                                                                                                                                                                                                                                                                                | Instruction Buffer           Useable           Disable Unit           Disable Unit           Rogister File (16,384 x 32-bit)           Core         Core         SFU           Core         Core         SFU           Core         Core         Core         SFU           Core         Core         Core         SFU           Core         Core         Core         SFU           Core         Core         Core         SFU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Instruction Buffer           Unspace Unit           Dispace Unit           Dispace Unit           Dispace Unit           Core         Core         Dispace Unit           Core         Core         Core         Dispace Unit         SPEC           Core         Core         Core         Dispace Unit         SPEC           Core         Core         Core         Core         Dispace Unit         SPEC           Core         Core         Core         Core         Core         Core         Core         Core         Core <th c<="" th=""><th>Instruction Buffer           Warp Scheduler           Disearts from           Rogister File (16,394 x 32-bit)           Core         Core</th></th> | <th>Instruction Buffer           Warp Scheduler           Disearts from           Rogister File (16,394 x 32-bit)           Core         Core</th>                                                                                                                                                                      | Instruction Buffer           Warp Scheduler           Disearts from           Rogister File (16,394 x 32-bit)           Core         Core |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Core         Core         Core         LDRT         SFU           Core         Core         Core         DDRT         SFU           Core         Core         Core         LDRT         SFU           Core         Core         Core         LDRT         SFU           Core         Core         Core         LDRT         SFU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | Core     SFU       Core     Core     Core     Core     Core     Core     SFU       Core     Core     Core     Core     Core     Core     SFU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Core         Core         Core         LOST         SFU           Core         Core         Core         Core         DOST         SFU           Core         Core         Core         Core         LDST         SFU           Core         Core         Core         Core         LDST         SFU           Core         Core         Core         Core         LDST         SFU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | Core     Core     Core     Core     Dot     SFU       Core     Core     Core     Core     Uo31     SFU       Core     Core     Core     Core     Lo31     SFU       Core     Core     Core     Core     Lo31     SFU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Core     Core     Core     Core     SrU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Core     Core     Core     Core     STU       Core     Core     Core     Core     BTU       Core     Core     Core     Core     STU       Core     Core     Core     Core     STU       Core     Core     Core     Core     STU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Core     Core     Core     Core     FU       Core     Core     Core     Core     SFU       Core     Core     Core     Core     Loro     SFU       Core     Core     Core     Core     Loro     SFU       Core     Core     Core     Core     Loro     SFU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Core     Core     Core     Core     Core     Core     Core     Core     Core     SFU       Core     Core     Core     Core     LDST     SFU       Core     Core     Core     Core     LDST     SFU       Core     Core     Core     LDST     SFU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Core         Core         Core         Lost         SFU           Core         Core         Core         Core         Lost         SFU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| Opposite Using         Opposite Using           Register File (16,534 × 32-bit)         SPU           Core         Core         Lot         SPU           Core         Core         Core         Lot         SPU | Disparter UN         Disparter UN           Register File (16.384 x 32-bit)           Core         Core         Core         Core         SFU           Core         Core         Core         Core         SFU           Core         Core         Core         Core         SFU           Core         Core         Core         Core         Core         SFU | Doppedia billi         Doppedia billi           Register File (16,384 × 32-bit)         SF4           Core         Core         Lost         SF4           Core         Core         Lost         SF4           Core         Core         Lost         SF4           Core         Core         Lost         SF4           Core         Core         Core         Lost         SF4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Dependent break         Dependent break           Register File (FLS.384 × 32-bit)         SFU           Com         Com         Com         Com         SFU | Depended Upent         Depended Upent           Register File (16,384 × 32-bit)           Core         Core         Core         SFU           Core         Core         Core         Seu         SFU           Core         Core         Core         Core         Seu         SFU | Dispat/or         Dispat/or           Rodistor File (16.384 × 32-bit)         SF0           Core         Core         Core         Core         Core         Core         Core         Core         SF0           Core         Core         Core         Core         Core         Core         Core         SF0           Core         SF0           Core         Core         Core         Core         Core         Core         SF0         SF0           Core         Core         Core         Core         Core         SF0         SF0 | Departed level         Departed level           Register         Searce         Searce           Register         Searce         Searce         Searce           Register         Searce         Searce         Searce         Searce           Register         Searce         Searce         Searce         Searce | Departer biol         Departer biol           Register biol         (16.384 × 32-bit)           Register biol         (16.784 × 32-bit)           Core         Core         Core         Core         Core           Core         Core         Core         Core         Core         Core           Core         Core         Core         Core         Core         Core         Core           Core         Core         Core         Core         Core         Core         Core           Core         Core         Core         Core         Core         Core         Core           Core         Core         Core         Core         Core         Core         Core           Core         Core         Core         Core         Core         Core         Core           Core         Core         Core         Core         Core         Core         Core           Core         Core         Core         Core         Core         Core         Core         Core         Core         Core         Core         Core         Core         Core         Core         Core         Core         Core         Core         Core         Core                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | Dependet Unit         Dependet Unit           Register File (16, 384 x 32-bit)           Core         Core         Core         SPU           Core         Core         Core         Core         Core         Core         SPU |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
   for (int i = threadIdx.x; i < size; i+=blockDim.x) {
      d_a[i] = d_b[i] + d_c[i];
   }
}</pre>
```

id within the block

calling the function

vector\_add<<<1,1024>>>(d\_a, d\_b, d\_c, size);

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
   for (int i = threadIdx.x; i < size; i+=blockDim.x) {
      d_a[i] = d_b[i] + d_c[i];
   }
}</pre>
```

id within the block

threads per block

calling the function

vector\_add<<<1,1024>>>(d\_a, d\_b, d\_c, size);

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
   for (int i = threadIdx.x; i < size; i+=blockDim.x) {
      d_a[i] = d_b[i] + d_c[i];
   }
}</pre>
```

calling the function

Launch with many thread blocks

vector\_add<<<1,1024>>>(d\_a, d\_b, d\_c, size);

```
__global___void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    d_a[i] = d_b[i] + d_c[i];
}
```

calling the function

Need to recalculate some thread ids.

Launch with many thread blocks

vector\_add<<<1024,1024>>>(d\_a, d\_b, d\_c, size);

Now we have 1 thread for each element

#define SIZE (1024\*1024)

## thread ids

#### threadIdx.x is thread id within a block

#### example threadIdx.x == 9

|      | Instruction Buffer |            |          |            |     |  |  |  |  |  |
|------|--------------------|------------|----------|------------|-----|--|--|--|--|--|
|      | Warp Scheduler     |            |          |            |     |  |  |  |  |  |
| Di   | spatch Uni         | it         | C        | ispatch Ur | nit |  |  |  |  |  |
|      | Regist             | er File (′ | 16,384 x | 32-bit)    |     |  |  |  |  |  |
| Core | Core               | Core       | Core     | LD/ST      | SFU |  |  |  |  |  |
| Core | Core               | Core       | Core     | LD/ST      | SFU |  |  |  |  |  |
| Core |                    | Core       | Core     | LD/ST      | SFU |  |  |  |  |  |
| Core | Core               | Core       | Core     | LD/ST      | SFU |  |  |  |  |  |
| Core | Core               | Core       | Core     | LD/ST      | SFU |  |  |  |  |  |
| Core | Core               | Core       | Core     | LD/ST      | SFU |  |  |  |  |  |
| Core | Core               | Core       | Core     | LD/ST      | SFU |  |  |  |  |  |
| Core | Core               | Core       | Core     | LD/ST      | SFU |  |  |  |  |  |

blockIdx.x == 0

|                                 | h              | nstructio | on Buffe | r          |     |  |  |  |  |  |
|---------------------------------|----------------|-----------|----------|------------|-----|--|--|--|--|--|
|                                 | Warp Scheduler |           |          |            |     |  |  |  |  |  |
| Di                              | spatch Uni     | it        | C        | ispatch Un | it  |  |  |  |  |  |
| Register File (16,384 x 32-bit) |                |           |          |            |     |  |  |  |  |  |
| Core                            | Core           | Core      | Core     | LD/ST      | SFU |  |  |  |  |  |
| Core                            | Core           | Core      | Core     | LD/ST      | SFU |  |  |  |  |  |
| Core                            |                | Core      | Core     | LD/ST      | SFU |  |  |  |  |  |
| Core                            | Core           | Core      | Core     | LD/ST      | SFU |  |  |  |  |  |
| Core                            | Core           | Core      | Core     | LD/ST      | SFU |  |  |  |  |  |
| Core                            | Core           | Core      | Core     | LD/ST      | SFU |  |  |  |  |  |
| Core                            | Core           | Core      | Core     | LD/ST      | SFU |  |  |  |  |  |
| Core                            | Core           | Core      | Core     | LD/ST      | SFU |  |  |  |  |  |

blockIdx.x == 1

|      | Instruction Buffer |            |         |            |      |  |  |  |  |  |
|------|--------------------|------------|---------|------------|------|--|--|--|--|--|
|      | Warp Scheduler     |            |         |            |      |  |  |  |  |  |
| Di   | spatch Uni         | it         | C       | ispatch Un | iit  |  |  |  |  |  |
|      | _ <b>_</b>         |            |         | •          |      |  |  |  |  |  |
|      | Regist             | er File (1 | 6,384 x | 32-bit)    |      |  |  |  |  |  |
|      |                    |            |         |            |      |  |  |  |  |  |
| Core | Core               | Core       | Core    | LD/ST      | SFU  |  |  |  |  |  |
| Core | Core               | Core       | Core    | LD/ST      | SFU  |  |  |  |  |  |
|      |                    |            |         |            |      |  |  |  |  |  |
| Core |                    | Core       | Core    | LD/ST      | SFU  |  |  |  |  |  |
| Core | Coro               | Coro       | Coro    | I D/ST     | SEII |  |  |  |  |  |
| COIE | Core               | Core       | Core    | 20/31      | GFU  |  |  |  |  |  |
| Core | Core               | Core       | Core    | LD/ST      | SFU  |  |  |  |  |  |
| Core | Core               | Core       | Core    | LD/ST      | SFU  |  |  |  |  |  |
|      |                    |            |         |            |      |  |  |  |  |  |
| Core | Core               | Core       | Core    | LD/ST      | SFU  |  |  |  |  |  |
| Core | Core               | Core       | Core    | LD/ST      | SFU  |  |  |  |  |  |

Instruction Buffer Warp Scheduler **Dispatch Unit Dispatch Unit** JL. Register File (16,384 x 32-bit) Core Core LD/ST SFU SFU Core LD/ST Core Core Core LD/ST SFU Core Core Core LD/ST SFU Core Core LD/ST SFU LD/ST SFU Core Core LD/ST SFU Core SFU Core Core LD/ST Core Core

blockIdx.x == 2

blockIdx.x == 4

## thread ids

How to get a unique id per thread?

#### int i = blockIdx.x \* blockDim.x + threadIdx.x;

threadIdx.x is thread id within a block

#### example threadIdx.x == 9

|      | Instruction Buffer              |      |      |             |     |  |  |  |  |  |
|------|---------------------------------|------|------|-------------|-----|--|--|--|--|--|
|      | Warp Scheduler                  |      |      |             |     |  |  |  |  |  |
| Di   | spatch Un                       | it   | C    | )ispatch Ur | nit |  |  |  |  |  |
|      | Register File (16,384 x 32-bit) |      |      |             |     |  |  |  |  |  |
| Core | Core                            | Core | Core | LD/ST       | SFU |  |  |  |  |  |
| Core | Core                            | Core | Core | LD/ST       | SFU |  |  |  |  |  |
| Core |                                 | Core | Core | LD/ST       | SFU |  |  |  |  |  |
| Core | Core                            | Core | Core | LD/ST       | SFU |  |  |  |  |  |
| Core | Core                            | Core | Core | LD/ST       | SFU |  |  |  |  |  |
| Core | Core                            | Core | Core | LD/ST       | SFU |  |  |  |  |  |
| Core | Core                            | Core | Core | LD/ST       | SFU |  |  |  |  |  |
| Core | Core                            | Core | Core | LD/ST       | SFU |  |  |  |  |  |

blockIdx.x == 0

| Instruction Buffer          |        |            |          |         |     |  |  |  |  |
|-----------------------------|--------|------------|----------|---------|-----|--|--|--|--|
| Warp Scheduler              |        |            |          |         |     |  |  |  |  |
| Dispatch Unit Dispatch Unit |        |            |          |         |     |  |  |  |  |
|                             | +      |            |          | +       |     |  |  |  |  |
|                             | Regist | er File (1 | 16,384 x | 32-bit) |     |  |  |  |  |
|                             |        |            |          |         |     |  |  |  |  |
| Core                        | Core   | Core       | Core     | LD/ST   | SFU |  |  |  |  |
| Core                        | Core   | Core       | Core     | LD/ST   | SFU |  |  |  |  |
| Core                        |        | Core       | Core     | LD/ST   | SFU |  |  |  |  |
| Core                        | Core   | Core       | Core     | LD/ST   | SFU |  |  |  |  |
| Core                        | Core   | Core       | Core     | LD/ST   | SFU |  |  |  |  |
| Core                        | Core   | Core       | Core     | LD/ST   | SFU |  |  |  |  |
| Core                        | Core   | Core       | Core     | LD/ST   | SFU |  |  |  |  |
| Core                        | Core   | Core       | Core     | LD/ST   | SFU |  |  |  |  |

blockIdx.x == 1

| Instruction Buffer              |                             |      |      |       |     |  |  |  |  |  |
|---------------------------------|-----------------------------|------|------|-------|-----|--|--|--|--|--|
|                                 | Warp Scheduler              |      |      |       |     |  |  |  |  |  |
| Di                              | Dispatch Unit Dispatch Unit |      |      |       |     |  |  |  |  |  |
| Register File (16,384 x 32-bit) |                             |      |      |       |     |  |  |  |  |  |
| Core                            | Core                        | Core | Core | LD/ST | SFU |  |  |  |  |  |
| Core                            | Core                        | Core | Core | LD/ST | SFU |  |  |  |  |  |
| Core                            |                             | Core | Core | LD/ST | SFU |  |  |  |  |  |
| Core                            | Core                        | Core | Core | LD/ST | SFU |  |  |  |  |  |
| Core                            | Core                        | Core | Core | LD/ST | SFU |  |  |  |  |  |
| Core                            | Core                        | Core | Core | LD/ST | SFU |  |  |  |  |  |
| Core                            | Core                        | Core | Core | LD/ST | SFU |  |  |  |  |  |
| Core                            | Core                        | Core | Core | LD/ST | SFU |  |  |  |  |  |

|                                 | Instruction Buffer |      |      |            |     |  |  |  |  |  |
|---------------------------------|--------------------|------|------|------------|-----|--|--|--|--|--|
| Warp Scheduler                  |                    |      |      |            |     |  |  |  |  |  |
| Di                              | spatch Uni         | t    | D    | ispatch Un | it  |  |  |  |  |  |
| Register File (16,384 x 32-bit) |                    |      |      |            |     |  |  |  |  |  |
| Core                            | Core               | Core | Core | LD/ST      | SFU |  |  |  |  |  |
| Core                            | Core               | Core | Core | LD/ST      | SFU |  |  |  |  |  |
| Core                            |                    | Core | Core | LD/ST      | SFU |  |  |  |  |  |
| Core                            | Core               | Core | Core | LD/ST      | SFU |  |  |  |  |  |
| Core                            | Core               | Core | Core | LD/ST      | SFU |  |  |  |  |  |
| Core                            | Core               | Core | Core | LD/ST      | SFU |  |  |  |  |  |
| Core                            | Core               | Core | Core | LD/ST      | SFU |  |  |  |  |  |
| Core                            | Core               | Core | Core | LD/ST      | SFU |  |  |  |  |  |

blockIdx.x == 2

blockIdx.x == 4

# Final Round

The GPU in my PhD laptop



Nvidia 940m 1.8 Billion transistors 75 TDP Est. \$130 Fight!



The CPU in my professor workstation



Intel i7-9700K 2.16 Billion transistors 95 TDP Est. \$316

https://www.techpowerup.com/gpu-specs/geforce-940m.c2648 https://www.alibaba.com/product-detail/Intel-Core-i7-9700K-8-Cores\_62512430487.html https://www.prolast.com/prolast-elevated-boxing-rings-22-x-22/

# Final Round

Nearly 4x faster!!

Fight!

### The GPU in my PhD laptop



Nvidia 940m 1.8 Billion transistors 75 TDP Est. \$130



The CPU in my professor workstation



Intel i7-9700K 2.16 Billion transistors 95 TDP Est. \$316

https://www.techpowerup.com/gpu-specs/geforce-940m.c2648 https://www.alibaba.com/product-detail/Intel-Core-i7-9700K-8-Cores\_62512430487.html https://www.prolast.com/prolast-elevated-boxing-rings-22-x-22/

## Schedule

- Review previous optimizations
- New optimizations
- Advanced GPU topics

## Schedule

- Review previous optimizations
- New optimizations
- Advanced GPU topics



\_\_global\_\_\_void diverged (...) {
 if (threadIdx.x < 16) {
 // Do some work
 }
 else {
 // Do something else
 }
</pre>





\_\_global\_\_\_ void diverged (...) { if (threadIdx.x < 16) { // Do some work } else { // Do something else }



\_global\_\_ void diverged (...) {
 if (threadIdx.x < 16) {
 // Do some work
 }
 else {
 // Do something else
 }
</pre>



Warp scheduler keeps a bitmask for each warp Example: [0 1 1 0 ...] -> execute thread 1 and 2, no-op the rest

How is the bit vector initialized?

Do "only" need a bit vector?

Bit vector stack for warp 0

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ]

```
global
void diverged2 (...) {
  if (threadIdx.x < 8) {</pre>
    // Work 0
    if (threadIdx.x == 3) {
      // Work 1
    // Work 2
  else {
    // Work 3
```

Γ1

Bit vector stack for warp 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

```
global
void diverged2 (...) {
  if (threadIdx.x < 8) {</pre>
                                Push on the stack for if
        Work 0
     if (threadIdx.x == 3) {
       // Work 1
        Work 2
  else
        {
     // Work 3
```

Bit vector stack for warp 0

```
global
void diverged2 (...) {
  if (threadIdx.x < 8) {</pre>
       Work 0
    if (threadIdx.x == 3) {
         Work 1
       Work 2
  else {
    // Work 3
```

Bit vector stack for warp 0



```
global
void diverged2 (...) {
  if (threadIdx.x < 8) {</pre>
    // Work 0
    if (threadIdx.x == 3) {
       // Work 1
                                Pop the stack
       Work 2
  else {
    // Work 3
```

Bit vector stack for warp 0

```
global
void diverged2 (...) {
  if (threadIdx.x < 8) {</pre>
     // Work 0
     if (threadIdx.x == 3) {
       // Work 1
        Work 2
  else
        Work 3
                    Complement the top item on the stack for else
```

Bit vector stack for warp 0

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

```
global
void diverged2 (...) {
  if (threadIdx.x < 8) {</pre>
    // Work 0
    if (threadIdx.x == 3) {
       // Work 1
    // Work 2
  else {
    // Work 3
                Pop
```
### What about this?

```
__global_____void waiting(...) {
if (threadIdx.x == 0) {
    while (global_flag != 1); //Wait for thread 1 to set a flag
}
if (thread Idx.x == 1) {
    *global_flag = 1;
}
```

Synchronization and Shared Memory



# Synchronization and Shared Memory

- Last part of SM architecture to talk about!
- What if program requires thread communication?
- Threads in same warp execute "lock-step"
- BUT threads in different warps can de-sync (execute different instructions)

• Vector reduction: Sum up elements in array

• Parallel algorithm? Consider 4 threads



Reduce to 4 elements by striding



Reduce to 4 elements by striding

Use half as many threads to compute the next round

### Requires synchronization between levels!



Use half as many threads to compute the next round

half as many again, until single result is reached



### Demo

• Barrier



## Shared memory (if time)

- Software managed cache
- Can be used to share memory between threads in the same block efficiently

### Performance Portability

### What about different vendors?

- AMD warp size of 64
- Intel warp size of 8 OR 16 depending on resource usage
- ARM old chips do not have warps, new chips have warp size of 8
- All vendors except Nvidia have max block size of 256. ARM supports only 128 in some cases.
- Different memory technologies (DDR, GDDR, HBM)

| tuned for | R9 Fury X -   | 24%      | 20%        | 20%        | 23%        | 37%         | 100%      | 99%       | 100%      | 99%      | 62%            | 80%         | 65%         |
|-----------|---------------|----------|------------|------------|------------|-------------|-----------|-----------|-----------|----------|----------------|-------------|-------------|
|           | R9 290X -     | 91%      | 79%        | 84%        | 97%        | 97%         | 99%       | 100%      | 93%       | 100%     | 2%             | 70%         | 57%         |
|           | HD 7970 -     | 23%      | 19%        | 20%        | 21%        | 33%         | 100%      | 100%      | 99%       | 100%     | 73%            | 85%         | 71%         |
|           | GTX 1080 Ti - | 91%      | 97%        | 58%        | 98%        | 100%        | 68%       | 38%       | 28%       | 62%      | 1%             | 5%          | 4%          |
|           | GTX 980 Ti -  | 71%      | 69%        | 51%        | 100%       | 92%         | 53%       | 33%       | 49%       | 36%      | 5%             | 5%          | 5%          |
|           | GTX 780 Ti -  | 93%      | 66%        | 100%       | 95%        | 90%         | 68%       | 33%       | 32%       | 62%      | 5%             | 5%          | 4%          |
|           | GTX 680 -     | 92%      | 100%       | 71%        | 98%        | 98%         | 55%       | 38%       | 26%       | 57%      | 1%             | 5%          | 4%          |
|           | GTX 580 -     | 100%     | 87%        | 74%        | 53%        | 56%         | х         | х         | Х         | х        | 1%             | 5%          | 4%          |
|           |               | GTX 580- | GTX 680 -  | GTX 780 TI | GTX 980 TI | GTX 1080 Ti | - 0262 CH | R9 2905 - | R9 Fury X | RX 480 - | Ivy Bridge CPU | Haswell CPU | Skylake CPU |
|           |               |          | running on |            |            |             |           |           |           |          |                |             |             |

From:

tuned for

nvidia chips tuned for Nvidia gpus

| Skylake CPU -    | 90%        | 69%     | 93%        | 90%        | 90%         | 45%     | 28%       | 14%       | 21%      | 2%             | 99%         | 100%        |
|------------------|------------|---------|------------|------------|-------------|---------|-----------|-----------|----------|----------------|-------------|-------------|
| Haswell CPU -    | 12%        | 16%     | 16%        | 39%        | 43%         | 37%     | 20%       | 11%       | 17%      | 2%             | 100%        | 97%         |
| Ivy Bridge CPU - | 21%        | 14%     | 18%        | 41%        | 40%         | 19%     | 14%       | 6%        | 13%      | 100%           | 53%         | 44%         |
| RX 480 -         | 87%        | 74%     | 78%        | 95%        | 93%         | 99%     | 99%       | 97%       | 100%     | 2%             | 93%         | 81%         |
| R9 Fury X -      | 24%        | 20%     | 20%        | 23%        | 37%         | 100%    | 99%       | 100%      | 99%      | 62%            | 80%         | 65%         |
| R9 290X -        | 91%        | 79%     | 84%        | 97%        | 97%         | 99%     | 100%      | 93%       | 100%     | 2%             | 70%         | 57%         |
| HD 7970 -        | 23%        | 19%     | 20%        | 21%        | 33%         | 100%    | 100%      | 99%       | 100%     | 73%            | 85%         | 71%         |
| GTX 1080 Ti      | 91%        | 97%     | 58%        | 98%        | 100%        | 68%     | 38%       | 28%       | 62%      | 1%             | 5%          | 4%          |
| GTX 980 Ti       | 71%        | 69%     | 51%        | 100%       | 92%         | 53%     | 33%       | 49%       | 36%      | 5%             | 5%          | 5%          |
| GTX 780 Ti       | 93%        | 66%     | 100%       | 95%        | 90%         | 68%     | 33%       | 32%       | 62%      | 5%             | 5%          | 4%          |
| GTX 680          | 92%        | 100%    | 71%        | 98%        | 98%         | 55%     | 38%       | 26%       | 57%      | 1%             | 5%          | 4%          |
| GTX 580          | 100%       | 87%     | 74%        | 53%        | 56%         | х       | Х         | Х         | Х        | 1%             | 5%          | 4%          |
|                  | GTX 580    | GTX 680 | GTX 780 TI | GTX 980 TI | GTX 1080 Ti | 0262 CH | R9 290X - | R9 Fury X | RX 480 - | Ivy Bridge CPU | Haswell CPU | Skylake CPU |
|                  | running on |         |            |            |             |         |           |           |          |                |             |             |

#### From:

AMD chips tuned for AMD chips

tuned for

Skylake CPU -69% 90% 45% 28% 14% 21% 2% 909 39% 11% Haswell CPU 12% 16% 16% 43% 37% 20% 17% 2% Ivy Bridge CPU -21% 14% 18% 41% 40% 19% 14% 6% 13% 53% 44% RX 480 -87% 74% 78% 81% 2% 20% 23% 37% R9 Fury X -24% 20% 62% 80% 65% R9 290X -70% 57% 79% 84% 2% HD 7970 19% 20% 21% 33% 100% 85% 23% 73% 71% GTX 1080 Ti -58% 5% 68% 38% 28% 62% 1% 4% 69% 53% GTX 980 Ti -71% 51% 33% 49% 36% 5% 5% 5% GTX 780 Ti -66% 68% 33% 32% 62% 5% 5% 4% 71% 55% 38% 26% 57% GTX 680 -1% 5% 4% GTX 580 -87% 74% 53% 56% Х Х Х Х 1% 5% 4% GTX 580 GTX 780 GTX 780 GTX 980 GTX 980 GTX 980 Ti GTX 980 Ti BO HD 7970 HD 7970 R9 Fury X R9 Fu running on

#### From:

CPU chips tuned for CPUs

tuned for

Skylake CPU -69% 45% 28% 14% 21% 2% 909 39% Haswell CPU 12% 16% 16% 43% 37% 20% 11% 17% 2% Ivy Bridge CPU -21% 14% 18% 41% 40% 19% 14% 6% 13% 53% 44% RX 480 -87% 74% 78% 2% 81% 23% R9 Fury X -24% 20% 20% 37% 62% 80% 65% R9 290X -70% 57% 79% 84% 2% HD 7970 19% 20% 21% 85% 23% 33% 73% 71% 62% GTX 1080 Ti -58% 68% 38% 28% 1% 5% 4% 69% 53% 49% GTX 980 Ti -71% 51% 33% 36% 5% 5% 5% GTX 780 Ti -66% 68% 33% 32% 62% 5% 5% 4% 71% 55% 38% 26% 57% GTX 680 -1% 5% 4% GTX 580 -87% 74% 53% 56% Х Х Х Х 1% 5% 4% GTX 780 Ti GTX 780 Ti GTX 980 Ti GTX 1080 Ti HD 7970 R9 Fury X Skylake CPU Skylake CPU GTX 580 running on

#### From:

tuned for

Skylake CPU -69% 90% 45% 28% 14% 21% 2% 16% 39% 37% Haswell CPU 12% 16% 43% 20% 11% 17% 2% Ivy Bridge CPU -21% 14% 18% 41% 40% 19% 14% 6% 13% 53% 44% RX 480 -87% 74% 78% 81% 2% 23% R9 Fury X -24% 20% 20% 37% 62% 80% 65% R9 290X -70% 57% 79% 84% 2% HD 7970 19% 20% 21% 85% 71% 23% 33% 73% 62% GTX 1080 Ti -58% 68% 38% 28% 1% 5% 4% 69% 53% 33% 49% GTX 980 Ti -71% 51% 36% 5% 5% 5% GTX 780 Ti -66% 68% 33% 32% 62% 5% 5% 4% 71% 55% 38% 26% 57% GTX 680 -1% 5% 4% GTX 580 -87% 74% 53% 56% Х Х Х Х 1% 5% 4% GTX 780 Ti GTX 780 Ti GTX 980 Ti GTX 1080 Ti HD 7970 R9 Fury X Skylake CPU Skylake CPU GTX 580 running on

AMD chips tuned for CPUs and Nvidia GPUS

#### From:

CPU chips tuned for Nvidia and AMD GPUs



#### From:

### Putting it all together:

- GPUs are programmed as external accelerators: Host must manage memory!
- threads execute in groups of 32 (or 1,8,16,64) called a warp.
- Parallelism across warps hides latency
- Access memory in strided patterns
- Use many blocks!
- Synchronization available across threads in the same block



EDWARD KANDROT

FOREWORD BY JACK DONGARRA

**Copyrighted Materia** 





#### Further readings

### On Thursday

- Reese will talk about distributed computing!
- Office hours on Wednesday will be joint (go over homework questions)