### CSE113: Parallel Programming March 15, 2023

- Topics:
  - GPU programming continued

| Instruction Buffer              |            |         |               |       |     |
|---------------------------------|------------|---------|---------------|-------|-----|
|                                 |            | Warp So | heduler       |       |     |
| Di                              | spatch Uni | it      | Dispatch Unit |       |     |
| Register File (16,384 x 32-bit) |            |         |               |       |     |
| Core                            | Core       | Core    | Core          | LD/ST | SFU |
| Core                            | Core       | Core    | Core          | LD/ST | SFU |
| Core                            | Core       | Core    | Core          | LD/ST | SFU |
| Core                            | Core       | Core    | Core          | LD/ST | SFU |
| Core                            | Core       | Core    | Core          | LD/ST | SFU |
| Core                            | Core       | Core    | Core          | LD/ST | SFU |
| Core                            | Core       | Core    | Core          | LD/ST | SFU |
| Core                            | Core       | Core    | Core          | LD/ST | SFU |

#### Announcements

- extra day on Homework 4 (You can turn it in by the end of today)
- HW 5 is out, you should be able to get started
- Last two days of class!

Discuss some of the API differences between Javascript web workers and C++ threads. For example, how do you pass data to web workers? How are they launched? How are they joined?

Read this short blog post about shared array buffers:

https://hacks.mozilla.org/2017/06/a-cartoon-intro-to-arraybuffers-and-sharedarraybuffers/

write a few sentences about what you learned and how shared array buffers can enable performant parallelism in javascript.

Please try launching the HW 5 server and make sure you can navigate to each of the pages.

| I have done this and I can see what is expected for each part                         | 22 respondents | 67 <sup>%</sup> | ✓ |
|---------------------------------------------------------------------------------------|----------------|-----------------|---|
| I have got the homework but I get an<br>error when I navigate to some of the<br>pages | 1 respondent   | 3 %             |   |
| I have not yet downloaded the package<br>and tried things out for HW 5                | 10 respondents | 30 <sup>%</sup> |   |

Which type of GPU will you be using for HW 5?

| Nvidia | 13 respondents | <b>39</b> %     | $\checkmark$ |
|--------|----------------|-----------------|--------------|
| Intel  | 15 respondents | 45 <sup>%</sup> |              |
| AMD    | 2 respondents  | 6 %             |              |
| Apple  | 7 respondents  | 21 <sup>%</sup> |              |





How do we allocate memory on a CPU?



How do we allocate CPU memory on the host?

• Our heterogeneous, parallel, programming model

int \*x = (int\*) malloc(sizeof(int)\*SIZE);



How do we allocate CPU memory on the host?



We need to allocate GPU memory on the host



We need to allocate GPU memory on the host



We need to allocate GPU memory on the host



We need to allocate GPU memory on the host





• Our heterogeneous, parallel, programming model

If we can't access d\_x on the CPU, how do we initialize the memory?

GPU has no access to input devices e.g. disk



• Our heterogeneous, parallel, programming model

If we can't access d\_x on the CPU, how do we initialize the memory?

GPU has no access to input devices e.g. disk



• Our heterogeneous, parallel, programming model

If we can't access d\_x on the CPU, how do we initialize the memory?

GPU has no access to input devices e.g. disk



### How does this look in code?

### How does this look in code?

Nothing too exciting yet.

- Write a special function in your C++ code.
  - Called a Kernel
  - Use the new keyword \_\_\_global\_\_\_
  - Keywords in
    - OpenCL \_\_kernel
    - Metal kernel
- Write it how you'd write any other function

```
__global___ void vector_add(int * a, int * b, int * c, int size) {
    for (int i = 0; i < size; i++) {
        a[i] = b[i] + c[i];
    }
}</pre>
```

```
__global___ void vector_add(int * a, int * b, int * c, int size) {
   for (int i = 0; i < size; i++) {
      a[i] = b[i] + c[i];
   }
}</pre>
```

calling the function

```
__global___ void vector_add(int * a, int * b, int * c, int size) {
   for (int i = 0; i < size; i++) {
      a[i] = b[i] + c[i];
   }
}</pre>
```

calling the function

*What in the world?* special new CUDA syntax. We will talk more soon

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    for (int i = 0; i < size; i++) {
        d_a[i] = d_b[i] + d_c[i];
    }
}</pre>
```

Pass in pointers to memory on the device

calling the function

• Our heterogeneous, parallel, programming model

Remember, GPU needs to access its own memory



```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
   for (int i = 0; i < size; i++) {
     d_a[i] = d_b[i] + d_c[i];
   }
}</pre>
```

*Constants can be passed in regularly* 

calling the function

Are we ready to run the program? What are we missing?





Finally, we can run the GPU program!

Lets see what all the hype is about



It didn't do so well...

# First parallelization attempt

- Lets look at some GPU documentation.
- The Maxwell whitepaper shows a diagram of one of the GPU cores

|                                 | Instruction Buffer |         |               |       |     |  |
|---------------------------------|--------------------|---------|---------------|-------|-----|--|
|                                 |                    | Warp So | cheduler      |       |     |  |
| Dispatch Unit                   |                    |         | Dispatch Unit |       |     |  |
| Register File (16,384 x 32-bit) |                    |         |               |       |     |  |
| Core                            | Core               | Core    | Core          | LD/ST | SFU |  |
| Core                            | Core               | Core    | Core          | LD/ST | SFU |  |
| Core                            | Core               | Core    | Core          | LD/ST | SFU |  |
| Core                            | Core               | Core    | Core          | LD/ST | SFU |  |
| Core                            | Core               | Core    | Core          | LD/ST | SFU |  |
| Core                            | Core               | Core    | Core          | LD/ST | SFU |  |
| Core                            | Core               | Core    | Core          | LD/ST | SFU |  |
| Core                            | Core               | Core    | Core          | LD/ST | SFU |  |

https://www.techpowerup.com/gpu-specs/docs/nvidia-gtx-980.pdf

woah, 32 cores!

We should parallelize our application!

|                                 | Instruction Buffer |      |               |       |     |  |  |  |  |
|---------------------------------|--------------------|------|---------------|-------|-----|--|--|--|--|
| Warp Scheduler                  |                    |      |               |       |     |  |  |  |  |
| Dispatch Unit                   |                    |      | Dispatch Unit |       |     |  |  |  |  |
| Register File (16,384 x 32-bit) |                    |      |               |       |     |  |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |  |

https://www.techpowerup.com/gpu-specs/docs/nvidia-gtx-980.pdf

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    for (int i = 0; i < size; i++) {
        d_a[i] = d_b[i] + d_c[i];
    }
}</pre>
```

calling the function

vector\_add<<<1,1>>>(d\_a, d\_b, d\_c, size);

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    for (int i = 0; i < size; i++) {
        d_a[i] = d_b[i] + d_c[i];
    }
}</pre>
```

calling the function

vector\_add<<<1,32>>>(d\_a, d\_b, d\_c, size);

number of threads to launch the program with

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    int chunk_size = size/blockDim.x;
    int start = chunk_size * threadIdx.x;
    int end = start + end;
    for (int i = start; i < end; i++) {
        d_a[i] = d_b[i] + d_c[i];
    }
}</pre>
```

calling the function

number of threads

```
vector_add<<<1,32>>>(d_a, d_b, d_c, size);
```

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    int chunk_size = size/blockDim.x;
    int start = chunk_size * threadIdx.x;
    int end = start + end;
    for (int i = start; i < end; i++) {
        d_a[i] = d_b[i] + d_c[i];
    }
}</pre>
```

calling the function

vector\_add<<<1,32>>>(d\_a, d\_b, d\_c, size);

number of threads thread id

Lets try it! What do we think?



Getting better but we have a long ways to go!

# GPU Memory



# GPU Memory

**CPU Memory:** 

Fast: Low Latency Easily saturated: Low Bandwidth Scales well: up to 1 TB DDR

2-lane straight highway driven on by sports cars



Different technologies

16-lane highway on a windy road driven by semi trucks

## GPU Memory

bandwidth: ~**700 GB/s** for GPU ~**50 GB/s** for CPUs

memory Latency:~600 cycles for GPU memory~200 cycles for CPU memory

Cache Latency: ~**28** cycles for L1 hit for GPU ~**4** cycles for L1 hit on CPUs









warp 0



600 cycles!



warp 0

















Hey, my memory has arrived!



But wait, I thought preemption was expensive?



But wait, I thought preemption was expensive?

Registers all stay on chip

|                                 | Instruction Buffer |      |               |       |     |  |  |  |
|---------------------------------|--------------------|------|---------------|-------|-----|--|--|--|
| Warp Scheduler                  |                    |      |               |       |     |  |  |  |
| Dispatch Unit                   |                    |      | Dispatch Unit |       |     |  |  |  |
| Register File (16,384 x 32-bit) |                    |      |               |       |     |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |
| Core                            | Core               | Core | Core          | LD/ST | SFU |  |  |  |

But wait, I thought preemption was expensive? dedicated scheduler logic



But wait, I thought preemption was expensive?

bound on number of warps: 32

#### Go back to our program

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    int chunk_size = size/blockDim.x;
    int start = chunk_size * threadIdx.x;
    int end = start + end;
    for (int i = start; i < end; i++) {
        d_a[i] = d_b[i] + d_c[i];
    }
}</pre>
```

calling the function

Lets launch with 32 warps

```
vector_add<<<1,32>>>(d_a, d_b, d_c, size);
```

#### Go back to our program

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    int chunk_size = size/blockDim.x;
    int start = chunk_size * threadIdx.x;
    int end = start + end;
    for (int i = start; i < end; i++) {
        d_a[i] = d_b[i] + d_c[i];
    }
}</pre>
```

calling the function

Lets launch with 32 warps

```
vector_add<<<1,1024>>>(d_a, d_b, d_c, size);
```

### Concurrent warps

Lets try it! What do we think?

## Concurrent warps

Lets try it! What do we think?



Getting better!

# Optimizing memory accesses

| Instruction Buffer              |            |      |               |       |     |  |  |  |  |
|---------------------------------|------------|------|---------------|-------|-----|--|--|--|--|
| Warp Scheduler                  |            |      |               |       |     |  |  |  |  |
| Di                              | spatch Uni | it   | Dispatch Unit |       |     |  |  |  |  |
| Register File (16,384 x 32-bit) |            |      |               |       |     |  |  |  |  |
| Core                            | Core       | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core       | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core       | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core       | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core       | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core       | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core       | Core | Core          | LD/ST | SFU |  |  |  |  |
| Core                            | Core       | Core | Core          | LD/ST | SFU |  |  |  |  |

### Optimizing memory accesses



this is the load/store unit. The hardware component responsible for issuing loads and stores.

Why doesn't every core have one?

#### Optimizing memory accesses



This is the instruction cache... Why doesn't every core have a instruction buffer to keep track of its program?

this is the load/store unit. The hardware component responsible for issuing loads and stores.

Why doesn't every core have one?

# Warp execution



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

## Warp execution



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

#### Program:

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```

## Warp execution



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

#### Program:

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

instruction is fetched from the buffer and distributed to all the cores.

#### **Program:**

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

Cores can a large register file they share expensive HW units (load/store and special functions)

#### Program:

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

All cores need to wait until all cores finish the first instruction

#### **Program:**

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

Start the next instruction.

Program: int variable1 = b[0]; int variable2 = c[0]; int variable3 = variable1 + variable2; a[0] = variable3;

Why would we have a programming model like this?



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

Start the next instruction.

Program: int variable1 = b[0]; int variable2 = c[0]; int variable3 = variable1 + variable2; a[0] = variable3;

Why would we have a programming model like this? More cores (share program counters) Can be efficient to share other hardware resources



#### Lets look closer at memory

#### Program:

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```

4 cores are accessing memory. what happens if they access the same value?

**GPU Memory** 

### Load Store Unit



All read the same value

**GPU Memory** 

### Load Store Unit



#### All read the same value

This is efficient: the load store unit can ask for the value and then broadcast it to all cores.



#### All read the same value

This is efficient: the load store unit can ask for the value and then broadcast it to all cores.

1 request to GPU memory

Efficient, but probably not too common.



Read contiguous values

GPU Memory

### Load Store Unit



**Read contiguous values** Like the CPU cache, the Load/Store Unit reads in memory in chunks. 16 bytes **GPU Memory** 

### Load Store Unit



**Read contiguous values** Like the CPU cache, the Load/Store Unit reads in memory in chunks. 16 bytes





**Read contiguous values** Like the CPU cache, the Load/Store Unit reads in memory in chunks. 16 bytes

Can easily distribute the values to the threads





**Read contiguous values** Like the CPU cache, the Load/Store Unit reads in memory in chunks. 16 bytes

Can easily distribute the values to the threads

1 request to GPU memory



Read non-contiguous values

Not good!

Accesses are Serialized. You need 4 requests to GPU memory

### **GPU Memory**

### Load Store Unit



Read non-contiguous values

Not good!



Read non-contiguous values

Not good!



Read non-contiguous values

Not good!



Read non-contiguous values

Not good!



## Go back to our program

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    int chunk_size = size/blockDim.x;
    int start = chunk_size * threadIdx.x;
    int end = start + end;
    for (int i = start; i < end; i++) {
        d_a[i] = d_b[i] + d_c[i];
    }
}</pre>
```

calling the function

```
vector_add<<<1,32>>>(d_a, d_b, d_c, size);
```

## Chunked Pattern



array a

## Chunked Pattern

the first element accessed by the 4 threads sharing a load store unit. What sort of access is this?



array a

## Chunked Pattern

the first element accessed by the 4 threads sharing a load store unit. What sort of access is this?



array a

How can we fix this

## Stride Pattern



array a

#### What sort of pattern is this?

## Stride Pattern



array a

## Go back to our program

```
__global__ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    int chunk_size = size/blockDim.x;
    int start = chunk_size * threadIdx.x;
    int end = start + end;
    for (int i = start; i < end; i++) {
        d_a[i] = d_b[i] + d_c[i];
    }
}</pre>
```

calling the function

Lets change this to a stride pattern

```
vector_add<<<1,1024>>>(d_a, d_b, d_c, size);
```

### Go back to our program

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
   for (int i = threadIdx.x; i < size; i+=blockDim.x) {
      d_a[i] = d_b[i] + d_c[i];
   }
}</pre>
```

calling the function

```
vector_add<<<1,1024>>>(d_a, d_b, d_c, size);
```

## Coalesced memory accesses

Lets try it! What do we think?

## Coalesced memory accesses

Lets try it! What do we think?



What else can we do?

# Multiple streaming multiprocessors

We've been talking only about 1 streaming multiprocessor, most GPUs have multiple SMs big ML GPUs have 32. My GPU has 4

|      | h          | nstructi   | on Buffe | )r          |     |
|------|------------|------------|----------|-------------|-----|
|      |            | Warp So    | cheduler |             |     |
| Di   | spatch Uni | it         | C        | )ispatch Ur | it  |
|      | Regist     | er File (' | 16,384 x | 32-bit)     |     |
| Core | Core       | Core       | Core     | LD/ST       | SFU |
| Core | Core       | Core       | Core     | LD/ST       | SFU |
| Core | Core       | Core       | Core     | LD/ST       | SFU |
| Core | Core       | Core       | Core     | LD/ST       | SFU |
| Core | Core       | Core       | Core     | LD/ST       | SFU |
| Core | Core       | Core       | Core     | LD/ST       | SFU |
| Core | Core       | Core       | Core     | LD/ST       | SFU |
| Core | Core       | Core       | Core     | LD/ST       | SFU |

# Multiple streaming multiprocessors

We've been talking only about 1 streaming multiprocessor, most GPUs have multiple SMs big ML GPUs have 32. This little GPU has 1

|      | h          | nstructi   | on Buffe | ər         |     |      | ir         | nstructio  | on Buffe | er         |     |      | 1         | nstructi  | on Buffe | )r         |     |      | 1         | nstructi  | on Buffe | ər          |     |
|------|------------|------------|----------|------------|-----|------|------------|------------|----------|------------|-----|------|-----------|-----------|----------|------------|-----|------|-----------|-----------|----------|-------------|-----|
|      |            | Warp So    | cheduler |            |     |      |            | Warp So    | cheduler |            |     |      |           | Warp So   | cheduler |            |     |      |           | Warp So   | cheduler |             |     |
| Dis  | spatch Uni | it         | Ĺ        | Dispatch U | nit | Di   | spatch Uni | t          | [        | Dispatch U | nit | D    | spatch Un | it        | [        | )ispatch U | nit | Di   | spatch Un | it        | Ĺ        | Dispatch Un | nit |
|      | Ť          | er File (' | 16,384 x |            |     |      | Registe    | er File (* | 16,384 x | 32-bit)    |     |      |           | er File ( | 16,384 x |            |     |      |           | er File ( | 16,384 x |             |     |
| Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST       | SFU |
| Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST       | SFU |
| Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST       | SFU |
| Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST       | SFU |
| Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST       | SFU |
| Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST       | SFU |
| Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST       | SFU |
| Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core       | Core       | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST      | SFU | Core | Core      | Core      | Core     | LD/ST       | SFU |

# Multiple streaming multiprocessors

CUDA provides virtual streaming multiprocessors called **blocks** 

Very efficient at launching and joining blocks.

No limit on blocks: launch as many as you need to map 1 thread to 1 data element

|                                    |                                                       | nstructi                                              | on Buffe                                             | ər                                                                     |                                 |
|------------------------------------|-------------------------------------------------------|-------------------------------------------------------|------------------------------------------------------|------------------------------------------------------------------------|---------------------------------|
|                                    |                                                       | Warp Se                                               |                                                      |                                                                        |                                 |
| Di                                 | spatch Un                                             | ıt                                                    |                                                      | Dispatch U                                                             | nit                             |
|                                    | Regist                                                | er File (                                             | 16,384 x                                             | : 32-bit)                                                              |                                 |
| Core                               | Core                                                  | Core                                                  | Core                                                 | LD/ST                                                                  | SFU                             |
| Core                               | Core                                                  | Core                                                  | Core                                                 | LD/ST                                                                  | SFU                             |
| ore                                | Core                                                  | Core                                                  | Core                                                 | LD/ST                                                                  | SFU                             |
| Core                               | Core                                                  | Core                                                  | Core                                                 | LD/ST                                                                  | SFU                             |
| Core                               | Core                                                  | Core                                                  | Core                                                 | LD/ST                                                                  | SFU                             |
| Core                               | Core                                                  | Core                                                  | Core                                                 | LD/ST                                                                  | SFU                             |
| Core                               | Core                                                  | Core                                                  | Core                                                 | LD/ST                                                                  | SFU                             |
| Core                               | Core                                                  | Core                                                  | Core                                                 | LD/ST                                                                  | SFU                             |
|                                    |                                                       |                                                       |                                                      |                                                                        |                                 |
|                                    |                                                       |                                                       |                                                      |                                                                        |                                 |
|                                    | 1                                                     |                                                       | on Buffe                                             | ər                                                                     |                                 |
| Di                                 |                                                       | Warp Se                                               | cheduler                                             |                                                                        | nia                             |
| Di                                 | spatch Un                                             | Warp Si                                               | cheduler<br>C                                        | Dispatch Ur                                                            | nit                             |
| Di                                 | spatch Un                                             | Warp Si                                               | cheduler                                             | Dispatch Ur                                                            | nit                             |
|                                    | spatch Un                                             | Warp Si                                               | cheduler<br>C                                        | Dispatch Ur                                                            | nit<br>SFU                      |
| Core                               | spatch Uni<br>T<br>Regist                             | Warp Si<br>it<br>er File ('                           | cheduler<br>(<br>16,384 x                            | Dispatch Ui<br>4<br>x 32-bit)                                          |                                 |
| Core<br>Core                       | Regist                                                | Warp Si<br>it<br>er File ('<br>Core                   | cheduler<br>(<br>16,384 x<br>Core                    | Dispatch Ui                                                            | SFU                             |
| Core<br>Core<br>Core               | spatch Uni<br>Regista<br>Core<br>Core                 | Warp Si<br>it<br>Er File (<br>Core<br>Core            | Core<br>Core                                         | Dispatch Un<br>32-bit)<br>LD/ST<br>LD/ST                               | SFU<br>SFU                      |
| Core<br>Core<br>Core               | Regista<br>Core<br>Core<br>Core                       | Warp Si<br>it<br>Er File ('<br>Core<br>Core<br>Core   | cheduler<br>(<br>16,384 x<br>Core<br>Core<br>Core    | Cispatch Ui<br>( 32-bit)<br>LOIST<br>LOIST<br>LDIST                    | SFU<br>SFU<br>SFU               |
| D:<br>Core<br>Core<br>Core<br>Core | Registr<br>Core<br>Core<br>Core<br>Core               | Warp Si<br>it<br>er File (<br>Core<br>Core<br>Core    | Core<br>Core<br>Core<br>Core<br>Core                 | Cispatch Ui<br>Cispatch Ui<br>Cist<br>LDIST<br>LDIST<br>LDIST<br>LDIST | SFU<br>SFU<br>SFU<br>SFU        |
| Core<br>Core<br>Core<br>Core       | spatch Uni<br>Registr<br>Core<br>Core<br>Core<br>Core | Warp Si<br>it<br>Core<br>Core<br>Core<br>Core<br>Core | Core<br>Core<br>Core<br>Core<br>Core<br>Core<br>Core | Clispatch Ui<br>( 32-bit)<br>LD/ST<br>LD/ST<br>LD/ST<br>LD/ST          | SFU<br>SFU<br>SFU<br>SFU<br>SFU |

### Go back to our program

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
   for (int i = threadIdx.x; i < size; i+=blockDim.x) {
      d_a[i] = d_b[i] + d_c[i];
   }
}</pre>
```

calling the function

Launch with many thread blocks

vector\_add<<<1,1024>>>(d\_a, d\_b, d\_c, size);

## Go back to our program

```
__global___void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    d_a[i] = d_b[i] + d_c[i];
}
```

calling the function

Need to recalculate some thread ids.

Launch with many thread blocks

vector\_add<<<1024,1024>>>(d\_a, d\_b, d\_c, size);

Now we have 1 thread for each element

#define SIZE (1024\*1024)

# Final Round

Tiny GPU in an embedded system



Fight!



Nvidia Jetson Nano (whole chip, CPU + GPU) 2 Billion transistors 10 TDP Est. \$99 https://www.techpow

https://www.techpowerup.com/gpu-specs/geforce-940m.c2648 https://www.alibaba.com/product-detail/Intel-Core-i7-9700K-8-Cores\_62512430487.html https://www.prolast.com/prolast-elevated-boxing-rings-22-x-22/ The CPU in my professor workstation



Intel i7-9700K 2.16 Billion transistors 95 TDP Est. \$316

# See you on Friday

- Turn in HW 4 if you haven't already
- Working on GPU programming