# CSE113: Parallel Programming Feb. 16, 2022

- Topics:
  - Intro to module 4
  - Barriers



#### Announcements

- Midterm was due on Monday
  - Me and Reese will start grading ASAP: 2 week turnaround time
- Homework 2 grades:
  - Plan on Friday, you have 1 week to raise any concerns
- Homework 3 is due on Friday
  - Several office hours/mentoring hours left
  - Piazza is available
  - You can share results
- Homework 4 will be assigned Friday
  - You should have what you need to get started on part 1 instantly

# Today's Quiz

• Due tomorrow by midnight, please do it!

When a CAS operation fails in a lock-free linked list, the implementation should:

○ spin on the same CAS (like a mutex), it will eventually succeed

○ throw an exception

○ retry the operation from the start, even if it means traversing the list again

 $\bigcirc$  The CAS operation will never fail because it is an atomic operation

An object wrapped in the C++ atomic template allows you to:

Check all that apply

atomically execute method calls

□ load and store the object atomically

perform atomic CAS on the object

□ locks each method call automatically

Which are possible ways that hardware implements RMW operations? check all that apply:

special loads and stores that track if there has been a conflict

] sleeping all other threads while one thread accesses the location

having the hardware lock the cache line

☐ flushing the pipeline to avoid ILP interleavings

It is the end of module 3: concurrent data structures:

Feel free to write a few sentences about how you found the module, e.g. what you found most interesting, what you found unclear, etc.

See you on Wednesday to start module 4!

#### Atomic wrapper

• code works now! It was in a different directory

#### C++ Atomic template

```
bank_account snapshot;
bank_account update;
bool success;
do {
  bank_account snapshot = tylers_account.load();
  bank_account update = snapshot;
  update.buy_coffee();
  success = tylers_account.compare_exchange_strong(snapshot, update);
  } while (success == false);
```

Recall atomic templates allow you to do compare and swap

#### Review

• Optimizing the concurrent set

# Single traversal operations



































#### CAS based insertion

#### 

Adding


















Find the location Cache your insertion point!

b.next == e



Find the location Cache your insertion point!

b.next == e



Only insert if your insertion point is valid!

CAS(b.next, e, c);

Find the location Cache your insertion point!

b.next == e

e create "c" Adding Using CAS С

Only insert if your insertion point is valid!

CAS(b.next, e, c);

Find the location Cache your insertion point!

b.next == e



Only insert if your insertion point is valid!

CAS(b.next, e, c);

Find the location Cache your insertion point!

b.next == e

notion is being abused here:  ${f e}$  and  ${f c}$  will be node  ${f *}$ 



Only insert if your insertion point is valid!

CAS(b.next, e, c);

Find the location Cache your insertion point!

b.next == e



Only insert if your insertion point is valid!

CAS(b.next, e, c);

Find the location Cache your insertion point!

b.next == e



Only insert if your insertion point is valid!

CAS(b.next, e, c);

Find the location Cache your insertion point!

b.next == e



## Further considerations

- need to include "valid bit" in compare and swap to make sure the node is still valid
- Can "pack the bit" into the address (there will always be room because addresses are byte addressable, and addresses 8 bytes)
- More details in the book!

# Schedule

- Module 4 introduction
- Barriers
  - Specification
  - Implementation

# Schedule

- Module 4 introduction
- Barriers
  - Specification
  - Implementation

Mental model of concurrency

- Functional
  - Interleavings events from different threads can interleave
  - Atomicity what events are indivisible?
  - Specifications how can we create useful abstractions (mutexes, concurrent data structures)
- Performance
  - Increase parallelism judicious use of mutexes, load balancing
  - Cache behaviors threads should try to utilize their own cache lines
  - **Operating system** yielding/sleeping threads
  - Architectural details instruction-level parallelism

• Depending on your needs, programs become more/less complex to reason about.



• Depending on your needs, programs become more/less complex to reason about.

In many cases, we use building blocks and specifications to traverse this range



• Depending on your needs, programs become more/less complex to reason about.

To get a more complete picture, we will fill in some of the gaps here



• Depending on your needs, programs become more/less complex to reason about.

To get a more complete picture, we will fill in some of the gaps here



• Depending on your needs, programs become more/less complex to reason about.



• Depending on your needs, programs become more/less complex to reason about.



 On newer architectures, you may only get specifications about the relaxed memory and progress. It is up to the user to implement their own atomics, mutexes, concurrent data structures, etc!



# Modern chips

• From David Brooks lab at Harvard:

http://vlsiarch.eecs.harvard. edu/research/accelerators/di e-photo-analysis/

 GPUs/accelerators will have different guarantees w.r.t. atomics and memory orderings



#### Historical issues on Nvidia GPUs





Issue implementing a concurrent data structure related to memory orderings









*Issue implementing a mutex relating to memory orderings* 

#### Historical issues on Nvidia GPUs



In both cases, they had reasoned about their implementations exactly how we have! The issues came from lower in the stack: memory orderings

# running a mutex on an iPad GPU?

| 2:50 AM | Tue Dec 2                                                               | 9 |          |                             |          |          |          | <b>00</b> 🗢 73% 🗔                                                                                |
|---------|-------------------------------------------------------------------------|---|----------|-----------------------------|----------|----------|----------|--------------------------------------------------------------------------------------------------|
| <       | >                                                                       | Ш |          |                             | webglsam | ples.org |          |                                                                                                  |
|         |                                                                         |   | W Wikipe | edia, the free encyclopedia | ×        | ×        | l        | Run Tests                                                                                        |
|         |                                                                         |   |          |                             |          |          |          | Ran 12 tests out of 366<br>Local iterations: 75<br>Killed Tests: 100<br>Time (seconds): 0.000000 |
|         | A problem repeatedly occurred on "https://webglsamples.org/electricflow |   |          |                             |          |          | flower/e | Cancel                                                                                           |
|         |                                                                         |   |          |                             |          |          |          | Clear Test Log                                                                                   |
|         |                                                                         |   |          |                             |          |          |          | Save to File                                                                                     |
|         |                                                                         |   |          |                             |          |          |          | Action Log:                                                                                      |

This one has to do with functional properties of the scheduler

video demo

 On newer architectures, you may only get specifications about the relaxed memory and progress. It is up to the user to implement their own atomics, mutexes, concurrent data structures, etc!



# Schedule

- Module 4 introduction
- Barriers
  - Specification
  - Implementation

## Barriers

- Why do barriers fit into this module: "Reasoning About Parallel Computing"?
  - Relaxed Memory Models make reasoning about parallel computing HARD
  - Barriers make it EASIER (at the cost of performance potentially)
- A barrier is a concurrent object (like a mutex):
  - Only one method: barrier (called await in the book)
- Separates computational phases

#### My current favorite: particle simulation



by Yanwen Xu

My current favorite: particle simulation







time = 0

time = 1

time = 2

#### My current favorite: particle simulation







time = 0

time = 1

time = 2

at each time, compute new positions for each particle (in parallel)

My current favorite: particle simulation



time = 0

time = 1

time = 2

at each time, compute new positions for each particle (in parallel)

But you need to wait for all particles to be computed before starting the next time step

• Deep neural networks



from http://cs231n.stanford.edu/

• Deep neural networks



## Barriers

- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived
- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived

Example: say there are 4 threads:



- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived



- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived



- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived



- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived



- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived



- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived



A more formal specification

Given a global barrier B and a global memory location x where initially \*x = 0;

First, what would we expect var to be after this program?

<u>Thread 1:</u>
B.barrier();
var = \*x;

thread 0

thread 1

A more formal specification

Given a global barrier B and a global memory location x where initially \*x = 0;

| Thread 1:    |
|--------------|
| B.barrier(); |
| var = *x;    |

gives an event: barrier arrive

thread 0

Thread 0:

\*x = 1;

B.barrier();

thread 1 - barrier arrive

<u>Thread 0:</u>
\*x = 1;
B.barrier();

barrier arrive

A more formal specification

Given a global barrier B and a global memory location x where initially x = 0;

Thread 1: B.barrier var = \*X;

gives an event: barrier arrive

barrier arrive needs to wait for all threads to arrive (similar to how a mutex request must wait for another to release)

thread 0

thread 1

<u>Thread O:</u>
\*x = 1;
B.barrier();

A more formal specification

Given a global barrier B and a global memory location x where initially \*x = 0;

| Thread 1:    |
|--------------|
| B.barrier(); |
| var = *x;    |



A more formal specification

Given a global barrier B and a global memory location x where initially \*x = 0;

| <u>Thread 1:</u> |  |
|------------------|--|
| B.barrier();     |  |
| var = *x;        |  |





<u>Thread O:</u> \*x = 1; B.barrier(); A more formal specification

Given a global barrier B and a global memory location x where initially \*x = 0;

| Thread 1:    |
|--------------|
| B.barrier(); |
| var = *x;    |

now that all threads have arrived: They can leave (1 event at the same time)



Thread 0: \*x = 1; B.barrier(); A more formal specification

Given a global barrier B and a global memory location x where initially \*x = 0;

| Thread 1:    |
|--------------|
| B.barrier(); |
| var = *x;    |

This finishes the barrier execution



<u>Thread O:</u> \*x = 1; B.barrier(); A more formal specification

Given a global barrier B and a global memory location x where initially \*x = 0;

| Thread 1:    |
|--------------|
| B.barrier(); |
| var = *x;    |



\_\_\_\_\_

thread 0

thread 1

thread 2

thread 0

thread 1

thread 2 – barrier arrive























They've all arrived



They've all arrived







extending to the next *barrier leave* 



- Barrier Property:
  - If the only concurrent object you use in your program is a barrier (no mutexes, concurrent data-structures, atomic accesses)
  - If every barrier interval contains no data conflicts, then

your program will be deterministic (only 1 outcome allowed)

• much easier to reason about  $\textcircled{\odot}$ 

Assume we are reading from x

We are only allowed to return one possible value



no data conflicts means that x is written to at most once per barrier interval

Assume we are reading from x

We are only allowed to return one possible value



Assume we are reading no data conflicts means that x is written to at most once from x per barrier interval We are only allowed to return one possible not allowed value Barrier Interval N - 2 Barrier Interval N - 1 Barrier Interval N - 3 **Barrier Interval N** thread 0 barrier leave barrier leave var = \*x \*x = 2 barrier leave barrier leave thread 1 barrier leave barrier leave \*x = 1 barrier leave barrier leave barrier leave thread 2

no data conflicts means that x is written to at most once per barrier interval

we will read from the write from the most recent barrier interval Assume we are reading from x

We are only allowed to return one possible value



# Schedule

• Module 4 introduction

#### • Barriers

- Specification
- Implementation

# **Barrier Implementation**

• First attempt at implementation

```
class Barrier {
 private:
    atomic int counter;
    int num threads;
 public:
    Barrier(int num threads) {
      counter = 0;
      this->num_threads = num_threads;
     void barrier() {
        // ??
```

# **Barrier Implementation**

```
class Barrier {
 private:
    atomic int counter;
    int num threads;
 public:
    Barrier(int num_threads) {
      counter = 0;
      this->num threads = num threads;
     void barrier() {
        int arrival_num = atomic_fetch_add(&counter, 1);
        // What next?
```

#### **Barrier Implementation**

First handle the case where the thread is the last thread to arrive

```
class Barrier {
  private:
    atomic int counter;
    int num threads;
  public:
    Barrier(int num threads) {
      counter = 0;
      this->num threads = num threads;
     void barrier() {
        int arrival num = atomic fetch add(&counter, 1);
        if (arrival_num == num_threads) {
           counter.store(0);
        // What next?
```

Spin while there is a thread waiting at the barrier

**Barrier** Implementation

```
class Barrier {
  private:
    atomic int counter;
    int num threads;
  public:
    Barrier(int num threads) {
      counter = 0;
      this->num threads = num threads;
     void barrier() {
        int arrival num = atomic fetch add(&counter, 1);
        if (arrival_num == num_threads) {
           counter.store(0);
        else {
          while (counter.load() != 0);
```

Spin while there is a thread waiting at the barrier

**Barrier** Implementation

Does this work?

```
class Barrier {
  private:
    atomic int counter;
    int num threads;
  public:
    Barrier(int num threads) {
      counter = 0;
      this->num threads = num threads;
     void barrier() {
        int arrival num = atomic fetch add(&counter, 1);
        if (arrival_num == num_threads) {
           counter.store(0);
        else {
          while (counter.load() != 0);
```

```
<u>Thread 0:</u>
B.barrier();
B.barrier();
```

```
void barrier() {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
    }
    else {
        while (counter.load() != 0);
    }
```

```
<u>Thread 1:</u>
B.barrier();
B.barrier();
```

thread 0

thread 1







arrival\_num = 2

arrival\_num = 1

thread 0

thread 1
```
num_threads == 2
counter == 2
```



```
num_threads == 2
counter == 0
```



```
num_threads == 2
counter == 0
```



thread 1

```
num_threads == 2
counter == 0
```



| oid barrier() {                                                 |  |
|-----------------------------------------------------------------|--|
| <pre>int arrival_num = atomic_fetch_add(&amp;counter, 1);</pre> |  |
| <pre>if (arrival_num == num_threads) {</pre>                    |  |
| counter.store(0);                                               |  |
| }                                                               |  |
| else {                                                          |  |
| <b>while</b> (counter.load() != 0); /                           |  |
| }                                                               |  |
|                                                                 |  |
|                                                                 |  |

Thread 1: B.barrier(); B.barrier();

Leaves barrier

arrival\_num = 1

in a perfect world, thread 1 executes now and leaves the barrier

but what if the OS preempted thread 1? Or it was asleep?

```
num_threads == 2
counter == 0
```



| <pre>void barrier() {</pre>                                     |
|-----------------------------------------------------------------|
| <pre>int arrival_num = atomic_fetch_add(&amp;counter, 1);</pre> |
| <pre>if (arrival_num == num_threads) {</pre>                    |
| counter.store(0);                                               |
| }                                                               |
| else {                                                          |
| <b>while</b> (counter.load() != 0);                             |
| }                                                               |
| }                                                               |
|                                                                 |

Thread 1: B.barrier(); B.barrier();

enters next barrier

arrival\_num = 1

in a perfect world, thread 1 executes now and leaves the barrier

but what if the OS preempted thread 1? Or it was asleep?

num\_threads == 2
counter == 1





| Thread 1:               |
|-------------------------|
| <pre>B.barrier();</pre> |
| B.barrier();            |

arrival\_num == 1

arrival\_num = 1

in a perfect world, thread 1 executes now and leaves the barrier

but what if the OS preempted thread 1? Or it was asleep?

```
num_threads == 2
counter == 1
```





Thread 1 wakes up! Doesn't think its missed anything

arrival\_num == 1

arrival\_num = 1

in a perfect world, thread 1 executes now and leaves the barrier

```
num_threads == 2
counter == 1
```







Thread 1 wakes up! Doesn't think its missed anything

arrival\_num == 1

arrival\_num = 1

in a perfect world, thread 1 executes now and leaves the barrier

Both threads get stuck here!

```
<u>Thread 0:</u>
B.barrier();
B.barrier();
```

```
void barrier() {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
    }
    else {
        while (counter.load() != 0);
    }
}
```

```
<u>Thread 1:</u>
B.barrier();
B.barrier();
```

```
<u>Thread 0:</u>
B.barrier();
B.barrier();
```

```
void barrier() {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
    }
    else {
        while (counter.load() != 0);
    }
}
```

```
<u>Thread 1:</u>
B.barrier();
B.barrier();
```

Two different barriers that alternate?

```
<u>Thread 0:</u>
B0.barrier();
B1.barrier();
```

```
void barrier() {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
    }
    else {
        while (counter.load() != 0);
    }
}
```

```
<u>Thread 1:</u>
B0.barrier();
B1.barrier();
```

Two different barriers that alternate?

Pros: simple to implement

Cons: user has to alternate barriers

```
<u>Thread 0:</u>
B0.barrier();
B1.barrier();
```

```
void barrier() {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
    }
    else {
        while (counter.load() != 0);
    }
}
```

```
<u>Thread 1:</u>
B0.barrier();
B1.barrier();
```

Two different barriers that alternate?

Pros: simple to implement

Cons: user has to alternate barriers

```
B.barrier();
if (...) {
   B.barrier();
}
B.barrier();
```

How to alternate these calls?

## Sense Reversing Barrier

- Book Chapter 17
- Alternating "sense" dynamically



sync on sense = false

Thread 1: B.barrier(); B.barrier();

## Sense Reversing Barrier

- Book Chapter 17
- Alternating "sense" dynamically



sync on sense = true

<u>Thread 1:</u>
B.barrier();
B.barrier();

```
class SenseBarrier {
 private:
   atomic int counter;
   int num threads;
   atomic bool sense;
   bool thread sense[num threads];
 public:
   Barrier(int num threads) {
      counter = 0;
     this->num threads = num threads;
      sense = false;
     thread sense = {true, ...};
    void barrier(int tid) {
        int arrival num = atomic fetch add(&counter, 1);
        if (arrival num == num threads) {
           counter.store(0);
           sense = thread sense[tid];
        else {
          while (sense != thread sense[tid]);
        thread sense[tid] = !thread sense[tid];
```

thread\_sense = true

```
num_threads == 2
counter == 0
sense = false
```

thread\_sense = true

```
<u>Thread O:</u>
B.barrier();
B.barrier();
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
      while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
}
```

<u>Thread 1:</u>
B.barrier();
B.barrier();

thread\_sense = true arrival\_num = 2



```
num_threads == 2
counter == 2
sense = false
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
      while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
}
```

```
thread_sense = true
arrival_num = 1
```

B.barrier();

B.barrier();

Thread 1:





```
num_threads == 2
counter == 0
sense = true
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
     while (sense != thread_sense[tid]);
    }
    thread sense[tid] = !thread_sense[tid];
}
```

thread\_sense = true arrival\_num = 1 <u>Thread 1:</u>

B.barrier();

B.barrier();



```
num_threads == 2
counter == 0
sense = true
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
      while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
}
```

```
thread_sense = true
arrival_num = 1

<u>Thread 1:</u>
B.barrier();
B.barrier();
```

Remember the issue! Thread 1 went to sleep around this time and thread 0 went into the barrier again!



```
num_threads == 2
counter == 1
sense = true
```



```
thread_sense = true
arrival_num = 1
```

| <u>Thread 1:</u>        |
|-------------------------|
| <pre>B.barrier();</pre> |
| B.barrier();            |



```
num_threads == 2
counter == 1
sense = true
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
      while (sense != thread_sense[tid]);
    }
    thread sense[tid] = !thread sense[tid];
}
```

```
thread_sense = true
arrival_num = 1
<u>Thread 1:</u>
B.barrier();
```

B.barrier();

both are waiting!, but thread 1 can leave



```
num_threads == 2
counter == 1
sense = true
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
     while (sense != thread_sense[tid]);
    }
    thread sense[tid] = !thread sense[tid];
}
```

```
thread_sense = false
arrival_num = 1
<u>Thread 1:</u>
B.barrier();
```

B.barrier();

both are waiting!, but thread 1 can leave



```
num_threads == 2
counter == 1
sense = true
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
      while (sense != thread_sense[tid]);
    }
    thread sense[tid] = !thread sense[tid];
}
```

```
thread_sense = false
arrival_num = 1
<u>Thread 1:</u>
B.barrier();
```

B.barrier();

```
Thread 1 finishes the barrier
```



```
num_threads == 2
counter == 1
sense = true
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
      while (sense != thread_sense[tid]);
    }
    thread sense[tid] = !thread sense[tid];
}
```

```
thread_sense = false
arrival_num = 1
<u>Thread 1:</u>
```

B.barrier();

B.barrier();



```
num_threads == 2
counter == 2
sense = true
```



```
thread_sense = false
arrival_num = 2
<u>Thread 1:</u>
B.barrier();
```

B.barrier();







thread 0 can leave, thread 1 can leave and the barrier works as expected!

# See you on Wedensday!

- Starting on module 4
- Get your midterm in!
- Work on HW 3