## CSE113: Parallel Programming

March 1, 2023

- Topics:
- Intro to module 4
- Barriers


## Announcements

- Midterm grades are out
- Let us know within 1 week if there are issues
- We are working on HW 2 grades and will strive to get them released by Monday
- Homework 3 is due today (last late day)
- We are hoping to get HW 4 assigned today
- There are some issues with the server we are trying to work out


## Previous quiz

When a CAS operation fails in a lock-free linked list, the implementation should:spin on the same CAS (like a mutex), it will eventually succeedthrow an exceptionretry the operation from the start, even if it means traversing the list againThe CAS operation will never fail because it is an atomic operation

## Previous quiz

An object wrapped in the C++ atomic template allows you to:
Check all that applyatomically execute method callsload and store the object atomicallyperform atomic CAS on the objectlocks each method call automatically

## Previous quiz

Which are possible ways that hardware implements RMW operations? check all that apply:special loads and stores that track if there has been a conflictsleeping all other threads while one thread accesses the locationhaving the hardware lock the cache lineflushing the pipeline to avoid ILP interleavings

## Previous quiz

It is the end of module 3: concurrent data structures:
Feel free to write a few sentences about how you found the module, e.g. what you found most interesting, what you found unclear, etc.

See you on Wednesday to start module 4!

## Review

- Optimizing the concurrent set

Single traversal operations

What could go wrong?


What could go wrong?


What could go wrong?


What could go wrong?


What could go wrong?


What could go wrong?


Fixed with logical flag


Fixed with logical flag


Fixed with logical flag


Fixed with logical flag


Fixed with logical flag


Fixed with logical flag


Fixed with logical flag


Fixed with logical flag


Fixed with logical flag


Fixed with logical flag


## Fixed with logical flag



## CAS based insertion

## Lock-free Lists



Adding

## Lock-free Lists



## Lock-free Lists



## Lock-free Lists



## Lock-free Lists



## Lock-free Lists



## Lock-free Lists



## Lock-free Lists



## Lock-free Lists



## Lock-free Lists

## Adding <br> Solution: use CAS <br> 



Find the location
create "d"
insert "d"

Can this just be a regular store?

Find the location
Cache your insertion point!
b.next ==e


Find the location
Cache your insertion point!
b.next ==e


Only insert if your insertion point is valid!

CAS(b.next, e, c);

Find the location
Cache your insertion point!
b.next ==e
notion is being abused here: e and $c$ will be node *


Only insert if your insertion point is valid!

CAS(b.next, e, c);

Find the location
Cache your insertion point!
b.next ==e
notion is being abused here: e and $c$ will be node *


Only insert if your insertion point is valid!

CAS(b.next, e, c);

Find the location
Cache your insertion point!
b.next ==e
notion is being abused here: e and c will be node *


Only insert if your insertion point is valid!

CAS(b.next, e, c);

Find the location
Cache your insertion point!
b.next ==e
notion is being abused here: e and c will be node *


Only insert if your insertion point is valid!

CAS(b.next, e, c);

Find the location
Cache your insertion point!
b.next ==e
notion is being abused here: $e$ and $c$ will be node *


Only insert if your insertion point is valid!

CAS(b.next, e, c);

Find the location
Cache your insertion point!
b.next ==e
notion is being abused here: $e$ and $c$ will be node *


## Further considerations

- need to include "valid bit" in compare and swap to make sure the node is still valid
- Can "pack the bit" into the address (there will always be room because addresses are byte addressable, and addresses 8 bytes)
- More details in the book!


## Schedule

- Module 4 introduction
- Barriers
- Specification
- Implementation


## Schedule

- Module 4 introduction
- Barriers
- Specification
- Implementation


## Reasoning about concurrency

Mental model of concurrency

- Functional
- Interleavings - events from different threads can interleave
- Atomicity - what events are indivisible?
- Specifications - how can we create useful abstractions (mutexes, concurrent data structures)
- Performance
- Increase parallelism - judicious use of mutexes, load balancing
- Cache behaviors - threads should try to utilize their own cache lines
- Operating system - yielding/sleeping threads
- Architectural details - instruction-level parallelism


## Reasoning about concurrency

- Depending on your needs, programs become more/less complex to reason about.

Harder
Easier

using concurrent data structures

## Reasoning about concurrency

- Depending on your needs, programs become more/less complex to reason about.



## Reasoning about concurrency

- Depending on your needs, programs become more/less complex to reason about.

To get a more complete picture, we will fill in some of the gaps here


## Reasoning about concurrency

- Depending on your needs, programs become more/less complex to reason about.

To get a more complete picture, we will fill in some of the gaps here

Harder
Easier


## Reasoning about concurrency

- Depending on your needs, programs become more/less complex to reason about.



## Reasoning about concurrency

- Depending on your needs, programs become more/less complex to reason about.



## Reasoning about concurrency

- On newer architectures, you may only get specifications about the relaxed memory and progress. It is up to the user to implement their own atomics, mutexes, concurrent data structures, etc!



## Modern chips

- From David Brooks lab at Harvard:
http://vlsiarch.eecs.harvard. edu/research/accelerators/di e-photo-analysis/
- GPUs/accelerators will have different guarantees w.r.t. atomics and memory orderings



## Historical issues on Nvidia GPUs



Issue implementing a concurrent data structure related to memory orderings

(d)


Issue implementing a mutex relating to memory orderings

## Historical issues on Nvidia GPUs


(d)

In both cases, they had reasoned about their implementations exactly how we have! The issues came from lower in the stack: memory orderings

## running a mutex on an iPad GPU?



This one has to do with functional properties of the scheduler

## Reasoning about concurrency

- On newer architectures, you may only get specifications about the relaxed memory and progress. It is up to the user to implement their own atomics, mutexes, concurrent data structures, etc!



## Schedule

- Module 4 introduction
- Barriers
- Specification
- Implementation


## Barriers

- Why do barriers fit into this module: "Reasoning About Parallel Computing"?
- Relaxed Memory Models make reasoning about parallel computing HARD
- Barriers make it EASIER (at the cost of performance potentially)
- A barrier is a concurrent object (like a mutex):
- Only one method: barrier (called await in the book)
- Separates computational phases


## Barrier Examples

My current favorite: particle simulation

## Barrier Examples

My current favorite: particle simulation

time $=0$

time $=1$

time $=2$

## Barrier Examples

My current favorite: particle simulation

time $=0$

time $=1$

time $=2$
at each time, compute
new positions for each particle
(in parallel)

## Barrier Examples

My current favorite: particle simulation

time $=0$

time $=1$

time $=2$
at each time, compute
new positions for each particle (in parallel)

But you need to wait for all particles to be computed before starting the next time step

## Barrier Examples

- Deep neural networks

hidden layer 1 hidden layer 2


## Barrier Examples

- Deep neural networks

hidden layer 1 hidden layer 2


## Barriers

- Intuition: threads stop and wait for each other:
- Threads arrive at the barrier
- Threads wait at the barrier
- Threads leave the barrier once all other threads have arrived


## Barriers

- Intuition: threads stop and wait for each other:
- Threads arrive at the barrier
- Threads wait at the barrier
- Threads leave the barrier once all other threads have arrived

Example: say there are 4 threads: barrier();


## Barriers

- Intuition: threads stop and wait for each other:
- Threads arrive at the barrier
- Threads wait at the barrier
- Threads leave the barrier once all other threads have arrived
Example: say there are 4 threads:
thread 0 arrives


## Barriers

- Intuition: threads stop and wait for each other:
- Threads arrive at the barrier
- Threads wait at the barrier
- Threads leave the barrier once all other threads have arrived

| Example: say there are 4 threads: |  |
| :--- | :--- |
| $\xrightarrow[\text { thread } 0 \text { waits }]{\text { thread } 1 \text { arrives }}$ |  |
|  |  |

## Barriers

- Intuition: threads stop and wait for each other:
- Threads arrive at the barrier
- Threads wait at the barrier
- Threads leave the barrier once all other threads have arrived

| Example: say there are 4 threads: |  |
| :--- | :--- |
| $\xrightarrow[\text { thread } 0 \text { waits }]{\text { thread } 1 \text { waits }}$ |  |
|  |  |

## Barriers

- Intuition: threads stop and wait for each other:
- Threads arrive at the barrier
- Threads wait at the barrier
- Threads leave the barrier once all other threads have arrived
$\xrightarrow{\substack{\text { Example: say there are } 4 \text { threads: } \\ \xrightarrow[\text { thread } 1 \text { waits }]{\text { thread } 2 \text { waits }}}}$


## Barriers

- Intuition: threads stop and wait for each other:
- Threads arrive at the barrier
- Threads wait at the barrier
- Threads leave the barrier once all other threads have arrived
$\xrightarrow{\substack{\text { Example: say there are } 4 \text { threads: } \\ \text { thread } 0 \text { waits } \\ \text { thread } 1 \text { waits } \\ \text { thread } 2 \text { waits }}}$ now that they have all arrived


## Barriers

- Intuition: threads stop and wait for each other:
- Threads arrive at the barrier
- Threads wait at the barrier
- Threads leave the barrier once all other threads have arrived


A more formal specification

## Given a global barrier B

and a global memory location x where initially *x $=0$;

First, what would we expect var to be after this program?

```
Thread 1:
B.barrier();
var = *x;
```

A more formal specification

## Given a global barrier B

and a global memory location x where initially ${ }^{*} x=0$;

## Thread 1: <br> B.barrier(); <br> var $=$ *x;

gives an event:
barrier arrive

## A more formal specification

## Given a global barrier B

and a global memory location x where initially ${ }^{*} x=0$;

## Thread 1: <br> B.barrier(); <br> var $=$ *x;

gives an event:
barrier arrive
barrier arrive needs to wait for all threads to arrive (similar to how a mutex request must wait for another to release)

A more formal specification

## Given a global barrier B

and a global memory location x where initially *x $=0$;

## Thread 1: <br> B.barrier(); <br> var $=$ *x;


thread 1 barrier arrive

A more formal specification

## Given a global barrier B

and a global memory location x where initially *x $=0$;

## Thread 1: <br> B.barrier(); <br> var $=$ *x;

thread $0 \quad{ }^{*} \times 1 \quad$ barrier arrive

A more formal specification

## Given a global barrier B

and a global memory location x where initially *x $=0$;

## Thread 1: <br> B.barrier(); <br> var $=$ *x;

now that all threads have arrived:
They can leave (1 event at the same time)


## A more formal specification

## Given a global barrier B

and a global memory location x where initially ${ }^{*} x=0$;

## Thread 1: <br> B.barrier() ; <br> var $=$ *x;

This finishes the barrier execution


## A more formal specification

## Given a global barrier B

and a global memory location x where initially ${ }^{*} x=0$;

## Thread 1: <br> B.barrier() ; <br> var $=* x$;

what value must this read? Any other value possible?


One more example, assume initially $* x=* y=0$

| Thread 0: |
| :--- |
| $\star_{\mathrm{x}=1}=1 ;$ <br> B. barrier ( $) ;$ |

$$
\begin{aligned}
& \text { Thread 1: } \\
& \text { * }^{\mathrm{y}=2 ;} \\
& \text { B. } \operatorname{barrier}(\mathrm{)} \text {; }
\end{aligned}
$$

## Thread 2: <br> B.barrier(); <br> var $=* x+* y ;$

One more example, assume initially $* x=* y=0$

| Thread 0: |
| :--- |
| $*_{x=1 ;} ;$ |
| B.barrier () ; |

$$
\begin{aligned}
& \text { Thread 1: } \\
& \hline \text { *y }=2 ; \\
& \text { B.barrier ( ) ; }
\end{aligned}
$$

```
Thread 2:
B.barrier();
var = *x + *y;
```

One more example, assume initially $* x=* y=0$

| Thread 0: |
| :--- |
| $\star_{\mathrm{x}=1}=1 ;$ <br> B. barrier (); |

$$
\begin{aligned}
& \text { Thread 1: } \\
& \text { *y = 2; } \\
& \text { B. barrier(); }
\end{aligned}
$$

## Thread 2: <br> B.barrier(); <br> var $=*_{x}+* y$;

```
thread 0 **x=1
```

thread $1 \quad{ }^{*} y=2$
thread 2
barrier arrive

One more example, assume initially $* x=* y=0$

| Thread 0: |
| :--- |
| $\mathrm{*x}_{\mathrm{x}=1 ;} ;$ |
| B. barrier (); |

$$
\begin{aligned}
& \text { Thread 1: } \\
& \text { *y }=2 \text {; }^{\text {B. barrier() ; }}
\end{aligned}
$$

```
Thread 2:
B.barrier();
var = *x + *y;
```

```
thread 0 ***=1 barrier arrive
```

thread $1 \quad{ }^{*} y=2 \quad$ barrier arrive

One more example, assume initially $* x=* y=0$

| Thread 0 : |
| :--- |
| $\star_{\mathrm{x}=1 ;} ;$ |
| B. barrier ()$;$ |

## Thread 1: <br> *y = 2; <br> B.barrier();

Thread 2:
B.barrier();
var $=*_{x}+* y$;

They've all arrived


One more example, assume initially $* x=* y=0$

| Thread 0: |
| :--- |
| $* x=1 ;$ |
| B.barrier () ; |

```
Thread 1:
*y = 2;
B.barrier();
```


## Thread 2: <br> B.barrier(); <br> var $=* x+* y$;

They've all arrived


One more example, assume initially $* x=* y=0$

| Thread 0: |
| :--- |
| $* x=1 ;$ |
| B.barrier () ; |

```
Thread 1:
*y = 2;
B.barrier();
```


## Thread 2: <br> B.barrier(); <br> var $=$ *x + *y;


barrier leave $\quad$ var $={ }^{*} x+{ }^{*} y$

One more example, assume initially $* x=* y=0$

## Thread 0: <br> *x $=1$; <br> B.barrier();

```
Thread 1:
*y = 2;
B.barrier();
```


## Thread 2:

B.barrier();
var $=* x+* y ;$
sometimes called a phase

| Barrier Interval 0 | Barrier Interval 1 |
| :---: | :---: |



## Barriers

- Barrier Property:
- If the only concurrent object you use in your program is a barrier (no mutexes, concurrent data-structures, atomic accesses)
- If every barrier interval contains no data conflicts, then
your program will be deterministic (only 1 outcome allowed)
- much easier to reason about :)

Assume we are reading from $x$

We are only allowed to return one possible value

| Barrier Interval N-3 | Barrier Interval N-2 | Barrier Interval N-1 | Barrier Interval N |
| :--- | :--- | :--- | :--- |


no data conflicts means that $x$ is written to at most once per barrier interval

Assume we are reading from $x$

We are only allowed to return one possible value

| Barrier Interval N-3 | Barrier Interval N-2 | Barrier Interval N-1 | Barrier Interval N |
| :--- | :--- | :--- | :--- |


no data conflicts means that $x$ is written to at most once per barrier interval
not allowed

Assume we are reading from $x$

We are only allowed to return one possible value

| Barrier Interval N-3 | Barrier Interval N-2 | Barrier Interval N-1 | Barrier Interval N |
| :--- | :--- | :--- | :--- |


no data conflicts means that $x$ is written to at most once per barrier interval
we will read from the write from the most recent barrier interval

Assume we are reading from $x$

We are only allowed to return one possible value

| Barrier Interval N-3 | Barrier Interval N-2 | Barrier Interval N-1 | Barrier Interval N |
| :--- | :--- | :--- | :--- |



## Schedule

- Module 4 introduction
- Barriers
- Specification
- Implementation


## Barrier Implementation

- First attempt at implementation

```
class Barrier {
    private:
    atomic_int counter;
    int num_threads;
    public:
    Barrier(int num_threads) {
        counter = 0;
        this->num_threads = num_threads;
    }
    void barrier() {
            // ??
        }
}
```


## Barrier Implementation

```
class Barrier {
    private:
        atomic_int counter;
        int num_threads;
    public:
        Barrier(int num_threads) {
            counter = 0;
            this->num_threads = num_threads;
        }
            void barrier() {
                int arrival_num = atomic_fetch_add(&counter, 1);
            // What next?
        }
}
```


## Barrier Implementation

First handle the case where the thread is the last thread to arrive

```
class Barrier {
    private:
        atomic_int counter;
        int num_threads;
    public:
        Barrier(int num_threads) {
        counter = 0;
        this->num_threads = num_threads;
    }
    void barrier() {
            int arrival_num = atomic_fetch_add(&counter, 1);
            if (arrival_num == num_threads - 1) {
                counter.store(0);
            }
            // What next?
        }
}
```


## Barrier Implementation

Spin while there is a thread waiting at the barrier

```
class Barrier {
    private:
        atomic_int counter;
        int num_threads;
    public:
    Barrier(int num_threads) {
        counter = 0;
        this->num_threads = num_threads;
    }
    void barrier() {
        int arrival_num = atomic_fetch_add(&counter, 1);
            if (arrival_num == num_threads - 1) {
                counter.store(0);
            }
            else {
                while (counter.load() != 0);
            }
        }
}
```


## Barrier Implementation

Spin while there is a thread waiting at the barrier

Does this work?

```
class Barrier {
    private:
        atomic_int counter;
        int num_threads;
    public:
    Barrier(int num_threads) {
        counter = 0;
        this->num_threads = num_threads;
    }
    void barrier() {
        int arrival_num = atomic_fetch_add(&counter, 1);
            if (arrival_num == num_threads - 1) {
                counter.store(0);
            }
            else {
                while (counter.load() != 0);
            }
        }
}
```


## Thread 0:

B.barrier();
B.barrier();
void barrier() \{

```
int arrival_num = atomic_fetch_add(&counter, 1);
```

    if (arrival_num \(==\) num_threads) \{
        counter.store (0);
    \}
    else \{
        while (counter.load() ! = 0);
    \}
    \}

## Thread 1: <br> B.barrier(); <br> B.barrier();

```
num_threads == 2
```


## Thread 0: <br> B.barrier(); <br> B.barrier();

void barrier() \{

```
int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads - 1) {
        counter.store(0);
    }
    else {
        while (counter.load() != 0);
```

    \}
    \}
arrival_num = 1

## Thread 1: <br> B.barrier(); <br> B.barrier();

```
num_threads == 2
counter == 2
```


## Thread 0: <br> B.barrier(); <br> B.barrier();

void barrier() \{
int arrival_num = atomic_fetch_add(\&counter, 1); if (arrival_num $==$ num_threads - 1) \{
counter.store (0);
\}
else
while (counter.load() $!=0$ )
\}
\}
arrival_num = 1

## Thread 1: <br> B.barrier(); <br> B.barrier();

```
num_threads == 2
counter == 0
```


## Thread 0: <br> B.barrier(); <br> B.barrier();

arrival_num = 1
void barrier() \{
int arrival_num = atomic_fetch_add(\&counter, 1); if (arrival_num == num_threads - 1) \{
counter.store(0);
\}
else \{
while (counter.load() ! = 0)
\}
\}

## Thread 1: <br> B.barrier(); <br> B.barrier();

```
num_threads == 2
counter == 0
```


## Thread 0: <br> B.barrier(); <br> B.barrier();

Leaves barrier

```
void barrier() {
```

void barrier() {
int arrival_num = atomic_fetch_add(\&counter, 1);
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads - 1) {
if (arrival_num == num_threads - 1) {
counter.store(0);
counter.store(0);
}
}
else
else
while (counter.load() != 0);
while (counter.load() != 0);
}
}
}

```

\section*{Thread 1: \\ B.barrier(); \\ B.barrier();}
in a perfect world,
thread 1 executes now and leaves the barrier
```

num_threads == 2
counter == 0

```

\section*{Thread 0:}
B.barrier();
B.barrier();
```

void barrier() {
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads - 1) {
counter.store(0);
}
else {
while (counter.load() != 0);
}
}

```

\section*{Thread 1: \\ B.barrier(); \\ B.barrier();}
\[
\text { arrival_num = } 0
\]
in a perfect world,
thread 1 executes now and leaves the barrier
but what if the OS preempted thread 1? Or it was asleep?
```

num_threads == 2
counter == 0

```

\section*{Thread 0:}
B.barrier() ;
B.barrier();
```

void barrier()
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads - 1) {
counter.store(0);
}
else {
while (counter.load() != 0);
}
}

```

\section*{Thread 1: \\ B.barrier(); \\ B.barrier();}
\[
\text { arrival_num = } 0
\]
in a perfect world,
thread 1 executes now and leaves the barrier
but what if the OS preempted thread 1? Or it was asleep?
```

num_threads == 2
counter == 1

```

\section*{Thread 0: \\ B.barrier(); B.barrier();}
void barrier() \{
    int arrival_num = atomic_fetch_add(\&counter, 1);
    if (arrival_num == num_threads - 1)
        counter.store(0);
    \}
    else \{
        while (counter.load() != 0);
    \}
\}

\section*{Thread 1: \\ B.barrier(); \\ B.barrier();}
\[
\text { arrival_num = } 0
\]
in a perfect world,
thread 1 executes now and leaves the barrier
but what if the OS preempted thread 1? Or it was asleep?
```

num_threads == 2
counter == 1

```

\section*{Thread 0:}
B.barrier() ;
B.barrier();
```

void barrier() {
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads - 1) {
counter.store(0);
}
else {
while (counter.load() != 0)
}
}

```

Thread 1 wakes up! Doesn't think its missed anything
in a perfect world,
thread 1 executes now and leaves the barrier

\section*{Thread 1: \\ B.barrier(); B.barrier();}
\[
\text { arrival_num = } 0
\]
arrival_num == 0
```

num_threads == 2
counter == 1

```

\section*{Thread 0:}
B.barrier() ;
B.barrier();
```

void barrier() {
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads - 1) {
counter.store(0);
}
else
while (counter.load() != 0);
}
}

```

Thread 1 wakes up! Doesn't think its missed anything
in a perfect world,
thread 1 executes now and leaves the barrier

Both threads get stuck here!

\section*{Thread 0:}
B.barrier();
B.barrier();
void barrier() \{
```

int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads - 1) {
counter.store(0);
}
else {
while (counter.load() != 0);

```
    \}
\}

Ideas for fixing?

\section*{Thread 1: \\ B.barrier(); \\ B.barrier();}

\section*{Thread 0:}
B.barrier();
B.barrier();
```

void barrier() {

```
    int arrival_num = atomic_fetch_add(\&counter, 1);
    if (arrival_num == num_threads - 1) \{
        counter.store(0);
    \}
    else \{
        while (counter.load() != 0);
    \}
\}

Ideas for fixing?
Two different barriers that alternate?

\section*{Thread 1: \\ B.barrier(); \\ B.barrier();}

Thread 0:
B0.barrier(); B1.barrier();
```

void barrier() {
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads - 1) {
counter.store(0);
}
else
while (counter.load() != 0);
}
}

```

Ideas for fixing?
Two different barriers that alternate?
Pros: simple to implement
Cons: user has to alternate barriers

\section*{Thread 1: \\ B0.barrier(); \\ B1.barrier();}

Thread 0:
B0.barrier();
B1.barrier();
```

void barrier() {
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads - 1) {
counter.store(0);
}
else
while (counter.load() != 0);
}
}

```

\section*{Thread 1: \\ B0.barrier(); \\ B1.barrier();}

Ideas for fixing?
Two different barriers that alternate?
Pros: simple to implement
Cons: user has to alternate barriers
B.barrier();
if (...) \{
    B.barrier();
\}
B.barrier();

How to alternate these calls?

\section*{Sense Reversing Barrier}
- Book Chapter 17
- Alternating "sense" dynamically
```

Thread 0:
B.barrier();
B.barrier();

```
sync on sense \(=\) false
```

Thread 1:
B.barrier();
B.barrier();

```

\section*{Sense Reversing Barrier}
- Book Chapter 17
- Alternating "sense" dynamically
```

Thread 0:
B.barrier();
B.barrier();

```
sync on sense \(=\) true
```

Thread 1:
B.barrier();
B.barrier();

```
```

class SenseBarrier {
private:
atomic int counter;
int num_threads;
atomic_bool sense;
bool thread_sense[num_threads];
public:
Barrier(int num_threads) {
counter = 0;
this->num_threads = num_threads;
sense = false;
thread_sense = {true, ...};
}
void barrier(int tid) {
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads) {
counter.store(0);
sense = thread_sense[tid];
}
else {
while (sense != thread_sense[tid]);
}
thread_sense[tid] = !thread_sense[tid];
}
}

```
num_threads == 2
thread_sense \(=\) true

\section*{Thread 0:}
B.barrier();
B.barrier() ;
counter == 0
sense = false
```

void barrier(int tid) {
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads-1) {
counter.store(0);
sense = thread_sense[tid];
}
else {
while (sense != thread_sense[tid]);
}
thread_sense[tid] = !thread_sense[tid];
}

```
num_threads == 2
thread sense \(=\) true
arrival_num = 1

\section*{Thread 0:}
B.barrier() ;
B.barrier();
counter \(==2\)
sense = false
void barrier(int tid) \{
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads-1) \{
counter.store (0);
sense = thread_sense[tid];
\}
else \{
while (sense ! = thread_sense[tid]);
\}
thread_sense[tid] = !thread_sense[tid];
\}
thread_sense = true arrival_num \(=0\)

\section*{Thread 1: \\ B.barrier(); \\ B.barrier();}
num_threads \(==2\)
thread sense \(=\) true
arrival_num = 1

\section*{Thread 0:}
B.barrier() ;
B.barrier();
counter \(==2\)
sense = false
void barrier(int tid) \{
int arrival_num = atomic_fetch_add(\&counter, 1); if (arrival_num == num_threads-1) \{
counter.store (0) ;
sense \(=\) thread sense[tid];
else while (sense ! = thread_sense[tid]);
\}
thread_sense[tid] = !thread_sense[tid];
thread_sense = true
arrival_num \(=0\)

\section*{Thread 1: \\ B.barrier(); \\ B.barrier();}
num_threads \(==2\)
thread sense \(=\) false
arrival_num = 1

\section*{Thread 0:}
B.barrier() ;
B.barrier();
counter \(==0\)
sense \(=\) true
```

void barrier(int tid) {
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads-1) {
counter.store(0);
sense = thread_sense[tid];
}
else
while (sense != thread sense[tid]);
}
thread_sense[tid] = !thread_sense[tid];

```
        thread_sense = true
    arrival_num = 0

\section*{Thread 1: \\ B.barrier(); \\ B.barrier();}
thread sense \(=\) false
arrival_num = ?

\section*{Thread 0:}
B.barrier(); B.barrier();
```

num_threads == 2
num_threads $==2$

```
counter \(==0\)
sense \(=\) true
```

void barrier(int tid) {
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads-1) {
counter.store(0);
sense = thread_sense[tid];
}
else {
while (sense != thread sense[tid])
}
thread_sense[tid] = !thread_sense[tid];
}

```
thread_sense \(=\) true arrival_num \(=0\)

\section*{Thread 1: \\ B.barrier(); B.barrier();}

Remember the issue! Thread 1 went to sleep around this time and thread 0 went into the barrier again!
num_threads \(==2\)
thread sense \(=\) false
arrival_num = 0

\section*{Thread 0:}
B.barrier();
B.barrier();
counter == 1
sense \(=\) true
void barrier (int tid)
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads-1) \{
counter.store (0);
sense \(=\) thread_sense[tid];
\}
else \{
while (sense ! = thread_sense[tid]);
\}
thread_sense[tid] = !thread_sense[tid];
thread_sense = true
arrival_num \(=0\)

\section*{Thread 1: \\ B.barrier(); \\ B.barrier();}
num_threads \(==2\)
thread sense = false
arrival_num = 0

\section*{Thread 0:}
B.barrier () ; B.barrier () ;
counter == 1
sense \(=\) true
```

                                    thread_sense = true
    arrival_num = 0
    ```

\section*{Thread 1: \\ B.barrier(); B.barrier();}
```

void barrier(int tid) {
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads-1) {
counter.store(0);
sense = thread_sense[tid];
}
else {
while (sense != thread_sense[tid]);
}
thread_sense[tid] = !thread_sense[tid];
}

```
both are waiting!, but thread 1 can leave
num_threads \(==2\)
thread sense \(=\) false
arrival_num = 0

\section*{Thread 0:}
B.barrier(); B.barrier();
counter \(==1\)
sense \(=\) true
```

thread_sense = false arrival_num = 0

```

\section*{Thread 1: \\ B.barrier(); B.barrier() ;}
both are waiting!, but thread 1 can leave
num_threads \(==2\)
thread sense = false
arrival_num = 0

\section*{Thread 0:}
B.barrier () ; B.barrier () ;
counter \(==1\)
sense \(=\) true
```

thread_sense = false arrival_num = ?

```

\section*{Thread 1: \\ B.barrier(); \\ B.barrier();}
num_threads \(==2\)
thread sense = false
arrival_num = 0

\section*{Thread 0:}
B.barrier();
B.barrier();
counter \(==1\)
sense \(=\) true
\[
\begin{gathered}
\text { thread_sense }=\text { false } \\
\text { arrival_num }=?
\end{gathered}
\]

\section*{Thread 1: \\ B.barrier(); \\ B.barrier();}
```

void barrier(int tid) {
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads-1) {
counter.store(0);
sense = thread_sense[tid];
}
else
while (sense != thread_sense[tid]);
}
thread_sense[tid] = !thread_sense[tid];
}

```
num_threads \(==2\)
thread sense = false
arrival_num = 0

\section*{Thread 0:}
B.barrier(); B.barrier();
counter \(==2\)
sense \(=\) true
```

void barrier(int tid) {
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads-1) {
counter.store(0);
sense = thread_sense[tid];
}
else
while (sense != thread_sense[tid]);
}
thread_sense[tid] = !thread_sense[tid];
}

```
thread_sense = false arrival_num = 1

\section*{Thread 1: \\ B.barrier(); \\ B.barrier() ;}
num_threads \(==2\)
thread sense = false
arrival_num = 0
Thread 0:
B.barrier();
B.barrier();
counter \(==2\)
sense \(=\) true
void barrier(int tid)
int arrival_num = atomic_fetch_add(\&counter, 1); if (arrival_num == num_threads-1) \{

                while (sense != thread_sense[tid]);
\}
thread_sense[tid] = !thread_sense[tid];
thread_sense = false arrival_num = 1

\section*{Thread 1: \\ B.barrier(); \\ B.barrier() ;}
num_threads \(==2\)
thread sense \(=\) false
arrival_num = 0

\section*{Thread 0:}
B.barrier();
B.barrier();
counter == 0
sense = false
thread_sense = false
\[
\text { arrival_num = } 1
\]

\section*{Thread 1: \\ B.barrier() ; \\ B.barrier();}
void barrier(int tid)
int arrival_num = atomic_fetch_add(\&counter, 1); if (arrival_num == num_threads-1) \{

    \}
    thread_sense[tid] = !thread_sense[tid];
    \}
thread sense \(=\) false
arrival_num = 0

\section*{Thread 0:}
B.barrier () ; B.barrier() ;
```

num_threads == 2
num_threads == 2

```
counter \(==0\)
sense \(=\) false
void barrier(int tid) \{
int arrival_num = atomic_fetch_add(\&counter, 1);
if (arrival_num == num_threads-1) \{

                while (sense != thread_sense[tid]);
\}
thread_sense[tid] = !thread_sense[tid];
\}
thread_sense = false arrival_num = 1

\section*{Thread 1: \\ B.barrier(); \\ B.barrier();}
thread 0 can leave, thread 1 can leave and the barrier works as expected!

\section*{See you on Wednesday!}
- Starting on module 4
- Work on HW 3```

