#### CSE113: Parallel Programming March 3, 2023

#### • Topics:

- Finish up barriers
- Memory consistency models:
  - Total store order
  - Relaxed memory consistency
  - Examples



#### Announcements

- Midterm grades are out
  - Let us know within 1 week if there are issues
- We are working on HW 2 grades and will strive to get them released by Monday
- Homework 4 is out!
  - Should be able to do most of it (part 1 and some of the others) by end of lecture

Barrier objects have how many API calls?

| 0 0        |  |  |  |
|------------|--|--|--|
| 01         |  |  |  |
| ○ 2        |  |  |  |
| <b>○</b> 3 |  |  |  |

A barrier call emits which of the following events? Check all that apply

| barrier_lock    |  |  |
|-----------------|--|--|
| barrier_arrive  |  |  |
| barrier_enqueue |  |  |
| barrier_leave   |  |  |

If a program uses both barriers and mutexes, the outcome is deterministic (i.e. the same every time) if there are no data conflicts

⊖ True

 $\bigcirc$  False

# Example

Write a few sentences about what you think the best interface for parallel programming is: that is: do you think it is atomics? Mutexes? Concurrent Data Structures? Barriers? Or even maybe the compiler should simply do it all automatically? Or is it some combination of the above? What are the trade-offs involved?

- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived

Example: say there are 4 threads:



- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived



- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived



- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived



- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived



- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived



- Intuition: threads stop and wait for each other:
  - Threads *arrive* at the barrier
  - Threads *wait* at the barrier
  - Threads *leave* the barrier once all other threads have arrived



```
num_threads == 2
counter == 1
```



| <pre>void barrier() {</pre>                                     |
|-----------------------------------------------------------------|
| <pre>int arrival_num = atomic_fetch_add(&amp;counter, 1);</pre> |
| <pre>if (arrival_num == num_threads - 1) {</pre>                |
| counter.store(0);                                               |
| }                                                               |
| else {                                                          |
| <pre>while (counter.load() != 0);</pre>                         |
| }                                                               |
| }                                                               |



Thread 1 wakes up! Doesn't think its missed anything

arrival\_num == 0

arrival\_num = 0

in a perfect world, thread 1 executes now and leaves the barrier

Both threads get stuck here!

#### Sense Reversing Barrier

- Book Chapter 17
- Alternating "sense" dynamically



sync on sense = true

<u>Thread 1:</u>
B.barrier();
B.barrier();

```
class SenseBarrier {
 private:
   atomic int counter;
   int num threads;
   atomic bool sense;
   bool thread sense[num threads];
 public:
   Barrier(int num threads) {
      counter = 0;
     this->num threads = num threads;
      sense = false;
     thread sense = {true, ...};
    void barrier(int tid) {
        int arrival num = atomic fetch add(&counter, 1);
        if (arrival num == num threads) {
           counter.store(0);
           sense = thread sense[tid];
        else {
          while (sense != thread sense[tid]);
        thread sense[tid] = !thread sense[tid];
```

thread\_sense = true

```
num_threads == 2
counter == 0
sense = false
```

thread\_sense = true

```
<u>Thread O:</u>
B.barrier();
B.barrier();
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads-1) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
      while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
}
```

```
<u>Thread 1:</u>
B.barrier();
B.barrier();
```

thread\_sense = true arrival\_num = 1



```
num_threads == 2
counter == 2
sense = false
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads-1) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
      while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
}
```

```
thread_sense = true
arrival_num = 0
Thread 1:
```

B.barrier();

B.barrier();





```
num_threads == 2
counter == 0
sense = true
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads-1) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
      while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
}
```

```
thread_sense = true
arrival_num = 0
<u>Thread 1:</u>
B.barrier();
```

B.barrier();



```
num_threads == 2
counter == 0
sense = true
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads-1) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
        while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
}
```

```
thread_sense = true
arrival_num = 0
<u>Thread 1:</u>
B.barrier();
```

B.barrier();

Remember the issue! Thread 1 went to sleep around this time and thread 0 went into the barrier again!



```
num_threads == 2
counter == 1
sense = true
```



```
thread_sense = true
arrival_num = 0
Thread 1:
```

B.barrier();

B.barrier();



```
num_threads == 2
counter == 1
sense = true
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads-1) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
      while (sense != thread_sense[tid]);
    }
    thread sense[tid] = !thread sense[tid];
}
```

```
thread_sense = true
arrival_num = 0
<u>Thread 1:</u>
B.barrier();
```

B.barrier();

```
both are waiting!,
but thread 1 can leave
```



```
num_threads == 2
counter == 1
sense = true
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads-1) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
      while (sense != thread_sense[tid]);
    }
    thread sense[tid] = !thread sense[tid];
}
```

```
thread_sense = false
arrival_num = 0
<u>Thread 1:</u>
B.barrier();
```

B.barrier();

both are waiting!, but thread 1 can leave



```
num_threads == 2
counter == 1
sense = true
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads-1) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
      while (sense != thread_sense[tid]);
    }
    thread sense[tid] = !thread sense[tid];
}
```

```
thread_sense = false
arrival_num = ?
<u>Thread 1:</u>
B.barrier();
```

B.barrier();

Thread 1 finishes the barrier



```
num_threads == 2
counter == 1
sense = true
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads-1) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
      while (sense != thread_sense[tid]);
    }
    thread sense[tid] = !thread sense[tid];
}
```

```
thread_sense = false
arrival_num = ?
<u>Thread 1:</u>
B.barrier();
B.barrier();
```



```
num_threads == 2
counter == 2
sense = true
```



```
thread_sense = false
arrival_num = 1
```









thread 0 can leave, thread 1 can leave and the barrier works as expected!

# Moving on to memory consistency!

• One of my favorite topics!

### Moving on to memory consistency!

- One of my favorite topics!
- What do other people think?

Look, memory ordering pretty much \_is\_ the rocket science of CS, but the C standards committee basically made it a ton harder by specifying "we have to make the rocket out of duct tape and bricks, and only use liquid hydrogen as a propellant".

Linus

# Memory Consistency

- We have been very strict about using atomic types in this class
  - and the methods (.load and .store)
  - why?
  - Architectures do very strange things with memory loads and stores
  - Compilers do too (but we won't talk too much about them today)
  - C++ gives us sequential consistency if we use atomic types and operations
  - What do we remember sequential consistency from?

# Sequential consistency for atomic memory

• Let's play our favorite game:

#### **Global variable:**

atomic\_int x(0); atomic\_int y(0);

#### Thread 0:

x.store(1);
y.store(1);

Is it possible for t0 == 0 and t1 == 1

#### <u>Thread 1:</u> int t0 = y.load(); int t1 = x.load();

atomic\_int x(0); atomic\_int y(0);

#### Thread 0:

x.store(1);
y.store(1);

Is it possible for t0 == 0 and t1 == 1

<u>Thread 1:</u> int t0 = y.load(); int t1 = x.load();

#### x.store(1);

y.store(1);

int t1 = x.load();

### Global variable:

atomic\_int x(0); atomic\_int y(0);

#### Thread 0:

x.store(1);
y.store(1);

```
Is it possible for
t0 == 0 and t1 ==1
```

```
int t0 = y.load();
x.store(1);
y.store(1);
int t1 = x.load();
```

<u>Thread 1:</u> int t0 = y.load(); int t1 = x.load();

### **Global variable:**

atomic\_int x(0); atomic\_int y(0);

#### Thread 0:

x.store(1);
y.store(1);

#### How about:

Is it possible for t0 == 1 and t1 == 0

### <u>Thread 1:</u> int t0 = y.load(); int t1 = x.load();

#### x.store(1);

y.store(1);

int t1 = x.load();



atomic\_int x(0); atomic\_int y(0);

<u>Thread 0:</u> x.store(1); int t0 = y.load(); Another test Can t0 == t1 == 0?

<u>Thread 1:</u>
y.store(1);
int t1 = x.load();

atomic\_int x(0); atomic\_int y(0);

<u>Thread 0:</u> x.store(1); int t0 = y.load(); Another test Can t0 == t1 == 0?

<u>Thread 1:</u>
y.store(1);
int t1 = x.load();

int t0 = y.load();

y.store(1);



atomic\_int x(0); atomic\_int y(0);



Another test

Can t 0 == t 1 == 0?

<u>Thread 1:</u>
y.store(1);
int t1 = x.load();

- Plain atomic accesses are documented to be sequentially consistent (SC)
- Why wasn't SC very good for concurrent data structures?
  - Compossibility: two objects that are SC might not be SC when used together
  - Programs contain only 1 shared memory though; no reason to compose different main memories.

## What about ISAs?

- Remember, it is important for us to understand how our code executes on the architecture to write high performing programs
- Lets think about x86
  - Instructions:
  - MOV %t0 [x] loads the value at x to register t0
  - MOV [y] 1 stores the value 1 to memory location y

int x[1] = {0}; int y[1] = {0};

| <u>Thread 0:</u> |      |     |  |
|------------------|------|-----|--|
| mov              | [X], | 1   |  |
| mov              | %t0, | [Y] |  |

Another test Can t0 == t1 == 0?

| Thread 1: |       |
|-----------|-------|
| mov [y],  | 1     |
| mov %t1,  | [ X ] |

int x[1] = {0}; int y[1] = {0};

| 7 |  | h | r | е | a | d | 0 | • |
|---|--|---|---|---|---|---|---|---|
|---|--|---|---|---|---|---|---|---|

mov [x], 1 mov %t0, [y]

mov [x], 1

Another test Can t0 == t1 == 0?

| <u>Threc</u> | nd 1: |     |
|--------------|-------|-----|
| mov          | [Y],  | 1   |
| mov          | %t1,  | [X] |

### Global variable:

int x[1] = {0}; int y[1] = {0};

| Thread 0: |              |     |
|-----------|--------------|-----|
| mov       | [X],         | 1   |
| mov       | %t0 <b>,</b> | [y] |
|           | •            |     |

Another test Can t0 == t1 == 0?

mov %t1, [x]

mov [x], 1

mov %t0, [y]



no place for this event!

# What if we actually run this code?

• We'd like to be able to compile atomic instructions just to regular ISA loads and stores

## Schedule

- Memory consistency models:
  - Total store order
  - Relaxed memory consistency
  - Examples



<u>Thread 1:</u>

mov [y], 1

mov %t1, [x]

Core 1

| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |



| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |























### Thread 1:

Execute next instruction









### <u>Thread 1:</u>

Values get loaded from memory



<u>Thread 0:</u>

<u>Thread 1:</u>

we see t0 == t1 == 0!





Main Memory

x:0

y:0



## Our first relaxed memory execution!

- also known as weak memory behaviors
- An execution that is NOT allowed by sequential consistency
- A memory model that allows relaxed memory executions is known as a relaxed memory model
  - X86 has a relaxed memory model due to store buffering
  - If you restrict yourself to use only default atomic operations, C++ has does NOT have a weak memory model

### Litmus tests

- Small concurrent programs that check for relaxed memory behaviors
- Vendors have a long history of under documented memory consistency models
- Academics have empirically explored the memory models
  - Many vendors have unofficially endorsed academic models
  - X86 behaviors were documented by researchers before Intel!

### Litmus tests

This test is called "store buffering"

| <u>Thread 0:</u> |      |     |  |
|------------------|------|-----|--|
| mov              | [X], | 1   |  |
| mov              | %t0, | [Y] |  |

| Threa | nd 1: |     |
|-------|-------|-----|
| mov   | [Y],  | 1   |
| mov   | %t1,  | [x] |

$$Can t 0 == t 1 == 0?$$

### Restoring sequential consistency

- It is typical that relaxed memory models provide special instructions which can be used to disallow weak behaviors.
- These instructions are called Fences
- The X86 fence is called mfence. It flushes the store buffer.





### <u>Thread 0:</u>

























execute next instruction





values are loaded from memory



Thread 1:

#### We don't get the problematic behavior: t0 == t1 == 0



# Next example



single thread same address

possible outcomes: t0 = 1 t0 = 0

Which one do you expect?

| x:0 |             |  |
|-----|-------------|--|
| y:0 | Main Memory |  |
|     |             |  |

| Thread O: | -            |                        |
|-----------|--------------|------------------------|
| mov [x],  | 1            | How does this execute? |
| mov %t0,  | [X]          |                        |
| Core 0    | Store Buffer |                        |
|           |              |                        |

| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |



#### execute first instruction

mov %t0, [x]



| x:0 |             |  |
|-----|-------------|--|
| y:0 | Main Memory |  |
|     |             |  |

Store the value in the store buffer

mov %t0, [x]

|        | Store Buffer |
|--------|--------------|
| Core 0 | x:1          |
|        |              |
|        |              |

| x:0 |             |  |
|-----|-------------|--|
| y:0 | Main Memory |  |
|     |             |  |



Next instruction



| x:0 |             |  |
|-----|-------------|--|
| y:0 | Main Memory |  |
|     |             |  |

Where to load??

Store buffer? Main memory?





Where to load??

Threads check store buffer before going to main memory

It is close and cheap to check.



| x:0 |             |  |
|-----|-------------|--|
| y:0 | Main Memory |  |
|     |             |  |

# Memory Consistency

- How to specify a relaxed memory model?
- We can do it operationally
  - by constructing a high-level machine and reasoning about operations through the machine.
  - or we can talk about instructions that are allowed to "break" program order.

**Global variable:** 

int x[1] = {0}; int y[1] = {0};

| <u>Thread 0:</u> |     |  |
|------------------|-----|--|
| mov [x],         | 1   |  |
| mov %t0,         | [Y] |  |

Another test Can t0 == t1 == 0?

| Thread 1: |       |
|-----------|-------|
| mov [y],  | 1     |
| mov %t1,  | [ X ] |

We will annotate instructions with S for store, and L for loads

**Global variable:** 

int x[1] = {0}; int y[1] = {0};

Thread 0:

S:mov [x], 1 L:mov %t0, [y] Another test Can t0 == t1 == 0?

| <u>Thread 1:</u> |     |
|------------------|-----|
| S:mov [y],       | 1   |
| L:mov %t1,       | [X] |

We will annotate instructions with S for store, and L for loads

int x[1] = {0}; int y[1] = {0};

Thread 0:

S:mov [x], 1 L:mov %t0, [y]

S:mov [x], 1

L:mov %t0, [y]

Another test Can t0 == t1 == 0?

| Thread 1:  |     |
|------------|-----|
| S:mov [y], | 1   |
| L:mov %t1, | [X] |

S:mov [y], 1

L:mov %t1, [x]

int x[1] = {0}; int y[1] = {0};

# Thread O:

S:mov [x], 1 L:mov %t0, [y] Another test Can t0 == t1 == 0?



S(tores) followed by a L(oad) do not have to follow program order

int x[1] = {0}; int y[1] = {0};

### Thread O:

S:mov [x], 1 L:mov %t0, [y] Another test Can t0 == t1 == 0?



S(tores) followed by a L(oad) do not have to follow program order

### Global variable:

int x[1] = {0}; int y[1] = {0};

### <u>Thread 0:</u>

S:mov [x], 1 L:mov %t0, [y] Another test Can t0 == t1 == 0?

we can ignore this condition!!



| Thread 1: |      |     |  |
|-----------|------|-----|--|
| S:mov     | [Y], | 1   |  |
| L:mov     | %t1, | [X] |  |

Now we can satisfy the condition!



<u>Thread 0:</u> S:mov [x], 1 L:mov %t0, [y]

Lets peak under the hood here

Another test Can t0 == t1 == 0?



int x[1] = {0}; int y[1] = {0};

Thread 0:

S:mov [x], 1 L:mov %t0, [y]

Lets peak under the hood here

Global timeline is when the Store operation becomes visible to other threads





int x[1] = {0}; int y[1] = {0};

<u>Thread 0:</u> S:mov [x], 1 L:mov %t0, [y]

Lets peak under the hood here

Global timeline is when the Store operation becomes visible to other threads





int x[1] = {0}; int y[1] = {0};

#### Thread 0:

S:mov [x], 1 L:mov %t0, [y]

Lets peak under the hood here

Global timeline is when the Store operation becomes visible to other threads Another test Can t0 == t1 == 0?





| Thread 1: |       |  |
|-----------|-------|--|
| S:mov [y] | , 1   |  |
| L:mov %t1 | , [X] |  |

# Questions

• Can stores be reordered with stores?

mov [x], 1

mov [y], 1



| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |

mov [y], 1



#### execute the first instruction

| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |

mov [y], 1



| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |

value goes into store buffer

mov [y], 1



| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |

execute next instruction

execute next instruction



| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |

value goes into the store buffer



| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |



On x86, the store buffer trains in a FIFO way: thus stores cannot be reordered



On x86, the store buffer trains in a FIFO way: thus stores cannot be reordered



# Questions

- Can stores be reordered with stores?
- How do we make rules about mfence?

int x[1] = {0}; int y[1] = {0};

### Thread 0:

S:mov [x], 1 mfence

L:mov %t0, [y]

S:mov [x], 1

mfence

L:mov %t0, [y]

Another test Can t0 == t1 == 0?



Rules: S(tores) followed by a L(oad) do not have to follow program order.

int x[1] = {0}; int y[1] = {0};

<u>Thread 0:</u>
S:mov [x], 1
mfence
L:mov %t0, [y]

So we can't reorder this instruction at all!



<u>Thread 1:</u>
S:mov [y], 1
mfence
L:mov %t1, [x]

S:mov [y], 1

Rules:

S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

## Rules

• Are we done?

Rules: S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

Global variable:

int x[1] = {0}; int y[1] = {0};

### Thread 0:

S:mov [x], 1 L:mov %t0, [x]

S:mov [x], 1

L:mov %t0, [x]

Another test Can t 0 == 0?

Rules: S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order







Another test Can  $\pm 0$ ?

Rules: S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

S(tores) cannot be reordered past L(oads) from the same address

### TSO - Total Store Order

### **Rules:**

S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

S(tores) cannot be reordered past L(oads) from the same address

# Schedule

- Memory consistency models:
  - Total store order
  - Relaxed memory consistency
  - Examples

• We can specify them in terms of what reorderings are allowed



• We can specify them in terms of what reorderings are allowed



### **Sequential Consistency**

• We can specify them in terms of what reorderings are allowed



TSO - total store order

• We can specify them in terms of what reorderings are allowed



### Weaker models?

• We can specify them in terms of what reorderings are allowed



### **PSO - partial store order**

If memory access 0 appears before memory access 1 in program order, can it bypass program order?

Allows stores to drain from the store buffer in any order

• We can specify them in terms of what reorderings are allowed



**RMO - Relaxed Memory Order** 

If memory access 0 appears before memory access 1 in program order, can it bypass program order?

Very relaxed model!

• FENCE: can always restore order using fences. Accesses cannot be reordered past fences!



### **Any Memory Model**

If memory access 0 appears before memory access 1 in program order, and there is a FENCE between the two accesses, can it bypass program order?

# Schedule

- Memory consistency models:
  - Total store order
  - Relaxed memory consistency
  - Examples

<u>Global variable:</u>

int x[1] = {0}; int y[1] = {0};

Thread 0:

L:mov %t0, [y] S:mov [x], 1 First thing: change our syntax to pseudo code

| Thread 2 |      |     |  |
|----------|------|-----|--|
| L:mov    | %t1, | [X] |  |
| S:mov    | [Y], | 1   |  |

<u>Global variable:</u>

int x[1] = {0}; int y[1] = {0}; First thing: change our syntax to pseudo code You should be able to find natural mappings to any ISA

<u>Thread 0:</u> L:%t0 = load(y) S:store(x,1) <u>Thread 1:</u> L:%t1 = load(x) S:store(y,1) <u>Global variable:</u>

Question:  $can \pm 0 == \pm 1 == 1$ ?

<u>Thread 0:</u> L:%t0 = load(y) S:store(x,1)

int  $x[1] = \{0\};$ 

int  $y[1] = \{0\};$ 

<u>Thread 1:</u> L:%t1 = load(x) S:store(y,1) Global variable:

int x[1] = {0}; int y[1] = {0}; Question:  $can \pm 0 == \pm 1 == 1$ ?

Get out our lego bricks and try for sequential consistency

<u>Thread 0:</u> L:%t0 = load(y) S:store(x,1)

L:t = load(y)

S:store(x,1)

<u>Thread 1:</u> L:%t1 = load(x) S:store(y,1)

$$L:$$
t1 = load(x)



Question: can t0 == t1 == 1?

Get out our lego bricks



Not allowed under sequential consistency!



Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





```
int x[1] = {0};
int y[1] = {0};
```

Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?

Get out our lego bricks



Are we done? The behavior is no longer allowed

## One more example

Global variable:

int x[1] = {0}; int y[1] = {0};

Thread 0:

S:store(x,1) S:store(y,1)

S:store(x,1)

S:store(y,1)

Question: can t 0 == 1 and t 1 == 0?

<u>Thread 1:</u> L:%t0 = load(y) S:%t1 = load(x)

L:
$$&t0 = load(y)$$

L:%t1 = load(x)

### Global variable:

int x[1] = {0}; int y[1] = {0};

Thread 0:

S:store(x,1) S:store(y,1) Question: can t 0 == 1 and t 1 == 0?

start off thinking about sequential consistency

| Thread 1: |         |
|-----------|---------|
| L:%t0 =   | load(y) |
| S:%t1 =   | load(x) |

L:
$$t = load(y)$$

L:%t1 = load(x)

S:store(x,1)

S:store(y,1)



int  $x[1] = \{0\};$ 



int  $x[1] = \{0\};$ 

int  $y[1] = \{0\};$ 

Question: can t 0 == 1 and t 1 == 0?



What about TSO? NO

int  $x[1] = \{0\};$ 



int  $x[1] = \{0\};$ 



int  $x[1] = \{0\};$ 



int  $x[1] = \{0\};$ 

int  $y[1] = \{0\};$ 

Question: can t 0 == 1 and t 1 == 0?



Now it is disallowed in PSO

int  $x[1] = \{0\};$ 

int  $y[1] = \{0\};$ 



Question: can t 0 == 1 and t 1 == 0?

int x[1] = {0}; int y[1] = {0};



What about RMO?

Thread 0:

fence

int  $x[1] = \{0\};$ int  $y[1] = \{0\};$  Question: can t 0 == 1 and t 1 == 0?



What about RMO? The loads can be reordered also!

int  $x[1] = \{0\};$ 

int  $y[1] = \{0\};$ 

Question: can t 0 == 1 and t 1 == 0?



#### What about RMO? add a fence

Question: can t 0 == 1 and t 1 == 0?

int x[1] = {0}; int y[1] = {0};



Now the relaxed behavior is disallowed

- Historic Chips:
  - X86: TSO
    - Surprising robust
    - mutexes and concurrent data structures generally seem to work
    - watch out for store buffering
  - IBM Power and ARM
    - Very relaxed. Similar to RMO with even more rules
    - Mutexes and data structures must be written with care
    - ARM recently strengthened theirs

- Historic Chips:
  - X86: TSO
    - Surprising robust
    - mutexes and concurrent data structures generally seem to work
    - watch out for store buffering
  - IBM Power and ARM
    - Very relaxed. Similar to RMO with even more rules
    - Mutexes and data structures must be written with care
    - ARM recently strengthened theirs

Companies have a history of providing insufficient documentation about their rules: academics have then gone and figured it out!

Getting better these days

- Modern Chips:
  - RISC-V : two specs: one similar to TSO, one similar to RMO
  - Apple M1: toggles between TSO and weaker
  - Vulkan does not provide any fences that provide S L ordering

- PSO and RMO were never implemented widely
  - I have not met anyone who knows of any RMO taped out chip
  - They are part of SPARC ISAs (i.e. RISC-V before it was cool)
  - These memory models might have been part of specialized chips
- Interestingly:
  - Early Nvidia GPUs appeared to informally implement RMO
- Other chips have very strange memory models:
  - Alpha DEC basically no rules

# See you on Wednesday!

- Get HW 3 in!
- Watch out for HW2 grades
- Finishing up memory models on Wednesday