# **CSE113: Parallel Programming** Feb. 23, 2022

# • Topics:

- Memory consistency models:
  - Relaxed memory consistency
  - Examples
  - Compiling memory consistency



# Announcements

- HW 4 is out
  - Due on March 4
  - Please don't share timing results until next Monday
  - You mostly had what you needed at the end of Friday
  - You will have everything you need to do it at the end of lecture today
- Grades for HW 2 are out
  - Let us know by next Monday if you have any issues
- Grades for Midterm are on their way
  - Expect them by Monday

# Today's Quiz

• Due Tomorrow by midnight; please do it!

# Previous quiz

A relaxed memory execution refers to:

 $\bigcirc$  An execution where some stores failed to reach main memory

○ Any execution which contains a data-conflict

○ An execution that utilizes the processor's store buffer

 $\bigcirc$  An execution that is not sequentially consistent

# Previous quiz

Which of the following memory accesses pairs, when they appear in program order, does x86 allow to be re-ordered?

 $\bigcirc$  Load followed by a Store

 $\bigcirc$  Load followed by a Load

 $\bigcirc$  Store followed by a Store

 $\bigcirc$  Store followed by a Load

# Previous quiz

- Didn't save the question:
  - What is the difference between a fence and a barrier?

- The naming can get confusing!
- https://en.wikipedia.org/wiki/Barrier\_(computer\_science)
- https://en.wikipedia.org/wiki/Memory\_barrier

# Review

# Sequential consistency and litmus tests

### Global variable:

atomic\_int x(0); atomic\_int y(0);

### Thread 0:

x.store(1);
y.store(1);

Is it possible for t0 == 0 and t1 == 1

# <u>Thread 1:</u> int t0 = y.load(); int t1 = x.load();

atomic\_int x(0); atomic\_int y(0);

### Thread 0:

x.store(1);
y.store(1);

Is it possible for t0 == 0 and t1 == 1

<u>Thread 1:</u> int t0 = y.load(); int t1 = x.load();

#### x.store(1);

y.store(1);

int t1 = x.load();

### Global variable:

atomic\_int x(0); atomic\_int y(0);

### Thread 0:

x.store(1);
y.store(1);

```
Is it possible for
t0 == 0 and t1 ==1
```

```
int t0 = y.load();
x.store(1);
y.store(1);
int t1 = x.load();
```

<u>Thread 1:</u> int t0 = y.load(); int t1 = x.load();

# **Global variable:**

atomic\_int x(0); atomic\_int y(0);

### Thread 0:

x.store(1);
y.store(1);

#### How about:

Is it possible for t0 == 1 and t1 == 0

# <u>Thread 1:</u> int t0 = y.load(); int t1 = x.load();

#### x.store(1);

y.store(1);

int t1 = x.load();



atomic\_int x(0); atomic\_int y(0);

<u>Thread 0:</u> x.store(1); int t0 = y.load(); Another test Can t0 == t1 == 0?

<u>Thread 1:</u>
y.store(1);
int t1 = x.load();

atomic\_int x(0); atomic\_int y(0);

<u>Thread 0:</u> x.store(1); int t0 = y.load(); Another test Can t0 == t1 == 0?

<u>Thread 1:</u>
y.store(1);
int t1 = x.load();

int t0 = y.load();

y.store(1);



atomic\_int x(0); atomic\_int y(0);



Another test

Can t0 == t1 == 0?

<u>Thread 1:</u>
y.store(1);
int t1 = x.load();

# X86 TSO operational model



<u>Thread 1:</u>

mov [y], 1

mov %t1, [x]

Core 1

| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |



| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |























# Thread 1:

Execute next instruction









# <u>Thread 1:</u>

Values get loaded from memory



<u>Thread 0:</u>

<u>Thread 1:</u>

we see t0 == t1 == 0!





Main Memory

x:0

y:0



# X86 TSO programming model

int x[1] = {0}; int y[1] = {0};

Thread 0:

S:mov [x], 1 L:mov %t0, [y]

S:mov [x], 1

L:mov %t0, [y]

Another test Can t0 == t1 == 0?

| Thread 1:  | Thread 1: |  |  |  |  |
|------------|-----------|--|--|--|--|
| S:mov [y], | 1         |  |  |  |  |
| L:mov %t1, | [X]       |  |  |  |  |

S:mov [y], 1

L:mov %t1, [x]

int x[1] = {0}; int y[1] = {0};

# Thread O:

S:mov [x], 1 L:mov %t0, [y] Another test Can t0 == t1 == 0?



S(tores) followed by a L(oad) do not have to follow program order

int x[1] = {0}; int y[1] = {0};

## Thread O:

S:mov [x], 1 L:mov %t0, [y] Another test Can t0 == t1 == 0?



S(tores) followed by a L(oad) do not have to follow program order

# Global variable:

int x[1] = {0}; int y[1] = {0};

### <u>Thread 0:</u>

S:mov [x], 1 L:mov %t0, [y] Another test Can t0 == t1 == 0?

we can ignore this condition!!



| Thread 2 | Thread 1: |     |  |  |  |
|----------|-----------|-----|--|--|--|
| S:mov    | [Y],      | 1   |  |  |  |
| L:mov    | %t1,      | [X] |  |  |  |

Now we can satisfy the condition!



<u>Thread 0:</u> S:mov [x], 1 L:mov %t0, [y]

Lets peak under the hood here

Another test Can t0 == t1 == 0?



int x[1] = {0}; int y[1] = {0};

Thread 0:

S:mov [x], 1 L:mov %t0, [y]

Lets peak under the hood here

Global timeline is when the Store operation becomes visible to other threads





int x[1] = {0}; int y[1] = {0};

<u>Thread 0:</u> S:mov [x], 1 L:mov %t0, [y]

Lets peak under the hood here

Global timeline is when the Store operation becomes visible to other threads





int x[1] = {0}; int y[1] = {0};

### Thread 0:

S:mov [x], 1 L:mov %t0, [y]

Lets peak under the hood here

Global timeline is when the Store operation becomes visible to other threads Another test Can t0 == t1 == 0?





| Thread 1: |       |  |  |
|-----------|-------|--|--|
| S:mov [y] | , 1   |  |  |
| L:mov %t1 | , [X] |  |  |

## TSO - Total Store Order

**Rules:** S(tores) followed by a L(oad) do not have to follow program order.

int x[1] = {0}; int y[1] = {0};

### Thread 0:

S:mov [x], 1 mfence

L:mov %t0, [y]

S:mov [x], 1

mfence

L:mov %t0, [y]

Another test Can t0 == t1 == 0?



Rules: S(tores) followed by a L(oad) do not have to follow program order.

int x[1] = {0}; int y[1] = {0};

<u>Thread 0:</u>
S:mov [x], 1
mfence
L:mov %t0, [y]

So we can't reorder this instruction at all!



<u>Thread 1:</u>
S:mov [y], 1
mfence
L:mov %t1, [x]

S:mov [y], 1

Rules:

S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

## TSO - Total Store Order

### **Rules:**

S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

Global variable:

int x[1] = {0}; int y[1] = {0};

### Thread 0:

S:mov [x], 1 L:mov %t0, [x]

S:mov [x], 1

L:mov %t0, [x]

Another test Can  $\pm 0 = 0$ ?

> Rules: S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

## TSO - Total Store Order

### **Rules:**

S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

S(tores) cannot be reordered past L(oads) from the same address

## On to new material!

# Schedule

- Memory consistency models
  - More general relaxed consistency
  - Examples
  - Compiling relaxed memory models

• We can specify them in terms of what re-orderings are allowed



• We can specify them in terms of what re-orderings are allowed



#### **Sequential Consistency**

• We can specify them in terms of what re-orderings are allowed



TSO - total store order

• We can specify them in terms of what re-orderings are allowed



#### Weaker models?

• We can specify them in terms of what re-orderings are allowed



**PSO - partial store order** 

If memory access 0 appears before memory access 1 in program order, can it bypass program order?

Allows stores to drain from the store buffer in any order

• We can specify them in terms of what re-orderings are allowed



### **RMO - Relaxed Memory Order**

• FENCE: can always restore order using fences. Accesses cannot be reordered past fences!



#### **Any Memory Model**

If memory access 0 appears before memory access 1 in program order, and there is a FENCE between the two accesses, can it bypass program order?

# Schedule

- Memory consistency models
  - More general relaxed consistency
  - Examples
  - Compiling relaxed memory models

int x[1] = {0}; int y[1] = {0};

Thread 0:

L:mov %t0, [y] S:mov [x], 1 First thing: change our syntax to pseudo code

| Thread 1: |      |     |  |
|-----------|------|-----|--|
| L:mov     | %t1, | [X] |  |
| S:mov     | [Y], | 1   |  |

int x[1] = {0}; int y[1] = {0}; First thing: change our syntax to pseudo code You should be able to find natural mappings to any ISA

<u>Thread 0:</u> L:%t0 = load(y) S:store(x,1) <u>Thread 1:</u> L:%t1 = load(x) S:store(y,1)

Question:  $can \pm 0 == \pm 1 == 1$ ?

<u>Thread 0:</u> L:%t0 = load(y) S:store(x,1)

int  $x[1] = \{0\};$ 

int  $y[1] = \{0\};$ 

<u>Thread 1:</u> L:%t1 = load(x) S:store(y,1)

int x[1] = {0}; int y[1] = {0}; Question:  $can \pm 0 == \pm 1 == 1$ ?

Get out our lego bricks and try for sequential consistency

<u>Thread 0:</u> L:%t0 = load(y) S:store(x,1)

L:t = load(y)

S:store(x,1)

<u>Thread 1:</u> L:%t1 = load(x) S:store(y,1)

$$L:$$
t1 = load(x)



Question: can t0 == t1 == 1?

Get out our lego bricks



Not allowed under sequential consistency!



Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





```
int x[1] = {0};
int y[1] = {0};
```

Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?





Question: can t0 == t1 == 1?

Get out our lego bricks



Are we done? The behavior is no longer allowed

#### One more example

int x[1] = {0}; int y[1] = {0};

Thread 0:

S:store(x,1) S:store(y,1)

S:store(x,1)

S:store(y,1)

Question: can t 0 == 1 and t 1 == 0?

L:
$$t = load(y)$$

L:t1 = load(x)

int x[1] = {0}; int y[1] = {0};

Thread 0:

S:store(x,1) S:store(y,1) Question: can t 0 == 1 and t 1 == 0?

start off thinking about sequential consistency

| Thread 1: |         |
|-----------|---------|
| L:%t0 =   | load(y) |
| L:%t1 =   | load(x) |

L:
$$t = load(y)$$

L:%t1 = load(x)

S:store(x,1)

S:store(y,1)



int  $x[1] = \{0\};$ 

int  $y[1] = \{0\};$ 

Question: can t 0 == 1 and t 1 == 0?



What about TSO?

int  $x[1] = \{0\};$ 

int  $y[1] = \{0\};$ 

Question: can t 0 == 1 and t 1 == 0?



What about TSO? NO

int  $x[1] = \{0\};$ 



int  $x[1] = \{0\};$ 



int  $x[1] = \{0\};$ 



int  $x[1] = \{0\};$ 

int  $y[1] = \{0\};$ 

Question: can t 0 == 1 and t 1 == 0?



Now it is disallowed in PSO

int  $x[1] = \{0\};$ 

int  $y[1] = \{0\};$ 



What about RMO?

Question: can t 0 == 1 and t 1 == 0?

int x[1] = {0}; int y[1] = {0};



What about RMO?

Thread 0:

fence

int  $x[1] = \{0\};$ int  $y[1] = \{0\};$  Question: can t 0 == 1 and t 1 == 0?



What about RMO? The loads can be reordered also!

int  $x[1] = \{0\};$ 

int  $y[1] = \{0\};$ 

Question: can t 0 == 1 and t 1 == 0?



#### What about RMO? add a fence

Question: can t 0 == 1 and t 1 == 0?

int x[1] = {0}; int y[1] = {0};



Now the relaxed behavior is disallowed

- Historic Chips:
  - X86: TSO
    - Surprising robost
    - mutexes and concurrent data structures generally seem to work
    - watch out for store buffering
  - IBM Power and ARM
    - Very relaxed. Similar to RMO with even more rules
    - Mutexes and data structures must be written with care
    - ARM recently strengthened theirs

- Historic Chips:
  - X86: TSO
    - Surprising robost
    - mutexes and concurrent data structures generally seem to work
    - watch out for store buffering
  - IBM Power and ARM
    - Very relaxed. Similar to RMO with even more rules
    - Mutexes and data structures must be written with care
    - ARM recently strengthened theirs

Companies have a history of providing insufficient documentation about their rules: academics have then gone and figured it out!

Getting better these days

- Modern Chips:
  - RISC-V : two specs: one similar to TSO, one similar to RMO
  - Apple M1: toggles between TSO and weaker (ARM)
  - Vulkan does not provide any fences that provide S L ordering

- PSO and RMO were never implemented widely
  - I have not met anyone who knows of any RMO taped out chip
  - They are part of SPARC ISAs (i.e. RISC-V before it was cool)
  - These memory models might have been part of specialized chips
- Interestingly:
  - Early Nvidia GPUs appeared to informally implement RMO
- Other chips have very strange memory models:
  - Alpha DEC basically no rules

## More about GPU memory models

- GPU Harbor
  - created by our teaching staff! Reese Levine and Tim Guo!

int x[1] = {0}; int y[1] = {0};

Thread 0:

S:mov [x], 1 L:mov %t0, [y]

S:mov [x], 1

L:mov %t0, [y]

Another test Can t0 == t1 == 0?

| Thread 1:  |     |
|------------|-----|
| S:mov [y], | 1   |
| L:mov %t1, | [X] |

S:mov [y], 1

L:mov %t1, [x]

Show this example on GPU Harbor

int x[1] = {0}; int y[1] = {0}; Question:  $can \pm 0 == \pm 1 == 1$ ?

Get out our lego bricks and try for sequential consistency

<u>Thread 0:</u> L:%t0 = load(y) S:store(x,1)

L:%t0 = load(y)

S:store(x,1)

<u>Thread 1:</u> L:%t1 = load(x) S:store(y,1)

$$L:$$
t1 = load(x)

Show this example on GPU Harbor

# Where do programming languages fit in?

- One of the highest priorities of a programming language
  - Write once, run everywhere

start with both of the grids for the two different memory models

language C++11 (sequential consistency)



target machine







start with both of the grids for the two different memory models

language C++11 (sequential consistency)



find mismatch





start with both of the grids for the two different memory models



find mismatch Two options:

make sure stores are not reordered with later loads

make sure loads are not reordered with earlier stores target machine TSO (x86)







start with both of the grids for the two different memory models



want to reduce the number of atomic load/stores in your program









#### Memory orders

- Atomic operations take an additional "memory order" argument
  - memory\_order\_seq\_cst default
  - memory\_order\_relaxed weakest

Where have we seen memory\_order\_relaxed?

#### Relaxed memory order



L S L NO NO S NO NO

| language             |           |  |  |  |
|----------------------|-----------|--|--|--|
| C++11 (memory_order_ | _relaxed) |  |  |  |



basically no orderings except for accesses to the same address

#### Compiling memory order relaxed

language C++11 (memory\_order\_relaxed)

|   | L                    | S                    |  |
|---|----------------------|----------------------|--|
| L | different<br>address | different<br>address |  |
| S | different<br>address | different<br>address |  |



L

S

## Compiling memory order relaxed

language C++11 (memory\_order\_relaxed)

|   | L                    | S                    |  |
|---|----------------------|----------------------|--|
| L | different<br>address | different<br>address |  |
| S | different<br>address | different<br>address |  |

lots of mismatches!

target machine TSO (x86)



L

S

# Compiling memory order relaxed

language C++11 (memory\_order\_relaxed)

|   | L                    | S                    |   |
|---|----------------------|----------------------|---|
| L | different<br>address | different<br>address | E |
| S | different<br>address | different<br>address | S |

lots of mismatches!

But language is more relaxed than machine

so no fences are needed

target machine TSO (x86)



L

S

# Compiling memory order relaxed

Do any of the ISA memory models need any fences for relaxed memory order?

language C++11 (memory\_order\_relaxed)



### Memory order relaxed

- Very few use-cases! Be very careful when using it
  - Peeking at values (later accessed using a heavier memory order)
  - Counting (e.g. number of finished threads in work stealing)
  - DO NOT USE FOR QUEUE INDEXES

## More memory orders: we will not discuss in class

- Atomic operations take an additional "memory order" argument
  - memory\_order\_seq\_cst default
  - memory\_order\_relaxed weakest
- More memory orders (useful for mutex implementations):
  - memory\_order\_acquire
  - memory\_order\_release
- EVEN MORE memory orders (complicated: in most research it is omitted)
  - memory\_order\_consume

### A cautionary tale

Thread 0:

m.lock(); display.enq(triangle0); m.unlock(); Thread 1:

m.lock(); display.enq(triangle1); m.unlock();

Thread 0:

m.lock(); display.enq(triangle0); m.unlock(); Thread 1:

m.lock(); display.enq(triangle1); m.unlock();

We know how lock and unlock are implemented

### Thread 0:

SPIN:CAS(mutex,0,1); display.enq(triangle0); store(mutex,0); Thread 1:

```
SPIN:CAS(mutex,0,1);
display.enq(triangle1);
store(mutex,0);
```

We know how lock and unlock are implemented We also know how a queue is implemented

```
Thread 0:
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex,0);
```

```
Thread 1:
```

```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex,0);
```

We know how lock and unlock are implemented We also know how a queue is implemented

What is an execution?

### Thread 0:

```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex,0);
```

#### Thread 1:

```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex,0);
```

#### CAS(mutex,0,1);

*if blue goes first it gets to complete its critical section while thread 1 is spinning* 

### Thread 0:

```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex,0);
```

```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex,0);
```



```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex,0);
```

#### Thread 1:

```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex,0);
```



now yellow gets a change to go

```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex,0);
```

#### Thread 1:

```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex,0);
```



now yellow gets a change to go

```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex,0);
```

```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex,0);
```





```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex,0);
```

```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex,0);
```





S

```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex,0);
```

```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex,0);
```



### Nvidia in 2015

- Nvidia architects implemented a weak memory model
- Nvidia programmers expected a strong memory model
- Mutexes implemented without fences!

### Nvidia in 2015







bug found in two Nvidia textbooks

We implemented a side-channel attack that made the bugs appear more frequently

These days Nvidia has a very well-specified memory model!







```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex,0);
```

#### Thread 1:

```
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex,0);
```



How to fix the issue?





# Memory Model Strength

- If one memory model M0 allows more relaxed behaviors than another memory model M1, then M0 is more *relaxed* (or *weaker*) than M1.
- It is safe to run a program written for M0 on M1. But not vice versa



TSO

RMO

# Memory Model Strength

- Many times specifications are weaker than implementations:
  - A chip might document PSO, but implement TSO:

• Why?



# See you on Friday!

- Get started on HW 4
- Let us know if there are any issues with HW 2 grades
- Finishing up module 4 on Friday: forward progress models
  - Then on to GPUs!