# CSE211: Compiler Design Nov. 10, 2021

• **Topic:** Compiling relaxed memory models

L:
$$t = load(y)$$

#### Announcements

- Homework 3 is due next Wednesday
  - 1 more office hour before then tomorrow
    - Sign up sheet link posted around noon tomorrow
    - Please don't sign up until link is posted!
  - Feel free to share results (not code!) on slack
  - Part 2 uses a lot of memory. Feel free to reduce the array size, but try not to reduce it too far.
- Friday's class will be canceled
  - Work on homework 3 and project/paper proposals
- Guest lecture for Nov. 22
  - Aviral Goel will talk about laziness in R

#### Announcements

- Friday's class will be canceled
  - Work on homework 3 and project/paper proposals
- Guest lecture for Nov. 22
  - Aviral Goel will talk about laziness in R

# Paper and project proposals

- Due on Nov. 14
  - Thanks to everyone who has messaged me so far!
  - I will try to have grades for HW2 and midterm by then

# CSE211: Compiler Design Nov. 10, 2021

• **Topic:** Compiling relaxed memory models

L:
$$t = load(y)$$

#### Review

- How to implement parallelism in DOALL loops
  - Regular parallelism?

# DOALL regular parallelism on SMP system



# DOALL regular parallelism on GPUs

one streaming multiprocessor contains many small Compute Elements (CE)

CEs Can load adjacent memory locations simultaneously



What about a striped pattern?

| 0 1 2 3 | 4 5 6 7 |
|---------|---------|
|---------|---------|

# Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically



### Work stealing - local worklists

• local worklists: divide tasks into different worklists for each thread





# Today's lecture

- We have been assuming DOALL loops:
  - Threads access completely disjoint memory
  - This might not be the case
  - Examples?

# Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

Pros? Cons?

Shared head of the list



### Work stealing - local worklists

• local worklists: divide tasks into different worklists for each thread



### What happens when threads share data?

#### <u>Global variable:</u>

int x[1] = {0}; int y[1] = {0};

#### Thread 0:

```
S:store(x, 1);
L:%t0 = load(y);
```

# <u>Thread 1:</u> S:store(y, 1); L:%t1 = load(x);

#### Global variable:

int x[1] = {0}; int y[1] = {0};

Thread 0:

S:store(x, 1);
L:%t0 = load(y);

S:store(x, 1);

L:%t0 = load(y);

<u>Thread 1:</u> S:store(y, 1); L:%t1 = load(x);

S:store(y, 1);
L:%t1 = load(x);

pick from the top of the pile of either thread

### Sequential Consistency

- Sequential interleaving of atomic instructions
- What are "atomic instructions"?



int x[1] = {0}; int y[1] = {0};



pick from the top of the pile of either thread Can t0 == t1 == 0 at the end of the execution?

#### Demo

• What is going on?



<u>Thread 1:</u>

mov [y], 1

mov %t1, [x]



| x:0 |             |  |
|-----|-------------|--|
| y:0 | Main Memory |  |
|     |             |  |



| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |























#### Thread 1:

Execute next instruction









#### <u>Thread 1:</u>

Values get loaded from memory



<u>Thread 0:</u>

<u>Thread 1:</u>

we see t0 == t1 == 0!





Main Memory

x:0

y:0



# Our first relaxed memory execution!

- also known as weak memory behaviors
- An execution that is NOT allowed by sequential consistency
- A memory model that allows relaxed memory executions is known as a relaxed memory model

#### Litmus tests

- Small concurrent programs that check for relaxed memory behaviors
- Vendors have a long history of under documented memory consistency models
- Academics have empirically explored the memory models
  - Many vendors have unofficially endorsed academic models
  - X86 behaviors were documented by researchers before Intel!

#### Litmus tests

This test is called "store buffering"

| <u>Thread 0:</u> |      |     |  |
|------------------|------|-----|--|
| mov              | [X], | 1   |  |
| mov              | %t0, | [Y] |  |

| Thread 1: |     |
|-----------|-----|
| mov [y],  | 1   |
| mov %t1,  | [x] |

Can t 0 == t 1 == 0?
## Restoring sequential consistency

- It is typical that relaxed memory models provide special instructions which can be used to disallow weak behaviors.
- These instructions are called Fences
- The X86 fence is called mfence. It flushes the store buffer.





### <u>Thread 0:</u>

























execute next instruction





values are loaded from memory



We don't get the problematic behavior: t0 = 0 and t1 = 0



# Next example

### <u>Thread 0:</u>



single thread same address

possible outcomes: t0 = 1 t0 = 0

Which one do you expect?

| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |

| <u>Thread 0:</u> |              |                        |
|------------------|--------------|------------------------|
| mov [x],         | 1            | How does this execute? |
| mov %t0,         | [x]          |                        |
| Core 0           | Store Buffer |                        |
|                  |              |                        |

| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |



#### execute first instruction

mov %t0, [x]



| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |

### <u>Thread 0:</u>

Store the value in the store buffer

mov %t0, [x]

|        | Store Buffer |
|--------|--------------|
| Core 0 | x:1          |
|        |              |
|        |              |

| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |



Next instruction



| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |

### <u>Thread 0:</u>

Where to load??

Store buffer? Main memory?





### <u>Thread 0:</u>

Where to load??

Threads check store buffer before going to main memory

It is close and cheap to check.



| x:0 |             |
|-----|-------------|
| y:0 | Main Memory |
|     |             |

## Question

• Can stores be reordered with stores?

### **Global variable:**

int x[1] = {0}; int y[1] = {0};

#### Thread 0:

S:mov [x], 1 L:mov %t0, [y]

### S:mov [x], 1

L:mov %t0, [y]

Can t 0 == t 1 == 0?



| L:mov | %t1, | [X] |
|-------|------|-----|
|-------|------|-----|

Rules: S(tores) followed by a L(oad) do not have to follow program order.

### <u>Global variable:</u>

int x[1] = {0}; int y[1] = {0};

Thread 0:

S:mov [x], 1 mfence L:mov %t0, [y]

S:mov [x], 1

mfence

L:mov %t0, [y]

Can t 0 == t 1 == 0?



Rules: S(tores) followed by a L(oad) do not have to follow program order.

### <u>Global variable:</u>

int x[1] = {0}; int y[1] = {0};

Thread 0:

S:mov [x], 1 mfence L:mov %t0, [y]

S:mov [x], 1

mfence

L:mov %t0, [y]

Can t 0 == t 1 == 0?

<u>Thread 1:</u>
S:mov [y], 1
mfence
L:mov %t1, [x]



Rules:

S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

## Rules

• Are we done?

Rules: S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

Global variable:

int x[1] = {0}; int y[1] = {0};

#### Thread 0:

S:mov [x], 1 L:mov %t0, [x]

S:mov [x], 1

L:mov %t0, [x]

Another test Can t 0 == 0?

Rules: S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

<u>Global variable:</u>

int x[1] = {0}; int y[1] = {0};

#### Thread 0:

S:mov [x], 1 L:mov %t0, [x]

S:mov [x], 1

L:mov %t0, [x]

Another test Can  $\pm 0 = 0$ ?

> Rules: S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

S(tores) cannot be reordered past L(oads) from the same address

## TSO - Total Store Order

### **Rules:**

S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

S(tores) cannot be reordered past L(oads) from the same address

• We can specify them in terms of what reorderings are allowed



• We can specify them in terms of what reorderings are allowed



#### **Sequential Consistency**

• We can specify them in terms of what reorderings are allowed



TSO - total store order

• We can specify them in terms of what reorderings are allowed



#### Weaker models?

• We can specify them in terms of what reorderings are allowed



#### **PSO - partial store order**

If memory access 0 appears before memory access 1 in program order, can it bypass program order?

Allows stores to drain from the store buffer in any order

• We can specify them in terms of what reorderings are allowed



**RMO - Relaxed Memory Order** 

If memory access 0 appears before memory access 1 in program order, can it bypass program order?

Very relaxed model!

• FENCE: can always restore order using fences. Accesses cannot be reordered past fences!

#### **Any Memory Model**

If memory access 0 appears before memory access 1 in program order, and there is a FENCE between the two accesses, can it bypass program order? <u>Global variable:</u>

int x[1] = {0}; int y[1] = {0}; First thing: change our syntax to pseudo code You should be able to find natural mappings to any ISA

<u>Thread 0:</u> L:%t0 = load(y) S:store(x,1) <u>Thread 1:</u> L:%t1 = load(x) S:store(y,1) <u>Global variable:</u>

Question:  $can \pm 0 == \pm 1 == 1$ ?

<u>Thread 0:</u> L:%t0 = load(y) S:store(x,1)

int  $x[1] = \{0\};$ 

int  $y[1] = \{0\};$ 

<u>Thread 1:</u> L:%t1 = load(x) S:store(y,1) Global variable:

int x[1] = {0}; int y[1] = {0}; Question:  $can \pm 0 == \pm 1 == 1$ ?

Get out our lego bricks and try for sequential consistency

<u>Thread 0:</u> L:%t0 = load(y) S:store(x,1)

L:t = load(y)

S:store(x,1)

<u>Thread 1:</u> L:%t1 = load(x) S:store(y,1)

L:
$$t1 = load(x)$$

S:store(y,1)
Global variable:

int x[1] = {0}; int y[1] = {0}; Question:  $can \pm 0 == \pm 1 == 1$ ?

Get out our lego bricks and try for TSO

<u>Thread 0:</u> L:%t0 = load(y) S:store(x,1)

L:t = load(y)

S:store(x,1)

Thread 1: L:%t1 = load(x) S:store(y,1) L:%t1 = load(x) S:store(y,1) memory access 0 S L Different L NO address memory access 1 S NO NO

Global variable:

int x[1] = {0}; int y[1] = {0}; Question:  $can \pm 0 == \pm 1 == 1$ ?

Get out our lego bricks and try for PSO

<u>Thread 0:</u> L:%t0 = load(y) S:store(x,1)

L:t = load(y)

S:store(x,1)



<u>Global variable:</u>

int x[1] = {0}; int y[1] = {0}; Question: can t0 == t1 == 1?

Get out our lego bricks and try for RMO



<u>Global variable:</u>

int x[1] = {0}; int y[1] = {0}; Question:  $can \pm 0 == \pm 1 == 1$ ?

Get out our lego bricks and try for RMO



L:%t0 = load(y)

fence

S:store(x,1)

<u>Thread 1:</u> L:%t1 = load(x) fence S:store(y,1)



How do we disallow it?

## Compiling relaxed memory models

# Compiling relaxed memory models

- C++ style:
  - Any memory conflicts (read-write or write-write) must be accessed with an atomic operation\*
  - Otherwise your program is undefined
  - By default, you will get sequentially consistent behavior
  - \*unless they are synchronized, which is a really complicated concept in c++...
    If you are interested, I can recommend papers.

start with both of the grids for the two different memory models

language C++11 (sequential consistency)



target machine







start with both of the grids for the two different memory models

language C++11 (sequential consistency)



find mismatch





start with both of the grids for the two different memory models



find mismatch Two options:

make sure stores are not reordered with later loads

make sure loads are not reordered with earlier stores target machine TSO (x86)







start with both of the grids for the two different memory models



want to reduce the number of atomic load/stores in your program









#### Memory orders

- Atomic operations take an additional "memory order" argument
  - memory\_order\_seq\_cst default
  - memory\_order\_relaxed weakest

Where have we seen memory\_order\_relaxed?

## Relaxed memory order



L S L NO NO S NO NO

|       | langu    | lage    |           |
|-------|----------|---------|-----------|
| C++11 | (memory_ | _order_ | _relaxed) |



basically no orderings except for accesses to the same address

language C++11 (memory\_order\_relaxed)

|   | L                    | S                    |  |
|---|----------------------|----------------------|--|
| L | different<br>address | different<br>address |  |
| S | different<br>address | different<br>address |  |



L

S

language C++11 (memory\_order\_relaxed)

|   | L                    | S                    |  |
|---|----------------------|----------------------|--|
| L | different<br>address | different<br>address |  |
| S | different<br>address | different<br>address |  |

lots of mismatches!

target machine TSO (x86)



L

S

language C++11 (memory\_order\_relaxed)

|   | L                    | S                    |  |
|---|----------------------|----------------------|--|
| L | different<br>address | different<br>address |  |
| S | different<br>address | different<br>address |  |

lots of mismatches!

But language is more relaxed than machine

so no fences are needed

target machine TSO (x86)



L

S

Do any of the ISA memory models need any fences for relaxed memory order?

language C++11 (memory\_order\_relaxed)



### Memory order relaxed

- Very few use-cases! Be very careful when using it
  - Peeking at values (later accessed using a heavier memory order)
  - Counting (e.g. number of finished threads in work stealing)

## More memory orders: we will not discuss in class

- Atomic operations take an additional "memory order" argument
  - memory\_order\_seq\_cst default
  - memory\_order\_relaxed weakest
- More memory orders (useful for mutex implementations):
  - memory\_order\_acquire
  - memory\_order\_release
- EVEN MORE memory orders (complicated: in most research it is ommitted)
  - memory\_order\_consume

## Memory consistency in the real world

- Historic Chips:
  - X86: TSO
    - Surprising robost
    - mutexes and concurrent data structures generally seem to work
    - watch out for store buffering
  - IBM Power and ARM
    - Very relaxed. Similar to RMO with even more rules
    - Mutexes and data structures must be written with care
    - ARM recently strengthened theirs

## Memory consistency in the real world

- Modern Chips:
  - RISC-V : two specs: one similar to TSO, one similar to RMO
  - Apple M1: toggles between TSO and weaker

## Memory consistency in the real world

- PSO and RMO were never implemented widely
  - I have not met anyone who knows of any RMO taped out chip
  - They are part of SPARC ISAs (i.e. RISC-V before it was cool)
  - These memory models might have been part of specialized chips
- Interestingly:
  - Early Nvidia GPUs appeared to informally implement RMO
- Other chips have very strange memory models:
  - Alpha DEC basically no rules

# Compiler

- Previously (before C/++11):
  - Use volatile
  - Use inline assembly for fences
  - Not portable!
- Now:
  - C/++11 memory model
  - But there are still bugs: Intel OpenCL compiler, IBM C++ compiler... others?

## Further research

- Should we provide sequential consistency by default? even without atomics?
  - How to do this?
  - Many interesting papers

## Thanks!

- Friday's lecture is canceled
- On Monday, we will talk about decoupled access execute (DAE)