## CSE113: Parallel Programming

Feb. 18, 2022

#### • Topics:

- Memory consistency models:
  - Total store order
  - Relaxed memory consistency
  - Examples

target machine TSO (x86)

. S

NO different address

S

#### Announcements

- HW 3 due today
  - Questions on piazza will be answered until 5
  - Sanya has office hours
- Grades for HW 2 are planned for release today
  - by midnight
- HW 4 released tonight by midnight
  - You have what you need for part 1, likely the other parts today

## Today's Quiz

• Due Wednesday before class

Barrier objects have how many API calls?

 $\bigcirc$  0

 $\bigcirc$  1

O 2

○ 3

| A barrier call emits which of the following events? Check all that apply |  |
|--------------------------------------------------------------------------|--|
| ☐ barrier_lock                                                           |  |
| ☐ barrier_arrive                                                         |  |
| ☐ barrier_enqueue                                                        |  |
| □ barrier leave                                                          |  |

If a program uses both barriers and mutexes, the outcome is deterministic (i.e. the same every time) if there are no data conflicts

- True
- False

# Example

Write a few sentences about what you think the best interface for parallel programming is: that is: do you think it is atomics? Mutexes? Concurrent Data Structures? Barriers? Or even maybe the compiler should simply do it all automatically? Or is it some combination of the above? What are the trade-offs involved?

#### Reasoning about concurrency

 Depending on your needs, programs become more/less complex to reason about.



#### Review

#### Barriers

#### Barrier Examples

My current favorite: particle simulation



at each time, compute new positions for each particle (in parallel)

But you need to wait for all particles to be computed before starting the next time step

#### Barrier Examples

Deep neural networks



#### Question

• We already know how to do this... Why do we need barriers?

```
Thread 0:
*x = 1;
B.barrier();
```

thread 2

```
<u>Thread 1:</u>
*y = 2;
B.barrier();
```

```
<u>Thread 2:</u>
B.barrier();
var = *x + *y;
```

```
thread 0
thread 1
```

```
Thread 0:
*x = 1;
B.barrier();
```

barrier arrive

thread 2

```
<u>Thread 1:</u>
*y = 2;
B.barrier();
```

```
<u>Thread 2:</u>

B.barrier();

var = *x + *y;
```

```
thread 0
thread 1
```

```
<u>Thread 0:</u>

*x = 1;

B.barrier();
```

barrier arrive

thread 2

```
Thread 1:
*y = 2;
B.barrier();
```

```
<u>Thread 2:</u>

B.barrier();

var = *x + *y;
```

```
thread 0 *x=1 thread 1 *y=2
```

```
<u>Thread 0:</u>
*x = 1;
B.barrier();
```

```
Thread 1:
*y = 2;
B.barrier();
```

```
<u>Thread 2:</u>

B.barrier();

var = *x + *y;
```

```
<u>Thread 0:</u>
*x = 1;
B.barrier();
```

```
Thread 1:
*y = 2;
B.barrier();
```



They've all arrived



```
<u>Thread 0:</u>
*x = 1;
B.barrier();
```

```
Thread 1:
*y = 2;
B.barrier();
```

```
<u>Thread 2:</u>
B.barrier();
var = *x + *y;
```

They've all arrived



```
<u>Thread 0:</u>
*x = 1;
B.barrier();
```

```
<u>Thread 1:</u>
*y = 2;
B.barrier();
```

```
Thread 2:
B.barrier();
var = *x + *y;
```



What is this guaranteed to be?

```
Thread 0:
*x = 1;
B.barrier();
```

```
Thread 1:
*y = 2;
B.barrier();
```



sometimes called a phase

extending to the next barrier leave

Barrier Interval 0 Barrier Interval 1 thread 0 barrier arrive barrier leave thread 1 \*y = 2 barrier arrive

barrier arrive thread 2 barrier leave var = \*x + \*v

barrier leave

```
class SenseBarrier {
 private:
   atomic int counter;
   int num threads;
   atomic bool sense;
   bool thread sense[num threads];
 public:
   Barrier(int num threads) {
      counter = 0;
     this->num threads = num threads;
      sense = false;
     thread sense = {true, ...};
    void barrier(int tid) {
        int arrival num = atomic fetch add(&counter, 1);
        if (arrival num == num threads) {
           counter.store(0);
           sense = thread sense[tid];
        else {
          while (sense != thread sense[tid]);
        thread sense[tid] = !thread sense[tid];
```

thread sense = true

```
num_threads == 2
counter == 0
sense = false
```

thread\_sense = true

```
Thread 0:
B.barrier();
B.barrier();
```

```
void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
        while (sense != thread_sense[tid]);
    }
     thread_sense[tid] = !thread_sense[tid];
}
```

```
Thread 1:
B.barrier();
B.barrier();
```

```
thread_sense = true
arrival_num = 2
```

```
Thread 0:
B.barrier();
B.barrier();
```

```
num_threads == 2
    counter == 2
    sense = false

void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
        while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
```

```
thread_sense = true
arrival_num = 1
```

```
Thread 1:
B.barrier();
B.barrier();
```

```
thread_sense = true
arrival_num = 2
```

```
Thread 0:
B.barrier();
B.barrier();
```

```
num_threads == 2
    counter == 2
    sense = false

void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
        while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
}
```

```
thread_sense = true
arrival_num = 1
```

```
Thread 1:
B.barrier();
B.barrier();
```

```
thread_sense = false
arrival_num = 2
```

```
Thread 0:
B.barrier();
B.barrier();
```

```
thread_sense = true
arrival_num = 1
```

```
Thread 1:
B.barrier();
B.barrier();
```

```
thread_sense = false
arrival_num = 2
```

```
Thread 0:
B.barrier();
B.barrier();
```

```
num_threads == 2
    counter == 0
    sense = true

void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
        while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
```

```
thread_sense = true
arrival_num = 1
Thread 1:
```

```
Thread 1:
B.barrier();
B.barrier();
```

Remember the issue! Thread 1 went to sleep around this time and thread 0 went into the barrier again!

```
thread_sense = false
arrival_num = 1
```

```
Thread 0:
B.barrier();
B.barrier();
```

```
thread_sense = true
arrival_num = 1
```

```
Thread 1:
B.barrier();
```

B.barrier();

```
thread_sense = false
arrival_num = 1
```

```
Thread 0:
B.barrier();
B.barrier();
```

```
thread_sense = true
arrival_num = 1
```

```
Thread 1:
B.barrier();
B.barrier();
```

both are waiting!, but thread 1 can leave

```
thread_sense = false
arrival_num = 1
```

```
Thread 0:
B.barrier();
B.barrier();
```

```
num_threads == 2
    counter == 1
    sense = true

void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
        while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
```

```
thread_sense = false
arrival_num = 1
```

```
Thread 1:
B.barrier();
B.barrier();
```

both are waiting!, but thread 1 can leave

```
thread_sense = false
arrival_num = 1
```

```
Thread 0:
B.barrier();
B.barrier();
```

```
num_threads == 2
    counter == 1
    sense = true

void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
        while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
}
```

```
thread_sense = false
    arrival_num = 1

Thread 1:
B.barrier();
B.barrier();
```

Thread 1 finishes the barrier

```
thread_sense = false
arrival_num = 1
```

```
Thread 0:
B.barrier();
B.barrier();
```

```
num_threads == 2
    counter == 1
    sense = true

void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
        while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
```

```
thread_sense = false
arrival_num = 1

<u>Thread 1:</u>
```

```
Ihread 1:
B.barrier();
B.barrier();
```

```
thread_sense = false
arrival_num = 1
```

```
Thread 0:
B.barrier();
B.barrier();
```

```
num_threads == 2
    counter == 2
    sense = true

void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
        while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
```

```
thread_sense = false
arrival_num = 2

<u>Thread 1:</u>
B.barrier();
```

B.barrier();

```
thread_sense = false
arrival_num = 1
```

```
Thread 0:
B.barrier();
B.barrier();
```

```
num_threads == 2
    counter == 2
    sense = true

void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
        while (sense!= thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
}
```

```
thread_sense = false
    arrival_num = 2

Thread 1:
B.barrier();
B.barrier();
```

```
thread_sense = false
arrival_num = 1
```

```
Thread 0:
B.barrier();
B.barrier();
```

```
num_threads == 2
    counter == 0
    sense = false

void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
        while (sense!= thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
}
```

```
thread_sense = false
  arrival_num = 2

Thread 1:
B.barrier();
B.barrier();
```

```
thread_sense = false
arrival_num = 1
```

```
Thread 0:
B.barrier();
B.barrier();
```

```
num_threads == 2
    counter == 0
    sense = false

void barrier(int tid) {
    int arrival_num = atomic_fetch_add(&counter, 1);
    if (arrival_num == num_threads) {
        counter.store(0);
        sense = thread_sense[tid];
    }
    else {
      while (sense != thread_sense[tid]);
    }
    thread_sense[tid] = !thread_sense[tid];
}
```

```
thread_sense = false
  arrival_num = 2

Thread 1:
B.barrier();
B.barrier();
```

thread 0 can leave, thread 1 can leave and the barrier works as expected!

# On to the lecture!

# Schedule

#### Memory consistency models:

- Total store order
- Relaxed memory consistency
- Examples

# Memory Consistency

- We have been very strict about using atomic types in this class
  - and the methods (.load and .store)
  - why?
  - Architectures do very strange things with memory loads and stores
  - Compilers do too (but we won't talk too much about them today)
  - C++ gives us sequential consistency if we use atomic types and operations
  - What do we remember sequential consistency from?

# Sequential consistency for atomic memory

• Let's play our favorite game:

```
atomic_int x(0);
atomic_int y(0);
```

```
Thread 0:
x.store(1);
y.store(1);
```

```
Is it possible for
t0 == 0 and t1 ==1
```

```
Thread 1:
int t0 = y.load();
int t1 = x.load();
```

```
Global variable:
```

```
atomic_int x(0);
atomic_int y(0);
```

```
Thread 0:
x.store(1);
y.store(1);
```

```
x.store(1);
```

y.store(1);

```
Is it possible for
t0 == 0 and t1 ==1
```

```
Thread 1:
int t0 = y.load();
int t1 = x.load();
```

```
int t0 = y.load();
```

int 
$$t1 = x.load();$$

```
atomic_int x(0);
atomic_int y(0);
```

```
Thread 0:
x.store(1);
y.store(1);
```

```
Is it possible for
  t0 == 0 \text{ and } t1 == 1
  int t0 = y.load();
    x.store(1);
     y.store(1);
int t1 = x.load();
```

```
Thread 1:
int t0 = y.load();
int t1 = x.load();
```

```
atomic_int x(0);
atomic_int y(0);
```

```
Thread 0:
x.store(1);
y.store(1);
```

```
x.store(1);
```

#### **How about:**

```
to == 1 and t1 ==0
```

int 
$$t0 = y.load();$$

int 
$$t1 = x.load();$$

```
atomic_int x(0);
atomic_int y(0);
```

```
Thread 0:
x.store(1);
y.store(1);
```

```
x.store(1);
```

no where for this one to go!

#### **How about:**

```
Is it possible for
 t0 == 1 \text{ and } t1 == 0
   y.store(1);
 int t0 = y.load();
int t1 = x.load();
```

```
Thread 1:
int t0 = y.load();
int t1 = x.load();
```

```
atomic_int x(0);
atomic_int y(0);
```

```
Another test Can t0 == t1 == 0?
```

```
Thread 0:
x.store(1);
int t0 = y.load();
```

```
Thread 1:
y.store(1);
int t1 = x.load();
```

```
atomic_int x(0);
atomic_int y(0);
```

```
Another test Can t0 == t1 == 0?
```

```
Thread 0:
x.store(1);
int t0 = y.load();
```

```
x.store(1);
```

```
int t0 = y.load();
```

```
Thread 1:
y.store(1);
int t1 = x.load();
```

```
y.store(1);
int t1 = x.load();
```

```
Global variable:
```

```
atomic_int x(0);
atomic_int y(0);
```

```
Another test Can t0 == t1 == 0?
```

```
Thread 0:
x.store(1);
int t0 = y.load();
```

x.store(1);

```
int t0 = y.load();
y.store(1);
int t1 = x.load();
```

```
Thread 1:
y.store(1);
int t1 = x.load();
```

no place for this one!

#### C++

• Plain atomic accesses are documented to be sequentially consistent (SC)

- Why wasn't SC very good for concurrent data structures?
  - Compossibility: two objects that are SC might not be SC when used together
  - Programs contain only 1 shared memory though; no reason to compose different main memories.

## What about ISAs?

 Remember, it is important for us to understand how our code executes on the architecture to write high performing programs

- Lets think about x86
  - Instructions:
  - MOV %t0 [x] loads the value at x to register t0
  - MOV [y] 1 stores the value 1 to memory location y

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Another test Can t0 == t1 == 0?
```

#### Thread 0:

```
mov [x], 1
mov %t0, [y]
```

#### Thread 1:

```
mov [y], 1
mov %t1, [x]
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Another test Can t0 == t1 == 0?
```

```
Thread 0:
```

```
mov [x], 1
mov %t0, [y]
```

mov [x], 1

mov %t0, [y]

#### Thread 1:

```
mov [y], 1
mov %t1, [x]
```

mov [y], 1

mov %t1, [x]

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Another test

Can t0 == t1 == 0?
```

# Thread 0: mov [x], 1 mov %t0, [y]



# <u>Thread 1:</u> mov [y], 1 mov %t1, [x]

mov [y], 1

no place for this event!

# What if we actually run this code?

 We'd like to be able to compile atomic instructions just to regular ISA loads and stores

# Schedule

- Memory consistency models:
  - Total store order
  - Relaxed memory consistency
  - Examples

mov [x], 1

mov %t0, [y]

Core 0

#### Thread 1:

mov [y], 1

mov %t1, [x]

Core 1

y:0

#### Thread 1:

mov %t0, [y]

mov %t1, [x]

Core 0

mov [x], 1

execute first instruction what happens to the stores?

Core 1

mov [y], 1

y:0

#### Thread 1:

mov %t0, [y]

X86 cores contain a store buffer; holds stores before going to main memory

mov %t1, [x]

Core 0

Store Buffer x:1

Store Buffer
y:1

Core 1

x:0 y:0

#### Thread 0: Thread 1: X86 cores contain a store buffer; holds stores before mov %t1, [x] mov %t0, [y] going to main memory Store Buffer Store Buffer Core 0 Core 1 x:1 y:1 eventually they flush to main memory x:0 **Main Memory** y:0

#### Thread 0: Thread 1: X86 cores contain a store buffer; holds stores before mov %t1, [x] mov %t0, [y] going to main memory Store Buffer Store Buffer Core 0 Core 1 x:1 eventually they flush to main memory x:0 **Main Memory** y:1

# Thread 0: mov [x], 1 mov %t0, [y]

rewind

#### Thread 1:

mov [y], 1

mov %t1, [x]

Core 0

Store Buffer

Store Buffer

Core 1

y:0

#### Thread 1:

mov %t1, [x]

mov %t0, [y]

Store Buffer

Core 0

mov [x], 1

execute first instruction



y:0 Main Memory

#### Thread 1:

mov %t0, [y]

values get stored in SB

mov %t1, [x]

Core 0

Store Buffer x:1

Store Buffer y:1

Core 1

y:0

#### Thread 1:

#### Execute next instruction







#### Values get loaded from memory



#### Thread 1:





Thread 1:

#### Store buffers are drained eventually



Thread 1:

Store buffers are drained eventually but we've already done our loads

Core 0



# Our first relaxed memory execution!

also known as weak memory behaviors

An execution that is NOT allowed by sequential consistency

- A memory model that allows relaxed memory executions is known as a relaxed memory model
  - X86 has a relaxed memory model due to store buffering
  - If you restrict yourself to use only default atomic operations, C++ has does
     NOT have a weak memory model

### Litmus tests

Small concurrent programs that check for relaxed memory behaviors

 Vendors have a long history of under documented memory consistency models

- Academics have empirically explored the memory models
  - Many vendors have unofficially endorsed academic models
  - X86 behaviors were documented by researchers before Intel!

## Litmus tests

This test is called "store buffering"

```
Thread 0:

mov [x], 1

mov %t0, [y]
```

```
Thread 1:

mov [y], 1

mov %t1, [x]
```

```
Another test Can t0 == t1 == 0?
```

# Restoring sequential consistency

• It is typical that relaxed memory models provide special instructions which can be used to disallow weak behaviors.

These instructions are called Fences

• The X86 fence is called mfence. It flushes the store buffer.

mov [x], 1

mfence

mov %t0, [y]

Core 0

Store Buffer

## Thread 1:

mov [y], 1

mfence

mov %t1, [x]

Store Buffer

Core 1

x:0 Main Memory y:0

## Thread 1:

mfence

mov %t1, [x]



Execute first instruction



y:0 Main Memory

# Thread 1:

mfence

mov %t0, [y]

Core 0

Store Buffer x:1

Values go into the store buffer

Store Buffer y:1 mfence

mov %t1, [x]

Core 1

y:0

Main Memory

# Thread 1:

#### Execute next instruction

mov %t0, [y]

Store Buffer
x:1



y:0 Main Memory

# Thread 1:



Thread 1:

store buffers are flushed

mov %t0, [y]

Core 0

Store Buffer

Store Buffer

Core 1

mov %t1, [x]

x:1

y:1

Main Memory

Thread 1:

#### execute next instruction





y:1 Main Memory

#### values are loaded from memory



Thread 1:

We don't get the problematic behavior: t0 != 0 and t1 != 0



# Next example

mov [x], 1

mov %t0, [x]

Core 0

Store Buffer

single thread same address

possible outcomes:

t0 = 1

t0 = 0

Which one do you expect?

x:0

y:0

Main Memory

mov [x], 1

How does this execute?

mov %t0, [x]

Core 0

Store Buffer

x:0

y:0

Main Memory

#### execute first instruction





Store the value in the store buffer

mov %t0, [x]

Core 0

Store Buffer x:1

y:0 Main Memory

#### Next instruction





Where to load??

Store buffer? Main memory?







Where to load??

Threads check store buffer before going to main memory

It is close and cheap to check.



# Memory Consistency

How to specify a relaxed memory model?

- We can do it operationally
  - by constructing a high-level machine and reasoning about operations through the machine.
  - or we can talk about instructions that are allowed to "break" program order.

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Another test Can t0 == t1 == 0?
```

```
Thread 0:
mov [x], 1
mov %t0, [y]
```

```
Thread 1:
mov [y], 1
mov %t1, [x]
```

We will annotate instructions with S for store, and L for loads

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Another test Can t0 == t1 == 0?
```

```
Thread 0:
```

```
S:mov [x], 1
L:mov %t0, [y]
```

#### Thread 1:

```
S:mov [y], 1
L:mov %t1, [x]
```

We will annotate instructions with S for store, and L for loads

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Another test Can t0 == t1 == 0?
```

### Thread 0:

```
S:mov [x], 1
L:mov %t0, [y]
```

S:mov [x], 1

L:mov %t0, [y]

## Thread 1:

```
S:mov [y], 1
L:mov %t1, [x]
```

S:mov [y], 1

L:mov %t1, [x]

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Another test

Can t0 == t1 == 0?
```

#### Thread 0:

```
S:mov [x], 1
L:mov %t0, [y]
```



#### Thread 1:

```
S:mov [y], 1
L:mov %t1, [x]
```

S:mov [y], 1

Now we make a new rule:

S(tores) followed by a L(oad) do not have to follow program order

```
int x[1] = \{0\};
int y[1] = \{0\};
```

#### Another test Can t0 == t1 == 0?

## Thread 0:

```
S:mov [x], 1
L:mov %t0, [y]
```



Now we make a new rule:

S(tores) followed by a L(oad) do not have to follow program order

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Another test Can t0 == t1 == 0?
```

### Thread 0:

```
S:mov [x], 1
L:mov %t0, [y]
```

we can ignore this condition!!



#### Thread 1:

```
S:mov [y], 1
L:mov %t1, [x]
```

Now we can satisfy the condition!

int 
$$x[1] = \{0\};$$
  
int  $y[1] = \{0\};$ 

Another test Can 
$$t0 == t1 == 0$$
?

# Thread 0:

```
S:mov [x], 1
L:mov %t0, [y]
```

Lets peak under the hood here



int 
$$x[1] = \{0\};$$
  
int  $y[1] = \{0\};$ 

# Thread 0:

```
S:mov [x], 1
L:mov %t0, [y]
```

Lets peak under the hood here

Global timeline is when the Store operation becomes visible to other threads Another test Can t0 == t1 == 0?



int 
$$x[1] = \{0\};$$
  
int  $y[1] = \{0\};$ 

```
Thread 0:
S:mov [x], 1
```

L:mov %t0, [y]

Lets peak under the hood here

Global timeline is when the Store operation becomes visible to other threads Another test

Can t0 == t1 == 0?



```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Another test Can t0 == t1 == 0?
```

```
Thread 0:
```

```
S:mov [x], 1
L:mov %t0, [y]
```

Lets peak under the hood here

Global timeline is when the Store operation becomes visible to other threads we can ignore this condition!!



Thread 1:

S:mov [y], 1

# Questions

• Can stores be reordered with stores?

mov [x], 1

mov [y], 1

Core 0

Store Buffer

x:0

y:0

Main Memory

mov [y], 1

execute the first instruction

Core 0

Store Buffer

mov [x], 1

y:0 Main Memory

mov [y], 1

value goes into store buffer

Core 0

Store Buffer x:1

y:0 Main Memory

mov [y], 1

execute next instruction

Core 0



x:0

y:0

Main Memory

#### execute next instruction





value goes into the store buffer

Core 0

| Store Buffer |
|--------------|
| x:1          |
| y:1          |
|              |

y:0 Main Memory

### Thread 0:



### Thread 0:



### Thread 0:



## Questions

• Can stores be reordered with stores?

• How do we make rules about mfence?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

#### Thread 0:

```
S:mov [x], 1
mfence
L:mov %t0, [y]
```

```
S:mov [x], 1
```

mfence

```
L:mov %t0, [y]
```

Another test Can t0 == t1 == 0?

#### Thread 1:

```
S:mov [y], 1
mfence
L:mov %t1, [x]
```

```
S:mov [y], 1
```

mfence

```
L:mov %t1, [x]
```

Rules: S(tores) followed by a L(oad) do not have to follow program order.

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Thread 0:
```

S:mov [x], 1
mfence
L:mov %t0, [y]

So we can't reorder this instruction at all!

Another test Can t0 == t1 == 0?



#### Thread 1:

S:mov [y], 1
mfence
L:mov %t1, [x]

S:mov [y], 1

#### Rules:

S(tores) followed by a L(oad) do not have to follow program order.

### Rules

• Are we done?

Rules:

S(tores) followed by a L(oad) do not have to follow program order.

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Another test
Can t0 == 0?
```

### Thread 0:

S:mov [x], 1

L:mov %t0, [x]

S:mov [x], 1

L:mov %t0, [x]

Rules:

S(tores) followed by a L(oad) do not have to follow program order.

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Another test
Can t0 == 0?

### Thread 0:

S:mov [x], 1

L:mov %t0, [x]

S:mov [x], 1

where to put this store?



Rules:

S(tores) followed by a L(oad) do not have to follow program order.

```
int x[1] = \{0\};
int y[1] = \{0\};
```

#### Thread 0:

S:mov[x], 1

L:mov %t0, [x]

where to put this store?



Rules:

S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

S(tores) cannot be reordered past L(oads) from the same address

### TSO - Total Store Order

#### **Rules:**

S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

S(tores) cannot be reordered past L(oads) from the same address

## Schedule

- Memory consistency models:
  - Total store order
  - Relaxed memory consistency
  - Examples

We can specify them in terms of what reorderings are allowed



We can specify them in terms of what reorderings are allowed

|                 |   | memory access 0 |    |
|-----------------|---|-----------------|----|
|                 |   | L               | S  |
| memory access 1 | L | NO              | NO |
|                 | S | NO              | NO |

#### **Sequential Consistency**

We can specify them in terms of what reorderings are allowed

|                 |   | memory access 0 |                      |
|-----------------|---|-----------------|----------------------|
|                 |   | L               | S                    |
| memory access 1 | L | NO              | Different<br>address |
|                 | S | NO              | NO                   |

TSO - total store order

We can specify them in terms of what reorderings are allowed

|                 |   | memory access 0 |   |
|-----------------|---|-----------------|---|
|                 |   | L               | S |
| memory access 1 | L | ?               |   |
|                 | S | ?               | ? |

#### Weaker models?

We can specify them in terms of what reorderings are allowed



Allows stores to drain from the store buffer in any order

We can specify them in terms of what reorderings are allowed



#### **RMO - Relaxed Memory Order**

If memory access 0 appears before memory access 1 in program order, can it bypass program order?

Very relaxed model!

• FENCE: can always restore order using fences. Accesses cannot be reordered past fences!



#### **Any Memory Model**

If memory access 0 appears before memory access 1 in program order, and there is a FENCE between the two accesses, can it bypass program order?

## Schedule

- Memory consistency models:
  - Total store order
  - Relaxed memory consistency
  - Examples

```
int x[1] = \{0\};
int y[1] = \{0\};
```

First thing: change our syntax to pseudo code

#### Thread 0:

```
L:mov %t0, [y]
S:mov [x], 1
```

#### Thread 1:

```
L:mov %t1, [x]
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```

First thing: change our syntax to pseudo code You should be able to find natural mappings to any ISA

#### Thread 0:

```
L:%t0 = load(y)
```

S:store(x,1)

### *Thread 1:*

```
L:%t1 = load(x)
```

S:store(y,1)

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

#### Thread 0:

```
L:%t0 = load(y)
```

S:store(x,1)

#### Thread 1:

L:%t1 = load(x)

S:store(y,1)

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks and try for sequential consistency

#### Thread 0:

$$L:%t0 = load(y)$$

S:store(x,1)

$$L:%t0 = load(y)$$

S:store(x,1)

#### Thread 1:

$$L:%t1 = load(x)$$

S:store(y,1)

$$L:%t1 = load(x)$$

S:store(y,1)

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks



L:%t0 = load(y)



#### Thread 1:

L:%t1 = load(x)

S:store(y,1)

Not allowed under sequential consistency!

int 
$$x[1] = \{0\};$$
  
int  $y[1] = \{0\};$ 

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store(x,1)

L: %t0 = load(y)

respect program order

S:store(x,1)

L:%t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

Thread 1:

L:%t1 = load(x)

S:store(y,1)

memory access 0

| NO | Different<br>address |
|----|----------------------|
| NO | NO                   |

What about TSO?

int 
$$x[1] = \{0\};$$
  
int  $y[1] = \{0\};$ 

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store(x,1)

L:%t0 = load(y)

respect program order

S:store(x,1)

L:%t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

Thread 1:

L:%t1 = load(x)

S:store(y,1)

memory access 0

L

S

| NO | Different<br>address |
|----|----------------------|
| NO | NO                   |

What about TSO? NOT ALLOWED!

int 
$$x[1] = \{0\};$$
  
int  $y[1] = \{0\};$ 

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store(x,1)

L:%t0 = load(y)

respect program order

S:store(x,1)

L: %t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

*Thread 1:* 

L:%t1 = load(x)

S:store(y,1)

memory access 0

.

NO Different address

NO Different address

What about PSO?

int 
$$x[1] = \{0\};$$
  
int  $y[1] = \{0\};$ 

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store(x,1)

L:%t0 = load(y)

respect program order

S:store(x,1)

L: %t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

Thread 1:

L:%t1 = load(x)

S:store(y,1)

memory access 0

L

NO Different address

NO Different address

What about PSO? NO!

int 
$$x[1] = \{0\};$$
  
int  $y[1] = \{0\};$ 

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store(x,1)

L:%t0 = load(y)

respect program order

S:store(x,1)

L:%t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

Thread 1:

L:%t1 = load(x)

S:store(y,1)

memory access 0

-

S

| YES       | Different<br>address |  |
|-----------|----------------------|--|
| different | Different            |  |
| address   | address              |  |

What about RMO?

int 
$$x[1] = \{0\};$$
  
int  $y[1] = \{0\};$ 

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store(x,1)

L:%t0 = load(y)

respect program order

S:store(x,1)

L: %t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

Thread 1:

L:%t1 = load(x)

S:store(y,1)

memory access 0

L

S

YES

Different address

different address

address

What about RMO?

int 
$$x[1] = \{0\};$$
  
int  $y[1] = \{0\};$ 

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store(x,1)

L:%t0 = load(y)

respect program order

S:store(x,1)

L: %t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

Thread 1:

L:%t1 = load(x)

S:store(y,1)

memory access 0

Different YES address different Different address address

What about RMO? YES!

int 
$$x[1] = \{0\};$$
  
int  $y[1] = \{0\};$ 

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store(x,1)

L:%t0 = load(y)

respect program order

S:store(x,1)

L: %t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

Thread 1:

L:%t1 = load(x)

S:store(y,1)

memory access 0

L

.

YES

Different address

different address

address

address

How do we disallow the behavior in RMO?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

fence

S:store(x,1)

L:%t0 = load(y)

respect program order

S:store(x,1)

L: %t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

Thread 1:

L:%t1 = load(x)

S:store(y,1)

memory access 0

S

YES

Different address

different address

Different address

How do we disallow the behavior in RMO?

int 
$$x[1] = \{0\};$$
  
int  $y[1] = \{0\};$ 

Question: can t0 == t1 == 1?

Get out our lego bricks



address

address

How do we disallow the behavior in RMO?

int  $x[1] = \{0\};$ int  $y[1] = \{0\};$ 

Question: can t0 == t1 == 1?

Get out our lego bricks



fence

respect program order

fence

S:store(x,1)

L: %t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

Thread 1:

L: %t1 = load(x)

S:store(y,1)

memory access 0

Different YES address

different Different address address

Now we cannot break program order past the fence! Are we done?

L:%t0 = load(y)

S:store(x,1)

L:%t0 = load(y)

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks



address

address

Now we cannot break program order past the fence! Are we done?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

L:%t0 = load(y)

Question: can t0 == t1 == 1?

Get out our lego bricks



respect program order

fence

S:store(x,1)

$$L:%t1 = load(x)$$

S:store(y,1)

satisfy constraints

memory access 1

Now we cannot break program order past the fence! Are we done? The behavior is no longer allowed

## Thread 1:

L:%t1 = load(x)

fence

S:store(y,1)

memory access 0

YES

Different address

different address

Different address

# One more example

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == 1 and t1 == 0?

### Thread 0:

```
S:store(x,1)
S:store(y,1)
```

S:store(x,1)

S:store(y,1)

## Thread 1:

L:%t0 = load(y) S:%t1 = load(x)

L:%t0 = load(y)

L:%t1 = load(x)

```
int x[1] = \{0\};
int y[1] = \{0\};
```

#### Thread 0:

S:store(x,1) S:store(y,1)

S:store(x,1)

S:store(y,1)

Question: can t0 == 1 and t1 == 0?

start off thinking about sequential consistency

## Thread 1:

L:%t0 = load(y)

S:%t1 = load(x)

L:%t0 = load(y)

L: %t1 = load(x)

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```

start off thinking about sequential consistency

## Thread 0:

```
S:store(x,1)
S:store(y,1)
```

S:store(x,1)

respect program order

## S:store(y,1)

$$L:%t0 = load(y)$$

L:%t1 = load(x)

satisfy constraints

#### Thread 1:

$$L:%t0 = load(y)$$

$$S:%t1 = load(x)$$

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



What about TSO?

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



What about TSO? NO

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



What about PSO?

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



What about PSO?

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



What about PSO? YES

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



Now it is disallowed in PSO

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



What about RMO?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == 1 and t1 == 0?

### Thread 0:

```
S:store(x,1)
```

fence

S:store(y,1)



```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Question: can t0 == 1 and t1 == 0?
```

```
Thread 0:
S:store(x,1)
fence
S:store(y,1)
```



What about RMO? The loads can be reordered also!

```
Global variable:
```

S:store(y,1)

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == 1 and t1 == 0?

```
Thread 0:
S:store(x,1)
fence
```

$$L:%t1 = load(x)$$



What about RMO? add a fence

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == 1 and t1 == 0?

## Thread 0:

S:store(x,1)

fence

S:store(y,1)



Now the relaxed behavior is disallowed

## • Historic Chips:

- X86: TSO
  - Surprising robost
  - mutexes and concurrent data structures generally seem to work
  - watch out for store buffering
- IBM Power and ARM
  - Very relaxed. Similar to RMO with even more rules
  - Mutexes and data structures must be written with care
  - ARM recently strengthened theirs

## Historic Chips:

- X86: TSO
  - Surprising robost
  - mutexes and concurrent data structures generally seem to work
  - watch out for store buffering
- IBM Power and ARM
  - Very relaxed. Similar to RMO with even more rules
  - Mutexes and data structures must be written with care
  - ARM recently strengthened theirs

Companies have a history of providing insufficient documentation about their rules: academics have then gone and figured it out!

Getting better these days

- Modern Chips:
  - RISC-V: two specs: one similar to TSO, one similar to RMO
  - Apple M1: toggles between TSO and weaker
  - Vulkan does not provide any fences that provide S L ordering

- PSO and RMO were never implemented widely
  - I have not met anyone who knows of any RMO taped out chip
  - They are part of SPARC ISAs (i.e. RISC-V before it was cool)
  - These memory models might have been part of specialized chips
- Interestingly:
  - Early Nvidia GPUs appeared to informally implement RMO
- Other chips have very strange memory models:
  - Alpha DEC basically no rules

## See you on Wednesday!

• Get HW 3 in!

Watch out for HW2 grades

Finishing up memory models on Wednesday