# Response Time Analysis of Synchronous Data Flow Programs on a Many-Core Processor

Hamza Rihani, Matthieu Moy, Claire Maiza, Robert I. Davis, Sebastian Altmeyer

RTNS'16, October 19, 2016













Single-core code generation

```
int main_app(i1, i2)
{
    na = NA(i1);
    ne = NE(i2);
    nb = NB(na);
    nd = ND(na);
    nf = NF(ne);
    o = NC(nb,nd,nf);
    return o;
}
```

High level representation

static non-preemptive scheduling



Multi/Many-core code generation



static non-preemptive scheduling

High level representation





Multi/Many-core code generation



static non-preemptive scheduling

High level representation





High level representation

Multi/Many-core code generation



static non-preemptive scheduling



#### Contributions

1 Precise accounting for interference on shared resources in a many-core processor



#### Contributions

1 Precise accounting for interference on shared resources in a many-core processor





| 8 shared memory banks | NoC Rx NoC T   |                | Tx              | 8 s             |                       |
|-----------------------|----------------|----------------|-----------------|-----------------|-----------------------|
|                       | RM             |                | DSU             |                 | 8 shared memory banks |
| mor                   | $P_6$          | $P_7$          | $P_{14}$        | $P_{15}$        | d me                  |
| Ĕ P                   | $P_4$          | $P_5$          | P <sub>12</sub> | $P_{13}$        | mor                   |
| hare                  | P2             | P <sub>3</sub> | P <sub>10</sub> | P <sub>11</sub> | y bai                 |
| 8 s                   | P <sub>0</sub> | P <sub>1</sub> | P <sub>8</sub>  | $P_9$           | nks                   |

#### Contributions

1 Precise accounting for interference on shared resources in a many-core processor







80

40

00

**Q** task of interest

#### Outline

1 Motivation and Context

- 2 Models Definition
  - Architecture Model
  - Execution Model
  - Application Model
- 3 Multicore Response Time Analysis of SDF Programs
- 4 Evaluation
- 5 Conclusion and Future Work

# Outline

Motivation and Context

- 2 Models Definition
  - Architecture Model
  - Execution Model
  - Application Model

3 Multicore Response Time Analysis of SDF Programs

- 4 Evaluation
- 5 Conclusion and Future Work



- Kalray MPPA 256 Bostan
- 16 compute clusters + 4 I/O clusters
- Dual NoC



#### Per cluster:

- 16 cores + 1 Resource Manager
- NoC Tx, NoC Rx, Debug Unit
- 16 shared memory banks (total size: 2 MB)



#### Per cluster:

- 16 cores + 1 Resource Manager
- NoC Tx, NoC Rx, Debug Unit
- 16 shared memory banks (total size: 2 MB)
- Multi-level bus arbiter per memory bank



ω

shared

memory

banks









- Tasks mapping on cores
- Static non-preemptive scheduling



- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation

different tasks go to different memory banks



(128 KB)



- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation

different tasks go to different memory banks

• Interference from communications







- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation
  - different tasks go to different memory banks
- Interference from communications
- Execution model:
  - execute in a "local" bank
  - write to a "remote" bank

Single phase: execute and write data.

memory access patter







- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation
  - different tasks go to different memory banks
- Interference from communications
- Execution model:
  - execute in a "local" bank
  - write to a "remote" bank





- Direct Acyclic Task Graph
- Mono-rate (or at least harmonic rates)
- Fixed mapping and execution order



- Direct Acyclic Task Graph
- Mono-rate (or at least harmonic rates)
- Fixed mapping and execution order

Each task  $\tau_i$ :





- Direct Acyclic Task Graph
- Mono-rate (or at least harmonic rates)
- Fixed mapping and execution order
   Each task τ<sub>i</sub>:
- Processor Demand, Memory Demand





- Direct Acyclic Task Graph
- Mono-rate (or at least harmonic rates)
- Fixed mapping and execution order
   Each task τ<sub>i</sub>:
- Processor Demand, Memory Demand
- Release date  $(rel_i)$ , response time  $(R_i)$





- Direct Acyclic Task Graph
- Mono-rate (or at least harmonic rates)
- Fixed mapping and execution order
   Each task τ<sub>i</sub>:
- Processor Demand, Memory Demand
- Release date  $(rel_i)$ , response time  $(R_i)$





- Direct Acyclic Task Graph
- Mono-rate (or at least harmonic rates)
- Fixed mapping and execution order
   Each task τ<sub>i</sub>:
- Processor Demand, Memory Demand
- Release date  $(rel_i)$ , response time  $(R_i)$



# Outline

Motivation and Context

- 2 Models Definition
  - Architecture Model
  - Execution Model
  - Application Model

#### 3 Multicore Response Time Analysis of SDF Programs

- 4 Evaluation
- 5 Conclusion and Future Work

• Response Time 
$$R = PD + I^{BUS}(R)$$









• Recursive formula  $\Rightarrow$  fixed-point algorithm.



- Recursive formula  $\Rightarrow$  fixed-point algorithm.
- Multiple shared resources (memory banks)



- Recursive formula  $\Rightarrow$  fixed-point algorithm.
- Multiple shared resources (memory banks)

$$I^{BUS}(R) = \sum_{b \in B} I^{BUS}_b(R)$$

where B: a set of memory banks



- Recursive formula  $\Rightarrow$  fixed-point algorithm.
- Multiple shared resources (memory banks)

$$I^{BUS}(R) = \sum_{b \in B} I^{BUS}_b(R)$$

where B: a set of memory banks

Q Requires a model of the bus arbiter

#### Model of the MPPA Bus



 $I_h^{\text{BUS}}$ : delay from all accesses + concurrent ones









$$\begin{split} I^{\text{BUS}}_b: \text{ delay from all accesses } + \text{ concurrent ones} \\ S^b_i: \text{ number of accesses of task } \tau_i \text{ to bank } b \\ S^b_i = \text{ Memory Demand to bank } b \\ A^{y,b}_i: \text{ number of concurrent accesses from core } y \text{ to bank } b \end{split}$$





$$\begin{split} I^{\text{BUS}}_b: \text{ delay from all accesses } + \text{ concurrent ones} \\ S^b_i: \text{ number of accesses of task } \tau_i \text{ to bank } b \\ S^b_i = \text{ Memory Demand to bank } b \\ A^{y,b}_i: \text{ number of concurrent accesses from core } y \text{ to bank } b \end{split}$$





$$\begin{split} I^{\text{BUS}}_b: \text{ delay from all accesses } + \text{ concurrent ones} \\ S^b_i: \text{ number of accesses of task } \tau_i \text{ to bank } b \\ S^b_i = \text{ Memory Demand to bank } b \\ A^{y,b}_i: \text{ number of concurrent accesses from core } y \text{ to bank } b \end{split}$$

$$Lv_{1} = S_{i}^{b}$$

$$Lv_{2} = Lv_{1} + \sum_{y=1}^{15} \min(A_{i}^{y,b}, Lv_{1})$$

$$Lv_{3} = Lv_{2} + \min(A_{i}^{G2,b}, Lv_{2})$$

$$Lv_{4} = Lv_{4} + A_{i}^{G3,b}$$

 $I_b^{BUS} = Lv_4 \times Bus Delay$ 





$$\begin{split} I^{\mathsf{BUS}}_b\colon \text{delay from all accesses} &+ \text{ concurrent ones} \\ S^b_i\colon \text{number of accesses of task } \tau_i \text{ to bank } b \\ S^b_i &= \text{ Memory Demand to bank } b \\ A^{y,b}_i: \text{ number of concurrent accesses from core } y \text{ to bank } b \\ A^{y,b}_i &= \sum \text{ overlapping concurrent accesses} \end{split}$$

$$Lv_{1} = S_{i}^{b}$$

$$Lv_{2} = Lv_{1} + \sum_{y=1}^{15} \min(A_{i}^{y,b}) Lv_{1}$$

$$Lv_{3} = Lv_{2} + \min(A_{i}^{G2,b}) Lv_{2}$$

$$Lv_{4} = Lv_{4} + (A_{i}^{G3,b})$$

 $I_b^{\scriptscriptstyle BUS} = Lv_4 \times \text{Bus Delay}$ 





$$\begin{split} I^{\text{BUS}}_b\colon \text{delay from all accesses } + \text{ concurrent ones} \\ S^b_i\colon \text{number of accesses of task } \tau_i \text{ to bank } b \\ S^b_i = \text{Memory Demand to bank } b \\ A^{y,b}_i\colon \text{number of concurrent accesses from core } y \text{ to bank } b \\ A^{y,b}_i = \sum \text{ overlapping concurrent accesses} \end{split}$$





1 Start with initial release dates.







**1** Start with initial release dates.

2 Compute response times

•••





1 Start with initial release dates.

2 Compute response times

... ...



- 1 Start with initial release dates.
- 2 Compute response times
  - ... ... a fixed-point is reached!





- 1 Start with initial release dates.
- 2 Compute response times
  - ... ... a fixed-point is reached!
- 3 Update the release dates.





- 1 Start with initial release dates.
- 2 Compute response times
  - ... ... a fixed-point is reached!
- 3 Update the release dates.
- 4 Repeat until no release date changes (another fixed-point iteration).





- 1 Start with initial release dates.
- 2 Compute response times
  - ... ... a fixed-point is reached!
- 3 Update the release dates.
- 4 Repeat until no release date changes (another fixed-point iteration).







- Convergence of the  $1^{st}$  fixed-point iteration:
  - Monotonic and bounded







13 / 21



http://www-verimag.imag.fr/TR/TR-2016-1.pdf



http://www-verimag.imag.fr/TR/TR-2016-1.pdf





http://www-verimag.imag.fr/TR/TR-2016-1.pdf



Full proof in our technical report: http://www-verimag.imag.fr/TR/TR-2016-1.pdf





# Outline

Motivation and Context

- 2 Models Definition
  - Architecture Model
  - Execution Model
  - Application Model
- 3 Multicore Response Time Analysis of SDF Programs
- 4 Evaluation
- 5 Conclusion and Future Work



• Flight management system controller

<sup>&</sup>lt;sup>1</sup> Pagetti et al., RTAS 2014



- Flight management system controller
- Receive from sensors and transmit to actuators

<sup>&</sup>lt;sup>1</sup> Pagetti et al., RTAS 2014





- Flight management system controller
- Receive from sensors and transmit to actuators
- Assumptions:

Tasks are mapped on 5 cores Debug Support Unit is disabled Context switches are over-approximated constants

<sup>&</sup>lt;sup>1</sup> Pagetti et al., RTAS 2014





- Flight management system controller
- Receive from sensors and transmit to actuators
- Assumptions:

Tasks are mapped on 5 cores Debug Support Unit is disabled Context switches are over-approximated constants

<sup>&</sup>lt;sup>1</sup> Pagetti et al., RTAS 2014

| Task       | Processor Demand (cycles) | Memory Demand (accesses) |
|------------|---------------------------|--------------------------|
| altitude   | 275                       | 22                       |
| az_filter  | 274                       | 22                       |
| h_filter   | 326                       | 24                       |
| va_control | 303                       | 24                       |
| va_filter  | 301                       | 23                       |
| vz_control | 320                       | 25                       |
| vz_filter  | 334                       | 25                       |

Table: Task profiles of the FMS controller

• Profile obtained from measurements

| Task       | Processor Demand (cycles) | Memory Demand (accesses) |
|------------|---------------------------|--------------------------|
| altitude   | 275                       | 22                       |
| az_filter  | 274                       | 22                       |
| h_filter   | 326                       | 24                       |
| va_control | 303                       | 24                       |
| va_filter  | 301                       | 23                       |
| vz_control | 320                       | 25                       |
| vz_filter  | 334                       | 25                       |

Table: Task profiles of the FMS controller

- Profile obtained from measurements
- Memory Demand: data and instruction cache misses + communications

| Task       | Processor Demand (cycles) | Memory Demand (accesses) |
|------------|---------------------------|--------------------------|
| altitude   | 275                       | 22                       |
| az_filter  | 274                       | 22                       |
| h_filter   | 326                       | 24                       |
| va_control | 303                       | 24                       |
| va_filter  | 301                       | 23                       |
| vz_control | 320                       | 25                       |
| vz_filter  | 334                       | 25                       |

Table: Task profiles of the FMS controller

- Profile obtained from measurements
- · Memory Demand: data and instruction cache misses + communications
- Moreover:
  - NoC Rx: writes 5 words
  - NoC Tx: reads 2 words

| Task       | Processor Demand (cycles) | Memory Demand (accesses) |
|------------|---------------------------|--------------------------|
| altitude   | 275                       | 22                       |
| az_filter  | 274                       | 22                       |
| h_filter   | 326                       | 24                       |
| va_control | 303                       | 24                       |
| va_filter  | 301                       | 23                       |
| vz_control | 320                       | 25                       |
| vz_filter  | 334                       | 25                       |

Table: Task profiles of the FMS controller

- Profile obtained from measurements
- Memory Demand: data and instruction cache misses + communications
- Moreover:
  - NoC Rx: writes 5 words
  - *NoC Tx*: reads 2 words

Experiments: Find the smallest schedulable hyper-period

### **Evaluation: Experiments**



Smallest schedulable hyper-period

### **Evaluation: Experiments**



 Pessimistic assumption: High priority tasks are bounded by 1 access per bank

### **Evaluation: Experiments**



 Pessimistic assumption: High priority tasks are bounded by 1 access per bank



 Pessimistic assumption: High priority tasks are bounded by 1 access per bank



E5: All accesses interfere

E4, E3: We don't use the release dates

E2, E1: Our approach. We use the release dates

Taking into account the memory banks improves the analysis with a factor in [1.77,2.52]



Smallest schedulable hyper-period

|                 | E5/E1 | E5/E2 | E3/E1 | E4/E2 | E2/E1 | E4/E3 |
|-----------------|-------|-------|-------|-------|-------|-------|
| MPPA            | 4.15  | 4.12  | 1.68  | 1.29  | ~1.01 | 0.77  |
| RR              | 3.3   | 3.29  | 1.24  | 1.13  | ~1.01 | 0.91  |
| Speedup factors |       |       |       |       |       |       |











# Outline

Motivation and Context

- 2 Models Definition
  - Architecture Model
  - Execution Model
  - Application Model
- 3 Multicore Response Time Analysis of SDF Programs
- 4 Evaluation
- 5 Conclusion and Future Work

• A response time analysis of SDF on the Kalray MPPA 256

• A response time analysis of SDF on the Kalray MPPA 256

• Given:

- Task profile
- Mapping of Tasks
- Execution Order

• A response time analysis of SDF on the Kalray MPPA 256

- Given:
  - Task profile
  - Mapping of Tasks
  - Execution Order
- We compute:
  - Tight response times taking into account the interference.
  - Release dates respecting the dependency constraints.

 $\circ\,$  A response time analysis of SDF on the Kalray MPPA 256  $\,$ 

- Given:
  - Task profile
  - Mapping of Tasks
  - Execution Order
- We compute:
  - Tight response times taking into account <u>the interference</u>

model of

the multi-level arbiter

• Release dates respecting the dependency constraints.

• A response time analysis of SDF on the Kalray MPPA 256

- Given:
  - Task profile
  - Mapping of Tasks
  - Execution Order



• A response time analysis of SDF on the Kalray MPPA 256

- Given:
  - Task profile
  - Mapping of Tasks
  - Execution Order



• Model of the Resource Manager.



• Model of the Resource Manager.

tighter estimation of context switches and other interrupts



• Model of the Resource Manager.

tighter estimation of context switches and other interrupts

• Model of the NoC accesses.



Model of the Resource Manager.

• Model of the NoC accesses.



tighter estimation of

use the output of

 Model of the Resource Manager.
 use the output of any NoC analysis

- Model of the NoC accesses.
- Memory access pipelining.



 Model of the Resource Manager.
 Model of the NoC accesses.
 use the output of any NoC analysis
 current assumption: bus delay is 10 cycles

Memory access pipelining.



• Model of the Resource Manager.

Model of the NoC accesses.

current assumption: bus delay is 10 cycles

use the output of

tighter estimation of context switches and

other interrupts

- Memory access pipelining.
- Model Blocking and non-blocking accesses.



• Model of the Resource Manager.

• Model of the NoC accesses.

current assumption: bus delay is 10 cycles

use the output of

- Memory access pipelining.
- Model Blocking and non-blocking accesses.



reads are blocking writes are non-blocking

tighter estimation of context switches and

other interrupts



# Questions?

#### BACKUP

Example: Fixed Priority bus arbiter, PE1 > PE0Bus access delay = 10



<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0Bus access delay = 10



• Task of interest running on PE0:

 $R_0 = 10 + 3 \times 10$  (response time in isolation)

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0Bus access delay = 10



• Task of interest running on PE0:

 $R_0 = 10 + 3 \times 10 \text{ (response time in isolation)}$  $R_1 = 10 + 3 \times 10 + 2 \times 10 = 60$ 

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0Bus access delay = 10



• Task of interest running on PE0:

 $R_0 = 10 + 3 \times 10$  (response time in isolation)

 $R_1 = 10 + 3 \times 10 + 2 \times 10 = 60$ 

 $R_2 = 10 + 3 \times 10 + 2 \times 10 + 2 \times 10 = 80$ 

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0Bus access delay = 10



• Task of interest running on PE0:

 $R_0 = 10 + 3 \times 10$  (response time in isolation)

 $R_1 = 10 + 3 \times 10 + 2 \times 10 = 60$ 

 $R_2 = 10 + 3 \times 10 + 2 \times 10 + 2 \times 10 = 80$ 

 $R_3 = 10 + 3 \times 10 + 2 \times 10 + 2 \times 10 + 0 = 80$  (fixed-point)

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

# The Global Picture

