

#### Overview on modern computer architectures: the software crisis

#### Ivan Girotto – igirotto@ictp.it

Information & Communication Technology Section (ICTS) International Centre for Theoretical Physics (ICTP)





# How fast is my CPU?!

- CPU power is measured in FLOPS

   number of floating point operations x second
   FLOPS = #cores x clock x FLOP cycle
- FLOP/cycle is the number of multiply-add (FMA) performed per cycle
  - architectural limit
  - depend also by the instruction set supported





# The CPU Memory Hierarchy



Ivan Girotto - igirotto@ictp.if San José, 1 August 2017

Overview on modern computer architectures: the software crisis





# **Performance Metrics**

- When all CPU component work at maximum speed that is called *peak of performance*
  - Tech-spec normally describe the theoretical peak
  - Benchmarks measure the real peak
  - Applications show the real performance value
- CPU performance is measured as:

Floating point operations per seconds GFLOP/s

 But the real performance is in many cases mostly related to the memory bandwidth (GBytes/s)





# Cache Memory

- Designed for temporal/spatial locality
- Data is transferred to cache in blocks of fixed size, called *cache lines*.
- Operation of LOAD/STORE can lead at two different scenario:
  - cache hit
  - cache miss

Loop: load r1, A(i) load r2, s mult r3, r2, r1 store A(i), r2 branch => loop

CACHE

CPU

Registers

# MAIN MEMORY

Ivan Girotto - igirotto@ictp.it San José, 1 August 2017





# The CPU Memory Hierarchy



(a) Memory hierarchy for server





# Data Memory Access

- Data ordering
- Reduce at minimum the data transfers
- Avoid complex data structure within computational intensive kernels
- Define constants and help the compiler
- Exploit the memory hierarchy



The Abdus Salam International Centre for Theoretical Physics



IAE/



John Von Neumann

# The Classical Model



Ivan Girotto - igirotto@ictp.it San José, 1 August 2017



The Abdus Salam International Centre for Theoretical Physics



IAEA



John Von Neumann

# The Classical Model



Ivan Girotto - igirotto@ictp.it San José, 1 August 2017





## Sequential Processing







IAEA

| Pipelining    |   |   |   |   |   |   |   |   |   |     |   |
|---------------|---|---|---|---|---|---|---|---|---|-----|---|
| Clock Period  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ••• | Ν |
| Instruction 1 | F | D | L | Е | S |   |   |   |   |     |   |
| Instruction 2 |   | F | D | L | Е | S |   |   |   |     |   |
| Instruction 3 |   |   | F | D | L | Е | S |   |   |     |   |
| Instruction 4 |   |   |   | F | D | L | E | S |   |     |   |
| Instruction 5 |   |   |   |   | F | D | L | Е | S |     |   |

Ivan Girotto - igirotto@ictp.i San José, 1 August 2017

verview on modern computer architectures: the software crisis





IAE

# Pipelining

| Clock Period  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | <br>N |
|---------------|---|---|---|---|---|---|---|---|---|-------|
| Instruction 1 | F | D | L | E | S |   |   |   |   |       |
| Instruction 2 |   | F | D | L | E | S |   |   |   |       |
| Instruction 3 |   |   | F | D | L | E | S |   |   |       |
| Instruction 4 |   |   |   | F | D | L | E | S |   |       |
| Instruction 5 |   |   |   |   | F | D | L | E | S |       |





# Superscalaring

| Clock<br>Period | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ••• | Ν |
|-----------------|---|---|---|---|---|---|---|---|---|-----|---|
| Instruction 1   | F | D | L | Е | S |   |   |   |   |     |   |
| Instruction 2   | F | D | L | Е | S |   |   |   |   |     |   |
| Instruction 3   |   | F | D | L | Е | S |   |   |   |     |   |
| Instruction 4   |   | F | D | L | Е | S |   |   |   |     |   |
| Instruction 5   |   |   | F | D | L | Е | S |   |   |     |   |
| Instruction 6   |   |   | F | D | L | Е | S |   |   |     |   |





# The Inside Parallelism

#### Scalar Mode







# The Inside Parallelism

#### SIMD Mode









#### Multi-core system Vs Serial Programming



Xeon E5650 hex-core processors (12GB - RAM)





# Threading and Vectorization







# Aid Automatic Vectorization

- Avoid data dependences
- Data alignment
  - memory allocation using posix\_memalign
- Aliasing
- Avoid conditional statements
- Use compiler auto-vectorization & analyze compiler report (every compiler as his own flags)





# Conclusion

- Development of today computer architecture is based on increasing complexity: the software crisis!!!
- Scientific software developers for high-performance computing need to master this complexity:
  - avoid useless instructions, branches (if possible) and expensive OP
  - enhance data locality and data reuse (memory throughput)
  - aid the compiler for automatic vectorization
  - use compiler optimization (-O3)
  - make use of computer libraries for HPC