









#### Trends in Energy, Power and Thermal Efficiency of HPC systems

Andrea Bartolini, PhD

http://www-micrel.deis.unibo.it/sitonew/ a.bartolini@unibo.it

Università di Bologna & ETH Zurich

http://www.iis.ee.ethz.ch/ barandre@iis.ee.ethz.ch











#### Outline

# Power in digital systems

- Dynamic Power Management
- Power Management & Heterogeneity
- Characterization of thermal effects



#### **Dynamic Power**







#### Sub-threshold Leakage Current



- Exponential  $\downarrow$  with  $\downarrow$  V<sub>ds</sub>
- Exponential  $\downarrow$  with  $\uparrow$  V<sub>TH</sub>
- Exponential  $\downarrow$  with  $\downarrow$  T



#### Is the same for all the die?







#### **Alpha-Power Thermal Model**

Delay:  

$$D_{p} = \frac{C_{out} V_{dd}}{I_{ON}} = \frac{C_{out} V_{dd}}{\mu(T) [V_{dd} - V_{th}(T)]^{\alpha}}$$

Carrier Mobility:

$$\mu(\mathsf{T}) = \mu(\mathsf{T}_{0}) \left( \frac{\mathsf{T}_{0}}{\mathsf{T}} \right)^{\mathsf{m}}$$

Threshold Voltage:

$$V_{th} = V_{th}(T_0) - k(T - T_0)$$

T 
$$\uparrow$$
  $\mu\downarrow$   $V_{th}\downarrow$ 



#### **Thermal Behavior of CMOS gates**



#### Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich



#### **Power Wall**

#### Here is a Clue to the Problem

The problem is now called "the Power Wall". It is illustrated in this figure, taken from Patterson & Hennessy.



- The design goal for the late 1990's and early 2000's was to drive the clock rate up. This was done by adding more transistors to a smaller chip.
- Unfortunately, this increased the power dissipation of the CPU chip beyond the capacity of inexpensive cooling techniques



#### **Roadmap for CPU Clock Speed: Around 2005**



Here is the result of the best thought in 2005. By 2015, the clock speed of the top "hot chip" would be in the 12 - 15 GHz range.

Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich



#### The CPU Clock Speed Roadmap (A Few Revisions Later)



2001 2003 2005 2007 2009 2011 2013

This reflects the practical experience gained with dense chips that were literally "hot"; they radiated considerable thermal power and were difficult to cool. Law of Physics: All electrical power consumed is eventually radiated as heat.





# The MultiCore Approach

# Multiple cores on the same chip

- Simpler
- Slower
- Less power demanding



Under-Clocking

Over-clocked (+20%) Max Frequency Under-clocked (-20%)

#### Multi-Core Energy-Efficient Performance

1.73x 1.13x 1.13x 1.00x 1.00x 1.02x 1.02x 1.02x 1.02x 1.02x 1.02x 1.02x 0 1.02x 

Relative single-core frequency and Vcc

[Intel® Multi-Core Processors Making the Move to Quad-Core and Beyond]



#### **Transition to Multicore**







# **The Utilization Wall**

- Scaling theory
  - Transistor and power budgets no longer balanced
  - Exponentially increasing problem
- Observations in the wild
  - Flat frequency curve
  - "Turbo Mode"
  - Increasing cache/processor ratio

#### **Classical scaling**

| Utilization                     | 1                |
|---------------------------------|------------------|
| Device power (V <sub>dd</sub> ) | 1/S <sup>2</sup> |
| Device power (cap)              | 1/S              |
| Device frequency                | S                |
| Device count                    | S <sup>2</sup>   |

#### Leakage limited scaling

| Device count                   | <b>S</b> <sup>2</sup> |
|--------------------------------|-----------------------|
| Device frequency               | S                     |
| Device power (cap)             | 1/S                   |
| <b>Device power</b> $(V_{dd})$ | ~1                    |
| Utilization                    | 1/S <sup>2</sup>      |



# **The Utilization Wall**

- Scaling theory
  - Transistor and power budgets no longer balanced
  - Exponentially increasing problem
- Observations in the wild
  - Flat frequency curve
  - "Turbo Mode"
  - Increasing cache/processor ratio





# **The Utilization Wall**

- Scaling theory
  - Transistor and power budgets no longer balanced
  - Exponentially increasing problem!
- Observations in the wild
  - Flat frequency curve
  - "Turbo Mode"
  - Increasing cache/processor ratio



**Utilization Wall:** 



# **Dark Implications for Multicore**



2x4 cores @ 3 GHz (8 cores dark) (*Industry's Choice*)

4 cores @ 2x3 GHz (12 cores dark)





# What do we do with Dark Silicon?

- Insights:
  - Power is now more expensive than area
  - Specialized logic has been shown as an effective way to improve energy efficiency (10-1000x)
- Possible Approach:
  - Fill dark silicon with specialized cores to save energy on common apps
  - Near-threshold Computing
  - Turbo mode





HW – Turbo Mode Controller





[Haswell\_Turbo\_Boost\_Technology\_v0\_7\_02-1]



Swiss Federal Institute of Technology Zurich

#### HW – Turbo Mode Controller



ALMA MATER STUDIORUM Università di Bologna

#### Intel<sup>®</sup> Turbo Boost Technology Power vs. Frequency

Reducing PL2 down to TDP (PL1) does not equal turbo disablement:

- You can still enjoy extended turbo performance with most applications.
- Max turbo frequency depends on number of active cores.
- Will reduce performance.



We will evaluate the impact of turbo logic in terms of energy efficiency





#### **Near-Threshold Computing**







#### **Near-Threshold Computing**



Used in Ultra-low power devices, induces high variation

Vivek De Intel Lab

Still not suitable for HPC systems due to the performance loss



#### Outline

- Power in digital systems
- Dynamic Power Management
- Power Management & Heterogeneity
- Characterization of thermal effects

#### Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

#### **General Architecture**



#### • System

- Sensors
  - Performance counterPMU
  - Core temperature
- Actuator Knobs
  - ACPI states
    - P-State  $\rightarrow$  DVFS
    - C-State  $\rightarrow$  P<sub>GATING</sub>
  - Task allocation
- Controller
  - Reactive
    - Threshold/Heuristic
    - Controller theory
  - Proactive
    - Predictors





#### DVFS – with deadline or "on-demand governor"

# *Key idea*: Exploit *slack* by scaling V & f to run evenly across a time quantum

Linux on-demand governor:







# Run Fast and Stop (RTFS) vs DVFS

- Run Fast Then Stop (RFTS) is a technique where the processor runs at the highest frequency until the job is finished, then it stops.
- DVFS runs "low and slow" to reduce dynamic power by V^2.
  - Active Power Gate Clock Gate
- RFTS:
  - 1. Clock Gating core continues to leak.
  - 2. Power Gating core is powered off and doesn't leak.



Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich



#### Run Fast and Stop (RFTS) vs DVFS - II



Time





#### Run Fast and Stop vs DVFS - III

- "Break-even time" is defined as the time that the core needs to be powered off to compensate for save and restore energy.
- In high leakage situations, the power gating benefit is realized in a shorter time.













#### Outline

- Power in digital systems
- Dynamic Power Management
- Power Management & Heterogeneity
- Characterization of thermal effects





# Heterogeneity in SupercomputersDesiredUndesired





Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

**Eurora Node** 





Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

#### **Eurora Node**





Eurora System - 64 nodes CPUs:

- 2 CPU E5-2658 (node 1-32)
  - 8 cores @ 2GHz
  - TDP 95 W
  - 2.8 Turbo max freq.
- 2 CPU E5-2687W (node 33-64)
  - 8 cores @ 3.1GHz
  - TDP 150 W
  - 3.8 Turbo max freq.

#### Accelerator:

- 2x Nvidia Kepler K20 card (node 33-64)
  - 32 GB of GDDR5
  - Peak 2 TFLOP DP @ 250W
- 2x Intel Xeon Phi (node 1-32)
  - 16GB GDDR5
  - Peak 1.4 TFFLOP DP @ 245W

#### Software:

- SMP CentOS Linux
- On demand power governor



## **Workload Design**

Benchmarks and tests performed :

**SYNT CPU:** synthetic parallel benchmark. It emulates a **CPU bound** application.

**SYNT Mem:** synthetic parallel benchmark. It emulates a **memory bound** task execution.

**QE:** Quantum ESPRESSO is a freely available integrated suite of computer codes for electronic-structure calculations.

To generate the data-set used for characterizing the variability sources in Eurora:

We designed a PBS script that set the frequencies and runs the benchmarks.

We save the initial time and end time.

Off-line the log files are used to navigate the traces of the Eurora monitoring framework.





#### CPU bound

- Time ~ freq
   Mem bound
- Time >> freq







#### CPU bound

- Pdram constant
   ~ 5W
- Ppkg
   ~ 20W higher Pcpu
- Sligthly different power in between nodes ~ 5W
- Power ~ Ppkg







#### MEM bound

- Pdram constant
   ~ 10W
- Ppkg
   ~ 25W higher Pcpu
- Sligthly different power in between nodes ~ 5W





CPU bound – energy minimum @ 1.8-2GHz







#### MEM bound

- Minimum E
   @ min freq.
- Energy gain ↓
   ↓ frequenza
- @ 2^37 Iterations
  - @Turbo => 380Kcal



@ 1.2GHz
 =>190Kcal





**Energy Heterogeneity** 



Current System (Eurora)
 Intra node variability ~ 10%



# **Energy Heterogeneity**



#### • Current System (Eurora)

- Intra node variability ~ 10%
- Operating point sensitivity
  - Max perf. not Min Energy
- HW Accelerators

Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

#### DVFS vs. RFTS

| Nodes                         | Optimal [MHz] | Ex Time [%] | Energy [%] | EDP [%]   |  |  |  |  |
|-------------------------------|---------------|-------------|------------|-----------|--|--|--|--|
|                               | Frequency     | Overhead    | Saving     | Saving    |  |  |  |  |
| Benchmark SYNT CPU            |               |             |            |           |  |  |  |  |
| 2.1GHz                        | 1900 (2101)   | -11 (0)     | +2 (0)     | -11 (0)   |  |  |  |  |
| 3.1GHz                        | 2000 (3101)   | -70 (0)     | +18 (0)    | -39 (0)   |  |  |  |  |
| Benchmark SYNT Mem            |               |             |            |           |  |  |  |  |
| 2.1GHz                        | 1200 (1600)   | -18 (-5)    | +18 (+8)   | +2 (+11)  |  |  |  |  |
| 3.1GHz                        | 1200 (1600)   | -23 (-9)    | +50 (+48)  | +38 (+43) |  |  |  |  |
| Benchmark QE $Al^2O^3$        |               |             |            |           |  |  |  |  |
| 2.1GHz                        | 1700 (2101)   | -20 (0)     | +3 (0)     | -17 (0)   |  |  |  |  |
| 3.1GHz                        | 1800 (3100)   | -65 (-4)    | +27 (+11)  | -21 (+8)  |  |  |  |  |
| Benchmark QE-SiO <sup>2</sup> |               |             |            |           |  |  |  |  |
| 2.1GHz                        | 1800 (2101)   | -18 (0)     | +3 (0)     | -15 (0)   |  |  |  |  |
| 3.1GHz                        | 1800 (3100)   | -79 (-9)    | +21 (+8)   | -40 (+1)  |  |  |  |  |





Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich





PV: 4x Power saving,5% Energy Gain (GPU DVFS) + 10% Energy Gain (CPU Min Freq)



QE: 2-3x Power saving,15% Energy Gain (GPU DVFS) + 5% Energy Gain (CPU DVFS)



#### Outline

# Power in digital systems

- Dynamic Power Management
- Power Management & Heterogeneity
- Characterization of thermal effects

Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

#### Thermal Heterogenity Direct Liquid Cooling



#### System level

Chip level

Time

|             | Node n. | R <sub>thCPU1</sub><br>[°C/W] | R <sub>thCPU2</sub><br>[°C/W] | R <sub>thGPU1</sub><br>[°C/W] | R <sub>thGPU2</sub><br>[°C/W] | <i>T</i><br>[°C] |
|-------------|---------|-------------------------------|-------------------------------|-------------------------------|-------------------------------|------------------|
| Board level | 33      | 0.3160                        | 0.3080                        | 0.1260                        | 0.1242                        | 40.1             |
|             | 25      | 0.3230                        | 0.3000                        | 0.1157                        | 0.1023                        | 20.7             |
|             | 35      | 0.2902                        | 0.2085                        | 0.1137                        | 0.1203                        | 40.1             |
|             | 30      | 0.3047                        | 0.2905                        | 0.1220                        | 0.1213                        | 40.1             |
|             | 57      | 0.0200                        | 0.0102                        | 0.1000                        | 0.117                         | 40.0             |

Thermal aware job schedulers can effectively allocate tasks to mitigate thermal gradients, thermal hazard, and average temperature.







#### Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

#### **Thermal Impact**

#### #1 Core Rotating Power Virus

#### Up to 20 C Temperature difference on DIE ~ 30 C Temperature difference in between sockets - Thermal neighbours exists!





Fan Speed [RPM]







#### Hot & Cold Cores – Haswell – E5-2699 v3

Single core maximum temperature - ranking





**Supercomputers** 









#### **Biblio**

[TC 2014] Beneventi, Francesco, et al. "An effective gray-box identification procedure for multicore thermal modeling." Computers, IEEE Transactions on 63.5 (2014): 1097-1110.

[DATE 2014] Bartolini, Andrea, et al. "Unveiling eurora-thermal and power characterization of the most energy-efficient supercomputer in the world." Proceedings of the conference on Design, Automation & Test in Europe. European Design and Automation Association, 2014.

[ISLPED 2014] Fraternali, Francesco, et al. "Quantifying the Impact of Variability on the Energy Efficiency for a Next-generation Ultra-green Supercomputer." *Proceedings of the 2014 international symposium on Low power electronics and design*. ACM, 2014.