# **arm** Research

# Arm's role in co-design for the next generation of HPC platforms

Filippo Spiga Software and Large Scale Systems

### What it is Co-design?

Abstract: Preparations for Exascale computing have led to the realization that future computing environments will be significantly different from those that provide Petascale capabilities. This change is driven by energy constraints, which is compelling architects to design systems that will require a significant re-thinking of how algorithms are developed and implemented. **Co-design has been proposed as a methodology for scientific application, software and hardware communities to work together.** This chapter gives an overview of co-design and discusses the early application of this methodology to High Performance Computing.

[...] The co-design strategy is based on developing partnerships with computer vendors and application scientists and engaging them in a highly collaborative and iterative design process well before a given system is available for commercial use. The process is built around identifying leading edge, high-impact scientific applications and providing concrete optimization targets rather than focusing on speeds and feeds (FLOPs and bandwidth) and percent of peak.

"On the Role of Co-design in High Performance Computing", Transition of HPC Towards Exascale Computing (2013)



## Arm business model

#### Global leader in the development of semiconductor IP

R&D outsourcing for semiconductor companies

#### Innovative business model

- Upfront license fee flexible licensing models
- Ongoing royalties typically based on a percentage of chip price
- Technology reused across multiple applications

#### Create and transform markets

Based on the #1 architecture and #1 ecosystem with more than 100 billion chips shipped to date



### Why Arm matters in HPC?





### **Arm HPC strategy**

#### **Mission**

#### Enable the world's first Arm supercomputer(s)

#### Strategy

#### Enablement + Co-Design + Partnership

#### **Building Blocks**

#### **Enablement**

Address gaps in computational capability and data movement within Architecture

## **Seed the software ecosystem** with open source support for Armv8 and SVE libraries, tools, and optimized workloads

Provide **world class tools** for compilation, analysis, and debug at large scale.

#### **Co-Design**

- Work with key end-customers in DoE, DoD, RIKEN, and EU to **design balanced architecture**, uArchitecture and SoCs based on real-world workloads, not benchmarks.
- Develop **simulation and modelling tools** to support co-design development with end-customers, partners, and academia.

#### **Partnership**

Work with Architecture partners to quickly bring **optimized solutions to market**.

Work with ATG & uArchitecture design teams to **steer future designs** to be more relevant for HPC, HPDA, and ML

Work with key ISVs to enable mid-market

## **European HPC initiatives adopting Arm for HPC**



University of BRISTOL



GW4

#### Isambard system specification:

- Cray system
- 10,000+ ARMv8 cores
- Cray software tools
  - Compiler, math libs, tools...
- Technology comparison:
  - x86, Xeon Phi, Pascal GPUs
- Phase 1 installed March 2017
- The ARM part arrives early 2018
  - Early access nodes from Sep

Simon McIntosh Smith, simonm@cs.bris.ac.uk, @simonmcs http://gw4.ac.uk/isambard/



6



I.K.Brunel 1804-1859

**arm** Research

bristol.ac.uk



## **Other WW HPC initiatives adopting Arm for HPC**

Sandia National Laboratorie

#### Sandia's NNSA/ASC ARM Platforms



#### Post-K: Fujitsu HPC CPU to Support ARM v8 ARM FUITSU Post-K fully utilizes Fujitsu proven supercomputer microarchitecture Fujitsu, as a lead partner of ARM HPC extension development, is working to realize ARM Powered® supercomputer w/ high application performance ARM v8 brings out the real strength of Fujitsu's microarchitecture HPC apps acceleration feature K computer FX100 **FX10** Post-K FMA: Floating Multiply and Add ~ ~ ~ Math. acceleration primitives\* ✓Enhanced ~ 1 ~ Inter core barrier 1 ~ V ~ Sector cache ✓Enhanced 1 1 1 Hardware prefetch assist ✓Enhanced V ~ Tofu interconnect ✓Integrated ✓ Integrated V

\* Mathematical acceleration primitives include trigonometric functions, sine & cosines, and exponential...



### **Co-design with Arm**

Analysis of applications to devise the most efficient solutions







- Initial focus on successfully building with both Arm and GCC compilers across a broad front.
- Often only modest changes to environment variables, build scripts and architecture files are needed

Public wiki & repo = <u>https://gitlab.com/arm-hpc/packages/wikis/home</u>





Architecture

Platform

### **Arm Allinea Studio**



Website: <u>https://www.arm.com/products/development-tools/hpc-tools/allinea-studio</u>



### Arm Compiler - Building on LLVM, Clang and Flang projects



## System Software & Tools roadmap

#### Fortran Compiler

- Improvements in debugging
- Increased Fortran 2008 support
- Improved OpenMP 4.5 support

#### All compilers

• Improvements in optimization report

# More features in compilers

- Application specific tuning and optimization
- For Cavium ThunderX2 and other server-class Arm-based platforms

- SVE enabled Performance Libraries
- Application specific tuning and optimization in Compilers and Libraries for SVE

More optimizations for current hardware Getting ready for SVE-based future hardware



### The hardware



### The hardware

#### Qualcomm Centriq<sup>™</sup> 2400 Processor Accelerating Innovation in the Data Center Performance 48-core SoC design with optimized performance for throughput data center workloads Power World's first and only 10nm processor for server-class performance and leading power efficiency **Total Cost of Ownership** Significant hardware CapEx and OpEx savings Ecosystem Robust Arm64-based ecosystem across infrastructure software and core operating system components Security Reliable security for the modern datacenter founded on immutable root of trust and TrustZone architecture 2017



## DGEMM performance with Arm tools on Cavium ThunderX2

Excellent serial and parallel performance

- Achieving very high performance at the node level leveraging high core counts and large memory bandwidth
- Single core performance at 95% of peak for DGEMM
- Parallel performance significantly higher than OpenBLAS



ARM Performance Libraries
 OpenBLAS

### **arm** Research

DGEMM – 56 threads on Cavium ThunderX2 CN99

### **Results from Isambard on ThunderX2**



Source: <u>http://gw4.ac.uk/isambard</u>

## **Scalable Vector Extension (SVE)**

- There is no preferred vector length
  - Vector Length (VL) is hardware choice, from 128 to 2048 bits, in increments of 128
  - Vector Length Agnostic (VLA) programming adjusts dynamically to the available VL
  - No need to recompile, or to rewrite hand-coded SVE assembler or C intrinsics
- SVE is not an extension of Advanced SIMD
  - A separate architectural extension with a new set of A64 instruction encodings
  - Focus is HPC scientific workloads, not media/image processing
- Common-sense says you need high vector utilisation to achieve significant speedups
  - Compilers often unable to vectorize due to intra-vector data & control dependencies
  - Begins to address some of the traditional barriers to auto-vectorization (e.g. control flow)



**arm** Research

#### 19 © 2017 Arm Limited

### **Scalable Vector Extension (SVE)**



**HPCAFE-2017** *"Energy-efficient HPC in Mont-Blanc and beyond: an ARM hardware and software perspective",* Roxana Rusitoru



2123

## Jülich efforts exploring SVE for scientific applications

Objective: understand if apps cat benefit from SVE, assess quality and readiness of tools

- Various Arm-based SoC (Huawei Taishan)
- Several applications of interest: QE, KKRnano, GRID, BQCD
- Results on MiniKKR show Arm-based SoC (no SVE) similar performance figure versus x86
- Estimate performance using instruction/branch counting (*dynamic*) and critical path analysis (*static*)



"Early Experience with ARM SVE", presented at SC'17 Arm SVE User Meeting by D. Pleiter (JSC)
<u>http://www.goingarm.com/slides/2017/SVE\_SC17/GoingArm\_SVE\_SC17\_Arm\_Dirk.pdf</u>
"Exploring SVE for scientific applications", presented at HiPEAC'18 by S. Nassyr (JSC)

http://www.goingarm.com/slides/2018/HiPEAC2018/julich\_hipeac\_goingarm\_2018.pdf

### **Platforms**

• HPE Apollo 70

https://news.hpe.com/hpe-helps-businesses-capitalize-on-hpc-and-aiapplications-with-new-high-density-compute-and-storage/

• CRAY XC50

http://investors.cray.com/phoenix.zhtml?c=98390&p=irolnewsArticle&ID=2316352

• ATOS Bull Sequana x1310

https://atos.net/en/2017/press-release/general-pressreleases 2017\_06\_19/atos-expands-range-supercomputers-include-armprocessors-new-bull-sequana-x1310





## **Building communities**

Outside the people we collaborate with, various complementary Arm HPC communities already exist:

- Arm HPC User Group (SC) and GoingArm (ISC/ArmRS/HiPEAC)
- Arm HPC Google Group (<u>https://groups.google.com/forum/arm-hpc</u>)
- Arm HPC GitLab pages (<u>https://gitlab.com/arm-hpc/</u>)

Our app work is **engaging with code owners and users** to get suitable test cases, to get Arm support built in, and including helping them make testing part of their development processes



