



Universidad Nacional de San Luis



# FPGA for the Acceleration of Machine Learning Algorithms

### Romina Molina

Joint ICTP-IAEA School on FPGA-based SoC and its Applications for Nuclear and Related Instrumentation — (smr 3562)

January – February 2021

# Outline

- Introduction
- Machine Learning (ML)
- SoC-based FPGA
- Acceleration of ML Inference
- High-Level Synthesis for ML (hls4ml)

# Introduction

# Introduction

### Machine learning and SoC





Artificial Intelligence, Machine Learning, Deep Learning

"Learning can be defined as the process of estimating associations between inputs, outputs, and parameters of a system using a limited number of observations" (Cherkassky et al. 2007)







Romina Molina



Supervised Learning













#### Unsupervised Learning



### Unsupervised Learning

An artificial neural network (ANN) is composed of neuron (or node) interconnections arranged in different layers.

$$y = f(b + \sum w_i x_i) \tag{1}$$



### **Multi-Layer Perceptron**





### **Convolutional Neural Networks (CNN)**





In a classifier, an input is mapped into a specific class.

Supervised training step to recognize patterns: the network compares its actual output with the desired output. The difference between these two values is adjusted with backpropagation.



K-fold cross-validation



# SoC-based FPGA

# SoC-based FPGA



High level comparison of Zynq-7000 SoC and Zynq UltraScale+ MPSoC

### SoC-based FPGA

#### Machine learning and SoC





Romina Molina

#### Considerations to map inference into FPGAs





- Low-precision arithmetic to reduce power consumption and increase throughput.
- Reduce memory footprint NN model can be deployed into on-chip memory, avoiding DDR access and bottlenecks.
- Model compression techniques [1]

### Model compression

- Quantization (Q) and pruning (P) (train from scratch and pre-trained model)
  - Q: Reduce number of bits to represent weights and bias
  - P: Remove connections and/or neurons
- Low-rank factorization (train from scratch and pre-trained model)
- Compact convolutional filters (train from scratch)
- Knowledge destillation (train from scratch)

### Model compression: Pruning





#### Left: Before pruning - Right: after pruning

| Romina Molina                                             |         |
|-----------------------------------------------------------|---------|
| ICTP smr3562 - FPGA for the Acceleration of ML Algorithms | 27 / 57 |

### Model compression: Knowledge distillation



High level diagram for knowledge distillation.

### **High-Level Synthesis**

### Hardware design - High-Level Synthesis

Vivado HLS:

- It provides the facility to create RTL from a high level of abstraction.
- It allows the optimization of the input code using directives to:
  - Reduce latency
  - Improve performance and throughput
  - Reduce resource utilization

Without directives, Vivado HLS will look minimize latency and improve concurrency

Hardware design - High-Level Synthesis

- Minimize latency: UNROLL, LOOP\_FLATTEN, LOOP\_MERGE.
- Minimize throughput: DATAFLOW, PIPELINE.
- Improve bottleneck: RESOURCE, ARRAY\_PARTITION, ARRAY\_RESHAPE.

### Hardware design - High-Level Synthesis - Directives



Loop example

### Hardware design - High-Level Synthesis - Directives



Romina Molina

ICTP smr3562 - FPGA for the Acceleration of ML Algorithms

### Hardware design - High-Level Synthesis - Directives



Loop + Unroll directive

### Hardware design - High-Level Synthesis - Directives



Romina Molina

ICTP smr3562 - FPGA for the Acceleration of ML Algorithms



Inference IP core

## High level synthesis for machine learning (hls4ml)

- Package for ML inference in FPGAs using HLS. (Duarte et. al)
- "Fast inference of deep neural networks (DNN) in FPGAs for particle physics" Duarte et al. [2]
- GitHub: https://github.com/fastmachinelearning/hls4ml-tutorial
- https://fastmachinelearning.org/hls4ml/

#### hls4ml



Design flow for hls4ml [2]

#### hls4ml

- HLS to create IP Core.
- Keras, TensorFlow, Pytorch.
- On-chip data structures.
- Quantization through ap\_fixed in HLS.
- Trade-off between resoure utilization and latency/throughput.
- Integration with other tools.

Pipelining to speed up the process by accepting new inputs after an initiation interval.

- Size/Compression
- Precision
- Dataflow/Resource Reuse
- Quantization Aware Training: Qkeras [3]



#### Profiling

#### Method: hls4ml.model.profiling.numerical



Profiling to adjust precision

#### How we start with the tool?

First, we have to download packages and dependencies. Then, we need to decide between:

- Using command line
- Using Jupyter Notebook

#### Using command line:

- Configuration file (.yml). In this case, the file has the name model-config.yml
- Files required: .json y .h5

```
# particles_keras_config.yml
```

```
# File json
KerasJson: ../model/model_architecture.json
```

```
# File h5
KerasH5: ../model/model_weights.h5
```

```
#InputData: ../model/modelInput.dat
#OutputPredictions: ../modelPredictions.dat
```

```
OutputDir: particleHW
ProjectName: particlesIdentification
XilinxPart: xc7z020-clg484-1
ClockPeriod: 10
Backend: Vivado
```

```
IOType: to_parallel # options: to_serial/io_parallel
HLSConfig:
Model:
    Precision: ap_fixed<16,8>
    ReuseFactor: 1
    # LayerType:
    # Dense:
    # ReuseFactor: 2
    # Strategy: Resource
```

```
# Compression: True
```

```
# xczu9eg-ffvb1156-2-e xc7z020clg484-1
```

#### Commands for terminal execution

- hls4ml convert -c model-config.yml
- hls4ml build -p ProjectName -a
- vivado\_hls -f ProjectName.tcl "csim=1 synth=1 cosim=0 export=0"

Following the information in the previous image, **ProjectName** was replaced by **particlesIdentification**:

- hls4ml convert -c model-config.yml
- hls4ml build -p particlesIdentification -a
- vivado\_hls -f particlesIdentification.tcl "csim=1 synth=1 cosim=0 export=0"

#### Using Jupyter Notebook:

```
1 from tensorflow.keras.models import load_model
2 from sklearn.metrics import accuracy_score
3 model = load_model('model_keras_MLP.h5')
```

```
4 model.summary()
```

Model: "sequential"

| Layer (type)       | Output Shape | Param # |
|--------------------|--------------|---------|
| fc1 (Dense)        | (None, 60)   | 3900    |
| relu1 (Activation) | (None, 60)   | 0       |
| fc0 (Dense)        | (None, 40)   | 2440    |
| relu0 (Activation) | (None, 40)   | Θ       |
| fc2 (Dense)        | (None, 30)   | 1230    |
| relu2 (Activation) | (None, 30)   | Θ       |
| fol (Danca)        | (None 10)    | 210     |

#### Model summary

#### **Using Jupyter Notebook:**

#### Generated code

| 63  |                                                                                                           |
|-----|-----------------------------------------------------------------------------------------------------------|
| 64  | laver3 t laver3 out[N LAYER 3]:                                                                           |
| 65  | <pre>#pragma HLS ARRAY PARTITION variable=layer3 out complete dim=0</pre>                                 |
| 66  | nnet::dense latency <input2 config3="" laver3="" t,="">(input1, laver3 out, w3, b3);</input2>             |
| 67  | ······································                                                                    |
| 68  | laver5 t laver5 out[N LAYER 3]:                                                                           |
| 69  | #pradma HLS ARRAY PARTITION variable=layer5 out complete dim=0                                            |
| 70  | nnet::relu <layer3 config5="" layer5="" relu="" t,="">(layer3 out, layer5 out);</layer3>                  |
| 71  |                                                                                                           |
| 72  | layer6 t layer6 out[N LAYER 6];                                                                           |
| 73  | #pragma HLS ARRAY PARTITION variable=laver6 out complete dim=0                                            |
| 74  | nnet::dense latency <layer5 config6="" layer6="" t,="">(layer5 out, layer6 out, w6, b6);</layer5>         |
| 75  |                                                                                                           |
| 76  | laver8 t laver8 out[N LAYER 6]:                                                                           |
| 77  | #pragma HLS ARRAY PARTITION variable=layer8 out complete dim=0                                            |
| 78  | nnet::relu <layer6 config8="" layer8="" relu="" t,="">(layer6 out, layer8 out);</layer6>                  |
| 79  | ······································                                                                    |
| 80  | laver9 t laver9 out[N LAYER 9]:                                                                           |
| 81  | #pragma HLS ARRAY PARTITION variable=layer9 out complete dim=0                                            |
| 82  | nnet::dense latency <layer8 config9="" layer9="" t,="">(layer8 out, layer9 out, w9, b9);</layer8>         |
| 83  |                                                                                                           |
| 84  | layer11 t layer11 out[N LAYER 9];                                                                         |
| 85  | #pragma HLS ARRAY PARTITION variable=laver11 out complete dim=0                                           |
| 86  | <pre>nnet::relu<layer9 config11="" layer11="" relu="" t,="">(layer9 out, layer11 out);</layer9></pre>     |
| 87  |                                                                                                           |
| 88  | laver12 t laver12 out[N LAYER 12]:                                                                        |
| 89  | <pre>#pragma HLS ARRAY PARTITION variable=layer12 out complete dim=0</pre>                                |
| 90  | nnet::dense latency <layer11 config12="" layer12="" t,="">(layer11 out, layer12 out, w12, b12);</layer11> |
| 91  |                                                                                                           |
| 92  | layer14 t layer14 out[N LAYER 12];                                                                        |
| 93  | <pre>#pragma HLS ARRAY PARTITION variable=layer14 out complete dim=0</pre>                                |
| 94  | nnet::relu <layer12 config14="" layer14="" relu="" t,="">(layer12 out, layer14 out);</layer12>            |
| 95  |                                                                                                           |
| 96  | layer15 t layer15 out[N LAYER 15];                                                                        |
| 97  | <pre>#pragma HLS ARRAY PARTITION variable=layer15 out complete dim=0</pre>                                |
| 98  | nnet::dense latency <layer14 config15="" layer15="" t,="">(layer14 out, layer15 out, w15, b15);</layer14> |
| 99  |                                                                                                           |
| 100 | nnet::softmax <layer15 config17="" result="" softmax="" t,="">(layer15 out, layer17 out);</layer15>       |
| 101 |                                                                                                           |
|     |                                                                                                           |

Case of study: Muon–Electron Pulse Shape Discrimination for Water Cherenkov Detectors Case of study: Muon–Electron Pulse Shape Discrimination for Water Cherenkov Detectors [4]

- Water Cherenkov detector (WCD) at the Escuela de Ciencias Físicas y Matemáticas in Universidad de San Carlos de Guatemala (ECFM-USAC).
- Signal: sampled at 125 MHz, 14-bit resolution.
- Feature extraction in the incoming signal to obtain signal classification.

# Case of study: Muon–Electron Pulse Shape Discrimination for Water Cherenkov Detectors



Different types of signals: electron, muon and electric discharges

Case of study: Muon–Electron Pulse Shape Discrimination for Water Cherenkov Detectors



#### MLP architecture

| ICTP | smr3562 - | EPGA for | the Acceleration | of ML Algorithms |  |
|------|-----------|----------|------------------|------------------|--|
|      |           |          |                  |                  |  |

Romina Molina

# Case of study: Muon–Electron Pulse Shape Discrimination for Water Cherenkov Detectors

| Layer (type)         | Output Shape | Param # |
|----------------------|--------------|---------|
| fcl (Dense)          | (None, 60)   | 3900    |
| relu1 (Activation)   | (None, 60)   | Θ       |
| fc0 (Dense)          | (None, 40)   | 2440    |
| relu0 (Activation)   | (None, 40)   | 0       |
| fc2 (Dense)          | (None, 30)   | 1230    |
| relu2 (Activation)   | (None, 30)   | Θ       |
| fc3 (Dense)          | (None, 10)   | 310     |
| relu3 (Activation)   | (None, 10)   | 0       |
| output (Dense)       | (None, 3)    | 33      |
| softmax (Activation) | (None, 3)    | Θ       |

#### Model summary

# Case of study: Muon–Electron Pulse Shape Discrimination for Water Cherenkov Detectors

HLS reports comparison. Solutions 1 and 5 without directives. Solutions 2 and 6 with directives applied by hls4mland Softmax as activation function. Solutions 3 and 7 with directives applied by hls4ml, PIPELINE to improve the interval,without Softmax and with a reuse factor of 1 for all the layers. Solutions 4 and 8 with directives applied by hls4ml, PIPELINE to improve the interval, without Softmax and with a reuse factor of 8 for all the dense layers

| Solution | Directives       | Estimated  | Clock  | Inference    | Interval | BRAM | DSP  | FF      | LUT     |
|----------|------------------|------------|--------|--------------|----------|------|------|---------|---------|
|          |                  | Clock [ns] | Cycles | Clock Cycles |          |      |      |         |         |
| ZU9EG    |                  |            |        |              |          |      |      |         |         |
| 1        | No               | 4.653      | 36,917 | 36,848       | 36,917   | 23   | 2    | 2407    | 5732    |
| 2        | Yes + Softmax    | 4.653      | 18,526 | 18,457       | 18,526   | 2    | 1245 | 26,192  | 180,066 |
| 3        | Yes + NS + RF: 1 | 4.251      | 84     | 19           | 64       | 0    | 1235 | 27221   | 167,158 |
| 4        | Yes + NS + RF: 8 | 4.993      | 115    | 50           | 64       | 0    | 155  | 38,571  | 141,443 |
| XC7Z020  |                  |            |        |              |          |      |      |         |         |
| 5        | No               | 6.508      | 91,777 | 91,707       | 91,777   | 23   | 2    | 4313    | 6952    |
| 6        | Yes + Softmax    | 6.508      | 40,063 | 39,993       | 40,063   | 2    | 1245 | 188,626 | 171,599 |
| 7        | Yes + NS + RF: 1 | 4.350      | 121    | 55           | 64       | 0    | 1235 | 189,059 | 159,351 |
| 8        | Yes + NS + RF: 8 | 5.561      | 143    | 77           | 64       | 0    | 155  | 76,286  | 118,936 |

# Case of study: Muon–Electron Pulse Shape Discrimination for Water Cherenkov Detectors





### References

- 1 Duarte, J.; Han, S.; Harris, P.; Jindariani, S.; Kreinar, E.; Kreis, B.; Ngadiuba, J.; Pierini, M.; Rivera, R.; Tran, N.; et al. Fast inference of deep neural networks in FPGAs for particle physics. J. Instrum. 2018, 13, P07027, doi:10.1088/1748-0221/13/07/p07027.
- 2 Cheng, Y.; Wang, D.; Zhou, P.; Zhang, T. A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv 2017, arXiv:1710.09282
- 3 Coelho, J.; Kuusela, A.; Zhuang, H.; Aarrestad, T.; Loncar, V.; Ngadiuba, J.; Pierini, M.; Summers, S. Ultra Low-latency, Low-area Inference Accelerators using Heterogeneous Deep Quantization with QKeras and hls4ml. arXiv 2020, arXiv:2006.10159.
- 4 Garcia, L.G.; Molina, R.S.; Crespo, M.L.; Carrato, S.; Ramponi, G.; Cicuttin, A.; Morales, I.R.; Perez, H. Muon–Electron Pulse Shape Discrimination for Water Cherenkov Detectors Based on FPGA/SoC. Electronics 2021, 10, 224. https://doi.org/10.3390/electronics10030224
- 5 Vivado Design Suite User Guide High-Level Synthesis UG902 (v2019.1) July 12, 2019

Mg. Romina S. Molina

ICTP rmolina@ictp.it

Università degli Studi di Trieste rominasoledad.molina@phd.units.it

National University of San Luis: rsmolina@unsl.edu.ar



## Thanks!

