

#### Workshop on Fully Programmable Systems-on-Chip for Scientific Applications



Independent University, Bangladesh



FPGA for Accelerating Machine Learning (ML) Algorithms

**Romina Soledad Molina** 

MLab/STI Unit - ICTP

Doha, Qatar 2024



## Outline

- Introduction.
- Remarks from SOTA.
- ML and model compression techniques.
- An end-to-end workflow to compress and deploy DNN on FPGA.
  - DNN training and compression.
  - Integration with a hardware synthesis tool for ML.
  - Hardware assessment framework.
- Applications.
- Final remarks.









A machine/system capable of imitating human behavior











Classification

Machine learning



Classification





#### Classification





Classification



















#### Generalization



Image from Togootogtokh, E., & Amartuvshin, A. (2018). Deep Learning Approach for Very Similar Objects Recognition Application on Chihuahua and Muffin Problem. ArXiv, abs/1801.09573.









- In a classifier, an input is mapped to a specific class.
- Supervised training phase: the network compares its current output with the desired output. The difference between these two values is corrected using backpropagation.



#### A classifier as example





#### A classifier as example







#### A classifier as example









#### A classifier as example











































Introduction: ML and SoC







Low latency





Low latency

Low power consumption

SoC-based FPGA



Low latency

Low power consumption

High parallelism





Low latency

Low power consumption

High parallelism

**Resource-constrained devices** 

SoC-based FPGA









. Towards ML-based models implemented on resource-constrained devices.



- . Towards ML-based models implemented on resource-constrained devices.
- . SOTA models: VGG16, MobileNet V2, BERT, U-Net, YOLO.



- . Towards ML-based models implemented on resource-constrained devices.
- . SOTA models: VGG16, MobileNet V2, BERT, U-Net, YOLO.
- . Compression: focused on pruning and quantization.



- . Towards ML-based models implemented on resource-constrained devices.
- . SOTA models: VGG16, MobileNet V2, BERT, U-Net, YOLO.
- . Compression: focused on pruning and quantization.
- . Workflows addressing some parts of the development cycle.



- . Towards ML-based models implemented on resource-constrained devices.
- . SOTA models: VGG16, MobileNet V2, BERT, U-Net, YOLO.
- . Compression: focused on pruning and quantization.
- . Workflows addressing some parts of the development cycle.
- . Off-chip memory transactions.



- . Towards ML-based models implemented on resource-constrained devices.
- . SOTA models: VGG16, MobileNet V2, BERT, U-Net, YOLO.

Memory footprint and latency

- . Compression: focused on pruning and quantization.
- . Workflows addressing some parts of the development cycle.
- . Off-chip memory transactions.



- . Towards ML-based models implemented on resource-constrained devices.
- . SOTA models: VGG16, MobileNet V2, BERT, U-Net, YOLO.
- . Compression: focused on pruning and quantization.
- . Workflows addressing some parts of the development cycle.
- . Off-chip memory transactions.

Memory footprint and latency Ensemble of compression

techniques



- . Towards ML-based models implemented on resource-constrained devices.
- . SOTA models: VGG16, MobileNet V2, BERT, U-Net, YOLO.
- . Compression: focused on pruning and quantization.
- . Workflows addressing some parts of the development cycle.
- . Off-chip memory transactions.





- . Towards ML-based models implemented on resource-constrained devices.
- . SOTA models: VGG16, MobileNet V2, BERT, U-Net, YOLO.
- . Compression: focused on pruning and quantization.
- . Workflows addressing some parts of the development cycle.
- . Off-chip memory transactions.

| et, folo.              | Memory footprint<br>and latency          |
|------------------------|------------------------------------------|
| tion.<br>opment cycle. | Ensemble of<br>compression<br>techniques |
|                        | On-chip memory<br>deployment             |
| Productivity           | End-to-end<br>workflow                   |
|                        |                                          |



#### **ML and model compression techniques**



#### ML and model compression techniques for reconfigurable hardware accelerators

**Ensemble of compression techniques -** Exploration of the interplay between:



#### ML and model compression techniques for reconfigurable hardware accelerators

| Pruning | Quantization | Knowledge distillation |
|---------|--------------|------------------------|
|         |              |                        |



#### ML and model compression techniques for reconfigurable hardware accelerators





International Centre for Theoretical Physics





International Centre for Theoretical Physics





International Centre for Theoretical Physics

**Ensemble of compression techniques -** Exploration of the interplay between:





# An end-to-end workflow to efficiently compress and deploy DNN on SoC/FPGA



#### **End-to-end workflow**

#### A- DNN training and compression





#### **End-to-end workflow**

#### A- DNN training and compression





#### **End-to-end workflow**

#### A- DNN training and compression





## A. DNN training and compression



#### **DNN training and compression** Stage 1 - Teacher training





#### **DNN training and compression** Stage 1 - Teacher training





#### **DNN training and compression** Stage 1 - Teacher training





## **DNN training and compression** Stage 2 - Student training





## **DNN training and compression** Stage 2 - Student training





## **DNN training and compression** Stage 2 - Student training











🚱 ONNX





















https://github.com/fastmachinelearning/





ips.//github.com/lastmachinelearning/





ML framework support:

- (Q)Keras
- **PyTorch** (limited)
- (Q)ONNX (in development)



# Integration with a hardware synthesis tool for ML



#### ML framework support:

- (Q)Keras
- **PyTorch** (limited)
- (Q)ONNX (in development)

Neural networks architectures:

- Fully Connected NN
- Convolutional NN
- Recurrent NN
- Graph NN



# Integration with a hardware synthesis tool for ML



#### ML framework support:

- (Q)Keras
- **PyTorch** (limited)
- (Q)ONNX (in development)

Neural networks architectures:

- Fully Connected NN
- Convolutional NN
- Recurrent NN
- Graph NN

#### HLS backends:

- Vivado HLS
- Intel HLS
- Vitis HLS (experimental)

https://fastmachinelearning.org/hls4ml/



#### **Python integration**

'he Abdus Salam

International Centre for Theoretical Physics



1 from tensorflow.keras.models import load\_model 2 from sklearn.metrics import accuracy\_score 3 model = load\_model('model\_keras\_MLP.h5') 4 model.summary()

Model: "sequential"

JNIVERSITÀ

DEGLI STUD

| Layer (type)       | Output Shape | Param # |
|--------------------|--------------|---------|
| fc1 (Dense)        | (None, 60)   | 3900    |
| relu1 (Activation) | (None, 60)   | Θ       |
| fc0 (Dense)        | (None, 40)   | 2440    |
| relu0 (Activation) | (None, 40)   | Θ       |
| fc2 (Dense)        | (None, 30)   | 1230    |
| relu2 (Activation) | (None, 30)   | 0       |
| fal (Danca)        | (Nana 10)    | 210     |



#### **Python integration**

UNIVERSITÀ

DEGLI STUD

'he Abdus Salam

International Centre

for Theoretical Physics





#### **Python integration**

JNIVERSITÀ

DEGLI STUD

he Abdus Salam

International Centre

for Theoretical Physics



```
import hls4ml
import plotting
hls4ml.model.optimizer.OutputRoundingSaturationMode.layers = ['Activation']
hls4ml.model.optimizer.OutputRoundingSaturationMode.rounding mode = 'AP RND'
hls4ml.model.optimizer.OutputRoundingSaturationMode.saturation mode = 'AP SAT'
config = hls4ml.utils.config from keras model(model, granularity='name')
config['Model'] = {'Precision' : 'ap fixed<17,16>', 'ReuseFactor' : 1, 'Strategy' : 'Latency'}
config['LayerName']['fc1']['Precision']['weight'] = 'ap fixed<9, 1>'
config['LayerName']['softmax']['Precision'] = 'ap fixed<32,15>'
print("-----
                        ----")
plotting.print dict(config)
print("-----
hls model = hls4ml.converters.convert from keras model(model,
                                                    hls confia=confia.
                                                    output dir='model 3/MLP student smr3765'
hls model.compile()
```



#### **QKeras for quantization-aware training**

JNIVERSITÀ

DEGLI STUDI

he Abdus Salam

International Centre

for Theoretical Physics



```
# MLP architecture
# Create the student OKERAS
studentQ MLP = keras.Seguential(
        Input(shape=(30,)),
        QDense(20, name='fcl',
                 kernel guantizer=guantized bits(9,1,alpha=1), bias guantizer=guantized bits(23,15,alpha=1)),
        QActivation(activation=quantized relu(16,15), name='relu1'),
        ODense(10, name='fc2'.
                 kernel quantizer=quantized bits(9,1,alpha=1), bias quantizer=quantized bits(23,15,alpha=1)),
        OActivation(activation=quantized relu(16.15), name='relu2').
        QDense(10, name='fc6',
                 kernel quantizer=quantized bits(9,1,alpha=1), bias quantizer=quantized bits(23,15,alpha=1)),
        QActivation(activation=quantized relu(16,15), name='relu3'),
        ODense(4, name='output',
                 kernel quantizer=quantized bits(32,15,alpha=1), bias quantizer=quantized bits(32,15,alpha=1)),
        Activation(activation='softmax', name='softmax')
    ],
    name="student".
print gstats(studentQ MLP)
```



hls

High-level synthesis project generated through hls4ml

UNIVERSITÀ

DEGLI STUDI

International Centre for Theoretical Physics

```
esis Summary(solution1)
                        myproject.cpp ×
 Layers t layer8 out[N LAYER 8];
 #pragma HLS ARRAY PARTITION variable=layer8 out complete dim=0
 nnet::dense<layer6 t, layer8 t, config8>(layer6 out, layer8 out, w8, b8); // fc3
 Layer9_t layer9 out[N LAYER 8];
 #pragma HLS ARRAY PARTITION variable=layer9 out complete dim=0
 nnet::linear<layer8 t, layer9 t, linear config9>(layer8 out, layer9 out); // fc3 linear
 layer11 t layer11 out[N LAYER 11];
 #pragma HLS ARRAY PARTITION variable=layer11 out complete dim=0
 nnet::dense<layer9 t, layer11 t, config11>(layer9 out, layer11 out, w11, b11); // fc4
 layer12_t layer12 out[N LAYER 11];
 #pragma HLS ARRAY PARTITION variable=layer12 out complete dim=0
 nnet::linear<layer11 t, layer12 t, linear config12>(layer11 out, layer12 out); // fc4 linear
```



























## **Applications**



- Gamma/Neutron discrimination [submitted TNS].
- Pest classification in fruit crops [9, 11].
- Pulse shape discriminator for cosmic rays studies [8, 11].
- Volcanic seismic event detection [12].
- Object detection for adverse weather conditions, particularly haze and fog [India on-going].
- Water quality monitoring applied to Dunav river [Serbia Remarkable / on-going].





# **Gamma/Neutron** discrimination



Gamma/neutron discrimination

International Centre for Theoretical Physics



- Tagged dataset of gamma and neutron events from Deuterium-Deuterium
   (DD) and Deuterium-Tritium (DT) generators.
- The dataset was recorded at the Neutron Science Facility (NSF) of the Nuclear Science and Instrumentation Laboratory (NSIL), IAEA.
- The detector is based on a small **CLYC** (Cs2LiYCl6:Ce) crystal (0.5 in diameter by 30 mm length) coupled to a 4-element SiPM array.
- The data were **sampled at 4 GSPS with 10-bits resolution** using a CAEN DT5761 digitizer.
- The total gamma and neutron events in this dataset are 10,913 and 27,696, respectively.



Gamma/neutron discrimination





Morales, I. R., Crespo, M. L., Bogovac, M., Cicuttin, A., Kanaki, K., & Carrato, S. (2023). Gamma/neutron classification with SiPM CLYC detectors using frequency-domain analysis for embedded real-time applications. Nuclear Engineering and Technology.

Dataset from https://doi.org/10.5281/zenodo.8037059



Gamma/neutron discrimination









'he Abdus Salam

International Centre for Theoretical Physics JNIVERSITÀ

DEGLI STUD













Overall accuracy

- Teacher architecture (original): **99.00%**
- Student architecture (compressed): 98.20%

Gamma/neutron discrimination





Overall accuracy

NIVERSIT

- Teacher architecture (original): **99.00%**
- Student architecture (compressed): **98.20%**
- SoC memory footprint in terms of resource utilization @200MHz
  - Artix-7 platform: **below 35%**

Gamma/neutron discrimination

International Centre for Theoretical Physics





- Overall accuracy
  - Teacher architecture (original): **99.00%**
  - Student architecture (compressed): **98.20%**
- SoC memory footprint in terms of resource utilization @200MHz
  - Artix-7 platform: below 35%

- SoC latency
  - Zedboard platform: 45 clk cycles (@200MHz)

Romina Soledad Molina | Doha - Qatar | 2024

Gamma/neutron discrimination







### **Image classification based on CNN**



Pest classification in fruit crops Precision agriculture on the edge



















Precision agriculture on the edge Nectras IoT trap Captured image Other insects Pest classification in fruit crops Lobesia botrana













Pest classification in fruit crops Teacher architecture based on VGG16 and obtained through transfer learning - 14,818,706 parameters -











#### Romina Soledad Molina | Doha - Qatar | 2024



• Overall accuracy

- Teacher architecture: 98.87%
- Student architecture: **95.78%**

Pest classification in fruit crops



Pest classification

in fruit crops

International Centre for Theoretical Physics

- Overall accuracy
  - Teacher architecture: **98.87%**
  - Student architecture: 95.78%
- SoC memory footprint in terms of resource utilization @200MHz
  - KRIA platform: **below 21%**
  - PYNQ-Z1 platform: below 63%



### Object detection [work in progress]



#### Image from Cityscapes dataset.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., ... & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In *Proceedings of the IEEE conference on computer vision and pattern recognition* (pp. 3213-3223).



### Object detection [work in progress]



Image from Cityscapes dataset.



Image from Du, X., Lin, T. Y., Jin, P., Ghiasi, G., Tan, M., Cui, Y., ... & Song, X. (2020). Spinenet: Learning scale-permuted backbone for recognition and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11592-11601).



# **Final remarks**

- The proposed workflow successfully generates compressed models, leading to a **fully on-chip memory-mapped implementation** on the FPGA.
- The **integration of KD** into the ensemble of compression techniques contributes to achieving a balanced student model in terms of size, computational efficiency, and accuracy.
- The workflow addresses the entire development cycle: from the ML-based architecture training to the hardware deployment, overcoming the limitations outlined in previous works.
- Addition of FPGA metric estimation in the training stage to adapt the ML-based model to the hardware architecture.



## Workshop on Fully Programmable Systems-on-Chip for Scientific Applications







Thank you!!

Doha, Qatar 2024





# **Back up slides**

Romina Soledad Molina | Novi Sad - Serbia | 2024



# Pulse shape discriminator for cosmic rays studies

Romina Soledad Molina | Novi Sad - Serbia | 2024



Pulse shape discriminator for cosmic rays studies



Romina Soledad Molina | Novi Sad - Serbia | 2024



Pulse shape discriminator for cosmic rays studies



Left. Teacher architecture based on MLP - 16,352 parameters.

**Right:** Distilled architecture - **529** parameters Compression ratio: 30.91*x*.



Pulse shape discriminator for cosmic rays studies

- Overall accuracy
  - Teacher architecture: 99.70%
  - Student architecture: 98.96%
    - 8-bit fixed point
    - Target sparsity: 20%



Pulse shape discriminator for cosmic rays studies

International Centre for Theoretical Physics

• Overall accuracy

INIVERSITÀ

- Teacher architecture: 99.70%
- Student architecture: 98.96%
  - 8-bit fixed point
  - Target sparsity: 20%
- SoC memory footprint in terms of resource utilization @200MHz
  - Artix-7: below 27%



Pulse shape discriminator for cosmic rays studies

International Centre for Theoretical Physics

• Overall accuracy

JNIVERSIT

- Teacher architecture: **99.70%**
- Student architecture: **98.96%** 
  - 8-bit fixed point
  - Target sparsity: 20%
- SoC memory footprint in terms of resource utilization @200MHz
  - Artix-7: **below 27%**
- SoC latency @200MHz
  - Artix-7: 10 clock cycles

# ML and model compression techniques for SoC/FPGA Applications - Connections example

The Abdus Salam

International Centre

for Theoretical Physics

UNIVERSITÀ

DEGLI STUDI

DITRIESTE



Pulse shape discriminator for cosmic rays

In []: from pyng import Overlay

Pulse shape discriminator for cosmic rays studies JNIVERSITÀ

DEGLI STUD

ne Abdus Salam

International Centre for Theoretical Physics

| In [ ]: | <pre>ol = Overlay("hw/inference_PYNO.bit")</pre>                                                                                                                                                                                              |
|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| In [ ]: | ol.ip_dict                                                                                                                                                                                                                                    |
| In []:  | dma = ol.axi_dma_0<br>dma_send = ol.axi_dma_0.sendchannel<br>dma_recv = ol.axi_dma_0.recvchannel                                                                                                                                              |
| In [ ]: | <pre>from pynq import allocate import numpy as np data_size = 30 input_buffer = allocate(shape=(data_size,), dtype=np.uint32)</pre>                                                                                                           |
| In [ ]: | <pre>x3 = [0, 2, 0, 0, 0, 0, 2, 14, 60, 231, 232, 232, 232, 230, 232,<br/>231, 233, 232, 231, 231, 232, 232, 231, 230, 232, 231, 232, 231, 230]<br/>for i in range(0, data_size):<br/>input_buffer[i] = x3[i]</pre>                           |
| In [ ]: | <pre>import matplotlib.pyplot as plt plt.figure(figsize=(15,7)) plt.xlabel('Samples', fontsize=11) plt.ylabel('Amplitude', fontsize=11) plt.grid(True, alpha=1.0) plt.plot(x3, 'o', label="Signal 1", color='navy', markersize=7, lw=1)</pre> |







Pulse shape discriminator for cosmic rays studies JNIVERSITÀ

DEGLI STUD

'he Abdus Salam

International Centre for Theoretical Physics

| hls_ip = ol.inference_HW_0                                                                                            |
|-----------------------------------------------------------------------------------------------------------------------|
| hls_ip.register_map                                                                                                   |
| <pre># Initialize HLS IP core CONTROL_REGISTER = 0x0 hls_ip.write(CONTROL_REGISTER, 0x81) # 0x81 will set bit 0</pre> |
| hls_ip.register_map                                                                                                   |
| <pre># Start the DMA transfer dma_send.transfer(input_buffer)</pre>                                                   |
| <pre>output_buffer = allocate(shape=(4,), dtype=np.uint32)</pre>                                                      |
| <pre>dma_recv.transfer(output_buffer)</pre>                                                                           |
| <pre>for i in range(4):     print((output_buffer[i]))</pre>                                                           |
|                                                                                                                       |



