



2065-D4304

#### Advanced Training Course on FPGA Design and VHDL for Hardware Simulation and Synthesis

26 October - 20 November, 2009

Presentations by Participants

### National University of San Luis San Luis, Argentina Research and Electronic Services Lab

(Laboratorio de Electrónica, Investigación y Servicios)

L.E.I.S.

#### Topics:

- FPGA Platform development for Image and Video Processing
- Virtual Instrumentation (RVI)





- Objectives
- Results
- Drawbacks
- Staff
- What's next?







- Prepare a platform capable of Image and Video processing tests, based on parallel algorithms.
- The researcher doesn't need to develop the entire platform.
- The image or video researcher should be able of implement an algorithm in hardware, test it and compare the results with a computerimplemented version.





Prepare a platform capable of Image and Video processing tests, based on parallel algorithms

#### The platform...

- ... must fit in the Xilinx Multimedia Board FPGA.
- ... must use the on-board hardware.
- ... all the available IP cores and modules can, but not necessarily must, be used.











#### The board...

- ... Video decoder ADV7185.
- ... VGA DAC FMS3810.
- ... Memory ZBT RAM K7N163601M.
- ... FPGA Xilinx Virtex II XC2V2000-6FF896.

















- Objectives
- Results
- Drawbacks
- Staff
- What's next?











#### Results











```
entity TRANSLATE is
       Port ( CLK
                                     : in
                                              STD LOGIC;
                                    : in
              RESET
                                    : in
              PAL MAX ADDRESS
                                              STD LOGIC VECTOR (18 downto 0);
                                : in
              USER ACCESS OK
                                              STD LOGIC;
                                    : in
              PAL READ DATA
                                              STD LOGIC VECTOR (31 downto 0);
                                              STD LOGIC VECTOR (31 downto 0);
              VGA WRITE DATA
                                    : out
              PAL READ ADDRESS : inout STD LOGIC VECTOR (18 downto 0);
                                   : inout STD LOGIC VECTOR (18 downto 0);
              VGA WRITE ADDRESS
                                   : inout
              PAL READ
              VGA WRITE
                                    : out
              );
   end TRANSLATE;
   architecture TRANSLATE ARCH of TRANSLATE is
   begin
                                                                             VGA_WRITE_DATA
   end architecture;
                                                                            VGA_WRITE_ADDRESS
                                                            USER_ACCESS_OK
                                                                    TRANSLATE
                                                                             PAL_READ_ADDRESS
                                                                                VGA_WRITE
                                                                                PAL_READ
Andrés Miguel Airabella
```





By-Pass, Basic EQ, RGB2GS conversion and GS2BW has been performed.















- Objectives
- Results
- Drawbacks
- Staff
- What's next?







- Virtex II is obsolete.
- Due to the FPGA discontinuity, old software versions must be used. Xilinx ISE 9.1 or 9.2.
- Clock distribution in the board is too bad!
- The board programming methods are a Parallel Port JTAG cable (too old!) and a CompactFlash.
- The video inputs and outputs are Analog.





- Objectives
- Results
- Drawbacks
- Staff
- What's next?







Andres Miguel Airabella
Carlos Federico Sosa Paez
Ricardo Petrino







- Objectives
- Results
- Drawbacks
- Staff
- What's next?







## FPGA Platform development for Image and Video Processing What's next?

- Test more complex algorithms, that involves processing more than one pixel at each time.
- Test image processing in control applications, but using the platform.
- Develop a Firewire Interface for new digital video cameras.





### National University of San Luis San Luis, Argentina

Research and Electronic Services Lab

(Laboratorio de Electrónica, Investigación y Servicios)

L.E.I.S.

#### Topics:

- FPGA Platform development for Image and Video Processing
- Virtual Instrumentation (RVI)





#### Virtual Instrumentation (RVI)

#### **Objectives**

Design a digital osciloscope using the FPGA RVI Prototype Board and the LP Data Conversion Daughter Board, provided by ICTP.

- Develop the Hardware
- Develop the Software







#### Virtual Instrumentation (RVI)



Hardware







**Software** 





### Virtual Instrumentation (RVI) Staff

Facundo Aguilera
Carlos Federico Sosa Paez
Diego Costa









Email: andres.airabella@ysbes.com.ar Website: http://www.unsl.edu.ar/~amairabe



#### **General Projects and Challenges**

Andreia Barbiero (andreia.barbiero@lactec.org.br)



#### Curriculum

- GRADUATED IN COMPUTER SCIENCE AT FEDERAL UNIVERSITY OF PARANÁ - BRAZIL
- MSC IN COMPUTER ARCHITETURE AT UFPR BRAZIL
- WORKING FOR 3 YEARS AT LACTEC RESEARCH INSTITUTE
- DEVELOPING EMBEDDED SYSTEM PROJECTS USING MICROCONTROLERS, C LANGUAGE, LINUX, JAVA...



### LACTEC RESEARCH INSTITUTE FOR DEVELOPMENT

- 50 YEARS OLD INSTITUTE
- TECHNOLOGICAL RESEARCH CENTER AND A NON-PROFIT AND SELF-SUSTAINABLE INSTITUTE
- WAS CREATED BY A GOVERNMENT COMPANY (COPEL) AND FEDERAL UNIVERSITY OF PARANÁ
- AROUND 400 RESEARCHERS WITH 60 DOCTORS AND MASTERS
- LOCATED AT CURITIBA PARANÁ BRAZIL



#### **LACTEC'S PROJECTS**

- MAINLY R&D PROJECTS INVOLVING POWER ENERGY UTILITIES COMPANIES

- POWER UTILITIES COMPANIES HAVE TO SPEND 0.5% OF THEIR PROFITS IN R&D PROJECTS. IT IS A LAW!



#### **SCENARIO**

#### ENERGY GENERATION, TRANSMITION AND DISTRIBUTION





#### **CHALLENGES**

#### BRAZIL A HUGE COUNTRY...





#### **CHALLENGES**



Advanced Training Course on FPGA Design and VHDL for Hardware Simulation and Synthesis



#### **CHALLENGES**

### - ELECTRICAL ENERGY NETWORK IMPROVEMENT AND INTEGRATION (SMARTGRID)





#### **CHALLENGS**

- PREVENT BLACK-OUT
- AVOID ENERGY ROBBERY (25-30% ENERGY LOSS)
- SELF-RECOVERY
- ONLINE CONSUMPTION MEASURE



#### R&D CPFL / LACTEC





#### PROJECTS USING FPGA (FUTURE)

#### CURRENT/VOLTAGE SENSORS TO DETECT SURGES (LIGHTNING)





### PROJECTS USING FPGA (FUTURE)

### POWER QUALITY SYSTEMS





### PROJECTS USING FPGA (FUTURE)

### SIGNAL FILTERS TO CORRECT PHASE NOISE







www.lactec.org.br

# FPGA Implementations of Artificial Neural Networks

Prof. Amr Khairat Radi Physics Department Faculty of Sciences Ain Shams University



### **Artificial Neural Networks**

- What is Neural Networks
- Artificial Neural Networks
- Applications
- Modelling a Neuron
- Hardware Neuron

The question
'What is a neural network?'
is ill-posed.
-- Pinkus (1999)

A method of computing, based on the interaction of multiple connected processing elements

## Brain Computer: What is it?





Human brain contains a massively interconnected net of 10<sup>13</sup>-10<sup>14</sup> (1000 billion) neurons (cortical cells)



Biological Neuron
- The simple
"arithmetic
computing" element

## Biological Neurons

- 1. Soma or body cell is a large, round central body in which almost all the logical functions of the neuron are realized.
- 2. The axon (output), is a nerve fibre attached to the soma which can serve as a final output channel of the neuron. An axon is usually highly branched.
- 3. The dendrites (inputs)- represent a highly branching tree of fibres. These long irregularly shaped nerve fibres (processes) are attached to the soma.
- 4. Synapses are specialized contacts on a neuron which are the termination points for the axons from other neurons.



## ANN as a Brain-Like Computer

NN as an model of brain-like Computer





#### Brain

➤ The human brain is still not well understood and indeed its behavior is very complex!
➤ There are about 1000 billion neurons in the human cortex and 60 trillion synapses of connections
➤ The brain is a highly complex, nonlinear and parallel computer (information-processing system)

- ❖ An artificial neural network (ANN) is a massively parallel distributed processor that has a natural propensity for storing experimental knowledge and making it available for use. It means that:
- ➤ Knowledge is acquired by the network through a learning (training) process;
  ➤ The strength of the interconnections between neurons is implemented by means of the synaptic weights used to store the knowledge.

The learning process is a procedure of the adapting the weights with a learning algorithm in order to capture the knowledge. On more mathematically, the aim of the learning process is to map a given relation between inputs and output (outputs) of the network.



# Mathematical Interpretation of Classification in Neural Networks

Mathematical model of quantization:

"Learning by Examples"



# Learning via Self-Organization Principle

**Self-organization** – basic principle of learning: *Structure reconstruction* 



## Modelling a Neuron





- A neuron has a set of *n synapses* associated to the *inputs*. Each of them is characterized by a weight.
- A signal  $x_i$ , i = 1,...,n at the  $i^{th}$  input is multiplied (weighted) by the weight Output  $w_i$ , i = 1,...,n
- The weighted input signals are summed. Thus, a linear combination of the input signals  $w_1x_1 + ... + w_nx_n$  is obtained. A "free weight" (or bias)  $w_n$  which does not correspond to any input, is added to this linear combination and this forms a weighted sum  $z = w_0 + w_1x_1 + ... + w_nx_n$   $\Rightarrow$  A nonlinear activation function  $\varphi$  is applied to the weighted sum. A value of the activation function  $y = \phi(z)$  is the neuron's output.

# Artificial Neuron: Classical Activation Functions







#### Threshold activation

$$\phi(z) = \operatorname{sign}(z) = \begin{cases} 1, & \text{if } z \ge 0, \\ -1, & \text{if } z < 0. \end{cases}$$

#### Logistic activation



#### Hyperbolic tangent activation



## A simplest network

- ➤ What we need to implement ANN:
- ➤ Hardware neuron (sigmoid function)
- Random generator for the weights
- Comparing
- Changing the weights after an iteration



## Hardware Neuron



## Hardware Neuron



# Logical Feedback Shift Register Random Number Generator



# Sigmoid function

$$F(z) = \begin{cases} z(\beta - \theta \ z) & for \quad 0 \le z \le L \\ z(\beta + \theta \ z) & for \quad -L \le z < L \end{cases}$$



where  $\beta$  and  $\theta$  represent the slope and the gain of the nonlinear function F(z) between the saturation regions - L and L. Fig. 2 shows a block diagram of the activation function implemented using this process.



### Shift and Add

- $Y(x)=2^{-n^*}x + b$
- Advantages
  - Small Design
  - Short Latency of 5
- Disadvantages
  - Piecewise Outputs
  - Limited Accuracy



### References

- [1] L. Smith, ed. (1996, 2001), "An Introduction to Neural Networks", URL: <a href="http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html">http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html</a>
- [2] Sarle, W.S., ed. (1997), Neural Network FAQ, URL: <a href="ftp://ftp.sas.com/pub/neural/FAQ.html">ftp://ftp.sas.com/pub/neural/FAQ.html</a>
- [3] StatSoft, "Neural Networks", URL: <a href="http://www.statsoftinc.com/textbook/stneunet.html">http://www.statsoftinc.com/textbook/stneunet.html</a>
- [4] S. Cho, T. Chow, and C. Leung, "A Neural-Based Crowd Estimation by Hybrid Global Learning Algorithm", IEEE Transactions on Systems, Man and Cybernetics, Part B, No. 4. 1999.
- [5]W. S. McCulloch, W. Pitts, "A logical calculus of the ideas immanent in nervous activity," *Bulletin of Mathematical Biophysics*, vol. 5 pp. 115-133, 1943.
- [6] J. L. McClelland, D. E. Rumelhart, *Parallel Distributed Processing: Explorations in the Microstructure of Cognition*, The MIT Press, 1986.



### What can a Neural Net do?

- Compute a known function
- Approximate an unknown function
- Pattern Recognition
- Signal Processing
- Learn to do any of the above

## **Basic Concepts**

- A Neural Network generally maps a set of inputs to a set of outputs
- Number of inputs/outputs is variable
- The Network itself is composed of an arbitrary number of nodes with an arbitrary topology



# **Basic Concepts**



Definition of a node:

 A node is an element which performs the function

$$y = f_H(\sum (w_i x_i) + W_b)$$

## Simple Perceptron

- Binary logic application
- $f_H(x) = u(x)$  [linear threshold]
- $\bullet$  W<sub>i</sub> = random(-1,1)
- $\bullet$  Y = u(W<sub>0</sub>X<sub>0</sub> + W<sub>1</sub>X<sub>1</sub> + W<sub>b</sub>)
- Now how do we train it?



## Basic Training

- Perception learning rule  $\Delta W_i = \eta * (D Y) * X_i$
- η = Learning Rate
- D = Desired Output

Adjust weights based on a how well the current weights match an objective

# Logic Training

- Expose the network to the logical OR operation
- Update the weights after each epoch
- As the output approaches the desired output for all cases, ΔW<sub>i</sub> will approach 0

| $X_0$ | $X_1$ | D |
|-------|-------|---|
| 0     | 0     | 0 |
| 0     | 1     | 1 |
| 1     | 0     | 1 |
| 1     | 1     | 1 |

## Results

 $W_0 \overline{W_1 W_b}$ 



### Details

- Network converges on a hyper-plane decision surface
- $X_1 = (W_0/W_1)X_0 + (W_b/W_1)$



## Typical Activation Functions

- $F(x) = 1 / (1 + e^{-k \sum (w_i x_i)})$
- Shown fork = 0.5, 1 and 10
- Using a nonlinear function which approximates a linear threshold allows a network to approximate nonlinear functions



# Back-Propagated Delta Rule Networks (BP)

- Inputs are put through a 'Hidden Layer' before the output layer
- All nodesconnectedbetween layers



### **BP Network Details**

- Forward Pass:
  - Error is calculated from outputs
  - Used to update output weights
- Backward Pass:
  - Error at hidden nodes is calculated by back propagating the error at the outputs through the new weights
  - Hidden weights updated

### In Matrix Form

For:
n inputs, m hidden nodes
and q outputs

**o**<sub>lk</sub> is the output of the I<sup>th</sup> neuron For the k<sup>th</sup> of p patterns

$$\mathbf{A} = \begin{pmatrix} a_{10} & a_{11} & \cdots & a_{1n} \\ a_{20} & a_{21} & \cdots & a_{2n} \\ \cdots & \cdots & \cdots & \cdots \\ a_{m0} & a_{m1} & \cdots & a_{mn} \end{pmatrix},$$

$$\mathbf{B} = \begin{pmatrix} b_{10} & b_{11} & \cdots & b_{1m} \\ b_{20} & b_{21} & \cdots & b_{2m} \\ \cdots & \cdots & \cdots & \cdots \\ b_{q0} & b_{q1} & \cdots & b_{qm} \end{pmatrix}.$$

$$o_{lk} = f_H \left( \sum_{j=0}^m b_{lj} f_H \left( \sum_{i=0}^n a_{ji} x_{ik} \right) \right), \qquad 1 \le k \le p.$$

 $\mathbf{v}_k$  is the output of the hidden layer  $\mathbf{o}_k$  is the true output vector

$$egin{aligned} oldsymbol{v}_k &= \begin{pmatrix} 1 \\ F_H(oldsymbol{A}oldsymbol{x}_k) \end{pmatrix} \ oldsymbol{o}_k &= F(oldsymbol{A}, oldsymbol{B}, oldsymbol{x}_k) = F_H(oldsymbol{B}oldsymbol{v}_k) \end{aligned}$$

### Matrix Tricks

- $\bullet E(A, B) = \sum_{k=1}^{p} \sum_{k=1}^{p} (\mathbf{t}_{k} \mathbf{o}_{k})^{T} (\mathbf{t}_{k} \mathbf{o}_{k})^{T}$
- t<sub>k</sub> denotes true output vectors
- The optimal weight matrix of B can be computed directly if f<sub>H</sub><sup>-1</sup>(t) is known
- $\bullet \mathbf{B}' = f_{H}^{-1}(\mathbf{t})\mathbf{v}^{\mathsf{T}}(\mathbf{v}\mathbf{v}^{\mathsf{T}})^{*}$
- So... E(A, B) = E(A, B(A)) = E'(A)
  - Which makes our weight space much smaller

# Hybrid LS RS/SA/GA Training

 Delta rule training may converge to a local minimum

- Hybrid Global Learning (HGL) will converge on a global minimum
  - Randomize A [-0.5, 0.5]
  - Minimize the Error function E'(A)

# Random Sampling

- Randomize A [-0.5, 0.5]
- Loop for a long time
  - Compute  $B_{new} = B(A)$
  - if E(A, B) < THRESH
    - Training Complete
  - else
    - $\bullet$  A<sub>new</sub> = A + N(0,  $\sigma$ )

    - if  $E(A, B) > E(A_{new}, B_{new})$ 
      - $E(A, B) = E(A_{new}, B_{new})$
      - $A = A_{new}$ ;  $B = B_{new}$ ;  $\sigma = \sigma_{bound}$

## Other methods

- Simulated Annealing
  - More accurate results
  - Much slower
- Genetic Algorithms
  - More accurate results
  - Slower
- For details on methods and results see:
  - S. Cho, Chow, C. Leung, "A neural-based crowd estimation by hybrid global learning algorithm", Systems, Man and Cybernetics, Part B, IEEE Transactions on, Page(s): 535-541

## Alternative Activation functions

- Radial BasisFunctions
  - Square
  - Triangle
  - Gaussian!

(μ, σ) can be varied at each hidden node to guide training





# Alternate Topologies

 Inputs analyze signal at multiple points in time

RBF functions may be used to select a 'window' in the input data



# Typical Topologies

- Set of inputs
- Set of hidden nodes
- Set of outputs
- Too many nodes makes network hard to train

# Supervised Vs. Unsupervised

- Previously discussed networks are 'supervised'
  - Need to be trained ahead of time with lots of data
- Unsupervised networks adapt to the input
  - Applications in Clustering and reducing dimensionality
  - Learning may be very slow

# Current Applications

- Investment Analysis
  - Predicting movement of stocks
  - Replacing earlier linear models
- Signature Analysis
  - Bank Checks, VISA, etc.
- Process Control
  - Chemistry related
- Monitoring
  - Sensor networks may gather more data than can be processed by operators
  - Inputs: Cues from camera data, vibration levels, sound, radar, lydar, etc.
  - Output: Number of people at a terminal, engine warning light, control for light switch

## Current Work

- Work of S. Cho, Chow and C. Leung
- Training Neural Network to approximate crowd size based on image cues
- Network approximates a non linear mapping between input and output space
- Their input cues: Black Pixels, White Pixels, Edge Pixels
- Our proposed inputs: Edge Histogram, Blob Size histogram (based on background model)



## Middleware Switch

Muhammad Faisal German Aerospace Center





#### Middleware Switch

History

- Need for the Core avionic System in Standard Satellite Bus (SSB).
- Start of *Compatksatellit* project.







# Testing of the Middleware Switch board



# Testing of the Middleware Switch board



# Testing of the Middleware Switch board



# SpaceWire Port



## SpaceWire Protocol

SpaceWire is a standard for high-speed links and networks for use onboard spacecraft, easing the interconnection of: enormous mass-memory processing units, and downlink telemetry sub-systems.

# SpaceWire Protocol







## Spacewire

. SpaceWire is defined in the European Cooperation for Space Standardization ECSS-E50-12A standard

## SpaceWire Protocol

- SpaceWire is being widely used on many space missions by:
- <u>ESA</u>
- NASA
- JAXA
- SpaceWire equipment is connected together using SpaceWire links which are:
- serial,
- high-speed (2 Mbits/sec to 200 Mbits/sec),
- bi-directional,
- full-duplex.









## Residue-to-Decimal Converters for Moduli Sets With Common Factors

#### Kazeem Gbolagade<sup>1,2</sup> and Sorin Cotofana<sup>2</sup>

1.Computer Sc. Dept., UDS, Navrongo, Ghana

2. Computer Engineering Laboratory,

Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology (TU Delft), The Netherlands.





## **Presentation Overview**

- Introduction
- Proposed Algorithm
- Performance Evaluation
- Conclusions

ICTP 2009 <sup>2</sup>



## Residue Number Systems

- The basis for an RNS is a set of relatively prime integers  $\{m_1, m_2, m_3, ..., m_n\}$  where gcd( $m_i, m_j$ )=1 for i≠j and  $M = \prod_{i=1}^{m} m_i$
- $X \longrightarrow [0,M-1] \longrightarrow X = (x_1,x_2,x_3,...,x_n) \text{ where } x_i = |X|_{m_i}$
- Suppose  $K = (k_1, k_2, k_3, ..., k_n)$  and  $L = (l_1, l_2, l_3, ..., l_n)$  and  $\Theta$  rep. operation of +,-, & \*
- If  $W = (w_1, w_2, w_3, ..., w_n)$ and  $W = K \Theta L$   $w_i = |k_i \Theta l_i|_{m_i}$

ICTP 2009



# RNS Advantages and Challenges

- RNS advantages:
  - Carry free addition/subtraction,
  - Parallelism,
  - Error detection and correction.
- RNS challenges:
  - Sign detection,
  - Magnitude comparison,
  - Overflow detection,
  - Division and other complex operations,
  - Moduli Selection,
  - Residue to Binary/Decimal Conversion and vice-versa.



#### RNS to Decimal Conversion

**Conventional Methods** 

Chinese Remainder Theorem (CRT)

Large Modulo M

Mixed Radix Conversion (MRC)

**Naturally Serial** 

**Research Question**: Given the moduli set  $\{2n+2,2n+1,2n\}$ , can we select a moduli set such that CRT can be performed with less expensive modulo operations? Given that larger dynamic range is of practical interest, can we extend the moduli set  $\{2n+2,2n+1,2n\}$  to  $\{2n+3,2n+2,2n+1,2n\}$ ?



# **Proposed Conversion Method**

Assume the moduli set  $\{2n+2, 2n+1, 2n\}$ .

#### Strategy:

- Eliminate the computation of multiplicative inverses,
- Simplify the resulting CRT,
- Eliminate the explicit calculation of the modulo operation,
- Possible extension to {2n+3,2n+2,2n+1,2n}.



#### Elimination of the Multiplicative Inverses

Given the moduli set  $\{2n+2, 2n+1, 2n\}$  with  $m_1=2n+2, m_2=2n+1, m_3=2n$ , we obtain the following :

$$\left| \left( m_1 m_2 \right)^{-1} \right|_{\frac{m_3}{2}} = \frac{n+1}{2},$$
 (1)

$$\left| \left( m_2 \frac{m_3}{2} \right)^{-1} \right|_{m_1} = n + 2, \qquad (2)$$

$$\left| \left( m_1 \frac{m_3}{2} \right)^{-1} \right|_{m_2} = 2n - 1.$$
 (3)

ICTP 2009 7



### CRT Simplification (1)

For moduli set sharing a common factor, it has been proved in (Wang, 1999) that  $(x_1,x_2,x_3)$  represents a valid number if and only if  $(x_1+x_3)$  is even.

The decimal equivalent of the RNS number  $(x_1, x_2, x_3)$  for the moduli set  $\{m_1, m_2, m_3\}$  in the form  $\{2n+2, 2n+1, 2n\}$  is computed for  $(x_1+x_3)$  even, as follows:

$$X = \left| \frac{m_2 m_3}{2} x_1 - m_1 m_3 x_2 + \frac{m_1 m_2}{2} x_3 \right|_{M_L}$$
 (4)

ICTP 2009 8



### CRT Simplification (2)

Using the lemma  $|am_1|_{m_1m_2} = m_1|a|_{m_2}$  presented in (Wang, '99), we further simplify the equations and we obtain for X positive and negative, respectively, the following:

$$X = (x_2 - x_1)m_1 + x_1 + m_1 m_2 \left| \frac{(x_1 + x_3)}{2} - x_2 \right|_{\frac{m_3}{2}}$$
 (5)

$$X = (x_2 - x_1)m_1 + x_1 + m_1 m_2 \left( \left| \frac{(x_1 + x_3)}{2} - x_2 \right|_{\frac{m_3}{2}} + \frac{m_3}{2} \right)$$
 (6)

# Seption to state

### Extension: $\{2n+3,2n+2,2n+1,2n\}$

The decimal equivalent of the RNS number  $(x_1, x_2, x_3, x_4)$  for the moduli set  $\{m_1, m_2, m_3, m_4\}$  in the form  $\{2n+3, 2n+2, 2n+1, 2n\}$  is computed as follows:

$$X = (x_1 + x_2 + x_3) + m_1 m_2 m_3 \left| k_1 x_1 + k_2 x_2 + k_3 x_3 + \left| M_4^{-1} \right|_{m_4} x_4 \right|_{m_4}$$

 $k_1$ ,  $k_2$ , and  $k_3$  can be pre-computed.

This proposal is better than the traditional CRT.



#### Low Cost Modulo Calculation

We summarize the 4 possibilities as follows:

- If the tentative sum is  $< -\frac{m_3}{2}$ , add m<sub>3</sub>;
- If the tentative sum is  $>=-\frac{m_3}{2}$ , add  $\frac{m_3}{2}$ ;
- If tentative sum is zero and  $x_1 > x_2$ , add  $\frac{m_3}{2}$ ;
- Otherwise, do nothing.

In this way all the modulo operations have been replaced by a single corrective addition or subtraction greatly reducing the complexity of the converter.



#### Performance Evaluation

#### Area-delay comparison:

| Metrics        | Wang, et al ,'99                 | Our proposal                       |
|----------------|----------------------------------|------------------------------------|
| Area           | 4 adders,<br>2 multipliers       | 4 adders,<br>2 multipliers         |
| Delay          | 3 additions<br>2 multiplications | 3/4 additions<br>1 multiplications |
| Mod Operations | $m_1 m_3$                        | $m_3$                              |

- Apparently similar area and smaller delay,
- Simpler modulo operation.



### Performance Evaluation (1)

Synthesized results: Area-delay comparison

| n  | Our Area | Area<br>(Wang) | Our Delay | Delay<br>(Wang) |
|----|----------|----------------|-----------|-----------------|
| 10 | 56       | 80             | 32.524    | 40.042          |
| 11 | 78       | 107            | 33.372    | 43.683          |
| 12 | 56       | 78             | 31.978    | 37.454          |
| 15 | 88       | 116            | 37.424    | 49.108          |
| 20 | 64       | 93             | 31.545    | 39.9            |

- On average, our proposal is 20% faster than Wang's converter,
- On average, our proposal requires 29% lesser area.



### Performance Evaluation (2)





 The figures show that the proposed converter outperforms Wang's proposal in terms of both area and delay.

14



### Conclusions

- We presented a new algorithm for RNS to Decimal conversion using the moduli set {2n+2, 2n+1, 2n}.
- We eliminated the computation of multiplicative inverses.
- We further simplified the resulting CRT in order to obtain a low complexity implementation which does not require the explicit utilization of modulo operation in the conversion process.
- The proposed scheme is shown to be better in terms of both area cost and conversion delay when compared to the best known equivalent state of the art converters.

ICTP 2009 15



# 



Md. Saiful Islam
Lecturer, Institute of Information Technology
University of Dhaka
Dhaka-1000, Bangladesh

#### CONTENTS

- Reversible Logic
  - Motivation
  - Reversible Computing
  - Applications
- Reversible Gates
  - Basic Reversible Gates
  - Feynman, Toffoli, Fredkin, Peres and F2G gate
- Properties of Reversible Network
- Reversible Full-Adder: An Example
- Toffoli Gate: An Example
- Complexity Analysis
- Reversible Logic Minimization
- Generalized Algorithm: An Example

November 4, 2009



November 4, 2009

### MOTIVATION

Getting more and more information...

What about Nizar...?



BEFORE AFTER

AF

You may be overheated and lose your hair ...uh, its too hot to keep

Don't keep it, lets them go out...um, now its cool

November 4, 2009

4

### MOTIVATION (CONTD...)

Landauer 's Principle:

kTln2 Joule for every bit of information lost

Where k is the Boltzmann Constant (1.38  $\times$  10<sup>-23</sup> J/K) T is the absolute temperature



### MOTIVATION (CONTD...)

#### CPU Transistor Counts 1971-2008 & Moore's Law



### MOTIVATION (CONTD...)

As we pack more and more logic elements into smaller and smaller volumes and clock them at higher and higher frequencies, we dissipate more and more heat.

Energy cost money

Systems overheat

Portable systems exhaust their batteries

November 4, 2009

### REVERSIBLE COMPUTING

- Also known as non-destructive computing
- Implements bijective function
- One-to-one mapping
- Every input vector will be mapped to a unique output vector



### **APPLICATIONS**

- Cryptography
- Digital Signal Processing
- DNA Computing
- Optical Information Processing
- Nanotechnology
- Low Power CMOS Design, etc.

November 4, 2009



Advanced Training Course on FPGA Design and VHDL for Hardware Simulation and Synthesis

### BASIC REVERSIBLE GATES





Feynman Gate (1985)

| A | В | Р | Q |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 |
| 1 | 1 | 1 | 0 |

Two-through gate...



Toffoli Gate (1980)

November 4, 2009



11

ICTP is boring, no dance, no party...

We wish to dance and have party...cheers

The lady is dancing controlled by the gentleman...



One-through gate...



The gentle man (active)...

Fredkin gate (1982)

The lady is changing her hands...(exchange)





Peres gate (1985)

| A | В | С | Р | Q | R |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 | 0 | 1 |
| 0 | 1 | 0 | 0 | 1 | 0 |
| 0 | 1 | 1 | 0 | 1 | 1 |
| 1 | 0 | 0 | 1 | 1 | 0 |
| 1 | 0 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 |

One-through gate...



Feynman Double gate

| A | В | С | Р | Q | R |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 | 0 | 1 |
| 0 | 1 | 0 | 0 | 1 | 0 |
| 0 | 1 | 1 | 0 | 1 | 1 |
| 1 | 0 | 0 | 1 | 1 | 1 |
| 1 | 0 | 1 | 1 | 1 | 0 |
| 1 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 |

### **PROPERTIES**

- No. of input lines must be equal to no. of output lines.
- No fan-out is allowed
- No feedback
- Networks include only Reversible gates

# REVERSIBLE FULL ADDER: AN EXAMPLE



# REVERSIBLE FULL ADDER: AN EXAMPLE (CONTD...)



### TOFFOLI GATE: AN EXAMPLE (CONTD...)



Toffoli gate from other Reversible gates...

### **COMPLEXITY ANALYSIS**



### **COMPLEXITY ANALYSIS (CONTD...)**

- □ Garbage Outputs (GO): outputs that are not used as primary outputs are termed as garbage outputs
- ☐ Constant Inputs (CI): constants are the input lines that are either set to zero(0) or one (1) in the circuit's input side
- ☐ Gate Count (GC): number of gates used to realize the system
- □ Hardware Complexity (HC): refers to the number of basic gates (NOT, AND and EXOR gate) used to synthesize the given function

### **COMPLEXITY ANALYSIS (CONTD...)**

□ Area Consumption (AC): the area consumed by a design can be calculated as the sum of the width of individual gates used to realize the system.

$$AC = \sum_{i=1}^{n} w_i$$

□Path Delay (PD): total number of gates on the way of final output



### REVERSIBLE LOGIC MINIMIZATION

To synthesize a given function

- ✓ Minimize the Gate Count
- ✓ Use minimum no. of Constant Inputs
- ✓ Reduce the no of Garbage Outputs produced
- ✓ Reduce Hardwire Complexity as much as possible.
- ✓ Minimize Area Consumption
- ✓ Reduce the overall Path Delay

### GENERALIZED ALGORITHM: TRANSFORMATION-BASED



We use such gates for any number of inputs

### GENERALIZED ALGORITHM: TRANSFORMATION-BASED

- Reversible function is given as a truth table
- For each row in the truth table
  - Introduce Toffoli gates such that the output equals the input
  - Make sure that previous outputs are not affected



# TRANSFORMATION-BASED (CONTD...) BIDIRECTIONAL



# TRANSFORMATION-BASED (CONTD...) BIDIRECTIONAL



### TRANSFORMATION-BASED (CONTD...) IMPROVEMENTS

- Output permutations
  - via swap gates
- Control input reduction
  - there may be more than on possible assignment of control variables
  - select the one that makes the function "simple"
- Template matching

### TRANSFORMATION-BASED (CONTD...) TEMPLATES

- Idea: replace a sequence of gates with an equivalent shorter sequence
- example:



### 



November 4, 2009

### REFERENCES

- Nizar Abdallah
- R. Landauer, "Irreversibility and Heat Generation in the Computational Process", IBM Journal of Research Development, 5,1961, 183-191.
- Bennett, C., "Logical Reversibility of Computation," IBM Journal of Research and Development, 17, 1973, 525-532.
- R. Feynman, "Quantum Mechanical Computers," Optics News, Vol. 11, pp. 11–20, 1985.
- T. Toffoli, "Reversible Computing," Tech memo MIT/LCS/TM-151, MIT Lab for Comp. Sci, 1980.
- E. Fredkin and T. Toffoli, "Conservative logic," Int'l J. Theoretical Physics, Vol. 21, pp.219–253, 1982.
- Peres, A., 1985. Reversible Logic and quantum computers. Physical Review: A, 32(6): 3266-3276.
- Dmitri Maslov, "Reversible Logic Synthesis", PhD Dissertation, Computer Science Department, University of New Brunswick, Canada, Oct 2003.

#### REFERENCES (CONTD...)

- Md. Saiful Islam, et. al., Fault Tolerant Reversible Logic Synthesis: Carry Look-Ahead and Carry-Skip Adders, ACTEA 2009, pp. 396-401.
- Ralph C. Merkle, "Two Types of Mechanical Reversible Logic", <a href="http://www.zyvex.com/nanotech/mechano.html">http://www.zyvex.com/nanotech/mechano.html</a>
- **Md. Saiful Islam** and Md. Rafiqul Islam, "Minimization of Reversible Adder Circuits", Asian Journal of Information Technology, Vol. 4, No. 12, pp. 1146-1151, 2005.
- B. Parhami, "Fault-Tolerant Reversible Circuits", Proc. 40th Asilomar Conf. Signals, Systems and Computers, October 2006, Pacific Grove, CA, pp: 1726-1729.
- M. Miller, D. Masolv and G. W. Dueck, "A Transformation Based Algorithm for Reversible Logic Synthesis", Design Automation Conference, 2003.

November 4, 2009 34

DON'T ASK MUCH...THIS WILL MAKE ME HAPPY ...!!!!!!!!!!

November 4, 2009 35

# Implementing a Distributed Shared Memory for Switch Fabric on FPGA

#### Hadi Khani

Router lab, ECE Department, Faculty of Engineering, University of Tehran<sup>2</sup> Tehran, Iran

#### What is Switch Fabric?

 Switch Fabric is a Hardware Module in Network Routers

 Switch Fabric directs a cell to the proper output port actor ing to destination field in the header



#### Speed Up

- Problem: If we direct cell in sequence it will be too slow!
- Solution: We can direct all arriving cells
  - At the same time
  - In parallel
  - Concurrently
  - In a Dataflow Manner



#### Collision

 Question : what happens if two cells go to one output?



#### Collision

 And Worse, what happens if all cells go to one output?



#### Solution (Queuing)

 Some cells must be stored in a shared memory before going outside



## After a while We have queues at every output ports

So every cells must be written into a shared memory through on single port in sequence



#### Low Capacity Switch Fabric

### Switch Capacity = Memory Frequency x Port Width

- Example
  - Memory port width = 32 bit
  - Memory Access Frequency = 100 MHz
  - Switch Capacity = 3.2 Gbps

#### Distributed Shared Memory

(instead of a monolithic one)

In order to avoid any collision during writing process we must distribute cell to all memory banks



#### Higher Capacity Switch Fabric

Switch Capacity = Memory Frequency x Port Width x No Ports

#### Example

- Memory port width = 32 bit
- Memory Access Frequency = 100 MHz
- No of Ports = 4
- Switch Capacity = 12.8 Gbps

## Distributed Shared Memory Block Diagram



#### Clarifying Concept with an Example

- 4 x 4 switch
- Each cell consists of 4 words
- Cell Time Slot = 4 clock cycles

### First Clock Cycle (WRITE PROCESS)



#### Second Clock Cycle

(WRITE PROCESS)



#### Third Clock Cycle

(WRITE PROCESS)



#### Fourth Clock Cycle

(WRITE PROCESS)



#### First Clock Cycle



#### Second Clock Cycle



#### Third Clock Cycle



#### Fourth Clock Cycle



### Implemented Switch Fabric Specification in 2005

- Virtex II 8000 from Xilinx
- 16 \* 16 switch Fabric
- Working Clock = @ 70 M Hz
- Verilog HDL
- Cell Size = 81 Bytes (with 65 bytes payload size)
- Switching Capacity = 20 Giga Bit per second

#### **Thanks**

Question? / Answer





# FPGA-based Correlator for Random Signal Processing in Noise Radar

Konstantin Lukin, Olena Melnykova, Sergey Lukin and Pavlo Vyplavin





#### Outline

- Time domain correlator for continuous signals
- Frequency domain correlator
- Correlator processing results
- Conclusions





### Estimation of cross-correlation using FPGA

$$K_{\tau} = \sum_{n=1}^{N} A_{n-\tau} B_n$$

## Diagram of correlator for CW signals



LHDES



- Representation of received data



#### Flow diagram of cross-correlation computing model



- Transmitting results to PC







### Flow diagram of multichannel correlator using FPGA



7





### Flow diagram of multichannel correlator using FPGA







#### FFT of a length-N sequence

$$N-1$$

$$X[k] = \sum_{n=0}^{\infty} x(n)e^{(-j2\pi nk)/N}$$

where k = 0, 1, ... N - 1





#### Diagram of FFT Module







### Flow diagram of FFT-based model using FPGA



11



### Flow diagram of FFT-based correlator model using FPGA



12



## User Interface L. H. D. E. S.







## Results of processing of noise signals





# Results of processing of noise signals







Results of processing of delta-pulses











# Results of processing of periodic functions



# Results of processing of periodic signal and delta-pulse







### Conclusions

- 1. We have implemented algorithms for correlation processing in FPGA in time domain and we are implementing algorithms in spectral domain.
- 2. The analysis shows that all correlators have rather good performance and each of them can be used in specific applications.
- 3. We realized correlator in the Altera/Stratix evaluation board.
- 4. In future we plan using these correlators in real radars.



L.A. Moreno-Coria<sup>1</sup>, J.G. Pérez-Luna<sup>1</sup>, B.S. Soto-Cruz<sup>1</sup>, J.L. Sosa-Sanchez<sup>1</sup>

<sup>1</sup> Center for Research in Semiconductor Devices, Benemérita Universidad Autónoma de Puebla, México.

Advanced Training Course on FPGA Design and VHDL for Hardware 🧓 Simulation and Synthesis 👝

- 26 October 2009-20 November 2009
  - Trieste-Italy 🧶



- > Introduction
- > Method
- > Discussion



Figure 1. Insolation for cover global demand for energy.

Are currently developing mechanisms for harnessing solar energy such as:

Solar Electricity
Solar Fuels
Solar thermal systems (thermoelectric cells, thermionic cells but this requires high temperatures [1])

In this work, we are working on the implementation of a solar tracking control of polar and high precision from a set of geometric equations whose define the apparent position of the sun.

Using FPGA, we create conditions for the search of optimization of power generation, as is the case of thermionic-thermoelectric generators, where cogeneration is done with a better conversion efficiency.

[1] Pérez Luna J. G., Desarrollo de un Generador Termoiónico de Corriente Alterna, Tesis Doctoral, U.N.A.M., México 2001.

To locate the solar position, we use the geometric model that determines the position centered on the position of the observer (Fig 2):



Figure 2. Definition of latitude L, hour angle h, a point P on the Earth's surface and solar declination  $\delta$ .

To locate the solar position we use the geometric model, that determines its position based on the position of the

observer (Fig 3):



Figure 3. Angles of the solar position.

From the Figures 1 and 2, we can obtain the angles involved in solar geometry. Their relationships through trigonometric inferences, are:

$$\delta = 23.45 \sin \left[ \frac{360}{365} (284 + n) \right]$$

$$Solar Time = Oficial Time + E + 4(L_{ref} - L_{loc})$$

$$E = 9.87 \sin (2B) - 7.53 \cos B - 1.5 \sin B$$

$$B = \frac{360}{364} (n - 81)$$

$$n = Day of the year (1 \le n \le 365)$$

$$\omega = -(15^{\circ}/hr)(solar time) + 180^{\circ}$$

$$sin \alpha = \cos \varphi \cos \delta + sen\varphi \sin \delta$$

$$\cos \delta \sin \omega$$
8

E= equation of time  $L_{ref}=$  length of the official time reference meridian for the zone in question  $L_{loc}=$  length of the meridian of the place



- > Introduction
- > Wethod
- > Discussion

• Once established the restriction of movement to two degrees of freedom:

- Solar altitude ( a )
- Solar azimuth ( y )

#### Diagram flow

➤ Is part of an algorithm that calculates the solar position where data is entered and then you get the angles that define a particular position, finally, updates the timedata and obtains the solar position for each instant

Figure 5. Diagram flow.



#### PID Control in closed loop

$$m(k) = k_p e(kT) + K_i \sum_{h=1}^{k} \frac{e((k-1)T) + e(kT)}{2} + k_d((e(kT) - e((k-1)T)))$$
 9

m(t) = output response in the control stage e(t) = actuating error signal  $k_p$  = proportional control action  $K_i$  = integral control action  $k_d$  = derivative control action

> The PID acts on the error of the variable, which will be manipulate through an appropriate combination of the three control measures

#### Power stage



Figure 6. Schematic diagram of the power stage.

- >The currents of the PWM signals (of the order of 25mA) are amplified to control dc motors
- ➤ It has 4 channels and provides up to 1A per channel
- Each channel is controlled by signals compatible with TTL input
- Each pair of channels has an enable signal that disconnects the output from them, well, feeding is independent of the control signals and dc motors fed from 5 to 12 Volts.





## Discussion

Applications

Solar Concentrator is using for:

- synthesis of phthalocyanines,
- > wafer oxidation,
- reactions of biofuels,
- > thermoelectric generators,
- > thermionic cells,
- > and photovoltaic cells

CHEMICAL REACTIONS

DIRECT CONVERSION

## Thank you!

## CONTACT PERSON

L.A. Moreno-Coria acoria acee.buap.mx

Centro de Investigaciones en Dispositivos Semiconductores, Benemérita Universidad Autónoma de Puebla, 14 Sur y Av. San Claudio, Ciudad Universitaria, C.P. 72570, Puebla, México.

\*Also: jgperezl@siu.buap.mx, blanca.soto@icbuap.mx, jose.sosa@icbuap.buap.mx

## Hardware Architecture for Particle Swarm Optimization using Floating-point Arithmetic

Daniel M. Muñoz Arboleda

Ph. D. Student - Mechatronic Systems GRACO – Automation and Control Group Mechanical Engineer Department University of Brasília, D.F., Brazil

> 4 November 2009 ICTP – Trieste - Italy

#### **About GRACO**

The research activities are organized in three areas:

- Automation and Manufacturing process
  - Welding process
  - Manufacturing automation process
- Robotic systems
  - Sensing, mapping and navigation for autonomous mobile robots
  - Computer vision
- Composite and functional materials
  - Shape Memory Alloys SMA
  - Piezoelectric ceramics sensors



#### **About GRACO**

- Staff
  - > 6 Ph. D. Professors
  - > 15 Master Students
  - > 9 Ph. D Students
- Cooperation
  - University of Santa Catarina Brazil
  - Pontificial Catholic University of Paraná Brazil
  - University of Uberlandia Brazil
  - > Cranfield University England
  - > INRIA France
  - > ITIV Universität Karlsruhe Germany
  - Europäisches Zentrum für Mechatronik
  - > PETROBRAS
  - > FIAT Brazil
  - Rockwell Automation
  - > Eletronorte













#### **About GRACO**

#### **Reconfigurable Systems Laboratory**

#### Research focuses

- High performance computer systems
- Architecture synthesis
- Hardware Synthesis and SoC Integration

#### Typical projects are from the fields

- Robotics (manipulators and or mobile robots)
- Automotive applications
- Welding processes
- Computer Vision applications
- Intelligent Systems and Optimization processes
- Control systems
- Building Automation







## Hardware Architecture for Particle Swarm Optimization using Floating-point Arithmetic

Daniel M. Muñoz Arboleda

Ph. D. Student - Mechatronic Systems
GRACO – Automation and Control Group
Mechanical Engineer Department
University of Brasília, D.F., Brazil

4 November 2009 ICTP – Trieste - Italy

#### Summary

Introduction

The PSO operation

Hardware implementations

Synthesis results

Simulation results

Conclusions

Future works

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

#### **Summary**

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

Particle Swarm Optimization (PSO) is a bio-inspired stochastic technique.

• Easy implementation



• Less computational requirements



• Non-gradient computation



• Parallel capabilities



• Large elapsed time



Parallel implementations of PSO can increase the throughput and improve their performance!

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works



Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

FPGA implementation of the PSO is a feasible and cheap solution, exploring the parallel capabilities by implementing on a single chip:

- Parallel particles
- Simultaneous computations

#### **HARDWARE IMPLEMENTATION PROBLEM:**

FPGAs only provides integer or fixed-point arithmetic, whereas an CPU usually works with a floating-point arithmetic.

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

Fixed-point arithmetic can works with high precision by using large bit-width representations.

Floating-point arithmetic is an essential component of high-performance computer systems.

• High precision computation. Capabilities to retain its resolution

• Large dynamic range to represent small and large real

numbers

The choice of using a fixed or floating-point arithmetic depends on whether the floating-point is needed by the data set?

Many optimization problems require to work with large and small real numbers.



Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

## The PSO Operation

- Population is called swarm.
- Each individual is called particle.
- Mass-less and volume-less.
- Each particle i has a current vector position  $x_i$ , a personal best position vector  $y_i$  and a velocity vector  $v_i$
- The PSO shares the global best position  $y_s$  among all the particles.



Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

### The PSO Operation

$$x_{ij}^{(t+1)} = x_{ij}^{(t)} + v_{ij}^{(t+1)}$$
(1)

$$v_{ij}^{(t+1)} = w v_{ij}^{(t)} + c_1 r_1 (y_{ij}^{(t)} - x_{ij}^{(t)}) + c_2 r_2 (y_s^{(t)} - x_{ij}^{(t)})$$
 (2)

w: inertia weight. Linearly decrease [0.9 to 0.1]

 $c_1$ : cognitive coefficient

 $c_2$ : social coefficient

 $r_1, r_2$ : uniform random numbers [0 to 1]

The velocities are clamped to the range  $[v_{\min}, v_{\max}]$ Large  $c_1$  values provide particles with a high self confidence in their experience.

Large  $c_2$  values provide particles with high confidence in the swarm.

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

### Hardware Implementations

#### The IEEE-754 standard

| 1 '      | Ew       | <u></u> — Mw — — — |  |
|----------|----------|--------------------|--|
| S        | E        | M                  |  |
| <u>+</u> | e + bias | mantissa = 1.M     |  |

Three binary components:

- Sign S
- Exponent *E*
- Mantissa M

This standard allows the user to work not only with single (32 bits) or double (64 bits) precisions, but also with the suitable bit-width according to the applications requirements.

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

### Particle architecture



$$\begin{aligned} v_{ij}^{(t+1)} &= w v_{ij}^{(t)} + r_1 \left( y_{ij}^{(t)} - x_{ij}^{(t)} \right) + r_2 \left( y_s^{(t)} - x_{ij}^{(t)} \right) \\ x_{ij}^{(t+1)} &= x_{ij}^{(t)} + v_{ij}^{(t+1)} \end{aligned}$$

- One floating-point addition/subtraction (FPadd)
- One floating-point multiplier (FPmul)
- One floating-point Linear Feedback Shift Register The FP operators have been previously developed.

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

### **Floating-point LFSR**



A Linear Feedback Shift Register (LFSR) is a shift register based technique.

- Several bits (taps) are chosen as feedback function.
- By selecting the appropriate *taps* the bit sequence works as a pseudo-random number generator.
- By selecting the appropriate changing bits in both the exponent and mantissa words, the LFSR works in the desired range.  $U[0, c_1]$  or  $U[0, c_2]$

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

### General HPSO architecture



The HPSO was developed using a Finite State Machine (FSM) approach. It is composed of:

- A swarm unit (S parallel particles)
- An evaluation unit (S parallel fitness functions)
- An individual best detection unit
- A global best detection unit

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

The proposed HPSO architecture has been validated using a swarm composed of 4 particles (S=4) and four well-know benchmark problems for a two-dimensional situation (N=2), also implemented in hardware.

• Sphere: 
$$f(\vec{x}) = \sum_{i=1}^{N} x_i^2$$
 (4)





Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

# Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

• Quadric: 
$$f(\vec{x}) = \sum_{i=1}^{N} \left( \sum_{j=1}^{i} x_j \right)^2$$





**Hardware Architecture for PSO using Floating**point Arithmetic

(5)

#### **Hardware Implementations**

Future works

• Rosenbrock:  $f(\vec{x}) = \sum_{i=1}^{N/2} 100 \cdot (x_{2i} - x_{2i-1}^2)^2 + (1 - x_{2i-1})^2$  (6)



Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summar

Introduction

The PSO operation

#### Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works



Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

# **Hardware Implementations**

Synthesis results

Simulation results

Conclusions

Future works

• Rastrigin: 
$$f(\vec{x}) = \sum_{i=1}^{N} (x_i^2 - 10\cos(2\pi x_i) + 10)$$
 (7)



Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

# Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

• Rastrigin: 
$$f(\vec{x}) = \sum_{i=1}^{N} (x_i^2 - 10\cos(2\pi x_i) + 10)$$
 (7)



# Hardware Architecture for PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

# **Hardware Implementations**

Synthesis results

Simulation results

Conclusions

Future works

### Results

The same experiment conditions were used for both the hardware ans software implementations.

| Parameter                                             | Value                          |
|-------------------------------------------------------|--------------------------------|
| Max. number iterations                                | 500                            |
| Cognitive Coefficient $c_1$                           | 2.0                            |
| Social Coefficient $c_1$                              | 2.0                            |
| Inertia weight                                        | [0.9 to 0.1]                   |
| Initial velocity                                      | 0.5                            |
| Maximum velocity                                      | 3.0                            |
| Domain range Rosenbrock Domain range remain functions | [-16.02 16.02]<br>[-5.12 5.12] |

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

# **Synthesis Results**

The proposed HPSO architecture was described in VHDL and synthesized for a FPGA Virtex5 family (chip xc5vlx330).

HPSO single precision (32 bits)

| Implemented FP-core | Flip-Flops<br>Max 207360 | LUTs<br>Max 207360 | DSP MUL<br>Max 192 | Freq.<br>MHz |
|---------------------|--------------------------|--------------------|--------------------|--------------|
| Particle            | 358 (0.17%)              | 1298 (0.63%)       | 2 (1.04%)          | 97.5         |
| PSO-Sphere          | 4246 (2.05%)             | 12058 (5.81%)      | 26 (13.5%)         | 90.5         |
| PSO-Quadric         | 4441 (2.14%)             | 12578 (6.07%)      | 18 (9.38%)         | 91.8         |
| PSO-Rosenbrock      | 3377 (1.63%)             | 8423 (4.06%)       | 10 (5.21%)         | 94.7         |
| PSO-Rastrigin       | 8474 (4.09%)             | 27067 (13.1%)      | 42 (21.9%)         | 90.6         |

The HPSO is feasibility implemented for all bit-width. The HPSO requires a large number of embedded multipliers. The complexity of the fitness function affects directly the scalability of the HPSO architecture.

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

**Synthesis results** 

Simulation results

Conclusions

Future works

### Validation of the FPLib

PSO for a Rastrigin function without random factor

$$x_{ij}^{(t+1)} = x_{ij}^{(t)} + v_{ij}^{(t+1)}$$
(1)

$$v_{ij}^{(t+1)} = w v_{ij}^{(t)} + r_1 (y_{ij}^{(t)} - x_{ij}^{(t)}) + r_2 (y_s^{(t)} - x_{ij}^{(t)})$$
(3)

$$f(\vec{x}) = \sum_{i=1}^{N} (x_i^2 - 10\cos(2\pi x_i) + 10)$$
 (7)



Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

**Simulation results** 

Conclusions

Future works

### **Simulation Results**

For each benchmark problem the HPSO 64 bits were simulated using different initial positions conditions.

The same initial position for each single run was used in a Matlab® •Sphere implementation (64 bits) and the Quadric best fitness results were compared. •Rastrigin •Rosenbrock initial values individual best evaluation swarm x<sub>p1</sub> particle 1 ≪ p1|RAM x<sub>p2</sub> particle 2 10 different  $\leq_{\rm pS}$ RAM particle S initial positions minimum best result global best

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works



# HPSO implementation: FPGA Xilinx at **50MHz**

Matlab<sup>®</sup> implementation: Intel Core Duo at **1.6 GHz**, 1GB of RAM, Windows XP OS.







#### **Elapsed time per iteration**

| Case problem   | Hardware | Software  | Times faster |
|----------------|----------|-----------|--------------|
| PSO-sphere     | 0.94 us  | 120.20 us | 127          |
| PSO-quadric    | 0.98 us  | 127.03 us | 129          |
| PSO-rosenbrock | 1.10 us  | 105.74 us | 96           |
| PSO-rastrigin  | 1.64 us  | 128.40 us | 78           |

HPSO achieves similar fitness values than software implementations; however, at the worst case, the HPSO is around 78 times faster than Matlab<sup>®</sup> implementation.

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

**Simulation results** 

Conclusions

Future works

### **Conclusions**

- A hardware architecture of the PSO algorithm (HPSO) have been implemented using a floating-point arithmetic.
- The floating-point arithmetic allows the algorithm to work with very large and small real numbers with a high precision.
- The area cost has a significant increase with the number of particles. The number of dimensions increase the elapsed time and has a minor effect in the area cost.
- The HPSO achieves similar fitness values than a software implementation. The main difference is due to the random number generator technique.
- •The best improvement case is for the Quadric function (127 times faster per iteration). The worst improvement case is for Rastrigin problem (around 78 times faster per iteration).

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

**Conclusions** 

Future works

### **Future works**

- Scalability study for different number of particles, bitwidth representations and number of dimensions.
- Application to different engineering problems: Robotics, Control and Automation, System Identification.
- A hardware implementation of an adaptive PSO algorithm could improve the performance, guarantying convergence.
- In order to accelerate the design process, a PSO VHDL CODE GENERATOR is a suitable implementation.

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

**Future works** 

### References

- J. Schuttle, J. Reinblot, B. Fregly, R. Haftka and A. George, "Parallel global optimization with the particle swarm algorithm," Int. J. Numer. Methods Eng. Vol. 6, No. 13, pp. 2296-2315, 2004.
- P. Reynolds, R. Duren, M. Trumbo, and R. Marks, "FPGA implementation of particle swarm optimization for inversion of large neural networks," Proc. IEEE Swarm Intelligence Symp., 2005, pp. 389-392.
- P. Palangpour, G. Venayagamoorthy and S. Smith, "Particle Swarm Optimization: A Hardware Implementation", in International Conference on Computer Design, 2009, pp. 134-139.
- D. Sanchez, D. Muñoz, C. Llanos, M. Ayala-Rincón, "Parameterizable Floating-point Library for Arithmetic Operations in FPGAs", in Proc. ACM Inter. Symp. on Int. Circuits and Syst. Design, 2009, pp. 253-258.
- D. Muñoz, D. Sanchez, C. Llanos, M. Ayala-Rincón,. "Tradeoff of FPGA Design of Floating-point Transcendental Functions", in Proc. IEEE Int. Conf. on Very Large Scale Integration, 2009, pp. 1-4.

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

# Acknowledgments



The authors gratefully acknowledge the support from Conselho Nacional de Desenvolvimento Científico e Tecnológico /PNM.



We acknowledge the financial support of this presentation from Fundação de Empreendimentos Científicos e Tecnológicos.

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

## **Questions?**

# Hardware Architecture for PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works





Division: (a)Goldschmidt (b) Newton-Rapson



Square root: (a)Goldschmidt (b) Newton-Rapson

Hardware
Architecture for
PSO using Floatingpoint Arithmetic

Summary

Introduction

The PSO operation

Hardware Implementations

Synthesis results

Simulation results

Conclusions

Future works

# FPGA leap from microcontrollers

Presented by: Arnold Bett Kenya

# Institution affiliated to

- University of Nairobi
- Physics department

# Type of institution

Educational

# Institution goals/ projects

- Teach students
- Provide instrumentation services to the country

# Specialization

- Analogue and digital circuit
- Micro-controller applications programming
- Microcontroller development kit production
- Circuit board design and production
- Scientific instruments development

# challenges

- As a country
  - ☐ Level of digital electronics at a low level
  - □ There are few specialists in FPGA/ microcontrollers
  - □ Luck of funding by government and institution
- As an institution
  - curriculum change takes time to be change
  - □ Analogue electronics is preferred to be taught
  - ☐ Few lecturers that teach embedded programming

# Physics department programs

- Develop microprocessor lab
- Build capacity in the following areas:
  - Multi layer circuit boards
  - Microcontroller
  - Digital signal processors
  - FPGA

# Projects directed towards FPGA

Fuel pump Dispenser with embedded distributed system

Details.....

- -Dispenser controller board
- -Memory for capturing and reading data
- -RF tag for car identification
- -Magnetic card reader or smart card reader
- -Ethernet link. Wired and wireless

# Prototype board characteristics

- Microcontroller
  - Z8 encore xp 64k
  - 80 pins
- 4 x 4 Matrix Keyboard
- Led display
- 4 seven segment display
- rs-232 communication
- I2C interface



## **Final Product**

- Cost efficient
  - Very few external parts
  - Minimal hardware material use
- Easy to implement
  - Plug and play if possible
- Distributed system
  - Control
  - Information read and write

## How to achieve better results

- Outsource circuit board production
- Use of FPGA
- Reduce external components

# Prototype photos









### The Big jump

• With the realization of the good side of life in the FPGA field its time to make the big jump

### What I feel about Actel Fusion



It Matters!.....

### Something to think about





Advanced Training Course on FPGA Design and VHDL for Hardware Simulation and Synthesis





### M. V. LOMONOSOV MOSCOW STATE UNIVERSITY

### **FACULTY OF PHYSICS**

### Methodology of FPGA Design

Lock-In Amplifier

Yuri Rumyantsev yarumyantsev@gmail.com

■ What is it?

Implementation







$$A(t)\sin(\omega t + \varphi) * \sin(\omega t) = \frac{A(t)}{2} [\cos\varphi - \cos(2\omega t + \varphi)]$$

$$A(t)\sin(\omega t + \varphi) * \cos(\omega t) = \frac{A(t)}{2} [\sin \varphi + \sin(2\omega t + \varphi)]$$

$$\langle A(t)\sin(\omega t + \varphi) * \sin(\omega t) \rangle = \frac{A(t)}{2}\cos\varphi = \alpha$$
$$\langle A(t)\sin(\omega t + \varphi) * \cos(\omega t) \rangle = \frac{A(t)}{2}\sin\varphi = \beta$$

$$A(t) = 2\sqrt{\alpha^2 + \beta^2}$$

$$\varphi = arctg(\frac{\beta}{\alpha})$$

### Lock-In Amplifier: Implementation





### FPGA-based Single-Chip GPS Navigation System using SDR and SOC

By Ahmad Sadiq Nigeria Communications Satellite Limited – Research and Development Department

NIGCOMSAT

### Introduction

- The aim of this presentation is to propose an FPGA-based design of GPS Navigation system
- This design proposes a single chip solution to handle all the GPS receiver functions as well as a complete SOC design with LCD display interface capable of running the Map software



### Table of Content

- Introduction
- About GPS
- A Look at Existing Systems
- Introducing the base technologies for this design: FPGA, SDR, SOC
- Preliminary (High-level)Design
- Design Considerations
- Summary
- References



## GLOBAL DISCOVER ABOUT GPS





### What is GPS

- GPS global positioning system
- A system of 30 active satellites orbiting the earth
- The satellites transmits signals which enable the exact location of a GPS receiver on the surface of the earth to be determined (or the atmosphere; up to LEO)



### The setup of GPS system

- The GPS system can be divided into three basic segments:
- Space segment (satellites)
- Control segment (control stations)
- User segment (GPS receiver)



### The GPS Receiver

- In order for a GPS receiver to work it must perform 4 tasks:
- 1.Find GPS signals (frequency, code phase)
- 2.Track/Demodulate the message from each GPS satellite (at the same time)
- 3. Calculate the position based on distances to the satellites
- 4. Calculate the correction to your local clock



### Trilateration





### A LOOK AT EXISTING SYSTEMS



### Typical Navigation System

 Existing systems typically uses at least two chips the GPS chip and a SOC for user applications e.g. the Map software.







### Typical Navigation System





### In contrast: our Proposed System





### **BASE TECHNOLOGIES**



### **FPGA**

- Field Programmable Gate Array is the most advanced programmable logic device
- Called the Poor man's ASIC because of the absence of NRE cost
- Advantages over ASIC includes reconfigurability, i.e. the device hardware can be reconfigured to implement new functionality without requiring any manufacturing process

NigComSat

### FPGAs: Hottest new features

- Processor cores inside the chip with computation clocks up to 500 MHz and above, and
- lower core voltages to keep power and heat down
- Over 1000 dedicated hardware multipliers
- Memory densities approaching 15 million bits
- Silicon geometries to 40 nanometres
- Configurable logic and I/O interface standards

### SDR

Software-Defined Radio (SDR) is a radio communication system where components that have typically been implemented in hardware (e.g. mixers, filters, modulators/ demodulators, detectors. etc.) are instead defined using software on a personal computer and then implemented on an FPGA.



# NFILI GLOBAL

### Software Defined GPS receiver

- A GPS receiver can be implemented as a software defined radio
- Students at the National Taipei university of Technology implemented a simplified four-channel GPS Digital Satellite Signal (GPS-DSS) baseband system including design of C/A code generator, navigation message processing and BPSK baseband modulation, utilizing only 10% resource of Xilinx Xc2s50e FPGA.

NIGCOMSAT

# GLOBAL

### Software Defined GPS receiver

 GPS receiver development based on SDR has advantages over conventional GPS receiver because the software GPS receiver can be upgraded to process future satellite signals such as Galileo, GPS L5 and Glonass without extra hardware.



### SOC

- With the increase of system resources available on latest generation FPGA, the System-on-Chip paradigm can be borrowed from classical silicon implementations into reconfigurable environments.
- It is now practical to ship volume embedded systems in a single FPGA
- the FPGA implements all of the system logic including a processor core

### PRELIMINARY DESIGN



### Preliminary Requirement

- 12 channel GPS receiver function
- Embedded System (SOC) with LCD Controller IP
- GUI Library including Map/Navigation Functions



### **Block Diagram**



**NIGCOMSAT** 

## DESIGN CONSIDERATIONS



### Challenges

 Because the received GPS signals are at such low levels this presents some challenges. One of the main considerations is to avoid contamination of the incoming signals with interference that can be generated from the digital electronics when using an FPGA.



### Fine tuning the FPGA to minimize interference

- Reduction of the slew rate of I/O pins.
- using the floor plan editor to optimise the FPGA routing after design was compiled and fitted into the device.
- For the clock: using a Low Voltage Differential Signalling (LVDS) (sine wave)<sup>4</sup>



### PCB Layout

- Copper Flooding
- Component placement to give maximum isolation for the RF components
- RF Shielding
- Bypass Capacitor and FPGA grounding





### Summary

- We believe that current FPGs technology is able to handle the design presented here
- The high-level integration offered by this design is the trend of the future as FPGAs are becoming more widely used in communications and mobile applications
- Because FPGA-based hardware design is reconfigurable, future technologies like L5 or Galileo can be incorporated easily
- Receiving algorithm can be developed for special purposes like military applications



# Type References



### References

- Sung-Chun Bu, et al (December 2004), A FPGAbased software GPS receiver design using Simulink
- 2. Trong-Yen Lee, et al, A Low-Cost GPS Satellite Signal Baseband System Using FPGA Prototyping
- Zedong Nie, Kangling Fang, Xu Xin (2006) A
   Portable Positioning System Based on SOPC
   Technology
- 4. Parkinson et al. (2006): **FPGA based GPS receiver** design considerations
- 5. Jun Xu et al., Implementation of FPGA-Based Acquisition of Weak GPS Signals

NIGCOMSAT

Government OF HIGH EDUCATION

REPUBLIC OF CAMEROON

DSCHANG UNIVERSITY

-----

**FACULTY OF SCIENCES** 

**DEPARTEMENT OF PHYSICS** 





### DESIGN AND SIMULATION OF AN ULTRA MODERN SECURISATION SYSTEM TELLY FOLLOWED OF A PUBLIC BUILDING:

CASE OF THE DSCHANG UNIVERSITY 'S RECTORATE

<u>BY</u>:

TCHAPGA TCHITO CHRISTIAN

In intention to obtain a MASTER IN PHYSICS

**Option: ELECTRONICS** 

**UNDER THE SUPERVISION OF:** 

**UNDER THE SUPERVISION OF:** 

PR. ANACLET FOMETHE

DR. ROBERT TCHITNGA

ACADEMIC YEAR: 2008/2009

### ABSTRACT

- We have designed and implemented for our master thesis in physics, a high-performance security system manager for a public building. Case of the of Dschang University 's Rectorate.
- This system will enable control of electrical equipments, from a position of mobile phone via SMS, from a position compatible mobile phone via the internet, a computer connected to the Internet, from a any position in the local network, from the machine on which it is connected, the status of all connected devices that form the safety and comfort of the cockpit..



# PART 1: USED LANGUAGES AND LOGIC CONSTITUTION

### CONCEPT OF CLIENT-SERVER TECHNOLOGY

- II.1 client
- II.2 server
- III The PHP language



### THE LANGUAGE XHTML AND CSS



### THE WML LANGUAGE



### THE GSM



Présenté par TCHAPGA TCHITO CHRISTIAN

### THE JAVA LANGUAGE





Présenté par TCHAPGA TCHITO CHRISTIAN

## PART 2 : ELECTRONIC CONSTITUTION



### ELECTRONIC CONTROL SYSTEM IMPLEMENTATION



### WHY MICROCONTROLLER?



#### CCS



### PIC PROGRAMMATOR



### ANALOGIC DIGITAL CONVERTER



### **RS232 COMMUNICATION**



### **INFRA-RED MODULE**



## DOORS AND WINDOWS CONTROL



### POWER INTERFACE



#### SOME RESULTS





Advanced Training Course on FPGA Design and VHDL for Hardware Simulation and Synthesis

#### COST ESTIMATE

- Electrical equipement: 90 250 XAF
- Computer devices: 1 050 000 XAF

Total: 1 140 250 XAF

1Euro = 655.94 XAF

## SCOPE OF PROJECT AND PROSPECTS FOR IMPROVEMENT

- Change sequential execution to data flow organisation
- Find a way to reduce components around command unit
- Increase I/O number (digital;analog...)

#### END

## TANK YOUFOR YOUR KINDLY ATTENTION