Zynq Architecture, PS (ARM) and PL

Joint ICTP-IAEA School on Hybrid Reconfigurable Devices for Scientific Instrumentation

Trieste, 1-5 June 2015
Contents

- Zynq All Programmable SoC
- Processing System (PS)
  - Application Processing Unit
  - Processor Peripherals
  - Ps ↔ PL interconnections
  - Clocks and resets
- Programmable Logic (PL)
- Booting and PL configuration
- References
Zynq-7000 Silicon Devices

- Driver Assistance
- Consumer Equipment
- Factory Automation
- Broadcast Camera
- Military Radios
- Medical Imaging and Networking
- Wired Communications
- Wireless Communications
- AVB Routers, Switches, Encoders

ARM® Dual Core Cortex®-A9 MPCore with Peripherals

<table>
<thead>
<tr>
<th>Artix-7 Fabric</th>
<th>Kintex-7 Fabric</th>
</tr>
</thead>
<tbody>
<tr>
<td>28k - 85k LC FPGA Fabric</td>
<td>125k - 444k LC FPGA Fabric</td>
</tr>
<tr>
<td>80 - 220 DSP Slices</td>
<td>400 - 2,020 DSP Slices</td>
</tr>
<tr>
<td>High Reliability I/Os</td>
<td>High Reliability and High Performance I/Os</td>
</tr>
<tr>
<td>6.25Gb/s Transceivers</td>
<td>12.5Gb/s Transceivers</td>
</tr>
</tbody>
</table>
# Zynq-7000 Device Family

<table>
<thead>
<tr>
<th></th>
<th>Z-7010</th>
<th>Z-7015</th>
<th>Z-7020</th>
<th>Z-7030</th>
<th>Z-7035</th>
<th>z-7045</th>
<th>z-7100</th>
</tr>
</thead>
<tbody>
<tr>
<td>Processor core</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Processor extensions</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>L1 Cache</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>L2 Cache</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>512KB</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>On-Chip Memory</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>256KB</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Memory Interfaces</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>DDR3, DDR3L, DDR2, LPDDR2, 2x Quad-SPI, NAND, NOR</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Peripherals</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>2x USB 2.0 (OTG), 2x Tri-mode Gigabit Ethernet, 2x SD/SDIO</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Logic Cells</td>
<td>28K</td>
<td>74K</td>
<td>85K</td>
<td>125K</td>
<td>275K</td>
<td>350K</td>
<td>444K</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>BlockRAMS (Mb)</td>
<td>240KB</td>
<td>380KB</td>
<td>560KB</td>
<td>1060KB</td>
<td>2000KB</td>
<td>2180KB</td>
<td>3020KB</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DSP Slices</td>
<td>80</td>
<td>160</td>
<td>220</td>
<td>400</td>
<td>900</td>
<td>900</td>
<td>2020</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Transceiver Count</td>
<td>4 (6.25Gb/s)</td>
<td>Up to 4 (12.5Gb/s)</td>
<td>Up to 16 (12.5 Gb/s)</td>
<td>Up to 16 (12.5 Gb/s)</td>
<td>Up to 16 (12.5 Gb/s)</td>
<td>Up to 16 (12.5 Gb/s)</td>
<td>Up to 16 (12.5 Gb/s)</td>
</tr>
</tbody>
</table>
Zynq All Programmable SoC

• Architecture: The Zynq-7000 AP SoC contains two parts
  - Processing System (PS)
  - Programmable Logic (PL)

  - hard processors (ARM-Cortex A9) and peripherals

  Based on 7 series: Kintex and Artix
Zynq All Programmable SoC

- Full Hw/Sw platform

More than just Silicon: A Comprehensive Platform Offering
Processing System (PS)

Interconnections
- AXI
- Peripherals
- Memory
  - On-chip
  - DDR
  - Flash

Programmable Logic

Peripherals

Application Processing Unit (APU)
- Processors + RAM

Memory Interfaces
- DDR2/3/LPDDR2 Controller
- 64b AXI, ACIP, Slave Ports
- 256 KB SRAM
- 612 KB L2 Cache and Controller

Central Interconnect
- ARM Cortex-A9 CPU

Programmable Logic to Memory Interconnect
- XADC
- High Performance AXI 32b/64b Slave Ports

DMA Channels
- Config AES/SHA

DAP

OCM Interconnect

DMA Sync

DeVC

ComSight Components

GIC

Programmable Logic (PL)

Interconnections
- I/O Peripherals
  - SPI 0
  - SPI 1
  - IIC 0
  - IIC 1
  - CAN 0
  - CAN 1
  - UART 0
  - UART 1
  - GPIO
  - SD 0
  - SD 1
  - USB 0
  - USB 1
  - ENET 0
  - ENET 1

Flash Memory Interfaces
- NAND
- QUAD SPI

SMC Timing Calculation

Extended MIO (EMIO)

I/O MUX (MIO)

Bank 0 MIO (15:0)

Bank 1 MIO (63:16)
• Dual ARM® Cortex™-A9 MPCore with NEON extensions
  - Up to 1GHz operation
  - 2.5 DMIPS/MHz per core
  - Separate 32KB instruction and data caches

• Snoop control unit
  - L1 cache snoop control
    • Accelerator coherency port

• Level 2 cache and controller
  - Shared 512 KB cache with parity
**PS Interconnect (1)**

- Programmable logic to memory
  - Two ports to DDR
  - One port to OCM SRAM
- Central interconnect
  - Enables other interconnects to communicate
- Peripheral master
  - USB, GigE, SDIO connects to DDR and PL via the central interconnect
- Peripheral slave
  - CPU, DMA, and PL access to IOP peripherals
PS Interconnect (2)

- **Processing system master**
  - Two ports from the processing system to programmable logic
  - Connects the CPU block to common peripherals through the central interconnect

- **Processing system slave**
  - Two ports from programmable logic to the processing system
### Memory Map

<table>
<thead>
<tr>
<th>Address Range</th>
<th>Purpose</th>
</tr>
</thead>
<tbody>
<tr>
<td>FFFC_0000 to FFFF_FFFF</td>
<td>OCM</td>
</tr>
<tr>
<td>FD00_0000 to FFFB_FFFF</td>
<td>Reserved</td>
</tr>
<tr>
<td>FC00_0000 to FCFF_FFFF</td>
<td>Quad SPI linear address</td>
</tr>
<tr>
<td>F8F0_3000 to FBFF_FFFF</td>
<td>Reserved</td>
</tr>
<tr>
<td>F890_0000 to F8F0_2FFF</td>
<td>CPU Private registers</td>
</tr>
<tr>
<td>F801_0000 to F88F_FFFF</td>
<td>Reserved</td>
</tr>
<tr>
<td>F800_1000 to F880_FFFF</td>
<td>PS System registers, SLCR Registers</td>
</tr>
<tr>
<td>F800_0C00 to F800_0FFF</td>
<td>Reserved</td>
</tr>
<tr>
<td>F800_0000 to F800_0BFF</td>
<td>Reserved</td>
</tr>
<tr>
<td>E600_0000 to F7FF_FFFF</td>
<td>SLCR Registers</td>
</tr>
<tr>
<td>E100_0000 to E5FF_FFFF</td>
<td>Reserved</td>
</tr>
<tr>
<td>E030_0000 to E0FF_FFFF</td>
<td>SMC Memory</td>
</tr>
<tr>
<td>E000_0000 to E02F_FFFF</td>
<td>Reserved</td>
</tr>
<tr>
<td>C000_0000 to DFFF_FFFF</td>
<td>IO Peripherals</td>
</tr>
<tr>
<td>8000_0000 to BFFF_FFFF</td>
<td>Reserved</td>
</tr>
<tr>
<td>4000_0000 to 7FFF_FFFF</td>
<td>PL (MAXI_GP1)</td>
</tr>
<tr>
<td>0010_0000 to 3FFF_FFFF</td>
<td>PL (MAXI_GP0)</td>
</tr>
<tr>
<td>0004_0000 to 000F_FFFF</td>
<td>DDR (address not filtered by SCU)</td>
</tr>
<tr>
<td>0000_0000 to 0003_FFFF</td>
<td>DDR (address filtered by SCU)</td>
</tr>
<tr>
<td></td>
<td>OCM</td>
</tr>
</tbody>
</table>

- The Cortex-A9 processor uses 32-bit addressing.
- All PS peripherals and PL peripherals are memory mapped to the Cortex-A9 processor cores.
- All slave PL peripherals will be located between `4000_0000` and `7FFF_FFFF` (connected to GP0) and `8000_0000` and `BFFF_FFFF` (connected to GP1).
Input/Output Peripherals

- Two USB 2.0 OTG/device/host
- Two tri-mode gigabit Ethernet (10/100/1000)
- Two SD/SDIO interfaces
  - Memory, I/O, and combo cards
- Two CAN 2.0Bs, SPIs, I2Cs, UARTs
- Four GPIO 32-bit blocks
  - 54 available through MIO; other 64 available through EMIO
- Static memories
  - NAND, NOR/SRAM, Quad SPI
- Trace ports
- Multiplexed output of peripheral and static memories
- 53 dedicated package pins available
- Two I/O banks; each selectable: 1.8V, 2.5V, or 3.3V
- Dedicated pins are used
  - User constraints (LOC) should not be present
- Software configurable
  - Automatically added to bootloader by tools
- Not available for all peripheral ports
  - Some ports can only use EMIO
1. The configuration is exported to SDK
2. a TCL initialization file is generated with the chose configuration
3. The initialization is included in the bootloader
EMIO

- Port to programmable logic
  - Enables use of the SelectIO™ interface with PS peripherals
- Alternative to using MIO
  - Mandatory for some peripheral ports
- User constraints must be present for the signals brought out to the SelectIO pins
  - The BitGen throw errors if LOC constraints are not present
General-Purpose I/O

- GPIO blocks
  - 4 separate banks of 32 GPIO bits
    - 2 connect to the 54 MIO pins
    - 32 bits and 22 bits, respectively
  - 2 connect to EMIO (64 bits)
  - Each GPIO bit can be dynamically programmed as I/O
  - Reset values independently configurable for each bit
  - Programmable interrupt generation for each bit
    - One interrupt generated per GPIO bank
PS $\rightarrow$ PL interfaces

- EMIO pins
- 2 AXI general-purpose ports (GP0-GP1)
- 4 AXI high-performance slave ports (HP0-HP3)
- 1 Accelerator coherence port (ACP) AXI slave I/F to CPU memory
- DMA, interrupts and event signals

<table>
<thead>
<tr>
<th>Interface</th>
<th>Features</th>
<th>Similar To</th>
</tr>
</thead>
<tbody>
<tr>
<td>Memory Map/Full</td>
<td>Traditional address/data bus</td>
<td>PLB v46, PCI</td>
</tr>
<tr>
<td></td>
<td>(single address, multiple data)</td>
<td></td>
</tr>
<tr>
<td>Streaming</td>
<td>Data only, burst</td>
<td>Local Link / FIFO / FSL</td>
</tr>
<tr>
<td>Lite</td>
<td>Traditional address/data bus</td>
<td>PLB v46 single / OPB</td>
</tr>
<tr>
<td></td>
<td>(single address, single data)</td>
<td></td>
</tr>
</tbody>
</table>

AMBA

AMBA 3.0 (2003)

AMBA 4.0 (2010)

Enhancements for FPGAs
Basic AXI signaling

- 5 channels
  - Read Address Channel
  - Read Data Channel
  - Write Address Channel
  - Write Data Channel
  - Write Response Channel
AXI4-Lite Interface

- No burst
- Data width 32 or 64 only
  - Xilinx IP only supports 32-bits
- Very small footprint
- Bridging to AXI4 handled automatically by AXI_Interconnect (if needed)
AXI4 Interface

- Sometimes called “Full AXI” or “AXI Memory Mapped”
  - Not ARM-sanctioned names
- Single address multiple data
  - Burst up to 256 data beats
- Data Width parameterizable
  - 1024 bits
AXI4-Stream

- No address channel, no read and write, always just master to slave
  - Effectively an AXI4 “write data” channel
- Unlimited burst length
  - AXI4 max 256
  - AXI4-Lite does not burst
- Virtually same signaling as AXI Data Channels
  - Protocol allows merging, packing, width conversion
  - Supports sparse, continuous, aligned, unaligned streams
AXI General-Purpose Ports

- 2 masters from PS → PL
  - Cortex A9 processors (via L2 cache controller)
  - USB, Ethernet and SD controllers
  - DMAC
  - Debug access port
- 32-bit data width
- 2 ports for higher bandwidth

- 2 slaves from PL → PS
  - PL masters
    - Microblaze
    - User IP
    - Third-party IP
    - PL slaves:
      - DDR / OCM
      - Peripherals
      - Device configuration controller
      - Debug access port
High-Performance Slave Ports

- 4 64-bit/32-bit FIFO-based AXI slave interfaces (AFI)
  - 1KB data FIFOs
- Asynchronous communication between PL/PS clock domains
- QoS supported from the programmable logic ports
- Low latency access to DDR & OCM
**AXI ACP Interface**

- 64-bit AXI slave port from PL → PS
- Direct connection to the SCU
- Cache coherent with L1 & L2 caches
- Tightly coupled co-processor
  - Performance relies on cache hits
  - Sw program coordinated with coprocessor
  - Cache miss → slower access to memory than using HPx slave ports
### AXI bandwidths

#### Raw Performance – Single channel, Single direction

<table>
<thead>
<tr>
<th>Type</th>
<th>Max Bandwidth</th>
<th>Connects To</th>
</tr>
</thead>
<tbody>
<tr>
<td>M/S AXI Gpx (32 bits)</td>
<td>600 MB/s</td>
<td>Masters to PL&lt;br&gt;Slaves from PL to internal resources of the PS (memory, peripherals, …)</td>
</tr>
<tr>
<td>S AXI Hpx (64 bits)</td>
<td>1200 MB/s</td>
<td>DDRx controller, OCM-RAM</td>
</tr>
<tr>
<td>S AXI ACP (64 bits)</td>
<td>1200 MB/s</td>
<td>SCU (L1, L2 caches, DDRx indirectly)</td>
</tr>
</tbody>
</table>

#### Compare to Peripheral Bandwidth

<table>
<thead>
<tr>
<th>Type</th>
<th>Max Bandwidth</th>
<th>Type</th>
<th>Max Bandwidth</th>
</tr>
</thead>
<tbody>
<tr>
<td>DDR controller (32 bits)</td>
<td>4264 MB/s</td>
<td>SD (4 bits)</td>
<td>25 MB/s</td>
</tr>
<tr>
<td>OCM (64 bits)</td>
<td>1779 MB/s</td>
<td>USB (8 bits)</td>
<td>60 MB/s</td>
</tr>
</tbody>
</table>
PL Clocking sources

- **PS Clocks**
  - External source
  - 3 PLLs
  - 4 sources to PL

- **PL Clocks**
  - External source
  - 4 PS clocks
  - Can't source PS

![Diagram of PL Clocking sources](image)
Clock generation (Vivado)

- PLL configuration for PS & PL
  - One input reference clock
- GUI clock configuration
- PS peripheral clock in Zynq Tab
  - Dedicated PLL clock for PS
  - PS I/O peripherals use the I/O PLL clock and ARM PLL
- Advanced clocking configuration
Zynq Resets

• Internal resets
  – Power-on reset (POR)
  – Watchdog resets from the three watchdog timers
  – Secure violation reset

• PS resets
  – External reset: PS_SRST_B
  – Warm reset: SRSTB

• PL resets
  – Four reset outputs from PS to PL
  – FCLK_RESET[3:0]
Zynq PL

programmable interconnects

Configurable Logic Block (CLB)
(slice)

Input / Output Blocks (IOBs)

switch matrix

Logic Fabric
Zynq PL (2)

- **Block RAMs**
  - For RAM / ROM / FIFO
  - 36 Kb each
    - Configurable as 2 18Kb
  - Different word sizes
    - 2048 * 18b, 4096 * 9b, ...
  - Can be combined into larger ones
  - Distributed RAM in LUTs as alternative

- **DSP48E1S**
  - Low power
  - Can be combined into larger ones
  - Configurable using OPCODES
Zynq PL (3)

- General Purpose I/Os
  - Referred as SelectMap resources
  - Banks of 50 IOBs each
    - Single-ended & differential supported
    - High Performance (HP):
      - 1.8 V high-speed interfaces to memory & other chips
    - High Range (HR)
      - 3.3V & support for wider variety of IO standards

- GTX transceivers
  - Dedicated “Hard IP” blocks
  - PCI Express, Serial Rapid IO, SCSI, SATA
  - Implemented in groups of 4 channels
    - Dedicated PLL
    - Up to 12.5 Gbps
• Analog to Digital conversion
  – XADC “Hard IP” block
    • 2 separated ADCs
    • 1 Msps each
    • Programmable from the APU (PS)
Zynq Booting

- **What is a boot loader?**
  - First program to run on power up or reset
  - Copies program from non-volatile memory to RAM
  - Loads an application or OS
  - Then transfers control

- **Why needed?**
  - Final Sw system
    - Might not fit into ROM
    - Might require some kind of run-time set up before it is launched
    - Might be determined dynamically

- **Boot loaders tend to range from simple to quite complex systems**
Zynq Booting (2)

- PS boots first
- Multi-stage boot process
  - Stage 0: Runs from ROM; loads from non-volatile memory to OCM
    - Provided by Xilinx; unmodifiable
  - Stage 1: Runs from OCM; loads from non-volatile memory to DDRx memory
    - User developed; Xilinx offers example code through SDK project
    - Initiates PS boot and PL configuration
  - Stage 2: Optional; runs from DDR
    - User developed; Xilinx offers example code – Uboot
    - Sourced from flash memory or through common peripherals, programmable logic I/O, etc.
    - Programmable logic configuration can be performed in Stage 1 or 2
- Boot source selected via package bootstrapping pins
- Optional secure boot mode allows the loading of encrypted bootloader
Boot and Configuration

- Zynq devices can be booted and/or configured in
  - Secure mode via static memories only (JTAG excluded)
    - Ability to have secure software
    - Protects bitstream and IP
  - Non-secure mode via JTAG or static memories (debug and development environment)
    - Standard boot model

- Four master boot devices
  - QSPI: serial memory, linear addressing
  - NAND: complex parallel memory
  - NOR: parallel memory, linear addressing
  - SD: Flash memory card

- Secondary boot devices
  - USB, Ethernet, and most other peripherals
Non-Secure Boot Example

Stage 1
- First Stage Bootloader runs from OCM
  - PS Boot Data is loaded into specified memory
  - Bitstream is loaded and PL is configured (optional)
  - Can enable Second stage boot (optional)

Stage 2
- Uboot runs from Second Stage Source
  - Responsible for loading OS kernel
  - Can be from any source

OS
(Kernel & Drivers)
- UBoot
- FSBL

• Power up Zynq-7000 EPP
  - Boot Mode Pins Identifies Boot Device
  - ROM Code runs
  - Copies First Stage Boot Loader to OCM
Secure Boot Example

Stage 0
- Power up Zynq-7000 EPP
  - Boot Mode Pins Identifies Boot Device
  - ROM Code runs
  - First Stage Boot Loader is decrypted using AES and then copied into OCM

Stage 1
- First Stage Boot Loader runs from OCM
  - PS Boot Data is decrypted using AES engine and put into specified memory
  - Bitstream is decrypted using AES and PL is configured

Stage 2
- UBoot from Xilinx does not support encryption at this point

OS (Kernel & Drivers)
- UBoot
- FSBL
Configuration and Re-Configuration

- DevC - The PL is configured via the device configuration interface module
- Accessed via a software application using an AXI port in the PS
  - Supported by Xilinx-provided APIs in SDK
  - Recommended methodology
- Separate DMA port into the Central interconnect for simultaneous PL configuration with software download
- Accessed from the PL via a GPx master AXI port
  - Not recommended
Device Configuration (DevC) Interface

- Three main blocks operate independently
  - An AXI-PCAP bridge for interfacing to the PL configuration logic
  - Device security management
  - An XADC interface
- Also contains an APB interface used by the host to configure the three blocks, to access the overall status, and to communicate with the PL XADC
The DevC Interface

• Manages basic device security and provides a simple DMA interface, PS setup, and PL configuration
  – Enables PL configuration through the processor configuration access port (PCAP) in both secure and non-secure master boot, including support for compressed PL bitstreams
  – Supports PL configuration readback
  – Supports concurrent bitstream download/upload
  – Enforces Zynq-7000 device system-level security including debug security
  – Supports XADC serial interface
  – Supports XADC alarm and over-temperature interrupt
  – Secure boot ROM code protection
AXI-PCAP Bridge

- Converts 32-bit AXI formatted data to the 32-bit PCAP protocol and vice versa
- Supports both concurrent and non-concurrent download and upload of configuration data
- The DMA engine moves data between the FIFOs and a memory device, typically the on-chip RAM, the DDR memory, or one of the peripheral memories
- Non-secure data to the PCAP interface can be sent every clock cycle, encrypted data can be sent every four clock cycles
References

