Accelerate AI Applications Using VITIS AI on Xilinx ZynqMP UltraScale+ FPGA

By Vaibhav Kothari

Associate Principal Engineer

Softnautics

April 13, 2021

Story

Accelerate AI Applications Using VITIS AI on Xilinx ZynqMP UltraScale+ FPGA

VITIS is a unified software platform for developing software and hardware, using Vivado and other components for Xilinx FPGA SoC platforms like ZynqMP UltraScale+ and Alveo cards. The key component of VITIS SDK, the VITIS AI runtime (VART), provides a unified interface for the deployment of end ML/AI applications on Edge and Cloud.

Inferencing in machine learning is computation-intensive, requires high memory bandwidth, and high performance compute to meet the low-latency and high-throughput requirements of various end applications.

Vitis AI Workflow

Xilinx Vitis AI provides a workflow to deploy deep learning inference applications on Xilinx Deep Learning Processing Unit (DPU) using a simple process:

(Image source: https://www.xilinx.com/support/documentation/sw_manuals/vitis_ai/1_0/ug1414-vitis-ai.pdf)

The Deep Processing Unit (DPU) is a configurable computation engine optimized for convolution neural networks for deep learning inference applications and placed in programmable logic (PL). DPU contains efficient and scalable IP cores that can be customized to meet many different applications' needs. The DPU defines its own instruction set, and the Vitis AI compiler generates instructions.

VITIS AI compiler schedules the instructions in an optimized manner to get the maximum performance possible.

A typical workflow to run any AI spplication on the Xilinx ZynqMP UltraScale+ SoC platform is comprised of the following:

  1. Model Quantization

  2. Model Compilation

  3. Model Optimization (Optional)

  4. Build DPU executable

  5. Build software application

  6. Integrate VITIS AI Unified APIs

  7. Compile and link the hybrid DPU application

  8. Deploy the hybrid DPU executable on FPGA

AI Quantizer

AI Quantizer is a compression tool for the quantization process by converting 32-bit floating-point weights and activations to fixed point INT8. It can reduce the computing complexity without losing accurate information for the model. The fixed point model needs less memory, thus providing faster execution and higher power efficiency than floating-point implementation.

(Source: Xilinx)

AI Compiler

The AI compiler maps a network model to a highly efficient instruction set and data flow. The compiler’s input is a Quantized 8-bit neural network and the output is DPU kernel - the executable will run on the DPU. Here, the unsupported layers need to be deployed in the CPU or the model can be customized to replace and remove those unsupported operations. It also performs sophisticated optimizations such as layer fusion, instruction scheduling, and reuse of on-chip memory.

Once we’re able to execute the DPU, we need to use Vitis AI unified APIs to initialize the data structure, initialize the DPU, implement the layers not supported by the DPU on CPU, and add the pre-processing and post-processing on a need basis on PL/PS.

(Source: Xilinx)

AI Optimizer

With its model compression technology, the AI Optimizer can reduce model complexity by 5-50x with minimal impact on accuracy. This deep compression takes inference performance to the next level. We can achieve desired sparsity and reduce runtime by 2.5x.

(Source: Xilinx)

AI Profiler

The AI Profiler can help the profiling inference to find caveats causing a bottleneck in the end-to-end pipeline. The profiler gives a designer a common timeline for DPU/CPU/Memory. This process doesn’t change any code and can trace the functions and do profiling.

(Source: Xilinx)

AI Runtime

VITIS AI runtime (VART) allows applications to use unified high-level runtime APIs for both edge and cloud deployments, making it seamless and efficient. Some of the key features include the following:

  • Asynchronous job submission

  • Asynchronous job collection

  • C++ and Python implementations

  • Multi-threading and multi-process execution

Vitis AI also offers DSight, DExplorer, DDump, & DLet, etc., for various task execution.

DSight & DExplorer

DPU IP offers a number of configurations to specific cores to choose as per the network model. DSight tells us the percentage utilization of each DPU core. It also gives the efficiency of the scheduler so that we could tune user threads. One can also see performance numbers like MOPS, Runtime, and memory bandwidth for each layer and each DPU node.

Softnautics chose the Xilinx ZynqMP UltraScale+ platform for high-performance and compute deployments. It provides the best application processing, highly configurable FPGA acceleration capabilities, and  VITIS SDK to accelerate high-performance ML/AI inferencing. One such application we targeted was face-mask detection for Covid-19 screening. The intention was to deploy multi-stream inferencing for Covid-19 screening of people wearing masks and identify non-compliance in real time, as mandated by various governments for Covid-19 precautions guidelines.

We prepared a dataset and selected pre-trained weights to design a model for mask detection and screening. We trained and pruned our custom models via the TensorFlow framework. It was a two-stage deployment of face detection followed by mask detection. The trained model thus obtained was passed through VITIS AI workflow covered in earlier sections. We observed 10x speed in inference time as compared to CPU. Xilinx provides different debugging tools and utilities that are very helpful during initial development and deployments. During our initial deployment stage, we were not getting detections for mask and Non-mask categories. We tried to match PC-based inference output with the output from one of the debug utilities called Dexplorer. However, debug mode and root-caused the issue to debug this further. Upon running the quantizer, we could tune the output with greater calibration images, iterations, and detections with approximately 96% accuracy on the video feed. We also tried to identify the bottleneck in the pipeline using AI profiler and then taking corrective actions to remove the bottleneck by various means, like using HLS acceleration to compute bottleneck in post-processing.

Associate Principal Engineer at Softnautics and has been key member in various Embedded Software projects across different domains for over nine years and last 3-4 years for machine learning and deep learning field. He has worked on audio/video/wireless domain solutions for Audience(Now Knowles), Xilinx, Lattice Semiconductor, Microsemi, etc, chipsets. He is passionate for enabling practical & real-world AI solutions on Embedded Edge platforms and various FPGA devices.

More from Vaibhav