The Four Stages of Inference Benchmarking

By Geoff Tate

CEO, Co-Founder & Board Member

Flex Logix Inc.

February 17, 2020

Story

This blog discusses how to benchmark inference accelerators to find the one that is the best for your neural network.

Over the last decade, neural networks have gone from interesting research to widely deployed for language translation, key word recognition, and object recognition.

For a long time, neural networks were limited to Data Centers which had the compute resources required to run neural networks, initially on microprocessors and then increasingly on GPUs which have many more of the MACs required for running neural networks.

Nvidia recently announced its sales of Inference products were out-shipping its sales of training products for the first time.

As Inference moves to the edge (anyplace outside of the data center) where power and cost budgets are constrained, customers are searching for inference accelerators that can deliver the throughput they need for the price and power they can afford.

This blog discusses how to benchmark inference accelerators to find the one that is the best for your neural network; and how customers commonly evolve their thinking on benchmarking as they come up the learning curve. Neural Network Inference is exciting but complicated, so it is initially very confusing. The lights come on step-by-step as customers work through the issues.

First let’s review the common elements of inference accelerators and the neural networks they run.

Common Elements of All Inference Accelerators

All inference accelerators have in common the following elements:

MACs (lots of them)
On-chip SRAM
Off-chip DRAM
Control logic
On-chip interconnect between all of the units

The number of elements and organization varies widely between inference accelerators; the method of organizing the MACs; the ratio of MACs to SRAM/DRAM; and how data flows between them is critical to determining how well the accelerator actually accelerates.

Common Elements of All Neural Network Models

All neural networks contain the following elements:

Numerics choice: 32-bit floating point (what the model was trained using),

16-bit floating point, 16 bit integer or 8 bit integer

Input data: images, audio, text, etc
Layers from dozens to hundreds each of which processes the activations from the prior layer and passes the output activations on to the next layer
Weights for each layer of the model

TOPS - The 1st Stage of Inference Benchmarking

Customers new to neural network performance estimation almost always start by asking, “How many TOPS does your chip/module/board have?” Because they assume TOPS and throughput correlate - but they don’t.

TOPS is an acronym for Trillions of Operations per Second the number of MACs available, in thousands, times the frequency the MACs run at, in GigaHertz, times 2 (one MAC = two operations). So, in simpler terms, 1K MACs at 1GHz = 2 TOPS.

More MACs means more TOPS.

What matters is whether the memory organization and the interconnect can keep the MACs “fed” so they are highly utilized and thus produce high throughput on the model.

ResNet-50 - The 2nd Stage of Inference Benchmarking

Once customers realize the metric that matters is throughput, they usually move on to asking, “What is your chip/module/board’s throughput in inferences/second for ResNet-50?”

MLPerf recently published benchmarks for ResNet-50 submitted by numerous manufacturers.

ResNet-50 is a popular CNN (convolutional neural network) for categorizing images and has been widely used for benchmarking for years.

The problem is, no customer actually uses ResNet-50.

Customers ask about ResNet-50 because they assume that a chip/module/board’s throughput on their model will correlate to ResNet-50 throughput.

The Two Main Flaws with this Assumption are:

ResNet-50 uses 224x224 images, but most customers want to process megapixel images which are 16+ times larger. ResNet-50 might run well on a chip/module/board for 224x224 images but perhaps not for megapixel images because the larger images will stress the memory subsystem much more than the smaller ones. For a 2 Megapixel image the intermediate activations can be 64MegaBytes whereas for a 224x224 image the intermediate activations are at most a couple megabytes.
Batch size: manufacturers want to quote the biggest number they can for benchmarks so their ResNet-50 benchmark numbers are typically for the biggest batch size they can run. But for applications on the edge, almost all applications need batch size = 1 for minimum latency. Consider a car: if you are looking for objects like pedestrians you need to become aware of them as soon as possible. So, large batch sizes may maximize throughput but on the edge what is needed is minimum latency which is batch size of 1.

ResNet-50 is not a bad benchmark for real world models IF it is run on megapixel images at batch size = 1. But it’s not a good benchmark as usually used.

Real World Models & Images - The 3rd Stage of Inference Benchmarking

The next stage customers reach in the learning curve is that they should find an open source neural network model that has characteristics similar to theirs: similar type of model (CNN or RNN or LSTM), similar size of image (or other input type), similar number of layers, and similar operations.

For example, customers interested in CNNs they most commonly ask, “What is your throughput in frames per second for YOLOv2 (or YOLOv3) for 2 Megapixels (or 1 or 4)?”

What’s really interesting is that although the majority of customers want to know about YOLOv2/v3, almost no manufacturer provides a benchmark for it (one exception is Nvidia Xavier which benchmarks YOLOv3 for 608x608 or 1/3 megapixel).

YOLOv3 is a very stressful benchmark which is a great test of the robustness of an inference accelerator: 62 Million weights; 100+ layers; and >300 Billion MACs to process a single 2-Megapixel image. Benchmarking this model shows whether an accelerator can simultaneously get high MAC utilization, manage storage reads/writes without stalling the MACs, and whether the interconnect can efficiently move data between memory and MACs without stalling compute.

It’s not just throughput that matters of course, it’s what is the cost and power to achieve the throughput.

A Nvidia Tesla T4 at $2000 and 75 Watts might have the throughput you want but may far exceed your budget.

The other thing customers think about is Throughput Efficiency, throughput/$, and throughput/watt for the kind of model they plan to run

The 4th Stage: Benchmarking the Actual Model for Throughput, Power, and Cost

The final stage in customers’ learning curve on benchmarking inference is to develop their own model, using training hardware/software, typically from Nvidia or in DataCenters, then to benchmark that model on possible target inference accelerators.

This way a customer can really tell which accelerator will give them the best throughput efficiency.

The end point seems obvious, but everything does in hindsight. Neural Network Inference is very complicated, and all customers go through a learning curve to reach the right conclusion.

High throughput, low power, low cost neural network inference solutions. eFPGA available for TSMC 12/16/22/28/40 and GF 12/14nm. Silicon proven.

Embedded Computing Design

By Geoff Tate

This blog discusses how to benchmark inference accelerators to find the one that is the best for your neural network.

Common Elements of All Inference Accelerators

All inference accelerators have in common the following elements:

Common Elements of All Neural Network Models

All neural networks contain the following elements:

TOPS - The 1st Stage of Inference Benchmarking

ResNet-50 - The 2nd Stage of Inference Benchmarking

The Two Main Flaws with this Assumption are:

Real World Models & Images - The 3rd Stage of Inference Benchmarking

The 4th Stage: Benchmarking the Actual Model for Throughput, Power, and Cost

Categories

Processing - Compute Modules

Networking & 5G - Visualization, Orchestration & Management

Processing - Chips & SoCs

Trending Articles

Axiomtek’s mBOX603 Delivers High-Performance Medical Imaging and AI-Assisted Diagnostics

Product of the Week: AAEON’s de next-RAP8-EZBOX for Industrial Robotics

Advantech Teams With AMD To Maximize Performance at the Edge

Designing for the Edge & the Race for Competitive AI

We Must Embrace Innovation Where it Matters Most—on the Fab Floor

AI & Machine Learning

iOmniscient Partners with Intel for Lightweight Predictive Maintenance

Industrial

Avalue ECM-ASL3 Industrial Board Offers Intel Next-Gen Compatibility for Edge AI and Automation

IoT

EMASS and Semtech to Showcase Collaboration at CES 2026

Networking & 5G

Taoglas Expands Low-Cost, Compact Chip Antenna Range for Wi-Fi 6/7, UWB, and ISM Applications