# The Four Stages of Inference Benchmarking

February 17, 2020

Story

This blog discusses how to benchmark inference accelerators to find the one that is the best for your neural network.

Over the last decade, neural networks have gone from interesting research to widely deployed for language translation, key word recognition, and object recognition.

For a long time, neural networks were limited to Data Centers which had the compute resources required to run neural networks, initially on microprocessors and then increasingly on GPUs which have many more of the MACs required for running neural networks.

Nvidia recently announced its sales of Inference products were out-shipping its sales of training products for the first time.

As Inference moves to the edge (anyplace outside of the data center) where power and cost budgets are constrained, customers are searching for inference accelerators that can deliver the throughput they need for the price and power they can afford.

This blog discusses how to benchmark inference accelerators to find the one that is the best for your neural network; and how customers commonly evolve their thinking on benchmarking as they come up the learning curve. Neural Network Inference is exciting but complicated, so it is initially very confusing. The lights come on step-by-step as customers work through the issues.

First let’s review the common elements of inference accelerators and the neural networks they run.

**Common Elements of All Inference Accelerators**

### All inference accelerators have in common the following elements:

- MACs (lots of them)
- On-chip SRAM
- Off-chip DRAM
- Control logic
- On-chip interconnect between all of the units

The number of elements and organization varies widely between inference accelerators; the method of organizing the MACs; the ratio of MACs to SRAM/DRAM; and how data flows between them is critical to determining how well the accelerator actually accelerates.

**Common Elements of All Neural Network Models**

### All neural networks contain the following elements:

- Numerics choice: 32-bit floating point (what the model was trained using),

16-bit floating point, 16 bit integer or 8 bit integer

- Input data: images, audio, text, etc
- Layers from dozens to hundreds each of which processes the activations from the prior layer and passes the output activations on to the next layer
- Weights for each layer of the model

**TOPS - The 1st Stage of Inference Benchmarking**

Customers new to neural network performance estimation almost always start by asking, “How many TOPS does your chip/module/board have?” Because they assume TOPS and throughput correlate - but they don’t.

TOPS is an acronym for Trillions of Operations per Second the number of MACs available, in thousands, times the frequency the MACs run at, in GigaHertz, times 2 (one MAC = two operations). So, in simpler terms, 1K MACs at 1GHz = 2 TOPS.

More MACs means more TOPS.

What matters is whether the memory organization and the interconnect can keep the MACs “fed” so they are highly utilized and thus produce high throughput on the model.

**ResNet-50 - The 2nd Stage of Inference Benchmarking**

Once customers realize the metric that matters is throughput, they usually move on to asking, “What is your chip/module/board’s throughput in inferences/second for ResNet-50?”

MLPerf recently published benchmarks for ResNet-50 submitted by numerous manufacturers.

ResNet-50 is a popular CNN (convolutional neural network) for categorizing images and has been widely used for benchmarking for years.

The problem is, no customer actually uses ResNet-50.

Customers ask about ResNet-50 because they assume that a chip/module/board’s throughput on their model will correlate to ResNet-50 throughput.

### The Two Main Flaws with this Assumption are:

- ResNet-50 uses 224x224 images, but most customers want to process megapixel images which are 16+ times larger. ResNet-50 might run well on a chip/module/board for 224x224 images but perhaps not for megapixel images because the larger images will stress the memory subsystem much more than the smaller ones. For a 2 Megapixel image the intermediate activations can be 64MegaBytes whereas for a 224x224 image the intermediate activations are at most a couple megabytes.
- Batch size: manufacturers want to quote the biggest number they can for benchmarks so their ResNet-50 benchmark numbers are typically for the biggest batch size they can run. But for applications on the edge, almost all applications need batch size = 1 for minimum latency. Consider a car: if you are looking for objects like pedestrians you need to become aware of them as soon as possible. So, large batch sizes may maximize throughput but on the edge what is needed is minimum latency which is batch size of 1.

ResNet-50 is not a bad benchmark for real world models IF it is run on megapixel images at batch size = 1. But it’s not a good benchmark as usually used.

**Real World Models & Images - The 3rd Stage of Inference Benchmarking**

The next stage customers reach in the learning curve is that they should find an open source neural network model that has characteristics similar to theirs: similar type of model (CNN or RNN or LSTM), similar size of image (or other input type), similar number of layers, and similar operations.

For example, customers interested in CNNs they most commonly ask, “What is your throughput in frames per second for YOLOv2 (or YOLOv3) for 2 Megapixels (or 1 or 4)?”

What’s really interesting is that although the majority of customers want to know about YOLOv2/v3, almost no manufacturer provides a benchmark for it (one exception is Nvidia Xavier which benchmarks YOLOv3 for 608x608 or 1/3 megapixel).

YOLOv3 is a very stressful benchmark which is a great test of the robustness of an inference accelerator: 62 Million weights; 100+ layers; and >300 Billion MACs to process a single 2-Megapixel image. Benchmarking this model shows whether an accelerator can simultaneously get high MAC utilization, manage storage reads/writes without stalling the MACs, and whether the interconnect can efficiently move data between memory and MACs without stalling compute.

It’s not just throughput that matters of course, it’s what is the cost and power to achieve the throughput.

A Nvidia Tesla T4 at $2000 and 75 Watts might have the throughput you want but may far exceed your budget.

The other thing customers think about is Throughput Efficiency, throughput/$, and throughput/watt for the kind of model they plan to run

**The 4th Stage: Benchmarking the Actual Model for Throughput, Power, and Cost**

The final stage in customers’ learning curve on benchmarking inference is to develop their own model, using training hardware/software, typically from Nvidia or in DataCenters, then to benchmark that model on possible target inference accelerators.

This way a customer can really tell which accelerator will give them the best throughput efficiency.

The end point seems obvious, but everything does in hindsight. Neural Network Inference is very complicated, and all customers go through a learning curve to reach the right conclusion.