Bigger Is Not Always Better: Exploiting Parallelism in NN Hardware
December 16, 2020
When considering hardware for high-performance neural networks, automotive engineers determine compute needs by simply adding NN requirements, which defines total system performance. Or does it?
Many automotive system designers, when considering suitable hardware platforms for executing high performance NNs (Neural Networks) frequently determine the total compute power by simply adding up each NN’s requirements – the total defines the capabilities of the NN accelerator needed. Or does it?
The reality is almost all automotive NN applications comprise a series of smaller NN workloads. By considering the task parallel nature of the NN inferences within automotive software, a far more flexible approach, using multiple cores, can deliver superior results with far greater scalability and power efficiency.
Looking “Under the Hood” of AI Workloads
When designing a hardware platform capable of executing the AI workloads of automated driving, many factors need to be considered. However, the biggest one is uncertainty: what workload does the hardware actually need to execute, and how much performance is needed to safely and reliably execute it? How much of the time do I need to run that workload? How can I make my hardware platform scalable, so I can allow for performance upgrades, and smaller or larger configurations depending on the vehicle model or sensors?
Without a deep understanding of the range of workloads and how they each operate, SoC designers are forced to target the worst case, which often means large parts of the chip’s capabilities are rarely used. For an automated vehicle, that means unnecessary cost, and much higher power consumption. That is why AImotive develops both software and hardware technologies: allowing us to take a holistic approach to system design.
What is the Workload?
Executing NNs requires extremely high-performance engines, measured in many tens or hundreds of TOPS (Trillions of Operations Per Second). However, building in sufficient contingency to manage uncertainty results in significant increases in size and power consumption, and of course cost. That doesn’t align well with demanding automotive constraints on cost, power and guaranteed performance.
This is why we designed aiWare hardware IP first and foremost as a highly scalable architecture. Our aiWare hardware IP complements our modular aiDrive software technology portfolio, by offering our customers and partners alternatives to mainstream solutions that are often over-specified.
Do We Need One Big NN Accelerator Engine?
The simple answer: no!
When we were designing the architecture for aiWare, we looked at all the different NN workloads within an AI-based system, which together required more than 100 TOPS. Based on experience with our own aiDrive software, as well as extensive discussions with our partners about how they are building their solutions, we saw that it is almost never one single large “monolithic” NN workload: it is usually a collection of much smaller NN workloads. Often this comprises applying one much smaller NN workload to input from multiple different sensors in parallel, while sometimes results from one set of NN computations are then passed on to another different NN workload that combines multiple inputs, known as “late fusion”.
AI systems use multiple NNs in various ways, breaking down the task into a series of modules. Pre-processing, that is work done to transform each individual sensor’s raw data before it is combined downstream with data from other sensors, can sometimes dominate the total TOPs budget. That’s why by looking at the complete system, not just one or two workloads in isolation, engineers can identify ways to utilize the inherent parallelism of a system to produce simpler yet more efficient solutions.
Do You Need to Scale All Performance Equally?
A single NN accelerator core is, by definition, limited in capacity, however powerful. What happens when you exceed the capabilities of that engine? If the NN accelerator is integrated into an SoC, will you be forced to move to a new SoC, if the primary reason for the upgrade is to substantially increase NN inference performance to keep up with the latest NN algorithms? That will require re-validation of all the software of every function executing on that SoC – not just the NN parts. A costly and time-consuming process, if you only want more performance from the NN engine.
There will always be reasons to increase other parts of an SoC like the CPU, memory, comms etc. However, that needs to be traded off against the expensive and time-consuming re-validation of any new SoC and the complex embedded software closely tied to it. By adopting an external NN accelerator approach, you can delay having to upgrade the SoC containing the host processor itself, just as PC gamers upgrade their GPU while keeping the same CPU and chassis.
Should Hardware Scale Over Lifetime?
As experience for integrating AI into automated vehicles grows rapidly, so too does the need for modularity and scalability. These days, cars often use common platforms for the underlying chassis, which can then be adapted to the various models. OEMs and Tier1s are now starting to apply similar concepts to vehicle software, by bringing together the different software components – sometimes distributed over 50 or 100 ECUs (Electronics Control Units) – into a common software platform. That will increasingly demand a more scalable hardware platform to execute that software, especially as software from more ECUs is gradually centralized.
As cars become increasingly upgradeable during their lifetime, the hardware platform also needs to move towards a standard, modular, scalable and upgradeable solution. But the different types of processors need to be upgradable separately: in particular the NN acceleration hardware, where workloads and algorithms are likely to change dramatically every year for the foreseeable future.