Chip synthesis and high-level synthesis: software in hardware
May 24, 2017
Combining chip synthesis and high-level synthesis provides the best mix of area, performance, and power.
A new generation of High-Level Synthesis (HLS) tools is currently being used for two main purposes. The first is to implement software in hardware for performance reasons. The second is to drive semiconductor design to a higher level of abstraction for reasons of productivity, reuse, architectural exploration, and better Quality of Results (QoR) than otherwise possible.
At the same time, a new approach to Register Transfer Level (RTL) synthesis called chip synthesis is making it easier than ever to achieve a fast and accurate assessment of the final performance without needing to create a complete physical implementation. Combining these two technologies lets designers quickly vary the parameters of a design, obtain correct performance numbers, and converge on a design with the best mix of area, performance, and power.
More horsepower, better abstraction
Embedded software of all types, especially software with high-throughput requirements such as high-definition video processing, often runs into performance problems. While some software approaches can be used to increase performance, the only workable approach when the performance is off by orders of magnitude is to change the underlying computing fabric on which the software runs, which could be as simple as switching to a multicore processor. However, usually due to power or cost reasons, changing the computing fabric is not optimal.
HLS is an increasingly attractive approach that takes part of the software and automatically implements it in hardware, either in raw gates on a System-on-Chip (SoC) or, better yet, in an FPGA. AutoPilot from AutoESL is an example of a tool that takes in C, C++, or SystemC as the input and quickly produces RTL Verilog or VHDL as the output.
At the same time, SoC designers are looking for ways to push design to a higher level of abstraction, describe their algorithms in C or SystemC, automatically convert this into RTL code, and hit the correct trade-off point for area (cost), power, and performance. By working at a higher level, designers can dramatically increase their productivity and be assured of QoR that is close to or better than hand-coded results. Again, HLS tools are the link that performs this optimized conversion from input to RTL code. The traditional RTL implementation flow can then take over.
Getting to the assessment
It would be an exaggeration to say that HLS makes compilation of hardware as simple as C compilation of software, but it certainly makes the transformation of software into hardware straightforward, especially compared to creating complex RTL implementations by hand. One reason that hardware compilation is more complicated than software compilation is that the HLS tool needs to consider a much richer set of trade-offs.
For example, a data path can be implemented simply, pipelined, or replicated. Each of these options has different performance, area, and power characteristics varying by factors as large as 1,000. HLS tools can be given directives to steer the implementation toward the sweet spot that the designer wants. But there is a problem: Given that the output of HLS is RTL code, how can designers quickly determine the area, power, and performance of a particular candidate implementation?
The missing link is the difficulty of assessing those characteristics and getting quick feedback about any issues. While HLS tools provide reasonable yet fairly coarse estimates, more accuracy is often required. However, there is a mismatch between the performance of traditional tools for reducing RTL code to implementation and the performance of HLS tools.
Although HLS runs extremely fast (about an hour or so), reducing an RTL implementation to achieve accurate performance might require half a day of synthesis followed by a day and a half of physical design. This is hardly the quick feedback loop that the HLS user would like, as it squanders the potential to iterate five or six times a day and minimizes it to a couple of times per week. The power of these newer HLS tools, which are language agnostic and can simultaneously optimize for timing, area, and performance, thus producing highly implementable RTL code, is marginalized by downstream RTL synthesis.
Blocks and the chip
A further nuance is that the detailed performance of a block doesn’t just depend on the block itself, but also on the other blocks around it. Sometimes an entire design that is synthesized from a high level includes legacy blocks, third-party IP blocks, and blocks designed by hand at the RTL level. When these blocks are implemented together, the performance of any particular block is interrelated with the performance of the other blocks that share some of the same physical resources.
With traditional RTL synthesis, the designer faces an unattractive choice: fast but coarse feedback or accurate but extremely slow feedback. What is required is an approach that provides both fast and accurate feedback. Chip synthesis tools such as RealTime Designer from Oasys Design Systems offer this combination of features.
Chip synthesis operates by directly reducing the RTL code to placed elements, thus providing two major advantages over traditional synthesis: place and route. The process is fast, and the timing and size data correlate well with what will eventually be obtained when the design is finally implemented. The combination of HLS and chip synthesis makes it possible to take a quantity of C code and quickly acquire excellent estimates for performance and area (see Figure 1). This makes it much more efficient for the designer to focus on the most appropriate place for implementation. In addition, because chip synthesis can quickly process huge blocks, it can synthesize the block being designed as well as the surrounding blocks that impact performance.
The difference is clear
Chip synthesis works differently from traditional synthesis. Once the RTL code has been parsed, it is partitioned (based on connectivity) into smaller segments that eventually will be reduced to gates. Each partition is small enough that it won’t contain any long wires, which would lead to high variability in timing, and large enough that it can include implementations with potentially different area/time trade-offs. Each partition is independent of the others. Of course, the timing numbers from all the other partitions are required to be able to time the whole chip, but the detailed internals of each partition are not required simultaneously. Because it is no longer necessary to look at the whole chip at the gate level at the same time, memory requirements are reduced.
The RTL partitioning approach is the main reason that chip synthesis can be so fast and effective. By operating at a higher level, this method intelligently synthesizes and times the design one partition at a time. Then, until timing is met, it resynthesizes, replaces (and updates the global routes), and repartitions parts of the design until constraints are met.
Working at a higher level with the latest HLS and chip synthesis technology produces orders of magnitude better performance. For the typical size of design created by HLS, chip synthesis runs in about the same amount of time as HLS. Using the two innovative technologies together means that a design can be iterated in an hour or two, allowing several trial implementations to be considered per day. The additional time freed up by this approach can be used in the form of a tighter schedule or to explore a richer space of alternatives.