Improving memory performance in SoC designs
April 03, 2017
No matter how new, fast, or high-performance an electronic device is touted to be, there's always a slight, almost imperceptible lag between a request...
No matter how new, fast, or high-performance an electronic device is touted to be, there’s always a slight, almost imperceptible lag between a request from the user and the device’s response. That’s the memory doing its job at an 80 percent or lower efficiency rate. Of course, the user still considers the device to be blazing fast, but the engineering group knows the performance of the system on chip (SoC) design driving the device could be better – much better, in fact.
Efficient, streamlined communication between the processor and the memory is every engineering group’s dream. That dream is thwarted by the highly integrated nature of today’s SoCs, comprised of many different clients each generating different types of request streams to a memory subsystem that may require hundreds of clock cycles of latency to access. Even a single client with multi-threaded capabilities running pointer-chasing code for linked-list processing will produce client request streams that are random and appear to have almost no locality of reference. This makes it impossible to get the best performance from the memory subsystem or efficient communication with the processor.
What’s needed is a streamlined way to collect and process this apparently random request information to create a virtual locality of reference for better decision making and greater efficiency. One new technology –– actually, a block of intellectual property (IP) embedded in the SoC –– is poised to do that. It manages widely divergent request streams to create a virtual locality of reference that make requests appear more linear. Implementing such technology improves memory bandwidth and lets an SoC extract the best possible performance from its memory subsystem.
Not to be confused with a memory scheduler, the IP is a memory prefetch engine that works with memory schedulers by grouping similar requests together. It analyzes multiple concurrent request streams from clients and determines which requests should be optimized, or prefetched, and which should not. The result is high hit rates with ultra-low false fetch rates.
Once a client request has been optimized, it is stored in a request optimization buffer (a small micro-cache holding optimized client requests) until it is needed by a client. Buffers provide a non-blocking interface for any of the multiple client interfaces for peak response bandwidth to exceed that of the memory subsystem and reduce average memory latency.
A multi-client interface that supports both AXI and OCP protocols can manage up to 16 clients, specified by the designer when he or she is configuring the technology. The configuration tool will build automatically the specified number of client interfaces, each functioning independently and able to support concurrent operation. This allows the IP to issue multiple concurrent client requests for any responses issued from the request optimization buffers. Consequently, the IP supplies a higher peak burst bandwidth than is provided by the underlying memory subsystem. Benchmarks showed the IP reduced the read latency from between 71 percent to 78 percent.
Every engineering group’s dream is to reduce a memory’s latency to improve the performance of each system component implemented in an SoC for a faster-running design without increasing power consumption. All electronic devices can benefit from an improved memory subsystem and now there’s an efficient way to do so with a block of IP. No more lagging!