Making embedded storage fault-tolerant

May 11, 2017

Making embedded storage fault-tolerant

Fault tolerance is the holy grail for embedded systems, especially for military, and industrial applications where real time operating systems are com...

Fault tolerance is the holy grail for embedded systems, especially for military, and industrial applications where real time operating systems are common and downtime is costly. Yet minimizing downtime is easier said than done – especially when it comes to storage.

Redundant storage using redundant array of independent disks (RAID) technology has been prevalent at the enterprise level for decades, but the size, weight, and computing constraints of embedded systems have made it much more difficult to implement for this sector. Recently, the prevalence of high density SSDs in ever-smaller form factors has made storage redundancy possible even in compact embedded systems. Along with ultra-compact hardware RAID controllers, we may be entering a new age where highly available embedded storage is no longer an oxymoron.

When it comes to creating reliable storage systems, redundancy is key. Mirroring disks using RAID has been common practice since the 1990s. RAID, a standardized system for mirroring data across drives, allowed for fault tolerant storage systems to be built – even using relatively inexpensive hardware. If a drive failed, its mirrored backup could take over, allowing for minimal or even no downtime in well-implemented systems.

While RAID made good sense for server applications, implementing it at the embedded system level was a challenge. Before SSDs were prevalent, hard drives were the main storage medium. Their size and weight meant that having redundant drives was out of the question for most, if not all, embedded applications.

When SSDs came onto the market in force, RAID was still challenging to implement. Flash storage was initially quite expensive, and redundant embedded storage would be cost prohibitive for many applications. Size was also an issue even with SSDS, as early SSD were not always smaller than the hard drives they replaced.

The computing power required to manage RAID has traditionally required either a bulky hardware RAID controller – impractical for space constrained systems, or a software RAID controller. While a software RAID controller makes sense in terms of space savings, for embedded systems, it’s not always the right option. Embedded computers are often size and energy constrained systems which can’t afford the CPU and memory overhead of running RAID software.

Reliability versus fault tolerance

Because of the various challenges of achieving storage redundancy in embedded systems, minimizing downtime for embedded storage has traditionally focused on reliability instead of fault tolerance. By making sure to use high quality components and designing reliable systems with a higher mean time to failure (MTTF), service life and operational times can be improved.

Mechanical hard drives were prone to numerous failure modes. Vibration, shock and plain old wear and tear meant that it was not a matter of if a drive would fail but when. Making reliable hard drives meant using better quality components and rugged mechanical design to better tolerate shock and vibration.

Today’s SSDs, with their solid state design, eliminate mechanical problems as a failure mode, but can still fail at the drive controller or storage medium level. Flash cells have a limited number of write cycles before the cell simply no longer stores bit states accurately. So, while flash is robust in the face of shock and vibration, write endurance needs to be carefully monitored for SSDs.

For SSDs, improving reliability therefore entails using industrial drives, which have drive controllers optimized for reliability and write endurance, rather than pure performance, as well as the use of higher grade flash. Rather than using consumer-grade multi-level cell (MLC) flash, industrial systems will often use single-level cell (SLC), or SLC-like flash such as iSLC. These higher-grade types of flash last thousands of write cycles longer than MLC flash, greatly extending storage service life.

While improving reliability is always a major goal for industrial systems, true resiliency requires fault tolerance as well. To understand how to create fault tolerance, we only need to look at enterprise data centers – where downtime can cost thousands to millions of dollars. In these mission critical environments, reliable components are combined with fault-tolerant design to create highly available systems.

Availability, which can be thought of as minimizing downtime, is approached in two ways. The first approach involves improving the service life of the system – improving reliability. The other approach is to reduce the time it takes to recover the system – improving fault-tolerance.

Fault-tolerant embedded storage

Fault-tolerant storage requires storage redundancy – there’s no way around it. Thankfully these days, both SSDs and RAID controllers have greatly shrunk in size.

[Figure 1 | M.2 SSDs like this Innodisk M.2 3SE3 drive pack up to 32GB of storage into a tiny 22x42x3.5mm form factor]

Whereas SSDs were originally the same size as the 3.5″ hard drives they replaced, today’s mSATA and M.2 form factor SSDs make even 2.5″ laptop drives look like oversized behemoths. These compact SSDs are less than half the size of a playing card, and their thickness is measured in millimeters.

RAID controllers have also undergone a serious diet. What used to require a full PCIe card now can be implemented on an SoC-type chip. When paired with the right firmware, this new generation of RAID controllers is designed to work with SSDs, not against them.

For the embedded system designer today, there are a number of options on the market for various storage form factors:

[Figure 2 | This E2SS-32R2 xRAID controller comes in a 2.5" drive enclosure, virtualizing a dual M.2 SSD array into a single 2.5" drive.]

For larger systems with an existing 2.5″ drive slot, These AID controllers emulate a 2.5″ disk. They consist of a hardware RAID controller with two mSATA or M.2 slots for redundant SSDs. Possible to be configured in either RAID 1 or RAID 0 configuration for performance, they present as a normal 2.5″ drive to the host system, while providing redundancy and fault-tolerance, or higher performance in the case of RAID 0.

[Figure 3 | This EGSS-32R1 RAID controller is integrated into a 22x42x11mm M.2 form factor, making it the smallest RAID controller card currently available.]

For smaller systems, an mSATA or M.2 interface can provide one of the most compact RAID configurations available today. Just like the 2.5″ disk replacement, the mSATA or M.2 RAID controller plugs into the corresponding interface and presents a single drive. In fact, it provides storage redundancy through a physical connection to two SATA drives.

These SATA drives can either be normal sized SATA drives connected using a flexible cable, or for maximum space efficiency, SATADOM drives, which are compact SSDs that connect directly to the SATA connector. SATADOM drives from Innodisk come in various physical configurations, from vertical to horizontal, to fit a variety of embedded systems.

[Figure 4 | SATADOM drives like this Innodisk SH 3SE3 come in both vertical and horizontal configurations to fit space constrained embedded systems]

While not an option for most low-power embedded systems, high-end embedded PCs with severe space constraints can consider using dual SSDs in conjunction with software RAID. The compact nature of mSATA, M.2 and SATADOM SSDs makes this the ultimate compact RAID configuration, but the CPU and memory of software RAID makes this only viable for higher end embedded systems with the resources to support this configuration.

Implementing high-availability embedded storage

The combination of fault tolerant redundant RAID storage, with reliable, industrial grade SSD drives such as SLC or iSLC grade SSDs, allows embedded systems to achieve true high availability. Both reliability ­– the time before failure, and fault-tolerance – the time to repair, are addressed, minimizing downtime for the storage subsystem.
Fault-tolerance can also be used on its own, with MLC grade SLCs. For low write-cycle applications, this can be an affordable yet highly effective approach to minimize downtime.

While it’s been a long and arduous journey, the miniaturization of SSDs and RAID controllers is allowing today’s embedded systems to finally achieve true fault tolerant storage.

C.C. Wu is vice president of Innodisk and director of the Embedded Flash Division. He is a frequent presenter at the annual Flash Memory Summit held in Santa Clara, California, and speaks on the topics of NAND flash technology and embedded systems storage.

C. C. Wu, Innodisk Corporation