Tesla's Failing Vehicles Due to Worn Out eMMCs. What happened?
January 04, 2021
The embedded NAND-based eMMC found in older Model S and X units wore out due to the NAND flash cell structure within the eMMC. The cost of an oversight | eMMC NAND Flash Technology & Use Case Demands.
ODIs recent information request into older Tesla's Model S and Model X vehicles highlighted a workload oversight in which Main Control Units (MCUs) based on the NVIDIA Tegra 3 processor with an integrated 8GB eMMC NAND flash memory were experiencing issues. The problem became compounded when new firmware updates were introduced that brought additional features to the Electric Vehicles (EV). This acted as fuel to further fire the NAND flash memory wearing progress. Even though the firmware wasn’t an issue in the beginning and the logged data had plenty of memory to handle the workload, each firmware upgrade brought with it new features, lessening the amount of storage space via each update.
In response to ODI's Information Request, Tesla listed 2,399 complaints and field reports, 7,777 warranty claims, and 4,746 non-warranty claims related to MCU replacements. The failing MCUs resulted in loss of rear camera image display when in reverse gear. With the NAND flash memory all worn out, drivers no longer had access to some of the vehicle’s features such as HVAC (defogging), audible chimes relating to ADAS, autopilot, and turn signals, and while owners could still technically drive their vehicles, they could no longer charge them effectively, making the cars inoperative.
eMMC modules have a pre-defined lifespan due to the NAND flash technology they are based upon. They have a limited number of Program/Erase (P/E) cycles they can be subject to and even if a company designs to those specifications initially, they must anticipate the growing workload challenges that the same system must manage over time. Ultimately, the problem here is threefold. There is a lack of technological understanding on NAND flash technology, there is a significantly more complex and multifaceted lack of use case understanding and there is an assumption that the lifespan of the drive is entirely hinged on the NAND flash technology – not the flash memory controller at play.
Understanding NAND Flash Technology in Tesla
According to several Tesla repair professionals, the embedded NAND-based eMMC found in older Model S and X units wore out due to the NAND flash cell structure within the eMMC. This is true, to an extent. Different types of NAND flash technology have a different (but always a limited) number of P/E cycles or what others call ‘write cycles’.
- SLC NAND flash technology approx. 100 000 P/E Cycles
- MLC NAND flash technology approx. 10 000-3500 P/E Cycles
- TLC NAND flash technology approx. 3000 P/E Cycles
- QLC NAND flash technology approx. 1000-100 P/E Cycles
This means once these cycles have been used up, the drive can no longer store data reliably anymore. According to Tesla's report, the Hynix units "are rated for 3,000 program/erase cycles for each block of NAND flash within the eMMC".
To understand why NAND flash cells always have a limited number of P/E cycles, one has to understand the technology it’s based on. NAND flash is a type of Non Volatile Memory (NVM) technology that stores data in arrays of memory cells that are made, either through Charge Trap technology or Floating-Gate MOSFET transistors. By applying a high voltage to the control gate of the transistor, while the source and drain are grounded, the electrons in the channel can gain enough energy to overcome the oxide barrier and move from the channel into the Floating Gate. This process of trapping electrons in the floating gate is the programming (or “write”) operation of a flash device which corresponds to a logical bit 0. On the contrary, the erase operation extracts the electrons from the floating gate switching the data stored in the cell to a logical bit 1. NAND flash cells wear out because program and erase cycles eventually damage the isolating layer between the floating gate and the substrate. This reduces data retention and can lead to loss of data or cells unintentionally being programmed.
Understanding the Use Case Workload in eMMC modules
Tesla EVs are a challenging environment for any storage application, not only because of the automotive quality demands on temperature and functional safety but because each vehicle is used differently. In this case the eMMC modules were affected by daily drive time, daily charge time, daily music streaming time, and a range of other factors. Furthermore, so much vital functionality and features were riding on the MCUs ability to reliably carry out its job. The eMMC in this eco-system has a very unique industrial grade workload that could only be properly achieved with a high quality flash memory controller designed to industrial standards.
Tesla claimed "at the nominal daily P/E cycle use rate of 0.7 per block, it would take between 11 and 12 years to accumulate an average of 3,000 P/E cycles per block in the device, at the 95th percentile of daily P/E cycle use rate of 1.5 per block it would take five to six years to accumulate an average of 3,000 P/E cycles per block in the device." At the end of the day, the demanding nature of the compounding firmware updates bought these drives to crash well before their anticipated time frame. It begs the question, why were these MCUs crashing so early?
Understanding the role of the NAND flash controller in Storage Systems
The role of the flash memory controller in high end storage systems is often overlooked. Where the NAND flash is often sprung into the spotlight, many neglect to assess the true capabilities of the controller in managing their application and the selected flashes pre-defined P/E cycles. While flash technology plays a significant role in defining the lifetime of a drive, the selected controller should mask all the inherent imperfections of the flash, elongating its life, ensuring there are no failing devices or data corruption.
For example, the best type of Error Correction Coding (ECC) the flash memory controller can carry out for any given storage device, depends entirely on the characteristics of the selected NAND flash and the processing performance available in the controller. Different types of errors are also more common in different types of NAND flash e.g. read-disturb errors are more likely in Multi-Level Cells (MLCs) and other controller features like wear leveling and the timing of garbage collection will be affected by the amount of over-provisioning in the NAND flash. As a result, the controller needs to be carefully matched to the characteristics of the NAND flash and if this is overlooked – it’s no wonder drives crash earlier than predicted. It’s an expensive oversight and choosing the right flash controller is a vital aspect in designing an efficient and reliable storage system like an eMMC module.
Ultimately, in the industrial sector – failing systems and data corruption are not as accepted as they are in the other markets because lifetime expectancies and failure costs are significantly more pressing. Storage systems like eMMC modules need to be designed in relation to their unique workload and managed properly to avoid failures in their specific field. Lastly, the flash memory controller plays an incredibly important role in masking the imperfections of the chosen NAND flash technology and should be very much considered as a core component, not just a supporter to the NAND flash.
Lena Harman is responsible for digital marketing, online strategy and the optimization of online platforms at Hyperstone. She holds a double degree in Communications and International Studies from the University of Technology, Sydney.