These top 10 tips outline the things that engineers need to consider when specifying embedded flash to help make the process simpler...
Working with embedded flash isn’t necessarily as easy as it may appear, particularly if you have a demanding application. These top 10 tips outline the things that engineers need to consider when specifying embedded flash to help make the process simpler.
#10: FOB vs. steady-state performance
The performance of flash devices deteriorates over time through usage or fragmentation on the physical level. When a drive is new or has been through only a few program/erase (P/E) cycles, it is deemed to be fresh out of the box (FOB). To write, or program, new data on a flash device, a full block must be empty, and any existing obsolete data first needs to be erased. This process is called garbage collection. In the FOB state, there is no garbage, and this means that all solid-state drives (SSDs) and flash devices will initially have very high performance. However, as more data is added and old data is erased, the performance spike will level off into what is called a steady state. Depending on the structure of the garbage — and whether it is highly fragmented over many blocks after many small random writes or accumulated in few blocks — the cleaning process has a higher or lower impact on performance. It is a challenge for designers to predict what that steady-state performance will be for their use case; there is not always an applicable standard benchmark, so it is vital to check the performance of a drive after it has been through two or three P/E cycles while being exposed to a representative use case.
#9: Seeking consistent performance
SSD or flash drive configurations can be optimized to deliver the required performance, whether that is for speed, data retention, reliability, or endurance of the system. In order to tune the configuration, it is crucial to know the use case. Hyperstone offers tools to assess the use case and the impact on the physical level of the flash. These tools can be used in the target system during a qualification with a representative use-case model, or “in flight” during use. This information enables a full understanding of the performance behavior of the flash device over its life cycle. Also, based on that information, several configurations of the system, the firmware, or the flash translation layer (FTL) can be optimized.
#8: Overprovisioning
One option available to designers is overprovisioning (OP). This can deliver a lower write amplification factor (WAF), which equates to greater endurance. OP will reserve a certain physical area of storage for the FTL. By changing the OP, you can trade capacity for performance and endurance. Carefully assessing the workload and tuning the OP enables a more cost-effective solution that performs better and lasts longer.
To support this effort, health monitoring and lifetime estimation are extremely valuable tools. During qualification, you can monitor lifetime parameters such as block erase counts and bit errors to evaluate how your storage media behaves within your system. Once deployed in the field, you can remotely monitor the status of the solution or have the health-monitoring utility issue warnings when the media approaches its end of life (EoL).
#7: Managing static data with RDM and DDR
Managing static data (also known as data-at-rest) is often overlooked. There are numerous physical effects that will impact the retention of a memory cell, and only a read operation can allow one to figure out the number of bit errors. As a consequence, if the data is static and not accessed often, the accumulation of errors will not be seen until it is read. It could be the case that it is then too late for correction and the information is lost. Hyperstone’s dynamic data refresh (DDR) checks the full drive at intervals decided by the end user to make sure the error can be corrected. The data is rewritten if necessary.
Read disturb management (RDM) manages the disturbance induced on a memory cell when adjacent cells are read. An example of this, relevant in the automotive industry, is the design of navigation systems. While almost all data in navigation systems is static, the mapping pages around one’s home are read more often than other data. As the navigation system is repeatedly asked to read this page, its neighboring pages also become altered. RDM monitors and counts the number of reads in adjacent physical pages and rewrites this information accordingly. Both RDM and DDR require background activities and can be set up to manage the trade-off between reliability and performance.
#6: Ensuring health
Health monitoring throughout the lifetime of a device can be crucial to ensure reliable operation and avoid unexpected failure. Self-Monitoring, Analysis, and Reporting Technology (SMART) delivers important data about the status of the NAND flash. Initially, this function was developed to monitor the health of hard-disk drives (HDDs), but it is also a very powerful and precise tool for NAND flash devices, especially if mission-critical information as to what is happening on the physical level can be monitored. In this case, it enables users to monitor the lifetime of the flash device using different attributes. Foreseeing impending issues before they happen or reach critical levels can avoid system downtime or loss of valuable data. State-of-the-art tools like Hyperstone’s hySMART utility assist in this process and allow users to monitor and track their solution. The results can be displayed on a graphical user interface (GUI).
#5: Improved mapping with hyMap
Either block-mapping or page-mapping strategies can be adopted to suit different use cases. For consumer products, cost is often the prime concern; for enterprise products, it is performance; and for industrial applications, it is reliability. For video recording — a use case with mostly large sequential writes — block-based mapping is suitable. That is why it is mostly used within consumer USB flash drives or SD cards. In this approach, file system sectors are combined into blocks and mapped to physical blocks. For operating systems, databases, or logging applications, the FTL may need to be optimized for small random writes. Hence, another approach is page-based mapping — a more granular approach like Hyperstone’s hyMap — in which file system sectors are mapped to physical pages on the flash.
#4: Invisible caching trade-offs
Single-level cell (SLC) flash can store one bit of information in each cell. Programming is faster, but the cost per bit is higher than multi-level cell (MLC) NAND. SLC is used for applications in which reliability, endurance, and speed — especially for lower-capacity drives — are essential, such as industrial automation, networking, and robotics. MLC can store two bits of data per cell and, therefore, is much cheaper per bit than SLC. However, MLC has a lower write performance, and because the thresholds are closer, more accuracy is required to program the cells.
Triple-level cell (TLC) and quad-level cell (QLC) can store three or four bits per cell, respectively, thereby achieving even lower cost per bit. TLC and QLC are currently targeting consumer applications for which capacity is more important than lifetime and reliability. While their page-read times are comparable, the program times for the physical flash array increase dramatically from SLC to QLC. Also, the amount of data per program operation becomes larger. In order to save SRAM area, cheap controller solutions overwrite the SRAM buffer immediately after the data is transferred to the flash but before the write confirmation of the flash is received. This means that they cannot handle program errors.
When using SLC, MLC, or TLC, performance boosting can be achieved with flash controller features such as early acknowledge (EA), which is a write-caching feature. If a host sends 4K of random data to be written, the controller confirms reception to the host even though this data has not yet been written to the flash. Once the controller has received 4 × 4K, completing a flash page, the controller writes all 4K clusters into one page. Especially for TLC and QLC, RAID-like features are necessary to ensure reliable operations. Depending on how these are configured, they can have a tremendous impact on random write performance and the WAF.
#3: Trading capacity for performance and reliability
It is possible to operate MLC and TLC in so-called pseudo modes (pSLC or pMLC). For example, a pSLC NAND is a standard MLC NAND in which only one bit per cell is programmed. In other words, pSLC is a particular usage model of an MLC flash. By storing only one bit per cell, endurance, reliability, and data retention are increased compared to MLC. The parameters described are significantly better if compared to MLC and are closer to SLC rather than MLC. Hyperstone offers dedicated firmware (FW) for such modes and works together with flash vendors to optimize this for the most reliable usage.
#2: Data in power-failure events
It is important that the event of a power failure doesn’t result in corrupted data or worse. If a log of recent flash transactions is maintained, the controller can recover the last valid entry. It is possible that if a write operation is active during a power-fail situation, this data could be lost; however, the controller should be able to recover the last valid data. Hyperstone controllers use an internal voltage detector to monitor the supply voltage, and this can be enabled to stop flash accesses early in power-fail situations. If enabled, the FW will finish the current command and assert the flash write-protect. Additionally, a guard time can be set to avoid false triggering due to noise. However, even if disabled, the FW continues to work in a power fail situation until the voltage has fallen to another lower level where the internal reset detector triggers. Hyperstone carries out extensive tests on this behavior before releasing any controller and firmware. When choosing file systems, controller systems, and firmware configurations, system designers should consider whether their system is susceptible to unexpected power-downs.
#1: Defining requirements
For an optimum design to be achieved, you must first understand your target use case and the specific demands and requirements for the device. Armed with this information, you will be able to select the controller, firmware, and configurations that provide the optimum solution. The heart of any storage system is its controller, which controls and defines both its behavior and reliability. To meet specific use-case demands — whether it be consumer, enterprise, or industrial — controllers and firmware should be optimized for cost, performance, or reliability. If reliability, data retention, and endurance are the prime concerns, then an industrial controller would be the best solution.
Conclusion
There is a variety of storage systems and controllers available in terms of interface options and quality levels. If the storage system is vital for your application or holds sensitive data, or if failure would result in costly down-time, you need to choose your controller carefully, so make sure that you fully understand your requirements.
— Axel Mehnert is the VP Marketing & Product Strategy at Hyperstone.