SRAM-based FPGAs are increasingly being used in-orbit, and if an essential bit is flipped, the consequences could be catastrophic.
The configuration memory of SRAM-based space-grade FPGAs can be susceptible to single-event upsets (SEUs) and a soft error may or may not impact device functionality, depending on the criticality of the affected bit. If upsets accumulate, the probability of failure increases.
To mitigate against unintended changes, some devices have hardened the circuit design of the configuration memory cells to substantially increase the critical charge necessary to flip a bit, as well as interleaving the layout of CRAM to ensure physically-adjacent cells are not contained in the same configuration packet. The latter reduces the likelihood of a multiple-bit upset resulting from a single radiation strike increasing the effectiveness of the EDAC logic. Frame-based error-correcting code (ECC) combined with a CRC of the complete bitstream have increased the number of random bit errors that can be detected. Collectively, these innovations have significantly improved FPGA reliability and availability, lowering soft-error failure rates.
Ordinarily for Xilinx’s XQRKU060 FPGA, as the bitstream is initially loaded into the device, a CRC value is calculated from the data frames. Following configuration, an optional check can be initiated that compares the actual checksum with the expected one, and if these do not match, configuration is aborted.
Figure 1 This diagram shows the normal configuration sequence for Xilinx FPGAs. Source: Xilinx
For SRAM-based FPGAs, scrubbing is the collective name given to a range of techniques used to refresh (or re-program) the configuration memory, or detect (readback) and correct (writeback) errors in the background during normal device operation to prevent the accumulation of SEUs.
An internal scrubber implements the above functionality within the fabric and can be coded by the user or provided by the supplier, e.g. the V5QV FPGA provides continuous readback and writeback of configuration memory in the background. A CRC check value is calculated and compared with that computed from the original bitstream to validate the integrity of the current configuration. Some bits can change during normal user operation, e.g. distributed RAM or BRAM, and the CRC function can mask these variable locations.
There are two readback modes: verify and capture. The former reads all CRAM cells including the current values on all user memory elements to confirm the bitstream remains as intended, while the latter also allows access to the CLB and IOB registers, which by design change during normal operation. The time required to scrub the complete configuration using readback and writeback depends on which interface is used, the clock frequency, and the number of upsets detected.
Figure 2 These steps initiate the readback of configuration memory. Source Xilinx
For the XQRKU060, to access the internal scrubber Xilinx offers its Soft-Error Mitigation (SEM) IP, which uses the ICAP to check (readback) the contents of the FPGA’s configuration memory in the background during normal operation. This detects for errors by re-calculating and comparing frame-level ECC and device-level CRC checksums with the intended bitstream stored in external flash. Configuration bits are corrected if a mismatch is found. For the XQRKU060, the supported use of the readback CRC capability is through the SEM IP only.
The SEM IP also allows you to classify those bits that would result in a functional change if flipped. In a design that utilises 70% of the FPGA’s resources, typically 25 to 50% of the configuration bits are essential. The Vivado Design Suite creates a mask file of these, which can be stored in external flash memory. If an upset is detected, the SEM IP compares this with the mask data, correcting CRAM if necessary. The essential bits technology has been shown to reduce design-specific error rates by 50% compared to the intrinsic device SEU failures in time (FIT). Furthermore, the SEM IP includes fault injection to allow you to investigate the system response to a targeted SEU and architect mitigation strategies to reduce single-event functional interrupts (SEFIs).
As an example, the maximum ICAP frequency for the XQRKU060 is 200 MHz and using a 90 MHz clock, the times required for single-bit error mitigation are as follows.
Figure 3 These are SEM IP sample latency times required for single-bit error mitigation., Source: Xilinx
The SEM IP requires 425 LUTs, 490 flip-flops, 59 I/O, 4 RAMB36 blocks, and one DSP48 core when implemented on the XQRKU060. An arbiter is also available to share the ICAP between the SEM IP and other functions. Please note that the SelectMAP interface is not available at the same time as the SEM IP is using the ICAP and you may need to consider using the JTAG interface to scrub deliberately injected errors.
Similarly, NanoXplore’s rad-hard FPGAs contain an internal scrubber known as the Configuration Memory Integrity Check (CMIC). As the bitstream is loaded during power-up, an ECC-protected checksum is calculated. When enabled, the FPGA’s internal configuration is read and compared with this stored reference. At the maximum clock frequency of 50 MHz, the scrub rate can be set from every 5.3 ms to once in 65 days, with a memory scan requiring 4 ms. At a clock frequency of 20 MHz, the scrub rate can be set from every 13.25 ms to once in 162.5 days, with a check requiring 10 ms.
Figure 4 NanoXplore’s NG-Medium space-grade SRAM FPGA contain an internal scrubber. Source: NanoXplore
All internal scrubbing solutions may be susceptible to single-event effects (SEEs) dependant on the hardness of the device. External solutions are typically implemented using a non-volatile FPGA or a microcontroller, and range in complexity from periodic re-booting (soft reset) of the entire FPGA configuration from the off-chip non-volatile memory (blind scrubbing), to checking (readback) the contents of the FPGA’s configuration in the background during normal operation by comparing with the original bitstream stored in external flash, and correcting (writeback) if a mismatch is found.
For the XQRKU060 FPGA, the simplest external solution is to periodically pulse the PROGRAM_B pin in bank 0 to force a re-configuration. This blind scrubbing method is only suitable for those applications that can tolerate interruption to normal FPGA operation, e.g. LEO Earth-observation missions, and you decide the scrub rate: how often you want to soft reset. The time taken to re-configure the device depends on the mode, the external configuration-clock frequency, and whether the external memory uses a parallel or serial interface. Obviously this technique does not perform any checking, simply refreshing the FPGA’s internal configuration with the bitstream stored in off-chip flash. The INIT_B pin remains low during configuration or if a CRC error is detected, and a high on the DONE output indicates successful device configuration.
For those missions that cannot tolerate an interruption to normal FPGA operation, more sophisticated external scrubbing solutions implemented using an off-chip microprocessor or a non-volatile flash or antifuse FPGA, manage configuration at power-up by streaming the bitstream from the flash memory to the XQRKU060 in slave mode, as shown in Figure 5.
Figure 5 This is an exmple of Slave SelectMap configuration mode. Source: Xilinx
The XQRKU060 FPGA allows access to its configuration memory using the SelectMap, ICAP, and JTAG interfaces. During normal operation, the MPU or the non-volatile FPGA accesses the SelectMAP or JTAG ports in the background to either refresh the XQRKU060’s complete configuration (blind scrubbing), or readback its contents, compare this with the original golden image stored in the external flash memory and correct if required to prevent the accumulation of SEUs. The former is quicker as individual frames are simply re-written and you decide the scrub rate based on your mission’s reliability needs. The speed of the latter depends on whether the interface to the external flash memory is parallel or serial, the external configuration clock frequency, and the number of upsets detected.
Table 1 lists the maximum bandwidths for the XQRKU060’s configuration ports. This FPGA also offers a power-on-reset option (POR_OVERRIDE) to reduce the initialisation delay when the device is first powered up.
Table 1 Clock rates, data widths, and bandwidths for XQRKU060 configuration ports
Previously SEFIs were observed within the memory cells of the configuration control circuitry of the V5QV FPGA and an external scrubber for the XQRKU060 should also check for these. These SEFIs are not a failure of the user design and could cause device-wide functionality or visibility issues for the configuration management scheme. The observed SEFIs are listed below together with their mitigation actions, and can be recovered by performing a soft reset by pulsing the PROGRAM_B pin.
Figure 6 These are the observed SEFIs from the V5QV FPGA. Source: Xilinx
For the XQRKU060, enabling the PERSIST bitstream option maintains the configuration logic access to the SelectMAP port after configuration for readback access. This allows you to re-configure the FPGA using an external device such as a microcontroller, without pulsing the PROGRAM_B pin, or using the JTAG interface.
The XQRKU060 also offers partial re-configuration to dynamically re-program specific regions in-orbit: active partitions within the fabric can be updated without compromising the integrity of applications running elsewhere within the FPGA that use the imported logic. Re-configurable modules can be swapped in and out as needed using the ICAP or SelectMAP, as illustrated below.
Figure 7 Partial re-configuration of specific user logic blocks can dynamically re-program specific regions in-orbit. Source: Xilinx
The main configuration (shown in blue above) uses a static, top-level place and route implementation and imported partitions have to respect this. Partial bit files can be as small as one frame or as large as the complete bitstream, and this technique is being used by some avionics sub-systems as a blind-scrubbing technique. If partial re-configuration is used to re-program an entire FPGA, its speed is similar (less power-up INIT_B/DONE initialisation) to conventional configuration as the same interfaces are used. The XQRKU060 requires a minimum flash memory of 256 Mb and the specifications of its configuration ports are listed in Table 1. For the maximum clock frequencies shown there, a parallel interface obviously supports a higher bandwidth, and requires less time to configure the device, but will require more real-estate on the PCB. It is important to note that following partial re-configuration, the golden-frame ECC and device-level CRC values must be re-calculated to prevent false errors during readback.
Spacechips offers scrubbing and in-orbit re-configuration solutions and teaches these on our FPGA training course. I presented a talk on both topics at the XRTC on the 17th of June. Until next month, the best suggestion of how scrubbing can be improved will win a Courses for Rocket Scientists World Tour t-shirt. Congratulations to Daisy from South Africa, the first to answer the riddle from my previous post.
Dr. Rajan Bedi is the CEO and founder of Spacechips, which designs and builds a range of advanced, L to Ku-band, ultra-high-throughput on-board processors and transponders for telecommunication, Earth-observation, navigation, internet, and M2M/IoT satellites. You can also contact Rajan on Twitter.