Protecting your embedded software against memory corruption

Article By : Lendaro Francucci

This article provides a software method that explains how to deal with corruption of memory data sets stored in non-volatile devices.

The aim of this article is to provide a software method that explains how to deal with corruption of memory data sets stored in non-volatile devices, such as small EEPROM or flash memories. It is common to see these data sets in tiny embedded systems that store persistent data such as configuration parameters, critical system logs, among others. These data sets may be corrupted after a system crash, power failure or an ESD.

This article proposes a simple but effective mechanism that can save such data with a lower likelihood of becoming corrupted. Additionally, this method includes a well-known mechanism to detect variable corruption, because they may be corrupted by a variety of causes such as environmental factors (e.g., EMI, heat, radiation), hardware faults (e.g., power fluctuation, power failures, memory cell faults, address line shorts), or software faults (other software erroneously modifying memory). Even though this article uses C language to implement the proposed method, it can be easily implemented in other programming languages like C++.

Let’s suppose an embedded system that can be configured in runtime through a set of parameters which are stored in a non-volatile memory. These parameters are arranged in a C structure:

typedef struct ConfigData ConfigData;
struct ConfigData
{
    int optionA;
    long optionB;
};

What would happen if a power failure or system reset occurred while these data are being updated?. It is possible  that they are corrupted. To address this problem, a fixed-length binary code called a CRC value is calculated on the configuration data to detect whether  they have been corrupted or not. This code is stored, in addition to the data values, in the non-volatile memory:

typedef struct Config Config;
struct Config
{
    ConfigData data;
    Crc32 crc;
};

Suppose the software module that implements this method is called Config, which provides functions to initialize, set and get the configuration data, while protecting them via calls to the CRC calculator. These functions are defined in a header file named Config.h and implemented in a source file named Config.c. The following code snippet shows a fragment of the file Config.h.

typedef enum ConfigErrorCode ConfigErrorCode;
enum ConfigErrorCode
{
    NO_ERRORS,
    INIT_DATA,
    CORRUPT_DATA
};
...
typedef void (*ConfigErrorHandler)(ConfigErrorCode errorCode);
...
ConfigErrorCode Config_init(void);
void Config_setErrorHandler(ConfigErrorHandler errorHandler);
bool Config_getOptionA(int *value);
bool Config_getOptionB(long *value);
bool Config_setOptionA(int value);
bool Config_setOptionB(long value);

When the system starts, the data stored in the non-volatile memory are first checked and then copied to a variable if they are not corrupted. Otherwise they are restored to the default values, which are defined in the file ConfigDft.h. Even though this procedure is useful when the system starts for the first time, there is a more sophisticated alternative that will be explored later on. The function Config_init() covers this feature.

static const Config configDefault =
{
    {
        CONFIG_OPTA_DFT,
        CONFIG_OPTB_DFT
    }, 0
};
...
ConfigErrorCode
Config_init(void)
{
    ConfigErrorCode res = NO_ERRORS;
    Crc32_init();
    if (checkDataFromNVMem(&config) == false)
    {
        res = INIT_DATA;
        if (errorHandler != (ConfigErrorHandler)0)
        {   
            errorHandler(res);
        }
        config = configDefault;
        NVMem_storeData(CONFIG_ADDR_BEGIN, sizeof(Config),
                        (const uint8_t *)&config);
    }
    return res;
}

The CRC value is set when configuration is updated and checked when configuration is read. Updating implies storing the new configuration together with its CRC in both the private variable and the non-volatile memory.

As an example, the function Config_setOptionA() shows how to set a configuration option, in this case optionA.

bool
Config_setOptionA(int value)
{
    bool res = false;
    if (checkData((const Config *)&config) == false)
    {
        if (errorHandler != (ConfigErrorHandler)0)
        {
            errorHandler(CORRUPT_DATA);
        }
    }
    else
    {
        config.data.optionA = value;
        config.crc = Crc32_calc((const uint8_t *)&config, sizeof(Config),
                                0xffffffff);
        NVMem_storeData(CONFIG_ADDR_BEGIN, sizeof(Config),
                        (const uint8_t *)&config);
        res = true;
    }
    return res;
}

After updating the non-volatile memory, the function Config_setOptionA() could add another verification to ensure that the recently stored data have been correctly written.

The proposed method in this article suggests reading the configuration from a variable instead of the non-volatile memory directly, since this variable is an updated copy of the configuration stored in the non-volatile memory. When configuration is read, the CRC is recalculated and compared to the stored CRC. If they differ, then the errorHandler() is called. Otherwise the retrieved data is returned to the client. The function Config_getOptionA() shown below demonstrates how to retrieve a configuration option from the system configuration.

bool
Config_getOptionA(int *value)
{
    bool res = false;
    if (checkData((const Config *)&config) == false)
    {
        if (errorHandler != (ConfigErrorHandler)0)
        {   
            errorHandler(CORRUPT_DATA);
        }
    }
    else
    {
        if (value != (int *)0)
        {
            *value = config.data.optionA;
            res = true;
        }
    }
    return res;
}

The Config_init() function showed how to deal with data corruption at startup, it suggested restoring the whole configuration to default values. However, a more advanced alternative could be used instead using two blocks of the non-volatile memory to store the configuration data arranged as the Config structure suggested. One block is called main and another backup, which will be justified later on. When the system starts the configuration stored in both non-volatile memory blocks are checked, recalculating the CRC and comparing it with the stored CRC; if only one block is corrupted, the whole healthy block will be copied to the other one. If both are corrupted, then they will be restored to default values. The last condition arises when both blocks are healthy. In this situation, the stored CRC of each block is  compared with each other. If they differ, the main block will be copied to the backup block.

When using non-volatile devices like flash memories, every block should be assigned an exclusive physical sector of that memory. The next diagram represents the behavior of this mechanism:

The Config_init() function is modified to perform the mechanism explained above:

...
static const RecProc recovery[] =
{
    proc_in_error, proc_recovery, proc_backup, proc_cmp
};
...
static ConfigErrorCode
proc_in_error(void)
{
    block = configDefault;
    block.crc = Crc32_calc((const uint8_t *)&block.data,
                           sizeof(ConfigData), 0xffffffff);
    NVMem_storeData(CONFIG_MAIN_ADDR, sizeof(Config),
                    (const uint8_t *)&block);
    NVMem_storeData(CONFIG_BACKUP_ADDR, sizeof(Config),
                    (const uint8_t *)&block);
    return CORRUPT_DATA;
}
static ConfigErrorCode
proc_recovery(void)
{
    block = backupBlock;
    NVMem_storeData(CONFIG_MAIN_ADDR, sizeof(Config),
                    (const uint8_t *)&block);
    return RECOVER_DATA;
}
static ConfigErrorCode
proc_backup(void)
{
    NVMem_storeData(CONFIG_BACKUP_ADDR, sizeof(Config),
                    (const uint8_t *)&block);
    return BACKUP_DATA;
}
static ConfigErrorCode
proc_cmp(void)
{
    ConfigErrorCode res = NO_ERRORS;
    if (main.readCRC != backup.readCRC)
    {
        res = proc_backup();
    }
    return res;
}
...
ConfigErrorCode
Config_init(void)
{
    int status;
    Crc32_init();
    NVMem_readData(CONFIG_MAIN_ADDR, sizeof(Config),
                   (uint8_t *)&block);
    main.readCRC = Crc32_calc((const uint8_t *)&block.data,
                              sizeof(ConfigData), 0xffffffff);
    main.result = (main.readCRC == block.crc) ? 1: 0;
    NVMem_readData(CONFIG_BACKUP_ADDR, sizeof(Config),
                   (uint8_t *)&backupBlock);
    backup.readCRC = Crc32_calc((const uint8_t *)&backupBlock.data,
                                sizeof(ConfigData), 0xffffffff);
    backup.result = (backup.readCRC == backupBlock.crc) ? 1: 0;
    status = 0;
    status = (main.result << 1) | backup.result;
    return (*recovery[status])();
}

If the system does not need to check the data set stored in RAM every time a configuration option is accessed by set and get functions, then an alternative version would look like this:

bool
Config_getOptionA(int *value)
{
    bool res = false;
    if (value != (int *)0)
    {
        *value = block.data.optionA;
        res = true;
    }
    return res;
}
bool
Config_setOptionA(int value)
{
    block.data.optionA = value;
    block.crc = Crc32_calc((const uint8_t *)&block.data,
                           sizeof(ConfigData), 0xffffffff);
    NVMem_storeData(CONFIG_MAIN_ADDR, sizeof(Config),
                    (const uint8_t *)&block.data);
    NVMem_storeData(CONFIG_BACKUP_ADDR, sizeof(Config),
                    (const uint8_t *)&block.data);
    return true;
}

The shown source code, written in C language, and its unit test cases are available in the safety-mem-patterns repository. It contains three directories Config.alt1/, Config.alt2/ and Config.recovery/ which correspond to different alternatives to implement the module Config according to the proposed method. Config.alt1 and Config.alt2 are similar, but the last one does not check the  data set stored in RAM every time a configuration option is accessed by set and get functions. Whereas the alternative Config.recovery is derived from Config.alt2 but includes the recovery mechanism.

Even though the introduced method is an effective manner to protect an embedded software against non-volatile memory corruption, it is strongly recommended to add additional electronic circuitry such as early power failure detection and an alternative power supply like a battery or a super capacitor to deal with a power failure in a more reliable way. While a power failure is in progress, this circuitry not only allows the software system to successfully finish updating the data set in the non-volatile memory, but it also avoids starting a new updating. In turn, a memory corruption product of a power failure will be less likely to occur.

This article was originally published on Embedded.

Lendaro Francucci is an electronic engineer who has focused in real-time embedded system development using software models in several industries for more than ten years, such as railway, medical, IoT, telecom, and energy. Leandro is the author of the free and open-source RKH state machine framework, and he is also the co-founder and owner of VortexMakes, a startup to provide consulting and training services in embedded software for companies of all sizes. Leandro is always interested in new challenges, as well as knowledge transfer, researching and constant learning.

 

Leave a comment