To fully embrace a DevOps-style workflow requires capability for OTA updates and a responsive development organization that understands the limitations of software testing and the importance of continuous improvement also after release...
Software tracing is an important tool in every embedded developer’s toolbox, especially when combined with advanced visualization. Most embedded systems have plenty of cyclical patterns, where the same sequence repeats over and over. When debugging, you often want to find the anomalies, i.e. deviations from the normal cyclical behavior where something out of the ordinary happened.
However, software tracing in itself is only a form of data collection. Looking for a problem in a trove of textual or numerical log data is akin to searching for a needle in a haystack, but with proper visualization the search is transformed into a problem of visual pattern recognition, something the human brain is particularly well equipped to do. Interactive graphs showing execution times, response times, task switches, message passing between tasks – all these allow a developer to quickly spot anomalies in the execution of their firmware, where to dig deeper.
Tools for visual trace diagnostics have been around for at least a decade and have proved to be useful for development and debugging in the lab. With more and more embedded software developers adding secure ‘internet of things’ cloud connectivity, it is quite natural to consider the use of tracing in deployed devices in the field, in order to capture real-world problems that had been missed during testing. After all, software-based tracing does not require any additional hardware and a connected IoT device is obviously capable of uploading diagnostic trace data, in the same way as regular application data. In this way, developers can quickly become aware of any remaining software issues that cause problems during real-world operation and also get detailed diagnostics to understand the cause.
In this scenario, software tracing is comparable to a virtual “flight recorder”, like those used in airliners in case of accidents. It is an integrated part of the product that is always recording, providing vital information in case of a problem. But unlike the real flight recorder boxes, it is a software solution and intended for software issues.
One solution for this kind of IoT device monitoring is Percepio’s DevAlert (figure 1), which consists of three parts: a firmware monitor, a small library that you add to your firmware to enable tracing and uploading of alerts; our Tracealyzer tool for visual trace diagnostics; and a cloud service, responsible for categorizing and storing alerts, notifying developers, filtering out duplicate alerts, and more.
Figure 1. Percepio DevAlert provides IoT developers with instant feedback about errors in their cloud connected devices, allowing for rapid continuous improvement of the device software.
(Click on image to enlarge)
The initial version runs on AWS and is intended for RTOS applications using AWS IoT core, but the solution can be adapted for other cloud platforms.
Software tracing and cloud connectivity
Tracing in the development lab and tracing deployed devices are two different things. If you are using visual trace diagnostics in the lab today and look to expand it into the field, there are a few things you need to think through.
Compared to a direct physical connection like USB or Ethernet, a cloud connection offers both limited bandwidth and much longer response times. Uploading say 5 KB of data may require tens or hundreds of milliseconds over a wireless interface. However, in this approach, traces are not transmitted continuously, but only when an alert is generated and only a small trace of the most recent events. Alerts are only intended for unusual but important things, for instance if an error has been detected in the application code, such as a failed sanity check, hard fault or a watchdog reset.
Any internet-connected device needs to be secure. It’s therefore important not to introduce any new attack vectors. We solve this in DevAlert by relying on existing cloud connectivity rather than introducing a new connection. This leverages the security of AWS and other leading IoT/cloud providers, which offers verified SDKs for cloud connectivity that are secured according to best practices, such as device authentication using X.509 certificates and encrypted communication using TLS. This would then make DevAlert uploads just as secure as regular IoT application data, and for added security, it only needs one-way communication: it never listens for incoming messages.
In this approach, alerts are uploaded to the same cloud account as normally used by the device, and with the same level of security. Once in the cloud, a small part of the data is provided to the cloud service. This doesn’t include the actual trace data, which may be considered as sensitive information and therefore remains in the cloud account of the device. Figures 2a and 2b show the data flow and the security barriers in more detail.
Figure 2a. Data flow starts in the device software, where developers add alerts to the source code. Every alert that is uploaded to the device cloud account includes a short trace with the most recent events preceding the alert. Finally, a metadata signature is forwarded to the DevAlert cloud service. (Click on image to enlarge)
Figure 2b. The cloud service compares incoming alerts to previous alerts from the customer’s entire device fleet and notifies developers about any new issues. Alerts that are duplicates are counted and stored, but no notifications are sent. This way, developers’ inboxes aren’t flooded if the same alert is triggered in multiple devices. (Click on image to enlarge)
Operational costs for receiving alerts to a cloud account are typically low, although it naturally depends on the volume. To begin with, as long as no issues are detected, no alerts are sent. In general, cloud providers also charge very little for sending and storing occasional alert message. Most IoT applications generate a lot more data, which is reflected in the pricing of the IoT/cloud services. For example, sending 1 million MQTT messages to AWS IoT core costs US$ 1.
Most of the alert processing is done in the cloud service, a fully managed service hosted by Percepio. Only the initial processing is done in the device developer cloud account, which keeps the cloud costs low and simplifies integration.
Sending out over-the-air updates to fix reported errors can potentially cost a bit more, since you need to transfer a lot more data and to all devices. AWS provides a pricing example where the cost of updating a fleet of 600,000 devices is US$ 1,275. This is however not very expensive in relation to the cost of letting a bug remain unfixed – damaged customer experience, lower product review ratings, lower sales, or even accidents and legal action.
DevOps for embedded development
Enabling your IoT devices to “phone home” in case of software issues comes with a significant upside. The direct awareness of errors and detailed diagnostics create a feedback loop between developers and deployed code, allowing developers to fix bugs faster and push out updated firmware faster – see Figure 3. This so-called DevOps philosophy has long been the standard in development of mobile and cloud applications, and with the introduction of secure cloud-based IoT platforms it has become possible for embedded development to work this way too.
Figure 3. The DevAlert dashboard in Tracealyzer lists the most recent reported alerts and traces.
(Click on image to enlarge)
From a business perspective, this DevOps-style monitoring translates into fewer dissatisfied customers, since fewer end users will be affected by bugs in production code. Most embedded software contains some missed bugs at release, despite all verification efforts, but they typically do not show up directly for everyone. There is often some time to fix the problem before many customers are affected, if you know about it early. Ideally, developers should be notified within seconds of the very first alert and the provided trace diagnostics allows for rapid analysis and correction. Developers may then send out an automatic over-the-air update to fix the problem. The instant awareness and trace diagnostics may greatly reduce the time-to-repair and minimize the number of affected customers.
Improved device reliability reduces liability risks and also reduces costs for customer support, returns and debugging. The provided diagnostics makes it far easier for developers to reproduce customer issues, since they get information directly from the device and do not have to rely on the user to describe the circumstances. Without automatic feedback, you rely on your end users to report any issues and provide sufficiently detailed information. A vague error report like “the system stops responding” isn’t very helpful, and it may take weeks to find a likely cause. And even then, it’s just your best guess – you can’t really know if you solved the right problem.
Not only bugs
One thing to note is that alerts do not have to be about just missed bugs and the resulting errors. Since developers are free to decide where and why alerts should be generated, they could use them also for monitoring key performance metrics of the application and see the reason for occasional performance issues.
Monitoring the user interface can also reveal interesting information. Let’s say you have a situation where the user opens up a menu on a touch screen, e.g. in the infotainment systems of a car, and then hesitates where to proceed. To catch such issues, the application developer can start a timer after each input event and generate an alert if no input is received within say 5 seconds. If many alerts are then received about the same part of the user interface, this can be important feedback that can help your organization build better products.
All in all, leveraging software tracing and cloud-based alerts in deployed devices has major benefits and is not complicated. However, to fully embrace a DevOps-style workflow requires capability for over-the-air updates and a responsive development organization that understands the limitations of software testing and the importance of continuous improvement also after release.
— Johan Kraft is CEO of Percepio AB