Making machine learning work within the resource-constrained environment of an embedded device can become a quagmire. This article will focus on working through issues with a hypothetical device with significant ML components.
Machine learning (ML) is hard; making it work within the resource-constrained environment of an embedded device can easily become a quagmire. In light of this harsh reality, anyone attempting to implement ML in an embedded system must consider, and frequently revisit, the design aspects crucially affected by its requirements. A bit of upfront planning makes the difference between project success and failure.
For this article, our focus is on building commercial-grade applications with significant, or even dominant, ML components. We’ll use a theoretical scenario in which you have a device, or better yet an idea for one, which will perform complex analytics, usually in something close to real time, and deliver results in the form of network traffic, user data displays, machine control, or all three. The earlier you are in the design process, the better positioned you’ll be to adjust your hardware and software stack to match the ML requirements. The available tools, especially at the edge, are neither mature, nor general purpose. Also keep in mind that, the more flexible you are, the better your odds of building a viable product.
Let’s start by describing a hypothetical device and then work through some of the ML-related issues that will affect its design.
A smart security camera
For our design, let’s look at a networked security camera. As an IoT device it’s continually connected to the Internet. We’ll assume our device will have at least 4GB of SDRAM, a 64-bit ARM CPU, and run an embedded Linux that supports an Anaconda Python distribution, OpenCV, DLIB, and TensorFlow.
Our ML related goals for this application are: 1) recording, identifying, and labeling “interesting” frames, and 2) alerting security personnel to suspicious activity.
As with most projects of this nature, we are constrained by various physical, environmental, and cost factors. To make the best use of the available data, we’ll need to use ML to examine and classify multiple objects from every image frame. The first design strategy we’ll explore is to use edge-based ML to identify frames of interest and do some simple initial classification, and a cloud-based service to handle the actual determination of a possible alert. How should we proceed?
Process images in the cloud?
Regardless of where the processing occurs, this application will need to recognize various objects in the video frames, such as people, faces, vehicles, etc. Each object set requires execution (inference) of an ML model, which produces a set of bounded, labeled objects. A typical camera will record about 30 frames per second. If we wanted to minimize the cost and complexity of the camera, couldn’t it just send the raw data to a cloud provider? Ignoring other considerations, this is about 2.5 million images per day. Even at a deep discount, processing this number of images with a commercial ML services provider would be billed at about $1000 a day, and that’s for each recognition model applied. Clearly, we’ll need to make other choices.
Let’s start by examining the raw input stream. Each raw 720p (standard HD 1280×720px) frame uses about 5 MB, so if we were to send 30 frames/sec over the network, we would need an incredible 1.5 Gbps connection (about a terabyte every two hours). For full HD and higher, multiply this by four to 10×. We will not be sending raw, uncompressed video. Our problem is that our ML models will only work against individual image frames. Where should we make our tradeoffs?
Some clues to our dilemma lie in the basic nature of video, which produces vast amounts of data but, in most cases, very little of the information in a given frame is different from the one that preceded it. This is why various compression techniques work extremely well, and raw data reductions of 100-1000× is typical. It’s reasonable to assume that compression can reduce our 1.5 Gbps stream to something in the neighborhood of 1.5 Mbps, so perhaps we can do our ML inference using cloud services, as long as we host our own machine instances rather than paying for ML services. Depending on our workloads, we can expect to process 24×7 video for between $50 and $200 per month.
Are there any other factors that might bias us towards more cloud processing?
Bear in mind that the decision to perform processing in the cloud or on the edge is not a binary decision but a continuum of options. Designs can massively benefit from even limited edge computing; in some cases, simple detections (motion/vehicle/human/face) can be implemented on the edge to reduce cloud workloads by orders of magnitude with no loss in utility.
[Continue reading on EDN US: Can we do most of the work on the edge?]
John Fogarty is an advisory software engineer at Base2 Solutions.