AI makes data storage more effective for analytics

Article By : Noam Mizrahi

Turning data into intelligence requires analysis. AI implemented in the storage controller can substantially speed that analysis, as this proof-of-concept demonstration shows.

Data is being generated today at a pace far greater than anyone could have imagined. In the past, humans were the primary source of data generation. Now, there are image devices, sensors, drones, connected cars, IoT devices, and pieces of industrial equipment generating data in multiple ways and formats. But, we should not confuse data with information – it is vital to differentiate between the two terms.

Currently, only a small fraction of collected data is valuable enough to be treated as a real asset. Take an imaging device. Here one minute of relevant activity matters, instead of long hours of extraneous video footage where nothing of importance happens. By way of an analogy, ‘data’ is the mine where people are digging for the golden nugget that is ‘information.’ The ability to turn this data into valuable information (the ‘digging’ if you will) may be referred to as ‘analytics.’

Editor’s Note: This article is part of an AspenCore Special Project, a collection of interrelated articles that explores the application of AI at the edge, looking beyond the voice and vision systems that have garnered much of the press. Included in this Special Project are deep dives on the innovations pushing AI toward the edge, AI in the test industry, and how AI changes the future of edge computing.

Figure 1 Increase in data storage demand 2009 to 2020

The graph shown in Figure 1, compiled by analyst firm Statista, describes the phenomenal ramp up in stored data capacity over the course of the last decade. It predicts that by 2020 demand for storage will reach more than 42,000 exabytes. However, most of the data being stored (most estimates suggest at least 80% of it) is still in a completely unstructured form – and this presents difficulties when using it for analytical purposes. Estimations are that only 5% of the data stored is actually being analyzed. If we could have a way to represent this unstructured data with metadata that effectively described it in the context of the analysis being done, much larger amounts of data could be analyzed. This significantly increases the value organizations can generate out of the data they possess.

Artificial intelligence (AI) is a technology set to have significant impact on every aspect of modern society. This includes areas such as e-commerce recommendations, natural language translations, FinTech, security, object identification/detection, and even the field of medicine where life-threatening cancer cells (or other abnormalities) can be quickly pinpointed. Despite their diversity, there is a common thread for all these use cases because we now have a technology that can effectively scan through enormous sets of unstructured data (videos, text, voice, images, etc.), and process them so that true value can be derived.

Specifically, we can use AI not only for the analytical process itself, but also for pre-processing raw unstructured data to provide it with tagged metadata that can represent it in a simple yet precise manner. This simplified database can be analyzed, via upper layer analytics software, and useful information gleaned from it. Organizations have been waiting for AI to get much more out of the data they store, which until this stage, had remained ‘dark.’

Okay, so we want to generate metadata to enable our analytics software to run more effectively, and we have AI as the tool to create that metadata database out of our enormous unstructured database. Now, we just need to bring these huge amounts of data to our AI compute entities to do the work. But wait, is that the right way to go? Really?

If we look at the two main places where data is generated and stored today, namely the cloud and the edge, it quickly becomes apparent that moving big amounts of data around is very expensive and should be avoided. In the cloud, routing all this data though the data center will put strain on the constituent network infrastructure, consume a lot of power, and increase latency levels (thereby adding to the overall processing time). Conversely, at the edge, there are limited compute and power resources available. The restricted network capabilities of small devices situated there will make uploading large data quantities to the cloud for processing impractical. In both cases, minimizing the amount of data we move around and instead relying on metadata is key to maximizing operational efficiency.

It will be far more effective if, instead of moving data around, the assigning of metadata could be done at the source, i.e. where the data is located inside the storage device itself. Solid state drives (SSDs) already include the essential elements needed to serve as compute entities. These are normally only used in relation to drive operation, but they can be repurposed to do function-related tasks and take care of this tagging work or be complemented with integrated hardware/software/firmware blocks to undertake such functions. One mode of operation may be to use idle windows of the drive to carry out background mapping tasks. A different approach might be to process this data as it is being written to the drive. Each of these two modes of operation comes with its own pros and cons and may apply to different use cases.

Analyzing data as it is being written to the drive may be very useful for generating alerts, as an example. If you would consider a surveillance system, such logic that is able to scan the data as it comes in to storage may complement alerts that the camera is capable of generating (like movement) and further recognize events that are of importance (such as suspicious behavior or people) and advise the security control. At the same time, it would be the most effective approach in terms of ‘data touches’ as it would mean that data is only touched and processed once – as it comes in.

However, in many cases, that would also imply using stronger CPUs and AI engines in order to provide real time results on streams that can be high resolution videos, for example. In a very cost and power sensitive environment like an SSD, that may become an issue. At the same time, this in-line analysis would be competing with other drive-related operations as system reads and writes from/to the drive, as both functions potentially compete for the same compute and memory resources of the drive.

[Continue reading on EDN US: Offline processing]

Noam Mizrahi is a Marvell Fellow, and vice president of Technology and Architecture in Marvell’s CTO Office.

Leave a comment