Image and video data, or visual data, has seen unprecedented growth in the last few years. Applications across domains are shifting to Machine Learning (ML) and Data Science to create new products with better user experiences and to derive insights from this vast and rich collection of visual data. These insights help businesses gain a better understanding of their customers and provide inference points for making complex decisions.
c. 2016
In 2016, Luis, the rest of our team at Intel Labs, and I, started looking at visual cloud infrastructure for large scale ML deployments . We have spoken with 100s of data engineers, ML (infrastructure) engineers, data scientists, and systems researchers working in multiple application domains, such as medical imaging, smart retail, sports, entertainment, and smart city since then. These conversations have confirmed the tremendous progress made in improving the performance and accuracy of ML models as well as the shift in focus towards developing infrastructure for large scale deployment and improving the data quality. Practitioners routinely tell us that big visual data management is either an active problem for them or one that they see on their very near horizon. These insights and our desire to address the challenges of visual data management led us to form ApertureData. To better understand our solution, let us first look more specifically at the issues that users have to face.
Visual Data Infrastructure Challenges Today
Visual data is a collection of images and videos that typically grows over time. For example, visual data could be X-rays or MRI scans of patients in the radiology department of a health center, pictures of clothes from different retailers, or traffic camera videos to detect pedestrian patterns. This visual data is usually accompanied by some metadata, such as patient age, source of data capture, date, location, and other attributes that exist at the time of creation. Over time, this metadata continues to be enhanced with regions of interest annotations, feature vectors , and more application context. The visual data itself may be needed in different resolutions or formats, depending on the end goal, for example, display vs. training.
Depending on how far along an organization is in their ML deployment journey, it faces three basic problems when working with this information-rich but complex to manage visual data:
- The semi-duplicate dataset problem - Often, a large team of data scientists train on smaller subsets of a larger dataset so that they can develop models that focus on different classes of entities. For instance, training the model to recognize different animals or training the model to recognize dogs specifically. Some of the current and popularly used ML models often require constant retraining due to updates to input data, misclassifications, or improvements in the datasets to fix biases. Parameters describing the dataset such as sources of data capture, annotations, the amount of space a certain entity class occupies in an image or frame, can be stored in comma separated value (.csv, .xlsx) files. As a result, for each new training cycle, the data scientists lose precious time and resources in creating copies of visual data in their storage buckets, parsing the csv files to understand this data before they can prepare it for consumption by ML frameworks like PyTorch, and finally launch the training tasks. Given that their other teammates might be training for potentially overlapping classes (e.g. all dogs are animals), this can also result in duplication of dataset across the team resulting not just in wasted time but also storage, networking, and compute resources involved in replicating data.
- The technical debt / glue code problem - The primary challenge with visual data is its multimodal nature. When creating infrastructure to store and search efficiently, besides handling size and volume of visual data, the solution needs to tackle images, potentially videos or individual frames, regions of interest within these images or frames along with corresponding labels, and all the other application metadata. With the lack of visual-first data management options that understand these special characteristics, this visual data and metadata are often scattered across multiple disparate systems such as cloud buckets and databases, with wrapper scripts to bind queries to multiple systems and interchange formats. This is essentially glue code. As visual data is often pre-processed as part of a ML pipeline (e.g. cropped, zoomed, rotated, normalized), additional glue code is continually added to these scripts to layer data transformations and ML functionalities. This glue code leads to an increasing amount of technical debt with multiple data access points and a maintenance nightmare, which worsens as an ML deployment scales to tackle larger datasets. It requires constant upkeep as versions or interface of various components in the pipeline change, causing increased usage of resources (extra engineers, more infrastructure), go-to-market (GTM) delays, increased risk of failure of the infrastructure, and loss of revenue.
- The ML-in-practice problem - ML practitioners need tools to manipulate datasets. For instance, the ability to explore a given visual dataset to ensure they are creating a balanced training set (e.g. an animal dataset should contain not just cats or dogs but horses, lions, tigers and other animals). Once such a dataset is identified and when experimenting with models that achieve the best accuracy for a desired task or for comparing various models, the dataset needs to be stable, like a snapshot. The lack of ability to search through visual datasets and create snapshots of the desired dataset across the glue code layers discussed earlier lead to extremely slow alternatives of manual inspection and copies as checkpoints. Beyond these, certain teams might want to consider using feature vectors to speed up their ML or to perform similarity searches. Given there are limited options for feature indexing and searching, especially ones that can live across reboots, most teams resort to using some internal solutions. Solutions to all these ML-in-practice problems tend to be team or organization specific, and are often not well integrated with the wrapper or glue scripts described earlier, adding further to the mountain of technical debt.
Visual data management in the context of ML and data science is one of the early pain points that needs to be addressed by teams across various industries so they can get desired results from using ML. Beyond its impact on user productivity, there is also a sizeable business impact that results from a misuse or overuse of resources due to a lack of unified solution, there is a hiring cost due to needing more data scientists or mismatching engineering skill set and finally but most importantly, there is a market cost associated with the delays that result from setting up infrastructure. We believe these problems can be solved by creating a new way to manage visual datasets, which lays the path for an increasingly ML-driven future.
ML-Ready Visual Database Infrastructure
To solve the visual data management problems and create a solution that brings step change innovation, we asked ourselves:
- Could we design a high-performance, scalable system that recognized the unique nature of visual data and offered interfaces designed to handle it?
- What would ML users’ lives look like if they could spend most of the time focusing on ML and data science rather than worrying about their data infrastructure?
- Could we combine feature search with metadata search to more closely match expected results from a user query?
- Could we offer a unified interface and backend infrastructure that can cater to all the stages of ML and any use case of visual data?
- Could we do more for visual ML?
The questions led us to create the open source Visual Data Management System . Using this new system, we enabled a new class of applications to scale to much larger data sizes at radically improved performance. This open source system forms the core of our product, ApertureDB: a unique, and purpose-built database for visual analytics.
Introducing ApertureDB
ApertureDB stores and manages images, videos, feature vectors, and associated metadata like annotations. It natively supports complex searching and preprocessing operations over media objects. ApertureDB’s visual data-first approach saves hundreds of hours of data platform engineering efforts spent by data science and ML engineering teams, setting them up for success when scaling their visual analytics pipelines. It removes the time consuming tasks of manually linking visual data with metadata, related access challenges, and overhead of maintaining multiple disparate data systems.
Using ApertureDB, (potentially smaller) ML and data science teams can focus on application development and on providing value to their customers. By offloading data infrastructure scaling to ApertureDB, they get an average 15x increase in data access speed. For large ML deployment, ApertureDB provides network overhead reduction of up to 63% due to the optimizations ApertureDB offers via the unified interface.
Partner with us - use ApertureDB
If your organization uses or intends to use ML on visual data (small or large team) or you are simply curious about our technology, our approach to infrastructure development, and where we are headed, please contact us team@aperturedata.io or sign up for a free trial .
We will be documenting our journey in these blogs, click here to subscribe.
I want to thank Luis Remis , ApertureData co-founder, for helping focus the content. I also want to acknowledge the insights and valuable edits from Namrata Banerjee, Jim Blakley, Jonathan Gray, Priyanka Tembey, and Romain Cledat.