ApertureData

Working with visual data (images, videos) and its metadata is no picnic . Especially if you are running analytics on them. You not only need to deal with the visual data itself, but also have to deal with multimodal metadata (i.e., bounding boxes coordinates and their labels, or a text description with the content of the image), and even feature vectors, just to name a few. When you complete a project and attempt to apply a "cookie-cutter" approach to solve a new, yet similar business problem, you discover that you have to deal with various types of metadata all over again. There is not much re-use. Your prior experience does help, but only linearly, not exponentially. If you were wondering what is missing, you are not alone.

If you have trained an ML model to run object detection, segmentation, or tracking, you are familiar with the long list of steps needed before you even start tuning your neural network model's parameters (let alone run the training and wait for the results).

Let’s walk through some of the steps. For this, let’s use as an example a popular dataset: Common Objects in Context (or COCO for short). COCO is a large-scale object detection, segmentation, and captioning dataset that includes more than 200 thousand images, together with annotations. These annotations include bounding boxes, object segmentation, and recognition in context for about 1.5 million different objects. While COCO could be stored on a single machine (less than 100GB, including images and labels, bounding boxes, and segmentation information), its rich annotations and complexity, make it a great example dataset to start with.

COCO can be used for training models on general objects (for which labels are provided) to later train (using transfer learning) on more specific objects based on application specific needs. You could also select a subset of the images for training, throwing away images that are not relevant to your application (i.e., images containing objects and labels that are no relevant).

Now, consider a ML developer constructing a pipeline for object detection . First, you have to have all images in an accessible place, and the right data structures that associate each image and its bounding boxes with the right labels. This requires assigning consistent identifiers for the images and adding their metadata in some form of relational or key-value database. Finally, if the ML pipeline needs images that are of a size different than the original ones present in the dataset, there is additional compute diverted towards pre-processing after the images are fetched. All these steps require integration of different software solutions that provide various functionalities that can then be stitched together with a script for this specific use case.

More generally, data scientists and machine learning developers usually end up building an ad-hoc solution that results in a combination of databases and file systems to store metadata and visual data (images, videos), respectively. There are N technologies for dealing with visual data and M technologies for metadata. Each is mostly unaware of each other, therefore a connector, or a "glue", has to be written in order to have a solution for a given problem, as we describe in greater detail here . Companies (or team within the same company) select one of each technology according to immediate needs, which results in multiple implementations for each of N*M connectors. These ad-hoc solutions make replicating experiments difficult, and more importantly, they do not scale well when deployed in the real-world. The reason behind such complexity is the lack of a one-system that can be used to store and access all the data the application needs.

Now, to add more real-world complexity, less assume we work in a team of ML engineers and data scientist. There are many steps in the ML pipeline, and many people involved in the task. It is important for the team to share data, intermediate results, and good practices, and, given usually growing scale, having multiple replicas of the dataset is no longer an option. An ad-hoc solution with loosely defined interfaces is painful to maintain and error-prone. What we need is a database system design for visual data and ML workloads.

The good news is that at ApertureData, we have been focusing all our efforts on solving these problems. Our solution, ApertureDB , provides an API that enables a smooth interaction with visual data. Images, videos, bounding boxes, feature vectors and labels are first-class entities.

Accelerating Data Preparation

Let’s see how easy it is to interact with the COCO dataset with ApertureDB in Python. As a starting point, we have pre-populated the dataset on ApertureDB. Connecting to ApertureDB is very simple. All we need to do is instantiate a connector to the database, using the binary-compatible VDMS connector (VDMS is the original and open-source version of ApertureDB):

import vdms

db = vdms.vdms() # Instanciate a DB connector

db.connect(“coco.datasets.aperturedata.io”)

Now, with a connection open to the database, we can start sending queries. We can do this in two ways: using the Native API , or using ApertureDB Object Model API. The latter allows us to interact with the data in the database as if we were dealing with Python Objects, and without the need to write query using the native API. We will use the Object Model API for this demonstration.

Here is an example of how easy it is to filter and retrieve images. Let’s retrieve only the images that are larger than 600x600 pixels, and we only one images of a specific type of Creative Common License:

from aperturedb import Image

imgs = Image.Images(db)

const = Image.Constraints()

const.greater(“width”, 600)

const.greater(“height”, 600)

const.equal("license", 4)

imgs.search(constraints=const)

And that’s it. The imgs object will contain a handler for all the images. we can pass those images to our favourite framework for processing (Pytorch, Tensorflow, etc). We also provide tools for quick visualization. Showing the images is as simple as it can be:

imgs.display(limit=6) # We limit the number of images for display purposes.

Combining the snippets above will help us find the images that are 600x600 in resolution and covered by license 4 with a few (6) examples displayed below:

Even if the images are now fully accessible to the application and ready for use, they will be expected to be preprocessed with basic operations like a resize (to make it the size expected in the input layer of our neural network), or even a normalization step. It is a shame that we tranferred the full-size image from ApertureDB only to later discard most of that information with a resize (which most likely will downsample the image to, say, 224x224). But because ApertureDB was designed with ML use cases in mind, these basic pre-processing operations are supported as part of the API. Instead of retrieving the full-size images, we can simply indicate that we want the images to be resize before retrieving them, saving bandwidth and accelerating the fetching significantly.

We can retrieve resized images directly (and even rotate them as a data augmentation technique) by simply doing:

import vdms

from aperturedb import Image

db = vdms.vdms()

db.connect("coco.datasets.aperturedata.io")

imgs = Image.Images(db)

const = Image.Constraints()

const.greater("width", 600)

const.greater("height", 600)

ops = Image.Operations()

ops.resize(224, 224)

ops.rotate(45.0, True)

imgs.search(constraints=const, operations=ops)

imgs.display(limit=6)

The result is:

As part of our pipeline, we may also need to retrieve metadata associated with the images. We can easily obtain a list of all the metadata properties associated with the images by doing:

prop_names = imgs.get_props_names()

print(prop_names)

# Reponse:

['aspect_ratio', 'coco_url', 'date_captured', 'date_insertion', 'file_name', 'flickr_url', 'height', 'id', 'license', 'seg_id', 'type', 'width', 'yfcc_id']

For instance, let’s retrieve the metadata information associated with the images we retrieved in our last query:

import json

props = imgs.get_properties(["coco_url", "height", "width"])

print("Response: ", json.dumps (props, indent=4))

Response:

{

"341010": # This is the image id

{

"coco_url": "http://images.cocodataset.org/train2017/000000341010.jpg",

"height": 640,

"width": 640

"208867": # This is the image id

{

"coco_url": "http://images.cocodataset.org/train2017/000000208867.jpg",

"height": 640,

"width": 628

…

}

What about bounding boxes and segmentation masks?

Because ApertureDB recognizes images, videos, and bounding boxes as first-class citizens of the database, we can easily retrieve and show these objects with little effort. For instance, retrieving the bounding boxes associated with the images is as simple as doing:

imgs.display(limit=6, show_bboxes=True)

Similarly, let’s retrieve the segmentation annotations associated with the images. Same as bounding boxes, we can do that with a simple function parameter, as follows.

imgs.display(limit=6, show_segmentation=True)

Note that the segmentation masks needed will be retrieved from the ApertureDB only when explicitly requested (in this case, for displaying them). You can also use other interfaces to retrieve bounding boxes coordinates, labels, and segmentation polygons directly as JSON objects or Python dictionaries.

Support for Similarity Search

Another important piece of the puzzle when dealing with visual data are feature vectors. For instance, in the context of the COCO dataset, feature vectors represent information about the nature or content of the images. Feature vectors represent this information in a high-dimensional hyper-space, which can be used to perform a “similarity search” (i.e., finding objects that are similar). This similarity is not tied to metadata (like labels or image properties), but rather to “how similar” those feature vectors are. Because ApertureDB recognizes feature vectors as part of its interface, associating feature vectors to images, and, later, searching for similar images is as simple as it can be.

Let's take as an example the following image:

Now, let’s find the 6 most similar images (and, their bounding boxes, just because it is so easy):

As we have seen, interacting with visual data has never been easier, and we are only scratching the surface of ApertureDB’s powerful API.

Last, but not least, ApertureDB offers a web user interface that allows user to explore the content of the database and run image search. Here is an example of what our webUI looks like for the COCO dataset:

We have made an ApertureDB free trial available through our website , with all the sample code we show here. Simply select the COCO usecase from the dropdown menu, and you will receive an email with instructions on how to access the instance of ApertureDB, pre-populated with the COCO dataset.

Coming up next…

While COCO is a good case to study because of its rich annotations, its size is rather small. We have designed ApertureDB to support a significantly large number of objects while providing the same simple interfaces, without sacrificing performance. We have been using much larger datasets such as the Yahoo Flickr Creative Common 100 million dataset (or YFCC100m for short) both to support real-world customer use cases, and for internal evaluation. We will be releasing this evaluation and more details soon.

Special thanks to Vishakha and Andrei from ApertureData, who helped me shape this article, and to my friends and family that reviewed and annotate countless times.

‍

Tags:

Visual Data

Dataset preparation and management

Machine Learning

Related Blogs

Blogs

June 6, 2025

Smarter Agents Start with Smarter Data

Building smart AI agents isn't just about better models — it's about better data infrastructure. This blog explores why legacy stacks fail multimodal AI and sets the stage for modern solutions that enable agents to reason, act, and scale.

Watch Now

Insights and Musings from the Grace Hopper Panel: Navigating Conflicts and Synergies in Multimodal AI Across Industries

Blogs

November 26, 2024

Insights and Musings from the Grace Hopper Panel: Navigating Conflicts and Synergies in Multimodal AI Across Industries

This panel of experts got together to demystify the various AI tools and technologies, understand how the technologies interplay, and help our audience understand both the similarities and differences, across various industries.

Watch Now

Seeing Further Down the Visual Cloud Road

Articles & White Papers

October 24, 2024

Seeing Further Down the Visual Cloud Road

Learn why visual data today needs special treatment and how this can be achieved...

Watch Now

Articles & White Papers

October 24, 2024

Overcoming Visual Analysis Paralysis

Read this article to learn how to overcome the massive scale challenges of visual ...

Watch Now

Building Real World RAG-based Applications with ApertureDB

Blogs

Nov 21, 2024

Building Real World RAG-based Applications with ApertureDB

Combining different AI technologies, such as LLMs, embedding models, and a database like ApertureDB that is purpose-built for multimodal AI, can significantly enhance the ability to retrieve and generate relevant content.

Managing Visual Data for Machine Learning and Data Science. Painlessly.

Blogs

Oct 15, 2024

Managing Visual Data for Machine Learning and Data Science. Painlessly.

Visual data or image/video data is growing fast. ApertureDB is a unique database...

Blogs

Oct 15, 2024

What’s in Your Visual Dataset?

CV/ML users need to find, analyze, pre-process as needed; and to visualize their images and videos along with any metadata easily...

Transforming Retail and Ecommerce with Multimodal AI

Blogs

Oct 15, 2024

Transforming Retail and Ecommerce with Multimodal AI

Multimodal AI can boost retail sales by enabling better user experience at lower cost but needs the right infrastructure...

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1

Blogs

Oct 15, 2024

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1

Multimodal AI, vector databases, large language models (LLMs)...

How a Purpose-Built Database for Multimodal AI Can Save You Time and Money

Blogs

Oct 15, 2024

How a Purpose-Built Database for Multimodal AI Can Save You Time and Money

With extensive data systems needed for modern applications, costs...

Minute-Made Data Preparation with ApertureDB

Blogs

Oct 15, 2024

Minute-Made Data Preparation with ApertureDB

Working with visual data (images, videos) and its metadata is no picnic...

Why Do We Need A Purpose-Built Database For Multimodal Data?

Blogs

Oct 15, 2024

Why Do We Need A Purpose-Built Database For Multimodal Data?

Recently, data engineering and management has grown difficult for companies building modern applications...

Building a Specialized Database for Analytics on Images and Videos

Blogs

Oct 15, 2024

Building a Specialized Database for Analytics on Images and Videos

ApertureDB is a database for visual data such as images, videos, embeddings and associated metadata like annotations, purpose-built for...

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2

Blogs

Oct 15, 2024

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2

Multimodal AI, vector databases, large language models (LLMs)...

Challenges and Triumphs: Multimodal AI in Life Sciences

Blogs

Oct 15, 2024

Challenges and Triumphs: Multimodal AI in Life Sciences

AI presents a new and unparalleled transformational opportunity for the life sciences sector...

Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?

Blogs

Oct 15, 2024

Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?

The data landscape has dramatically changed in the last two decades...

Can A RAG Chatbot Really Improve Content?

Blogs

Oct 15, 2024

Can A RAG Chatbot Really Improve Content?

We asked our chatbot questions like "Can ApertureDB store pdfs?" and the answer it gave..

Blogs

Oct 15, 2024

ApertureDB Now Available on DockerHub

Getting started with ApertureDB has never been easier or safer...

Are Vector Databases Enough for Visual Data Use Cases?

Blogs

Oct 15, 2024

Are Vector Databases Enough for Visual Data Use Cases?

ApertureDB vector search and classification functionality is offered as part of our unified API defined to...

Accelerate Industrial and Visual Inspection with Multimodal AI

Blogs

Oct 15, 2024

Accelerate Industrial and Visual Inspection with Multimodal AI

From worker safety to detecting product defects to overall quality control, industrial and visual inspection plays a crucial role...

ApertureDB 2.0: Redefining Visual Data Management for AI

Blogs

Oct 15, 2024

ApertureDB 2.0: Redefining Visual Data Management for AI

A key to solving Visual AI challenges is to bring together the key learnings of...

Minute-Made Data Preparation with ApertureDB

Accelerating Data Preparation

What about bounding boxes and segmentation masks?

Support for Similarity Search

Coming up next…

Special thanks to Vishakha and Andrei from ApertureData, who helped me shape this article, and to my friends and family that reviewed and annotate countless times.

Related Blogs

Ready to Accelerate your AI Workflows?