Blogs

Lessons Learned Building a Cloud-Agnostic Database‍

December 12, 2024
Vishakha Gupta
Vishakha Gupta

In our previous blog, we covered the need to define infrastructure in a cloud-agnostic manner. In this article, we’ll dive deeper into the core challenges posed by designing cloud-agnostic infrastructure. Because we ran into many of these challenges while building ApertureDB, a cloud-agnostic database specifically built for multimodal data and metadata, we’ll also add some color from our learnings along the way.

Challenge #1: How to handle distributed deployments

We knew from the start that building a modern database for advanced AI applications meant that high levels of scalability, availability, and performance were going to be table stakes for small startups and large enterprises alike. That’s why one of the goals for ApertureDB was to efficiently manage distributed deployments.

But when you’re building a cloud-agnostic database or a similar infrastructure software, handling distributed deployments requires careful planning to ensure portability, consistency, and performance across the various cloud platforms.

Kubernetes as the key

Kubernetes is an open-source platform that makes it easy to deploy and scale containerized applications, such as a database instance deployed via a Docker image. For example, you could choose a MongoDB database, containerize the database using Docker images, and then write Kubernetes YAML manifests to deploy the container. Kubernetes then makes it easy to horizontally scale the MongoDB database by adding new “pods,” to monitor the pods’ performance and logs, and to manage database upgrades. Essentially, Kubernetes handles a lot of the heavy lifting required in application management. 

Given Kubernetes' status as the most popular open-source container orchestration platform, it’s vital to understand its integration with your chosen cloud provider. Most cloud providers, as well as private data center deployments, offer Kubernetes services. This is why we chose to package ApertureDB’s components into Docker containers, making them easy to deploy in Kubernetes clusters using Terraform or Helm chart configurations.

ApertureDB's cloud-agnostic architecture

At its core, ApertureDB is a set of Docker images deployed in a Kubernetes environment via Terraform (or Helm chart). ApertureDB implements some modules differently, such as the object store or file system interaction layer, depending on the cloud provider. Written in Terraform, these implementations have standardized interfaces vis a vis ApertureDB’s core module, making it easy to plug-and-play with any cloud provider. This modular approach ensures that ApertureDB can seamlessly integrate with any cloud provider offering Kubernetes, including AWS, Azure, and GCP.

By building the foundation of your data layer on Kubernetes, you can reuse a significant portion of your deployment code across different cloud providers. This strategy not only simplifies the deployment process but also enhances the flexibility and portability of your system, ensuring that you can easily adapt to any cloud environment.

Challenge #2: How to scale storage capacity 

Another goal we had when building a cloud-agnostic database was to be able to scale storage capacity while continuing to work on various cloud platforms. Beyond the typical considerations, like what types of data API are supported, and how storage fees are calculated, there are additional aspects to consider. For example, what happens when you start operating with multiple simultaneous cloud object stores? How do you navigate each store’s various permission structures, throughput, and file systems?

As we built ApertureDB, we realized that the complexity of cloud storage-agnostic deployment was begging for a simple, high-performance abstraction layer. By setting up connections with major cloud provider object stores through their SDKs - and offering storage abstraction that could take advantage of the specific configurations and optimizations per cloud provider,  in our own server - we were able to simplify storage management on various clouds.

Challenge #3: How to standardize the unstandardized

Designing for multi-cloud support means we specifically had to standardize storage and load-balancer interfaces to work with any cloud provider, which we detail below.

Storage interface

ApertureDB’s standardized storage interface is designed to function uniformly regardless of the cloud provider chosen by the user. This interface accepts the same inputs and produces consistent outputs across different platforms, making integration straightforward and hassle-free.

Key Elements:

The primary input for the storage interface is the bucket name, object name, and a way to specify credentials that is handled independent of the storage interface since it can vary across cloud providers. These credentials ensure secure and authenticated access to the data. By abstracting these details, ApertureDB simplifies the interaction with various cloud storage services, such as Amazon S3 buckets, Google Cloud Storage (GCS) buckets, and can even be used with a Posix-compliant filesystem.

This standardization allows ApertureDB to provide a consistent storage experience, whether the user is operating on AWS, GCP, or any other supported cloud platform.

Load-balancer interface

Similarly, ApertureDB’s load-balancer interface is standardized to operate uniformly across different cloud providers. This interface ensures that the necessary inputs are compatible with any environment, facilitating seamless load balancing and high availability. We chose to implement our own load balancer because the different cloud providers would impose different limitations which were creating scaleout hurdles when dealing with large objects which ApertureDB regularly does because of its support of image and video data. 

Key Elements:

  • Inputs: The load-balancer interface requires two primary inputs:some text
    • Kubernetes Node Group: This defines the group of nodes in the Kubernetes cluster that will handle the load.
    • Kubernetes Node Port Details: These include ApertureDB’s TCP port and the HTTP/HTTPS ports of ApertureDB’s internal application load balancer.

By using these standardized inputs, ApertureDB can configure load balancing appropriately for each cloud provider based on their respective load balancing methods: NLB on AWS and global forwarding rules on GCP.

Ensuring feature parity in your cloud-agnostic architecture

If you’re worried about feature parity across platforms, it’s important to note that most major cloud providers cover much of the same ground and offer very similar feature sets, resource types, and tools. Take a single feature at random—say, container registries for storing container images. Amazon offers its ECR, Google its Artifact Registry, and Azure its ACR. ApertureDB users already deploy or are in the process of deploying across all three of these providers. 

Lessons learned

In today's dynamic landscape, where you might need features from OpenAI in Azure, Gemini in GCP, or Anthropic in AWS, staying cloud-agnostic in your build is a smart, strategic choice that preserves optionality. But architecting a product to support that is hard: you have to optimize performance, cost, and ease of use, and you need scalable elements that won’t lock you into nightmare tech debt in a year’s time. 

The most important thing when building or choosing a tool like ApertureDB for multimodal data management is to understand exactly how the tool can plug into other pieces of a product’s architecture. Choosing a cloud provider is just one of many elements involved in architecting a product. 

We thought we’d close out with some of the key lessons learned along the way while building our own cloud-agnostic database:

  1. Choose flexible software: choosing tools that work across cloud providers to form your fundamental infrastructure layers will reduce the burden down the line. 
  2. Prioritize resource planning: even with cloud-agnostic tools, there are always certain configurations that can throw you off when deploying to a new cloud provider. Costs of machine resources, their performance mappings, and the time for your team to actually deploy on different clouds can add months to your project.
  3. Don’t underestimate the learning curve: working across cloud providers is no negligible task and can add weeks to your timeline, unless you build a team that has already worked in multi-cloud environments. 

By choosing software that offers maximum flexibility and scalability, you can build a robust, cloud-agnostic architecture. If you work with multimodal data and want to see how ApertureDB can simplify these challenges, contact us at team@aperturedata.io.

If you’re interested in learning more about how ApertureDB works, reach out to us at team@aperturedata.io. Stay informed about our journey by subscribing to our blog.

I want to acknowledge the insights and valuable edits from JJ Nguyen, Ali Asadpoor and Ian Yanusko.

Tags:

Related Blogs

Data science challenges in extracting value from image and video based data
Videos & Podcasts
Data science challenges in extracting value from image and video based data
Learn why data scientists and ML practitioners working on visual data need...
Read More
Watch Now
Applied
Can A RAG Chatbot Really Improve Content?
Blogs
Can A RAG Chatbot Really Improve Content?
We asked our chatbot questions like "Can ApertureDB store pdfs?" and the answer it gave..
Read More
Watch Now
Applied
ApertureDB Now Available on DockerHub
Blogs
ApertureDB Now Available on DockerHub
Getting started with ApertureDB has never been easier or safer...
Read More
Watch Now
Product
Why do businesses need a purpose-built database for visual analytics?
Videos & Podcasts
Why do businesses need a purpose-built database for visual analytics?
Learn why businesses benefit from a specialized database for data science on...
Read More
Watch Now
Applied
Building Real World RAG-based Applications with ApertureDB
Blogs
Building Real World RAG-based Applications with ApertureDB
Combining different AI technologies, such as LLMs, embedding models, and a database like ApertureDB that is purpose-built for multimodal AI, can significantly enhance the ability to retrieve and generate relevant content.
Read More
Managing Visual Data for Machine Learning and Data Science. Painlessly.
Blogs
Managing Visual Data for Machine Learning and Data Science. Painlessly.
Visual data or image/video data is growing fast. ApertureDB is a unique database...
Read More
What’s in Your Visual Dataset?
Blogs
What’s in Your Visual Dataset?
CV/ML users need to find, analyze, pre-process as needed; and to visualize their images and videos along with any metadata easily...
Read More
Transforming Retail and Ecommerce with Multimodal AI
Blogs
Transforming Retail and Ecommerce with Multimodal AI
Multimodal AI can boost retail sales by enabling better user experience at lower cost but needs the right infrastructure...
Read More
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1
Blogs
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1
Multimodal AI, vector databases, large language models (LLMs)...
Read More
How a Purpose-Built Database for Multimodal AI Can Save You Time and Money
Blogs
How a Purpose-Built Database for Multimodal AI Can Save You Time and Money
With extensive data systems needed for modern applications, costs...
Read More
Minute-Made Data Preparation with ApertureDB
Blogs
Minute-Made Data Preparation with ApertureDB
Working with visual data (images, videos) and its metadata is no picnic...
Read More
Why Do We Need A Purpose-Built Database For Multimodal Data?
Blogs
Why Do We Need A Purpose-Built Database For Multimodal Data?
Recently, data engineering and management has grown difficult for companies building modern applications...
Read More
Building a Specialized Database for Analytics on Images and Videos
Blogs
Building a Specialized Database for Analytics on Images and Videos
ApertureDB is a database for visual data such as images, videos, embeddings and associated metadata like annotations, purpose-built for...
Read More
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2
Blogs
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2
Multimodal AI, vector databases, large language models (LLMs)...
Read More
Challenges and Triumphs: Multimodal AI in Life Sciences
Blogs
Challenges and Triumphs: Multimodal AI in Life Sciences
AI presents a new and unparalleled transformational opportunity for the life sciences sector...
Read More
Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?
Blogs
Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?
The data landscape has dramatically changed in the last two decades...
Read More
Can A RAG Chatbot Really Improve Content?
Blogs
Can A RAG Chatbot Really Improve Content?
We asked our chatbot questions like "Can ApertureDB store pdfs?" and the answer it gave..
Read More
ApertureDB Now Available on DockerHub
Blogs
ApertureDB Now Available on DockerHub
Getting started with ApertureDB has never been easier or safer...
Read More
Are Vector Databases Enough for Visual Data Use Cases?
Blogs
Are Vector Databases Enough for Visual Data Use Cases?
ApertureDB vector search and classification functionality is offered as part of our unified API defined to...
Read More
Accelerate Industrial and Visual Inspection with Multimodal AI
Blogs
Accelerate Industrial and Visual Inspection with Multimodal AI
From worker safety to detecting product defects to overall quality control, industrial and visual inspection plays a crucial role...
Read More
ApertureDB 2.0: Redefining Visual Data Management for AI
Blogs
ApertureDB 2.0: Redefining Visual Data Management for AI
A key to solving Visual AI challenges is to bring together the key learnings of...
Read More
Stay Connected:
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.