In the first blog of this series, we covered the basics of multimodal AI, generative AI, and how vector databases enable semantic search over text and multimodal data . In this second blog of the series, we’ll look at real-life examples of where vector databases are used when building some text and multimodal AI applications like semantic search often for question/answering solutions, recognizing faces for security, and using robots to check store shelves . These examples will help us understand fundamental requirements when building various text and multimodal AI applications, and give us an idea of why we need advanced data solutions, beyond vector databases, that can enable more complex and contextual data searches.
Search And Retrieval Requirements of Sample Multimodal AI Applications
Vector indexing, search, and classification are hard problems that are tackled by a growing collection of vector databases in the market or incumbents that have introduced vector support. However , these databases can’t do everything that is expected from the data layer in a classic machine learning pipeline regardless of whether it's for traditional use cases or for GenAI applications.
Let's walk through a few examples we have encountered when working with our users, to really understand what’s required from the data layer, to support multimodal AI in the real world.
Chat Support or Question-Answering Using Semantic Search
A common example we come across now is chat or question-answering support (let’s call it chatbot for now) thanks to LLM-based bots like ChatGPT that often use semantic search . Let’s start with a manufacturer that makes millions of different products that provide error codes when something goes wrong and that also include a product manual with purchase. A customer may experience an issue and sees an error code but doesn’t understand what it means. They visit the manufacturers chatbot and enter in the error code as a first step of similarity search. Next, they want to add more restrictions to the search by focusing on products sold in 2019 so adding more metadata filtering. In addition, the customer also wants a copy of the product manual to learn more. At this time, they not only want an ‘answer’ to the meaning of the error code but also want a link to the complete original manual for their review with some author or date restrictions. This evolved the query from similarity search to address constraints due to complex metadata filtering followed by pdf access requirements making it a problem where vector databases and even the incumbent databases with “vector add-ons” cannot solely provide a complete solution.
Facial Recognition
Another very simple but common use case example is finding a specific face in a collection of faces for say, surveillance/safety concerns or just remembering someone’s name. First, we start with a similarity search of face images but then you want to restrict your search to limit by nationality where people from the United States are the only ones we want to find. This introduces a metadata attribute. Next, you remember that the person you are searching for acted in a specific movie, so you add that search constraint. You also would like to see video clips of the movies in which this person appeared. As you can see, your search gradually evolved from simple similarity search to more and more complex metadata filtering followed by video access, capabilities that vector databases were simply not built to support on their own.
Visual Recommendations
Let’s say an e-commerce company with over millions of products wants to support visual recommendations based on colors and shapes of products but wants to limit items to their fall 2023 catalog or to a specific supplier catalog. The goal is then similarity search constrained by complex metadata filtering followed by image access.
Before running these queries, there are many steps that must be completed to support the visual recommendations for the ecommerce company. This includes creating and managing datasets easily so they can be revised with every catalog update. Iteratively training models that will be used to extract embeddings for visual search, will also be needed with all the millions of images that are being constantly updated. Finally the embeddings will need to be effectively indexed to support similarity search quickly and easily at scale. While a vector database can solve the last part, a) the dataset management of product and image data, b) training with millions of relevant images, and finally, c) keeping track of all the metadata, embeddings, and images to respond to users, they all need seamless integration with the entire machine learning pipeline, not just vector or metadata search support.
Smart Retail
Ever seen a robot checking for empty shelves at a grocery or retail store? The robots go up and down the aisles looking for empty shelves to then determine what product needs to be added. This query includes matching thousands of vectors to their corresponding product name that requires classification followed by product lookup, at scale.
Again, before the queries can be executed, there are many tasks that need to be completed and managed effectively. All the labeled data must be created and managed to be easily updated with every new image of an empty shelf. Models must be constantly trained as new products are added or old products removed requiring large amounts of images. Finally all the embeddings need to be indexed to perform quickly and easily to support the business needs.
Source: https://www.badger-technologies.com/resource/badger-resources.html
Building a Data Stack for Current and Future AI Applications
All of these examples share common requirements to index high-dimensional embeddings from any multimodal data type as well as to allow metadata-based filtering and retrieval of the corresponding data after the search. This necessitates certain requirements from the data layer such as:
- High throughput vector search and classification
- Filter on rich and evolving metadata
- Ability to connect with original data
- Seamless integrations with various steps of an AI pipeline beyond just a query or analytics framework
- Reliability and stability at scale to support production
- Production-ready, cloud-agnostic, often virtual private cloud (VPC) deployments
But can vector databases handle all this by themselves? Today’s popular databases can:
- Store feature vectors and return an ID, but the ID needs to be managed separately and linked to different data types.
- Sometimes allow users to attach a row of columns and include some metadata information but by nature of it being a few columns of metadata, it is difficult to represent typically complex application metadata
- Rarely support complex filtering, such as overlaying intricate schemas that mimic graphs, even when we are talking about incumbent relational databases allowing vector search
- Often lack the ability to manage multimodal data and provide provenance information
While vector databases are able to solve the vector indexing and search challenges, you need to add other databases and storage options to the data pipeline and architecture to handle the rest, resulting in a complex glued together system that is brittle and painful to maintain and reuse. We also cover the other database alternatives and their pros/cons in our blog on multimodal data requirements in detail.
Given that the traditional data and database tools result in a spaghetti data architecture, a technical debt that should be avoided, one solution then is to architect a purpose-built database that can manage complex and varied data types while exposing a unified interface to index, search, and retrieve those various data types. With the right implementation, it can save companies lots of headaches and wasted engineering time. ApertureDB is one such database. Databases like ApertureDB focus on multimodal data and AI giving businesses a unified, scalable, and easy-to-use solution for data management and analysis in today's fast-changing world.
What Next?
What to learn more? To continue digging deeper into the world of multimodal data and AI, check out these blogs on a) Why Do We Need A Purpose-Built Database For Multimodal Data? , b) Your Multimodal Data Is Constantly Evolving - How Bad Can It Get? , and c) How a Purpose-Built Database for Multimodal AI Can Save You Time and Money.
Last but not least, you can subscribe here to learn more about text and multimodal AI and data as we document lessons from our journey in our blog.
I want to acknowledge the insights and valuable edits from Laura Horvath and Drew Ogle.