Multimodal AI, vector databases, large language models(LLMs), retrieval-augmented generation (RAG), and knowledge graphs are cutting-edge technologies making a significant impact across various industries today. These technologies have evolved so rapidly in the past few years and seen unprecedented adoption in pretty much every industry vertical in some form or the other, making them more of a need-to-know vs. nice-to-know. In fact, for data practitioners and decision makers, it’s essential to dig deeper, underneath the terms, and understand the challenges that stand in their way of successfully implementing their AI strategies.
With our specific interest and research on data, that is in large part responsible for the quality of these AI methods, we have created a series of blogs that start from introducing relevant terminologies to advanced cost analysis of getting things wrong. In this first blog of the series, in addition to introducing these terms, let's explore multimodal data, how databases including vector or graph fit in, practical use cases, and how the generative AI wave is reshaping the landscape.
The Fundamentals
Multimodal Data
For diagnosing complex symptoms, doctors often prescribe a variety of tests. Think about the CT scans or x-rays that are commonly taken -- as well as blood tests - really any and all sorts of tests - that are used to diagnose patients and come up with treatment plans in health and life sciences . Multimodal data here refers to all the different types of data that, when combined, the doctor uses to improve your health. And it is not just at doctor’s offices - multimodal data (like video, audio, sensors, text, etc.) can be captured virtually everywhere. Another point to note about most of this data is that it typically goes beyond structured representations that could be captured in spreadsheets or relational databases, to more unstructured formats that are not so easy to search through, but are more representative of how humans understand the world.
Multimodal AI
To bridge the gap between human-like understanding and AI capabilities, Multimodal AI combines different types of data—such as text, images, videos, or audio—to generate more comprehensive and accurate outputs. Consider autonomous cars. To operate in a rich human environment, these vehicles collect many different types or modalities of data to enable them to travel without a driver. This data could be coming from cameras collecting videos from both inside and outside the vehicle, recorders capturing audio of the passengers as well as external environment, sensors collecting Radar or LiDAR (Light Detection and Ranging) data, weather information, etc. Multimodal AI combines all of this data and, with AI models, actually generates commands to drive and control the car, such as providing depth perception so the car doesn't hit anything or anyone.
Generative AI
This is probably a household term by now, ever since OpenAI’s ChatGPT gained in popularity. Generative AI (Gen AI) leverages machine learning techniques to generate data similar to or derived from the datasets it was trained on, in an attempt to mimic human responses to questions. This includes creating new content such as images, music, and text. A key aspect of Gen AI is its ability to work with both text and multimodal data. Text-based generative AI can create human-like text, while multimodal generative AI can understand and generate content that combines multiple data types, such as text, audio, video and images. This makes Gen AI a powerful method in a variety of fields, from content creation to data augmentation. Gen AI methods often use semantic searches combined with other techniques to find the right subset of supporting data in the context of the question asked and use large models to generate their responses.
Vector Embeddings
For it to be useful, w e want to be able to search through this vast collection of multimodal information to find something we are looking for (semantic search). We may know some keywords to guide this search or maybe we know what the information should contain or look like.
Imagine how large and complicated all this unstructured data is and how hard it will be to search everything . If trying to do facial recognition on a casino floor, for example, huge amounts of videos and images would have to be compared pixel by pixel to find a specific face in the crowds in all that data. It would involve a lot of manual visual inspection , making it extremely inefficient. But it gets dramatically easier with embeddings.
Embeddings give a simpler and lower dimensional representation of a specific piece of data so it is faster and easier for us to spot what we are looking for. Initially it may give approximate answers that can then be used to target more specific or relevant answers.
To extract embeddings, you may leverage different, potentially off-the-shelf models like Facenet , YOLO , large language models as offered by OpenAI, Cohere, Google, Anthropic, or the numerous open source ones, by running inference on your data and extracting, commonly, the penultimate layer.
Multimodal Embeddings
Multimodal embeddings are a way of linking different types of information, like text and images, into a shared understanding. This means that a computer can look at a picture (for example of a dog) and understand its meaning in relation to a piece of text (like ‘cute golden retriever puppies’), or vice versa. By combining these different forms of data, multimodal embeddings help computers interpret and interact with the world in a more human-like way.
Contrastive Language–Image Pre-Training ( CLIP ) , developed by OpenAI , is one of the earliest and prominent models that utilizes this approach . It learns by looking at many image and text pairs from the internet, figuring out which texts go with which images. This helps it know what an image is showing just by looking at it, even if it hasn’t seen that exact image before. Newer models like OpenAI’s GPT-4o, Google’s Gemini series of models extend multimodal capabilities to other types of data like audio to get closer to human-like abilities of interpretation and search.
Why We Need Vector Databases
Regardless of the source of these embeddings or feature vectors, finding similar data leads to complex requirements due to their high-dimensional nature. As a data scientist or an entire data team uses their well trained AI models to extract embeddings, they need to efficiently index them for search and classification. Vector databases and now some traditional databases offer special indexes to accommodate the high-dimensional nature of embeddings.
With these indexes as well as some clustering and other algorithmic magic, a vector database can return a label for a given embedding based on its closest matches, which is classification. Picture an image of Michelle Obama. The vector database can return a ‘label’ (or embedding) that tells you it is a picture of Michelle Obama. A vector database can also return the other closest vectors in the given search space. Imagine starting with an image of Taylor Swift and finding similar images of people that look like her.
Vector databases are therefore used in applications that are looking for similar items (text or multimodal), looking to classify data that is missing labels, or helping GenAI applications as a step towards creating a response to a user’s queries.
LLMs, RAGs, Knowledge Graphs
Large Language Models (LLMs) are trained on a vast amount of text data (there are large vision models for visual data, and so on). LLMs are designed to generate human-like text based on the input they are given. LLMs can understand context, answer questions, write essays, summarize texts, translate languages, and even generate creative content like poems or stories. They are used in a variety of applications, including chatbots, text editors, and more. Examples of LLMs include OpenAI’s GPT-3 and GPT-4.
If a question asked by a user requires context that a large model was not trained on, it can potentially hallucinate an answer. One of the methods to correct that or say “I don’t know the answer” is to use Retrieval Augmented Generation (RAG) to take the first set of matches from a query, order them based on relevant context or use newer data that was not yet included in the model training, and then create an answer. The methods can range from applying a different sorting algorithm to text or multimodal data that matched the first query, all the way to attaching richer context, for example, as retrieved from knowledge graphs.
A knowledge graph is a semantic data representation that describes real-world concepts and their relationships. In these graphs, nodes represent real-world entities e.g. a person, a product and edges represent their relationships, e.g. a person buys a product. The connection of nodes and edges provides rich semantic information which helps infer knowledge about whichever domain the graph is constructed for. The core unit of a knowledge graph is the “Entity-Relationship-Entity” triplet and graph databases can be used to represent this when building applications. There are now models that help you not only extract embeddings from a piece of text but also extract the relevant named entities and their relationships that can then be used to construct these knowledge graphs. It presents a great intertwining between vector databases, graph databases, various models, and accessing relevant data itself.
Next Steps
Vector databases are useful for managing and analyzing data, but modern data needs are more complex. Databases for multimodal data like ApertureDB offer a unified platform that combines different functions. This gives businesses a strong solution for data management and analysis in today's fast-changing world.
Want to learn more? Continue reading the the next blog in the series. In this blog, we’ll look at real-life examples of how multimodal AI is used. These examples show why we need advanced systems that do more than just basic data searches. We’ll explain why special multimodal AI databases are needed to handle these complex tasks efficiently and reliably, making everything run smoothly .
Last but not least, we will be documenting our journey and explaining all the components listed above on our blog, subscribe here .
I want to acknowledge the insights and valuable edits from Laura Horvath and Drew Ogle.