Creating Advanced Retrieval-Augmented Generation (RAG) Systems Using Ollama and Embedding Models

Creating Advanced Retrieval-Augmented Generation (RAG) Systems Using Ollama and Embedding Models

·

5 min read

Combining retrieval-based methods with generative capabilities can significantly enhance the performance and relevance of AI applications. This approach, known as Retrieval-Augmented Generation (RAG), leverages the best of both worlds: the ability to fetch relevant information from vast datasets and the power to generate coherent, contextually accurate responses. In this detailed blog post, we will explore how to build an advanced RAG system using Ollama and embedding models, specifically targeted at mid-level developers.

Introduction to RAG Systems

Retrieval-Augmented Generation (RAG) systems integrate two primary components:

  1. Retriever: This component searches through a knowledge base to find relevant documents or pieces of information.

  2. Generator: An LLM that generates responses based on the retrieved documents.

By combining these components, RAG systems can provide more accurate and contextually relevant answers than standalone generative models.

Setting Up the Environment

Before we dive into building a RAG system, let's set up our development environment.

Prerequisites

  1. Python: Ensure you have Python 3.8 or later installed.

  2. Ollama: Download and install Ollama from the official website.

  3. Dependencies: Install the necessary Python libraries.

pip install ollama chromadb pandas matplotlib

Step 1: Data Preparation

To demonstrate the RAG system, we will use a sample dataset of text documents. For this example, we'll assume we have a set of documents related to various topics.

Sample Data

documents = [
    "Llamas are members of the camelid family, meaning they're closely related to vicuñas and camels.",
    "Llamas were first domesticated and used as pack animals 4,000 to 5,000 years ago in the Peruvian highlands.",
    "Llamas can grow as much as 6 feet tall, though the average llama is between 5 feet 6 inches and 5 feet 9 inches tall.",
    "Llamas weigh between 280 and 450 pounds and can carry 25 to 30 percent of their body weight.",
    "Llamas are vegetarians and have very efficient digestive systems.",
    "Llamas live to be about 20 years old, though some live up to 30 years."
]

Step 2: Generate Embeddings

Embeddings are vector representations of the documents. These vectors capture the semantic meaning of the text, allowing us to compare and retrieve similar documents efficiently.

Generating Embeddings with Ollama

First, pull the necessary embedding model:

ollama pull mxbai-embed-large

Next, generate embeddings for the documents:

import ollama

# Initialize the Ollama client
client = ollama.Client()

# Generate embeddings
embeddings = []
for doc in documents:
    response = client.embeddings(model="mxbai-embed-large", prompt=doc)
    embeddings.append(response["embedding"])

# Display the embeddings
for i, emb in enumerate(embeddings):
    print(f"Document {i+1}: {emb[:5]}...")  # Print the first 5 dimensions of each embedding

Step 3: Storing Embeddings in a Vector Database

For efficient retrieval, we'll store the embeddings in a vector database. We will use chromadb for this purpose.

import chromadb

# Initialize the ChromaDB client
chroma_client = chromadb.Client()

# Create a collection to store documents and embeddings
collection = chromadb.create_collection(name="document_collection")

# Store documents and their embeddings in the collection
for i, (doc, emb) in enumerate(zip(documents, embeddings)):
    collection.add(ids=[str(i)], embeddings=[emb], documents=[doc])

print("Documents and embeddings stored successfully.")

Step 4: Implementing the Retrieval Component

To retrieve the most relevant document for a given query, we need to generate an embedding for the query and find the closest match in the vector database.

Example Query

query = "What animals are llamas related to?"

# Generate embedding for the query
query_embedding = client.embeddings(model="mxbai-embed-large", prompt=query)["embedding"]

# Retrieve the most relevant document
results = collection.query(query_embeddings=[query_embedding], n_results=1)
retrieved_document = results['documents'][0][0]

print(f"Retrieved Document: {retrieved_document}")

Step 5: Generating Responses with Llama 3

With the relevant document retrieved, we can now use the Llama 3 model to generate a response that incorporates this information.

Setting Up the Generator

Pull the Llama 3 model:

ollama pull llama3

Generate a Response

# Generate a response using the retrieved document and the query
response_prompt = f"Using this information: {retrieved_document}, respond to the query: {query}"
response = client.chat(model="llama3", prompt=response_prompt)

print(f"Generated Response: {response['response']}")

Step 6: Integrating All Components into a RAG Pipeline

To streamline the process, we'll integrate the retrieval and generation steps into a single pipeline.

class RAGPipeline:
    def __init__(self, retriever, generator):
        self.retriever = retriever
        self.generator = generator

    def __call__(self, query):
        # Step 1: Generate query embedding
        query_embedding = self.retriever.embeddings(model="mxbai-embed-large", prompt=query)["embedding"]

        # Step 2: Retrieve the most relevant document
        results = collection.query(query_embeddings=[query_embedding], n_results=1)
        retrieved_document = results['documents'][0][0]

        # Step 3: Generate a response
        response_prompt = f"Using this information: {retrieved_document}, respond to the query: {query}"
        response = self.generator.chat(model="llama3", prompt=response_prompt)

        return response['response']

# Initialize the RAG pipeline
rag_pipeline = RAGPipeline(retriever=client, generator=client)

# Example usage
query = "Tell me about the dietary habits of llamas."
response = rag_pipeline(query)
print(f"RAG Response: {response}")

Visualizing the Process

To better understand how the RAG system works, let's visualize the workflow:

  1. Query Embedding Generation: The user's query is transformed into a vector embedding.

  2. Document Retrieval: The embedding is used to find the most relevant document in the vector database.

  3. Response Generation: The retrieved document is combined with the query to generate a final response.

Example Visualization

import matplotlib.pyplot as plt
import numpy as np

# Create a sample dataset for visualization
doc_embeddings = np.array(embeddings)
query_embedding = np.array(query_embedding)

# Plot the document embeddings
plt.scatter(doc_embeddings[:, 0], doc_embeddings[:, 1], color='blue', label='Documents')
plt.scatter(query_embedding[0], query_embedding[1], color='red', label='Query')
plt.legend()
plt.title('Document and Query Embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.show()

Conclusion

Building a Retrieval-Augmented Generation (RAG) system with Ollama and embedding models can significantly enhance the capabilities of AI applications by combining the strengths of retrieval-based and generative approaches. This tutorial has guided you through the process of setting up a RAG system, from data preparation and embedding generation to document retrieval and response generation.