Retrieval System

This document explains the knowledge retrieval system in Atlas, which enables agents to find and use contextually relevant information.

Overview

The retrieval system in Atlas provides powerful search capabilities that allow agents to find relevant information based on natural language queries. Key features include:

Semantic Search: Find documents based on meaning, not just keywords
Keyword Search: Find exact matches for specific terms and phrases
Hybrid Retrieval: Combine semantic and keyword search for optimal results
Relevance Ranking: Sort results by similarity to the query
Metadata Filtering: Filter documents by version, source, or other attributes
LangGraph Integration: Seamless use in graph-based workflows
Persistent Storage: Efficient access to stored knowledge

The system is designed to be:

Accurate: Return the most contextually relevant information
Efficient: Optimize search performance for interactive use
Flexible: Support various query types and filtering options
Robust: Handle errors and edge cases gracefully

Core Components

KnowledgeBase Class

The KnowledgeBase class is the central interface for knowledge retrieval:

python

from atlas.knowledge.retrieval import KnowledgeBase

# Initialize the knowledge base
kb = KnowledgeBase(
    collection_name="atlas_knowledge_base",  # Optional, default value shown
    db_path=None  # Optional, defaults to ~/atlas_chroma_db or environment variable
)

# Basic retrieval
documents = kb.retrieve("What is the trimodal methodology?", n_results=5)

# Retrieval with version filter
documents = kb.retrieve(
    "What is the trimodal methodology?",
    n_results=5,
    version_filter="3"
)

Constructor Parameters

Parameter	Type	Description	Default
`collection_name`	`Optional[str]`	Name of ChromaDB collection	From `ATLAS_COLLECTION_NAME` env var or `"atlas_knowledge_base"`
`db_path`	`Optional[str]`	Path to ChromaDB storage	From `ATLAS_DB_PATH` env var or `~/atlas_chroma_db`

Key Methods

retrieve(): Find documents relevant to a query
get_versions(): Get all available document versions
search_by_topic(): Find documents on a specific topic

ChromaDB Integration

The retrieval system integrates with ChromaDB for vector search:

python

self.chroma_client = chromadb.PersistentClient(path=self.db_path)
self.collection = self.chroma_client.get_or_create_collection(
    name=self.collection_name
)

This provides:

Persistent storage of embeddings
Efficient vector search capabilities
Collection-based organization
Error handling with fallback options

LangGraph Integration

For use with LangGraph workflows, the system provides the retrieve_knowledge function:

python

from atlas.knowledge.retrieval import retrieve_knowledge

# Update state with retrieved knowledge
updated_state = retrieve_knowledge(
    state=current_state,
    query="What is the trimodal methodology?",
    collection_name="atlas_knowledge_base",
    db_path=None
)

# Retrieved documents are in updated_state["context"]["documents"]

This function:

Initializes the knowledge base
Extracts the query from state or parameters
Retrieves relevant documents
Updates the state with the results

Retrieval Process

Basic Retrieval Flow

The standard retrieval process follows these steps:

Query Preparation: The user’s query is processed
Vector Search: ChromaDB performs a similarity search
Result Processing: Raw results are converted to a structured format
Relevance Sorting: Documents are sorted by relevance score
Return to Caller: Processed documents are returned

python

def retrieve(self, query: str, n_results: int = 5, version_filter: Optional[str] = None):
    # Prepare filters if any
    where_clause = {}
    if version_filter:
        where_clause["version"] = version_filter

    # Query the collection
    results = self.collection.query(
        query_texts=[query],
        n_results=n_results,
        where=where_clause if where_clause else None,
    )

    # Format results
    documents = []
    for i, (doc, metadata, distance) in enumerate(
        zip(results["documents"][0], results["metadatas"][0], results["distances"][0])
    ):
        documents.append({
            "content": doc,
            "metadata": metadata,
            "relevance_score": 1.0 - distance,  # Convert distance to relevance score
        })

    return documents

Document Format

Retrieved documents follow a consistent format:

json

{
  "content": "# Trimodal Methodology\n\nThe Atlas Trimodal Methodology is a framework...",
  "metadata": {
    "path": "src-markdown/prev/v3/core/TRIMODAL_PRINCIPLES.md",
    "source": "src-markdown/prev/v3/core/TRIMODAL_PRINCIPLES.md",
    "file_name": "TRIMODAL_PRINCIPLES.md",
    "section_title": "Trimodal Methodology",
    "version": "3",
    "chunk_size": 1843
  },
  "relevance_score": 0.945
}

Key attributes:

content: The actual text of the document
metadata: Source information and document attributes
relevance_score: Similarity to the query (0-1 scale)

Filtering with Metadata

The system supports filtering documents by metadata attributes:

python

# Filter by version
docs = kb.retrieve(query, version_filter="3")

# Filter by other metadata (implemented through the where_clause)
where_clause = {"section_title": "Trimodal Methodology"}

Currently, version filtering is directly implemented, while other metadata filtering would require customization of the retrieve method.

Version Support

The system provides special support for document versioning:

python

# Get all available versions
versions = kb.get_versions()
print(f"Available versions: {versions}")  # e.g., ["1", "2", "3", "5"]

# Filter retrieval by version
v3_docs = kb.retrieve(query, version_filter="3")

This enables:

Discovering what versions exist in the knowledge base
Retrieving documents from specific versions
Comparing information across versions

Integration with Agents

Basic Agent Integration

The retrieval system integrates with Atlas agents to provide context for responses:

python

def process_message(self, message: str) -> str:
    # Retrieve relevant documents from the knowledge base
    documents = self.query_knowledge_base(message)

    # Create system message with context
    system_msg = self.system_prompt
    if documents:
        context_text = self.format_knowledge_context(documents)
        system_msg = system_msg + context_text

    # Generate response using the model provider
    model_request = ModelRequest(
        messages=[ModelMessage.user(msg["content"]) for msg in self.messages],
        system_prompt=system_msg,
        max_tokens=self.config.max_tokens,
    )

    response = self.provider.generate(model_request)
    # ...

The agent uses retrieved documents to augment the system prompt with relevant context, enabling more informed responses.

Context Formatting

Retrieved documents are formatted for inclusion in the system prompt:

python

def format_knowledge_context(self, documents: List[Dict[str, Any]]) -> str:
    if not documents:
        return ""

    context_text = "\n\n## Relevant Knowledge\n\n"

    # Use only the top 3 most relevant documents to avoid token limits
    for i, doc in enumerate(documents[:3]):
        source = doc["metadata"].get("source", "Unknown")
        content = doc["content"]
        context_text += f"### Document {i + 1}: {source}\n{content}\n\n"

    return context_text

This structured format:

Clearly identifies relevant knowledge
Maintains document attribution
Preserves formatting of the original content
Limits context to the most relevant documents

Direct Usage Examples

Basic Retrieval

python

from atlas.knowledge.retrieval import KnowledgeBase

# Initialize the knowledge base
kb = KnowledgeBase()

# Retrieve documents
documents = kb.retrieve("How does Atlas handle knowledge graphs?")

# Display results
print(f"Found {len(documents)} relevant documents:")
for i, doc in enumerate(documents):
    score = doc["relevance_score"]
    source = doc["metadata"]["source"]
    title = doc["metadata"]["section_title"]
    print(f"Document {i+1}: {title} (from {source}) - Relevance: {score:.4f}")
    print(f"  Preview: {doc['content'][:100]}...\n")

Version-Specific Retrieval

python

from atlas.knowledge.retrieval import KnowledgeBase

# Initialize the knowledge base
kb = KnowledgeBase()

# Get available versions
versions = kb.get_versions()
print(f"Available Atlas versions: {versions}")

# Compare information across versions
query = "Explain the trimodal methodology"
for version in versions:
    print(f"\nRetrieving from version {version}:")
    docs = kb.retrieve(query, n_results=1, version_filter=version)
    if docs:
        print(f"Source: {docs[0]['metadata']['source']}")
        print(f"Content summary: {docs[0]['content'][:150]}...")
    else:
        print(f"No relevant documents found in version {version}")

Topic-Based Search

python

from atlas.knowledge.retrieval import KnowledgeBase

# Initialize the knowledge base
kb = KnowledgeBase()

# Search for a specific topic
topic_docs = kb.search_by_topic("Knowledge Graph", n_results=3)

print(f"Documents about Knowledge Graphs:")
for doc in topic_docs:
    title = doc["metadata"]["section_title"]
    version = doc["metadata"]["version"]
    print(f"- {title} (version {version})")

With LangGraph Workflows

python

from langgraph.graph import StateGraph
from atlas.knowledge.retrieval import retrieve_knowledge

# Define state type
State = Dict[str, Any]

# Create a workflow
workflow = StateGraph()

# Add nodes
workflow.add_node("retrieve_context", retrieve_knowledge)
workflow.add_node("process_query", process_query_node)
workflow.add_node("generate_response", generate_response_node)

# Define edges
workflow.add_edge("retrieve_context", "process_query")
workflow.add_edge("process_query", "generate_response")

# Create the runnable
runnable = workflow.compile()

# Run with initial state
initial_state = {
    "messages": [{"role": "user", "content": "What is the trimodal methodology?"}]
}
result = runnable.invoke(initial_state)

Configuration Options

Environment Variables

ATLAS_COLLECTION_NAME: The name of the ChromaDB collection (default: “atlas_knowledge_base”)
ATLAS_DB_PATH: Path to store ChromaDB files (default: “~/atlas_chroma_db”)

Runtime Parameters

The KnowledgeBase class accepts:

collection_name: Override for the collection name
db_path: Override for the database path

Command Line Options

When using the Atlas CLI, several retrieval-related options are available:

python main.py -m query -q "What is the trimodal methodology?" -c custom_collection --db-path custom_db_path

-q/--query: The query to process
-c/--collection: Specify the collection name
--db-path: Override the database path

Performance Considerations

Optimizing Retrieval Results

For best results:

Query Formulation: Specific, well-formed queries yield better results
- Good: “Explain the trimodal methodology framework”
- Less effective: “methodology”
Result Count: Request appropriate number of results
- For agent context: 3-5 documents (avoid prompt token limits)
- For research: 5-10 documents (broader coverage)
Metadata Filtering: Use filters to narrow the search when appropriate
- Version filtering for specific documentation versions
- Future: source filtering for specific document types

Handling Large Collections

For larger knowledge bases:

Collection Organization: Use separate collections for different domains
Query Specificity: More specific queries perform better at scale
Result Limiting: Start with fewer results and increase if needed

Error Handling

The retrieval system includes robust error handling:

Database Connection: Falls back to in-memory DB if persistent connection fails
Empty Collections: Warns and returns empty results for empty collections
Query Execution: Catches and logs exceptions during query execution
Result Processing: Handles null or unexpected values in results

python

try:
    # Query the collection
    results = self.collection.query(...)
    # Process results...
except Exception as e:
    print(f"Error retrieving from knowledge base: {str(e)}")
    print(f"Query was: {query[:50]}...")
    return []  # Return empty results on error

Hybrid Retrieval

The Atlas knowledge system provides hybrid retrieval capabilities that combine semantic search with keyword-based search for more robust and accurate results.

Overview

Hybrid search leverages two complementary approaches:

Semantic (Vector) Search: Uses embeddings to find conceptually similar content, even when exact keywords aren’t present
Keyword (BM25) Search: Finds exact text matches for specific terms or phrases

By combining these approaches with configurable weights, hybrid retrieval can achieve better results than either method alone.

Using Hybrid Retrieval

You can use hybrid retrieval via the RetrievalSettings class:

python

from atlas.knowledge.retrieval import KnowledgeBase, RetrievalFilter
from atlas.knowledge.settings import RetrievalSettings

# Initialize knowledge base
kb = KnowledgeBase()

# Create retrieval settings with hybrid search enabled
settings = RetrievalSettings(
    use_hybrid_search=True,  # Enable hybrid search
    semantic_weight=0.7,     # 70% weight for semantic results
    keyword_weight=0.3,      # 30% weight for keyword results
    num_results=5,           # Return top 5 documents
    min_relevance_score=0.25 # Minimum relevance threshold
)

# Perform hybrid retrieval
documents = kb.retrieve(
    query="knowledge graph structure with nodes and edges",
    settings=settings,
    filter=RetrievalFilter.from_metadata(file_type="md")
)

Adjusting Hybrid Weights

You can customize the balance between semantic and keyword search:

Conceptual Queries: Use higher semantic weight (e.g., 0.8/0.2) for abstract or conceptual questions
Specific Term Queries: Use higher keyword weight (e.g., 0.3/0.7) when looking for specific terms or phrases
Balanced Approach: Equal weights (0.5/0.5) provide good general-purpose results

The weights are automatically normalized if they don’t sum to 1.0.

LangGraph Integration

Hybrid retrieval can be used with LangGraph workflows via the retrieve_knowledge function:

python

from atlas.knowledge.retrieval import retrieve_knowledge
from atlas.knowledge.settings import RetrievalSettings

# Define settings with hybrid search
settings = RetrievalSettings(use_hybrid_search=True)

# Use as a node in a LangGraph workflow
updated_state = retrieve_knowledge(
    state=current_state,
    query="What is the knowledge graph structure?",
    settings=settings
)

Future Enhancements

Planned improvements to the retrieval system include:

Advanced Re-ranking: More sophisticated post-retrieval scoring
Enhanced Filtering: More flexible metadata filtering options
Result Enrichment: Adding contextual information to search results
Query Expansion: Automatically enhancing queries for better results
Multi-stage Retrieval: Sequential retrieval with context refinement

Knowledge System Overview - Overview of the knowledge management system
Document Ingestion - Information about document processing
Agent Documentation - Documentation for Atlas agents that use the retrieval system
LangGraph Integration - Documentation on graph nodes including knowledge retrieval

Archive

Archive

Possible Future

Archive

Archive

Architecture

Agents

Core

Graph

Knowledge

Providers

Tools

The Matrix

Inner Universe

Nerv

Components

Composites

Patterns

Primitives

Types

Retrieval System ​

Overview ​

Core Components ​

KnowledgeBase Class ​

Constructor Parameters ​

Key Methods ​

ChromaDB Integration ​

LangGraph Integration ​

Retrieval Process ​

Basic Retrieval Flow ​

Document Format ​

Filtering with Metadata ​

Version Support ​

Integration with Agents ​

Basic Agent Integration ​

Context Formatting ​

Direct Usage Examples ​

Basic Retrieval ​

Version-Specific Retrieval ​

Topic-Based Search ​

With LangGraph Workflows ​

Configuration Options ​

Environment Variables ​

Runtime Parameters ​

Command Line Options ​

Performance Considerations ​

Optimizing Retrieval Results ​

Handling Large Collections ​

Error Handling ​

Hybrid Retrieval ​

Overview ​

Using Hybrid Retrieval ​

Adjusting Hybrid Weights ​

LangGraph Integration ​

Future Enhancements ​

Related Documentation ​

Retrieval System

Overview

Core Components

KnowledgeBase Class

Constructor Parameters

Key Methods

ChromaDB Integration

LangGraph Integration

Retrieval Process

Basic Retrieval Flow

Document Format

Filtering with Metadata

Version Support

Integration with Agents

Basic Agent Integration

Context Formatting

Direct Usage Examples

Basic Retrieval

Version-Specific Retrieval

Topic-Based Search

With LangGraph Workflows

Configuration Options

Environment Variables

Runtime Parameters

Command Line Options

Performance Considerations

Optimizing Retrieval Results

Handling Large Collections

Error Handling

Hybrid Retrieval

Overview

Using Hybrid Retrieval

Adjusting Hybrid Weights

LangGraph Integration

Future Enhancements

Related Documentation