Knowledge System Overview
This document provides a comprehensive overview of the knowledge management system in Atlas, which enables semantic storage and retrieval of information for agent interactions.
Introduction
The knowledge system in Atlas is a foundational component that provides agents with access to contextually relevant information. It consists of two main subsystems:
- Document Ingestion: Processing and storing documents in a vector database
- Knowledge Retrieval: Semantic search capabilities for finding relevant information
This system enables Atlas to maintain a comprehensive, versioned, and searchable knowledge base that serves as the foundation for intelligent agent interactions.
Architecture
The knowledge system is built around these key components:
- Vector Database: ChromaDB for persistent storage of document embeddings
- Document Processor: Handles document splitting, processing, and embedding
- Knowledge Base: Interface for retrieving relevant information
- Integration Layer: Connects the knowledge system to agents and workflows
The system is designed to be:
- Scalable: Handle large volumes of documents
- Persistent: Store embeddings efficiently for reuse
- Flexible: Support various document types and querying patterns
- Integrated: Work seamlessly with the agent infrastructure
Core Components
ChromaDB Integration
Atlas uses ChromaDB as its vector database, providing:
- Persistent Storage: Document embeddings are stored on disk
- Semantic Search: Embedding-based similarity search
- Metadata Filtering: Search refinement using document metadata
- Collection Management: Organizing embeddings in named collections
The integration handles connection management, persistence configuration, and error recovery:
# Initialize the database
self.chroma_client = chromadb.PersistentClient(path=self.db_path)
self.collection = self.chroma_client.get_or_create_collection(
name=self.collection_name
)
Metadata Filtering with ChromaDB 1.0.8+
With ChromaDB 1.0.8 and newer versions, Atlas supports enhanced metadata filtering using the RetrievalFilter
class:
from atlas.knowledge.retrieval import RetrievalFilter, KnowledgeBase
# Initialize knowledge base
kb = KnowledgeBase()
# Create filter for documents from a specific source path
filter = RetrievalFilter().add_filter("source", {"$eq": "docs/components/knowledge"})
# Add additional filters
filter = filter.add_range_filter("version", "2", "3") # Version between 2 and 3 (inclusive)
filter = filter.add_in_filter("file_name", ["index.md", "retrieval.md"]) # File is one of these
# Retrieve with filter
results = kb.retrieve("knowledge system", filter=filter)
The RetrievalFilter
class supports these ChromaDB 1.0.8+ compatible filter operations:
- Equality:
add_filter(key, {"$eq": value})
- Exact match - Inequality:
add_filter(key, {"$ne": value})
- Not equal to - Range Filters:
add_range_filter(key, min, max)
- Between min and max values - Comparison:
add_operator_filter(key, "$gt/$gte/$lt/$lte", value)
- Greater/less than - Inclusion:
add_in_filter(key, [value1, value2])
- Value is in the list - Exclusion:
add_filter(key, {"$nin": [value1, value2]})
- Value is not in the list
These filters can be combined for precise document retrieval and access.
Document Processing System
The DocumentProcessor
class manages the ingestion pipeline:
- Document Discovery: Finding relevant files in a directory structure
- Content Extraction: Reading document content from files
- Chunking: Splitting documents into manageable sections
- Metadata Annotation: Adding source information and other metadata
- Embedding Generation: Converting text to vector embeddings
- Vector Storage: Storing embeddings in ChromaDB
The processor uses heading-based document splitting to maintain content coherence:
# Split document by headings
heading_pattern = re.compile(r"^(#{1,6})\s+(.+)$", re.MULTILINE)
headings = list(heading_pattern.finditer(content))
# Process each section
for i, match in enumerate(headings):
title = match.group(2).strip()
# Determine section text
section_text = content[start_pos:end_pos].strip()
sections.append({"title": title, "text": section_text})
Retrieval System
The KnowledgeBase
class provides the semantic search interface:
- Query Processing: Converting user queries to vector searches
- Relevance Ranking: Sorting results by similarity score
- Result Formatting: Converting raw results to usable document formats
- Filtering: Narrowing results by metadata (e.g., document version)
The retrieval system supports several search patterns:
# Basic retrieval
documents = kb.retrieve(query, n_results=5)
# Version-filtered retrieval
documents = kb.retrieve(query, n_results=5, version_filter="v2")
# Topic-based search
documents = kb.search_by_topic("Knowledge Graph", n_results=5)
Integration with LangGraph
The knowledge system integrates with LangGraph through the retrieve_knowledge
function:
def retrieve_knowledge(
state: Dict[str, Any],
query: Optional[str] = None,
collection_name: Optional[str] = None,
db_path: Optional[str] = None,
) -> Dict[str, Any]:
"""Retrieve knowledge from the Atlas knowledge base."""
# Initialize knowledge base
kb = KnowledgeBase(collection_name=collection_name, db_path=db_path)
# Get query from state or parameter
query = query or get_query_from_state(state)
# Retrieve documents
documents = kb.retrieve(query)
# Update state with results
state["context"] = {"documents": documents, "query": query}
return state
This function serves as a standard node in LangGraph workflows and:
- Extracts query information from graph state
- Performs knowledge retrieval
- Updates the graph state with retrieved documents
Document Representation
Each document in the knowledge system is represented with a rich structure:
{
"content": "## Knowledge Graph Structure\n\nThe Atlas knowledge graph...",
"metadata": {
"path": "src-markdown/prev/v2/core/KNOWLEDGE_FRAMEWORK.md",
"source": "src-markdown/prev/v2/core/KNOWLEDGE_FRAMEWORK.md",
"file_name": "KNOWLEDGE_FRAMEWORK.md",
"section_title": "Knowledge Graph Structure",
"version": "2",
"chunk_size": 1250
},
"relevance_score": 0.92
}
Key attributes include:
- Content: The actual text of the document chunk
- Metadata:
- path/source: Location of the original document
- file_name: Name of the source file
- section_title: Title of the document section
- version: Version of the Atlas system the document belongs to
- chunk_size: Size of the text chunk
- Relevance Score: Similarity score relative to the query (0-1)
Configuration Options
The knowledge system is highly configurable via a combination of:
Environment Variables
ATLAS_COLLECTION_NAME
: The name of the ChromaDB collection (default: “atlas_knowledge_base”)ATLAS_DB_PATH
: Path to store ChromaDB files (default: “~/atlas_chroma_db”)
Runtime Parameters
Both DocumentProcessor
and KnowledgeBase
accept:
collection_name
: Override for the collection namedb_path
: Override for the database pathanthropic_api_key
: API key for embedding generation (DocumentProcessor only)
Command Line Options
When using the CLI, several knowledge-related options are available:
python main.py -m ingest -d <directory> -c <collection_name> --db-path <db_path>
-c/--collection
: Specify the collection name--db-path
: Override the database path-d/--directory
: Directory to ingest documents from-r/--recursive
: Recursively process subdirectories
Versioning Support
The knowledge system includes special support for versioned documentation:
Version Detection: Automatically extracts version information from file paths
pythonversion_match = re.search(r"/v(\d+(?:\.\d+)?)/", file_path) version = version_match.group(1) if version_match else "current"
Version Filtering: Allows retrieving documents from specific Atlas versions
python# Only get documents from v3 documents = kb.retrieve(query, version_filter="3")
Version Enumeration: Lists all available versions in the knowledge base
pythonversions = kb.get_versions() print(f"Available Atlas versions: {versions}")
This enables agents to access documentation from different evolutionary stages of the Atlas framework.
Usage Patterns
Ingesting Documentation
from atlas.knowledge.ingest import DocumentProcessor
# Initialize the processor
processor = DocumentProcessor(collection_name="my_knowledge_base")
# Process a directory of markdown files
processor.process_directory("./documents", recursive=True)
print(f"Processed {processor.collection.count()} document chunks")
Basic Retrieval
from atlas.knowledge.retrieval import KnowledgeBase
# Initialize the knowledge base
kb = KnowledgeBase(collection_name="my_knowledge_base")
# Retrieve relevant documents
documents = kb.retrieve("How does Atlas handle knowledge graphs?", n_results=5)
for i, doc in enumerate(documents):
print(f"Result {i+1}: {doc['metadata']['source']} - Score: {doc['relevance_score']:.4f}")
print(f"Content snippet: {doc['content'][:100]}...")
Enhancing Agent Responses
from atlas.knowledge.retrieval import KnowledgeBase
from atlas.agents.base import AtlasAgent
# Initialize components
kb = KnowledgeBase()
agent = AtlasAgent()
# Process a user query
query = "Explain the trimodal methodology in Atlas"
# Retrieve relevant knowledge
documents = kb.retrieve(query)
# Format context from documents
context = ""
for i, doc in enumerate(documents[:3]):
source = doc["metadata"]["source"]
content = doc["content"]
context += f"Document {i+1} ({source}):\n{content}\n\n"
# Generate enhanced response
response = agent.process_message_with_context(query, context)
print(response)
Integration with LangGraph Workflows
from langgraph.graph import StateGraph
from atlas.knowledge.retrieval import retrieve_knowledge
# Define the workflow
workflow = StateGraph()
# Add knowledge retrieval node
workflow.add_node("retrieve_knowledge", retrieve_knowledge)
# Connect the node
workflow.add_edge("user_input", "retrieve_knowledge")
workflow.add_edge("retrieve_knowledge", "generate_response")
Performance Considerations
Chunking Strategy
The default chunking strategy balances two competing priorities:
- Semantic Coherence: Keeping logically related content together
- Embedding Effectiveness: Avoiding chunks that are too large
The heading-based chunking strategy:
- Uses markdown headings as natural document boundaries
- Falls back to size-based chunking for large heading sections
- Preserves document structure and hierarchy
Database Optimization
For optimal performance:
- Collection Size: Keep collections under 100,000 documents for best performance
- Query Complexity: Simple, focused queries yield better results than complex ones
- Result Count: Retrieve 3-5 results for most agent interactions
- Persistence Location: Store the database on a fast SSD for better retrieval times
Error Handling
The knowledge system includes robust error handling:
- Connectivity Issues: Falls back to in-memory database if disk access fails
- Empty Collections: Warns about missing documents during retrieval
- File Access Problems: Skips problematic files during ingestion
- Invalid Content: Handles malformed documents gracefully
Core Features
The knowledge system includes several key features:
- Advanced Chunking Strategies: Sophisticated document splitting based on semantic boundaries
- Hybrid Retrieval: Combining embedding similarity with keyword search for optimal results
- Metadata Filtering: Fine-grained control over document selection based on attributes
- Customizable Processing: Configurable settings for different document types and use cases
- Versioned Content: Support for tracking different versions of the knowledge base
Future Enhancements
Planned improvements to the knowledge system include:
- Cross-Reference Analysis: Identifying relationships between documents
- Advanced Reranking: More sophisticated post-retrieval scoring algorithms
- Incremental Updates: Better handling of document changes and updates
- Cache Management: Query caching for better performance
- Multimedia Support: Handling of non-text document types
Related Documentation
- Document Ingestion - Detailed documentation on document processing
- Retrieval System - Information about knowledge retrieval functionality
- ChromaDB Integration - Details on vector database configuration
- LangGraph Integration - Documentation on using knowledge with LangGraph