Knowledge System Overview

This document provides a comprehensive overview of the knowledge management system in Atlas, which enables semantic storage and retrieval of information for agent interactions.

Introduction

The knowledge system in Atlas is a foundational component that provides agents with access to contextually relevant information. It consists of two main subsystems:

Document Ingestion: Processing and storing documents in a vector database
Knowledge Retrieval: Semantic search capabilities for finding relevant information

This system enables Atlas to maintain a comprehensive, versioned, and searchable knowledge base that serves as the foundation for intelligent agent interactions.

Architecture

The knowledge system is built around these key components:

Vector Database: ChromaDB for persistent storage of document embeddings
Document Processor: Handles document splitting, processing, and embedding
Knowledge Base: Interface for retrieving relevant information
Integration Layer: Connects the knowledge system to agents and workflows

The system is designed to be:

Scalable: Handle large volumes of documents
Persistent: Store embeddings efficiently for reuse
Flexible: Support various document types and querying patterns
Integrated: Work seamlessly with the agent infrastructure

Core Components

ChromaDB Integration

Atlas uses ChromaDB as its vector database, providing:

Persistent Storage: Document embeddings are stored on disk
Semantic Search: Embedding-based similarity search
Metadata Filtering: Search refinement using document metadata
Collection Management: Organizing embeddings in named collections

The integration handles connection management, persistence configuration, and error recovery:

python

# Initialize the database
self.chroma_client = chromadb.PersistentClient(path=self.db_path)
self.collection = self.chroma_client.get_or_create_collection(
    name=self.collection_name
)

Metadata Filtering with ChromaDB 1.0.8+

With ChromaDB 1.0.8 and newer versions, Atlas supports enhanced metadata filtering using the RetrievalFilter class:

python

from atlas.knowledge.retrieval import RetrievalFilter, KnowledgeBase

# Initialize knowledge base
kb = KnowledgeBase()

# Create filter for documents from a specific source path
filter = RetrievalFilter().add_filter("source", {"$eq": "docs/components/knowledge"})

# Add additional filters
filter = filter.add_range_filter("version", "2", "3")  # Version between 2 and 3 (inclusive)
filter = filter.add_in_filter("file_name", ["index.md", "retrieval.md"])  # File is one of these

# Retrieve with filter
results = kb.retrieve("knowledge system", filter=filter)

The RetrievalFilter class supports these ChromaDB 1.0.8+ compatible filter operations:

Equality: add_filter(key, {"$eq": value}) - Exact match
Inequality: add_filter(key, {"$ne": value}) - Not equal to
Range Filters: add_range_filter(key, min, max) - Between min and max values
Comparison: add_operator_filter(key, "$gt/$gte/$lt/$lte", value) - Greater/less than
Inclusion: add_in_filter(key, [value1, value2]) - Value is in the list
Exclusion: add_filter(key, {"$nin": [value1, value2]}) - Value is not in the list

These filters can be combined for precise document retrieval and access.

Document Processing System

The DocumentProcessor class manages the ingestion pipeline:

Document Discovery: Finding relevant files in a directory structure
Content Extraction: Reading document content from files
Chunking: Splitting documents into manageable sections
Metadata Annotation: Adding source information and other metadata
Embedding Generation: Converting text to vector embeddings
Vector Storage: Storing embeddings in ChromaDB

The processor uses heading-based document splitting to maintain content coherence:

python

# Split document by headings
heading_pattern = re.compile(r"^(#{1,6})\s+(.+)$", re.MULTILINE)
headings = list(heading_pattern.finditer(content))

# Process each section
for i, match in enumerate(headings):
    title = match.group(2).strip()
    # Determine section text
    section_text = content[start_pos:end_pos].strip()
    sections.append({"title": title, "text": section_text})

Retrieval System

The KnowledgeBase class provides the semantic search interface:

Query Processing: Converting user queries to vector searches
Relevance Ranking: Sorting results by similarity score
Result Formatting: Converting raw results to usable document formats
Filtering: Narrowing results by metadata (e.g., document version)

The retrieval system supports several search patterns:

python

# Basic retrieval
documents = kb.retrieve(query, n_results=5)

# Version-filtered retrieval
documents = kb.retrieve(query, n_results=5, version_filter="v2")

# Topic-based search
documents = kb.search_by_topic("Knowledge Graph", n_results=5)

Integration with LangGraph

The knowledge system integrates with LangGraph through the retrieve_knowledge function:

python

def retrieve_knowledge(
    state: Dict[str, Any],
    query: Optional[str] = None,
    collection_name: Optional[str] = None,
    db_path: Optional[str] = None,
) -> Dict[str, Any]:
    """Retrieve knowledge from the Atlas knowledge base."""
    # Initialize knowledge base
    kb = KnowledgeBase(collection_name=collection_name, db_path=db_path)

    # Get query from state or parameter
    query = query or get_query_from_state(state)

    # Retrieve documents
    documents = kb.retrieve(query)

    # Update state with results
    state["context"] = {"documents": documents, "query": query}

    return state

This function serves as a standard node in LangGraph workflows and:

Extracts query information from graph state
Performs knowledge retrieval
Updates the graph state with retrieved documents

Document Representation

Each document in the knowledge system is represented with a rich structure:

json

{
  "content": "## Knowledge Graph Structure\n\nThe Atlas knowledge graph...",
  "metadata": {
    "path": "src-markdown/prev/v2/core/KNOWLEDGE_FRAMEWORK.md",
    "source": "src-markdown/prev/v2/core/KNOWLEDGE_FRAMEWORK.md",
    "file_name": "KNOWLEDGE_FRAMEWORK.md",
    "section_title": "Knowledge Graph Structure",
    "version": "2",
    "chunk_size": 1250
  },
  "relevance_score": 0.92
}

Key attributes include:

Content: The actual text of the document chunk
Metadata:
- path/source: Location of the original document
- file_name: Name of the source file
- section_title: Title of the document section
- version: Version of the Atlas system the document belongs to
- chunk_size: Size of the text chunk
Relevance Score: Similarity score relative to the query (0-1)

Configuration Options

The knowledge system is highly configurable via a combination of:

Environment Variables

ATLAS_COLLECTION_NAME: The name of the ChromaDB collection (default: “atlas_knowledge_base”)
ATLAS_DB_PATH: Path to store ChromaDB files (default: “~/atlas_chroma_db”)

Runtime Parameters

Both DocumentProcessor and KnowledgeBase accept:

collection_name: Override for the collection name
db_path: Override for the database path
anthropic_api_key: API key for embedding generation (DocumentProcessor only)

Command Line Options

When using the CLI, several knowledge-related options are available:

python main.py -m ingest -d <directory> -c <collection_name> --db-path <db_path>

-c/--collection: Specify the collection name
--db-path: Override the database path
-d/--directory: Directory to ingest documents from
-r/--recursive: Recursively process subdirectories

Versioning Support

The knowledge system includes special support for versioned documentation:

Version Detection: Automatically extracts version information from file paths

python

version_match = re.search(r"/v(\d+(?:\.\d+)?)/", file_path)
version = version_match.group(1) if version_match else "current"

Version Filtering: Allows retrieving documents from specific Atlas versions

python

# Only get documents from v3
documents = kb.retrieve(query, version_filter="3")

Version Enumeration: Lists all available versions in the knowledge base

python

versions = kb.get_versions()
print(f"Available Atlas versions: {versions}")

This enables agents to access documentation from different evolutionary stages of the Atlas framework.

Usage Patterns

Ingesting Documentation

python

from atlas.knowledge.ingest import DocumentProcessor

# Initialize the processor
processor = DocumentProcessor(collection_name="my_knowledge_base")

# Process a directory of markdown files
processor.process_directory("./documents", recursive=True)

print(f"Processed {processor.collection.count()} document chunks")

Basic Retrieval

python

from atlas.knowledge.retrieval import KnowledgeBase

# Initialize the knowledge base
kb = KnowledgeBase(collection_name="my_knowledge_base")

# Retrieve relevant documents
documents = kb.retrieve("How does Atlas handle knowledge graphs?", n_results=5)

for i, doc in enumerate(documents):
    print(f"Result {i+1}: {doc['metadata']['source']} - Score: {doc['relevance_score']:.4f}")
    print(f"Content snippet: {doc['content'][:100]}...")

Enhancing Agent Responses

python

from atlas.knowledge.retrieval import KnowledgeBase
from atlas.agents.base import AtlasAgent

# Initialize components
kb = KnowledgeBase()
agent = AtlasAgent()

# Process a user query
query = "Explain the trimodal methodology in Atlas"

# Retrieve relevant knowledge
documents = kb.retrieve(query)

# Format context from documents
context = ""
for i, doc in enumerate(documents[:3]):
    source = doc["metadata"]["source"]
    content = doc["content"]
    context += f"Document {i+1} ({source}):\n{content}\n\n"

# Generate enhanced response
response = agent.process_message_with_context(query, context)
print(response)

Integration with LangGraph Workflows

python

from langgraph.graph import StateGraph
from atlas.knowledge.retrieval import retrieve_knowledge

# Define the workflow
workflow = StateGraph()

# Add knowledge retrieval node
workflow.add_node("retrieve_knowledge", retrieve_knowledge)

# Connect the node
workflow.add_edge("user_input", "retrieve_knowledge")
workflow.add_edge("retrieve_knowledge", "generate_response")

Performance Considerations

Chunking Strategy

The default chunking strategy balances two competing priorities:

Semantic Coherence: Keeping logically related content together
Embedding Effectiveness: Avoiding chunks that are too large

The heading-based chunking strategy:

Uses markdown headings as natural document boundaries
Falls back to size-based chunking for large heading sections
Preserves document structure and hierarchy

Database Optimization

For optimal performance:

Collection Size: Keep collections under 100,000 documents for best performance
Query Complexity: Simple, focused queries yield better results than complex ones
Result Count: Retrieve 3-5 results for most agent interactions
Persistence Location: Store the database on a fast SSD for better retrieval times

Error Handling

The knowledge system includes robust error handling:

Connectivity Issues: Falls back to in-memory database if disk access fails
Empty Collections: Warns about missing documents during retrieval
File Access Problems: Skips problematic files during ingestion
Invalid Content: Handles malformed documents gracefully

Core Features

The knowledge system includes several key features:

Advanced Chunking Strategies: Sophisticated document splitting based on semantic boundaries
Hybrid Retrieval: Combining embedding similarity with keyword search for optimal results
Metadata Filtering: Fine-grained control over document selection based on attributes
Customizable Processing: Configurable settings for different document types and use cases
Versioned Content: Support for tracking different versions of the knowledge base

Future Enhancements

Planned improvements to the knowledge system include:

Cross-Reference Analysis: Identifying relationships between documents
Advanced Reranking: More sophisticated post-retrieval scoring algorithms
Incremental Updates: Better handling of document changes and updates
Cache Management: Query caching for better performance
Multimedia Support: Handling of non-text document types

Document Ingestion - Detailed documentation on document processing
Retrieval System - Information about knowledge retrieval functionality
ChromaDB Integration - Details on vector database configuration
LangGraph Integration - Documentation on using knowledge with LangGraph

Archive

Archive

Possible Future

Archive

Archive

Architecture

Agents

Core

Graph

Knowledge

Providers

Tools

The Matrix

Inner Universe

Nerv

Components

Composites

Patterns

Primitives

Types

Knowledge System Overview ​

Introduction ​

Architecture ​

Core Components ​

ChromaDB Integration ​

Metadata Filtering with ChromaDB 1.0.8+ ​

Document Processing System ​

Retrieval System ​

Integration with LangGraph ​

Document Representation ​

Configuration Options ​

Environment Variables ​

Runtime Parameters ​

Command Line Options ​

Versioning Support ​

Usage Patterns ​

Ingesting Documentation ​

Basic Retrieval ​

Enhancing Agent Responses ​

Integration with LangGraph Workflows ​

Performance Considerations ​

Chunking Strategy ​

Database Optimization ​

Error Handling ​

Core Features ​

Future Enhancements ​

Related Documentation ​

Knowledge System Overview

Introduction

Architecture

Core Components

ChromaDB Integration

Metadata Filtering with ChromaDB 1.0.8+

Document Processing System

Retrieval System

Integration with LangGraph

Document Representation

Configuration Options

Environment Variables

Runtime Parameters

Command Line Options

Versioning Support

Usage Patterns

Ingesting Documentation

Basic Retrieval

Enhancing Agent Responses

Integration with LangGraph Workflows

Performance Considerations

Chunking Strategy

Database Optimization

Error Handling

Core Features

Future Enhancements

Related Documentation