ChromaDB Usage Guide

This guide provides practical instructions and patterns for using ChromaDB effectively within the Atlas framework. ChromaDB serves as the vector database powering Atlas’ knowledge retrieval system, enabling semantic search and contextual relevance for agents.

Introduction

ChromaDB is a lightweight, embedded vector database that enables efficient storage and retrieval of embeddings and their metadata. In Atlas, it forms the foundation of the knowledge system, allowing agents to find and utilize relevant information based on semantic meaning rather than exact keyword matching.

Getting Started

Installation

Atlas includes ChromaDB as a dependency, so you don’t need to install it separately. However, if you’re building a standalone application using Atlas components, you can install ChromaDB directly:

bash

pip install chromadb>=1.0.8

Basic Setup

To create a ChromaDB client and collection in Atlas:

python

import chromadb
from atlas.core import env

# Get database path from environment or use default
db_path = env.get_string("ATLAS_DB_PATH") or "~/atlas_chroma_db"

# Initialize client with persistent storage
client = chromadb.PersistentClient(path=db_path)

# Create or access a collection
collection_name = env.get_string("ATLAS_COLLECTION_NAME") or "atlas_knowledge_base"
collection = client.get_or_create_collection(name=collection_name)

Core Operations

Adding Documents

To add documents to a collection, you need three components:

Documents (text content)
Metadata (associated information)
IDs (unique identifiers)

Here’s a simple example:

python

# Add documents to collection
collection.add(
    documents=["This is document 1", "This is document 2"],
    metadatas=[
        {"source": "file1.md", "category": "documentation"},
        {"source": "file2.md", "category": "examples"}
    ],
    ids=["doc1", "doc2"]
)

TIP

When adding many documents, consider using batch operations for better performance.

Querying Documents

The primary operation in ChromaDB is similarity search based on vectors:

python

# Basic query with text
results = collection.query(
    query_texts=["How to use ChromaDB?"],
    n_results=5
)

# Process results
for i, (doc, metadata, distance) in enumerate(
    zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    )
):
    print(f"Result {i+1}: {metadata.get('source', 'Unknown')}")
    print(f"Distance: {distance}")
    print(f"Content: {doc[:100]}...")
    print()

Note the nested list structure in the results - the outer list represents each query (usually just one), while the inner list contains multiple results.

Metadata Filtering

ChromaDB allows filtering results based on document metadata. This is essential for retrieving relevant subsets of information in Atlas.

Basic Filtering

python

# Filter by exact metadata value
results = collection.query(
    query_texts=["Knowledge graph"],
    n_results=5,
    where={"category": "documentation"}
)

Filtering Operators

ChromaDB 1.0.8+ supports a variety of filtering operators:

python

# Equality operators
filter_eq = {"version": {"$eq": "3"}}  # Equal to "3" (recommended for 1.0.8+)
filter_eq_simple = {"version": "3"}    # Equal to "3" (alternative syntax)
filter_ne = {"version": {"$ne": "3"}}  # Not equal to "3"

# Numerical operators
filter_gt = {"version_num": {"$gt": 2}}  # Greater than 2
filter_gte = {"version_num": {"$gte": 2}}  # Greater than or equal to 2
filter_lt = {"version_num": {"$lt": 4}}  # Less than 4
filter_lte = {"version_num": {"$lte": 4}}  # Less than or equal to 4

# List operators
filter_in = {"category": {"$in": ["docs", "guides"]}}  # Matches any in list
filter_nin = {"category": {"$nin": ["temp", "draft"]}}  # Matches none in list

# Logical operators
filter_and = {
    "$and": [
        {"category": {"$eq": "documentation"}},
        {"version_num": {"$gte": 3}}
    ]
}

filter_or = {
    "$or": [
        {"category": {"$eq": "documentation"}},
        {"category": {"$eq": "guides"}}
    ]
}

WARNING

ChromaDB 1.0.8+ is more strict about filter syntax than previous versions. Always use the full operator syntax (e.g., {"$eq": value}) rather than the shorthand syntax where possible.

Using Atlas RetrievalFilter

Atlas provides a RetrievalFilter class that simplifies building ChromaDB-compatible filters:

python

from atlas.knowledge.retrieval import RetrievalFilter

# Create a filter
filter = RetrievalFilter()
filter.add_filter("category", {"$eq": "documentation"})  # Using operator syntax
filter.add_operator_filter("version_num", "$gte", 3)     # Helper method for operators

# Create range filter (between min and max, inclusive)
filter.add_range_filter("version_num", 2, 4)  # Equivalent to $gte 2 AND $lte 4

# Create in filter (value must be in list)
filter.add_in_filter("file_name", ["index.md", "overview.md"])

# Create from metadata fields
filter = RetrievalFilter.from_metadata(
    source_path="docs/guides",
    file_type="md",
    version="3"
)

# Combine filters
combined = RetrievalFilter.combine_filters([filter1, filter2], operator="$and")

# Use with knowledge base
from atlas.knowledge.retrieval import KnowledgeBase
kb = KnowledgeBase()
documents = kb.retrieve(query, filter=filter)

The RetrievalFilter class handles ChromaDB 1.0.8’s strict filter syntax requirements automatically and provides convenient helper methods for creating common filter patterns.

Document Content Filtering

In addition to metadata filtering, ChromaDB allows filtering based on document content with the where_document parameter:

python

# Find documents containing specific text
results = collection.query(
    query_texts=["architecture"],
    where_document={"$contains": "knowledge graph"},
    n_results=5
)

This is useful for finding documents that specifically mention certain terms or concepts.

Retrieval Settings

Atlas extends ChromaDB’s capabilities with retrieval settings that allow for more sophisticated search patterns:

python

from atlas.knowledge.retrieval import KnowledgeBase
from atlas.knowledge.settings import RetrievalSettings

# Create knowledge base
kb = KnowledgeBase()

# Configure retrieval settings
settings = RetrievalSettings(
    use_hybrid_search=True,  # Combine vector and keyword search
    semantic_weight=0.7,     # Weight for vector similarity
    keyword_weight=0.3,      # Weight for keyword matching
    num_results=5,           # Number of results to return
    min_relevance_score=0.5, # Minimum similarity threshold
    rerank_results=True      # Apply custom reranking
)

# Retrieve with settings and filter
documents = kb.retrieve(
    query="How does Atlas handle knowledge graphs?",
    filter=filter,
    settings=settings
)

Integration Patterns

Query Client Pattern

Atlas provides a query client interface for retrieving knowledge from ChromaDB:

python

from atlas import create_query_client

# Create a query client
client = create_query_client(
    collection_name="atlas_knowledge_base",
    provider_name="anthropic"  # Provider for completing the query
)

# Retrieve documents and generate a response
result = client.query_with_context(
    "Explain Atlas's knowledge graph structure",
    filter={"version": "latest"},
    max_results=3
)

# Access retrieved documents and response
print(result["response"])
print(f"Found {len(result['context']['documents'])} supporting documents:")
for doc in result['context']['documents']:
    print(f"- {doc['source']}")

Direct ChromaDB Access

For more control, you can use ChromaDB’s API directly:

python

from atlas.knowledge.retrieval import KnowledgeBase

# Initialize knowledge base
kb = KnowledgeBase(collection_name="atlas_knowledge_base")

# Access the ChromaDB collection
collection = kb.collection

# Perform direct operations
count = collection.count()
print(f"Collection contains {count} documents")

# Get all metadata fields
results = collection.get(limit=100)
fields = set()
for metadata in results["metadatas"]:
    fields.update(metadata.keys())
print(f"Available metadata fields: {sorted(list(fields))}")

# Peek at collection contents
peek = collection.peek(limit=5)
for i, (doc, metadata) in enumerate(zip(peek["documents"], peek["metadatas"])):
    print(f"Document {i+1}: {metadata.get('source', 'Unknown')}")
    print(f"Content preview: {doc[:100]}...")

Best Practices

Optimizing Performance

Batch Operations: Use batch operations when adding multiple documents

python

collection.add(
    documents=many_documents,
    metadatas=many_metadatas,
    ids=many_ids
)

Specific Filters: Make filters as specific as possible to reduce result set

python

# More specific is better
filter = {"category": "documentation", "version": "3"}

Filter Before Embedding: Apply metadata filters before semantic search when possible

python

# More efficient
results = collection.query(
    query_texts=["knowledge graph"],
    where={"category": "documentation"},
    n_results=5
)

Limit Result Count: Only request as many results as needed

python

# For agent context, 3-5 results is often sufficient
results = collection.query(query_texts=[query], n_results=3)

Error Handling

Implement robust error handling for ChromaDB operations:

python

try:
    results = collection.query(
        query_texts=[query],
        n_results=5,
        where=filter
    )
except Exception as e:
    logger.error(f"ChromaDB query failed: {e}")
    # Fall back to a simpler query
    try:
        results = collection.query(
            query_texts=[query],
            n_results=5
        )
    except Exception as e2:
        logger.error(f"Fallback query failed: {e2}")
        # Return empty results
        results = {"ids": [[]], "documents": [[]], "metadatas": [[]], "distances": [[]]}

Common Tasks

Managing Document Versions

Use metadata to track document versions:

python

# Add versioned documents
collection.add(
    documents=["Content for version 3", "Content for version 4"],
    metadatas=[
        {"source": "document.md", "version": "3"},
        {"source": "document.md", "version": "4"}
    ],
    ids=["doc_v3", "doc_v4"]
)

# Query specific version
results = collection.query(
    query_texts=["How to use the feature"],
    where={"version": "4"},
    n_results=5
)

Document Categorization

Organize documents by category for targeted retrieval:

python

# Add with categories
collection.add(
    documents=["Architecture document", "Implementation guide"],
    metadatas=[
        {"source": "arch.md", "category": "architecture"},
        {"source": "impl.md", "category": "implementation"}
    ],
    ids=["arch1", "impl1"]
)

# Query by category
results = collection.query(
    query_texts=["system design"],
    where={"category": "architecture"},
    n_results=5
)

Document Updating

Update documents when content changes:

python

# Update a specific document
collection.update(
    ids=["doc1"],
    documents=["Updated content for document 1"],
    metadatas=[{"source": "file1.md", "category": "documentation", "updated": True}]
)

Finding Similar Documents

Find documents similar to an existing one:

python

# Get a specific document
doc = collection.get(ids=["reference_doc"])

# Find similar documents
if doc["documents"]:
    results = collection.query(
        query_texts=[doc["documents"][0]],
        n_results=5
    )

Troubleshooting

Common Issues

Filter Syntax Errors: ChromaDB 1.0.8+ is strict about filter syntax

python

# Wrong: missing $ prefix for operator
wrong_filter = {"version": {"gt": 3}}

# Correct: includes $ prefix
correct_filter = {"version": {"$gt": 3}}

# Wrong: Using simple key-value in logical operators (in 1.0.8+)
wrong_and = {"$and": [{"category": "documentation"}, {"version": "3"}]}

# Correct: Using full operator syntax in logical operators
correct_and = {"$and": [{"category": {"$eq": "documentation"}}, {"version": {"$eq": "3"}}]}

Path Normalization Issues: Path formats must match exactly

python

# Wrong: Using different path formats than what's stored
wrong_filter = {"source": {"$eq": "some/partial/path"}}  # May not match actual path format

# Better: Use exact path as stored in metadata
better_filter = {"source": {"$eq": "docs/components/knowledge/index.md"}}

Type Mismatch: Type inconsistencies in filters

python

# Wrong: comparing string with number
wrong_filter = {"version_num": {"$gt": "3"}}

# Correct: consistent types
correct_filter = {"version_num": {"$gt": 3}}

Path Not Found: Database path doesn’t exist

python

# Ensure the directory exists
import os
os.makedirs(db_path, exist_ok=True)
client = chromadb.PersistentClient(path=db_path)

Exists Operator Issues: The $exists operator is problematic in ChromaDB 1.0.8+

python

# Avoid using $exists operator
problematic_filter = {"source": {"$exists": True}}

# Better: Use specific value matching if possible
better_filter = {"source": {"$ne": None}}

Debugging Techniques

Check Available Metadata:

python

# Get all metadata fields
results = collection.get(limit=100)
fields = set()
for metadata in results["metadatas"]:
  fields.update(metadata.keys())
print(f"Available fields: {sorted(list(fields))}")

Test Filters Directly:

python

# Try filter without query to see if any documents match
filter_test = collection.get(where=filter, limit=10)
print(f"Filter matched {len(filter_test['ids'])} documents")

Check Collection Content:

python

# Peek at collection to verify content
peek = collection.peek(limit=5)
for i, (doc, metadata) in enumerate(zip(peek["documents"], peek["metadatas"])):
    print(f"Document {i}: {metadata}")
    print(f"Content: {doc[:50]}...")

Conclusion

ChromaDB provides the vector database foundation that powers Atlas’s knowledge retrieval system. By understanding how to effectively use ChromaDB for document storage, filtering, and similarity search, you can build powerful knowledge-enabled applications with Atlas.

The Atlas framework extends ChromaDB’s capabilities with additional features like hybrid search, structured filtering, and tight integration with the agent system, making it easy to build sophisticated knowledge applications while still providing access to the underlying database when needed.

Archive

Archive

Possible Future

Archive

Archive

Architecture

Agents

Core

Graph

Knowledge

Providers

Tools

The Matrix

Inner Universe

Nerv

Components

Composites

Patterns

Primitives

Types

ChromaDB Usage Guide ​

Introduction ​

Getting Started ​

Installation ​

Basic Setup ​

Core Operations ​

Adding Documents ​

Querying Documents ​

Metadata Filtering ​

Basic Filtering ​

Filtering Operators ​

Using Atlas RetrievalFilter ​

Document Content Filtering ​

Retrieval Settings ​

Integration Patterns ​

Query Client Pattern ​

Direct ChromaDB Access ​

Best Practices ​

Optimizing Performance ​

Error Handling ​

Common Tasks ​

Managing Document Versions ​

Document Categorization ​

Document Updating ​

Finding Similar Documents ​

Troubleshooting ​

Common Issues ​

Debugging Techniques ​

Conclusion ​

Additional Resources ​

ChromaDB Usage Guide

Introduction

Getting Started

Installation

Basic Setup

Core Operations

Adding Documents

Querying Documents

Metadata Filtering

Basic Filtering

Filtering Operators

Using Atlas RetrievalFilter

Document Content Filtering

Retrieval Settings

Integration Patterns

Query Client Pattern

Direct ChromaDB Access

Best Practices

Optimizing Performance

Error Handling

Common Tasks

Managing Document Versions

Document Categorization

Document Updating

Finding Similar Documents

Troubleshooting

Common Issues

Debugging Techniques

Conclusion

Additional Resources