ChromaDB Usage Guide
This guide provides practical instructions and patterns for using ChromaDB effectively within the Atlas framework. ChromaDB serves as the vector database powering Atlas’ knowledge retrieval system, enabling semantic search and contextual relevance for agents.
Introduction
ChromaDB is a lightweight, embedded vector database that enables efficient storage and retrieval of embeddings and their metadata. In Atlas, it forms the foundation of the knowledge system, allowing agents to find and utilize relevant information based on semantic meaning rather than exact keyword matching.
Getting Started
Installation
Atlas includes ChromaDB as a dependency, so you don’t need to install it separately. However, if you’re building a standalone application using Atlas components, you can install ChromaDB directly:
pip install chromadb>=1.0.8
Basic Setup
To create a ChromaDB client and collection in Atlas:
import chromadb
from atlas.core import env
# Get database path from environment or use default
db_path = env.get_string("ATLAS_DB_PATH") or "~/atlas_chroma_db"
# Initialize client with persistent storage
client = chromadb.PersistentClient(path=db_path)
# Create or access a collection
collection_name = env.get_string("ATLAS_COLLECTION_NAME") or "atlas_knowledge_base"
collection = client.get_or_create_collection(name=collection_name)
Core Operations
Adding Documents
To add documents to a collection, you need three components:
- Documents (text content)
- Metadata (associated information)
- IDs (unique identifiers)
Here’s a simple example:
# Add documents to collection
collection.add(
documents=["This is document 1", "This is document 2"],
metadatas=[
{"source": "file1.md", "category": "documentation"},
{"source": "file2.md", "category": "examples"}
],
ids=["doc1", "doc2"]
)
TIP
When adding many documents, consider using batch operations for better performance.
Querying Documents
The primary operation in ChromaDB is similarity search based on vectors:
# Basic query with text
results = collection.query(
query_texts=["How to use ChromaDB?"],
n_results=5
)
# Process results
for i, (doc, metadata, distance) in enumerate(
zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
)
):
print(f"Result {i+1}: {metadata.get('source', 'Unknown')}")
print(f"Distance: {distance}")
print(f"Content: {doc[:100]}...")
print()
Note the nested list structure in the results - the outer list represents each query (usually just one), while the inner list contains multiple results.
Metadata Filtering
ChromaDB allows filtering results based on document metadata. This is essential for retrieving relevant subsets of information in Atlas.
Basic Filtering
# Filter by exact metadata value
results = collection.query(
query_texts=["Knowledge graph"],
n_results=5,
where={"category": "documentation"}
)
Filtering Operators
ChromaDB 1.0.8+ supports a variety of filtering operators:
# Equality operators
filter_eq = {"version": {"$eq": "3"}} # Equal to "3" (recommended for 1.0.8+)
filter_eq_simple = {"version": "3"} # Equal to "3" (alternative syntax)
filter_ne = {"version": {"$ne": "3"}} # Not equal to "3"
# Numerical operators
filter_gt = {"version_num": {"$gt": 2}} # Greater than 2
filter_gte = {"version_num": {"$gte": 2}} # Greater than or equal to 2
filter_lt = {"version_num": {"$lt": 4}} # Less than 4
filter_lte = {"version_num": {"$lte": 4}} # Less than or equal to 4
# List operators
filter_in = {"category": {"$in": ["docs", "guides"]}} # Matches any in list
filter_nin = {"category": {"$nin": ["temp", "draft"]}} # Matches none in list
# Logical operators
filter_and = {
"$and": [
{"category": {"$eq": "documentation"}},
{"version_num": {"$gte": 3}}
]
}
filter_or = {
"$or": [
{"category": {"$eq": "documentation"}},
{"category": {"$eq": "guides"}}
]
}
WARNING
ChromaDB 1.0.8+ is more strict about filter syntax than previous versions. Always use the full operator syntax (e.g., {"$eq": value}
) rather than the shorthand syntax where possible.
Using Atlas RetrievalFilter
Atlas provides a RetrievalFilter
class that simplifies building ChromaDB-compatible filters:
from atlas.knowledge.retrieval import RetrievalFilter
# Create a filter
filter = RetrievalFilter()
filter.add_filter("category", {"$eq": "documentation"}) # Using operator syntax
filter.add_operator_filter("version_num", "$gte", 3) # Helper method for operators
# Create range filter (between min and max, inclusive)
filter.add_range_filter("version_num", 2, 4) # Equivalent to $gte 2 AND $lte 4
# Create in filter (value must be in list)
filter.add_in_filter("file_name", ["index.md", "overview.md"])
# Create from metadata fields
filter = RetrievalFilter.from_metadata(
source_path="docs/guides",
file_type="md",
version="3"
)
# Combine filters
combined = RetrievalFilter.combine_filters([filter1, filter2], operator="$and")
# Use with knowledge base
from atlas.knowledge.retrieval import KnowledgeBase
kb = KnowledgeBase()
documents = kb.retrieve(query, filter=filter)
The RetrievalFilter
class handles ChromaDB 1.0.8’s strict filter syntax requirements automatically and provides convenient helper methods for creating common filter patterns.
Document Content Filtering
In addition to metadata filtering, ChromaDB allows filtering based on document content with the where_document
parameter:
# Find documents containing specific text
results = collection.query(
query_texts=["architecture"],
where_document={"$contains": "knowledge graph"},
n_results=5
)
This is useful for finding documents that specifically mention certain terms or concepts.
Retrieval Settings
Atlas extends ChromaDB’s capabilities with retrieval settings that allow for more sophisticated search patterns:
from atlas.knowledge.retrieval import KnowledgeBase
from atlas.knowledge.settings import RetrievalSettings
# Create knowledge base
kb = KnowledgeBase()
# Configure retrieval settings
settings = RetrievalSettings(
use_hybrid_search=True, # Combine vector and keyword search
semantic_weight=0.7, # Weight for vector similarity
keyword_weight=0.3, # Weight for keyword matching
num_results=5, # Number of results to return
min_relevance_score=0.5, # Minimum similarity threshold
rerank_results=True # Apply custom reranking
)
# Retrieve with settings and filter
documents = kb.retrieve(
query="How does Atlas handle knowledge graphs?",
filter=filter,
settings=settings
)
Integration Patterns
Query Client Pattern
Atlas provides a query client interface for retrieving knowledge from ChromaDB:
from atlas import create_query_client
# Create a query client
client = create_query_client(
collection_name="atlas_knowledge_base",
provider_name="anthropic" # Provider for completing the query
)
# Retrieve documents and generate a response
result = client.query_with_context(
"Explain Atlas's knowledge graph structure",
filter={"version": "latest"},
max_results=3
)
# Access retrieved documents and response
print(result["response"])
print(f"Found {len(result['context']['documents'])} supporting documents:")
for doc in result['context']['documents']:
print(f"- {doc['source']}")
Direct ChromaDB Access
For more control, you can use ChromaDB’s API directly:
from atlas.knowledge.retrieval import KnowledgeBase
# Initialize knowledge base
kb = KnowledgeBase(collection_name="atlas_knowledge_base")
# Access the ChromaDB collection
collection = kb.collection
# Perform direct operations
count = collection.count()
print(f"Collection contains {count} documents")
# Get all metadata fields
results = collection.get(limit=100)
fields = set()
for metadata in results["metadatas"]:
fields.update(metadata.keys())
print(f"Available metadata fields: {sorted(list(fields))}")
# Peek at collection contents
peek = collection.peek(limit=5)
for i, (doc, metadata) in enumerate(zip(peek["documents"], peek["metadatas"])):
print(f"Document {i+1}: {metadata.get('source', 'Unknown')}")
print(f"Content preview: {doc[:100]}...")
Best Practices
Optimizing Performance
Batch Operations: Use batch operations when adding multiple documents
pythoncollection.add( documents=many_documents, metadatas=many_metadatas, ids=many_ids )
Specific Filters: Make filters as specific as possible to reduce result set
python# More specific is better filter = {"category": "documentation", "version": "3"}
Filter Before Embedding: Apply metadata filters before semantic search when possible
python# More efficient results = collection.query( query_texts=["knowledge graph"], where={"category": "documentation"}, n_results=5 )
Limit Result Count: Only request as many results as needed
python# For agent context, 3-5 results is often sufficient results = collection.query(query_texts=[query], n_results=3)
Error Handling
Implement robust error handling for ChromaDB operations:
try:
results = collection.query(
query_texts=[query],
n_results=5,
where=filter
)
except Exception as e:
logger.error(f"ChromaDB query failed: {e}")
# Fall back to a simpler query
try:
results = collection.query(
query_texts=[query],
n_results=5
)
except Exception as e2:
logger.error(f"Fallback query failed: {e2}")
# Return empty results
results = {"ids": [[]], "documents": [[]], "metadatas": [[]], "distances": [[]]}
Common Tasks
Managing Document Versions
Use metadata to track document versions:
# Add versioned documents
collection.add(
documents=["Content for version 3", "Content for version 4"],
metadatas=[
{"source": "document.md", "version": "3"},
{"source": "document.md", "version": "4"}
],
ids=["doc_v3", "doc_v4"]
)
# Query specific version
results = collection.query(
query_texts=["How to use the feature"],
where={"version": "4"},
n_results=5
)
Document Categorization
Organize documents by category for targeted retrieval:
# Add with categories
collection.add(
documents=["Architecture document", "Implementation guide"],
metadatas=[
{"source": "arch.md", "category": "architecture"},
{"source": "impl.md", "category": "implementation"}
],
ids=["arch1", "impl1"]
)
# Query by category
results = collection.query(
query_texts=["system design"],
where={"category": "architecture"},
n_results=5
)
Document Updating
Update documents when content changes:
# Update a specific document
collection.update(
ids=["doc1"],
documents=["Updated content for document 1"],
metadatas=[{"source": "file1.md", "category": "documentation", "updated": True}]
)
Finding Similar Documents
Find documents similar to an existing one:
# Get a specific document
doc = collection.get(ids=["reference_doc"])
# Find similar documents
if doc["documents"]:
results = collection.query(
query_texts=[doc["documents"][0]],
n_results=5
)
Troubleshooting
Common Issues
Filter Syntax Errors: ChromaDB 1.0.8+ is strict about filter syntax
python# Wrong: missing $ prefix for operator wrong_filter = {"version": {"gt": 3}} # Correct: includes $ prefix correct_filter = {"version": {"$gt": 3}} # Wrong: Using simple key-value in logical operators (in 1.0.8+) wrong_and = {"$and": [{"category": "documentation"}, {"version": "3"}]} # Correct: Using full operator syntax in logical operators correct_and = {"$and": [{"category": {"$eq": "documentation"}}, {"version": {"$eq": "3"}}]}
Path Normalization Issues: Path formats must match exactly
python# Wrong: Using different path formats than what's stored wrong_filter = {"source": {"$eq": "some/partial/path"}} # May not match actual path format # Better: Use exact path as stored in metadata better_filter = {"source": {"$eq": "docs/components/knowledge/index.md"}}
Type Mismatch: Type inconsistencies in filters
python# Wrong: comparing string with number wrong_filter = {"version_num": {"$gt": "3"}} # Correct: consistent types correct_filter = {"version_num": {"$gt": 3}}
Path Not Found: Database path doesn’t exist
python# Ensure the directory exists import os os.makedirs(db_path, exist_ok=True) client = chromadb.PersistentClient(path=db_path)
Exists Operator Issues: The $exists operator is problematic in ChromaDB 1.0.8+
python# Avoid using $exists operator problematic_filter = {"source": {"$exists": True}} # Better: Use specific value matching if possible better_filter = {"source": {"$ne": None}}
Debugging Techniques
Check Available Metadata:
python# Get all metadata fields results = collection.get(limit=100) fields = set() for metadata in results["metadatas"]: fields.update(metadata.keys()) print(f"Available fields: {sorted(list(fields))}")
Test Filters Directly:
python# Try filter without query to see if any documents match filter_test = collection.get(where=filter, limit=10) print(f"Filter matched {len(filter_test['ids'])} documents")
Check Collection Content:
python# Peek at collection to verify content peek = collection.peek(limit=5) for i, (doc, metadata) in enumerate(zip(peek["documents"], peek["metadatas"])): print(f"Document {i}: {metadata}") print(f"Content: {doc[:50]}...")
Conclusion
ChromaDB provides the vector database foundation that powers Atlas’s knowledge retrieval system. By understanding how to effectively use ChromaDB for document storage, filtering, and similarity search, you can build powerful knowledge-enabled applications with Atlas.
The Atlas framework extends ChromaDB’s capabilities with additional features like hybrid search, structured filtering, and tight integration with the agent system, making it easy to build sophisticated knowledge applications while still providing access to the underlying database when needed.