Test Suite Planning

Current Development Status: Atlas is currently focusing on example-driven development rather than comprehensive test-driven development. The testing infrastructure described in this document represents the planned testing architecture and strategy, but is not the current implementation priority. We are maintaining this document as a reference for future test implementation while focusing on robust examples and functional demonstrations.

This document outlines the comprehensive plan for restructuring, enhancing, and maintaining the Atlas test suite. It combines the test audit findings with our refactoring strategy.

1. Current Testing Architecture Assessment

Atlas employs a multi-tiered testing approach with distinct categories:

Unit Tests: Tests for individual components
Mock Tests: Tests that don’t require API access using the MockProvider
API Tests: Integration tests that make real API calls
Minimal Tests: Basic functionality tests for core components

Distribution of Test Types

Abundant unit tests covering individual components
Good mocked tests for API-free testing
Limited integration tests connecting multiple components
Very few end-to-end tests simulating real usage patterns
Almost no real API tests that actually test against live endpoints

Current Issues

Scattered Test Implementation: Tests are scattered across various files and directories without clear organization
Confusing Directory Structure: The purpose of subdirectories like standard and integration is unclear
Environment Variable Configuration: Using environment variables for test configuration adds complexity
Mixed Test Types: Unit tests, mock tests, and API tests are mixed together in ways that make them hard to track
Unclear Test CLI: Current test runner CLI doesn’t clearly indicate which tests will run
Implementation-Specific Tests: Many tests are tightly coupled to implementation details

2. Recent Improvements

The following test issues have been successfully resolved:

Fixed OpenAI Provider Tests:
- Added missing ErrorSeverity import to fix error handling tests
- Fixed generate test with proper environment variable mocking
- Corrected stream and stream_with_callback tests
Fixed Ollama Provider Tests:
- Updated error handling assertions to match actual implementation messages
- Improved mocking in generate, streaming, and error handling tests
Fixed Tool Capable Agent Tests:
- Updated system prompt format assertion to match actual implementation
Implemented Real API Tests:
- Created real_api_tests.py with comprehensive provider tests
- Implemented test decorators for API and expensive tests
- Added token usage tracking and cost controls
- Updated test runner to support real API tests
- Added documentation for running API tests

All 110 tests are now passing, with 4 tests being skipped (these require actual API keys).

3. Provider Tests Analysis

OpenAI Provider Tests (`test_openai_provider.py`)

Strengths:

Comprehensive test coverage of core functionality
Detailed mocking of OpenAI API responses
Good error handling test coverage
Thorough tests for streaming capabilities

Weaknesses:

No actual API calls, relies entirely on mocking
Hardcoded expectations for model names and pricing that could become outdated
Some tests simulate OpenAI’s API structure rather than using actual responses
Model cost calculations may diverge from actual pricing over time

Ollama Provider Tests (`test_ollama_provider.py`)

Strengths:

Comprehensive test coverage similar to OpenAI provider
Good coverage of Ollama-specific API behaviors
Thorough error handling tests

Weaknesses:

Similar to OpenAI, relies entirely on mocking
Simulates Ollama’s API structures rather than using real responses
Token estimation may not match Ollama’s actual behaviors

Mock Provider Tests (`test_mock_provider.py`)

Strengths:

Well-implemented MockProvider for API-free testing
Complete with simulated delays, streaming, error handling, etc.
Used in many tests to avoid API costs

Weaknesses:

May not fully replicate the behavior of real providers
Could give false confidence if mock behavior diverges from real APIs

4. Agent Tests Analysis

Tool Capable Agent Tests (`test_tool_capable_agent.py`)

Strengths:

Tests both basic functionality and tool-specific behaviors
Covers error handling in tool execution
Tests structured messaging with tools

Weaknesses:

System prompt format assertions are brittle and may break with implementation changes
Limited coverage of complex tool interactions
Missing tests for tool permission management
No tests for handling streaming responses with tool calls

Agent Toolkit Tests (`test_agent_toolkit.py`)

Strengths:

Good coverage of toolkit registration and permissions
Tests for both function-based and class-based tools
Thorough error condition testing for invalid arguments and permissions

Weaknesses:

No tests for integration with actual provider responses
Limited testing of complex tools with different output types
Missing tests for dynamic tool discovery

5. Identified Issues and False Positives

Several patterns in the current tests might lead to false positives:

Mocked API Structure Mismatch
- Some mocks don’t match the actual API response structure
- These tests may pass with incorrect mocks but fail with real APIs
Hardcoded Expectations
- Tests with hardcoded model names, pricing, or text values
- These fail when implementations change even if behavior is correct
System Prompt Assertions
- Tests that check for specific text in system prompts
- These break when prompt wording changes, even if functionality remains
Token Counting Approximations
- Mocked token counts that may not match actual tokenization
- May pass in tests but fail with real usage

6. Proposed Test Structure

atlas/tests/
├── unit/                 # Unit tests that don't require API calls
│   ├── core/             # Tests for core features
│   ├── models/           # Tests for model abstractions
│   ├── knowledge/        # Tests for knowledge features
│   ├── agents/           # Tests for agent features
│   └── tools/            # Tests for tool features
│
├── mock/                 # Tests that use mocked providers/APIs
│   ├── providers/        # Mocked provider tests
│   ├── agents/           # Agent tests with mocked providers
│   └── workflows/        # Workflow tests with mocked components
│
├── integration/          # Tests that connect multiple components
│   ├── agent_tool/       # Agent + Tool integration tests
│   ├── knowledge_agent/  # Knowledge + Agent integration tests
│   └── workflow/         # Full workflow integration tests
│
├── api/                  # Real API tests (expensive)
│   ├── openai/           # OpenAI-specific API tests
│   ├── anthropic/        # Anthropic-specific API tests
│   └── ollama/           # Ollama-specific API tests
│
├── fixtures/             # Test fixtures and data
│   ├── documents/        # Test documents
│   ├── responses/        # Sample API responses
│   └── tools/            # Test tool implementations
│
├── helpers/              # Test helper functions
│   ├── __init__.py
│   ├── decorators.py     # Test decorators
│   ├── assertions.py     # Custom assertions
│   └── mocks.py          # Common mocks
│
└── __init__.py           # Main test entry point

7. Updated CLI Interface

Replace environment variable configuration with explicit CLI flags:

# Run all unit tests
python -m atlas.scripts.testing.run_tests unit

# Run specific unit test(s)
python -m atlas.scripts.testing.run_tests unit --module core.config

# Run mock tests
python -m atlas.scripts.testing.run_tests mock

# Run integration tests
python -m atlas.scripts.testing.run_tests integration

# Run real API tests (with confirmation)
python -m atlas.scripts.testing.run_tests api --confirm

# Run OpenAI API tests only
python -m atlas.scripts.testing.run_tests api --provider openai --confirm

# Run multiple test types
python -m atlas.scripts.testing.run_tests unit mock

# Run all tests (will prompt for confirmation before running API tests)
python -m atlas.scripts.testing.run_tests all

8. Standardization Recommendations

8.1 Make Tests Less Brittle

Many tests are tightly coupled to implementation details:

Hardcoded model names that might change
Specific expected text in system prompts
Fixed pricing expectations

Solution: Extract expected values from implementation:

python

# Instead of this:
self.assertEqual(provider._model_name, "gpt-4o")

# Do this:
expected_model = provider._model_name  # Or a configurable test value
self.assertEqual(provider._model_name, expected_model)

8.2 Standardize Test Patterns Across Providers

Each provider has slightly different test patterns:

Solution: Create standard test wrappers:

python

def test_provider_basic_functionality(provider_class, model_name, **kwargs):
    """Run standard provider tests for any provider implementation."""
    provider = provider_class(model_name=model_name, **kwargs)

    # Test API key validation
    assert provider.validate_api_key()

    # Test basic request
    request = ModelRequest(
        messages=[ModelMessage.user("Test message")],
        max_tokens=10
    )

    # Mock any API calls
    with mock_provider_api(provider_class):
        response = provider.generate(request)

    # Standard assertions
    assert isinstance(response, ModelResponse)
    assert response.provider == provider.name
    assert response.model == model_name
    assert isinstance(response.usage, TokenUsage)
    assert isinstance(response.cost, CostEstimate)

8.3 Implement Test Categories via Decorators

Instead of using environment variables, use clear decorators:

python

def unit_test(f):
    """Decorator for unit tests."""
    return f

def mock_test(f):
    """Decorator for tests that use mocked providers."""
    return f

def integration_test(f):
    """Decorator for integration tests that connect multiple components."""
    return f

def api_test(f):
    """Decorator for tests that make real API calls."""
    @functools.wraps(f)
    def wrapper(*args, **kwargs):
        if not API_TESTS_ENABLED:
            return unittest.skip("API tests disabled - run with appropriate flag")(f)(*args, **kwargs)
        return f(*args, **kwargs)
    return wrapper

def expensive_test(f):
    """Decorator for tests that make expensive API calls."""
    @functools.wraps(f)
    def wrapper(*args, **kwargs):
        if not EXPENSIVE_TESTS_ENABLED:
            return unittest.skip("Expensive tests disabled - run with appropriate flag")(f)(*args, **kwargs)
        return f(*args, **kwargs)
    return wrapper

8.4 Implement Token Usage Tracking in Tests

python

class TestWithTokenTracking(unittest.TestCase):
    """Base class for tests with token usage tracking."""

    def setUp(self):
        self.total_tokens_used = 0
        self.estimated_cost = 0.0

    def track_usage(self, response):
        """Track token usage from a response."""
        if hasattr(response, "usage") and response.usage:
            self.total_tokens_used += response.usage.total_tokens
            if hasattr(response, "cost") and response.cost:
                self.estimated_cost += response.cost.total_cost

    def tearDown(self):
        """Report token usage."""
        print(f"\nTest used {self.total_tokens_used} tokens at an estimated cost of ${self.estimated_cost:.6f}")

9. API Testing Strategy

9.1 Cost Control for API Tests

For API tests, we’ll implement the following safeguards:

Explicit Confirmation: Require --confirm flag for API tests
Provider Selection: Allow testing specific providers only
Token Usage Tracking: Track and report token usage and cost
Usage Limits: Set token/cost limits for test runs
Minimal Requests: Design tests to use minimal token counts

9.2 Provider-Specific Strategies

OpenAI Tests

Use GPT-3.5-Turbo for API tests to minimize costs while still testing actual API integration:

python

def test_openai_real_api(self):
    """Integration test with real OpenAI API."""
    provider = OpenAIProvider(model_name="gpt-3.5-turbo")
    request = ModelRequest(
        messages=[ModelMessage.user("Write a single word response: Hello")],
        max_tokens=5
    )

    response = provider.generate(request)
    self.assertIsNotNone(response.content)
    self.assertLessEqual(response.usage.total_tokens, 20)

Ollama Tests

Ollama is free to run locally, making it ideal for real API testing without costs:

python

def test_ollama_real_api(self):
    """Integration test with real Ollama API."""
    if not self._is_ollama_running():
        self.skipTest("Ollama not running")

    provider = OllamaProvider(model_name="llama2")
    request = ModelRequest(
        messages=[ModelMessage.user("Write a single word response: Hello")],
        max_tokens=5
    )

    response = provider.generate(request)
    self.assertIsNotNone(response.content)

Anthropic Tests

Use Anthropic Claude Haiku for cost-effective API tests:

python

def test_anthropic_real_api(self):
    """Integration test with real Anthropic API."""
    provider = AnthropicProvider(model_name="claude-3-haiku-20240307")
    request = ModelRequest(
        messages=[ModelMessage.user("Write a single word response: Hello")],
        max_tokens=5
    )

    response = provider.generate(request)
    self.assertIsNotNone(response.content)
    self.assertLessEqual(response.usage.total_tokens, 20)

10. Implementation Plan

Phase 1: Create New Directory Structure

Create new test directories according to the proposed structure
Set up __init__.py files to enable proper imports
Create test helper modules under the new structure

Phase 2: Migrate Existing Tests

Move existing tests to appropriate locations in the new structure
Update imports and paths as needed
Ensure all tests still run in the new structure

Phase 3: Update Test Runner

Refactor the test runner to work with the new directory structure
Replace environment variable configuration with CLI flags
Add confirmation prompts for expensive API tests
Update discovery mechanism to find tests in the new locations

Phase 4: Create New Test Templates

Create standardized test templates for each test type
Ensure consistent patterns across all test types
Document test patterns and best practices

Phase 5: Update Documentation

Update test documentation to reflect the new structure
Create examples of how to run different types of tests
Document best practices for writing tests

Phase 6: Cleanup

Remove deprecated test structure and files
Update all references to old test paths
Verify all tests still run correctly

11. Migration Strategy

To minimize disruption, we’ll use a phased approach:

Create the new structure alongside the existing one
Move tests one category at a time, ensuring they still pass
Update the test runner to support both old and new structures temporarily
After all tests are migrated, remove the old structure

This will ensure that tests continue to run throughout the migration process.

12. Current Challenges and Implementation Steps

Completed Steps

✅ Implement new test directory structure (unit, mock, integration, api)
✅ Update test runner CLI to use explicit flags instead of environment variables
✅ Create standardized test patterns with helper classes and decorators
✅ Migrate existing tests to new structure
✅ Implement proper API test cost controls and confirmation system
✅ Add comprehensive integration tests for component interactions

Current Challenges

API Tests in Mock Mode: The current test decorators don’t properly skip API tests when running in mock mode, causing potential unwanted API calls ✅
Provider Testing with Retry Mechanism: The newly implemented robust retry mechanism requires specialized testing approaches
- Provider tests need to be updated to work with retry mechanism and circuit breaker
- Tests should verify exponential backoff behavior with deterministic failures
- Circuit breaker state transitions need explicit testing
- Test mocks should provide deterministic failure sequences for retry testing
Structured Mocking Strategy: Implement a more consistent provider-specific mocking approach
- Create standardized mock libraries for each provider (initial implementation complete)
- Use these mock libraries across all provider tests for consistency
- Implement retry-aware mocking that can simulate transient failures
- Add helpers to simulate different error conditions (rate limits, timeouts, etc.)
- Mock objects don’t properly replicate the actual API response structures
Test Runner Limitations: The test runner needs improvements to better identify test types and prevent running API tests in mock mode
Missing Helper Libraries: Need standardized, reliable mock implementations for each provider to ensure consistent testing

Implementation Steps

~~Update API test decorator to check current test context more robustly~~ ✅ Complete
- Added robust test context detection based on CLI arguments
- Implemented proper skipping of API tests when in mock mode
- Added unit tests for decorator behavior in atlas/tests/unit/helpers/test_decorators.py
Create provider-specific mock libraries
- ~~Create mock_openai.py with standardized OpenAI mocking utilities~~ ✅ Started but needs fixing
- Fix null to None issue in sample response data
- Create mock_anthropic.py with standardized Anthropic mocking utilities
- Create mock_ollama.py with standardized Ollama mocking utilities
- Ensure all mock utilities return properly structured objects that match real API responses
- Add unit tests for mock utilities to verify their behavior
Fix provider test files to properly mock all API calls
- Fix OpenAI provider tests to properly initialize mocked client in setUp
- Use consistent patching approach with addCleanup for proper teardown
- Implement proper error simulation for each provider type
- Ensure streaming tests don’t hang or timeout
- Add proper assertions to verify mock interactions
- Standardize test patterns across all provider implementations
Enhance test runner to better filter tests based on their type
- Update test discovery to recognize test type metadata from decorators
- Add ability to filter by multiple test types in a single run
- Improve reporting to show which tests are being skipped and why
- Add clear error messages when tests fail due to missing mocks
Create comprehensive documentation of mocking approach
- Document standard patterns for mocking each provider
- Create examples of correct mock usage for common test scenarios
- Add documentation of common pitfalls and debugging approaches
- Update README with clear instructions for running tests with mocks

13. Structured Mocking Strategy

After evaluating the current status of the test suite, we’ve identified a need for a more structured approach to provider mocking. This section outlines the strategy for implementing reliable, consistent mocks across all provider tests.

13.1 Provider-Specific Mock Libraries

Each provider requires a dedicated mock library to handle its unique API structure:

OpenAI Mock Library (mock_openai.py):
- Sample responses for different models (GPT-4, GPT-3.5-Turbo)
- Realistic error responses (auth errors, rate limits, content filtering)
- Streaming utilities that properly simulate chunks
- Helper methods for common testing scenarios
Anthropic Mock Library (mock_anthropic.py):
- Sample responses for different Claude models
- Error simulation for Anthropic-specific error types
- Streaming utilities for content_block_delta format
- Realistic token usage and cost calculation
Ollama Mock Library (mock_ollama.py):
- Sample responses for various Ollama models
- Mocking for Ollama’s unique response format
- Request validation and error handling
- Simulated token counting (since Ollama doesn’t provide this)

13.2 Implementation Progress

Completed:

✅ Fixed API test decorator to properly skip tests in mock mode
✅ Created initial mock_openai.py with standard mocking utilities
✅ Fixed OpenAI provider test setUp method to properly mock client
✅ Fixed ‘null’ to ‘None’ issue in mock_openai.py

In Progress:

🔄 Creating standardized mock_anthropic.py library
🔄 Fixing streaming tests to avoid timeouts
🔄 Updating provider tests to use new mock libraries

Pending:

⏱️ Create mock_ollama.py
⏱️ Add unit tests for mock libraries
⏱️ Fix Provider Errors tests
⏱️ Update test documentation

13.3 Recommended Testing Patterns

For effective provider testing, all tests should follow these patterns:

Test Isolation:

python

def setUp(self):
    # Create patch with cleanup to ensure it's removed after tests
    patcher = mock.patch('atlas.providers.openai.OpenAI')
    self.mock_openai = patcher.start()
    self.addCleanup(patcher.stop)

    # Configure mock client
    mock_client = mock.MagicMock()
    self.mock_openai.return_value = mock_client

    # Create provider with mocked dependencies
    self.provider = self.provider_class(api_key="test-key")

    # Ensure the provider uses our mock
    self.provider._client = mock_client

Standardized Mock Usage:

python

@mock_test
def test_generate_mocked(self):
    """Test generating a response with a mocked API."""
    # Create a request
    request = self._create_request("Test message")

    # Use standardized mock from library
    with mock_openai_completions():
        # Generate response using mocked client
        response = self.provider.generate(request)

        # Verify response structure
        self.assertIsInstance(response, ModelResponse)
        self.assertEqual(response.provider, self.provider_name)

Timeout Prevention in Streaming Tests:

python

# Limit iterations to prevent infinite loops
try:
    for _ in range(10):  # Maximum of 10 iterations
        delta, response = next(stream_handler)
        # Process chunk...
except StopIteration:
    # Stream completed normally
    pass

Mock Verification:

python

# Verify our mock was called with expected parameters
self.provider._client.chat.completions.create.assert_called_once()
call_kwargs = self.provider._client.chat.completions.create.call_args[1]
self.assertEqual(call_kwargs.get("model"), expected_model)

14. Conclusion

This test suite restructuring will significantly improve the organization, clarity, and maintainability of the Atlas test suite. It will provide clearer separation of concerns, better discoverability, and more explicit control over test execution. By implementing standardized patterns and helper classes, we can make tests more resilient to implementation changes while maintaining comprehensive coverage of all system components.

The addition of proper API test cost controls and confirmation mechanisms will enable safe testing with real APIs while keeping costs under control. This will ensure that Atlas works correctly with actual LLM APIs while preventing regressions during development.

The new structured mocking approach with dedicated provider mock libraries will ensure consistent, reliable testing that doesn’t make unwanted API calls while providing thorough coverage of the Atlas codebase.

With these improvements, the Atlas test suite will provide reliable validation of system behavior across different providers and use cases, supporting the continued development and enhancement of the Atlas platform.

Archive

Archive

Possible Future

Archive

Archive

Architecture

Agents

Core

Graph

Knowledge

Providers

Tools

The Matrix

Inner Universe

Nerv

Components

Composites

Patterns

Primitives

Types

Test Suite Planning ​

1. Current Testing Architecture Assessment ​

Distribution of Test Types ​

Current Issues ​

2. Recent Improvements ​

3. Provider Tests Analysis ​

OpenAI Provider Tests (test_openai_provider.py) ​

Ollama Provider Tests (test_ollama_provider.py) ​

Mock Provider Tests (test_mock_provider.py) ​

4. Agent Tests Analysis ​

Tool Capable Agent Tests (test_tool_capable_agent.py) ​

Agent Toolkit Tests (test_agent_toolkit.py) ​

5. Identified Issues and False Positives ​

6. Proposed Test Structure ​

7. Updated CLI Interface ​

8. Standardization Recommendations ​

8.1 Make Tests Less Brittle ​

8.2 Standardize Test Patterns Across Providers ​

8.3 Implement Test Categories via Decorators ​

8.4 Implement Token Usage Tracking in Tests ​

9. API Testing Strategy ​

9.1 Cost Control for API Tests ​

9.2 Provider-Specific Strategies ​

OpenAI Tests ​

Ollama Tests ​

Anthropic Tests ​

10. Implementation Plan ​

Phase 1: Create New Directory Structure ​

Phase 2: Migrate Existing Tests ​

Phase 3: Update Test Runner ​

Phase 4: Create New Test Templates ​

Phase 5: Update Documentation ​

Phase 6: Cleanup ​

11. Migration Strategy ​

12. Current Challenges and Implementation Steps ​

Completed Steps ​

Current Challenges ​

Implementation Steps ​

13. Structured Mocking Strategy ​

13.1 Provider-Specific Mock Libraries ​

13.2 Implementation Progress ​

13.3 Recommended Testing Patterns ​

14. Conclusion ​

Test Suite Planning

1. Current Testing Architecture Assessment

Distribution of Test Types

Current Issues

2. Recent Improvements

3. Provider Tests Analysis

OpenAI Provider Tests (`test_openai_provider.py`)

Ollama Provider Tests (`test_ollama_provider.py`)

Mock Provider Tests (`test_mock_provider.py`)

4. Agent Tests Analysis

Tool Capable Agent Tests (`test_tool_capable_agent.py`)

Agent Toolkit Tests (`test_agent_toolkit.py`)

5. Identified Issues and False Positives

6. Proposed Test Structure

7. Updated CLI Interface

8. Standardization Recommendations

8.1 Make Tests Less Brittle

8.2 Standardize Test Patterns Across Providers

8.3 Implement Test Categories via Decorators

8.4 Implement Token Usage Tracking in Tests

9. API Testing Strategy

9.1 Cost Control for API Tests

9.2 Provider-Specific Strategies

OpenAI Tests

Ollama Tests

Anthropic Tests

10. Implementation Plan

Phase 1: Create New Directory Structure

Phase 2: Migrate Existing Tests

Phase 3: Update Test Runner

Phase 4: Create New Test Templates

Phase 5: Update Documentation

Phase 6: Cleanup

11. Migration Strategy

12. Current Challenges and Implementation Steps

Completed Steps

Current Challenges

Implementation Steps

13. Structured Mocking Strategy

13.1 Provider-Specific Mock Libraries

13.2 Implementation Progress

13.3 Recommended Testing Patterns

14. Conclusion