Skip to content

Multi-Engine Architecture

The MicroDC Worker supports running multiple inference engines simultaneously, loading them on-demand based on job requirements.

Overview

Instead of a single static engine, the worker can be configured with multiple available engines. Each engine is loaded only when a job requests it, optimizing resource usage.

Configuration

Available Engines

Configure which engines this worker supports:

engine:
  available:
    - ollama
    - transformers

Or via environment variable:

export MICRODC_ENGINES="[ollama, transformers]"

Engine-Specific Configuration

Each engine has its own configuration section:

engine:
  available:
    - ollama
    - transformers

  ollama:
    base_url: http://localhost:11434
    timeout: 600

  transformers:
    model_path: ./models
    device: auto
    max_memory_mb: 0
    auto_unload: true

How It Works

1. Engine Discovery

At startup, the worker reads engine.available and stores the list of supported engine types. No engines are loaded yet.

2. Job Routing

When a job arrives, the worker checks the platform field:

{
  "model_id": "llama3.1:8b",
  "platform": "ollama",
  "job_type": "llm",
  "input_data": "Hello!"
}

3. On-Demand Loading

If the requested engine isn't loaded:

  1. Worker creates the engine instance
  2. Engine initializes with its config
  3. Engine is cached for future jobs
  4. Job is executed

4. Default Platform

If no platform is specified, the worker uses the first available engine.

Job Examples

Ollama Job

{
  "model_id": "llama3.1:8b",
  "platform": "ollama",
  "job_type": "llm",
  "input_data": "Explain quantum computing."
}

Transformers Job

{
  "model_id": "meta-llama/Llama-2-7b-chat-hf",
  "platform": "transformers",
  "job_type": "llm",
  "input_data": "Write a haiku about coding."
}

Embedding Job (Auto-Platform)

{
  "model_id": "nomic-embed-text",
  "job_type": "embed",
  "input_data": {"texts": ["Hello world"]}
}

Without platform, uses first available engine that has the model.

Heartbeat Reporting

The worker reports all available engines and their loaded models in heartbeats:

{
  "engines": ["ollama", "transformers"],
  "models": [
    {"id": "llama3.1:8b", "platform": "ollama"},
    {"id": "nomic-embed-text", "platform": "transformers"}
  ]
}

Memory Management

Each engine manages its own memory independently:

  • Ollama: Managed by Ollama server
  • Transformers: LRU eviction with VRAM tracking

When multiple engines are loaded, be aware of total GPU memory usage.

Adding Custom Engines

To add a new engine:

  1. Create engine class inheriting from InferenceEngine
  2. Implement all abstract methods (see src/engines/base.py)
  3. Add configuration section to config/default.yaml
  4. Register in src/core/client.py:_create_engine()
  5. Add documentation in docs/engines/

Troubleshooting

Engine not loading

  • Check engine is listed in engine.available
  • Verify engine dependencies are installed
  • Check engine-specific config is valid

Wrong engine used

  • Explicitly set platform field in job
  • Check model exists in expected engine

Out of memory

  • Limit engines to what you need
  • Use quantization for Transformers models
  • Enable auto_unload for dynamic memory management