Multi-Engine Architecture¶

The MicroDC Worker supports running multiple inference engines simultaneously, loading them on-demand based on job requirements.

Overview¶

Instead of a single static engine, the worker can be configured with multiple available engines. Each engine is loaded only when a job requests it, optimizing resource usage.

Configuration¶

Available Engines¶

Configure which engines this worker supports:

engine:
  available:
    - ollama
    - transformers

Or via environment variable:

export MICRODC_ENGINES="[ollama, transformers]"

Engine-Specific Configuration¶

Each engine has its own configuration section:

engine:
  available:
    - ollama
    - transformers

  ollama:
    base_url: http://localhost:11434
    timeout: 600

  transformers:
    model_path: ./models
    device: auto
    max_memory_mb: 0
    auto_unload: true

How It Works¶

1. Engine Discovery¶

At startup, the worker reads engine.available and stores the list of supported engine types. No engines are loaded yet.

2. Job Routing¶

When a job arrives, the worker checks the platform field:

{
  "model_id": "llama3.1:8b",
  "platform": "ollama",
  "job_type": "llm",
  "input_data": "Hello!"
}

3. On-Demand Loading¶

If the requested engine isn't loaded:

Worker creates the engine instance
Engine initializes with its config
Engine is cached for future jobs
Job is executed

4. Default Platform¶

If no platform is specified, the worker uses the first available engine.

Job Examples¶

Ollama Job¶

{
  "model_id": "llama3.1:8b",
  "platform": "ollama",
  "job_type": "llm",
  "input_data": "Explain quantum computing."
}

Transformers Job¶

{
  "model_id": "meta-llama/Llama-2-7b-chat-hf",
  "platform": "transformers",
  "job_type": "llm",
  "input_data": "Write a haiku about coding."
}

Embedding Job (Auto-Platform)¶

{
  "model_id": "nomic-embed-text",
  "job_type": "embed",
  "input_data": {"texts": ["Hello world"]}
}

Without platform, uses first available engine that has the model.

Heartbeat Reporting¶

The worker reports all available engines and their loaded models in heartbeats:

{
  "engines": ["ollama", "transformers"],
  "models": [
    {"id": "llama3.1:8b", "platform": "ollama"},
    {"id": "nomic-embed-text", "platform": "transformers"}
  ]
}

Memory Management¶

Each engine manages its own memory independently:

Ollama: Managed by Ollama server
Transformers: LRU eviction with VRAM tracking

When multiple engines are loaded, be aware of total GPU memory usage.

Adding Custom Engines¶

To add a new engine:

Create engine class inheriting from InferenceEngine
Implement all abstract methods (see src/engines/base.py)
Add configuration section to config/default.yaml
Register in src/core/client.py:_create_engine()
Add documentation in docs/engines/

Troubleshooting¶

Engine not loading¶

Check engine is listed in engine.available
Verify engine dependencies are installed
Check engine-specific config is valid

Wrong engine used¶

Explicitly set platform field in job
Check model exists in expected engine

Out of memory¶

Limit engines to what you need
Use quantization for Transformers models
Enable auto_unload for dynamic memory management