Skip to content

MicroDC Worker Documentation

vLLM

vLLM Engine¶

The vLLM engine provides high-performance LLM inference using vLLM.

Status¶

Planned - Not yet implemented.

Planned Features¶

PagedAttention for efficient memory management
Continuous batching for high throughput
Tensor parallelism for multi-GPU inference
OpenAI-compatible API
Quantization support (AWQ, GPTQ, SqueezeLLM)

Why vLLM?¶

vLLM offers significant performance improvements over standard HuggingFace inference:

2-4x higher throughput with PagedAttention
Better GPU utilization with continuous batching
Lower latency for concurrent requests

Planned Configuration¶

engine:
  available:
    - vllm

  vllm:
    model_path: ${VLLM_MODEL_PATH:-./models}
    tensor_parallel_size: ${VLLM_TP_SIZE:-1}
    gpu_memory_utilization: ${VLLM_GPU_UTIL:-0.9}
    max_model_len: ${VLLM_MAX_LEN:-4096}
    quantization: ${VLLM_QUANTIZATION:-}  # awq, gptq, squeezellm

Implementation Notes¶

When implemented, vLLM engine will:

Use vLLM's AsyncLLMEngine for async inference
Support both local and Hub models
Integrate with existing multi-engine architecture
Report memory usage for heartbeat

Contributing¶

If you'd like to help implement the vLLM engine, see:

src/engines/base.py - InferenceEngine interface
src/engines/transformers_engine.py - Reference implementation
docs/engines/README.md - Adding new engines