Skip to content

vLLM Engine

The vLLM engine provides high-performance LLM inference using vLLM.

Status

Planned - Not yet implemented.

Planned Features

  • PagedAttention for efficient memory management
  • Continuous batching for high throughput
  • Tensor parallelism for multi-GPU inference
  • OpenAI-compatible API
  • Quantization support (AWQ, GPTQ, SqueezeLLM)

Why vLLM?

vLLM offers significant performance improvements over standard HuggingFace inference:

  • 2-4x higher throughput with PagedAttention
  • Better GPU utilization with continuous batching
  • Lower latency for concurrent requests

Planned Configuration

engine:
  available:
    - vllm

  vllm:
    model_path: ${VLLM_MODEL_PATH:-./models}
    tensor_parallel_size: ${VLLM_TP_SIZE:-1}
    gpu_memory_utilization: ${VLLM_GPU_UTIL:-0.9}
    max_model_len: ${VLLM_MAX_LEN:-4096}
    quantization: ${VLLM_QUANTIZATION:-}  # awq, gptq, squeezellm

Implementation Notes

When implemented, vLLM engine will:

  1. Use vLLM's AsyncLLMEngine for async inference
  2. Support both local and Hub models
  3. Integrate with existing multi-engine architecture
  4. Report memory usage for heartbeat

Contributing

If you'd like to help implement the vLLM engine, see:

  • src/engines/base.py - InferenceEngine interface
  • src/engines/transformers_engine.py - Reference implementation
  • docs/engines/README.md - Adding new engines