vLLM Engine¶
The vLLM engine provides high-performance LLM inference using vLLM.
Status¶
Planned - Not yet implemented.
Planned Features¶
- PagedAttention for efficient memory management
- Continuous batching for high throughput
- Tensor parallelism for multi-GPU inference
- OpenAI-compatible API
- Quantization support (AWQ, GPTQ, SqueezeLLM)
Why vLLM?¶
vLLM offers significant performance improvements over standard HuggingFace inference:
- 2-4x higher throughput with PagedAttention
- Better GPU utilization with continuous batching
- Lower latency for concurrent requests
Planned Configuration¶
engine:
available:
- vllm
vllm:
model_path: ${VLLM_MODEL_PATH:-./models}
tensor_parallel_size: ${VLLM_TP_SIZE:-1}
gpu_memory_utilization: ${VLLM_GPU_UTIL:-0.9}
max_model_len: ${VLLM_MAX_LEN:-4096}
quantization: ${VLLM_QUANTIZATION:-} # awq, gptq, squeezellm
Implementation Notes¶
When implemented, vLLM engine will:
- Use vLLM's
AsyncLLMEnginefor async inference - Support both local and Hub models
- Integrate with existing multi-engine architecture
- Report memory usage for heartbeat
Contributing¶
If you'd like to help implement the vLLM engine, see:
src/engines/base.py- InferenceEngine interfacesrc/engines/transformers_engine.py- Reference implementationdocs/engines/README.md- Adding new engines