Transformers Engine¶
The Transformers engine provides native HuggingFace Transformers inference for local model files.
Status¶
Production - Fully implemented and tested.
Features¶
- Text generation models (CausalLM, Seq2SeqLM)
- Embedding models (BERT, RoBERTa, sentence-transformers)
- Multimodal models (LLaVA, Qwen-VL)
- Dynamic VRAM management with LRU eviction
- bitsandbytes 4-bit/8-bit quantization
- HuggingFace Hub download with allowlist/blocklist
- Streaming text generation
- Auto device selection (CUDA, MPS, CPU)
Requirements¶
This installs:
- transformers>=4.40.0
- torch>=2.0.0
- accelerate>=0.27.0
- safetensors>=0.4.0
- bitsandbytes>=0.41.0
- Pillow>=10.0.0
- huggingface_hub>=0.20.0
Configuration¶
engine:
available:
- transformers
transformers:
# Local model storage directory
model_path: ${TRANSFORMERS_MODEL_PATH:-./models}
# Device configuration
device: ${TRANSFORMERS_DEVICE:-auto} # auto, cuda, mps, cpu
torch_dtype: ${TRANSFORMERS_DTYPE:-auto} # auto, float16, bfloat16, float32
# Memory management
max_memory_mb: ${TRANSFORMERS_MAX_MEMORY:-0} # 0 = use available VRAM
auto_unload: ${TRANSFORMERS_AUTO_UNLOAD:-true} # LRU unload when low memory
# Model loading options
trust_remote_code: ${TRANSFORMERS_TRUST_REMOTE_CODE:-false}
# Quantization (bitsandbytes)
default_quantization: ${TRANSFORMERS_QUANTIZATION:-} # int4, int8, or empty
# HuggingFace Hub download control
hub_download:
enabled: ${TRANSFORMERS_HUB_ENABLED:-true}
allowed_models: [] # e.g., ["meta-llama/*", "mistralai/*"]
blocked_models: [] # e.g., ["dangerous-org/*"]
cache_dir: ${HF_HOME:-}
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
TRANSFORMERS_MODEL_PATH |
./models |
Local model directory |
TRANSFORMERS_DEVICE |
auto |
Device (auto, cuda, mps, cpu) |
TRANSFORMERS_DTYPE |
auto |
Data type (auto, float16, bfloat16, float32) |
TRANSFORMERS_MAX_MEMORY |
0 |
Max VRAM MB (0 = auto) |
TRANSFORMERS_AUTO_UNLOAD |
true |
Enable LRU model unloading |
TRANSFORMERS_TRUST_REMOTE_CODE |
false |
Allow custom code in models |
TRANSFORMERS_QUANTIZATION |
`` | Default quantization (int4, int8) |
TRANSFORMERS_HUB_ENABLED |
true |
Enable Hub downloads |
HF_HOME |
~/.cache/huggingface |
HuggingFace cache directory |
Model Discovery¶
Models are discovered from the local model_path directory. Each subdirectory containing a config.json file is recognized as a model.
models/
llama-2-7b-chat/
config.json
model.safetensors
tokenizer.json
nomic-embed-text/
config.json
model.safetensors
Model type is auto-detected from the architecture in config.json:
- Embedding: BertModel, RobertaModel, NomicBertModel
- Multimodal: LlavaForConditionalGeneration, Qwen2VLForConditionalGeneration
- Text: Default (CausalLM, Seq2SeqLM)
Hub Download Control¶
Control which models can be downloaded from HuggingFace Hub:
hub_download:
enabled: true
allowed_models:
- "meta-llama/*" # All models from meta-llama org
- "mistralai/Mistral-7B*" # Specific models
- "sentence-transformers/*"
blocked_models:
- "untrusted-org/*" # Block entire org
If allowed_models is empty, all non-blocked models are allowed.
Note: Allowlist matching is case-insensitive. For example, sentence-transformers/all-MiniLM-L6-v2 will match jobs requesting sentence-transformers/all-minilm-l6-v2.
Models in the allowlist are automatically advertised as available, even before being downloaded. When a job requests such a model, it will be downloaded on-demand.
Quantization¶
Supports bitsandbytes quantization for reduced memory usage:
- int4: 4-bit quantization (NF4 format, double quantization)
- int8: 8-bit quantization
Set default_quantization in config or per-model via the quantization_config in model's config.json.
Memory Management¶
The engine supports multiple loaded models with automatic LRU eviction:
- When a model is requested, engine checks available VRAM
- If insufficient, least-recently-used models are unloaded
- Model is loaded and added to cache
last_used_attimestamp updated on each use
Disable auto-unload with auto_unload: false (will error on insufficient memory).
Job Examples¶
Text Generation:
{
"model_id": "llama-2-7b-chat",
"platform": "transformers",
"job_type": "llm",
"input_data": "Write a haiku about coding."
}
Chat Format:
{
"model_id": "llama-2-7b-chat",
"platform": "transformers",
"job_type": "llm",
"llm_interaction_type": "chat",
"input_data": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
}
Embedding:
{
"model_id": "nomic-embed-text",
"platform": "transformers",
"job_type": "embed",
"input_data": {"texts": ["Hello world"]}
}
Parameter Mapping¶
| Generic | Transformers | Description |
|---|---|---|
max_tokens |
max_new_tokens |
Maximum tokens to generate |
temperature |
temperature |
Sampling temperature |
top_p |
top_p |
Nucleus sampling |
top_k |
top_k |
Top-k sampling |
repetition_penalty |
repetition_penalty |
Repetition penalty |
seed |
torch.manual_seed() |
Random seed |
Implementation Files¶
src/engines/transformers_engine.py- TransformersEngine class (~1000 lines)tests/test_transformers_engine.py- Test suite (47 tests, 66% coverage)
Troubleshooting¶
CUDA out of memory¶
- Enable quantization:
default_quantization: int4 - Enable auto_unload to free memory from unused models
- Reduce
max_memory_mbif running other GPU processes
Model not found¶
- Check
model_pathdirectory contains model withconfig.json - If using Hub: ensure model is in
allowed_modelslist
Slow loading¶
- First load downloads model weights (subsequent loads use cache)
- Use quantized models for faster loading and less memory
MPS (Apple Silicon) issues¶
- Some models may not support MPS; set
device: cpuas fallback - bitsandbytes quantization not supported on MPS