Transformers Engine¶

The Transformers engine provides native HuggingFace Transformers inference for local model files.

Status¶

Production - Fully implemented and tested.

Features¶

Text generation models (CausalLM, Seq2SeqLM)
Embedding models (BERT, RoBERTa, sentence-transformers)
Multimodal models (LLaVA, Qwen-VL)
Dynamic VRAM management with LRU eviction
bitsandbytes 4-bit/8-bit quantization
HuggingFace Hub download with allowlist/blocklist
Streaming text generation
Auto device selection (CUDA, MPS, CPU)

Requirements¶

pip install -e ".[transformers]"

This installs:

transformers>=4.40.0
torch>=2.0.0
accelerate>=0.27.0
safetensors>=0.4.0
bitsandbytes>=0.41.0
Pillow>=10.0.0
huggingface_hub>=0.20.0

Configuration¶

engine:
  available:
    - transformers

  transformers:
    # Local model storage directory
    model_path: ${TRANSFORMERS_MODEL_PATH:-./models}

    # Device configuration
    device: ${TRANSFORMERS_DEVICE:-auto}  # auto, cuda, mps, cpu
    torch_dtype: ${TRANSFORMERS_DTYPE:-auto}  # auto, float16, bfloat16, float32

    # Memory management
    max_memory_mb: ${TRANSFORMERS_MAX_MEMORY:-0}  # 0 = use available VRAM
    auto_unload: ${TRANSFORMERS_AUTO_UNLOAD:-true}  # LRU unload when low memory

    # Model loading options
    trust_remote_code: ${TRANSFORMERS_TRUST_REMOTE_CODE:-false}

    # Quantization (bitsandbytes)
    default_quantization: ${TRANSFORMERS_QUANTIZATION:-}  # int4, int8, or empty

    # HuggingFace Hub download control
    hub_download:
      enabled: ${TRANSFORMERS_HUB_ENABLED:-true}
      allowed_models: []  # e.g., ["meta-llama/*", "mistralai/*"]
      blocked_models: []  # e.g., ["dangerous-org/*"]
      cache_dir: ${HF_HOME:-}

Environment Variables¶

Variable	Default	Description
`TRANSFORMERS_MODEL_PATH`	`./models`	Local model directory
`TRANSFORMERS_DEVICE`	`auto`	Device (auto, cuda, mps, cpu)
`TRANSFORMERS_DTYPE`	`auto`	Data type (auto, float16, bfloat16, float32)
`TRANSFORMERS_MAX_MEMORY`	`0`	Max VRAM MB (0 = auto)
`TRANSFORMERS_AUTO_UNLOAD`	`true`	Enable LRU model unloading
`TRANSFORMERS_TRUST_REMOTE_CODE`	`false`	Allow custom code in models
`TRANSFORMERS_QUANTIZATION`	``	Default quantization (int4, int8)
`TRANSFORMERS_HUB_ENABLED`	`true`	Enable Hub downloads
`HF_HOME`	`~/.cache/huggingface`	HuggingFace cache directory

Model Discovery¶

Models are discovered from the local model_path directory. Each subdirectory containing a config.json file is recognized as a model.

models/
  llama-2-7b-chat/
    config.json
    model.safetensors
    tokenizer.json
  nomic-embed-text/
    config.json
    model.safetensors

Model type is auto-detected from the architecture in config.json:

Embedding: BertModel, RobertaModel, NomicBertModel
Multimodal: LlavaForConditionalGeneration, Qwen2VLForConditionalGeneration
Text: Default (CausalLM, Seq2SeqLM)

Hub Download Control¶

Control which models can be downloaded from HuggingFace Hub:

hub_download:
  enabled: true
  allowed_models:
    - "meta-llama/*"           # All models from meta-llama org
    - "mistralai/Mistral-7B*"  # Specific models
    - "sentence-transformers/*"
  blocked_models:
    - "untrusted-org/*"        # Block entire org

If allowed_models is empty, all non-blocked models are allowed.

Note: Allowlist matching is case-insensitive. For example, sentence-transformers/all-MiniLM-L6-v2 will match jobs requesting sentence-transformers/all-minilm-l6-v2.

Models in the allowlist are automatically advertised as available, even before being downloaded. When a job requests such a model, it will be downloaded on-demand.

Quantization¶

Supports bitsandbytes quantization for reduced memory usage:

int4: 4-bit quantization (NF4 format, double quantization)
int8: 8-bit quantization

Set default_quantization in config or per-model via the quantization_config in model's config.json.

Memory Management¶

The engine supports multiple loaded models with automatic LRU eviction:

When a model is requested, engine checks available VRAM
If insufficient, least-recently-used models are unloaded
Model is loaded and added to cache
last_used_at timestamp updated on each use

Disable auto-unload with auto_unload: false (will error on insufficient memory).

Job Examples¶

Text Generation:

{
  "model_id": "llama-2-7b-chat",
  "platform": "transformers",
  "job_type": "llm",
  "input_data": "Write a haiku about coding."
}

Chat Format:

{
  "model_id": "llama-2-7b-chat",
  "platform": "transformers",
  "job_type": "llm",
  "llm_interaction_type": "chat",
  "input_data": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
  ]
}

Embedding:

{
  "model_id": "nomic-embed-text",
  "platform": "transformers",
  "job_type": "embed",
  "input_data": {"texts": ["Hello world"]}
}

Parameter Mapping¶

Generic	Transformers	Description
`max_tokens`	`max_new_tokens`	Maximum tokens to generate
`temperature`	`temperature`	Sampling temperature
`top_p`	`top_p`	Nucleus sampling
`top_k`	`top_k`	Top-k sampling
`repetition_penalty`	`repetition_penalty`	Repetition penalty
`seed`	`torch.manual_seed()`	Random seed

Implementation Files¶

src/engines/transformers_engine.py - TransformersEngine class (~1000 lines)
tests/test_transformers_engine.py - Test suite (47 tests, 66% coverage)

Troubleshooting¶

CUDA out of memory¶

Enable quantization: default_quantization: int4
Enable auto_unload to free memory from unused models
Reduce max_memory_mb if running other GPU processes

Model not found¶

Check model_path directory contains model with config.json
If using Hub: ensure model is in allowed_models list

Slow loading¶

First load downloads model weights (subsequent loads use cache)
Use quantized models for faster loading and less memory

MPS (Apple Silicon) issues¶

Some models may not support MPS; set device: cpu as fallback
bitsandbytes quantization not supported on MPS