Skip to content

Transformers Engine

The Transformers engine provides native HuggingFace Transformers inference for local model files.

Status

Production - Fully implemented and tested.

Features

  • Text generation models (CausalLM, Seq2SeqLM)
  • Embedding models (BERT, RoBERTa, sentence-transformers)
  • Multimodal models (LLaVA, Qwen-VL)
  • Dynamic VRAM management with LRU eviction
  • bitsandbytes 4-bit/8-bit quantization
  • HuggingFace Hub download with allowlist/blocklist
  • Streaming text generation
  • Auto device selection (CUDA, MPS, CPU)

Requirements

pip install -e ".[transformers]"

This installs:

  • transformers>=4.40.0
  • torch>=2.0.0
  • accelerate>=0.27.0
  • safetensors>=0.4.0
  • bitsandbytes>=0.41.0
  • Pillow>=10.0.0
  • huggingface_hub>=0.20.0

Configuration

engine:
  available:
    - transformers

  transformers:
    # Local model storage directory
    model_path: ${TRANSFORMERS_MODEL_PATH:-./models}

    # Device configuration
    device: ${TRANSFORMERS_DEVICE:-auto}  # auto, cuda, mps, cpu
    torch_dtype: ${TRANSFORMERS_DTYPE:-auto}  # auto, float16, bfloat16, float32

    # Memory management
    max_memory_mb: ${TRANSFORMERS_MAX_MEMORY:-0}  # 0 = use available VRAM
    auto_unload: ${TRANSFORMERS_AUTO_UNLOAD:-true}  # LRU unload when low memory

    # Model loading options
    trust_remote_code: ${TRANSFORMERS_TRUST_REMOTE_CODE:-false}

    # Quantization (bitsandbytes)
    default_quantization: ${TRANSFORMERS_QUANTIZATION:-}  # int4, int8, or empty

    # HuggingFace Hub download control
    hub_download:
      enabled: ${TRANSFORMERS_HUB_ENABLED:-true}
      allowed_models: []  # e.g., ["meta-llama/*", "mistralai/*"]
      blocked_models: []  # e.g., ["dangerous-org/*"]
      cache_dir: ${HF_HOME:-}

Environment Variables

Variable Default Description
TRANSFORMERS_MODEL_PATH ./models Local model directory
TRANSFORMERS_DEVICE auto Device (auto, cuda, mps, cpu)
TRANSFORMERS_DTYPE auto Data type (auto, float16, bfloat16, float32)
TRANSFORMERS_MAX_MEMORY 0 Max VRAM MB (0 = auto)
TRANSFORMERS_AUTO_UNLOAD true Enable LRU model unloading
TRANSFORMERS_TRUST_REMOTE_CODE false Allow custom code in models
TRANSFORMERS_QUANTIZATION `` Default quantization (int4, int8)
TRANSFORMERS_HUB_ENABLED true Enable Hub downloads
HF_HOME ~/.cache/huggingface HuggingFace cache directory

Model Discovery

Models are discovered from the local model_path directory. Each subdirectory containing a config.json file is recognized as a model.

models/
  llama-2-7b-chat/
    config.json
    model.safetensors
    tokenizer.json
  nomic-embed-text/
    config.json
    model.safetensors

Model type is auto-detected from the architecture in config.json:

  • Embedding: BertModel, RobertaModel, NomicBertModel
  • Multimodal: LlavaForConditionalGeneration, Qwen2VLForConditionalGeneration
  • Text: Default (CausalLM, Seq2SeqLM)

Hub Download Control

Control which models can be downloaded from HuggingFace Hub:

hub_download:
  enabled: true
  allowed_models:
    - "meta-llama/*"           # All models from meta-llama org
    - "mistralai/Mistral-7B*"  # Specific models
    - "sentence-transformers/*"
  blocked_models:
    - "untrusted-org/*"        # Block entire org

If allowed_models is empty, all non-blocked models are allowed.

Note: Allowlist matching is case-insensitive. For example, sentence-transformers/all-MiniLM-L6-v2 will match jobs requesting sentence-transformers/all-minilm-l6-v2.

Models in the allowlist are automatically advertised as available, even before being downloaded. When a job requests such a model, it will be downloaded on-demand.

Quantization

Supports bitsandbytes quantization for reduced memory usage:

  • int4: 4-bit quantization (NF4 format, double quantization)
  • int8: 8-bit quantization

Set default_quantization in config or per-model via the quantization_config in model's config.json.

Memory Management

The engine supports multiple loaded models with automatic LRU eviction:

  1. When a model is requested, engine checks available VRAM
  2. If insufficient, least-recently-used models are unloaded
  3. Model is loaded and added to cache
  4. last_used_at timestamp updated on each use

Disable auto-unload with auto_unload: false (will error on insufficient memory).

Job Examples

Text Generation:

{
  "model_id": "llama-2-7b-chat",
  "platform": "transformers",
  "job_type": "llm",
  "input_data": "Write a haiku about coding."
}

Chat Format:

{
  "model_id": "llama-2-7b-chat",
  "platform": "transformers",
  "job_type": "llm",
  "llm_interaction_type": "chat",
  "input_data": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
  ]
}

Embedding:

{
  "model_id": "nomic-embed-text",
  "platform": "transformers",
  "job_type": "embed",
  "input_data": {"texts": ["Hello world"]}
}

Parameter Mapping

Generic Transformers Description
max_tokens max_new_tokens Maximum tokens to generate
temperature temperature Sampling temperature
top_p top_p Nucleus sampling
top_k top_k Top-k sampling
repetition_penalty repetition_penalty Repetition penalty
seed torch.manual_seed() Random seed

Implementation Files

  • src/engines/transformers_engine.py - TransformersEngine class (~1000 lines)
  • tests/test_transformers_engine.py - Test suite (47 tests, 66% coverage)

Troubleshooting

CUDA out of memory

  • Enable quantization: default_quantization: int4
  • Enable auto_unload to free memory from unused models
  • Reduce max_memory_mb if running other GPU processes

Model not found

  • Check model_path directory contains model with config.json
  • If using Hub: ensure model is in allowed_models list

Slow loading

  • First load downloads model weights (subsequent loads use cache)
  • Use quantized models for faster loading and less memory

MPS (Apple Silicon) issues

  • Some models may not support MPS; set device: cpu as fallback
  • bitsandbytes quantization not supported on MPS