Hardware Acceleration Guide for STING-CE

Overview

STING-CE supports hardware acceleration for faster LLM inference using:

MPS (Metal Performance Shaders) on Apple Silicon Macs
CUDA on NVIDIA GPUs
CPU optimizations for systems without GPU

For basic installation and system requirements, see the STING Platform Installation Guide. For model-specific setup, see the Ollama Model Setup Guide.

Current Status

Docker Limitations

Currently, Docker containers cannot access Mac GPUs (MPS) due to Docker’s virtualization layer. The LLM service will fall back to CPU when running in Docker.

Native Execution for Mac GPU

To use Apple Silicon GPU acceleration, run the LLM service natively:

# Use the provided script
./run_native_mps.sh

# Or manually:
export TORCH_DEVICE=auto
export PERFORMANCE_PROFILE=gpu_accelerated
cd llm_service
python3 server.py

Performance Comparison

Configuration	Load Time	Inference Speed	Memory Usage
CPU (Docker)	~2.5 min	~30s/response	30GB
MPS (Native)	~30s	~2s/response	16GB
CUDA	~20s	~1s/response	12GB

Setup Instructions

1. Apple Silicon Mac (M1/M2/M3)

Requirements:

macOS 12.0+
Python 3.9+
PyTorch 2.0+ with MPS support

Installation:

# Install PyTorch with MPS support
pip3 install torch torchvision torchaudio

# Verify MPS availability
python3 -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')"

Running:

# Stop Docker LLM service
docker compose stop llm-gateway

# Run native service
./run_native_mps.sh

2. NVIDIA GPU (Linux/Windows)

Requirements:

CUDA 11.8+
NVIDIA Driver 450+
nvidia-docker2

Docker Configuration:

llm-gateway:
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

3. CPU Optimization

For systems without GPU, optimize CPU performance:

llm-gateway:
  environment:
    - PERFORMANCE_PROFILE=cpu_optimized
    - OMP_NUM_THREADS=8  # Adjust based on CPU cores
    - MKL_NUM_THREADS=8
    - QUANTIZATION=int8  # Reduce memory usage

Troubleshooting

MPS Not Detected

Check PyTorch version:

pip3 show torch | grep Version
# Should be 2.0+

Verify MPS support:

import torch
print(torch.backends.mps.is_available())
print(torch.backends.mps.is_built())

Update PyTorch:

pip3 install --upgrade torch torchvision

High Memory Usage

Enable quantization:

export QUANTIZATION=int8

Use smaller models:

export MODEL_NAME=phi3  # 3.8B params

Reduce batch size:

export BATCH_SIZE=1

Slow Inference

Check device usage:

# In server.py logs
INFO:__main__:Using device: mps  # Good
INFO:__main__:Using device: cpu  # Slow

Monitor GPU usage:

# Mac
sudo powermetrics --samplers gpu_power -i1000 -n1

# NVIDIA
nvidia-smi

Best Practices

Development: Use Docker with CPU for consistency
Production: Use native GPU execution for performance
Testing: Profile both configurations
Monitoring: Track GPU memory and utilization

Future Improvements

Docker GPU Support: Waiting for Docker Desktop MPS passthrough
Multi-GPU: Support for multiple GPUs
Mixed Precision: FP16/BF16 for faster inference
Dynamic Batching: Better throughput for multiple users

Performance Optimization Tips

For MPS (Apple Silicon)

# Enable MPS optimizations
export PYTORCH_ENABLE_MPS_FALLBACK=1
export TORCH_COMPILE_BACKEND=aot_eager

# Use appropriate precision
export TORCH_PRECISION=fp16  # Faster on MPS

For CPU

# Enable all CPU optimizations
export OMP_NUM_THREADS=$(sysctl -n hw.ncpu)
export MKL_NUM_THREADS=$(sysctl -n hw.ncpu)
export NUMEXPR_MAX_THREADS=$(sysctl -n hw.ncpu)

Memory Management

# Reduce memory fragmentation
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Enable memory efficient attention
export TORCH_CUDNN_V8_API_ENABLED=1

Last updated: October 22, 2025