Hardware Acceleration Guide for STING-CE
Overview
STING-CE supports hardware acceleration for faster LLM inference using:
- MPS (Metal Performance Shaders) on Apple Silicon Macs
- CUDA on NVIDIA GPUs
- CPU optimizations for systems without GPU
For basic installation and system requirements, see the STING Platform Installation Guide. For model-specific setup, see the Ollama Model Setup Guide.
Current Status
Docker Limitations
Currently, Docker containers cannot access Mac GPUs (MPS) due to Docker’s virtualization layer. The LLM service will fall back to CPU when running in Docker.
Native Execution for Mac GPU
To use Apple Silicon GPU acceleration, run the LLM service natively:
# Use the provided script
./run_native_mps.sh
# Or manually:
export TORCH_DEVICE=auto
export PERFORMANCE_PROFILE=gpu_accelerated
cd llm_service
python3 server.py
Performance Comparison
| Configuration | Load Time | Inference Speed | Memory Usage |
|---|---|---|---|
| CPU (Docker) | ~2.5 min | ~30s/response | 30GB |
| MPS (Native) | ~30s | ~2s/response | 16GB |
| CUDA | ~20s | ~1s/response | 12GB |
Setup Instructions
1. Apple Silicon Mac (M1/M2/M3)
Requirements:
- macOS 12.0+
- Python 3.9+
- PyTorch 2.0+ with MPS support
Installation:
# Install PyTorch with MPS support
pip3 install torch torchvision torchaudio
# Verify MPS availability
python3 -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')"
Running:
# Stop Docker LLM service
docker compose stop llm-gateway
# Run native service
./run_native_mps.sh
2. NVIDIA GPU (Linux/Windows)
Requirements:
- CUDA 11.8+
- NVIDIA Driver 450+
- nvidia-docker2
Docker Configuration:
llm-gateway:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
3. CPU Optimization
For systems without GPU, optimize CPU performance:
llm-gateway:
environment:
- PERFORMANCE_PROFILE=cpu_optimized
- OMP_NUM_THREADS=8 # Adjust based on CPU cores
- MKL_NUM_THREADS=8
- QUANTIZATION=int8 # Reduce memory usage
Troubleshooting
MPS Not Detected
- Check PyTorch version:
pip3 show torch | grep Version
# Should be 2.0+
- Verify MPS support:
import torch
print(torch.backends.mps.is_available())
print(torch.backends.mps.is_built())
- Update PyTorch:
pip3 install --upgrade torch torchvision
High Memory Usage
- Enable quantization:
export QUANTIZATION=int8
- Use smaller models:
export MODEL_NAME=phi3 # 3.8B params
- Reduce batch size:
export BATCH_SIZE=1
Slow Inference
- Check device usage:
# In server.py logs
INFO:__main__:Using device: mps # Good
INFO:__main__:Using device: cpu # Slow
- Monitor GPU usage:
# Mac
sudo powermetrics --samplers gpu_power -i1000 -n1
# NVIDIA
nvidia-smi
Best Practices
- Development: Use Docker with CPU for consistency
- Production: Use native GPU execution for performance
- Testing: Profile both configurations
- Monitoring: Track GPU memory and utilization
Future Improvements
- Docker GPU Support: Waiting for Docker Desktop MPS passthrough
- Multi-GPU: Support for multiple GPUs
- Mixed Precision: FP16/BF16 for faster inference
- Dynamic Batching: Better throughput for multiple users
Performance Optimization Tips
For MPS (Apple Silicon)
# Enable MPS optimizations
export PYTORCH_ENABLE_MPS_FALLBACK=1
export TORCH_COMPILE_BACKEND=aot_eager
# Use appropriate precision
export TORCH_PRECISION=fp16 # Faster on MPS
For CPU
# Enable all CPU optimizations
export OMP_NUM_THREADS=$(sysctl -n hw.ncpu)
export MKL_NUM_THREADS=$(sysctl -n hw.ncpu)
export NUMEXPR_MAX_THREADS=$(sysctl -n hw.ncpu)
Memory Management
# Reduce memory fragmentation
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
# Enable memory efficient attention
export TORCH_CUDNN_V8_API_ENABLED=1