External AI Gateway

STING uses an External AI Gateway (external_ai_service) as the central interface between the platform and language models. Instead of connecting directly to Ollama, OpenAI, or other providers, all AI requests route through this gateway — providing unified model management, failover, and provider abstraction.

Architecture

┌────────────┐  ┌────────────┐  ┌────────────┐
│  Bee Chat  │  │  Reports   │  │  Knowledge │
└─────┬──────┘  └─────┬──────┘  └─────┬──────┘
      │               │               │
      └───────────────┼───────────────┘
                      │
              ┌───────▼────────┐
              │  External AI   │  ← Provider Registry
              │   Gateway      │  ← Model routing
              │  (port 8091)   │  ← Health checks
              └───────┬────────┘
                      │
           ┌──────────┼──────────┐
           │          │          │
    ┌──────▼──┐ ┌─────▼────┐ ┌──▼───────┐
    │ Ollama  │ │ MiniMax  │ │ OpenAI   │
    │ (local) │ │ (cloud)  │ │ (cloud)  │
    └─────────┘ └──────────┘ └──────────┘

Key Concepts

Provider Registry — singleton that manages all configured LLM providers, their endpoints, API keys, and model lists
Nginx LLM Proxy — nginx-llm-proxy.conf provides upstream failover between providers with streaming support (proxy_buffering off, 300s timeouts)
Gateway endpoints — unified API at /api/external-ai/* that the frontend and other services call
No direct LLM coupling — services never call Ollama/OpenAI directly; they go through the gateway

Supported Providers

Provider	Type	Configuration
Ollama	Local or remote	Self-hosted, any Ollama-compatible endpoint
MiniMax	Cloud API	API key required
OpenAI	Cloud API	API key required
Anthropic	Cloud API	API key required
vLLM	Local or remote	OpenAI-compatible endpoint
LM Studio	Local	OpenAI-compatible endpoint

Configuration

config.yml

The primary LLM configuration lives in conf/config.yml:

ai:
  # Primary provider
  provider: minimax          # or: ollama, openai, anthropic, vllm
  
  # Ollama / local LLM settings
  ollama:
    host: dev-ubuntu.tail4e263b.ts.net  # Hostname or IP
    port: 11434
    model: llama3.1:8b                   # Default model
    
  # Cloud provider API keys (stored in Vault)
  minimax:
    api_key: vault:sting/minimax         # Vault path
    model: MiniMax-Text-01
    
  openai:
    api_key: vault:sting/openai
    model: gpt-4

Vault-Managed API Keys

API keys are stored securely in HashiCorp Vault, not in config files:

# Store an API key
sudo msting vault-secret openai sk-your-api-key-here

# Store MiniMax key
sudo msting vault-secret minimax your-minimax-key

# List stored providers
sudo msting vault-secret list

Environment Variables

The gateway reads from env/external_ai.env:

Variable	Description
`AI_PROVIDER`	Primary provider (`ollama`, `minimax`, `openai`)
`OLLAMA_HOST`	Ollama server hostname
`OLLAMA_PORT`	Ollama server port (default: 11434)
`OPENAI_API_KEY`	OpenAI API key (from Vault)
`MINIMAX_API_KEY`	MiniMax API key (from Vault)
`LLM_TIMEOUT`	Request timeout in seconds (default: 300)

Gateway API Endpoints

All endpoints are prefixed with /api/external-ai/:

Endpoint	Method	Description
`/health`	GET	Gateway health and provider status
`/models`	GET	List available models across all providers
`/generate`	POST	Generate text (streaming supported)
`/chat`	POST	Chat completion
`/embeddings`	POST	Generate embeddings for knowledge sync
`/pull`	POST	Pull a model (Ollama only)
`/restart`	POST	Restart the gateway service

Example: Check Gateway Health

curl -s https://localhost:5050/api/external-ai/health | python3 -m json.tool

{
  "status": "ready",
  "provider": "minimax",
  "models_available": 3,
  "ollama_reachable": true
}

Example: List Models

curl -s https://localhost:5050/api/external-ai/models

Nginx LLM Proxy

The nginx-llm-proxy.conf provides load balancing and failover between LLM backends:

upstream llm_backend {
    server ollama-host:11434;
    server minimax-gateway:8091 backup;
}

Key settings:

Streaming: proxy_buffering off for real-time token streaming
Timeouts: 300s for long-running generation requests
Failover: Automatic fallback if primary provider is unavailable

Using Tailscale for Remote LLM

If your LLM server (e.g., Ollama) runs on a different machine, Tailscale Magic DNS provides a stable hostname that survives IP changes:

ai:
  ollama:
    host: your-machine.tailnet-name.ts.net
    port: 11434

This is preferred over raw Tailscale IPs (100.x.x.x) which can change when devices reconnect.

Model Management

Pulling Models (Ollama)

From the STING admin UI (Bee Settings page), or via CLI:

# On the Ollama host
ollama pull llama3.1:8b
ollama pull nomic-embed-text    # For embeddings

Switching Providers

Update config.yml and regenerate:

sudo msting regenerate-env
sudo msting restart external-ai

Troubleshooting

Gateway reports “No providers available”

Check if the LLM host is reachable: curl http://ollama-host:11434/api/tags
Verify API keys are in Vault: sudo msting vault-secret list
Check gateway logs: sudo docker logs sting-ce-external-ai --tail 50

Slow responses

Check if the model is loaded: first request after idle may take 30-60s to load
Verify network latency to remote LLM hosts
Consider using a smaller model for faster responses

Streaming not working

Ensure the nginx LLM proxy has proxy_buffering off and the client supports SSE (Server-Sent Events).

Last updated: March 14, 2026