Service Health Monitoring Guide

This guide provides detailed information about monitoring the health and status of all STING platform services.

Overview
Health Check Endpoints
Monitoring Commands
Service Dependencies
Automated Health Checks
Troubleshooting Unhealthy Services

Overview

STING implements comprehensive health monitoring across all microservices. Each service exposes health endpoints that provide real-time status information.

Health Check Types:

Liveness: Is the service running?
Readiness: Is the service ready to accept requests?
Dependencies: Are required services available?

Health Check Endpoints

Core Services

Service	Health Endpoint	Expected Response	Critical
Flask API	`https://localhost:5050/health`	`{"status": "healthy"}`	Yes
Kratos Auth	`https://localhost:4434/admin/health/ready`	`{"status": "ready"}`	Yes
PostgreSQL	`docker exec sting-ce-db pg_isready`	`accepting connections`	Yes
Vault	`http://localhost:8200/v1/sys/health`	`{"initialized": true}`	Yes

AI & Knowledge Services

Service	Health Endpoint	Expected Response	Critical
Knowledge Service	`http://localhost:8090/health`	`{"status": "healthy", "service": "knowledge"}`	Yes
ChromaDB	`http://localhost:8000/api/v1/heartbeat`	`{"nanosecond heartbeat": ...}`	Yes
Chatbot	`http://localhost:8888/health`	`{"status": "healthy"}`	No
LLM Gateway	`http://localhost:8085/health`	`{"status": "healthy"}`	No

Supporting Services

Service	Health Endpoint	Expected Response	Critical
Redis	`docker exec sting-ce-redis redis-cli ping`	`PONG`	No
Messaging	`http://localhost:8889/health`	`{"status": "healthy"}`	No
Mailpit	N/A	Check container status	No
Frontend	Check container status	Running	Yes

Monitoring Commands

Quick Health Check Script

#!/bin/bash
# Save as check_health.sh

echo "=== STING Service Health Check ==="
echo

# Core Services
echo "Flask API: $(curl -s https://localhost:5050/health 2>/dev/null || echo 'OFFLINE')"
echo "Kratos: $(curl -s https://localhost:4434/admin/health/ready 2>/dev/null || echo 'OFFLINE')"
echo "Vault: $(curl -s http://localhost:8200/v1/sys/health 2>/dev/null || echo 'OFFLINE')"
echo "Database: $(docker exec sting-ce-db pg_isready 2>&1 || echo 'OFFLINE')"
echo

# AI Services
echo "Knowledge: $(curl -s http://localhost:8090/health 2>/dev/null || echo 'OFFLINE')"
echo "ChromaDB: $(curl -s http://localhost:8000/api/v1/heartbeat 2>/dev/null | head -c 50)..."
echo "Chatbot: $(curl -s http://localhost:8888/health 2>/dev/null || echo 'OFFLINE')"
echo "LLM Gateway: $(curl -s http://localhost:8085/health 2>/dev/null || echo 'OFFLINE')"
echo

# Support Services
echo "Redis: $(docker exec sting-ce-redis redis-cli ping 2>&1 || echo 'OFFLINE')"
echo "Messaging: $(curl -s http://localhost:8889/health 2>/dev/null || echo 'OFFLINE')"

Container Status Monitoring

# View all containers with health status
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.State}}"

# Watch container status in real-time
watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"'

# Get detailed health info for a specific service
docker inspect sting-ce-knowledge --format='{{json .State.Health}}' | jq

Service Logs Monitoring

# Monitor multiple services simultaneously
docker compose logs -f app knowledge chatbot

# View logs with timestamps
docker logs --timestamps --tail 50 sting-ce-knowledge

# Search for errors across all services
docker compose logs | grep -i error

Service Dependencies

Understanding service dependencies helps troubleshoot cascading failures:

┌─────────────────┐
│    Frontend     │
└────────┬────────┘
         │
┌────────▼────────┐
│   Flask API     │
├─────────────────┤
│ Depends on:     │
│ • PostgreSQL    │
│ • Vault         │
│ • Kratos        │
│ • Knowledge*    │
└────────┬────────┘
         │
┌────────▼────────┐     ┌─────────────┐
│   Knowledge     │────▶│  ChromaDB   │
└─────────────────┘     └─────────────┘
         │
┌────────▼────────┐     ┌─────────────┐
│    Chatbot      │────▶│ LLM Gateway │
├─────────────────┤     └─────────────┘
│ Depends on:     │
│ • Messaging     │
│ • Redis         │
└─────────────────┘

Automated Health Checks

Docker Compose Health Checks

Each service in docker-compose.yml includes health check configuration:

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8090/health"]
  interval: 30s
  timeout: 10s
  retries: 5
  start_period: 30s

Monitoring with Docker Events

# Monitor health check events
docker events --filter event=health_status

# Get health check history
docker inspect sting-ce-knowledge | jq '.[0].State.Health.Log'

Creating Custom Health Monitors

# health_monitor.py
import requests
import time
import json

SERVICES = {
    "flask_api": "https://localhost:5050/health",
    "knowledge": "http://localhost:8090/health",
    "chatbot": "http://localhost:8888/health",
    "chromadb": "http://localhost:8000/api/v1/heartbeat",
}

def check_services():
    results = {}
    for name, url in SERVICES.items():
        try:
            response = requests.get(url, timeout=5, verify=False)
            results[name] = {
                "status": "healthy" if response.status_code == 200 else "unhealthy",
                "response_time": response.elapsed.total_seconds()
            }
        except Exception as e:
            results[name] = {
                "status": "offline",
                "error": str(e)
            }
    return results

# Run continuous monitoring
while True:
    status = check_services()
    print(json.dumps(status, indent=2))
    time.sleep(30)

Troubleshooting Unhealthy Services

Common Health Check Failures

1. Knowledge Service Unhealthy

# Check if ChromaDB is running (dependency)
curl http://localhost:8000/api/v1/heartbeat

# View knowledge service logs
docker logs sting-ce-knowledge --tail 100

# Restart the service
docker compose restart knowledge

2. Database Connection Issues

# Check PostgreSQL status
docker exec sting-ce-db pg_isready -U postgres

# View connection count
docker exec sting-ce-db psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"

# Reset connections
docker compose restart db app

3. Redis Memory Issues

# Check Redis memory usage
docker exec sting-ce-redis redis-cli info memory

# Clear Redis cache if needed
docker exec sting-ce-redis redis-cli FLUSHALL

# Restart Redis
docker compose restart redis

4. LLM Gateway Not Responding

# Check native LLM service (macOS)
./sting-llm status

# Check proxy configuration
docker logs sting-ce-llm-gateway-proxy

# Test direct connection
curl http://host.docker.internal:8086/health

Health Check Best Practices

Monitor Critical Services First: Focus on database, authentication, and API services
Set Appropriate Timeouts: Adjust health check intervals based on service startup time
Use Cascading Restarts: Restart dependent services in order
Log Health Events: Keep history of health check failures for pattern analysis
Alert on Repeated Failures: Don’t alert on single failures, wait for consistent issues

Integration with Monitoring Systems

Prometheus Metrics

# Example prometheus configuration
scrape_configs:
  - job_name: 'sting-services'
    static_configs:
      - targets:
        - 'localhost:5050'  # Flask API metrics
        - 'localhost:8090'  # Knowledge service metrics

Grafana Dashboard

Create dashboards to visualize:

Service uptime percentages
Response time trends
Resource usage (CPU, memory)
Error rates.

Alert Manager Rules

# Example alert rule
groups:
  - name: sting_alerts
    rules:
      - alert: ServiceDown
        expr: up{job="sting-services"} == 0
        for: 5m
        annotations:
          summary: "Service {{ $labels.instance }} is down"

Last updated: October 22, 2025