Service Health Monitoring Guide

This guide provides detailed information about monitoring the health and status of all STING platform services.

Table of Contents

Overview

STING implements comprehensive health monitoring across all microservices. Each service exposes health endpoints that provide real-time status information.

Health Check Types:

  • Liveness: Is the service running?
  • Readiness: Is the service ready to accept requests?
  • Dependencies: Are required services available?

Health Check Endpoints

Core Services

ServiceHealth EndpointExpected ResponseCritical
Flask APIhttps://localhost:5050/health{"status": "healthy"}Yes
Kratos Authhttps://localhost:4434/admin/health/ready{"status": "ready"}Yes
PostgreSQLdocker exec sting-ce-db pg_isreadyaccepting connectionsYes
Vaulthttp://localhost:8200/v1/sys/health{"initialized": true}Yes

AI & Knowledge Services

ServiceHealth EndpointExpected ResponseCritical
Knowledge Servicehttp://localhost:8090/health{"status": "healthy", "service": "knowledge"}Yes
ChromaDBhttp://localhost:8000/api/v1/heartbeat{"nanosecond heartbeat": ...}Yes
Chatbothttp://localhost:8888/health{"status": "healthy"}No
LLM Gatewayhttp://localhost:8085/health{"status": "healthy"}No

Supporting Services

ServiceHealth EndpointExpected ResponseCritical
Redisdocker exec sting-ce-redis redis-cli pingPONGNo
Messaginghttp://localhost:8889/health{"status": "healthy"}No
MailpitN/ACheck container statusNo
FrontendCheck container statusRunningYes

Monitoring Commands

Quick Health Check Script

#!/bin/bash
# Save as check_health.sh

echo "=== STING Service Health Check ==="
echo

# Core Services
echo "Flask API: $(curl -s https://localhost:5050/health 2>/dev/null || echo 'OFFLINE')"
echo "Kratos: $(curl -s https://localhost:4434/admin/health/ready 2>/dev/null || echo 'OFFLINE')"
echo "Vault: $(curl -s http://localhost:8200/v1/sys/health 2>/dev/null || echo 'OFFLINE')"
echo "Database: $(docker exec sting-ce-db pg_isready 2>&1 || echo 'OFFLINE')"
echo

# AI Services
echo "Knowledge: $(curl -s http://localhost:8090/health 2>/dev/null || echo 'OFFLINE')"
echo "ChromaDB: $(curl -s http://localhost:8000/api/v1/heartbeat 2>/dev/null | head -c 50)..."
echo "Chatbot: $(curl -s http://localhost:8888/health 2>/dev/null || echo 'OFFLINE')"
echo "LLM Gateway: $(curl -s http://localhost:8085/health 2>/dev/null || echo 'OFFLINE')"
echo

# Support Services
echo "Redis: $(docker exec sting-ce-redis redis-cli ping 2>&1 || echo 'OFFLINE')"
echo "Messaging: $(curl -s http://localhost:8889/health 2>/dev/null || echo 'OFFLINE')"

Container Status Monitoring

# View all containers with health status
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.State}}"

# Watch container status in real-time
watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"'

# Get detailed health info for a specific service
docker inspect sting-ce-knowledge --format='{{json .State.Health}}' | jq

Service Logs Monitoring

# Monitor multiple services simultaneously
docker compose logs -f app knowledge chatbot

# View logs with timestamps
docker logs --timestamps --tail 50 sting-ce-knowledge

# Search for errors across all services
docker compose logs | grep -i error

Service Dependencies

Understanding service dependencies helps troubleshoot cascading failures:

┌─────────────────┐
│    Frontend     │
└────────┬────────┘
         │
┌────────▼────────┐
│   Flask API     │
├─────────────────┤
│ Depends on:     │
│ • PostgreSQL    │
│ • Vault         │
│ • Kratos        │
│ • Knowledge*    │
└────────┬────────┘
         │
┌────────▼────────┐     ┌─────────────┐
│   Knowledge     │────▶│  ChromaDB   │
└─────────────────┘     └─────────────┘
         │
┌────────▼────────┐     ┌─────────────┐
│    Chatbot      │────▶│ LLM Gateway │
├─────────────────┤     └─────────────┘
│ Depends on:     │
│ • Messaging     │
│ • Redis         │
└─────────────────┘

Automated Health Checks

Docker Compose Health Checks

Each service in docker-compose.yml includes health check configuration:

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8090/health"]
  interval: 30s
  timeout: 10s
  retries: 5
  start_period: 30s

Monitoring with Docker Events

# Monitor health check events
docker events --filter event=health_status

# Get health check history
docker inspect sting-ce-knowledge | jq '.[0].State.Health.Log'

Creating Custom Health Monitors

# health_monitor.py
import requests
import time
import json

SERVICES = {
    "flask_api": "https://localhost:5050/health",
    "knowledge": "http://localhost:8090/health",
    "chatbot": "http://localhost:8888/health",
    "chromadb": "http://localhost:8000/api/v1/heartbeat",
}

def check_services():
    results = {}
    for name, url in SERVICES.items():
        try:
            response = requests.get(url, timeout=5, verify=False)
            results[name] = {
                "status": "healthy" if response.status_code == 200 else "unhealthy",
                "response_time": response.elapsed.total_seconds()
            }
        except Exception as e:
            results[name] = {
                "status": "offline",
                "error": str(e)
            }
    return results

# Run continuous monitoring
while True:
    status = check_services()
    print(json.dumps(status, indent=2))
    time.sleep(30)

Troubleshooting Unhealthy Services

Common Health Check Failures

1. Knowledge Service Unhealthy

# Check if ChromaDB is running (dependency)
curl http://localhost:8000/api/v1/heartbeat

# View knowledge service logs
docker logs sting-ce-knowledge --tail 100

# Restart the service
docker compose restart knowledge

2. Database Connection Issues

# Check PostgreSQL status
docker exec sting-ce-db pg_isready -U postgres

# View connection count
docker exec sting-ce-db psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"

# Reset connections
docker compose restart db app

3. Redis Memory Issues

# Check Redis memory usage
docker exec sting-ce-redis redis-cli info memory

# Clear Redis cache if needed
docker exec sting-ce-redis redis-cli FLUSHALL

# Restart Redis
docker compose restart redis

4. LLM Gateway Not Responding

# Check native LLM service (macOS)
./sting-llm status

# Check proxy configuration
docker logs sting-ce-llm-gateway-proxy

# Test direct connection
curl http://host.docker.internal:8086/health

Health Check Best Practices

  1. Monitor Critical Services First: Focus on database, authentication, and API services
  2. Set Appropriate Timeouts: Adjust health check intervals based on service startup time
  3. Use Cascading Restarts: Restart dependent services in order
  4. Log Health Events: Keep history of health check failures for pattern analysis
  5. Alert on Repeated Failures: Don’t alert on single failures, wait for consistent issues

Integration with Monitoring Systems

Prometheus Metrics

# Example prometheus configuration
scrape_configs:
  - job_name: 'sting-services'
    static_configs:
      - targets:
        - 'localhost:5050'  # Flask API metrics
        - 'localhost:8090'  # Knowledge service metrics

Grafana Dashboard

Create dashboards to visualize:

  • Service uptime percentages
  • Response time trends
  • Resource usage (CPU, memory)
  • Error rates.

Alert Manager Rules

# Example alert rule
groups:
  - name: sting_alerts
    rules:
      - alert: ServiceDown
        expr: up{job="sting-services"} == 0
        for: 5m
        annotations:
          summary: "Service {{ $labels.instance }} is down"

Last updated: