STING Service Startup Resilience

Overview

STING includes enhanced service startup resilience to handle common issues during installation and updates, such as:

Services failing to start due to dependency timing
Containers stuck in “created” state
Port conflicts preventing service startup
Network timeouts during image pulls
Services exiting unexpectedly.

How It Works

Automatic Resilience

When you run ./install_sting.sh or ./manage_sting.sh start/update, STING automatically:

Checks service dependencies - Services start in the correct order based on their dependencies
Retries failed services - Up to 3 attempts with intelligent backoff
Performs health checks - Verifies services are actually responding, not just running
Detects port conflicts - Identifies when required ports are already in use
Handles container states - Manages containers stuck in “created” or “exited” states

Service Dependencies

Services are started in this order to respect dependencies:

Infrastructure: postgres, redis
Authentication: kratos, kratos-migrate
Core services: mailpit, app
Frontend: frontend
AI services: chatbot, knowledge
Gateway: nginx

Manual Recovery Tool

If services fail to start automatically, use the interactive recovery tool:

./scripts/recover_services.sh

This provides an interactive menu with options to:

View detailed service status
Check system resources and port conflicts
Attempt automatic recovery
View service logs
Recreate specific services
Perform full system restart.

Common Issues and Solutions

Services Stuck in “Created” State

Symptom: Services show as created but not running after update/install.

Solution: The enhanced startup automatically detects and starts these services. If it fails:

# Manual recovery
./scripts/recover_services.sh
# Select option 3 for automatic recovery

Port Conflicts

Symptom: Services fail to start with “port already allocated” errors.

Solution:

Check what’s using the port:

./scripts/recover_services.sh
# Select option 2 to check system resources

Stop conflicting services or change STING ports in .env files

Network Timeouts

Symptom: Image pulls fail with timeout errors (like in your example).

Solution:

Check your Docker proxy settings
Retry with better network connection
The enhanced startup will automatically retry failed pulls

Service Health Issues

Symptom: Service running but not responding to health checks.

Solution: The recovery tool can recreate unhealthy services:

./scripts/recover_services.sh
# Select option 5 to recreate specific service

Configuration

Retry Settings

You can modify retry behavior by editing /lib/service_startup_resilience.sh:

MAX_RETRIES=3        # Number of retry attempts
RETRY_DELAY=5        # Seconds between retries

Health Check URLs

Health check endpoints are defined in the resilience script:

HEALTH_CHECKS[app]="https://localhost:5050/health"
HEALTH_CHECKS[frontend]="https://localhost:8443"
HEALTH_CHECKS[chatbot]="http://localhost:5005/health"
# etc...

Troubleshooting

Enable Debug Logging

For more detailed output during startup:

DEBUG=true ./manage_sting.sh start

Check Service Logs

View logs for a specific service:

docker logs sting-ce-<service-name> --tail 50

Manual Service Start

If automatic recovery fails, start services manually:

# Start a specific service
docker start sting-ce-frontend

# Or recreate it
docker compose up -d frontend --force-recreate

Full System Reset

As a last resort, perform a full reset:

./manage_sting.sh stop
./manage_sting.sh cleanup
./install_sting.sh

Integration with manage_sting.sh

The enhanced resilience is automatically integrated into:

./install_sting.sh - Ensures all services start after installation
./manage_sting.sh update - Handles services that fail during updates
./manage_sting.sh start - Recovers any failed services.

No additional configuration is needed - it works automatically!

For Developers

Adding New Services

When adding new services, update the dependency map in /lib/service_startup_resilience.sh:

# Add your service dependencies
SERVICE_DEPS[myservice]="postgres redis app"

# Add health check if applicable
HEALTH_CHECKS[myservice]="http://localhost:PORT/health"

Custom Recovery Logic

You can add service-specific recovery logic by extending the start_service_with_retry function in the resilience script.

Support

If you continue experiencing startup issues:

Run the recovery tool and note any error messages
Check the STING documentation
Report issues with full error logs and recovery tool output

Last updated: October 20, 2025