STING Service Startup Resilience
Overview
STING includes enhanced service startup resilience to handle common issues during installation and updates, such as:
- Services failing to start due to dependency timing
- Containers stuck in “created” state
- Port conflicts preventing service startup
- Network timeouts during image pulls
- Services exiting unexpectedly.
How It Works
Automatic Resilience
When you run ./install_sting.sh or ./manage_sting.sh start/update, STING automatically:
- Checks service dependencies - Services start in the correct order based on their dependencies
- Retries failed services - Up to 3 attempts with intelligent backoff
- Performs health checks - Verifies services are actually responding, not just running
- Detects port conflicts - Identifies when required ports are already in use
- Handles container states - Manages containers stuck in “created” or “exited” states
Service Dependencies
Services are started in this order to respect dependencies:
- Infrastructure:
postgres,redis - Authentication:
kratos,kratos-migrate - Core services:
mailpit,app - Frontend:
frontend - AI services:
chatbot,knowledge - Gateway:
nginx
Manual Recovery Tool
If services fail to start automatically, use the interactive recovery tool:
./scripts/recover_services.sh
This provides an interactive menu with options to:
- View detailed service status
- Check system resources and port conflicts
- Attempt automatic recovery
- View service logs
- Recreate specific services
- Perform full system restart.
Common Issues and Solutions
Services Stuck in “Created” State
Symptom: Services show as created but not running after update/install.
Solution: The enhanced startup automatically detects and starts these services. If it fails:
# Manual recovery
./scripts/recover_services.sh
# Select option 3 for automatic recovery
Port Conflicts
Symptom: Services fail to start with “port already allocated” errors.
Solution:
- Check what’s using the port:
./scripts/recover_services.sh # Select option 2 to check system resources - Stop conflicting services or change STING ports in
.envfiles
Network Timeouts
Symptom: Image pulls fail with timeout errors (like in your example).
Solution:
- Check your Docker proxy settings
- Retry with better network connection
- The enhanced startup will automatically retry failed pulls
Service Health Issues
Symptom: Service running but not responding to health checks.
Solution: The recovery tool can recreate unhealthy services:
./scripts/recover_services.sh
# Select option 5 to recreate specific service
Configuration
Retry Settings
You can modify retry behavior by editing /lib/service_startup_resilience.sh:
MAX_RETRIES=3 # Number of retry attempts
RETRY_DELAY=5 # Seconds between retries
Health Check URLs
Health check endpoints are defined in the resilience script:
HEALTH_CHECKS[app]="https://localhost:5050/health"
HEALTH_CHECKS[frontend]="https://localhost:8443"
HEALTH_CHECKS[chatbot]="http://localhost:5005/health"
# etc...
Troubleshooting
Enable Debug Logging
For more detailed output during startup:
DEBUG=true ./manage_sting.sh start
Check Service Logs
View logs for a specific service:
docker logs sting-ce-<service-name> --tail 50
Manual Service Start
If automatic recovery fails, start services manually:
# Start a specific service
docker start sting-ce-frontend
# Or recreate it
docker compose up -d frontend --force-recreate
Full System Reset
As a last resort, perform a full reset:
./manage_sting.sh stop
./manage_sting.sh cleanup
./install_sting.sh
Integration with manage_sting.sh
The enhanced resilience is automatically integrated into:
./install_sting.sh- Ensures all services start after installation./manage_sting.sh update- Handles services that fail during updates./manage_sting.sh start- Recovers any failed services.
No additional configuration is needed - it works automatically!
For Developers
Adding New Services
When adding new services, update the dependency map in /lib/service_startup_resilience.sh:
# Add your service dependencies
SERVICE_DEPS[myservice]="postgres redis app"
# Add health check if applicable
HEALTH_CHECKS[myservice]="http://localhost:PORT/health"
Custom Recovery Logic
You can add service-specific recovery logic by extending the start_service_with_retry function in the resilience script.
Support
If you continue experiencing startup issues:
- Run the recovery tool and note any error messages
- Check the STING documentation
- Report issues with full error logs and recovery tool output