SDLC 06: Maintenance and Operations
Revision history: Updated June 2026 — 19-microservice architecture; systemd; Middleware.io APM; pino structured logging; journald; GitHub Actions health checks; incident response procedures.
1. Environment Management
| Branch | Environment | Domain | Purpose |
|---|---|---|---|
production | Live | pakashop.store | End-user production platform |
main | Staging | staging.pakashop.store | Final QA and integration testing |
2. systemd Operations
All 19 backend services are managed by systemd on their respective EC2 hosts.
# Detailed status of a single service
systemctl status pakashop-backend.service
# Live log tracking
journalctl -u pakashop-backend.service -f
# Check all services at once
./scripts/pakashop-status.sh
# Restart a service after config change
sudo systemctl restart pakashop-backend.service
# Reload systemd daemon after unit file changes
sudo systemctl daemon-reload
# Enable service to start on boot
sudo systemctl enable pakashop-backend.service
2.1 Unified Status Dashboard
The pakashop-status.sh script provides a color-coded overview:
#!/bin/bash
# scripts/pakashop-status.sh
echo "=== Pakashop Service Status ==="
echo ""
services=(
"pakashop-gateway"
"pakashop-backend"
"pakashop-config"
"pakashop-notifications"
"pakashop-tracking"
"pakashop-moderation"
"pakashop-recommendations"
"pakashop-scheduler"
"pakashop-search"
"pakashop-analytics"
"pakashop-fraud"
"pakashop-coupon"
"pakashop-loyalty"
"pakashop-whatsapp"
"pakashop-reports"
"pakashop-reconciliation"
"pakashop-invoicing"
"pakashop-pricing"
"pakashop-settlement"
)
for service in "${services[@]}"; do
status=$(systemctl is-active $service.service 2>/dev/null)
if [ "$status" = "active" ]; then
echo -e "\e[32m[OK]\e[0m $service"
else
echo -e "\e[31m[FAIL]\e[0m $service ($status)"
fi
done
echo ""
echo "=== Nginx Status ==="
systemctl is-active nginx.service && echo -e "\e[32m[OK]\e[0m nginx" || echo -e "\e[31m[FAIL]\e[0m nginx"
echo ""
echo "=== Redis Status ==="
redis-cli ping | grep -q PONG && echo -e "\e[32m[OK]\e[0m redis" || echo -e "\e[31m[FAIL]\e[0m redis"
echo ""
echo "=== PostgreSQL Status ==="
systemctl is-active postgresql.service && echo -e "\e[32m[OK]\e[0m postgresql" || echo -e "\e[31m[FAIL]\e[0m postgresql"
3. Monitoring and Observability
3.1 Middleware.io APM
All services ship traces, logs, and metrics to Middleware.io via OpenTelemetry:
- Traces: Custom spans for critical flows (
checkout.complete,payment.process,zra.validate,fraud.evaluate). - Logs: Structured pino logs with correlation IDs.
- Metrics: Request rates, error rates, latency percentiles, cache hit/miss ratios.
Access dashboards at app.middleware.io.
3.2 Structured Logging with pino
All services use pino for structured JSON logging:
{
"level": 30,
"time": 1716374400000,
"pid": 1234,
"hostname": "pakashop-prod-01",
"service": "pakashop-backend",
"traceId": "abc123-def456",
"spanId": "span789",
"msg": "Payment initiation started",
"orderId": "ord_123",
"gateway": "PAWAPAY",
"phone": "+26097*****56"
}
Logs are shipped to:
- Middleware.io via OpenTelemetry
- journald for local persistence
- CloudWatch Logs (via journald-cloudwatch-logs agent)
3.3 Log Investigation
# Find all errors for a specific order
journalctl -u pakashop-backend.service --since "1 hour ago" | grep '"orderId":"ord_123"'
# Find all errors in the last 10 minutes
journalctl -u pakashop-backend.service --since "10 minutes ago" --priority=err
# Follow logs for multiple services
journalctl -u pakashop-backend.service -u pakashop-gateway.service -f
# Search by trace ID
journalctl -u pakashop-backend.service | grep '"traceId":"abc123-def456"'
3.4 Health Check Endpoints
Every service exposes health endpoints:
| Endpoint | Purpose | Response |
|---|---|---|
GET /health | Liveness probe | {"status":"ok","service":"pakashop-backend"} |
GET /health/ready | Readiness probe | {"status":"ready","dependencies":{"postgres":"ok","redis":"ok"}} |
GET /health/metrics | Prometheus metrics | Raw metrics output |
The GitHub Actions health-check.yml workflow polls these every 15 minutes.
4. Incident Response
4.1 P1 — Critical (Live Platform Down)
- Identify: Check
productionbranch status and EC2 health via CloudWatch/Middleware.io../scripts/pakashop-status.shjournalctl -u pakashop-gateway.service --since "5 minutes ago" - Assess: Determine if issue is infrastructure (EC2, RDS, Redis) or code-related.
- Rollback: If caused by a merge, revert the
productionbranch commit and trigger redeploy.git revert HEADgit push origin production - Mitigate: If infrastructure issue, restart services or failover to standby.
- Notify: Alert stakeholders within 15 minutes via Slack #incidents.
- Post-mortem: Document root cause and remediation within 24 hours.
4.2 P2 — High (Staging Down or Degraded Performance)
- Identify: Check
mainbranch status and Middleware.io dashboards. - Resolve: Fix the offending code on
mainor revert if blocking other teams. - Verify: Run E2E tests on staging before declaring resolved.
4.3 P3 — Medium (Non-Critical Service Down)
- Identify: Check service-specific logs and metrics.
- Restart:
sudo systemctl restart pakashop-<service>.service - Investigate: Review recent deployments or configuration changes.
4.4 P4 — Low (Minor Issue or Question)
- Document in issue tracker.
- Address during next sprint.
5. Scheduled Maintenance
5.1 Weekly Tasks
- Monday: Check
npm auditonmainbranch. - Wednesday: Review Middleware.io alert thresholds and adjust if needed.
- Friday: Review disk usage on EC2 instances; clean up old logs.
5.2 Monthly Tasks
- Verify database backups for both Staging and Production stacks.
- Review and rotate API keys (PawaPay, Flutterwave, Cloudinary).
- Check SSL certificate expiry (Cloudflare auto-renews, but verify).
- Review and prune old Docker images and build artifacts.
5.3 Quarterly Tasks
- Security audit: review RBAC assignments, remove stale accounts.
- Performance review: analyse k6 performance test trends.
- Dependency update: major version upgrades for Node.js, Go, Python.
- Disaster recovery drill: restore from backup to test environment.
6. Alert Response Runbook
| Alert | Impact | Action |
|---|---|---|
payment_error_rate > 5% | Customers cannot pay | Check payment.process logs. Verify PawaPay/Flutterwave API status. Check fraud service for false positives. |
api_p95_latency > 500ms | Slow user experience | Check db_query_duration and high-latency spans. Review Redis cache hit rates. |
unhandled_exception | Potential platform crash | Use traceId to identify the failing request and environment. |
redis_memory > 80% | Cache eviction risk | Review cache TTLs. Flush stale keys. Consider scaling Redis instance. |
postgres_connections > 80% | Database connection pool exhaustion | Check for connection leaks. Restart services if necessary. Enable PgBouncer. |
fraud_queue_backlog > 100 | Fraud reviews delayed | Scale fraud service workers. Check rule thresholds. |
zra_transmission_failure > 5 | Tax compliance risk | Check ZRA API status. Verify API key validity. Review mock mode fallback. |
meilisearch_index_lag > 60s | Search results stale | Check Meilisearch task queue. Restart Meilisearch if necessary. |
For internal use only. Do not distribute outside Pakashop engineering.