Skip to main content

SDLC 06: Maintenance and Operations

Revision history: Updated June 2026 — 19-microservice architecture; systemd; Middleware.io APM; pino structured logging; journald; GitHub Actions health checks; incident response procedures.


1. Environment Management

BranchEnvironmentDomainPurpose
productionLivepakashop.storeEnd-user production platform
mainStagingstaging.pakashop.storeFinal QA and integration testing

2. systemd Operations

All 19 backend services are managed by systemd on their respective EC2 hosts.

# Detailed status of a single service
systemctl status pakashop-backend.service

# Live log tracking
journalctl -u pakashop-backend.service -f

# Check all services at once
./scripts/pakashop-status.sh

# Restart a service after config change
sudo systemctl restart pakashop-backend.service

# Reload systemd daemon after unit file changes
sudo systemctl daemon-reload

# Enable service to start on boot
sudo systemctl enable pakashop-backend.service

2.1 Unified Status Dashboard

The pakashop-status.sh script provides a color-coded overview:

#!/bin/bash
# scripts/pakashop-status.sh

echo "=== Pakashop Service Status ==="
echo ""

services=(
"pakashop-gateway"
"pakashop-backend"
"pakashop-config"
"pakashop-notifications"
"pakashop-tracking"
"pakashop-moderation"
"pakashop-recommendations"
"pakashop-scheduler"
"pakashop-search"
"pakashop-analytics"
"pakashop-fraud"
"pakashop-coupon"
"pakashop-loyalty"
"pakashop-whatsapp"
"pakashop-reports"
"pakashop-reconciliation"
"pakashop-invoicing"
"pakashop-pricing"
"pakashop-settlement"
)

for service in "${services[@]}"; do
status=$(systemctl is-active $service.service 2>/dev/null)
if [ "$status" = "active" ]; then
echo -e "\e[32m[OK]\e[0m $service"
else
echo -e "\e[31m[FAIL]\e[0m $service ($status)"
fi
done

echo ""
echo "=== Nginx Status ==="
systemctl is-active nginx.service && echo -e "\e[32m[OK]\e[0m nginx" || echo -e "\e[31m[FAIL]\e[0m nginx"

echo ""
echo "=== Redis Status ==="
redis-cli ping | grep -q PONG && echo -e "\e[32m[OK]\e[0m redis" || echo -e "\e[31m[FAIL]\e[0m redis"

echo ""
echo "=== PostgreSQL Status ==="
systemctl is-active postgresql.service && echo -e "\e[32m[OK]\e[0m postgresql" || echo -e "\e[31m[FAIL]\e[0m postgresql"

3. Monitoring and Observability

3.1 Middleware.io APM

All services ship traces, logs, and metrics to Middleware.io via OpenTelemetry:

  • Traces: Custom spans for critical flows (checkout.complete, payment.process, zra.validate, fraud.evaluate).
  • Logs: Structured pino logs with correlation IDs.
  • Metrics: Request rates, error rates, latency percentiles, cache hit/miss ratios.

Access dashboards at app.middleware.io.

3.2 Structured Logging with pino

All services use pino for structured JSON logging:

{
"level": 30,
"time": 1716374400000,
"pid": 1234,
"hostname": "pakashop-prod-01",
"service": "pakashop-backend",
"traceId": "abc123-def456",
"spanId": "span789",
"msg": "Payment initiation started",
"orderId": "ord_123",
"gateway": "PAWAPAY",
"phone": "+26097*****56"
}

Logs are shipped to:

  1. Middleware.io via OpenTelemetry
  2. journald for local persistence
  3. CloudWatch Logs (via journald-cloudwatch-logs agent)

3.3 Log Investigation

# Find all errors for a specific order
journalctl -u pakashop-backend.service --since "1 hour ago" | grep '"orderId":"ord_123"'

# Find all errors in the last 10 minutes
journalctl -u pakashop-backend.service --since "10 minutes ago" --priority=err

# Follow logs for multiple services
journalctl -u pakashop-backend.service -u pakashop-gateway.service -f

# Search by trace ID
journalctl -u pakashop-backend.service | grep '"traceId":"abc123-def456"'

3.4 Health Check Endpoints

Every service exposes health endpoints:

EndpointPurposeResponse
GET /healthLiveness probe{"status":"ok","service":"pakashop-backend"}
GET /health/readyReadiness probe{"status":"ready","dependencies":{"postgres":"ok","redis":"ok"}}
GET /health/metricsPrometheus metricsRaw metrics output

The GitHub Actions health-check.yml workflow polls these every 15 minutes.


4. Incident Response

4.1 P1 — Critical (Live Platform Down)

  1. Identify: Check production branch status and EC2 health via CloudWatch/Middleware.io.
    ./scripts/pakashop-status.sh
    journalctl -u pakashop-gateway.service --since "5 minutes ago"
  2. Assess: Determine if issue is infrastructure (EC2, RDS, Redis) or code-related.
  3. Rollback: If caused by a merge, revert the production branch commit and trigger redeploy.
    git revert HEAD
    git push origin production
  4. Mitigate: If infrastructure issue, restart services or failover to standby.
  5. Notify: Alert stakeholders within 15 minutes via Slack #incidents.
  6. Post-mortem: Document root cause and remediation within 24 hours.

4.2 P2 — High (Staging Down or Degraded Performance)

  1. Identify: Check main branch status and Middleware.io dashboards.
  2. Resolve: Fix the offending code on main or revert if blocking other teams.
  3. Verify: Run E2E tests on staging before declaring resolved.

4.3 P3 — Medium (Non-Critical Service Down)

  1. Identify: Check service-specific logs and metrics.
  2. Restart: sudo systemctl restart pakashop-<service>.service
  3. Investigate: Review recent deployments or configuration changes.

4.4 P4 — Low (Minor Issue or Question)

  1. Document in issue tracker.
  2. Address during next sprint.

5. Scheduled Maintenance

5.1 Weekly Tasks

  • Monday: Check npm audit on main branch.
  • Wednesday: Review Middleware.io alert thresholds and adjust if needed.
  • Friday: Review disk usage on EC2 instances; clean up old logs.

5.2 Monthly Tasks

  • Verify database backups for both Staging and Production stacks.
  • Review and rotate API keys (PawaPay, Flutterwave, Cloudinary).
  • Check SSL certificate expiry (Cloudflare auto-renews, but verify).
  • Review and prune old Docker images and build artifacts.

5.3 Quarterly Tasks

  • Security audit: review RBAC assignments, remove stale accounts.
  • Performance review: analyse k6 performance test trends.
  • Dependency update: major version upgrades for Node.js, Go, Python.
  • Disaster recovery drill: restore from backup to test environment.

6. Alert Response Runbook

AlertImpactAction
payment_error_rate > 5%Customers cannot payCheck payment.process logs. Verify PawaPay/Flutterwave API status. Check fraud service for false positives.
api_p95_latency > 500msSlow user experienceCheck db_query_duration and high-latency spans. Review Redis cache hit rates.
unhandled_exceptionPotential platform crashUse traceId to identify the failing request and environment.
redis_memory > 80%Cache eviction riskReview cache TTLs. Flush stale keys. Consider scaling Redis instance.
postgres_connections > 80%Database connection pool exhaustionCheck for connection leaks. Restart services if necessary. Enable PgBouncer.
fraud_queue_backlog > 100Fraud reviews delayedScale fraud service workers. Check rule thresholds.
zra_transmission_failure > 5Tax compliance riskCheck ZRA API status. Verify API key validity. Review mock mode fallback.
meilisearch_index_lag > 60sSearch results staleCheck Meilisearch task queue. Restart Meilisearch if necessary.

For internal use only. Do not distribute outside Pakashop engineering.