SDLC 06: Maintenance and Operations

Revision history: Updated June 2026 — 19-microservice architecture; systemd; Middleware.io APM; pino structured logging; journald; GitHub Actions health checks; incident response procedures.

1. Environment Management

Branch	Environment	Domain	Purpose
`production`	Live	`pakashop.store`	End-user production platform
`main`	Staging	`staging.pakashop.store`	Final QA and integration testing

2. systemd Operations

All 19 backend services are managed by systemd on their respective EC2 hosts.

# Detailed status of a single service
systemctl status pakashop-backend.service

# Live log tracking
journalctl -u pakashop-backend.service -f

# Check all services at once
./scripts/pakashop-status.sh

# Restart a service after config change
sudo systemctl restart pakashop-backend.service

# Reload systemd daemon after unit file changes
sudo systemctl daemon-reload

# Enable service to start on boot
sudo systemctl enable pakashop-backend.service

2.1 Unified Status Dashboard

The pakashop-status.sh script provides a color-coded overview:

#!/bin/bash
# scripts/pakashop-status.sh

echo "=== Pakashop Service Status ==="
echo ""

services=(
  "pakashop-gateway"
  "pakashop-backend"
  "pakashop-config"
  "pakashop-notifications"
  "pakashop-tracking"
  "pakashop-moderation"
  "pakashop-recommendations"
  "pakashop-scheduler"
  "pakashop-search"
  "pakashop-analytics"
  "pakashop-fraud"
  "pakashop-coupon"
  "pakashop-loyalty"
  "pakashop-whatsapp"
  "pakashop-reports"
  "pakashop-reconciliation"
  "pakashop-invoicing"
  "pakashop-pricing"
  "pakashop-settlement"
)

for service in "${services[@]}"; do
  status=$(systemctl is-active $service.service 2>/dev/null)
  if [ "$status" = "active" ]; then
    echo -e "\e[32m[OK]\e[0m $service"
  else
    echo -e "\e[31m[FAIL]\e[0m $service ($status)"
  fi
done

echo ""
echo "=== Nginx Status ==="
systemctl is-active nginx.service && echo -e "\e[32m[OK]\e[0m nginx" || echo -e "\e[31m[FAIL]\e[0m nginx"

echo ""
echo "=== Redis Status ==="
redis-cli ping | grep -q PONG && echo -e "\e[32m[OK]\e[0m redis" || echo -e "\e[31m[FAIL]\e[0m redis"

echo ""
echo "=== PostgreSQL Status ==="
systemctl is-active postgresql.service && echo -e "\e[32m[OK]\e[0m postgresql" || echo -e "\e[31m[FAIL]\e[0m postgresql"

3. Monitoring and Observability

3.1 Middleware.io APM

All services ship traces, logs, and metrics to Middleware.io via OpenTelemetry:

Traces: Custom spans for critical flows (checkout.complete, payment.process, zra.validate, fraud.evaluate).
Logs: Structured pino logs with correlation IDs.
Metrics: Request rates, error rates, latency percentiles, cache hit/miss ratios.

Access dashboards at app.middleware.io.

3.2 Structured Logging with pino

All services use pino for structured JSON logging:

{
  "level": 30,
  "time": 1716374400000,
  "pid": 1234,
  "hostname": "pakashop-prod-01",
  "service": "pakashop-backend",
  "traceId": "abc123-def456",
  "spanId": "span789",
  "msg": "Payment initiation started",
  "orderId": "ord_123",
  "gateway": "PAWAPAY",
  "phone": "+26097*****56"
}

Logs are shipped to:

Middleware.io via OpenTelemetry
journald for local persistence
CloudWatch Logs (via journald-cloudwatch-logs agent)

3.3 Log Investigation

# Find all errors for a specific order
journalctl -u pakashop-backend.service --since "1 hour ago" | grep '"orderId":"ord_123"'

# Find all errors in the last 10 minutes
journalctl -u pakashop-backend.service --since "10 minutes ago" --priority=err

# Follow logs for multiple services
journalctl -u pakashop-backend.service -u pakashop-gateway.service -f

# Search by trace ID
journalctl -u pakashop-backend.service | grep '"traceId":"abc123-def456"'

3.4 Health Check Endpoints

Every service exposes health endpoints:

Endpoint	Purpose	Response
`GET /health`	Liveness probe	`{"status":"ok","service":"pakashop-backend"}`
`GET /health/ready`	Readiness probe	`{"status":"ready","dependencies":{"postgres":"ok","redis":"ok"}}`
`GET /health/metrics`	Prometheus metrics	Raw metrics output

The GitHub Actions health-check.yml workflow polls these every 15 minutes.

4. Incident Response

4.1 P1 — Critical (Live Platform Down)

Identify: Check production branch status and EC2 health via CloudWatch/Middleware.io.

./scripts/pakashop-status.sh
journalctl -u pakashop-gateway.service --since "5 minutes ago"

Assess: Determine if issue is infrastructure (EC2, RDS, Redis) or code-related.
Rollback: If caused by a merge, revert the production branch commit and trigger redeploy.
```
git revert HEAD
git push origin production
```
Mitigate: If infrastructure issue, restart services or failover to standby.
Notify: Alert stakeholders within 15 minutes via Slack #incidents.
Post-mortem: Document root cause and remediation within 24 hours.

4.2 P2 — High (Staging Down or Degraded Performance)

Identify: Check main branch status and Middleware.io dashboards.
Resolve: Fix the offending code on main or revert if blocking other teams.
Verify: Run E2E tests on staging before declaring resolved.

4.3 P3 — Medium (Non-Critical Service Down)

Identify: Check service-specific logs and metrics.
Restart: sudo systemctl restart pakashop-<service>.service
Investigate: Review recent deployments or configuration changes.

4.4 P4 — Low (Minor Issue or Question)

Document in issue tracker.
Address during next sprint.

5. Scheduled Maintenance

5.1 Weekly Tasks

Monday: Check npm audit on main branch.
Wednesday: Review Middleware.io alert thresholds and adjust if needed.
Friday: Review disk usage on EC2 instances; clean up old logs.

5.2 Monthly Tasks

Verify database backups for both Staging and Production stacks.
Review and rotate API keys (PawaPay, Flutterwave, Cloudinary).
Check SSL certificate expiry (Cloudflare auto-renews, but verify).
Review and prune old Docker images and build artifacts.

5.3 Quarterly Tasks

Security audit: review RBAC assignments, remove stale accounts.
Performance review: analyse k6 performance test trends.
Dependency update: major version upgrades for Node.js, Go, Python.
Disaster recovery drill: restore from backup to test environment.

6. Alert Response Runbook

Alert	Impact	Action
`payment_error_rate > 5%`	Customers cannot pay	Check `payment.process` logs. Verify PawaPay/Flutterwave API status. Check fraud service for false positives.
`api_p95_latency > 500ms`	Slow user experience	Check `db_query_duration` and high-latency spans. Review Redis cache hit rates.
`unhandled_exception`	Potential platform crash	Use `traceId` to identify the failing request and environment.
`redis_memory > 80%`	Cache eviction risk	Review cache TTLs. Flush stale keys. Consider scaling Redis instance.
`postgres_connections > 80%`	Database connection pool exhaustion	Check for connection leaks. Restart services if necessary. Enable PgBouncer.
`fraud_queue_backlog > 100`	Fraud reviews delayed	Scale fraud service workers. Check rule thresholds.
`zra_transmission_failure > 5`	Tax compliance risk	Check ZRA API status. Verify API key validity. Review mock mode fallback.
`meilisearch_index_lag > 60s`	Search results stale	Check Meilisearch task queue. Restart Meilisearch if necessary.

For internal use only. Do not distribute outside Pakashop engineering.

1. Environment Management​

2. systemd Operations​

2.1 Unified Status Dashboard​

3. Monitoring and Observability​

3.1 Middleware.io APM​

3.2 Structured Logging with pino​

3.3 Log Investigation​

3.4 Health Check Endpoints​

4. Incident Response​

4.1 P1 — Critical (Live Platform Down)​

4.2 P2 — High (Staging Down or Degraded Performance)​

4.3 P3 — Medium (Non-Critical Service Down)​

4.4 P4 — Low (Minor Issue or Question)​

5. Scheduled Maintenance​

5.1 Weekly Tasks​

5.2 Monthly Tasks​

5.3 Quarterly Tasks​

6. Alert Response Runbook​