Observability Guide
Related docs:
SDLC/MAINTENANCE·Microservices·Hosting Infrastructure
1. Accessing Dashboards
Access your dashboards at Middleware Console.
- API Performance: Overview of request rates, error rates, and latency across all 19 services.
- Payment Success: Tracking transaction success rates by provider (PawaPay, Flutterwave, Mock).
- ZRA Compliance: Monitoring TPIN and NRC validation rates and failure causes.
- Fraud Detection: Real-time fraud score distribution, blocked transaction trends, review queue depth.
- Delivery Tracking: GPS update frequency, geofence trigger accuracy, ETA accuracy.
- Cache Performance: Redis hit/miss ratios, memory usage, eviction rates.
2. Troubleshooting with Logs
All logs are structured as JSON and include traceId correlation.
2.1 Log Format (pino)
{
"level": 30,
"time": 1716374400000,
"pid": 1234,
"hostname": "pakashop-prod-01",
"service": "pakashop-backend",
"traceId": "abc123-def456",
"spanId": "span789",
"msg": "Payment initiation started",
"orderId": "ord_123",
"gateway": "PAWAPAY",
"phone": "+26097*****56",
"duration_ms": 45
}
2.2 Example Queries
Find all logs for a specific order:
journalctl -u pakashop-backend.service --since "1 hour ago" | jq 'select(.orderId == "ord_123")'
View all errors for ZRA validation:
journalctl -u pakashop-invoicing.service --since "24 hours ago" | jq 'select(.level >= 50 and .spanName == "zra.validate")'
Trace a single transaction across services:
# Find traceId from backend logs
TRACE_ID="abc123-def456"
journalctl -u pakashop-backend.service | jq "select(.traceId == \"$TRACE_ID\")"
journalctl -u pakashop-tracking.service | jq "select(.traceId == \"$TRACE_ID\")"
journalctl -u pakashop-notifications.service | jq "select(.traceId == \"$TRACE_ID\")"
Find slow database queries:
journalctl -u pakashop-backend.service | jq 'select(.duration_ms > 500 and .spanName == "db.query")'
3. Interpreting Traces
Custom spans have been added for critical flows across all services:
| Span Name | Service | Description |
|---|---|---|
checkout.complete | backend | End-to-end checkout process |
payment.process | backend | Gateway communication time |
payment.webhook | backend | Webhook processing time |
zra.validate | invoicing | ZRA compliance API response times |
zra.transmit | invoicing | VSDC transmission time |
fraud.evaluate | fraud | Fraud rule evaluation time |
search.query | search | Meilisearch query time |
tracking.location | tracking | GPS processing + Kalman filter |
moderation.analyze | moderation | Sightengine API call time |
recommendations.generate | recommendations | Collaborative filtering computation |
report.generate | reports | PDF/CSV generation time |
settlement.batch | settlement | Batch payout processing time |
Slow checkout? Look at the payment.process child span to see if the gateway is the bottleneck.
High fraud false positives? Check the fraud.evaluate span for rule execution times and thresholds.
4. Custom Spans & Events
4.1 Adding a New Span (Node.js)
const { withSpan } = require('./lib/tracing');
await withSpan('my.operation', {
'order.id': orderId,
'user.role': userRole
}, async (span) => {
// your logic
span.setAttribute('custom.metric', value);
});
4.2 Adding a New Span (Go)
import "go.opentelemetry.io/otel"
ctx, span := otel.Tracer("pakashop-search").Start(ctx, "search.query")
defer span.End()
span.SetAttributes(
attribute.String("search.query", query),
attribute.Int("search.results", len(results)),
)
4.3 Adding a New Span (Python)
from opentelemetry import trace
tracer = trace.get_tracer("pakashop-moderation")
with tracer.start_as_current_span("moderation.analyze") as span:
span.set_attribute("asset.id", asset_id)
span.set_attribute("asset.type", asset_type)
# moderation logic
5. Alert Response Runbook
| Alert | Impact | Action |
|---|---|---|
payment_error_rate > 5% | Customers cannot pay | Check payment.process logs. Verify PawaPay/Flutterwave API status. Check fraud service for false positives. |
zra_validation_failures > 10 | Shops cannot be approved | Check zra.validate spans. Verify ZRA credentials and API availability. |
api_p95_latency > 500ms | Slow user experience | Check db_query_duration and high-latency spans. Review Redis cache hit rates. |
unhandled_exception | Potential platform crash | Use traceId to identify the failing request and environment. |
fraud_queue_backlog > 100 | Fraud reviews delayed | Scale fraud service workers. Check rule thresholds. |
tracking_ws_disconnects > 50/min | Buyers lose live tracking | Check tracking service health. Review Redis Pub/Sub connection pool. |
moderation_retry_exhausted > 10 | Images unmoderated | Check Sightengine API status. Review moderation service health. |
redis_memory > 80% | Cache eviction risk | Review cache TTLs. Flush stale keys. Consider scaling Redis instance. |
meilisearch_index_lag > 60s | Search results stale | Check Meilisearch task queue. Restart Meilisearch if necessary. |
settlement_batch_failures > 3 | Vendors not paid | Check settlement service logs. Verify PawaPay/Flutterwave payout API status. |
6. Rollback
To disable observability without code changes, set:
ENABLE_OBSERVABILITY=false
in your environment variables. This disables Middleware.io export while preserving local pino logging.
For internal use only. Do not distribute outside Pakashop engineering.