OrdersService Runbook
Operational runbook for troubleshooting and maintaining the OrdersService
This runbook provides operational procedures for the OrdersService, which is responsible for managing the entire lifecycle of customer orders in the FlowMart e-commerce platform.
Architecture
The OrdersService is responsible for:
- Creating and processing customer orders
- Tracking order status throughout fulfillment
- Coordinating with other services (Inventory, Payment, Shipping)
- Managing order history and amendments
Service Dependencies
Loading graph...
Monitoring and Alerting
Key Metrics
| Metric | Description | Warning Threshold | Critical Threshold |
|---|---|---|---|
order_creation_rate | Orders created per minute | < 5 | < 1 |
order_creation_latency | Time to create an order | > 2s | > 5s |
order_error_rate | Percentage of failed orders | > 1% | > 5% |
database_connection_pool | Database connection pool utilization | > 70% | > 90% |
memory_usage | Container memory usage | > 80% | > 90% |
cpu_usage | Container CPU usage | > 70% | > 85% |
Dashboards
Common Alerts
| Alert | Description | Troubleshooting Steps |
|---|---|---|
OrdersServiceHighLatency | API latency exceeds thresholds | See High Latency |
OrdersServiceHighErrorRate | Error rate exceeds thresholds | See High Error Rate |
OrdersServiceDatabaseConnectionIssues | Database connection issues | See Database Issues |
Troubleshooting Guides
High Latency
If the service is experiencing high latency:
-
Check system metrics:
Terminal window kubectl top pods -n orders -
Check database metrics in the MongoDB dashboard to identify slow queries.
-
Check dependent services to see if delays are caused by downstream systems:
Terminal window curl -X GET https://api.internal.flowmart.com/inventory/healthcurl -X GET https://api.internal.flowmart.com/payment/health -
Analyze recent changes that might have impacted performance.
-
Scale the service if needed:
Terminal window kubectl scale deployment orders-service -n orders --replicas=5
High Error Rate
If the service is experiencing a high error rate:
-
Check application logs:
Terminal window kubectl logs -l app=orders-service -n orders --tail=100 -
Check for recent deployments that might have introduced issues:
Terminal window kubectl rollout history deployment/orders-service -n orders -
Verify database connectivity:
Terminal window kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- node -e "const mongoose = require('mongoose'); mongoose.connect(process.env.MONGODB_URI).then(() => console.log('Connected!')).catch(err => console.error('Connection error', err));" -
Check dependent services for failures:
Terminal window curl -X GET https://api.internal.flowmart.com/inventory/healthcurl -X GET https://api.internal.flowmart.com/payment/health -
Consider rolling back if issues persist:
Terminal window kubectl rollout undo deployment/orders-service -n orders
Database Issues
If there are database connection issues:
-
Check MongoDB status:
Terminal window kubectl exec -it $(kubectl get pods -l app=mongodb -n data -o jsonpath='{.items[0].metadata.name}') -n data -- mongo admin -u admin -p $MONGODB_PASSWORD --eval "db.serverStatus()" -
Verify network connectivity:
Terminal window kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- ping mongodb.data.svc.cluster.local -
Check MongoDB resource usage:
Terminal window kubectl top pods -l app=mongodb -n data -
Review MongoDB logs:
Terminal window kubectl logs -l app=mongodb -n data --tail=100
Common Operational Tasks
Scaling the Service
To scale the service horizontally:
kubectl scale deployment orders-service -n orders --replicas=<number>Restarting the Service
To restart all pods:
kubectl rollout restart deployment orders-service -n ordersViewing Recent Orders
To view recent orders in the database:
kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- node -e "const mongoose = require('mongoose'); const Order = require('./models/order'); mongoose.connect(process.env.MONGODB_URI).then(async () => { const orders = await Order.find().sort({createdAt: -1}).limit(10); console.log(JSON.stringify(orders, null, 2)); process.exit(0); });"Manually Processing Stuck Orders
If orders are stuck in a particular state:
-
Identify stuck orders:
Terminal window kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- node -e "const mongoose = require('mongoose'); const Order = require('./models/order'); mongoose.connect(process.env.MONGODB_URI).then(async () => { const stuckOrders = await Order.find({status: 'PROCESSING', updatedAt: {$lt: new Date(Date.now() - 30*60*1000)}}); console.log(JSON.stringify(stuckOrders, null, 2)); process.exit(0); });" -
Manually trigger processing for a specific order:
Terminal window curl -X POST https://api.internal.flowmart.com/orders/process -H "Content-Type: application/json" -d '{"orderId": "ORDER_ID", "force": true}'
Recovery Procedures
Database Failure Recovery
If the MongoDB database becomes unavailable:
-
Verify the status of the MongoDB cluster:
Terminal window kubectl get pods -l app=mongodb -n data -
If the primary node is down, initiate a manual failover if necessary:
Terminal window kubectl exec -it mongodb-0 -n data -- mongo admin -u admin -p $MONGODB_PASSWORD --eval "rs.stepDown()" -
If the entire cluster is unavailable, create an incident and notify the Database Team.
-
Once database availability is restored, validate the OrdersService functionality:
Terminal window curl -X GET https://api.internal.flowmart.com/orders/health
Event Bus Failure Recovery
If the Event Bus is unavailable:
-
The OrdersService implements the Circuit Breaker pattern and will queue messages locally.
-
When the Event Bus is restored, check the backlog of events:
Terminal window kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- curl localhost:9090/metrics | grep event_queue -
Manually trigger event processing if necessary:
Terminal window curl -X POST https://api.internal.flowmart.com/orders/admin/process-event-queue -H "Authorization: Bearer $ADMIN_TOKEN"
Disaster Recovery
Complete Service Failure
In case of a complete service failure:
-
Initiate incident response by notifying the on-call team through PagerDuty.
-
Check for region-wide AWS issues on the AWS Status page.
-
If necessary, trigger the DR plan to fail over to the secondary region:
Terminal window ./scripts/dr-failover.sh orders-service -
Update Route53 DNS to point to the secondary region if global failover is needed:
Terminal window aws route53 change-resource-record-sets --hosted-zone-id $HOSTED_ZONE_ID --change-batch file://dr-dns-change.json
Maintenance Tasks
Deploying New Versions
kubectl set image deployment/orders-service -n orders orders-service=ecr.aws/flowmart/orders-service:$VERSIONDatabase Maintenance
Scheduled database maintenance should be performed during off-peak hours:
-
Notify stakeholders through the #maintenance Slack channel.
-
Set OrdersService to maintenance mode:
Terminal window curl -X POST https://api.internal.flowmart.com/orders/admin/maintenance -H "Authorization: Bearer $ADMIN_TOKEN" -H "Content-Type: application/json" -d '{"maintenanceMode": true, "message": "Scheduled maintenance"}' -
Perform database maintenance operations.
-
Turn off maintenance mode:
Terminal window curl -X POST https://api.internal.flowmart.com/orders/admin/maintenance -H "Authorization: Bearer $ADMIN_TOKEN" -H "Content-Type: application/json" -d '{"maintenanceMode": false}'
Contact Information
Primary On-Call: Orders Team (rotating schedule)
Secondary On-Call: Platform Team
Escalation Path: Orders Team Lead > Engineering Manager > CTO
Slack Channels:
- #orders-support (primary support channel)
- #orders-alerts (automated alerts)
- #incident-response (for major incidents)