OrdersService Runbook
Operational runbook for troubleshooting and maintaining the OrdersService
This runbook provides operational procedures for the OrdersService, which is responsible for managing the entire lifecycle of customer orders in the FlowMart e-commerce platform.
Architecture
The OrdersService is responsible for:
- Creating and processing customer orders
- Tracking order status throughout fulfillment
- Coordinating with other services (Inventory, Payment, Shipping)
- Managing order history and amendments
Service Dependencies
Loading graph...
Monitoring and Alerting
Key Metrics
Metric | Description | Warning Threshold | Critical Threshold |
---|---|---|---|
order_creation_rate | Orders created per minute | < 5 | < 1 |
order_creation_latency | Time to create an order | > 2s | > 5s |
order_error_rate | Percentage of failed orders | > 1% | > 5% |
database_connection_pool | Database connection pool utilization | > 70% | > 90% |
memory_usage | Container memory usage | > 80% | > 90% |
cpu_usage | Container CPU usage | > 70% | > 85% |
Dashboards
Common Alerts
Alert | Description | Troubleshooting Steps |
---|---|---|
OrdersServiceHighLatency | API latency exceeds thresholds | See High Latency |
OrdersServiceHighErrorRate | Error rate exceeds thresholds | See High Error Rate |
OrdersServiceDatabaseConnectionIssues | Database connection issues | See Database Issues |
Troubleshooting Guides
High Latency
If the service is experiencing high latency:
Check system metrics:
kubectl top pods -n orders
Check database metrics in the MongoDB dashboard to identify slow queries.
Check dependent services to see if delays are caused by downstream systems:
curl -X GET https://api.internal.flowmart.com/inventory/health curl -X GET https://api.internal.flowmart.com/payment/health
Analyze recent changes that might have impacted performance.
Scale the service if needed:
kubectl scale deployment orders-service -n orders --replicas=5
High Error Rate
If the service is experiencing a high error rate:
Check application logs:
kubectl logs -l app=orders-service -n orders --tail=100
Check for recent deployments that might have introduced issues:
kubectl rollout history deployment/orders-service -n orders
Verify database connectivity:
kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- node -e "const mongoose = require('mongoose'); mongoose.connect(process.env.MONGODB_URI).then(() => console.log('Connected!')).catch(err => console.error('Connection error', err));"
Check dependent services for failures:
curl -X GET https://api.internal.flowmart.com/inventory/health curl -X GET https://api.internal.flowmart.com/payment/health
Consider rolling back if issues persist:
kubectl rollout undo deployment/orders-service -n orders
Database Issues
If there are database connection issues:
Check MongoDB status:
kubectl exec -it $(kubectl get pods -l app=mongodb -n data -o jsonpath='{.items[0].metadata.name}') -n data -- mongo admin -u admin -p $MONGODB_PASSWORD --eval "db.serverStatus()"
Verify network connectivity:
kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- ping mongodb.data.svc.cluster.local
Check MongoDB resource usage:
kubectl top pods -l app=mongodb -n data
Review MongoDB logs:
kubectl logs -l app=mongodb -n data --tail=100
Common Operational Tasks
Scaling the Service
To scale the service horizontally:
kubectl scale deployment orders-service -n orders --replicas=<number>
Restarting the Service
To restart all pods:
kubectl rollout restart deployment orders-service -n orders
Viewing Recent Orders
To view recent orders in the database:
kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- node -e "const mongoose = require('mongoose'); const Order = require('./models/order'); mongoose.connect(process.env.MONGODB_URI).then(async () => { const orders = await Order.find().sort({createdAt: -1}).limit(10); console.log(JSON.stringify(orders, null, 2)); process.exit(0); });"
Manually Processing Stuck Orders
If orders are stuck in a particular state:
Identify stuck orders:
kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- node -e "const mongoose = require('mongoose'); const Order = require('./models/order'); mongoose.connect(process.env.MONGODB_URI).then(async () => { const stuckOrders = await Order.find({status: 'PROCESSING', updatedAt: {$lt: new Date(Date.now() - 30*60*1000)}}); console.log(JSON.stringify(stuckOrders, null, 2)); process.exit(0); });"
Manually trigger processing for a specific order:
curl -X POST https://api.internal.flowmart.com/orders/process -H "Content-Type: application/json" -d '{"orderId": "ORDER_ID", "force": true}'
Recovery Procedures
Database Failure Recovery
If the MongoDB database becomes unavailable:
Verify the status of the MongoDB cluster:
kubectl get pods -l app=mongodb -n data
If the primary node is down, initiate a manual failover if necessary:
kubectl exec -it mongodb-0 -n data -- mongo admin -u admin -p $MONGODB_PASSWORD --eval "rs.stepDown()"
If the entire cluster is unavailable, create an incident and notify the Database Team.
Once database availability is restored, validate the OrdersService functionality:
curl -X GET https://api.internal.flowmart.com/orders/health
Event Bus Failure Recovery
If the Event Bus is unavailable:
The OrdersService implements the Circuit Breaker pattern and will queue messages locally.
When the Event Bus is restored, check the backlog of events:
kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- curl localhost:9090/metrics | grep event_queue
Manually trigger event processing if necessary:
curl -X POST https://api.internal.flowmart.com/orders/admin/process-event-queue -H "Authorization: Bearer $ADMIN_TOKEN"
Disaster Recovery
Complete Service Failure
In case of a complete service failure:
Initiate incident response by notifying the on-call team through PagerDuty.
Check for region-wide AWS issues on the AWS Status page.
If necessary, trigger the DR plan to fail over to the secondary region:
./scripts/dr-failover.sh orders-service
Update Route53 DNS to point to the secondary region if global failover is needed:
aws route53 change-resource-record-sets --hosted-zone-id $HOSTED_ZONE_ID --change-batch file://dr-dns-change.json
Maintenance Tasks
Deploying New Versions
kubectl set image deployment/orders-service -n orders orders-service=ecr.aws/flowmart/orders-service:$VERSION
Database Maintenance
Scheduled database maintenance should be performed during off-peak hours:
Notify stakeholders through the #maintenance Slack channel.
Set OrdersService to maintenance mode:
curl -X POST https://api.internal.flowmart.com/orders/admin/maintenance -H "Authorization: Bearer $ADMIN_TOKEN" -H "Content-Type: application/json" -d '{"maintenanceMode": true, "message": "Scheduled maintenance"}'
Perform database maintenance operations.
Turn off maintenance mode:
curl -X POST https://api.internal.flowmart.com/orders/admin/maintenance -H "Authorization: Bearer $ADMIN_TOKEN" -H "Content-Type: application/json" -d '{"maintenanceMode": false}'
Contact Information
Primary On-Call: Orders Team (rotating schedule)
Secondary On-Call: Platform Team
Escalation Path: Orders Team Lead > Engineering Manager > CTO
Slack Channels:
- #orders-support (primary support channel)
- #orders-alerts (automated alerts)
- #incident-response (for major incidents)