OrdersService Runbook

This runbook provides operational procedures for the OrdersService, which is responsible for managing the entire lifecycle of customer orders in the FlowMart e-commerce platform.

Architecture

The OrdersService is responsible for:

Creating and processing customer orders
Tracking order status throughout fulfillment
Coordinating with other services (Inventory, Payment, Shipping)
Managing order history and amendments

Service Dependencies

Loading graph...

Monitoring and Alerting

Key Metrics

Metric	Description	Warning Threshold	Critical Threshold
`order_creation_rate`	Orders created per minute	< 5	< 1
`order_creation_latency`	Time to create an order	> 2s	> 5s
`order_error_rate`	Percentage of failed orders	> 1%	> 5%
`database_connection_pool`	Database connection pool utilization	> 70%	> 90%
`memory_usage`	Container memory usage	> 80%	> 90%
`cpu_usage`	Container CPU usage	> 70%	> 85%

Dashboards

Common Alerts

Alert	Description	Troubleshooting Steps
`OrdersServiceHighLatency`	API latency exceeds thresholds	See High Latency
`OrdersServiceHighErrorRate`	Error rate exceeds thresholds	See High Error Rate
`OrdersServiceDatabaseConnectionIssues`	Database connection issues	See Database Issues

Troubleshooting Guides

High Latency

If the service is experiencing high latency:

Check system metrics:
```
kubectl top pods -n orders
```
Check database metrics in the MongoDB dashboard to identify slow queries.

Check dependent services to see if delays are caused by downstream systems:

curl -X GET https://api.internal.flowmart.com/inventory/health
curl -X GET https://api.internal.flowmart.com/payment/health

Analyze recent changes that might have impacted performance.

Scale the service if needed:

kubectl scale deployment orders-service -n orders --replicas=5

High Error Rate

If the service is experiencing a high error rate:

Check application logs:

kubectl logs -l app=orders-service -n orders --tail=100

Check for recent deployments that might have introduced issues:
```
kubectl rollout history deployment/orders-service -n orders
```

Verify database connectivity:

kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- node -e "const mongoose = require('mongoose'); mongoose.connect(process.env.MONGODB_URI).then(() => console.log('Connected!')).catch(err => console.error('Connection error', err));"

Check dependent services for failures:

curl -X GET https://api.internal.flowmart.com/inventory/health
curl -X GET https://api.internal.flowmart.com/payment/health

Consider rolling back if issues persist:

kubectl rollout undo deployment/orders-service -n orders

Database Issues

If there are database connection issues:

Check MongoDB status:

kubectl exec -it $(kubectl get pods -l app=mongodb -n data -o jsonpath='{.items[0].metadata.name}') -n data -- mongo admin -u admin -p $MONGODB_PASSWORD --eval "db.serverStatus()"

Verify network connectivity:

kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- ping mongodb.data.svc.cluster.local

Check MongoDB resource usage:

kubectl top pods -l app=mongodb -n data

Review MongoDB logs:

kubectl logs -l app=mongodb -n data --tail=100

Common Operational Tasks

Scaling the Service

To scale the service horizontally:

kubectl scale deployment orders-service -n orders --replicas=<number>

Restarting the Service

To restart all pods:

kubectl rollout restart deployment orders-service -n orders

Viewing Recent Orders

To view recent orders in the database:

kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- node -e "const mongoose = require('mongoose'); const Order = require('./models/order'); mongoose.connect(process.env.MONGODB_URI).then(async () => { const orders = await Order.find().sort({createdAt: -1}).limit(10); console.log(JSON.stringify(orders, null, 2)); process.exit(0); });"

Manually Processing Stuck Orders

If orders are stuck in a particular state:

Identify stuck orders:

kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- node -e "const mongoose = require('mongoose'); const Order = require('./models/order'); mongoose.connect(process.env.MONGODB_URI).then(async () => { const stuckOrders = await Order.find({status: 'PROCESSING', updatedAt: {$lt: new Date(Date.now() - 30*60*1000)}}); console.log(JSON.stringify(stuckOrders, null, 2)); process.exit(0); });"

Manually trigger processing for a specific order:

curl -X POST https://api.internal.flowmart.com/orders/process -H "Content-Type: application/json" -d '{"orderId": "ORDER_ID", "force": true}'

Recovery Procedures

Database Failure Recovery

If the MongoDB database becomes unavailable:

Verify the status of the MongoDB cluster:

kubectl get pods -l app=mongodb -n data

If the primary node is down, initiate a manual failover if necessary:

kubectl exec -it mongodb-0 -n data -- mongo admin -u admin -p $MONGODB_PASSWORD --eval "rs.stepDown()"

If the entire cluster is unavailable, create an incident and notify the Database Team.
Once database availability is restored, validate the OrdersService functionality:
```
curl -X GET https://api.internal.flowmart.com/orders/health
```

Event Bus Failure Recovery

If the Event Bus is unavailable:

The OrdersService implements the Circuit Breaker pattern and will queue messages locally.

When the Event Bus is restored, check the backlog of events:

kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- curl localhost:9090/metrics | grep event_queue

Manually trigger event processing if necessary:

curl -X POST https://api.internal.flowmart.com/orders/admin/process-event-queue -H "Authorization: Bearer $ADMIN_TOKEN"

Disaster Recovery

Complete Service Failure

In case of a complete service failure:

Initiate incident response by notifying the on-call team through PagerDuty.
Check for region-wide AWS issues on the AWS Status page.
If necessary, trigger the DR plan to fail over to the secondary region:
```
./scripts/dr-failover.sh orders-service
```

Update Route53 DNS to point to the secondary region if global failover is needed:

aws route53 change-resource-record-sets --hosted-zone-id $HOSTED_ZONE_ID --change-batch file://dr-dns-change.json

Maintenance Tasks

Deploying New Versions

kubectl set image deployment/orders-service -n orders orders-service=ecr.aws/flowmart/orders-service:$VERSION

Database Maintenance

Scheduled database maintenance should be performed during off-peak hours:

Notify stakeholders through the #maintenance Slack channel.

Set OrdersService to maintenance mode:

curl -X POST https://api.internal.flowmart.com/orders/admin/maintenance -H "Authorization: Bearer $ADMIN_TOKEN" -H "Content-Type: application/json" -d '{"maintenanceMode": true, "message": "Scheduled maintenance"}'

Perform database maintenance operations.

Turn off maintenance mode:

curl -X POST https://api.internal.flowmart.com/orders/admin/maintenance -H "Authorization: Bearer $ADMIN_TOKEN" -H "Content-Type: application/json" -d '{"maintenanceMode": false}'

Contact Information

Primary On-Call: Orders Team (rotating schedule)
Secondary On-Call: Platform Team
Escalation Path: Orders Team Lead > Engineering Manager > CTO

Slack Channels:

#orders-support (primary support channel)
#orders-alerts (automated alerts)
#incident-response (for major incidents)