OrdersService Runbook

Operational runbook for troubleshooting and maintaining the OrdersService

This runbook provides operational procedures for the OrdersService, which is responsible for managing the entire lifecycle of customer orders in the FlowMart e-commerce platform.

Architecture

The OrdersService is responsible for:

  • Creating and processing customer orders
  • Tracking order status throughout fulfillment
  • Coordinating with other services (Inventory, Payment, Shipping)
  • Managing order history and amendments

Service Dependencies

Loading graph...

Monitoring and Alerting

Key Metrics

MetricDescriptionWarning ThresholdCritical Threshold
order_creation_rateOrders created per minute< 5< 1
order_creation_latencyTime to create an order> 2s> 5s
order_error_ratePercentage of failed orders> 1%> 5%
database_connection_poolDatabase connection pool utilization> 70%> 90%
memory_usageContainer memory usage> 80%> 90%
cpu_usageContainer CPU usage> 70%> 85%

Dashboards

Common Alerts

AlertDescriptionTroubleshooting Steps
OrdersServiceHighLatencyAPI latency exceeds thresholdsSee High Latency
OrdersServiceHighErrorRateError rate exceeds thresholdsSee High Error Rate
OrdersServiceDatabaseConnectionIssuesDatabase connection issuesSee Database Issues

Troubleshooting Guides

High Latency

If the service is experiencing high latency:

  1. Check system metrics:

    kubectl top pods -n orders
    
  2. Check database metrics in the MongoDB dashboard to identify slow queries.

  3. Check dependent services to see if delays are caused by downstream systems:

    curl -X GET https://api.internal.flowmart.com/inventory/health
    curl -X GET https://api.internal.flowmart.com/payment/health
    
  4. Analyze recent changes that might have impacted performance.

  5. Scale the service if needed:

    kubectl scale deployment orders-service -n orders --replicas=5
    

High Error Rate

If the service is experiencing a high error rate:

  1. Check application logs:

    kubectl logs -l app=orders-service -n orders --tail=100
    
  2. Check for recent deployments that might have introduced issues:

    kubectl rollout history deployment/orders-service -n orders
    
  3. Verify database connectivity:

    kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- node -e "const mongoose = require('mongoose'); mongoose.connect(process.env.MONGODB_URI).then(() => console.log('Connected!')).catch(err => console.error('Connection error', err));"
    
  4. Check dependent services for failures:

    curl -X GET https://api.internal.flowmart.com/inventory/health
    curl -X GET https://api.internal.flowmart.com/payment/health
    
  5. Consider rolling back if issues persist:

    kubectl rollout undo deployment/orders-service -n orders
    

Database Issues

If there are database connection issues:

  1. Check MongoDB status:

    kubectl exec -it $(kubectl get pods -l app=mongodb -n data -o jsonpath='{.items[0].metadata.name}') -n data -- mongo admin -u admin -p $MONGODB_PASSWORD --eval "db.serverStatus()"
    
  2. Verify network connectivity:

    kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- ping mongodb.data.svc.cluster.local
    
  3. Check MongoDB resource usage:

    kubectl top pods -l app=mongodb -n data
    
  4. Review MongoDB logs:

    kubectl logs -l app=mongodb -n data --tail=100
    

Common Operational Tasks

Scaling the Service

To scale the service horizontally:

kubectl scale deployment orders-service -n orders --replicas=<number>

Restarting the Service

To restart all pods:

kubectl rollout restart deployment orders-service -n orders

Viewing Recent Orders

To view recent orders in the database:

kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- node -e "const mongoose = require('mongoose'); const Order = require('./models/order'); mongoose.connect(process.env.MONGODB_URI).then(async () => { const orders = await Order.find().sort({createdAt: -1}).limit(10); console.log(JSON.stringify(orders, null, 2)); process.exit(0); });"

Manually Processing Stuck Orders

If orders are stuck in a particular state:

  1. Identify stuck orders:

    kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- node -e "const mongoose = require('mongoose'); const Order = require('./models/order'); mongoose.connect(process.env.MONGODB_URI).then(async () => { const stuckOrders = await Order.find({status: 'PROCESSING', updatedAt: {$lt: new Date(Date.now() - 30*60*1000)}}); console.log(JSON.stringify(stuckOrders, null, 2)); process.exit(0); });"
    
  2. Manually trigger processing for a specific order:

    curl -X POST https://api.internal.flowmart.com/orders/process -H "Content-Type: application/json" -d '{"orderId": "ORDER_ID", "force": true}'
    

Recovery Procedures

Database Failure Recovery

If the MongoDB database becomes unavailable:

  1. Verify the status of the MongoDB cluster:

    kubectl get pods -l app=mongodb -n data
    
  2. If the primary node is down, initiate a manual failover if necessary:

    kubectl exec -it mongodb-0 -n data -- mongo admin -u admin -p $MONGODB_PASSWORD --eval "rs.stepDown()"
    
  3. If the entire cluster is unavailable, create an incident and notify the Database Team.

  4. Once database availability is restored, validate the OrdersService functionality:

    curl -X GET https://api.internal.flowmart.com/orders/health
    

Event Bus Failure Recovery

If the Event Bus is unavailable:

  1. The OrdersService implements the Circuit Breaker pattern and will queue messages locally.

  2. When the Event Bus is restored, check the backlog of events:

    kubectl exec -it $(kubectl get pods -l app=orders-service -n orders -o jsonpath='{.items[0].metadata.name}') -n orders -- curl localhost:9090/metrics | grep event_queue
    
  3. Manually trigger event processing if necessary:

    curl -X POST https://api.internal.flowmart.com/orders/admin/process-event-queue -H "Authorization: Bearer $ADMIN_TOKEN"
    

Disaster Recovery

Complete Service Failure

In case of a complete service failure:

  1. Initiate incident response by notifying the on-call team through PagerDuty.

  2. Check for region-wide AWS issues on the AWS Status page.

  3. If necessary, trigger the DR plan to fail over to the secondary region:

    ./scripts/dr-failover.sh orders-service
    
  4. Update Route53 DNS to point to the secondary region if global failover is needed:

    aws route53 change-resource-record-sets --hosted-zone-id $HOSTED_ZONE_ID --change-batch file://dr-dns-change.json
    

Maintenance Tasks

Deploying New Versions

kubectl set image deployment/orders-service -n orders orders-service=ecr.aws/flowmart/orders-service:$VERSION

Database Maintenance

Scheduled database maintenance should be performed during off-peak hours:

  1. Notify stakeholders through the #maintenance Slack channel.

  2. Set OrdersService to maintenance mode:

    curl -X POST https://api.internal.flowmart.com/orders/admin/maintenance -H "Authorization: Bearer $ADMIN_TOKEN" -H "Content-Type: application/json" -d '{"maintenanceMode": true, "message": "Scheduled maintenance"}'
    
  3. Perform database maintenance operations.

  4. Turn off maintenance mode:

    curl -X POST https://api.internal.flowmart.com/orders/admin/maintenance -H "Authorization: Bearer $ADMIN_TOKEN" -H "Content-Type: application/json" -d '{"maintenanceMode": false}'
    

Contact Information

Primary On-Call: Orders Team (rotating schedule)
Secondary On-Call: Platform Team
Escalation Path: Orders Team Lead > Engineering Manager > CTO

Slack Channels:

  • #orders-support (primary support channel)
  • #orders-alerts (automated alerts)
  • #incident-response (for major incidents)

Reference Information