Inventory Service - Runbook

Operational runbook for troubleshooting and maintaining the InventoryService

This runbook provides operational procedures for the InventoryService, which is responsible for managing product inventory and stock levels across the FlowMart e-commerce platform.

Architecture

The InventoryService is responsible for:

  • Managing product inventory and stock levels
  • Reserving inventory for pending orders
  • Tracking inventory across warehouses and locations
  • Providing real-time availability information
  • Triggering restock notifications

Service Dependencies

Loading graph...

Monitoring and Alerting

Key Metrics

MetricDescriptionWarning ThresholdCritical Threshold
inventory_check_rateInventory availability checks per minute> 1000> 5000
inventory_check_latencyTime to check inventory availability> 100ms> 500ms
inventory_update_latencyTime to update inventory levels> 200ms> 1s
low_stock_itemsNumber of items with low stock> 50> 100
connection_pool_usageDatabase connection pool utilization> 70%> 90%
redis_hit_rateCache hit rate< 80%< 60%

Dashboards

Common Alerts

AlertDescriptionTroubleshooting Steps
InventoryServiceHighLatencyAPI latency exceeds thresholdsSee High Latency
InventoryServiceDatabaseIssuesDatabase connection or performance issuesSee Database Issues
InventoryServiceCacheFailureRedis cache unavailable or performance degradedSee Cache Issues
InventoryServiceOutOfStockCritical products out of stockSee Stock Management

Troubleshooting Guides

High Latency

If the service is experiencing high latency:

  1. Check system resource usage:

    kubectl top pods -n inventory
    
  2. Check database connection pool:

    kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl localhost:8080/actuator/metrics/hikaricp.connections.usage
    
  3. Check cache hit rate:

    kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl localhost:8080/actuator/metrics/cache.gets | grep "hit_ratio"
    
  4. Check for slow queries in the database:

    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
    
  5. Scale the service if needed:

    kubectl scale deployment inventory-service -n inventory --replicas=5
    

Database Issues

If there are database connection or performance issues:

  1. Check PostgreSQL status:

    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- pg_isready -U postgres
    
  2. Check for long-running transactions:

    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "SELECT pid, now() - xact_start AS duration, state, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;"
    
  3. Check for table bloat:

    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "SELECT schemaname, relname, n_live_tup, n_dead_tup, (n_dead_tup::float / n_live_tup::float) AS dead_ratio FROM pg_stat_user_tables WHERE n_live_tup > 1000 ORDER BY dead_ratio DESC;"
    
  4. Restart database connections in the application if needed:

    kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/actuator/restart-db-connections
    

Cache Issues

If there are Redis cache issues:

  1. Check Redis status:

    kubectl exec -it $(kubectl get pods -l app=redis -n data -o jsonpath='{.items[0].metadata.name}') -n data -- redis-cli ping
    
  2. Check Redis memory usage:

    kubectl exec -it $(kubectl get pods -l app=redis -n data -o jsonpath='{.items[0].metadata.name}') -n data -- redis-cli info memory
    
  3. Check cache hit rate:

    kubectl exec -it $(kubectl get pods -l app=redis -n data -o jsonpath='{.items[0].metadata.name}') -n data -- redis-cli info stats | grep hit_rate
    
  4. Clear cache if necessary:

    kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/actuator/caches/clearAll
    

Stock Management

For critical stock issues:

  1. Identify products with low or no stock:

    kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl localhost:8080/internal/api/inventory/low-stock
    
  2. Check for stuck inventory reservations:

    kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl localhost:8080/internal/api/inventory/stuck-reservations
    
  3. Release expired reservations if necessary:

    kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/internal/api/inventory/release-expired-reservations
    
  4. Manually update inventory levels for emergency corrections:

    curl -X PUT https://api.internal.flowmart.com/inventory/products/{productId}/stock \
      -H "Authorization: Bearer $ADMIN_TOKEN" \
      -H "Content-Type: application/json" \
      -d '{"warehouseId": "WAREHOUSE_ID", "quantity": 100, "reason": "Manual correction"}'
    

Common Operational Tasks

Scaling the Service

To scale the service horizontally:

kubectl scale deployment inventory-service -n inventory --replicas=<number>

Restarting the Service

To restart all pods:

kubectl rollout restart deployment inventory-service -n inventory

Database Maintenance

For routine database maintenance:

  1. Run VACUUM ANALYZE to optimize tables:

    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "VACUUM ANALYZE inventory_items;"
    
  2. Update database statistics:

    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "ANALYZE;"
    

Reconcile Inventory

To reconcile inventory with the warehouse management system:

kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/internal/api/inventory/reconcile

Manually Trigger Restock Notifications

To trigger restock notifications for low stock items:

kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/internal/api/inventory/trigger-restock-notifications

Recovery Procedures

Database Failure Recovery

If the PostgreSQL database becomes unavailable:

  1. Verify the status of the PostgreSQL cluster:

    kubectl get pods -l app=postgresql -n data
    
  2. If the primary instance is down, check if automatic failover has occurred:

    kubectl exec -it $(kubectl get pods -l app=postgresql-patroni -n data -o jsonpath='{.items[0].metadata.name}') -n data -- patronictl list
    
  3. If automatic failover has not occurred, initiate manual failover:

    kubectl exec -it $(kubectl get pods -l app=postgresql-patroni -n data -o jsonpath='{.items[0].metadata.name}') -n data -- patronictl failover
    
  4. Once database availability is restored, validate the InventoryService functionality:

    curl -X GET https://api.internal.flowmart.com/inventory/health
    

Cache Failure Recovery

If the Redis cache becomes unavailable:

  1. Verify Redis cluster status:

    kubectl get pods -l app=redis -n data
    
  2. If needed, restart the Redis cluster:

    kubectl rollout restart statefulset redis -n data
    
  3. The InventoryService will fall back to database queries when the cache is unavailable.

  4. When the cache is restored, you can warm it up:

    kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/internal/api/inventory/warm-cache
    

Disaster Recovery

Complete Service Failure

In case of a complete service failure:

  1. Initiate incident response by notifying the on-call team through PagerDuty.

  2. Verify the deployment status:

    kubectl describe deployment inventory-service -n inventory
    
  3. If necessary, restore from a previous version:

    kubectl rollout undo deployment inventory-service -n inventory
    
  4. If the primary region is experiencing issues, fail over to the secondary region:

    ./scripts/dr-failover.sh inventory-service
    
  5. Verify the service is functioning in the secondary region:

    curl -X GET https://api-dr.internal.flowmart.com/inventory/health
    

Maintenance Tasks

Deploying New Versions

kubectl set image deployment/inventory-service -n inventory inventory-service=ecr.aws/flowmart/inventory-service:$VERSION

Database Schema Updates

For database schema updates:

  1. Notify stakeholders through the #maintenance Slack channel.

  2. Set InventoryService to maintenance mode:

    curl -X POST https://api.internal.flowmart.com/inventory/admin/maintenance -H "Authorization: Bearer $ADMIN_TOKEN" -H "Content-Type: application/json" -d '{"maintenanceMode": true, "message": "Database schema update"}'
    
  3. Apply the database migrations:

    kubectl apply -f inventory-flyway-job.yaml
    
  4. Verify migration completion:

    kubectl logs -l job-name=inventory-flyway-migration -n inventory
    
  5. Turn off maintenance mode:

    curl -X POST https://api.internal.flowmart.com/inventory/admin/maintenance -H "Authorization: Bearer $ADMIN_TOKEN" -H "Content-Type: application/json" -d '{"maintenanceMode": false}'
    

Contact Information

Primary On-Call: Inventory Team (rotating schedule)
Secondary On-Call: Platform Team
Escalation Path: Inventory Team Lead > Engineering Manager > CTO

Slack Channels:

  • #inventory-support (primary support channel)
  • #inventory-alerts (automated alerts)
  • #incident-response (for major incidents)

Reference Information