Inventory Service - Runbook
Operational runbook for troubleshooting and maintaining the InventoryService
This runbook provides operational procedures for the InventoryService, which is responsible for managing product inventory and stock levels across the FlowMart e-commerce platform.
Architecture
The InventoryService is responsible for:
- Managing product inventory and stock levels
- Reserving inventory for pending orders
- Tracking inventory across warehouses and locations
- Providing real-time availability information
- Triggering restock notifications
Service Dependencies
Loading graph...
Monitoring and Alerting
Key Metrics
| Metric | Description | Warning Threshold | Critical Threshold |
|---|---|---|---|
inventory_check_rate | Inventory availability checks per minute | > 1000 | > 5000 |
inventory_check_latency | Time to check inventory availability | > 100ms | > 500ms |
inventory_update_latency | Time to update inventory levels | > 200ms | > 1s |
low_stock_items | Number of items with low stock | > 50 | > 100 |
connection_pool_usage | Database connection pool utilization | > 70% | > 90% |
redis_hit_rate | Cache hit rate | < 80% | < 60% |
Dashboards
Common Alerts
| Alert | Description | Troubleshooting Steps |
|---|---|---|
InventoryServiceHighLatency | API latency exceeds thresholds | See High Latency |
InventoryServiceDatabaseIssues | Database connection or performance issues | See Database Issues |
InventoryServiceCacheFailure | Redis cache unavailable or performance degraded | See Cache Issues |
InventoryServiceOutOfStock | Critical products out of stock | See Stock Management |
Troubleshooting Guides
High Latency
If the service is experiencing high latency:
-
Check system resource usage:
Terminal window kubectl top pods -n inventory -
Check database connection pool:
Terminal window kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl localhost:8080/actuator/metrics/hikaricp.connections.usage -
Check cache hit rate:
Terminal window kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl localhost:8080/actuator/metrics/cache.gets | grep "hit_ratio" -
Check for slow queries in the database:
Terminal window kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;" -
Scale the service if needed:
Terminal window kubectl scale deployment inventory-service -n inventory --replicas=5
Database Issues
If there are database connection or performance issues:
-
Check PostgreSQL status:
Terminal window kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- pg_isready -U postgres -
Check for long-running transactions:
Terminal window kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "SELECT pid, now() - xact_start AS duration, state, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;" -
Check for table bloat:
Terminal window kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "SELECT schemaname, relname, n_live_tup, n_dead_tup, (n_dead_tup::float / n_live_tup::float) AS dead_ratio FROM pg_stat_user_tables WHERE n_live_tup > 1000 ORDER BY dead_ratio DESC;" -
Restart database connections in the application if needed:
Terminal window kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/actuator/restart-db-connections
Cache Issues
If there are Redis cache issues:
-
Check Redis status:
Terminal window kubectl exec -it $(kubectl get pods -l app=redis -n data -o jsonpath='{.items[0].metadata.name}') -n data -- redis-cli ping -
Check Redis memory usage:
Terminal window kubectl exec -it $(kubectl get pods -l app=redis -n data -o jsonpath='{.items[0].metadata.name}') -n data -- redis-cli info memory -
Check cache hit rate:
Terminal window kubectl exec -it $(kubectl get pods -l app=redis -n data -o jsonpath='{.items[0].metadata.name}') -n data -- redis-cli info stats | grep hit_rate -
Clear cache if necessary:
Terminal window kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/actuator/caches/clearAll
Stock Management
For critical stock issues:
-
Identify products with low or no stock:
Terminal window kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl localhost:8080/internal/api/inventory/low-stock -
Check for stuck inventory reservations:
Terminal window kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl localhost:8080/internal/api/inventory/stuck-reservations -
Release expired reservations if necessary:
Terminal window kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/internal/api/inventory/release-expired-reservations -
Manually update inventory levels for emergency corrections:
Terminal window curl -X PUT https://api.internal.flowmart.com/inventory/products/{productId}/stock \-H "Authorization: Bearer $ADMIN_TOKEN" \-H "Content-Type: application/json" \-d '{"warehouseId": "WAREHOUSE_ID", "quantity": 100, "reason": "Manual correction"}'
Common Operational Tasks
Scaling the Service
To scale the service horizontally:
kubectl scale deployment inventory-service -n inventory --replicas=<number>Restarting the Service
To restart all pods:
kubectl rollout restart deployment inventory-service -n inventoryDatabase Maintenance
For routine database maintenance:
-
Run VACUUM ANALYZE to optimize tables:
Terminal window kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "VACUUM ANALYZE inventory_items;" -
Update database statistics:
Terminal window kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "ANALYZE;"
Reconcile Inventory
To reconcile inventory with the warehouse management system:
kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/internal/api/inventory/reconcileManually Trigger Restock Notifications
To trigger restock notifications for low stock items:
kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/internal/api/inventory/trigger-restock-notificationsRecovery Procedures
Database Failure Recovery
If the PostgreSQL database becomes unavailable:
-
Verify the status of the PostgreSQL cluster:
Terminal window kubectl get pods -l app=postgresql -n data -
If the primary instance is down, check if automatic failover has occurred:
Terminal window kubectl exec -it $(kubectl get pods -l app=postgresql-patroni -n data -o jsonpath='{.items[0].metadata.name}') -n data -- patronictl list -
If automatic failover has not occurred, initiate manual failover:
Terminal window kubectl exec -it $(kubectl get pods -l app=postgresql-patroni -n data -o jsonpath='{.items[0].metadata.name}') -n data -- patronictl failover -
Once database availability is restored, validate the InventoryService functionality:
Terminal window curl -X GET https://api.internal.flowmart.com/inventory/health
Cache Failure Recovery
If the Redis cache becomes unavailable:
-
Verify Redis cluster status:
Terminal window kubectl get pods -l app=redis -n data -
If needed, restart the Redis cluster:
Terminal window kubectl rollout restart statefulset redis -n data -
The InventoryService will fall back to database queries when the cache is unavailable.
-
When the cache is restored, you can warm it up:
Terminal window kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/internal/api/inventory/warm-cache
Disaster Recovery
Complete Service Failure
In case of a complete service failure:
-
Initiate incident response by notifying the on-call team through PagerDuty.
-
Verify the deployment status:
Terminal window kubectl describe deployment inventory-service -n inventory -
If necessary, restore from a previous version:
Terminal window kubectl rollout undo deployment inventory-service -n inventory -
If the primary region is experiencing issues, fail over to the secondary region:
Terminal window ./scripts/dr-failover.sh inventory-service -
Verify the service is functioning in the secondary region:
Terminal window curl -X GET https://api-dr.internal.flowmart.com/inventory/health
Maintenance Tasks
Deploying New Versions
kubectl set image deployment/inventory-service -n inventory inventory-service=ecr.aws/flowmart/inventory-service:$VERSIONDatabase Schema Updates
For database schema updates:
-
Notify stakeholders through the #maintenance Slack channel.
-
Set InventoryService to maintenance mode:
Terminal window curl -X POST https://api.internal.flowmart.com/inventory/admin/maintenance -H "Authorization: Bearer $ADMIN_TOKEN" -H "Content-Type: application/json" -d '{"maintenanceMode": true, "message": "Database schema update"}' -
Apply the database migrations:
Terminal window kubectl apply -f inventory-flyway-job.yaml -
Verify migration completion:
Terminal window kubectl logs -l job-name=inventory-flyway-migration -n inventory -
Turn off maintenance mode:
Terminal window curl -X POST https://api.internal.flowmart.com/inventory/admin/maintenance -H "Authorization: Bearer $ADMIN_TOKEN" -H "Content-Type: application/json" -d '{"maintenanceMode": false}'
Contact Information
Primary On-Call: Inventory Team (rotating schedule)
Secondary On-Call: Platform Team
Escalation Path: Inventory Team Lead > Engineering Manager > CTO
Slack Channels:
- #inventory-support (primary support channel)
- #inventory-alerts (automated alerts)
- #incident-response (for major incidents)