Inventory Service - Runbook
Operational runbook for troubleshooting and maintaining the InventoryService
This runbook provides operational procedures for the InventoryService, which is responsible for managing product inventory and stock levels across the FlowMart e-commerce platform.
Architecture
The InventoryService is responsible for:
- Managing product inventory and stock levels
- Reserving inventory for pending orders
- Tracking inventory across warehouses and locations
- Providing real-time availability information
- Triggering restock notifications
Service Dependencies
Loading graph...
Monitoring and Alerting
Key Metrics
Metric | Description | Warning Threshold | Critical Threshold |
---|---|---|---|
inventory_check_rate | Inventory availability checks per minute | > 1000 | > 5000 |
inventory_check_latency | Time to check inventory availability | > 100ms | > 500ms |
inventory_update_latency | Time to update inventory levels | > 200ms | > 1s |
low_stock_items | Number of items with low stock | > 50 | > 100 |
connection_pool_usage | Database connection pool utilization | > 70% | > 90% |
redis_hit_rate | Cache hit rate | < 80% | < 60% |
Dashboards
Common Alerts
Alert | Description | Troubleshooting Steps |
---|---|---|
InventoryServiceHighLatency | API latency exceeds thresholds | See High Latency |
InventoryServiceDatabaseIssues | Database connection or performance issues | See Database Issues |
InventoryServiceCacheFailure | Redis cache unavailable or performance degraded | See Cache Issues |
InventoryServiceOutOfStock | Critical products out of stock | See Stock Management |
Troubleshooting Guides
High Latency
If the service is experiencing high latency:
Check system resource usage:
kubectl top pods -n inventory
Check database connection pool:
kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl localhost:8080/actuator/metrics/hikaricp.connections.usage
Check cache hit rate:
kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl localhost:8080/actuator/metrics/cache.gets | grep "hit_ratio"
Check for slow queries in the database:
kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
Scale the service if needed:
kubectl scale deployment inventory-service -n inventory --replicas=5
Database Issues
If there are database connection or performance issues:
Check PostgreSQL status:
kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- pg_isready -U postgres
Check for long-running transactions:
kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "SELECT pid, now() - xact_start AS duration, state, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;"
Check for table bloat:
kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "SELECT schemaname, relname, n_live_tup, n_dead_tup, (n_dead_tup::float / n_live_tup::float) AS dead_ratio FROM pg_stat_user_tables WHERE n_live_tup > 1000 ORDER BY dead_ratio DESC;"
Restart database connections in the application if needed:
kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/actuator/restart-db-connections
Cache Issues
If there are Redis cache issues:
Check Redis status:
kubectl exec -it $(kubectl get pods -l app=redis -n data -o jsonpath='{.items[0].metadata.name}') -n data -- redis-cli ping
Check Redis memory usage:
kubectl exec -it $(kubectl get pods -l app=redis -n data -o jsonpath='{.items[0].metadata.name}') -n data -- redis-cli info memory
Check cache hit rate:
kubectl exec -it $(kubectl get pods -l app=redis -n data -o jsonpath='{.items[0].metadata.name}') -n data -- redis-cli info stats | grep hit_rate
Clear cache if necessary:
kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/actuator/caches/clearAll
Stock Management
For critical stock issues:
Identify products with low or no stock:
kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl localhost:8080/internal/api/inventory/low-stock
Check for stuck inventory reservations:
kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl localhost:8080/internal/api/inventory/stuck-reservations
Release expired reservations if necessary:
kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/internal/api/inventory/release-expired-reservations
Manually update inventory levels for emergency corrections:
curl -X PUT https://api.internal.flowmart.com/inventory/products/{productId}/stock \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"warehouseId": "WAREHOUSE_ID", "quantity": 100, "reason": "Manual correction"}'
Common Operational Tasks
Scaling the Service
To scale the service horizontally:
kubectl scale deployment inventory-service -n inventory --replicas=<number>
Restarting the Service
To restart all pods:
kubectl rollout restart deployment inventory-service -n inventory
Database Maintenance
For routine database maintenance:
Run VACUUM ANALYZE to optimize tables:
kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "VACUUM ANALYZE inventory_items;"
Update database statistics:
kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -c "ANALYZE;"
Reconcile Inventory
To reconcile inventory with the warehouse management system:
kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/internal/api/inventory/reconcile
Manually Trigger Restock Notifications
To trigger restock notifications for low stock items:
kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/internal/api/inventory/trigger-restock-notifications
Recovery Procedures
Database Failure Recovery
If the PostgreSQL database becomes unavailable:
Verify the status of the PostgreSQL cluster:
kubectl get pods -l app=postgresql -n data
If the primary instance is down, check if automatic failover has occurred:
kubectl exec -it $(kubectl get pods -l app=postgresql-patroni -n data -o jsonpath='{.items[0].metadata.name}') -n data -- patronictl list
If automatic failover has not occurred, initiate manual failover:
kubectl exec -it $(kubectl get pods -l app=postgresql-patroni -n data -o jsonpath='{.items[0].metadata.name}') -n data -- patronictl failover
Once database availability is restored, validate the InventoryService functionality:
curl -X GET https://api.internal.flowmart.com/inventory/health
Cache Failure Recovery
If the Redis cache becomes unavailable:
Verify Redis cluster status:
kubectl get pods -l app=redis -n data
If needed, restart the Redis cluster:
kubectl rollout restart statefulset redis -n data
The InventoryService will fall back to database queries when the cache is unavailable.
When the cache is restored, you can warm it up:
kubectl exec -it $(kubectl get pods -l app=inventory-service -n inventory -o jsonpath='{.items[0].metadata.name}') -n inventory -- curl -X POST localhost:8080/internal/api/inventory/warm-cache
Disaster Recovery
Complete Service Failure
In case of a complete service failure:
Initiate incident response by notifying the on-call team through PagerDuty.
Verify the deployment status:
kubectl describe deployment inventory-service -n inventory
If necessary, restore from a previous version:
kubectl rollout undo deployment inventory-service -n inventory
If the primary region is experiencing issues, fail over to the secondary region:
./scripts/dr-failover.sh inventory-service
Verify the service is functioning in the secondary region:
curl -X GET https://api-dr.internal.flowmart.com/inventory/health
Maintenance Tasks
Deploying New Versions
kubectl set image deployment/inventory-service -n inventory inventory-service=ecr.aws/flowmart/inventory-service:$VERSION
Database Schema Updates
For database schema updates:
Notify stakeholders through the #maintenance Slack channel.
Set InventoryService to maintenance mode:
curl -X POST https://api.internal.flowmart.com/inventory/admin/maintenance -H "Authorization: Bearer $ADMIN_TOKEN" -H "Content-Type: application/json" -d '{"maintenanceMode": true, "message": "Database schema update"}'
Apply the database migrations:
kubectl apply -f inventory-flyway-job.yaml
Verify migration completion:
kubectl logs -l job-name=inventory-flyway-migration -n inventory
Turn off maintenance mode:
curl -X POST https://api.internal.flowmart.com/inventory/admin/maintenance -H "Authorization: Bearer $ADMIN_TOKEN" -H "Content-Type: application/json" -d '{"maintenanceMode": false}'
Contact Information
Primary On-Call: Inventory Team (rotating schedule)
Secondary On-Call: Platform Team
Escalation Path: Inventory Team Lead > Engineering Manager > CTO
Slack Channels:
- #inventory-support (primary support channel)
- #inventory-alerts (automated alerts)
- #incident-response (for major incidents)