ShippingService Runbook

Operational runbook for troubleshooting and maintaining the ShippingService

This runbook provides operational procedures for the ShippingService, which is responsible for managing shipping options, carrier integration, and delivery tracking in the FlowMart e-commerce platform.

Architecture

The ShippingService is responsible for:

  • Calculating shipping costs and delivery estimates
  • Managing shipping carriers and integration
  • Generating shipping labels
  • Tracking shipments
  • Handling delivery exceptions and returns

Service Dependencies

Loading graph...

Monitoring and Alerting

Key Metrics

MetricDescriptionWarning ThresholdCritical Threshold
shipping_rate_calculation_rateRate calculations per minute< 10< 2
shipping_label_generation_successLabel generation success %< 98%< 95%
carrier_api_response_timeCarrier API response time> 2s> 5s
carrier_api_error_rateCarrier API errors %> 2%> 5%
tracking_update_processing_rateTracking updates processed per minute< 50< 10
shipment_tracking_lagDelay in tracking information> 15m> 1h

Dashboards

Common Alerts

AlertDescriptionTroubleshooting Steps
ShippingServiceHighErrorRateShipping API error rate above thresholdSee High Error Rate
ShippingCarrierAPIDownCarrier API connection issuesSee Carrier API Issues
ShippingServiceHighLatencyShipping service latency issuesSee High Latency
ShippingServiceDatabaseIssuesDatabase connection issuesSee Database Issues

Troubleshooting Guides

High Error Rate

If the service is experiencing a high error rate:

  1. Check application logs for error patterns:

    Terminal window
    kubectl logs -l app=shipping-service -n shipping --tail=100
  2. Check specific error types:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/error-analyzer.jar --last-hour
  3. Check for patterns in failed shipments:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/failed-shipments-analyzer.jar
  4. Check for recent deployments that might have introduced issues:

    Terminal window
    kubectl rollout history deployment/shipping-service -n shipping
  5. Verify if the issue is specific to a carrier (FedEx, UPS, etc.):

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/carrier-success-rates.jar

Carrier API Issues

If there are issues with carrier APIs:

  1. Check carrier API connectivity:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/carrier-health-check.jar
  2. Check carrier API credentials and rotation status:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/check-carrier-credentials.jar
  3. Check carrier status pages for announced outages:

  4. Check carrier timeouts in application logs:

    Terminal window
    kubectl logs -l app=shipping-service -n shipping | grep "carrier timeout"
  5. Enable fallback shipping carrier:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- curl -X POST localhost:8080/internal/api/shipping/enable-fallback-carrier -H "Content-Type: application/json" -d '{"primaryCarrier": "fedex", "fallbackCarrier": "ups", "reason": "FedEx API outage"}'

High Latency

If the service is experiencing high latency:

  1. Check system metrics:

    Terminal window
    kubectl top pods -n shipping
  2. Check JVM memory and GC metrics:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/jvm-metrics.jar
  3. Check MongoDB performance:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=mongodb -n data -o jsonpath='{.items[0].metadata.name}') -n data -- mongo --eval "db.currentOp()"
  4. Check carrier API response times:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/carrier-response-times.jar
  5. Scale the service if needed:

    Terminal window
    kubectl scale deployment shipping-service -n shipping --replicas=5

Database Issues

If there are MongoDB issues:

  1. Check MongoDB status:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=mongodb -n data -o jsonpath='{.items[0].metadata.name}') -n data -- mongo --eval "rs.status()"
  2. Check for slow queries:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=mongodb -n data -o jsonpath='{.items[0].metadata.name}') -n data -- mongo --eval "db.currentOp({ 'active': true, 'secs_running': { '$gt': 5 } })"
  3. Check database connection pool:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/db-pool-stats.jar
  4. Restart database connections if needed:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- curl -X POST localhost:8080/internal/api/system/refresh-db-connections

Common Operational Tasks

Managing Carrier API Credentials

Rotating Carrier API Keys

  1. Generate new API keys in the carrier portal:

  2. Store the new keys in AWS Secrets Manager:

    Terminal window
    aws secretsmanager update-secret --secret-id flowmart/shipping/fedex-api-key --secret-string '{"api_key": "NEW_KEY", "password": "NEW_PASSWORD", "account_number": "ACCOUNT_NUMBER"}'
  3. Trigger key rotation in the service:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- curl -X POST localhost:8080/internal/api/system/reload-carrier-credentials
  4. Verify the new keys are working by testing label generation:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/test-label-generation.jar --carrier=fedex

Managing Shipping Rates

Updating Shipping Rate Tables

When carrier rates change:

  1. Prepare the new rate table in the required JSON format.

  2. Upload the rate table to S3:

    Terminal window
    aws s3 cp new-fedex-rates.json s3://flowmart-configs/shipping/rates/
  3. Trigger rate table reload:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- curl -X POST localhost:8080/internal/api/shipping/reload-rate-tables -H "Content-Type: application/json" -d '{"carrier": "fedex"}'
  4. Verify rate calculations with test scenarios:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/test-rate-calculation.jar

Shipping Label Generation Troubleshooting

Debugging Failed Label Generation

If labels are failing to generate:

Terminal window
# Find recent failed label generation attempts
kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/find-failed-labels.jar --hours=2
# Get detailed error for a specific shipment
kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/label-error-details.jar --shipment-id=SHIP123456

Manual Label Generation

For special cases requiring manual intervention:

Terminal window
curl -X POST https://api.internal.flowmart.com/shipping/shipments/{shipmentId}/generate-label \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"carrier": "fedex", "service": "PRIORITY_OVERNIGHT", "forceGeneration": true}'

Tracking Updates

Triggering Manual Tracking Updates

To manually trigger tracking updates:

Terminal window
kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/sync-tracking.jar --shipment-id=SHIP123456
# For bulk tracking updates
kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/sync-tracking.jar --status=in_transit --hours=24

Tracking Webhook Troubleshooting

If tracking webhooks from carriers are failing:

Terminal window
# Check recent webhook failures
kubectl logs -l app=shipping-webhook-service -n shipping | grep "Webhook failure"
# Replay failed webhooks
kubectl exec -it $(kubectl get pods -l app=shipping-webhook-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/replay-webhooks.jar --hours=2

Recovery Procedures

Failed Shipment Recovery

If shipments are stuck or failed:

  1. Identify stuck shipments:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/find-stuck-shipments.jar
  2. Check shipment status with the carrier:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/check-carrier-shipment.jar --shipment-id=SHIP123456
  3. Resolve shipments that completed at carrier but failed in our system:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/resolve-shipment.jar --shipment-id=SHIP123456 --tracking-number=1Z999AA10123456784 --status=label_created

Carrier API Failure Recovery

If a carrier API is unavailable:

  1. Enable automatic carrier fallback:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- curl -X POST localhost:8080/internal/api/system/enable-carrier-fallback
  2. Monitor carrier API status for recovery:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/monitor-carrier-health.jar --carrier=fedex
  3. Switch back to primary carrier once it’s restored:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- curl -X POST localhost:8080/internal/api/system/disable-carrier-fallback

Database Failure Recovery

If the MongoDB database becomes unavailable:

  1. Verify the status of the MongoDB cluster:

    Terminal window
    kubectl get pods -l app=mongodb -n data
  2. Check if automatic failover has occurred:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=mongodb -n data -o jsonpath='{.items[0].metadata.name}') -n data -- mongo --eval "rs.status()"
  3. Once database availability is restored, validate ShippingService functionality:

    Terminal window
    curl -X GET https://api.internal.flowmart.com/shipping/health

Disaster Recovery

Complete Service Failure

In case of a complete service failure:

  1. Initiate incident response by notifying the on-call team through PagerDuty.

  2. Deploy to the disaster recovery environment if necessary:

    Terminal window
    ./scripts/dr-failover.sh shipping-service
  3. Update DNS records to point to the DR environment:

    Terminal window
    aws route53 change-resource-record-sets --hosted-zone-id $HOSTED_ZONE_ID --change-batch file://dr-dns-change.json
  4. Enable simplified shipping flow (if necessary):

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- curl -X POST localhost:8080/internal/api/system/enable-simplified-flow
  5. Regularly check primary environment recovery status.

Maintenance Tasks

Deploying New Versions

Terminal window
kubectl set image deployment/shipping-service -n shipping shipping-service=ecr.aws/flowmart/shipping-service:$VERSION

Database Maintenance

MongoDB Index Maintenance

Periodically verify and optimize MongoDB indexes:

Terminal window
# Check current indexes
kubectl exec -it $(kubectl get pods -l app=mongodb -n data -o jsonpath='{.items[0].metadata.name}') -n data -- mongo --eval "db.shipments.getIndexes()"
# Add new index (example)
kubectl exec -it $(kubectl get pods -l app=mongodb -n data -o jsonpath='{.items[0].metadata.name}') -n data -- mongo --eval "db.shipments.createIndex({carrier: 1, status: 1, createdAt: -1})"

Database Backups

Verify scheduled MongoDB backups:

Terminal window
# Check recent backups
aws s3 ls s3://flowmart-mongodb-backups/shipping/ --human-readable
# Trigger manual backup if needed
kubectl apply -f shipping-db-backup-job.yaml

Carrier Integration Updates

When a carrier updates their API:

  1. Test the API changes in the staging environment:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service-staging -n shipping-staging -o jsonpath='{.items[0].metadata.name}') -n shipping-staging -- java -jar /app/tools/test-carrier-integration.jar --carrier=fedex --mode=new
  2. Update integration configuration if needed:

    Terminal window
    kubectl apply -f updated-fedex-integration-config.yaml
  3. Validate the updated integration:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=shipping-service -n shipping -o jsonpath='{.items[0].metadata.name}') -n shipping -- java -jar /app/tools/validate-carrier-integration.jar --carrier=fedex

Contact Information

Primary On-Call: Logistics Team (rotating schedule)
Secondary On-Call: Platform Team
Escalation Path: Logistics Team Lead > Engineering Manager > CTO

Slack Channels:

  • #shipping-support (primary support channel)
  • #shipping-alerts (automated alerts)
  • #incident-response (for major incidents)

External Contacts:

Reference Information