PaymentService Runbook
Operational runbook for troubleshooting and maintaining the PaymentService
This runbook provides operational procedures for the PaymentService, which is responsible for processing payments, refunds, and managing financial transactions in the FlowMart e-commerce platform.
Architecture
The PaymentService is responsible for:
- Processing customer payments
- Managing refunds and chargebacks
- Integrating with external payment gateways
- Storing payment transactions
- Handling subscription billing
Service Dependencies
Loading graph...
Monitoring and Alerting
Key Metrics
| Metric | Description | Warning Threshold | Critical Threshold |
|---|---|---|---|
payment_processing_rate | Payments processed per minute | < 5 | < 1 |
payment_success_rate | Percentage of successful payments | < 95% | < 90% |
payment_processing_latency | Time to process a payment | > 3s | > 8s |
refund_processing_latency | Time to process a refund | > 5s | > 15s |
gateway_error_rate | Payment gateway errors | > 2% | > 5% |
fraud_detection_latency | Time for fraud checks | > 1s | > 3s |
Dashboards
Common Alerts
| Alert | Description | Troubleshooting Steps |
|---|---|---|
PaymentServiceHighErrorRate | Payment failure rate above threshold | See High Error Rate |
PaymentServiceGatewayFailure | Payment gateway connection issues | See Gateway Issues |
PaymentServiceHighLatency | Payment processing latency issues | See High Latency |
PaymentServiceDatabaseIssues | Database connection issues | See Database Issues |
Troubleshooting Guides
High Error Rate
If the service is experiencing a high payment error rate:
-
Check application logs for error patterns:
Terminal window kubectl logs -l app=payment-service -n payment --tail=100 -
Check payment gateway status on their status pages:
-
Check for patterns in failed transactions:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/query-failed-transactions.js --last-hour -
Check for recent deployments that might have introduced issues:
Terminal window kubectl rollout history deployment/payment-service -n payment -
Verify if the issue is specific to a payment method (credit card, PayPal, etc.):
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/payment-method-success-rates.js
Payment Gateway Issues
If there are issues with payment gateways:
-
Check gateway connectivity:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -o /dev/null -s -w "%{http_code}\n" https://api.stripe.com/v1/charges -H "Authorization: Bearer $STRIPE_TEST_KEY" -
Check payment gateway API keys rotation status:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/check-api-key-rotation.js -
Check gateway timeouts in application logs:
Terminal window kubectl logs -l app=payment-service -n payment | grep "gateway timeout" -
Verify if the issue is isolated to a specific gateway:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/gateway-health-check.js -
Switch to backup payment gateway if primary is down:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/payment/switch-gateway -H "Content-Type: application/json" -d '{"primaryGateway": "paypal", "reason": "Stripe outage"}'
High Latency
If the service is experiencing high latency:
-
Check system metrics:
Terminal window kubectl top pods -n payment -
Check database connection pool:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/db-pool-stats.js -
Check slow queries in the payment database:
Terminal window kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -d payments -c "SELECT query, calls, mean_exec_time, max_exec_time FROM pg_stat_statements WHERE mean_exec_time > 100 ORDER BY mean_exec_time DESC LIMIT 10;" -
Check payment gateway response times:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/gateway-latency-check.js -
Scale the service if needed:
Terminal window kubectl scale deployment payment-service -n payment --replicas=5
Database Issues
If there are database issues:
-
Check PostgreSQL status:
Terminal window kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- pg_isready -U postgres -d payments -
Check for long-running transactions:
Terminal window kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -d payments -c "SELECT pid, now() - xact_start AS duration, state, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;" -
Check for database locks:
Terminal window kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -d payments -c "SELECT relation::regclass, mode, pid, granted FROM pg_locks l JOIN pg_stat_activity a ON l.pid = a.pid WHERE relation = 'payments.transactions'::regclass;" -
Restart database connections if needed:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/refresh-db-connections
Common Operational Tasks
Managing API Keys
Rotating Payment Gateway API Keys
-
Generate new API keys in the payment gateway admin portal.
-
Store the new keys in AWS Secrets Manager:
Terminal window aws secretsmanager update-secret --secret-id flowmart/payment/stripe-api-key --secret-string '{"api_key": "sk_live_NEW_KEY", "webhook_secret": "whsec_NEW_SECRET"}' -
Trigger key rotation in the service:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/reload-api-keys -
Verify the new keys are active:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/verify-api-keys.js
Managing Refunds
Processing Manual Refunds
For special cases requiring manual intervention:
curl -X POST https://api.internal.flowmart.com/payment/transactions/{transactionId}/refund \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"amount": 1999, "reason": "Customer service request", "refundToOriginalMethod": true}'Finding Failed Refunds
To identify and retry failed refunds:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/list-failed-refunds.js --last-24hHandling Chargebacks
To record and process a new chargeback:
curl -X POST https://api.internal.flowmart.com/payment/transactions/{transactionId}/chargeback \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"chargebackReference": "CB12345", "amount": 1999, "reason": "Unauthorized transaction"}'Payment Reconciliation
To trigger payment reconciliation with payment gateway:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/reconcile-payments.js --gateway=stripe --date=2023-05-15Recovery Procedures
Failed Transactions Recovery
If transactions are stuck or failed:
-
Identify stuck transactions:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/find-stuck-transactions.js -
Check transaction status with the payment gateway:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/check-gateway-transaction.js --transaction-id=TXN123456 -
Resolve transactions that completed at gateway but failed in our system:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/resolve-stuck-transaction.js --transaction-id=TXN123456 --status=completed
Payment Gateway Failure Recovery
If a payment gateway is unavailable:
-
Enable fallback gateway mode:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/enable-fallback-gateway -
Monitor gateway status for recovery:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/monitor-gateway-health.js --gateway=stripe -
Disable fallback mode once the primary gateway is restored:
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/disable-fallback-gateway
Database Failure Recovery
If the PostgreSQL database becomes unavailable:
-
Verify the status of the PostgreSQL cluster:
Terminal window kubectl get pods -l app=postgresql -n data -
Check if automatic failover has occurred:
Terminal window kubectl exec -it $(kubectl get pods -l app=postgresql-patroni -n data -o jsonpath='{.items[0].metadata.name}') -n data -- patronictl list -
Once database availability is restored, validate the PaymentService functionality:
Terminal window curl -X GET https://api.internal.flowmart.com/payment/health
Disaster Recovery
Complete Service Failure
In case of a complete service failure:
-
Initiate incident response by notifying the on-call team through PagerDuty.
-
If necessary, deploy to the disaster recovery environment:
Terminal window ./scripts/dr-failover.sh payment-service -
Update DNS records to point to the DR environment:
Terminal window aws route53 change-resource-record-sets --hosted-zone-id $HOSTED_ZONE_ID --change-batch file://dr-dns-change.json -
Enable simplified payment flow (if necessary):
Terminal window kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/enable-simplified-flow -
Regularly check primary environment recovery status.
Maintenance Tasks
Deploying New Versions
kubectl set image deployment/payment-service -n payment payment-service=ecr.aws/flowmart/payment-service:$VERSIONDatabase Migrations
For database schema updates:
-
Notify stakeholders through the #maintenance Slack channel.
-
Create a migration plan and backup the database:
Terminal window kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- pg_dump -U postgres -d payments > payments_backup_$(date +%Y%m%d).sql -
Apply database migrations:
Terminal window kubectl apply -f payment-migration-job.yaml -
Verify migration completion:
Terminal window kubectl logs -l job-name=payment-db-migration -n payment
Compliance and Auditing
To generate PCI compliance reports:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/generate-pci-audit-report.js --month=2023-05Contact Information
Primary On-Call: Payments Team (rotating schedule)
Secondary On-Call: Platform Team
Escalation Path: Payments Team Lead > Engineering Manager > CTO
Slack Channels:
- #payments-support (primary support channel)
- #payments-alerts (automated alerts)
- #incident-response (for major incidents)
External Contacts:
- Stripe Support: support@stripe.com, 1-888-555-1234
- PayPal Support: merchant-support@paypal.com, 1-888-555-5678