PaymentService Runbook
Operational runbook for troubleshooting and maintaining the PaymentService
This runbook provides operational procedures for the PaymentService, which is responsible for processing payments, refunds, and managing financial transactions in the FlowMart e-commerce platform.
Architecture
The PaymentService is responsible for:
- Processing customer payments
- Managing refunds and chargebacks
- Integrating with external payment gateways
- Storing payment transactions
- Handling subscription billing
Service Dependencies
Loading graph...
Monitoring and Alerting
Key Metrics
Metric | Description | Warning Threshold | Critical Threshold |
---|---|---|---|
payment_processing_rate | Payments processed per minute | < 5 | < 1 |
payment_success_rate | Percentage of successful payments | < 95% | < 90% |
payment_processing_latency | Time to process a payment | > 3s | > 8s |
refund_processing_latency | Time to process a refund | > 5s | > 15s |
gateway_error_rate | Payment gateway errors | > 2% | > 5% |
fraud_detection_latency | Time for fraud checks | > 1s | > 3s |
Dashboards
Common Alerts
Alert | Description | Troubleshooting Steps |
---|---|---|
PaymentServiceHighErrorRate | Payment failure rate above threshold | See High Error Rate |
PaymentServiceGatewayFailure | Payment gateway connection issues | See Gateway Issues |
PaymentServiceHighLatency | Payment processing latency issues | See High Latency |
PaymentServiceDatabaseIssues | Database connection issues | See Database Issues |
Troubleshooting Guides
High Error Rate
If the service is experiencing a high payment error rate:
Check application logs for error patterns:
kubectl logs -l app=payment-service -n payment --tail=100
Check payment gateway status on their status pages:
Check for patterns in failed transactions:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/query-failed-transactions.js --last-hour
Check for recent deployments that might have introduced issues:
kubectl rollout history deployment/payment-service -n payment
Verify if the issue is specific to a payment method (credit card, PayPal, etc.):
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/payment-method-success-rates.js
Payment Gateway Issues
If there are issues with payment gateways:
Check gateway connectivity:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -o /dev/null -s -w "%{http_code}\n" https://api.stripe.com/v1/charges -H "Authorization: Bearer $STRIPE_TEST_KEY"
Check payment gateway API keys rotation status:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/check-api-key-rotation.js
Check gateway timeouts in application logs:
kubectl logs -l app=payment-service -n payment | grep "gateway timeout"
Verify if the issue is isolated to a specific gateway:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/gateway-health-check.js
Switch to backup payment gateway if primary is down:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/payment/switch-gateway -H "Content-Type: application/json" -d '{"primaryGateway": "paypal", "reason": "Stripe outage"}'
High Latency
If the service is experiencing high latency:
Check system metrics:
kubectl top pods -n payment
Check database connection pool:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/db-pool-stats.js
Check slow queries in the payment database:
kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -d payments -c "SELECT query, calls, mean_exec_time, max_exec_time FROM pg_stat_statements WHERE mean_exec_time > 100 ORDER BY mean_exec_time DESC LIMIT 10;"
Check payment gateway response times:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/gateway-latency-check.js
Scale the service if needed:
kubectl scale deployment payment-service -n payment --replicas=5
Database Issues
If there are database issues:
Check PostgreSQL status:
kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- pg_isready -U postgres -d payments
Check for long-running transactions:
kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -d payments -c "SELECT pid, now() - xact_start AS duration, state, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;"
Check for database locks:
kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -d payments -c "SELECT relation::regclass, mode, pid, granted FROM pg_locks l JOIN pg_stat_activity a ON l.pid = a.pid WHERE relation = 'payments.transactions'::regclass;"
Restart database connections if needed:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/refresh-db-connections
Common Operational Tasks
Managing API Keys
Rotating Payment Gateway API Keys
Generate new API keys in the payment gateway admin portal.
Store the new keys in AWS Secrets Manager:
aws secretsmanager update-secret --secret-id flowmart/payment/stripe-api-key --secret-string '{"api_key": "sk_live_NEW_KEY", "webhook_secret": "whsec_NEW_SECRET"}'
Trigger key rotation in the service:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/reload-api-keys
Verify the new keys are active:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/verify-api-keys.js
Managing Refunds
Processing Manual Refunds
For special cases requiring manual intervention:
curl -X POST https://api.internal.flowmart.com/payment/transactions/{transactionId}/refund \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"amount": 1999, "reason": "Customer service request", "refundToOriginalMethod": true}'
Finding Failed Refunds
To identify and retry failed refunds:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/list-failed-refunds.js --last-24h
Handling Chargebacks
To record and process a new chargeback:
curl -X POST https://api.internal.flowmart.com/payment/transactions/{transactionId}/chargeback \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"chargebackReference": "CB12345", "amount": 1999, "reason": "Unauthorized transaction"}'
Payment Reconciliation
To trigger payment reconciliation with payment gateway:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/reconcile-payments.js --gateway=stripe --date=2023-05-15
Recovery Procedures
Failed Transactions Recovery
If transactions are stuck or failed:
Identify stuck transactions:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/find-stuck-transactions.js
Check transaction status with the payment gateway:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/check-gateway-transaction.js --transaction-id=TXN123456
Resolve transactions that completed at gateway but failed in our system:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/resolve-stuck-transaction.js --transaction-id=TXN123456 --status=completed
Payment Gateway Failure Recovery
If a payment gateway is unavailable:
Enable fallback gateway mode:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/enable-fallback-gateway
Monitor gateway status for recovery:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/monitor-gateway-health.js --gateway=stripe
Disable fallback mode once the primary gateway is restored:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/disable-fallback-gateway
Database Failure Recovery
If the PostgreSQL database becomes unavailable:
Verify the status of the PostgreSQL cluster:
kubectl get pods -l app=postgresql -n data
Check if automatic failover has occurred:
kubectl exec -it $(kubectl get pods -l app=postgresql-patroni -n data -o jsonpath='{.items[0].metadata.name}') -n data -- patronictl list
Once database availability is restored, validate the PaymentService functionality:
curl -X GET https://api.internal.flowmart.com/payment/health
Disaster Recovery
Complete Service Failure
In case of a complete service failure:
Initiate incident response by notifying the on-call team through PagerDuty.
If necessary, deploy to the disaster recovery environment:
./scripts/dr-failover.sh payment-service
Update DNS records to point to the DR environment:
aws route53 change-resource-record-sets --hosted-zone-id $HOSTED_ZONE_ID --change-batch file://dr-dns-change.json
Enable simplified payment flow (if necessary):
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/enable-simplified-flow
Regularly check primary environment recovery status.
Maintenance Tasks
Deploying New Versions
kubectl set image deployment/payment-service -n payment payment-service=ecr.aws/flowmart/payment-service:$VERSION
Database Migrations
For database schema updates:
Notify stakeholders through the #maintenance Slack channel.
Create a migration plan and backup the database:
kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- pg_dump -U postgres -d payments > payments_backup_$(date +%Y%m%d).sql
Apply database migrations:
kubectl apply -f payment-migration-job.yaml
Verify migration completion:
kubectl logs -l job-name=payment-db-migration -n payment
Compliance and Auditing
To generate PCI compliance reports:
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/generate-pci-audit-report.js --month=2023-05
Contact Information
Primary On-Call: Payments Team (rotating schedule)
Secondary On-Call: Platform Team
Escalation Path: Payments Team Lead > Engineering Manager > CTO
Slack Channels:
- #payments-support (primary support channel)
- #payments-alerts (automated alerts)
- #incident-response (for major incidents)
External Contacts:
- Stripe Support: support@stripe.com, 1-888-555-1234
- PayPal Support: merchant-support@paypal.com, 1-888-555-5678