PaymentService Runbook

Operational runbook for troubleshooting and maintaining the PaymentService

This runbook provides operational procedures for the PaymentService, which is responsible for processing payments, refunds, and managing financial transactions in the FlowMart e-commerce platform.

Architecture

The PaymentService is responsible for:

  • Processing customer payments
  • Managing refunds and chargebacks
  • Integrating with external payment gateways
  • Storing payment transactions
  • Handling subscription billing

Service Dependencies

Loading graph...

Monitoring and Alerting

Key Metrics

MetricDescriptionWarning ThresholdCritical Threshold
payment_processing_ratePayments processed per minute< 5< 1
payment_success_ratePercentage of successful payments< 95%< 90%
payment_processing_latencyTime to process a payment> 3s> 8s
refund_processing_latencyTime to process a refund> 5s> 15s
gateway_error_ratePayment gateway errors> 2%> 5%
fraud_detection_latencyTime for fraud checks> 1s> 3s

Dashboards

Common Alerts

AlertDescriptionTroubleshooting Steps
PaymentServiceHighErrorRatePayment failure rate above thresholdSee High Error Rate
PaymentServiceGatewayFailurePayment gateway connection issuesSee Gateway Issues
PaymentServiceHighLatencyPayment processing latency issuesSee High Latency
PaymentServiceDatabaseIssuesDatabase connection issuesSee Database Issues

Troubleshooting Guides

High Error Rate

If the service is experiencing a high payment error rate:

  1. Check application logs for error patterns:

    kubectl logs -l app=payment-service -n payment --tail=100
    
  2. Check payment gateway status on their status pages:

  3. Check for patterns in failed transactions:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/query-failed-transactions.js --last-hour
    
  4. Check for recent deployments that might have introduced issues:

    kubectl rollout history deployment/payment-service -n payment
    
  5. Verify if the issue is specific to a payment method (credit card, PayPal, etc.):

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/payment-method-success-rates.js
    

Payment Gateway Issues

If there are issues with payment gateways:

  1. Check gateway connectivity:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -o /dev/null -s -w "%{http_code}\n" https://api.stripe.com/v1/charges -H "Authorization: Bearer $STRIPE_TEST_KEY"
    
  2. Check payment gateway API keys rotation status:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/check-api-key-rotation.js
    
  3. Check gateway timeouts in application logs:

    kubectl logs -l app=payment-service -n payment | grep "gateway timeout"
    
  4. Verify if the issue is isolated to a specific gateway:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/gateway-health-check.js
    
  5. Switch to backup payment gateway if primary is down:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/payment/switch-gateway -H "Content-Type: application/json" -d '{"primaryGateway": "paypal", "reason": "Stripe outage"}'
    

High Latency

If the service is experiencing high latency:

  1. Check system metrics:

    kubectl top pods -n payment
    
  2. Check database connection pool:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/db-pool-stats.js
    
  3. Check slow queries in the payment database:

    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -d payments -c "SELECT query, calls, mean_exec_time, max_exec_time FROM pg_stat_statements WHERE mean_exec_time > 100 ORDER BY mean_exec_time DESC LIMIT 10;"
    
  4. Check payment gateway response times:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/gateway-latency-check.js
    
  5. Scale the service if needed:

    kubectl scale deployment payment-service -n payment --replicas=5
    

Database Issues

If there are database issues:

  1. Check PostgreSQL status:

    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- pg_isready -U postgres -d payments
    
  2. Check for long-running transactions:

    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -d payments -c "SELECT pid, now() - xact_start AS duration, state, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;"
    
  3. Check for database locks:

    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -d payments -c "SELECT relation::regclass, mode, pid, granted FROM pg_locks l JOIN pg_stat_activity a ON l.pid = a.pid WHERE relation = 'payments.transactions'::regclass;"
    
  4. Restart database connections if needed:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/refresh-db-connections
    

Common Operational Tasks

Managing API Keys

Rotating Payment Gateway API Keys

  1. Generate new API keys in the payment gateway admin portal.

  2. Store the new keys in AWS Secrets Manager:

    aws secretsmanager update-secret --secret-id flowmart/payment/stripe-api-key --secret-string '{"api_key": "sk_live_NEW_KEY", "webhook_secret": "whsec_NEW_SECRET"}'
    
  3. Trigger key rotation in the service:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/reload-api-keys
    
  4. Verify the new keys are active:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/verify-api-keys.js
    

Managing Refunds

Processing Manual Refunds

For special cases requiring manual intervention:

curl -X POST https://api.internal.flowmart.com/payment/transactions/{transactionId}/refund \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"amount": 1999, "reason": "Customer service request", "refundToOriginalMethod": true}'

Finding Failed Refunds

To identify and retry failed refunds:

kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/list-failed-refunds.js --last-24h

Handling Chargebacks

To record and process a new chargeback:

curl -X POST https://api.internal.flowmart.com/payment/transactions/{transactionId}/chargeback \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"chargebackReference": "CB12345", "amount": 1999, "reason": "Unauthorized transaction"}'

Payment Reconciliation

To trigger payment reconciliation with payment gateway:

kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/reconcile-payments.js --gateway=stripe --date=2023-05-15

Recovery Procedures

Failed Transactions Recovery

If transactions are stuck or failed:

  1. Identify stuck transactions:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/find-stuck-transactions.js
    
  2. Check transaction status with the payment gateway:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/check-gateway-transaction.js --transaction-id=TXN123456
    
  3. Resolve transactions that completed at gateway but failed in our system:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/resolve-stuck-transaction.js --transaction-id=TXN123456 --status=completed
    

Payment Gateway Failure Recovery

If a payment gateway is unavailable:

  1. Enable fallback gateway mode:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/enable-fallback-gateway
    
  2. Monitor gateway status for recovery:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/monitor-gateway-health.js --gateway=stripe
    
  3. Disable fallback mode once the primary gateway is restored:

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/disable-fallback-gateway
    

Database Failure Recovery

If the PostgreSQL database becomes unavailable:

  1. Verify the status of the PostgreSQL cluster:

    kubectl get pods -l app=postgresql -n data
    
  2. Check if automatic failover has occurred:

    kubectl exec -it $(kubectl get pods -l app=postgresql-patroni -n data -o jsonpath='{.items[0].metadata.name}') -n data -- patronictl list
    
  3. Once database availability is restored, validate the PaymentService functionality:

    curl -X GET https://api.internal.flowmart.com/payment/health
    

Disaster Recovery

Complete Service Failure

In case of a complete service failure:

  1. Initiate incident response by notifying the on-call team through PagerDuty.

  2. If necessary, deploy to the disaster recovery environment:

    ./scripts/dr-failover.sh payment-service
    
  3. Update DNS records to point to the DR environment:

    aws route53 change-resource-record-sets --hosted-zone-id $HOSTED_ZONE_ID --change-batch file://dr-dns-change.json
    
  4. Enable simplified payment flow (if necessary):

    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/enable-simplified-flow
    
  5. Regularly check primary environment recovery status.

Maintenance Tasks

Deploying New Versions

kubectl set image deployment/payment-service -n payment payment-service=ecr.aws/flowmart/payment-service:$VERSION

Database Migrations

For database schema updates:

  1. Notify stakeholders through the #maintenance Slack channel.

  2. Create a migration plan and backup the database:

    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- pg_dump -U postgres -d payments > payments_backup_$(date +%Y%m%d).sql
    
  3. Apply database migrations:

    kubectl apply -f payment-migration-job.yaml
    
  4. Verify migration completion:

    kubectl logs -l job-name=payment-db-migration -n payment
    

Compliance and Auditing

To generate PCI compliance reports:

kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/generate-pci-audit-report.js --month=2023-05

Contact Information

Primary On-Call: Payments Team (rotating schedule)
Secondary On-Call: Platform Team
Escalation Path: Payments Team Lead > Engineering Manager > CTO

Slack Channels:

  • #payments-support (primary support channel)
  • #payments-alerts (automated alerts)
  • #incident-response (for major incidents)

External Contacts:

Reference Information