PaymentService Runbook

Operational runbook for troubleshooting and maintaining the PaymentService

This runbook provides operational procedures for the PaymentService, which is responsible for processing payments, refunds, and managing financial transactions in the FlowMart e-commerce platform.

Architecture

The PaymentService is responsible for:

  • Processing customer payments
  • Managing refunds and chargebacks
  • Integrating with external payment gateways
  • Storing payment transactions
  • Handling subscription billing

Service Dependencies

Loading graph...

Monitoring and Alerting

Key Metrics

MetricDescriptionWarning ThresholdCritical Threshold
payment_processing_ratePayments processed per minute< 5< 1
payment_success_ratePercentage of successful payments< 95%< 90%
payment_processing_latencyTime to process a payment> 3s> 8s
refund_processing_latencyTime to process a refund> 5s> 15s
gateway_error_ratePayment gateway errors> 2%> 5%
fraud_detection_latencyTime for fraud checks> 1s> 3s

Dashboards

Common Alerts

AlertDescriptionTroubleshooting Steps
PaymentServiceHighErrorRatePayment failure rate above thresholdSee High Error Rate
PaymentServiceGatewayFailurePayment gateway connection issuesSee Gateway Issues
PaymentServiceHighLatencyPayment processing latency issuesSee High Latency
PaymentServiceDatabaseIssuesDatabase connection issuesSee Database Issues

Troubleshooting Guides

High Error Rate

If the service is experiencing a high payment error rate:

  1. Check application logs for error patterns:

    Terminal window
    kubectl logs -l app=payment-service -n payment --tail=100
  2. Check payment gateway status on their status pages:

  3. Check for patterns in failed transactions:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/query-failed-transactions.js --last-hour
  4. Check for recent deployments that might have introduced issues:

    Terminal window
    kubectl rollout history deployment/payment-service -n payment
  5. Verify if the issue is specific to a payment method (credit card, PayPal, etc.):

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/payment-method-success-rates.js

Payment Gateway Issues

If there are issues with payment gateways:

  1. Check gateway connectivity:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -o /dev/null -s -w "%{http_code}\n" https://api.stripe.com/v1/charges -H "Authorization: Bearer $STRIPE_TEST_KEY"
  2. Check payment gateway API keys rotation status:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/check-api-key-rotation.js
  3. Check gateway timeouts in application logs:

    Terminal window
    kubectl logs -l app=payment-service -n payment | grep "gateway timeout"
  4. Verify if the issue is isolated to a specific gateway:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/gateway-health-check.js
  5. Switch to backup payment gateway if primary is down:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/payment/switch-gateway -H "Content-Type: application/json" -d '{"primaryGateway": "paypal", "reason": "Stripe outage"}'

High Latency

If the service is experiencing high latency:

  1. Check system metrics:

    Terminal window
    kubectl top pods -n payment
  2. Check database connection pool:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/db-pool-stats.js
  3. Check slow queries in the payment database:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -d payments -c "SELECT query, calls, mean_exec_time, max_exec_time FROM pg_stat_statements WHERE mean_exec_time > 100 ORDER BY mean_exec_time DESC LIMIT 10;"
  4. Check payment gateway response times:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/gateway-latency-check.js
  5. Scale the service if needed:

    Terminal window
    kubectl scale deployment payment-service -n payment --replicas=5

Database Issues

If there are database issues:

  1. Check PostgreSQL status:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- pg_isready -U postgres -d payments
  2. Check for long-running transactions:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -d payments -c "SELECT pid, now() - xact_start AS duration, state, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;"
  3. Check for database locks:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- psql -U postgres -d payments -c "SELECT relation::regclass, mode, pid, granted FROM pg_locks l JOIN pg_stat_activity a ON l.pid = a.pid WHERE relation = 'payments.transactions'::regclass;"
  4. Restart database connections if needed:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/refresh-db-connections

Common Operational Tasks

Managing API Keys

Rotating Payment Gateway API Keys

  1. Generate new API keys in the payment gateway admin portal.

  2. Store the new keys in AWS Secrets Manager:

    Terminal window
    aws secretsmanager update-secret --secret-id flowmart/payment/stripe-api-key --secret-string '{"api_key": "sk_live_NEW_KEY", "webhook_secret": "whsec_NEW_SECRET"}'
  3. Trigger key rotation in the service:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/reload-api-keys
  4. Verify the new keys are active:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/verify-api-keys.js

Managing Refunds

Processing Manual Refunds

For special cases requiring manual intervention:

Terminal window
curl -X POST https://api.internal.flowmart.com/payment/transactions/{transactionId}/refund \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"amount": 1999, "reason": "Customer service request", "refundToOriginalMethod": true}'

Finding Failed Refunds

To identify and retry failed refunds:

Terminal window
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/list-failed-refunds.js --last-24h

Handling Chargebacks

To record and process a new chargeback:

Terminal window
curl -X POST https://api.internal.flowmart.com/payment/transactions/{transactionId}/chargeback \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"chargebackReference": "CB12345", "amount": 1999, "reason": "Unauthorized transaction"}'

Payment Reconciliation

To trigger payment reconciliation with payment gateway:

Terminal window
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/reconcile-payments.js --gateway=stripe --date=2023-05-15

Recovery Procedures

Failed Transactions Recovery

If transactions are stuck or failed:

  1. Identify stuck transactions:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/find-stuck-transactions.js
  2. Check transaction status with the payment gateway:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/check-gateway-transaction.js --transaction-id=TXN123456
  3. Resolve transactions that completed at gateway but failed in our system:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/resolve-stuck-transaction.js --transaction-id=TXN123456 --status=completed

Payment Gateway Failure Recovery

If a payment gateway is unavailable:

  1. Enable fallback gateway mode:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/enable-fallback-gateway
  2. Monitor gateway status for recovery:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/monitor-gateway-health.js --gateway=stripe
  3. Disable fallback mode once the primary gateway is restored:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/disable-fallback-gateway

Database Failure Recovery

If the PostgreSQL database becomes unavailable:

  1. Verify the status of the PostgreSQL cluster:

    Terminal window
    kubectl get pods -l app=postgresql -n data
  2. Check if automatic failover has occurred:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=postgresql-patroni -n data -o jsonpath='{.items[0].metadata.name}') -n data -- patronictl list
  3. Once database availability is restored, validate the PaymentService functionality:

    Terminal window
    curl -X GET https://api.internal.flowmart.com/payment/health

Disaster Recovery

Complete Service Failure

In case of a complete service failure:

  1. Initiate incident response by notifying the on-call team through PagerDuty.

  2. If necessary, deploy to the disaster recovery environment:

    Terminal window
    ./scripts/dr-failover.sh payment-service
  3. Update DNS records to point to the DR environment:

    Terminal window
    aws route53 change-resource-record-sets --hosted-zone-id $HOSTED_ZONE_ID --change-batch file://dr-dns-change.json
  4. Enable simplified payment flow (if necessary):

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- curl -X POST localhost:3000/internal/api/system/enable-simplified-flow
  5. Regularly check primary environment recovery status.

Maintenance Tasks

Deploying New Versions

Terminal window
kubectl set image deployment/payment-service -n payment payment-service=ecr.aws/flowmart/payment-service:$VERSION

Database Migrations

For database schema updates:

  1. Notify stakeholders through the #maintenance Slack channel.

  2. Create a migration plan and backup the database:

    Terminal window
    kubectl exec -it $(kubectl get pods -l app=postgresql -n data -o jsonpath='{.items[0].metadata.name}') -n data -- pg_dump -U postgres -d payments > payments_backup_$(date +%Y%m%d).sql
  3. Apply database migrations:

    Terminal window
    kubectl apply -f payment-migration-job.yaml
  4. Verify migration completion:

    Terminal window
    kubectl logs -l job-name=payment-db-migration -n payment

Compliance and Auditing

To generate PCI compliance reports:

Terminal window
kubectl exec -it $(kubectl get pods -l app=payment-service -n payment -o jsonpath='{.items[0].metadata.name}') -n payment -- node scripts/generate-pci-audit-report.js --month=2023-05

Contact Information

Primary On-Call: Payments Team (rotating schedule)
Secondary On-Call: Platform Team
Escalation Path: Payments Team Lead > Engineering Manager > CTO

Slack Channels:

  • #payments-support (primary support channel)
  • #payments-alerts (automated alerts)
  • #incident-response (for major incidents)

External Contacts:

Reference Information