Fraud Analytics DB (v0.0.1)

Analytics database for fraud detection patterns, risk scoring, and transaction analysis

What is this?

Fraud Analytics DB is a ClickHouse columnar database optimized for high-speed analytics on transaction patterns, fraud signals, and risk assessment. It stores historical fraud data and real-time transaction features used by machine learning models.

What does it store?

  • Transaction Features: Extracted features from payment transactions for ML model scoring
  • Fraud Signals: Device fingerprints, IP addresses, velocity metrics, behavioral patterns
  • Risk Scores: Historical risk scores and fraud decisions for model training
  • Fraud Cases: Confirmed fraud incidents with investigation notes
  • Model Predictions: ML model outputs and confidence scores for analysis

Who writes to it?

  • FraudDetectionService writes real-time transaction features and risk scores
  • Data Pipeline ingests historical fraud data from payment systems
  • Manual Review Team updates fraud case outcomes and labels

Who reads from it?

  • FraudDetectionService queries historical patterns for real-time scoring
  • ML Training Pipeline extracts training data for model retraining
  • Data Science Team analyzes fraud trends and model performance
  • Business Intelligence generates fraud metrics and dashboards
  • Compliance Team audits fraud detection decisions

High-level data model

  • transaction_features: Real-time extracted features (device, location, amount patterns)
  • fraud_signals: Aggregated signals (velocity, anomaly scores)
  • risk_decisions: Historical risk decisions and outcomes
  • fraud_labels: Ground truth labels for confirmed fraud cases
  • Time-series data partitioned by day for optimal query performance

Common queries

-- Calculate fraud rate by country (last 30 days)
SELECT
country_code,
COUNT(*) as total_transactions,
SUM(is_fraud) as fraud_count,
(SUM(is_fraud) * 100.0 / COUNT(*)) as fraud_rate
FROM transaction_features
WHERE event_date >= today() - 30
GROUP BY country_code
ORDER BY fraud_rate DESC;
-- Get high-risk transaction patterns
SELECT
customer_id,
COUNT(*) as transaction_count,
AVG(risk_score) as avg_risk_score,
MAX(risk_score) as max_risk_score
FROM risk_decisions
WHERE event_date = today()
AND risk_score > 80
GROUP BY customer_id
HAVING transaction_count > 5;
-- Analyze model performance over time
SELECT
toStartOfHour(created_at) as hour,
model_version,
AVG(risk_score) as avg_score,
SUM(is_fraud) as actual_fraud_count,
COUNT(*) as total_predictions
FROM risk_decisions
WHERE created_at >= now() - INTERVAL 7 DAY
GROUP BY hour, model_version
ORDER BY hour DESC;
-- Find velocity anomalies (multiple transactions same customer)
SELECT
customer_id,
COUNT(*) as tx_count,
SUM(amount_cents) as total_amount,
arrayDistinct(groupArray(ip_address)) as unique_ips,
arrayDistinct(groupArray(device_fingerprint)) as unique_devices
FROM transaction_features
WHERE event_date = today()
AND created_at >= now() - INTERVAL 1 HOUR
GROUP BY customer_id
HAVING tx_count > 5 OR length(unique_ips) > 3;

Access patterns and guidance

  • Queries are highly parallelized across columns for fast analytics
  • Use event_date partition key in WHERE clauses for optimal performance
  • Aggregate queries benefit from ClickHouse’s native aggregation functions
  • Avoid SELECT * on large tables; specify needed columns
  • Use materialized views for frequently accessed aggregations

Data retention and archiving

  • Hot data: Last 90 days on high-performance SSD storage
  • Warm data: 91 days to 2 years on standard storage
  • Cold archive: 2-5 years on object storage (S3)
  • Automated archival jobs run monthly

Security and compliance

  • Contains sensitive fraud signals and PII (IP addresses, device IDs)
  • Access requires security clearance and fraud team membership
  • All queries logged for audit trail
  • Data anonymized for ML training datasets shared outside fraud team
  • GDPR right-to-deletion supported via customer_id purge jobs

Requesting access

To request access to Fraud Analytics DB:

  1. Analyst access (read-only):

    • Submit request via ServiceNow
    • Select “Data Analytics Access” → “Fraud Analytics DB”
    • Requires fraud team manager approval + security clearance
    • Access granted within 2-3 business days
  2. Data Science access (read + export):

    • Additional approval from Chief Data Officer required
    • Data export must be to secure workbench environment only
    • Training data exports require anonymization review
  3. Production write access:

    • Restricted to FraudDetectionService automated processes
    • Manual writes require incident ticket and fraud team lead approval

Contact: For access questions:

Performance characteristics

  • Query latency: p95 < 500ms for point queries, < 5s for complex aggregations
  • Insert throughput: 100,000+ events/second
  • Compression ratio: ~10x (columnar storage)
  • Data freshness: Near real-time (< 10 second delay)

Monitoring and alerts

  • Query performance monitoring (slow queries > 10s)
  • Data freshness lag alerts
  • Storage capacity monitoring (alert at 80%)
  • Replication lag alerts for distributed tables

Local development

  • Local ClickHouse via Docker: docker-compose up fraud-analytics-db
  • Connection string: FRAUD_ANALYTICS_DB_URL environment variable
  • Sample dataset available: npm run seed:fraud-analytics
  • Use ClickHouse client: clickhouse-client --host localhost --port 9000

Common issues and troubleshooting

  • Slow aggregation queries: Ensure event_date partition key is used in WHERE clause
  • Out of memory errors: Reduce query complexity or increase max_memory_usage setting
  • Data duplication: Use ReplacingMergeTree for deduplication on insert
  • Query timeout: Increase max_execution_time or optimize query with EXPLAIN

For more information, see FraudDetectionService documentation and ClickHouse best practices guide.