Fraud Analytics DB (v0.0.1)
Analytics database for fraud detection patterns, risk scoring, and transaction analysis
What is this?
Fraud Analytics DB is a ClickHouse columnar database optimized for high-speed analytics on transaction patterns, fraud signals, and risk assessment. It stores historical fraud data and real-time transaction features used by machine learning models.
What does it store?
- Transaction Features: Extracted features from payment transactions for ML model scoring
- Fraud Signals: Device fingerprints, IP addresses, velocity metrics, behavioral patterns
- Risk Scores: Historical risk scores and fraud decisions for model training
- Fraud Cases: Confirmed fraud incidents with investigation notes
- Model Predictions: ML model outputs and confidence scores for analysis
Who writes to it?
- FraudDetectionService writes real-time transaction features and risk scores
- Data Pipeline ingests historical fraud data from payment systems
- Manual Review Team updates fraud case outcomes and labels
Who reads from it?
- FraudDetectionService queries historical patterns for real-time scoring
- ML Training Pipeline extracts training data for model retraining
- Data Science Team analyzes fraud trends and model performance
- Business Intelligence generates fraud metrics and dashboards
- Compliance Team audits fraud detection decisions
High-level data model
transaction_features
: Real-time extracted features (device, location, amount patterns)fraud_signals
: Aggregated signals (velocity, anomaly scores)risk_decisions
: Historical risk decisions and outcomesfraud_labels
: Ground truth labels for confirmed fraud cases- Time-series data partitioned by day for optimal query performance
Common queries
-- Calculate fraud rate by country (last 30 days)SELECT country_code, COUNT(*) as total_transactions, SUM(is_fraud) as fraud_count, (SUM(is_fraud) * 100.0 / COUNT(*)) as fraud_rateFROM transaction_featuresWHERE event_date >= today() - 30GROUP BY country_codeORDER BY fraud_rate DESC;
-- Get high-risk transaction patternsSELECT customer_id, COUNT(*) as transaction_count, AVG(risk_score) as avg_risk_score, MAX(risk_score) as max_risk_scoreFROM risk_decisionsWHERE event_date = today() AND risk_score > 80GROUP BY customer_idHAVING transaction_count > 5;
-- Analyze model performance over timeSELECT toStartOfHour(created_at) as hour, model_version, AVG(risk_score) as avg_score, SUM(is_fraud) as actual_fraud_count, COUNT(*) as total_predictionsFROM risk_decisionsWHERE created_at >= now() - INTERVAL 7 DAYGROUP BY hour, model_versionORDER BY hour DESC;
-- Find velocity anomalies (multiple transactions same customer)SELECT customer_id, COUNT(*) as tx_count, SUM(amount_cents) as total_amount, arrayDistinct(groupArray(ip_address)) as unique_ips, arrayDistinct(groupArray(device_fingerprint)) as unique_devicesFROM transaction_featuresWHERE event_date = today() AND created_at >= now() - INTERVAL 1 HOURGROUP BY customer_idHAVING tx_count > 5 OR length(unique_ips) > 3;
Access patterns and guidance
- Queries are highly parallelized across columns for fast analytics
- Use
event_date
partition key in WHERE clauses for optimal performance - Aggregate queries benefit from ClickHouse’s native aggregation functions
- Avoid SELECT * on large tables; specify needed columns
- Use materialized views for frequently accessed aggregations
Data retention and archiving
- Hot data: Last 90 days on high-performance SSD storage
- Warm data: 91 days to 2 years on standard storage
- Cold archive: 2-5 years on object storage (S3)
- Automated archival jobs run monthly
Security and compliance
- Contains sensitive fraud signals and PII (IP addresses, device IDs)
- Access requires security clearance and fraud team membership
- All queries logged for audit trail
- Data anonymized for ML training datasets shared outside fraud team
- GDPR right-to-deletion supported via customer_id purge jobs
Requesting access
To request access to Fraud Analytics DB:
Analyst access (read-only):
- Submit request via ServiceNow
- Select “Data Analytics Access” → “Fraud Analytics DB”
- Requires fraud team manager approval + security clearance
- Access granted within 2-3 business days
Data Science access (read + export):
- Additional approval from Chief Data Officer required
- Data export must be to secure workbench environment only
- Training data exports require anonymization review
Production write access:
- Restricted to FraudDetectionService automated processes
- Manual writes require incident ticket and fraud team lead approval
Contact: For access questions:
- Slack: #fraud-detection-team
- Email: fraud-team@company.com
- Data governance: data-governance@company.com
Performance characteristics
- Query latency: p95 < 500ms for point queries, < 5s for complex aggregations
- Insert throughput: 100,000+ events/second
- Compression ratio: ~10x (columnar storage)
- Data freshness: Near real-time (< 10 second delay)
Monitoring and alerts
- Query performance monitoring (slow queries > 10s)
- Data freshness lag alerts
- Storage capacity monitoring (alert at 80%)
- Replication lag alerts for distributed tables
Local development
- Local ClickHouse via Docker:
docker-compose up fraud-analytics-db
- Connection string:
FRAUD_ANALYTICS_DB_URL
environment variable - Sample dataset available:
npm run seed:fraud-analytics
- Use ClickHouse client:
clickhouse-client --host localhost --port 9000
Common issues and troubleshooting
- Slow aggregation queries: Ensure event_date partition key is used in WHERE clause
- Out of memory errors: Reduce query complexity or increase max_memory_usage setting
- Data duplication: Use ReplacingMergeTree for deduplication on insert
- Query timeout: Increase max_execution_time or optimize query with EXPLAIN
For more information, see FraudDetectionService documentation and ClickHouse best practices guide.