ML Model Store (v0.0.1)
Object storage for trained ML models, model artifacts, and versioned fraud detection models
What is this?
ML Model Store is an S3 bucket that stores trained machine learning models, feature extractors, and model artifacts used by the Fraud Detection Service. It provides versioned storage for model rollback and A/B testing.
What does it store?
- Trained Models: Serialized fraud detection models (TensorFlow SavedModel, ONNX, pickle)
- Model Metadata: Model version, training date, performance metrics, feature schema
- Feature Extractors: Preprocessing pipelines and feature engineering code
- Model Configs: Hyperparameters, training configuration, deployment settings
- Experiment Results: Training logs, validation metrics, confusion matrices
Storage structure
ml-models/├── fraud-detection/│ ├── production/│ │ ├── v2.3.1/│ │ │ ├── model.onnx│ │ │ ├── metadata.json│ │ │ ├── feature_schema.json│ │ │ └── performance_metrics.json│ │ └── current -> v2.3.1 (symlink)│ ├── staging/│ │ └── v2.4.0-rc1/│ └── experiments/│ └── exp-2024-01-15-xgboost/├── feature-extractors/│ └── v1.2.0/│ ├── preprocessor.pkl│ └── feature_config.yaml└── archived/ └── deprecated-models/
Who writes to it?
- ML Training Pipeline uploads newly trained models after validation
- Data Science Team uploads experimental models and feature extractors
- CI/CD Pipeline promotes models from staging to production
Who reads from it?
- FraudDetectionService loads production models on startup and refresh
- Model Serving Infrastructure fetches models for deployment
- A/B Testing Framework loads multiple model versions for comparison
- Model Monitoring Service reads metadata for drift detection
Object lifecycle
- Models trained in ML platform → uploaded to
experiments/
folder - Validated models promoted to
staging/
with metadata - Approved models moved to
production/
with version tag - Old production models archived after 90 days in
archived/
- Archived models deleted after 3 years
Model metadata format
{ "model_id": "fraud-detection-v2.3.1", "version": "2.3.1", "trained_at": "2024-01-15T10:30:00Z", "framework": "tensorflow", "format": "onnx", "training_dataset": { "date_range": "2023-10-01 to 2024-01-01", "total_samples": 5000000, "fraud_rate": 0.023 }, "performance": { "auc_roc": 0.94, "precision": 0.89, "recall": 0.87, "f1_score": 0.88, "false_positive_rate": 0.02 }, "features": ["transaction_amount", "device_fingerprint", "ip_country", "..."], "deployment": { "min_memory_mb": 512, "inference_latency_p99_ms": 50, "deployed_at": "2024-01-16T08:00:00Z" }}
Access patterns
- Models loaded on FraudDetectionService startup (cold start)
- Periodic refresh every 6 hours to pick up new model versions
- Blue-green deployment: new version tested in parallel before full rollout
- Model download cached locally on service instances to reduce S3 calls
Versioning strategy
- Semantic versioning: major.minor.patch (e.g., 2.3.1)
- Major: Breaking changes to feature schema or model API
- Minor: Model improvements without breaking changes
- Patch: Bug fixes, retraining with same architecture
- Git tags linked to model versions for traceability
Security and access control
- Read access: FraudDetectionService IAM role only
- Write access: ML training pipeline CI/CD role only
- Encryption: AES-256 server-side encryption enabled
- Versioning: S3 versioning enabled for rollback capability
- Access logs: All S3 access logged to audit bucket
Requesting access
To request access to ML Model Store:
Read access (for service integration):
- Create IAM role request via AWS Access Portal
- Select “S3 Read Access” → “ml-model-store”
- Requires fraud team lead approval
- Access granted within 1 business day
Write access (for ML engineers):
- Submit request via #ml-platform Slack channel
- Requires senior ML engineer approval + security review
- Write access limited to
experiments/
andstaging/
folders - Production writes restricted to CI/CD pipeline only
Data Science exploration:
- Use ML Platform workbench with pre-configured read access
- Contact #ml-platform for workspace setup
Contact:
- ML Platform: #ml-platform
- Fraud ML Team: #fraud-ml-team
- Model governance: ml-governance@company.com
Model deployment workflow
- Train model in ML platform environment
- Upload to
experiments/
with metadata and performance metrics - Validation tests run automatically (schema check, performance baseline)
- If validated, promote to
staging/
for canary deployment - Monitor staging metrics for 24 hours
- Approve production promotion via deployment ticket
- CI/CD pipeline moves model to
production/
and updates symlink - FraudDetectionService auto-refreshes and loads new model
Monitoring and alerts
- Model staleness: Alert if production model > 30 days old
- Download failures: Alert on 5xx errors from S3
- Storage costs: Monitor bucket size (alert at $500/month)
- Performance drift: Compare new model metrics vs. baseline
Backup and disaster recovery
- S3 versioning: Enabled for accidental deletion protection
- Cross-region replication: Models replicated to us-east-1 for DR
- Backup frequency: Automatic with S3 durability (99.999999999%)
- Recovery time: < 5 minutes (point production symlink to previous version)
Local development
- Local MinIO S3-compatible storage:
docker-compose up minio
- Connection:
AWS_ENDPOINT_URL=http://localhost:9000
- Seed models:
npm run seed:ml-models
- CLI access:
aws s3 ls s3://ml-model-store/ --endpoint-url http://localhost:9000
Common issues and troubleshooting
- Model load timeout: Increase service timeout, check S3 connectivity
- Version mismatch: Ensure feature schema matches model version in metadata
- Cold start latency: Pre-warm model cache on service startup
- S3 rate limits: Use CloudFront or S3 acceleration for high-traffic models
- Model size too large: Compress models with ONNX optimization or quantization
Best practices
- Always include metadata.json with model performance metrics
- Test models in staging before production deployment
- Keep last 3 production versions for quick rollback
- Document breaking changes in model version release notes
- Monitor inference latency in production after deployment
For more information, see ML Platform documentation and Model Deployment Playbook.