Challenge
Loomstack's existing fraud system was a hand-tuned rules engine: 1,400 rules, accumulated over five years, maintained by a team that had largely turned over. It worked, in the sense that it caught real fraud, but the false-positive rate was 11.4% — meaning roughly one in nine flagged transactions was a legitimate purchase.
That number had a real cost. Each false positive translated to a frustrated customer, a manual review, and an average $4.20 in operational overhead. Across 30 million flagged transactions a year, the rules engine was costing more than it was saving. Worse, every new rule made the system harder to reason about. Adding a rule for one fraud pattern often broke detection of three others, and nobody could predict which ones.
The team had tried twice to replace the rules engine with ML. Both attempts stalled — not because the models couldn't beat the baseline, but because nobody could explain a model's decision to a regulator on demand.
Approach
We did not start with the model. We started with the eval.
The first six weeks of the engagement produced no production code. They produced a labeled evaluation set of 18,000 transactions, hand-graded by Loomstack's risk team, and an automated harness that could score any candidate model against it in under three minutes. The harness measured precision, recall, calibration, and — critically — explainability: every score had to come with a structured rationale that could be shown to a customer or a regulator.
With the eval in place, we built a hybrid pipeline. The rules engine stayed, but as a small set of high-confidence signals (chargeback history, known-bad device fingerprints, sanctions screening). The ML layer — a gradient-boosted model with a small transformer for sequence features — handled the long tail.
The architecture was deliberately boring: Kafka for ingestion, a Python scoring service running on EKS, Redis for feature caching, Postgres for audit. We avoided the temptation to introduce a vector database, an online feature store, or any of the other things that make demo slides look impressive and 3am pages more frequent.
Outcome
The new system ships scores in under 40ms at the 99th percentile, against a target of 100ms. Throughput peaks at 4 million transactions per hour during the holiday season — comfortably above the rules engine's old ceiling.
The headline number is the false-positive rate: from 11.4% to 3.7%, a 67% reduction. Loomstack's risk team estimates this will eliminate $24M in operational overhead in the first full year. Equally important, recall on confirmed fraud actually improved by 4.1 percentage points — the new system catches more fraud, not less.
The system has been in production for seven months. Loomstack's engineers own and operate it; we are on a small retainer for model retraining and eval updates. The codebase is in their repo, the models are in their accounts, and the runbooks were written by their on-call team during shadow rotations in the final month of the engagement.
The next phase, scoped but not yet started, is extending the explainability layer to support customer-facing decision letters — the regulatory groundwork is now in place to make that a small project rather than a six-month rebuild.