Architecting Sub-50ms Fraud Scoring: The Infrastructure Decisions That Matter

real-time data pipeline architecture diagram

Scoring a payment transaction for fraud risk in under 50 milliseconds is not a model performance problem — most production fraud models can produce a score in under 5ms once the features are ready. The 50ms budget is almost entirely consumed by infrastructure: network transit, feature retrieval, and data serialization. The architecture decisions that determine whether you hit that threshold or miss it are mostly made before a single model weight is trained.

This article walks through the specific infrastructure decisions that matter for sub-50ms fraud scoring, the tradeoffs each involves, and where teams typically lose time without realizing it.

The Latency Budget: Where the Time Actually Goes

Breaking down the end-to-end latency for a fraud scoring API call gives you a clearer target for where to invest optimization effort. A typical production path looks like this: network transit from the payment gateway to the scoring endpoint (5–15ms depending on geography and connection type), feature retrieval from the feature store (10–20ms), model inference (1–10ms), and response serialization and network return (5–10ms). Total: 21–55ms.

The immediate observation is that model inference — the part that receives the most optimization attention in ML teams — is the smallest component. A model that takes 2ms versus 5ms to produce a score contributes a 3ms difference to the total. A feature store with P95 latency of 20ms versus 8ms contributes a 12ms difference. The gain is in infrastructure, not in model optimization, for most teams trying to get below 50ms.

Network transit is the one component you can't directly optimize — it's determined by the speed of light and the distance between your processor's authorization infrastructure and your scoring service. What you can control is co-location: running the scoring service in the same data center region as the authorization flow eliminates a large fraction of transit latency. A scoring service running in us-east-1 responding to authorization requests from a gateway also in us-east-1 has 1–2ms transit time. The same scoring service responding to requests from a gateway in eu-west-1 has 80–100ms transit time, which blows the entire budget before any computation happens.

The Feature Store Is the Constraint

For teams where sub-50ms scoring is achievable architecturally, the feature store is almost always the binding constraint on latency. Fraud scoring models typically require 50–200 features per transaction, and most of those features cannot be computed entirely from the transaction event itself. Velocity features require looking up counts of recent events for the card, device, and IP. Historical baseline features require looking up statistics computed from the past 30 days of account history. Network graph features require reading pre-computed graph centrality scores from a graph database.

All of these require external reads during the scoring path. The latency of those reads determines the floor of the scoring latency. Getting those reads under 10ms total requires a combination of feature co-location (the feature store in the same availability zone as the scoring service), in-memory caching for high-frequency entities (the most active cards and devices should have their features cached in process memory, not just in Redis), and pre-computation of all aggregate features asynchronously so the scoring path is read-only.

The pre-computation point deserves emphasis. Teams that compute features on-demand during the scoring path — running a SQL query against the transaction database to count events in the last hour, for example — are structurally unable to hit sub-50ms latency at any significant transaction volume. On-demand computation that requires database queries has P95 latency of 30–80ms for simple queries, which exceeds the entire latency budget before any other processing happens. The scoring path must be read-only against a purpose-built feature store, not a general transactional database.

Feature Store Architecture Options

There are three practical feature store architectures for real-time fraud scoring, each with different latency and cost characteristics.

Redis with materialized feature records. Each entity (card, device, IP) has a key in Redis with a serialized feature vector that's updated by an asynchronous process as events arrive. The scoring path does a Redis GET by entity key, deserializes the feature vector, and passes it to the model. P95 read latency from an in-availability-zone Redis instance is 0.3–2ms. This is the lowest-latency option and the most common architecture in production fraud systems.

The limitation is that feature freshness is bounded by how quickly the async update process runs. If the update process runs every 30 seconds, features can be up to 30 seconds stale. For most fraud scoring applications, 30-second staleness is acceptable; for applications where near-real-time velocity features are critical, the update frequency needs to be tuned accordingly, which increases write load on Redis and associated infrastructure costs.

Apache Cassandra or ScyllaDB for large feature sets. For models with very wide feature vectors or with feature sets that don't fit in Redis memory at acceptable cost, Cassandra provides a low-latency read path with horizontal scaling. P95 read latency for a single Cassandra row by primary key is 2–8ms from within the same data center. Higher than Redis but acceptable within the 50ms budget for most feature retrieval patterns.

Embedded feature computation from event stream. For teams using a streaming architecture (Kafka), some features can be maintained as running state in the scoring service process itself rather than in a separate feature store. Kafka consumer groups maintaining in-memory velocity counters (updated with each incoming transaction event) avoid the external read entirely. The tradeoff is that in-memory state is not shared across scoring service instances without additional coordination, which complicates horizontal scaling.

Model Serving: The Infrastructure That Determines Inference Latency

The 1–10ms model inference window is small but not negligible. The two most common sources of inference latency above 5ms are: model complexity (ensemble models with 500+ trees are slower than single models), and serving infrastructure that adds overhead beyond raw model computation (HTTP servers with request routing, middleware, and logging add 5–20ms of overhead around the actual inference call).

For gradient boosting models, which are the dominant model type in production payment fraud (more on why in our article on gradient boosting vs. deep learning), ONNX runtime or LightGBM's C++ prediction path both achieve sub-5ms inference for models up to 1,000 trees at typical feature widths. The overhead on top of the inference call is where most teams lose time.

gRPC serving reduces serialization overhead compared to REST APIs for internal service-to-service calls. A REST endpoint with JSON serialization adds 3–8ms of overhead for a typical fraud scoring request and response. A gRPC endpoint with protobuf serialization adds 0.5–1.5ms. For a scoring service called at 500+ transactions per second, the serialization choice matters both for latency and for CPU cost.

The serving decision that has the largest latency impact is whether to run inference in-process or via a separate model serving infrastructure. Teams using dedicated model serving (TorchServe, MLflow models, SageMaker endpoints) often see 15–30ms added by the network hop and protocol overhead of calling the serving endpoint. For sub-50ms scoring, the model inference should happen in the same process as feature retrieval where possible, or via a localhost socket call if not — not via an external HTTP call to a separate service.

Concurrency and the 99th Percentile Problem

Fraud scoring latency targets are usually stated as averages or P50 values: "under 50ms average latency." This is the wrong metric. Payment authorization is a synchronous flow — the customer is waiting. A P99 latency of 200ms means 1% of customers experience a 200ms hold on their payment, which at 10 million monthly transactions means 100,000 monthly instances of visible latency to end users. P99 is the number that matters, not P50.

P99 latency is largely determined by how the system handles load spikes. Fraud scoring systems experience bursty load patterns — high traffic around retail open-close hours, promotional events, holidays — and need to handle the burst without P99 spikes. The failure mode is thread pool exhaustion: when all scoring threads are occupied, new requests queue and latency spikes until threads become available. Sizing the thread pool for peak rather than average load, combined with autoscaling that reacts fast enough to add capacity before queuing degrades P99, is the operational requirement.

Circuit breakers matter for resilience. If the feature store goes slow (not down, just slow), the scoring service needs to detect that and fall back to a degraded scoring mode — using only transaction-level features without historical features — rather than propagating the feature store latency through to authorization decisions. A 200ms feature store P99 that blows through the authorization timeout is worse than a degraded score that's less accurate but available in 5ms.

Data Locality: The Geography Problem

Payment processors often have authorization infrastructure distributed across geographies for redundancy and compliance reasons. A processor with US and EU operations will have authorization endpoints in both regions. If the fraud scoring service is centralized in one region, the scoring latency for transactions processed in the other region will include cross-region network transit — often 80–150ms — which blows the sub-50ms budget entirely.

Solving this requires either regional fraud scoring deployments (scoring service in each region, with feature stores synchronized across regions) or a hybrid architecture where each region can score locally for the majority of transactions but routes to a central scoring service for transactions where cross-regional feature data is needed for full accuracy. Regional deployment solves the latency problem but creates complexity in keeping models and features consistent across regions. The hybrid approach maintains single-source-of-truth model and feature data but accepts degraded latency for a subset of transactions.

Monitoring Latency in Production

P50, P95, and P99 latency per stage of the scoring pipeline should be monitored and alerted independently. A P99 feature retrieval latency alert separate from an overall scoring latency alert lets you diagnose whether a latency degradation is coming from the feature store, model inference, or network transit — information you need to direct the engineering response correctly.

The specific signals worth instrumenting are: feature cache hit rate (should be above 90% for high-frequency entities; if it drops, feature store reads increase and latency follows), feature store read P99 broken out by entity type (card features vs. device features vs. IP features often have different latency profiles), model inference P99 broken out by model type if you're running ensemble scoring, and end-to-end latency as seen from the API caller. The gap between end-to-end API latency and the sum of internal stage latencies is your serialization and middleware overhead — a gap of more than 5ms is worth investigating.

Sub-50ms fraud scoring is achievable with standard infrastructure components — it doesn't require exotic hardware or novel algorithms. It does require deliberate architecture choices made early in system design, before the system is built around components that structurally prevent low-latency operation. The teams that consistently hit this target are the ones who treat infrastructure latency as a first-class constraint from the beginning, not a performance optimization project applied after the fact.