The Exfiltration Risk
Public APIs are a non-starter for strict regulatory regimes. Data sovereignty demands that inference happens where the data lives.
The Architecture: Private Cloud Inference
We deployed a quantized Llama-3-70B model on a private Kubernetes cluster within the bank's AWS VPC. Using vLLM for high-throughput serving, we achieved <20ms latency.
Auditability & Logging
Every prompt and completion is logged to an immutable ledger (QLDB) for compliance auditing, ensuring full transparency of the model's decision-making process.



