When a major insurer’s AI system takes months to settle a claim that should be resolved in hours, the problem usually isn’t the model in isolation. It’s the system around the model and the latency that system introduces at every step. Speed in enterprise AI isn’t about impressive benchmark numbers. It’s about whether AI can...
The post AI latency is a business risk. Here’s how to manage it appeared first on DataRobot.
Why reasoning models dramatically increase token usage, latency, and infrastructure costs in production systems
The post Inference Scaling (Test-Time Compute): Why Reasoning Models Raise Your Compute Bill appeared first on Towards Data Science.
Gothenburg promised to optimise school admissions with a piece of code. The resulting chaos showed how unaccountable systems are ruining lives
We like to imagine that injustice announces itself loudly. That when something goes wrong in the public system, alarms go off and someone takes responsibility or is held accountable if they do not. But in 2020 in Gothenburg, injustice arrived quietly, disguised as efficiency.
For the first time, the city used an algorithm to allocate places in its schools. After all, working out geographical catchment areas and admissions is an administrative headache for any municipality. What better than a machine to optimise distances, preferences and capacity? The system was designed to serve public efficiency: framed as neutral, streamlined and objective.
Charlotta Kronblad researches digital transformation at the University of Gothenburg.
Continue reading...
DataRobot's new ACL Hydration framework preserves source-system access controls and enforces them at query time — so agentic AI retrieves the right information for the right user, every time.
The post Introducing ACL Hydration: secure knowledge workflows for agentic AI appeared first on DataRobot.
Late last year I got pulled into a capacity planning exercise for a global retailer that had wired a 70B model into their product search and recommendation pipeline. Every search query triggered an inference call. During holiday traffic their cluster was burning through GPU-hours at a rate that made their cloud finance team physically uncomfortable. They had already scaled from 24 to 48 H100s and latency was still spiking during peak hours. I was brought in to answer a simple question: Do we need 96 GPUs for the January sale or is something else going on?
I started where I always start with these engagements: profiling. I instrumented the serving layer and broke the utilization data down by inference phase. What came back changed how I think about GPU infrastructure.
During prompt processing — the phase where the model reads the entire user input in parallel — the H100s were running at 92% compute utilization. Tensor cores fully saturated. Exactly what you want to see on a $30K GPU. Bu
You bet on a hyperscaler to power your AI ambitions. One provider, one ecosystem, one set of tools. What nobody said out loud is that you just walked into a walled garden. The walls are the point. AWS, GCP, and Azure can all be connected to other environments, but none of them is built to...
The post Your AI agents will run everywhere. Is your architecture ready for that? appeared first on DataRobot.