AI latency is a business risk. Here’s how to manage it

Towards Data Sciencetoken usage latency reasoning models inference scaling

Inference Scaling (Test-Time Compute): Why Reasoning Models Raise Your Compute Bill

Why reasoning models dramatically increase token usage, latency, and infrastructure costs in production systems The post Inference Scaling (Test-Time Compute): Why Reasoning Models Raise Your Compute Bill appeared first on Towards Data Science.

May 3, 1:00 PM

The Guardian AIsystem algorithm sweden gothenburg

I took an algorithm to court in Sweden. The algorithm won | Charlotta Kronblad

Gothenburg promised to optimise school admissions with a piece of code. The resulting chaos showed how unaccountable systems are ruining lives We like to imagine that injustice announces itself loudly. That when something goes wrong in the public system, alarms go off and someone takes responsibility or is held accountable if they do not. But in 2020 in Gothenburg, injustice arrived quietly, disguised as efficiency. For the first time, the city used an algorithm to allocate places in its schools. After all, working out geographical catchment areas and admissions is an administrative headache for any municipality. What better than a machine to optimise distances, preferences and capacity? The system was designed to serve public efficiency: framed as neutral, streamlined and objective. Charlotta Kronblad researches digital transformation at the University of Gothenburg. Continue reading...

Apr 30, 4:00 AM

DataRobot Blogagentic ai datarobot acl hydration source-system

Introducing ACL Hydration: secure knowledge workflows for agentic AI

DataRobot's new ACL Hydration framework preserves source-system access controls and enforces them at query time — so agentic AI retrieves the right information for the right user, every time. The post Introducing ACL Hydration: secure knowledge workflows for agentic AI appeared first on DataRobot.

Apr 23, 7:50 PM

InfoWorld AIgpu-hours latency gpu capacity planning

How I doubled my GPU efficiency without buying a single new card

Late last year I got pulled into a capacity planning exercise for a global retailer that had wired a 70B model into their product search and recommendation pipeline. Every search query triggered an inference call. During holiday traffic their cluster was burning through GPU-hours at a rate that made their cloud finance team physically uncomfortable. They had already scaled from 24 to 48 H100s and latency was still spiking during peak hours. I was brought in to answer a simple question: Do we need 96 GPUs for the January sale or is something else going on? I started where I always start with these engagements: profiling. I instrumented the serving layer and broke the utilization data down by inference phase. What came back changed how I think about GPU infrastructure. During prompt processing — the phase where the model reads the entire user input in parallel — the H100s were running at 92% compute utilization. Tensor cores fully saturated. Exactly what you want to see on a $30K GPU. Bu

Apr 23, 9:00 AM

DataRobot Blogai agents aws datarobot azure

Your AI agents will run everywhere. Is your architecture ready for that?

You bet on a hyperscaler to power your AI ambitions. One provider, one ecosystem, one set of tools. What nobody said out loud is that you just walked into a walled garden. The walls are the point. AWS, GCP, and Azure can all be connected to other environments, but none of them is built to... The post Your AI agents will run everywhere. Is your architecture ready for that? appeared first on DataRobot.

Apr 21, 10:21 PM

WIRED AImark zuckerberg system jack dorsey tech ceos