Source link : https://tech365.info/the-crew-behind-steady-batching-says-your-idle-gpus-must-be-operating-inference-not-sitting-darkish/
Each GPU cluster has useless time. Coaching jobs end, workloads shift and {hardware} sits darkish whereas energy and cooling prices preserve operating. For neocloud operators, these empty cycles are misplaced margin.
The apparent workaround is spot GPU markets — renting spare capability to whoever wants it. However spot situations imply the cloud vendor continues to be the one doing the renting, and engineers shopping for that capability are nonetheless paying for uncooked compute with no inference stack connected.
FriendliAI’s reply is completely different: run inference immediately on the unused {hardware}, optimize for token throughput, and cut up the income with the operator. FriendliAI was based by Byung-Gon Chun, the researcher whose paper on steady batching turned foundational to vLLM, the open supply inference engine used throughout most manufacturing deployments right this moment.
Chun spent over a decade as a professor at Seoul Nationwide College finding out environment friendly execution of machine studying fashions at scale. That analysis produced a paper referred to as Orca, which launched steady batching. The method processes inference requests dynamically fairly than ready to fill a hard and fast batch earlier than executing. It’s now business commonplace and is the core mechanism inside vLLM.
This week, FriendliAI is launching a brand new platform referred to as InferenceSense. Simply as publishers use Google AdSense to monetize unsold advert stock,…
—-
Author : tech365
Publish date : 2026-03-12 15:26:00
Copyright for syndicated content belongs to the linked Source.
—-
1 – 2 – 3 – 4 – 5 – 6 – 7 – 8