2 posts tagged with "semantic-router"

vLLM Semantic Router + Milvus: How Semantic Routing and Caching Build Scalable AI Systems the Smart Way

October 30, 2025 · 9 min read

Milvus Ambassador

Most AI apps rely on a single model for every request. But that approach quickly runs into limits. Large models are powerful yet expensive, even when they're used for simple queries. Smaller models are cheaper and faster but can't handle complex reasoning. When traffic surges—say your AI app suddenly goes viral with ten million users overnight—the inefficiency of this one-model-for-all setup becomes painfully apparent. Latency spikes, GPU bills explode, and the model that ran fine yesterday starts gasping for air.

vLLM Semantic Router: Next Phase in LLM inference

September 6, 2025 · 5 min read

Huamin Chen

Distinguished Engineer @ Red Hat

Chen Wang

Senior Staff Research Scientist @ IBM

Yue Zhu

Staff Research Scientist @ IBM

Xunzhuo Liu

Software Engineer @ Tencent

code

Synced from official vLLM Blog: vLLM Semantic Router: Next Phase in LLM inference