Skip to main content

vLLM Semantic Router + Milvus: How Semantic Routing and Caching Build Scalable AI Systems the Smart Way

· 9 min read
Min Yin
Milvus Ambassador

Most AI apps rely on a single model for every request. But that approach quickly runs into limits. Large models are powerful yet expensive, even when they're used for simple queries. Smaller models are cheaper and faster but can't handle complex reasoning. When traffic surges—say your AI app suddenly goes viral with ten million users overnight—the inefficiency of this one-model-for-all setup becomes painfully apparent. Latency spikes, GPU bills explode, and the model that ran fine yesterday starts gasping for air.

Semantic Router Q4 2025 Roadmap: Journey to Iris

· 15 min read
Xunzhuo Liu
Software Engineer @ Tencent
Huamin Chen
Distinguished Engineer @ Red Hat
Chen Wang
Senior Staff Research Scientist @ IBM
Yue Zhu
Staff Research Scientist @ IBM

As we approach the end of 2025, we're excited to share our Q4 2025 roadmap for vLLM Semantic Router. This quarter marks a significant milestone in our project's evolution as we prepare for our first major release: v0.1, codename "Iris", expected in late 2025 to early 2026.

iris