
This blog post focuses on new features and improvements. For a comprehensive list, including bug fixes, please see the release notes.
LLM inference at scale typically involves deploying multiple replicas of the same model behind a load balancer. The standard approach treats these replicas as interchangeable and routes requests randomly or round-robin across them.
But LLM inference isn't stateless. Each replica builds up a KV cache of previously computed attention states. When a request lands on a replica without the relevant context already cached, the model has to recompute everything from scratch. This wastes GPU cycles and increases latency.
The problem becomes visible in three common patterns: shared system prompts (every app has one), RAG pipelines (users query the same knowledge base), and multi-turn conversations (follow-up messages share context). In all three cases, a naive load balancer forces replicas to independently compute the same prefixes, multiplying redundant work by your replica count.
Clarifai 12.3 introduces KV Cache-Aware Routing, which automatically detects prompt overlap across requests and routes them to the replica most likely to already have the relevant context cached. This delivers measurably higher throughput and lower time-to-first-token with zero configuration required.
This release also includes Warm Node Pools for faster scaling and failover, Session-Aware Routing to keep user requests on the same replica, Prediction Caching for identical inputs, and Clarifai Skills for AI coding assistants.
When you deploy an LLM with multiple replicas, standard load balancing distributes requests evenly across all replicas. This works well for stateless applications, but LLM inference has state: the KV cache.
The KV cache stores previously computed key-value pairs from the attention mechanism. When a new request shares context with a previous request, the model can reuse these cached computations instead of recalculating them. This makes inference faster and more efficient.
But if your load balancer doesn't account for cache state, requests get scattered randomly across replicas. Each replica ends up recomputing the same context independently, wasting GPU resources.
Shared system prompts are the clearest example. Every application has a system instruction that prefixes user messages. When 100 users hit the same model, a random load balancer scatters them across replicas, forcing each one to independently compute the same system prompt prefix. If you have 5 replicas, you're computing that system prompt 5 times instead of once.
RAG pipelines amplify the problem. Users querying the same knowledge base get near-identical retrieved-document prefixes injected into their prompts. Without cache-aware routing, this shared context is recomputed on every replica instead of being reused. The overlap can be substantial, especially when multiple users ask related questions within a short time window.
Multi-turn conversations create implicit cache dependencies. Follow-up messages in a conversation share the entire prior context. If the second message lands on a different replica than the first, the full conversation history has to be reprocessed. This gets worse as conversations grow longer.
Clarifai Compute Orchestration analyzes incoming requests, detects prompt overlap, and routes them to the replica most likely to already have the relevant KV cache loaded.
The routing layer identifies shared prefixes and directs traffic to replicas where that context is already warm. This happens transparently at the platform level. You don't configure cache keys, manage sessions, or modify your application code.
The result is measurably higher throughput and lower time-to-first-token. GPU utilization improves because replicas spend less time on redundant computation. Users see faster responses because requests hit replicas that are already warmed up with the relevant context.
This optimization is available automatically on any multi-replica deployment of vLLM or SGLang-backed models. No configuration required. No code changes needed.
GPU cold starts happen when deployments need to scale beyond their current capacity. The typical sequence: provision a cloud node (1-5 minutes), pull the container image, download model weights, load into GPU memory, then serve the first request.
Setting min_replicas ≥ 1 keeps baseline capacity always warm. But when traffic exceeds that baseline or failover happens to a secondary nodepool, you still face infrastructure provisioning delays.
Warm Node Pools keep GPU infrastructure pre-warmed and ready to accept workloads.
Popular GPU instance types have nodes standing by, ready to accept workloads without waiting for cloud provider provisioning. When your deployment needs to scale up, the node is already there.
When your primary nodepool approaches capacity, Clarifai automatically begins preparing the next priority nodepool before traffic spills over. By the time overflow happens, the infrastructure is ready.
Warm capacity is held using lightweight placeholder workloads that are instantly evicted when a real model needs the GPU. Your model gets the resources immediately without competing for scheduling.
This eliminates the infrastructure provisioning step (1-5 minutes). Container image pull and model weight loading still happen when a new replica starts, but combined with Clarifai's pre-built base images and optimized model loading, scaling delays are significantly reduced.
Beyond KV cache affinity, Clarifai 12.3 includes two additional routing optimizations that work together to improve performance.
Session-Aware Routing keeps user requests on the same replica throughout a session. This is particularly useful for conversational applications where follow-up messages from the same user share context. Instead of relying on KV cache affinity to detect overlap, session-aware routing ensures continuity by routing based on user or session identifiers.
This works without any client-side changes. The platform handles session tracking automatically and ensures that requests with the same session ID land on the same replica, preserving KV cache locality.
Prediction Caching stores results for identical input, model, and version combinations. When the exact same request arrives, the cached result is returned immediately without invoking the model.
This is useful for scenarios where multiple users submit identical queries. For example, in a customer support application where users frequently ask the same questions, prediction caching eliminates redundant inference calls entirely.
Both features are enabled automatically. You don't configure cache policies or manage session state. The routing layer handles this transparently.
We're releasing Clarifai Skills that turn AI coding assistants like Claude Code into Clarifai platform experts. Instead of explaining APIs from scratch, you describe what you want in plain language and your assistant finds the right skill and gets to work.
Built on the open Agent Skills standard, Clarifai Skills work across 30+ agent platforms including Claude Code, Cursor, GitHub Copilot, and Gemini. Each skill includes detailed reference documentation and working code examples.
Available skills cover the full platform: CLI commands (clarifai-cli), model deployment (clarifai-model-upload), inference (clarifai-inference), MCP server development (clarifai-mcp), deployment lifecycle management (clarifai-deployment-lifecycle), observability (clarifai-observability), and more.
Installation is straightforward:
Once installed, skills activate automatically when your request matches their description. Ask naturally ("Deploy Qwen3-0.6B with vLLM") and your assistant generates the correct code using Clarifai's APIs and conventions.
Full documentation, installation instructions, and examples here.
Model Serving and Deployment
The clarifai model deploy command now includes multi-cloud GPU discovery and a zero-prompt deployment flow. Simplified config.yaml structure for model initialization makes it easier to get started.
clarifai model serve now reuses existing resources when available instead of creating new ones. Served models are private by default. Added --keep flag to preserve the build directory after serving, useful for debugging and inspecting build artifacts.
Local Runner is now public by default. Models launched via the local runner are publicly accessible without manually setting visibility.
Model Runner
Added VLLMOpenAIModelClass parent class with built-in cancellation support and health probes for vLLM-backed models.
Optimized model runner memory and latency. Reduced memory footprint and improved response latency in the model runner. Streamlined overhead in SSE (Server-Sent Events) streaming.
Auto-detect and clamp max_tokens. The runner now automatically detects the backend's max_seq_len and clamps max_tokens to that value, preventing out-of-range errors.
Bug Fixes
Fixed reasoning model token tracking and streaming in agentic class. Token tracking for reasoning models now correctly accounts for reasoning tokens. Fixed event-loop safety, streaming, and tool call passthrough in the agentic class.
Fixed user/app context conflicts in CLI. Resolved conflicts between user_id and app_id when using named contexts in CLI commands.
Fixed clarifai model init directory handling. The command now correctly updates an existing model directory instead of creating a subdirectory.
KV Cache-Aware Routing is available now on all multi-replica deployments. Deploy a model with multiple replicas and routing optimizations are enabled automatically. No configuration required.
Install Clarifai Skills to turn Claude Code, Cursor, or any AI coding assistant into a Clarifai platform expert. Read the full installation guide and see the complete release notes for all updates in 12.3.
Sign up to start deploying models with intelligent request routing, or join the community on Discord here if you have any questions.
© 2026 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy