FluxCompute sits between your agent and your models. Every query gets analyzed in ~12ms and routed to the cheapest model that can answer it correctly. Same accuracy, fraction of the cost, no infra change.
1$ flux compile --provider anthropic \ 2 --traces ./agent-traces.jsonl 3 4→ analyzing 14d of agent traces ✓ 3.2s 5→ training query classifier ✓ 18.4s 6→ attaching KV cache layer ✓ 0.4s 7→ healthcheck /v1/route ✓ 12ms
Multi-turn agents recompute attention states on every turn. We persist them by session ID and restore them on the next step. Invisible to your agent code.
Real-time difficulty analysis on every incoming query. Easy → Haiku. Medium → Sonnet. Hard → Opus. Re-classifies when execution branches into a tool call or reasoning loop.
Dispatches each query to the chosen model — API tier or local weights. Handles provider-level routing across OpenAI, Anthropic, and local weight deployments.
When execution crosses a model boundary mid-loop, FluxCompute serialises full agent state — memory, tool calls, conversation history — and cross-translates it to the target model's format. The loop resumes without restart.
Tracks whether each routing decision was correct. Detects when query distribution shifts — seasonal, customer cohort, new feature — and triggers automatic recompilation monthly.
Cost per query, per type, per customer. Latency breakdowns. Routing accuracy. The dashboard your CFO asks for — exposed as Prometheus, OTLP, or our UI.
For regulated workloads — HIPAA, GDPR, classified, sovereign. Run 80% of queries on compressed models 40–70% smaller, with no measurable accuracy loss on your workload. All data stays local.
Measured on real production agent workloads, A6000 Ada hardware, against HumanEval and TriviaQA. Cost normalized to baseline of routing every query to the top-tier model. Lower is better.
FluxCompute is built by two Cornell Tech researchers with hands-on experience in ML hardware, production LLM systems, and inference optimization.
Researcher at Cornell Tech in Agentic Systems, ML hardware, and hardware-software co-design. Inference research at MIT CSAIL: hierarchical MoE, DAG-based token routing. SWE at Google: SVD-based context engineering and evaluation frameworks for Gemini. Prior: HPC systems, math libraries, and container platforms at Hewlett Packard Enterprise.
Two years shipping production LLM inference, dense retrieval, and RAG pipelines at Veolia — re-architected enterprise knowledge access from weeks to minutes, 6× faster reporting cadence, +20% NDCG. Researcher at Cornell Tech's Social Technologies Lab.