Research-backed · 60–70% inference cost reduction Research · NE Agents Day 2026 · Cornell Tech

The optimization layer for
agentic workflows.

FluxCompute sits between your agent and your models. Every query gets analyzed in ~12ms and routed to the cheapest model that can answer it correctly. Same accuracy, fraction of the cost, no infra change.

~/agent-app · flux compile
1$ flux compile --provider anthropic \
2  --traces ./agent-traces.jsonl
3
4 analyzing 14d of agent traces     3.2s
5 training query classifier         18.4s
6 attaching KV cache layer          0.4s
7 healthcheck /v1/route             12ms
Routing live · baseline $48,200/mo → projected $14,100/mo (−71%) total: 22.0s
Research & engineering from
Cornell Tech MIT CSAIL Google Veolia
01 · How it works

We optimize inference costs for large scale agentic workflows.

L0 · KV cache persistence

Multi-turn agents recompute attention states on every turn. We persist them by session ID and restore them on the next step. Invisible to your agent code.

Prefill recompute −50%

L1 · Query classifier

Real-time difficulty analysis on every incoming query. Easy → Haiku. Medium → Sonnet. Hard → Opus. Re-classifies when execution branches into a tool call or reasoning loop.

Routing overhead 12ms

L2 · Model executor

Dispatches each query to the chosen model — API tier or local weights. Handles provider-level routing across OpenAI, Anthropic, and local weight deployments.

Providers OpenAI · Anthropic · Local

Context Handoff

When execution crosses a model boundary mid-loop, FluxCompute serialises full agent state — memory, tool calls, conversation history — and cross-translates it to the target model's format. The loop resumes without restart.

State fidelity lossless

L3 · Drift monitor

Tracks whether each routing decision was correct. Detects when query distribution shifts — seasonal, customer cohort, new feature — and triggers automatic recompilation monthly.

Accuracy delta vs. baseline <1%

L4 · Observability

Cost per query, per type, per customer. Latency breakdowns. Routing accuracy. The dashboard your CFO asks for — exposed as Prometheus, OTLP, or our UI.

Telemetry OTLP · Prom · S3

On-prem compression

For regulated workloads — HIPAA, GDPR, classified, sovereign. Run 80% of queries on compressed models 40–70% smaller, with no measurable accuracy loss on your workload. All data stays local.

Model compression 40–70%
02 · Performance

Measured on production workloads.

Measured on real production agent workloads, A6000 Ada hardware, against HumanEval and TriviaQA. Cost normalized to baseline of routing every query to the top-tier model. Lower is better.

Inference cost · normalized

N=2.1M queries · <1% accuracy delta
FluxCompute
0.30×
Single-tier router
0.72×
Prompt compression
0.84×
KV cache only
0.88×
Baseline (top tier)
1.00×

Live routing feed

prod · streaming
21:14:0811mseasy → haiku200
21:14:0814msmedium → sonnet200
21:14:079mseasy → haiku200
21:14:0722mshard → opus200
21:14:0617msmedium → sonnet · branch200
21:14:0610mseasy → haiku200
21:14:0513mseasy → haiku200
21:14:0519mshard → opus200
04 · Team

Built by researchers who've worked on the problem.

FluxCompute is built by two Cornell Tech researchers with hands-on experience in ML hardware, production LLM systems, and inference optimization.

I

Ishan Patwardhan

Co-founder

Researcher at Cornell Tech in Agentic Systems, ML hardware, and hardware-software co-design. Inference research at MIT CSAIL: hierarchical MoE, DAG-based token routing. SWE at Google: SVD-based context engineering and evaluation frameworks for Gemini. Prior: HPC systems, math libraries, and container platforms at Hewlett Packard Enterprise.

N

Niki Karanikola

Co-founder

Two years shipping production LLM inference, dense retrieval, and RAG pipelines at Veolia — re-architected enterprise knowledge access from weeks to minutes, 6× faster reporting cadence, +20% NDCG. Researcher at Cornell Tech's Social Technologies Lab.

See your inference bill drop in two weeks.