Note
ai-cost-optimization-research
AI Cost Optimization — Research & Direction
Background: What We Built
We built an AI proxy / cost optimization MVP and submitted to YC.
- YC result: Top 10% — but not selected or interviewed
- What it did: API proxy layer to track business AI usage + semantic caching for local tools (Cursor, Claude Code)
- The caching for local file paths was novel but: competitors can replicate it
Why YC Probably Passed
The core problem: no durable moat
- We forked open-source repos and added a proxy layer
- Good codebase, but gateway/proxy is a crowded layer
- Any team can spin up the same stack in a weekend
- YC wants to see "why can't the big cloud providers or OpenAI just do this?"
Competitive Landscape
API Gateway / Proxy Layer (crowded)
| Company | What it does | Status |
|---|---|---|
| Portkey | Open-source LLM gateway — routing, caching, fallbacks | OSS + SaaS, well-funded |
| Martian | ML-based model router — routes to cheapest model that meets quality bar | $9M raised |
| RouteLLM | Open-source model routing (LMSYS research) | OSS, widely adopted |
| LiteLLM | Open-source unified LLM API with cost tracking | Very popular OSS |
| OpenRouter | API marketplace for 100+ models with routing | Growing fast |
| Helicone | LLM observability + cost analytics | Well-funded |
Conclusion: The API proxy/gateway layer is commoditizing. Hard to win here without massive distribution or being first.
The Real Opportunity: Go Deeper Than the API
The gateway is the wrong layer. The moat is at layers where systems knowledge matters — where you can't just fork a repo.
Direction 1: Agent Workflow Optimization ⭐
The problem: Multi-step agentic pipelines (LangGraph, CrewAI, AutoGen) are wasteful by design.
- Each step calls the frontier model (GPT-4, Claude 3.5) even for trivial subtasks
- No one is optimizing the pipeline — only individual API calls
- A "write a report" agent might make 20 LLM calls — most could be routed to a $0.001 model
What we'd build:
- An agent execution optimizer — sits between your agent framework and LLM APIs
- Analyzes the DAG of agent tasks at runtime
- Automatically routes each step to the cheapest model that can handle it
- Learns over time: "this summarization step → always use Haiku, this code review step → needs Sonnet"
- Caches intermediate agent states (not just prompts — full context snapshots)
Why this is defensible:
- Requires deep understanding of agent execution patterns, not just HTTP proxying
- The optimization model itself (learned routing policies) is the moat
- Gets better with more usage — flywheel effect
Target user: Companies running agent pipelines in production (LangChain, AutoGen users) with $10k+/mo LLM bills
Revenue: % of savings (e.g., take 20% of the cost reduction we generate), or flat SaaS per seat
Hosung's edge: Building the scheduler/optimizer that dispatches tasks to right-sized models is exactly systems programming — task scheduling, resource allocation, performance optimization
Direction 2: Hybrid Local + Cloud Routing ⭐⭐
The problem: Edge/local inference (llama.cpp, Ollama, LM Studio) is good but nobody has cracked intelligent hybrid routing.
- Local models: fast, private, $0 — but lower quality
- Cloud models: expensive, slower — but higher quality
- Right now: developers manually decide which calls go where
- Nobody is doing dynamic local/cloud routing based on task complexity + latency budget
What we'd build:
- A local inference daemon with intelligent routing
- Runs on dev machines / company servers
- Intercepts LLM calls, scores task complexity, routes to:
- Local (Ollama/llama.cpp) for simple/private tasks
- Cloud (Anthropic/OpenAI) for complex tasks that need it
- Power-aware: Hosung's GPU power systems background — optimize inference for laptop battery life / thermal limits
- Predictive prefetching: anticipate next agent steps, pre-load context
Why this is defensible:
- The local inference optimizer runs on hardware — needs deep systems knowledge (C++, power management, memory optimization)
- This is NOT just an HTTP proxy — it's a daemon that sits at the OS level
- Hosung literally worked on GPU power systems at Nvidia. This is his exact skillset applied to a new domain.
- The optimization loop (thermal, memory, latency) can't be replicated by a web dev
Target user:
- Enterprises with data privacy requirements (can't send data to cloud)
- Power users paying $200+/mo in API costs
- Companies running AI on edge devices
Revenue: Self-hosted license (enterprises pay $5k-50k/yr for on-prem), SaaS tier for individuals
Our Specific Unfair Advantage
| What we have | Why it matters here |
|---|---|
| Hosung: GPU power systems @ Nvidia | Local inference is GPU-bound. Power/thermal optimization = better performance on constrained hardware |
| Hosung: C++ systems programming | The daemon that intercepts + routes needs to be low-latency, low-overhead. Python won't cut it |
| Angie: built B2B SaaS from 0→1 | Can ship the dashboard, billing, onboarding — the enterprise integration surface |
| Angie: product design chops | Cost savings UX is complex — visualizing where money is going is a real product challenge |
| We already built a working proxy | We have the codebase. We know the problem space. We're not starting from zero. |
Why This Is Timely (2026)
- Agentic AI explosion: Every company is now running multi-step agent pipelines. LLM costs are the #1 complaint.
- Model proliferation: 100+ capable models now. The routing problem is real and growing.
- Local inference maturation: llama.cpp, Ollama, and quantized models (Gemma 3, Llama 3.3) are now good enough for 70% of tasks
- Enterprise AI cost pressure: Companies that went all-in on GPT-4 are now under budget pressure from finance teams
- No clear winner yet: Martian raised $9M but is still API-layer only. No one owns the agent workflow layer yet.
Recommended Next Steps
Immediate (this week)
- Talk to 10 people running agent pipelines — what are their actual monthly LLM costs? What would they pay to cut it by 40%?
- Audit our existing MVP codebase — what parts are reusable for the agent optimizer direction?
Short-term (1 month)
- Build a prototype agent optimizer — pick one framework (LangGraph) and show 30%+ cost reduction on a real pipeline
- Measure baseline: Pick 3 common agent tasks, run them unoptimized vs. optimized, publish the numbers
Medium-term (3 months)
- Hosung builds the local routing daemon — C++ daemon that intercepts Ollama/llama.cpp calls and makes routing decisions
- Angie builds the analytics dashboard — show users where their costs are going, what's being routed where, savings to date
Key Questions to Answer
- What is the actual LLM cost for a typical enterprise running LangGraph agents?
- Is the pain point cost reduction or latency reduction or both?
- Does Martian's approach work well? Why haven't they won yet?
- Would companies trust a third-party daemon running on their inference stack?
- What's the minimum accuracy loss acceptable when routing to cheaper models?
YC Resubmission Angle
Before: "We built an API proxy that tracks LLM costs and caches responses" → Weak. Sounds like a feature, not a company.
After (agent optimizer framing): "We cut LLM costs for agentic AI pipelines by 40-70% by optimizing which model handles each step in a multi-agent workflow — without writing any new prompts or changing existing code" → Stronger. Specific savings, specific mechanism, clear before/after, plug-and-play.
After (hybrid local+cloud framing): "We built the intelligent routing layer between local and cloud inference — companies keep sensitive data local and cut cloud costs 60%+, with zero code changes" → Enterprise angle. Privacy + cost. Defensible with Hosung's systems background.
Saved 2026-03-17 — from conversation research session
RouteLLM Analysis (added 2026-03-17)
Full deep dive:
Startup/routellm-deep-dive.md
What it is: Open-source binary router from LMSYS / UC Berkeley (same team as Chatbot Arena). ICLR 2025. 4,700+ stars.
The mechanism: Trains a Matrix Factorization model on 80k Chatbot Arena preference samples. Scores each incoming prompt → routes to strong (GPT-4) or weak (Mixtral) model based on a tunable threshold α.
Benchmark results:
- 95% GPT-4 performance using only 14% GPT-4 calls on MT Bench
- vs. Martian: same quality, >40% cheaper
- Overhead: <40ms per request
Cross-model generalization: Works on Claude + Llama without retraining — learns query complexity, not model-specific patterns.
Key gaps (our differentiation):
| Gap | What it means for us |
|---|---|
| Binary routing only | Can't handle a pool of models (local + 3 cloud tiers) |
| No local inference | Doesn't know Ollama/llama.cpp exist — treats everything as remote API |
| No system-state | Ignores battery, GPU temp, memory pressure, monthly budget |
| No agent awareness | Optimizes individual calls, not pipelines |
| Prompt leaks on scoring | The scoring step itself sends your prompt to the cloud — no privacy |
Our strategic options:
- Build on top: Use their MF router as the complexity signal layer. Add local inference + system-state layer on top. Fast to build, borrows their academic credibility.
- Replace the cloud assumption: Local-first routing where the scoring model runs on-device (prompt never leaves the machine). Privacy as the product. Hosung's C++ daemon is the moat.
- Go above: Optimize agent execution graphs holistically — RouteLLM becomes a dep we call internally, not a competitor. Bigger build, much harder to replicate.