Note

ai-cost-optimization-research

◈ Obsidian Startup March 17, 2026

AI Cost Optimization — Research & Direction

Background: What We Built

We built an AI proxy / cost optimization MVP and submitted to YC.

YC result: Top 10% — but not selected or interviewed
What it did: API proxy layer to track business AI usage + semantic caching for local tools (Cursor, Claude Code)
The caching for local file paths was novel but: competitors can replicate it

Why YC Probably Passed

The core problem: no durable moat

We forked open-source repos and added a proxy layer
Good codebase, but gateway/proxy is a crowded layer
Any team can spin up the same stack in a weekend
YC wants to see "why can't the big cloud providers or OpenAI just do this?"

Competitive Landscape

API Gateway / Proxy Layer (crowded)

Company	What it does	Status
Portkey	Open-source LLM gateway — routing, caching, fallbacks	OSS + SaaS, well-funded
Martian	ML-based model router — routes to cheapest model that meets quality bar	$9M raised
RouteLLM	Open-source model routing (LMSYS research)	OSS, widely adopted
LiteLLM	Open-source unified LLM API with cost tracking	Very popular OSS
OpenRouter	API marketplace for 100+ models with routing	Growing fast
Helicone	LLM observability + cost analytics	Well-funded

Conclusion: The API proxy/gateway layer is commoditizing. Hard to win here without massive distribution or being first.

The Real Opportunity: Go Deeper Than the API

The gateway is the wrong layer. The moat is at layers where systems knowledge matters — where you can't just fork a repo.

Direction 1: Agent Workflow Optimization ⭐

The problem: Multi-step agentic pipelines (LangGraph, CrewAI, AutoGen) are wasteful by design.

Each step calls the frontier model (GPT-4, Claude 3.5) even for trivial subtasks
No one is optimizing the pipeline — only individual API calls
A "write a report" agent might make 20 LLM calls — most could be routed to a $0.001 model

What we'd build:

An agent execution optimizer — sits between your agent framework and LLM APIs
Analyzes the DAG of agent tasks at runtime
Automatically routes each step to the cheapest model that can handle it
Learns over time: "this summarization step → always use Haiku, this code review step → needs Sonnet"
Caches intermediate agent states (not just prompts — full context snapshots)

Why this is defensible:

Requires deep understanding of agent execution patterns, not just HTTP proxying
The optimization model itself (learned routing policies) is the moat
Gets better with more usage — flywheel effect

Target user: Companies running agent pipelines in production (LangChain, AutoGen users) with $10k+/mo LLM bills

Revenue: % of savings (e.g., take 20% of the cost reduction we generate), or flat SaaS per seat

Hosung's edge: Building the scheduler/optimizer that dispatches tasks to right-sized models is exactly systems programming — task scheduling, resource allocation, performance optimization

Direction 2: Hybrid Local + Cloud Routing ⭐⭐

The problem: Edge/local inference (llama.cpp, Ollama, LM Studio) is good but nobody has cracked intelligent hybrid routing.

Local models: fast, private, $0 — but lower quality
Cloud models: expensive, slower — but higher quality
Right now: developers manually decide which calls go where
Nobody is doing dynamic local/cloud routing based on task complexity + latency budget

What we'd build:

A local inference daemon with intelligent routing
Runs on dev machines / company servers
Intercepts LLM calls, scores task complexity, routes to:
- Local (Ollama/llama.cpp) for simple/private tasks
- Cloud (Anthropic/OpenAI) for complex tasks that need it
Power-aware: Hosung's GPU power systems background — optimize inference for laptop battery life / thermal limits
Predictive prefetching: anticipate next agent steps, pre-load context

Why this is defensible:

The local inference optimizer runs on hardware — needs deep systems knowledge (C++, power management, memory optimization)
This is NOT just an HTTP proxy — it's a daemon that sits at the OS level
Hosung literally worked on GPU power systems at Nvidia. This is his exact skillset applied to a new domain.
The optimization loop (thermal, memory, latency) can't be replicated by a web dev

Target user:

Enterprises with data privacy requirements (can't send data to cloud)
Power users paying $200+/mo in API costs
Companies running AI on edge devices

Revenue: Self-hosted license (enterprises pay $5k-50k/yr for on-prem), SaaS tier for individuals

Our Specific Unfair Advantage

What we have	Why it matters here
Hosung: GPU power systems @ Nvidia	Local inference is GPU-bound. Power/thermal optimization = better performance on constrained hardware
Hosung: C++ systems programming	The daemon that intercepts + routes needs to be low-latency, low-overhead. Python won't cut it
Angie: built B2B SaaS from 0→1	Can ship the dashboard, billing, onboarding — the enterprise integration surface
Angie: product design chops	Cost savings UX is complex — visualizing where money is going is a real product challenge
We already built a working proxy	We have the codebase. We know the problem space. We're not starting from zero.

Why This Is Timely (2026)

Agentic AI explosion: Every company is now running multi-step agent pipelines. LLM costs are the #1 complaint.
Model proliferation: 100+ capable models now. The routing problem is real and growing.
Local inference maturation: llama.cpp, Ollama, and quantized models (Gemma 3, Llama 3.3) are now good enough for 70% of tasks
Enterprise AI cost pressure: Companies that went all-in on GPT-4 are now under budget pressure from finance teams
No clear winner yet: Martian raised $9M but is still API-layer only. No one owns the agent workflow layer yet.

Recommended Next Steps

Immediate (this week)

Talk to 10 people running agent pipelines — what are their actual monthly LLM costs? What would they pay to cut it by 40%?
Audit our existing MVP codebase — what parts are reusable for the agent optimizer direction?

Short-term (1 month)

Build a prototype agent optimizer — pick one framework (LangGraph) and show 30%+ cost reduction on a real pipeline
Measure baseline: Pick 3 common agent tasks, run them unoptimized vs. optimized, publish the numbers

Medium-term (3 months)

Hosung builds the local routing daemon — C++ daemon that intercepts Ollama/llama.cpp calls and makes routing decisions
Angie builds the analytics dashboard — show users where their costs are going, what's being routed where, savings to date

Key Questions to Answer

What is the actual LLM cost for a typical enterprise running LangGraph agents?
Is the pain point cost reduction or latency reduction or both?
Does Martian's approach work well? Why haven't they won yet?
Would companies trust a third-party daemon running on their inference stack?
What's the minimum accuracy loss acceptable when routing to cheaper models?

YC Resubmission Angle

Before: "We built an API proxy that tracks LLM costs and caches responses" → Weak. Sounds like a feature, not a company.

After (agent optimizer framing): "We cut LLM costs for agentic AI pipelines by 40-70% by optimizing which model handles each step in a multi-agent workflow — without writing any new prompts or changing existing code" → Stronger. Specific savings, specific mechanism, clear before/after, plug-and-play.

After (hybrid local+cloud framing): "We built the intelligent routing layer between local and cloud inference — companies keep sensitive data local and cut cloud costs 60%+, with zero code changes" → Enterprise angle. Privacy + cost. Defensible with Hosung's systems background.

Saved 2026-03-17 — from conversation research session

RouteLLM Analysis (added 2026-03-17)

Full deep dive: Startup/routellm-deep-dive.md

What it is: Open-source binary router from LMSYS / UC Berkeley (same team as Chatbot Arena). ICLR 2025. 4,700+ stars.

The mechanism: Trains a Matrix Factorization model on 80k Chatbot Arena preference samples. Scores each incoming prompt → routes to strong (GPT-4) or weak (Mixtral) model based on a tunable threshold α.

Benchmark results:

95% GPT-4 performance using only 14% GPT-4 calls on MT Bench
vs. Martian: same quality, >40% cheaper
Overhead: <40ms per request

Cross-model generalization: Works on Claude + Llama without retraining — learns query complexity, not model-specific patterns.

Key gaps (our differentiation):

Gap	What it means for us
Binary routing only	Can't handle a pool of models (local + 3 cloud tiers)
No local inference	Doesn't know Ollama/llama.cpp exist — treats everything as remote API
No system-state	Ignores battery, GPU temp, memory pressure, monthly budget
No agent awareness	Optimizes individual calls, not pipelines
Prompt leaks on scoring	The scoring step itself sends your prompt to the cloud — no privacy

Our strategic options:

Build on top: Use their MF router as the complexity signal layer. Add local inference + system-state layer on top. Fast to build, borrows their academic credibility.
Replace the cloud assumption: Local-first routing where the scoring model runs on-device (prompt never leaves the machine). Privacy as the product. Hosung's C++ daemon is the moat.
Go above: Optimize agent execution graphs holistically — RouteLLM becomes a dep we call internally, not a competitor. Bigger build, much harder to replicate.