Note

ai-cost-optimization-research

AI Cost Optimization — Research & Direction

Background: What We Built

We built an AI proxy / cost optimization MVP and submitted to YC.

  • YC result: Top 10% — but not selected or interviewed
  • What it did: API proxy layer to track business AI usage + semantic caching for local tools (Cursor, Claude Code)
  • The caching for local file paths was novel but: competitors can replicate it

Why YC Probably Passed

The core problem: no durable moat

  • We forked open-source repos and added a proxy layer
  • Good codebase, but gateway/proxy is a crowded layer
  • Any team can spin up the same stack in a weekend
  • YC wants to see "why can't the big cloud providers or OpenAI just do this?"

Competitive Landscape

API Gateway / Proxy Layer (crowded)

Company What it does Status
Portkey Open-source LLM gateway — routing, caching, fallbacks OSS + SaaS, well-funded
Martian ML-based model router — routes to cheapest model that meets quality bar $9M raised
RouteLLM Open-source model routing (LMSYS research) OSS, widely adopted
LiteLLM Open-source unified LLM API with cost tracking Very popular OSS
OpenRouter API marketplace for 100+ models with routing Growing fast
Helicone LLM observability + cost analytics Well-funded

Conclusion: The API proxy/gateway layer is commoditizing. Hard to win here without massive distribution or being first.


The Real Opportunity: Go Deeper Than the API

The gateway is the wrong layer. The moat is at layers where systems knowledge matters — where you can't just fork a repo.

Direction 1: Agent Workflow Optimization ⭐

The problem: Multi-step agentic pipelines (LangGraph, CrewAI, AutoGen) are wasteful by design.

  • Each step calls the frontier model (GPT-4, Claude 3.5) even for trivial subtasks
  • No one is optimizing the pipeline — only individual API calls
  • A "write a report" agent might make 20 LLM calls — most could be routed to a $0.001 model

What we'd build:

  • An agent execution optimizer — sits between your agent framework and LLM APIs
  • Analyzes the DAG of agent tasks at runtime
  • Automatically routes each step to the cheapest model that can handle it
  • Learns over time: "this summarization step → always use Haiku, this code review step → needs Sonnet"
  • Caches intermediate agent states (not just prompts — full context snapshots)

Why this is defensible:

  • Requires deep understanding of agent execution patterns, not just HTTP proxying
  • The optimization model itself (learned routing policies) is the moat
  • Gets better with more usage — flywheel effect

Target user: Companies running agent pipelines in production (LangChain, AutoGen users) with $10k+/mo LLM bills

Revenue: % of savings (e.g., take 20% of the cost reduction we generate), or flat SaaS per seat

Hosung's edge: Building the scheduler/optimizer that dispatches tasks to right-sized models is exactly systems programming — task scheduling, resource allocation, performance optimization


Direction 2: Hybrid Local + Cloud Routing ⭐⭐

The problem: Edge/local inference (llama.cpp, Ollama, LM Studio) is good but nobody has cracked intelligent hybrid routing.

  • Local models: fast, private, $0 — but lower quality
  • Cloud models: expensive, slower — but higher quality
  • Right now: developers manually decide which calls go where
  • Nobody is doing dynamic local/cloud routing based on task complexity + latency budget

What we'd build:

  • A local inference daemon with intelligent routing
  • Runs on dev machines / company servers
  • Intercepts LLM calls, scores task complexity, routes to:
    • Local (Ollama/llama.cpp) for simple/private tasks
    • Cloud (Anthropic/OpenAI) for complex tasks that need it
  • Power-aware: Hosung's GPU power systems background — optimize inference for laptop battery life / thermal limits
  • Predictive prefetching: anticipate next agent steps, pre-load context

Why this is defensible:

  • The local inference optimizer runs on hardware — needs deep systems knowledge (C++, power management, memory optimization)
  • This is NOT just an HTTP proxy — it's a daemon that sits at the OS level
  • Hosung literally worked on GPU power systems at Nvidia. This is his exact skillset applied to a new domain.
  • The optimization loop (thermal, memory, latency) can't be replicated by a web dev

Target user:

  • Enterprises with data privacy requirements (can't send data to cloud)
  • Power users paying $200+/mo in API costs
  • Companies running AI on edge devices

Revenue: Self-hosted license (enterprises pay $5k-50k/yr for on-prem), SaaS tier for individuals


Our Specific Unfair Advantage

What we have Why it matters here
Hosung: GPU power systems @ Nvidia Local inference is GPU-bound. Power/thermal optimization = better performance on constrained hardware
Hosung: C++ systems programming The daemon that intercepts + routes needs to be low-latency, low-overhead. Python won't cut it
Angie: built B2B SaaS from 0→1 Can ship the dashboard, billing, onboarding — the enterprise integration surface
Angie: product design chops Cost savings UX is complex — visualizing where money is going is a real product challenge
We already built a working proxy We have the codebase. We know the problem space. We're not starting from zero.

Why This Is Timely (2026)

  • Agentic AI explosion: Every company is now running multi-step agent pipelines. LLM costs are the #1 complaint.
  • Model proliferation: 100+ capable models now. The routing problem is real and growing.
  • Local inference maturation: llama.cpp, Ollama, and quantized models (Gemma 3, Llama 3.3) are now good enough for 70% of tasks
  • Enterprise AI cost pressure: Companies that went all-in on GPT-4 are now under budget pressure from finance teams
  • No clear winner yet: Martian raised $9M but is still API-layer only. No one owns the agent workflow layer yet.

Recommended Next Steps

Immediate (this week)

  1. Talk to 10 people running agent pipelines — what are their actual monthly LLM costs? What would they pay to cut it by 40%?
  2. Audit our existing MVP codebase — what parts are reusable for the agent optimizer direction?

Short-term (1 month)

  1. Build a prototype agent optimizer — pick one framework (LangGraph) and show 30%+ cost reduction on a real pipeline
  2. Measure baseline: Pick 3 common agent tasks, run them unoptimized vs. optimized, publish the numbers

Medium-term (3 months)

  1. Hosung builds the local routing daemon — C++ daemon that intercepts Ollama/llama.cpp calls and makes routing decisions
  2. Angie builds the analytics dashboard — show users where their costs are going, what's being routed where, savings to date

Key Questions to Answer

  • What is the actual LLM cost for a typical enterprise running LangGraph agents?
  • Is the pain point cost reduction or latency reduction or both?
  • Does Martian's approach work well? Why haven't they won yet?
  • Would companies trust a third-party daemon running on their inference stack?
  • What's the minimum accuracy loss acceptable when routing to cheaper models?

YC Resubmission Angle

Before: "We built an API proxy that tracks LLM costs and caches responses" → Weak. Sounds like a feature, not a company.

After (agent optimizer framing): "We cut LLM costs for agentic AI pipelines by 40-70% by optimizing which model handles each step in a multi-agent workflow — without writing any new prompts or changing existing code" → Stronger. Specific savings, specific mechanism, clear before/after, plug-and-play.

After (hybrid local+cloud framing): "We built the intelligent routing layer between local and cloud inference — companies keep sensitive data local and cut cloud costs 60%+, with zero code changes" → Enterprise angle. Privacy + cost. Defensible with Hosung's systems background.


Saved 2026-03-17 — from conversation research session


RouteLLM Analysis (added 2026-03-17)

Full deep dive: Startup/routellm-deep-dive.md

What it is: Open-source binary router from LMSYS / UC Berkeley (same team as Chatbot Arena). ICLR 2025. 4,700+ stars.

The mechanism: Trains a Matrix Factorization model on 80k Chatbot Arena preference samples. Scores each incoming prompt → routes to strong (GPT-4) or weak (Mixtral) model based on a tunable threshold α.

Benchmark results:

  • 95% GPT-4 performance using only 14% GPT-4 calls on MT Bench
  • vs. Martian: same quality, >40% cheaper
  • Overhead: <40ms per request

Cross-model generalization: Works on Claude + Llama without retraining — learns query complexity, not model-specific patterns.

Key gaps (our differentiation):

Gap What it means for us
Binary routing only Can't handle a pool of models (local + 3 cloud tiers)
No local inference Doesn't know Ollama/llama.cpp exist — treats everything as remote API
No system-state Ignores battery, GPU temp, memory pressure, monthly budget
No agent awareness Optimizes individual calls, not pipelines
Prompt leaks on scoring The scoring step itself sends your prompt to the cloud — no privacy

Our strategic options:

  1. Build on top: Use their MF router as the complexity signal layer. Add local inference + system-state layer on top. Fast to build, borrows their academic credibility.
  2. Replace the cloud assumption: Local-first routing where the scoring model runs on-device (prompt never leaves the machine). Privacy as the product. Hosung's C++ daemon is the moat.
  3. Go above: Optimize agent execution graphs holistically — RouteLLM becomes a dep we call internally, not a competitor. Bigger build, much harder to replicate.