UmarfarookGurramkonda
I build production LLM systems. Multi-stage agents, retrieval pipelines, natural-language-to-SQL over warehouses, and the eval harnesses that keep them honest.
what i actually do
Five lanes, one focus
LLM Orchestration
Multi-stage agent flows with routing, intent, retrieval, composition. Streaming responses, structured outputs, fallback chains across Claude, Gemini, OpenAI.
Retrieval & RAG
Hybrid retrieval (BM25 + vector + reranker), chunking strategies, metadata filtering, multimodal RAG. Tuning that survives real document corpora.
Natural Language to SQL
Schema discovery, synonym matching, cost-capped query generation over BigQuery and Postgres. With a real eval harness, not vibes.
Voice & Realtime Agents
Full-duplex voice agents on top of streaming speech models. Latency engineering, interruption handling, turn-taking that feels human.
Evals & Observability
Eval harnesses, regression suites, cost dashboards. The unsexy work that separates a prototype from a system you can defend.
now shipping
Building in public
Five OSS projects across the lanes I care about. Each ships with evals, a public URL, and a write-up on the tradeoffs. Live status below.
BigQuery NL2SQL MCP Server
Shipping SoonQuery BigQuery in natural language from Claude Desktop, Cursor, and Claude Code. Schema discovery, cost caps, query explanation, safety guardrails.
MCP is an underserved lane. NL2SQL over warehouses is something I ship at work. First project.
NL2SQL Eval Framework + Public Leaderboard
Shipping SoonOpen benchmark of Claude, GPT, Gemini, Llama on real BigQuery-style schemas. Live leaderboard updated as new models drop.
Evals are the most underserved skill in production AI. A live leaderboard is also a content engine.
Voice Mock Interview Coach
Shipping SoonReal-time full-duplex voice agent that runs mock AI engineer interviews and gives feedback. Latency-tuned for conversational feel.
Voice is visually impressive and rare. Solves a real pain (mine, and every job seeker's).
Personal AI Research Assistant
Shipping SoonIngest arxiv, blogs, PDFs into a personal RAG. Weekly digest, semantic search across your library, local-first option via Ollama.
I need it. Tools you actually use end up well-built.
prod-llm-starter
Shipping SoonOpinionated production template for LLM apps. FastAPI + LangGraph + Pydantic + Postgres/pgvector + eval harness + cost dashboard + auth + GHA. The thing every AI engineer wishes existed on day one.
Flagship. Utility repos compound. Forces deep understanding of every choice.
the track record
Where I've been
AI Engineer
Building production LLM systems for D2C trend prediction. Multi-stage orchestration with routing/retrieval/composition, NL-to-SQL over BigQuery with cost guardrails, RAG pipelines with sentence-transformers, deployment on GCP Cloud Run with full observability.
Freelance ML / AI Engineer
Built an AI-powered inventory system for a retail client. LLM-based invoice extraction, demand forecasting with scikit-learn, real-time stock alerts, and a visualization dashboard for surfaced insights.
Backend Developer Intern
Built REST backend for an event-management web app. Also contributed to an internal LLM-based healthcare assistant, integrating RAG retrieval over clinical documents and adding guardrails.
B.Tech, Computer Science
Graduated with CGPA 8.14 / 10.
how i think
Engineering principles
Coding is the easy part. Building the right system for a problem that keeps shifting is where the work actually lives.
Tradeoffs over tools
Pick by constraint, not hype. Postgres + pgvector beats a managed vector DB until it doesn't. Knowing when each breaks is the actual skill.
Evals before scale
If you cannot measure it, you cannot improve it. A bad eval beats no eval. A good eval beats opinions in standups.
Data quality over model swapping
A new model rarely fixes bad inputs. Time spent on retrieval quality, prompt structure, and labeled failures pays compounding interest.
Infrastructure is the product
Latency, cost, and reliability are features users feel. The model is one component of a system that has to stay up.
Ship narrow, then expand
One user, one workflow, working end-to-end. A tiny system that ships beats a grand system that demos.
AI-pair-programming with judgment
I use Claude Code, Cursor, and copilots aggressively. Then I reason through every architectural choice myself. Tools speed up typing; judgment doesn't delegate.
the toolkit
What I work with
Tools I use day to day. Not a list of every framework I've heard of.
LLMs & GenAI
Backend
Data
Cloud & Infra
DevOps & Observability
Frontend
say hello
Let's build something.
I'm open to roles in production AI, especially teams shipping LLM systems, RAG, agents, or NL interfaces over data. Remote or San Francisco. Quick replies.