AI for us means workflows that demonstrably save money, not demos that wow a board meeting. We build retrieval-grounded assistants on the Anthropic Claude API and OpenAI, classification pipelines for unstructured documents, and evaluation harnesses so you can tell whether a model upgrade actually broke production. We measure cost-per-task, latency, and accuracy against a held-out set before we recommend rolling anything out.
Document ingestion (PDFs, Word, scanned receipts via Tesseract), chunking, embeddings via OpenAI or local models, and a retriever that returns source paragraphs alongside every answer. No hallucinated quotes — every assertion links to a source.
Claude or GPT-4o-mini for structured extraction from invoices, KYC documents, contracts. Output is typed JSON, validated against your schema, with a human-in-the-loop queue for low-confidence cases.
A held-out test set, golden outputs, and an automated harness that scores every prompt change before it ships. We don't push prompt changes to production without seeing the eval delta first.
Every workflow ships with a cost-per-task target and a p95 latency target. We instrument both in production and alert when either drifts. Most builds end up cheaper than the manual process by 40–80×.
Cheap models (Claude Haiku, GPT-4o-mini) for the bulk of traffic, escalation to bigger models only when the cheaper one returns low confidence. Saves 60–90% on inference cost without measurable quality loss.
We treat AI as a software engineering problem first and a research problem second. The hard parts are usually data plumbing, evaluation, and integration with the existing workflow — not the model choice. We build the evaluation harness in week one so we can measure progress.
We are deliberately conservative about deploying autonomous agents. We've seen too many demos break down when the model has to chain four tool calls on real-world inputs. We prefer well-scoped assistants that do one thing reliably and hand off to a human when they hit a known edge.
Everything we build runs through the Anthropic SDK or OpenAI SDK directly — no Langchain layer cake. We've found that a thin wrapper around the official SDKs is easier to debug at 2am than a chain of abstractions.
Model inference (Claude, OpenAI) billed at cost — we pass through the actual provider invoice. Typical RAG workflows run at $0.001–0.01 per query.
Support triage router — a hybrid LLM router that reads inbound customer-support emails, classifies intent, drafts a response, and routes 78% of the queue to auto-send. The remaining 22% goes to a human with a pre-drafted reply. Replaced a 3-person triage team.
Send us a description of the manual task and we'll come back with an honest read on whether AI helps, and what the eval would look like.