Maintenance & evolution

It ships, then it gets better.

The system you own gets measurably sharper every month - monitored, tuned, and upgraded on the loop real usage creates - instead of quietly decaying.

Scope this build ↳ Free agentic audit · zero obligation

Eval loop3,960evals run

Measure

Golden set · LLM-judge

running

Tune

Retrieval · prompts

queued

Improve

Upgrade · extend

queued

Complete

—

09:41:00groundedness ↑ 0.94

09:40:59prompt.tuned · v7

09:40:58model.upgraded → newer

09:40:57resolution ↑ +6pts

↳ Built on the stack that ships

Claude CodeAgent SDKn8nRailwayVercelSupabase

[ 000 ]Trusted by operators

[ 01 ]What it is

The capability, defined.

The moat isn't the workflow you shipped on day one - it's the loop that real usage creates over time. We watch your systems in production, fix what drifts, fold in new model capabilities, and extend them as your business grows. The system you own keeps compounding instead of rotting.

Not that · ✓ this

Not a retainer that bills for nothing. Not set-and-forget that rots as models move. It's an eval-driven loop - measure, tune, prove - that compounds the system you already own, with the numbers reported, not asserted, and a clean handoff whenever you want it.

[ 02 ]The status quo

What this costs you today.

The system shipped, everyone moved on, and underneath it the world kept changing - so it's drifting and nobody can see it.

Models change beneath it and your data shifts, but the prompts and retrieval were tuned for a frozen moment.

Answers start drifting and nobody notices, because there's no eval on real traffic - the failures are invisible until something breaks loudly.

You're stuck on last year's model and capabilities, because nobody's tracking releases or testing upgrades safely.

'It's working' is asserted in a status email, with no metric behind it - so you're trusting, not verifying.

[ 03 ]What we build

The anatomy of the system.

The moat was never the day-one code - it's the loop that real usage creates over time. We operate that loop so the system compounds instead of rotting, and every improvement shows up on a chart.

Eval suite + golden datasets

A curated set of real inputs with known-good outputs acts as a deterministic gate - the regression test that proves a change made the system better, not just different.

Observability + LLM-as-judge

Production traces sampled continuously and scored on groundedness, resolution, latency, and cost - human review where it counts, an LLM-as-judge where it doesn't scale.

Prompt + retrieval tuning

When a number slips we fix the cause - prompts, chunking, the re-ranker, agent behavior - and re-run the evals to prove it moved, instead of guessing in the dark.

Model upgrades

When a sharper or cheaper model ships, we evaluate it against your suite and migrate you safely - so you ride the frontier without tracking releases or risking a silent regression.

Drift detection

A weekly job re-runs the agent on recent production inputs and compares outputs, tool-call patterns, cost, and latency to baseline - alerting when anything moves a few points week over week.

Monthly reporting

Your numbers, your dashboard - groundedness up, resolution up, cost per task down, regressions caught - improvement you can see, not a 'trust us' email.

[ 04 ]How it works

Engineered, not prompted.

We operate on the loop, not the launch - using the observability we build into every system on Claude Code, the Claude Agent SDK, n8n, Railway, Vercel, Cloudflare, and Supabase.

Measure

We score the system on real production traffic - groundedness, resolution rate, latency, cost per task - against golden datasets, using LLM-as-judge where human review doesn't scale. Bessemer's number: most AI failures are invisible. We make yours visible.

Tune

Where the numbers slip, we fix the cause - retrieval, prompts, agent behavior, the model itself - and re-run the evals to prove it moved. Improvement you can see on a chart, not a claim in a status email.

Improve

When a sharper model ships, we evaluate it against your suite and upgrade you safely; as your business changes, we extend the system onto the foundation you own. It compounds instead of decaying.

How we engineer it

Learn from production

Real usage shows where the system is weak. We use the logs and evals to tune retrieval, prompts, and agent behavior - so it gets measurably better month over month.

Ride the frontier

When a sharper model ships, we evaluate it and upgrade you safely - so your system improves as the underlying models do, without you tracking releases.

Extend, don't rebuild

As your business changes, we add workflows and agents onto the foundation you already own - compounding the system instead of starting over.

Hand off cleanly

Want to bring it in-house? We document and transfer it so your team can run it. No lock-in, no hostage situation.

[ 05 ]Example builds

What this looks like in the wild.

Always-on operations

We run point on your live agents and workflows - catching drift, fixing breaks, and keeping outcomes steady - while your team focuses on the business.

Continuous improvement loop

A monthly cycle of measurement and tuning that turns a good system into a great one, grounded in your real production data and proven on golden datasets.

Safe model migration

When the frontier moves, we benchmark the new model against your eval suite and roll it out behind a flag - capability and cost gains without the regression risk.

Scale-up builds

New agents and workflows added onto the foundation you already own as you grow - the engineering team that scales with you, not a one-off project that ages out.

[ 06 ]By the numbers

The reliability that ships.

~3%

The week-over-week movement in groundedness, cost, or tool-call patterns that should trigger an alert - the drift threshold that catches decay before users feel it.

Golden + sampled

The 2026-standard eval mix - a lean golden dataset as a deterministic release gate, plus random production sampling to surface the new failures the golden set can't predict.

Measured, not asserted

How 'it got better' should be proven - against historical baselines on real traffic - since LLM calls are non-deterministic and a single passing run proves nothing.

↳ Industry benchmarks and engineering standards, not Anfloy client metrics - we report your real numbers once you're live.

[ 07 ]The stack

Named tools, and why.

The model is fungible - the system is the moat. Here's what we build it on, and the reason each earns its place.

Langfuse / LangSmith

Continuous tracing, eval datasets, and LLM-as-judge scoring on live traffic - the observability backbone that makes drift and quality visible.

Braintrust

Prompt-centric experiments and regression evals - so a prompt or model change is benchmarked against your golden dataset before it ships, not after it breaks.

Ragas

Groundedness, faithfulness, and context-precision scoring for the RAG layer - the metrics that tell you whether retrieval quality is decaying.

OpenTelemetry (GenAI)

Vendor-neutral instrumentation so traces and evals stay portable - you can swap or stack observability tools without re-wiring the system.

Claude (Anthropic API)

The frontier you ride - new model versions evaluated against your suite and migrated in safely, so capability rises without a silent regression.

Deepchecks / drift jobs

Scheduled regression and drift checks that compare current outputs to historical baselines - catching degradation automatically instead of waiting for a complaint.

CI + scheduled evals

Evals run on every change and on a cron, so improvement is gated and continuous - the loop runs whether or not anyone's watching.

[ 08 ]The architectural difference

Why not just set it and forget it?

A system shipped and left alone doesn't hold steady - it decays. Models change underneath it, your data shifts, edge cases surface, and the failures are invisible until something breaks loudly. A maintained system runs the opposite way: real usage feeds an eval loop that makes it measurably sharper every month. The moat was never day-one code; it's the loop.

· Dimension

· Set-and-forget

· Anfloy custom

Over time

Quietly decays as the world moves.

Compounds - sharper every month on the loop.

Failures

Invisible until something breaks loudly.

Caught on evals against golden datasets, on real traffic.

New models

You're stuck on whatever shipped that day.

Evaluated and upgraded safely as the frontier moves.

Drift

Answers degrade and nobody notices.

Groundedness and resolution tracked continuously.

Growth

A one-off project that ages out.

Extended onto the foundation you already own.

Proof

'Trust us, it's working.'

Metrics reported, not asserted - your numbers, your dashboard.

Over time

Set-and-forgetQuietly decays as the world moves.

Anfloy customCompounds - sharper every month on the loop.

Failures

Set-and-forgetInvisible until something breaks loudly.

Anfloy customCaught on evals against golden datasets, on real traffic.

New models

Set-and-forgetYou're stuck on whatever shipped that day.

Anfloy customEvaluated and upgraded safely as the frontier moves.

Drift

Set-and-forgetAnswers degrade and nobody notices.

Anfloy customGroundedness and resolution tracked continuously.

Growth

Set-and-forgetA one-off project that ages out.

Anfloy customExtended onto the foundation you already own.

Proof

Set-and-forget'Trust us, it's working.'

Anfloy customMetrics reported, not asserted - your numbers, your dashboard.

[ 09 ]Who it's for

The honest fit check.

Build this if

Teams running an AI system in production - ours or someone else's - that want it to keep improving and stay reliable as models and their business change, with the metrics to prove it and no obligation to keep us forever.

Skip it if

If your system is genuinely static, low-stakes, and rarely touched, a light monitoring setup may be all you need rather than an active loop - we'll set that up and step back. And if you have an in-house ML team already running evals and upgrades, you don't need us to duplicate it.

[ 10 ]Questions

The honest answers.

Q01

Isn't an AI system just 'done' once it's built?

No - that's the single most common failure. Models change underneath it, your data and edge cases shift, and without maintenance a system quietly decays while everyone assumes it's fine. The compounding value comes from the loop of real usage feeding evals and tuning over time - that's the moat, not the day-one code. A maintained system gets measurably sharper every month; an abandoned one drifts until something breaks loudly in front of a customer.

Q02

Are we locked into paying you forever?

No. Maintenance is a choice, not a trap. The system is yours and runs without us - we offer ongoing operation because the eval loop keeps it improving, not because you're stuck. Whenever you'd rather run it in-house, we document and transfer the whole thing - the evals, the dashboards, the runbooks - so your team can take the loop and keep it going. No hostage situation, no platform you can't leave.

Q03

What does 'getting better' actually mean - and how do you prove it?

Measurable improvement on your numbers: higher groundedness and resolution rates, fewer escalations, faster runs, lower cost per task. We prove it by re-running changes against a golden dataset and comparing to historical baselines on real traffic - because LLM calls are non-deterministic, we look at distributions, not a single lucky run. Every tune is gated by evals before it ships, and the results land on a dashboard you can see, reported monthly, not asserted in an email.

Q04

Do we still own everything, and does it run on our infra?

Yes to both. The system, the eval suite, the golden datasets, and the dashboards all live on your infrastructure and accounts - we operate the loop on top of what you own, we don't take it hostage. Nothing routes through an Anfloy platform, your data stays in your perimeter, and the moment you want to bring it in-house, it's already all there for your team to run.

Q05

What happens when a new model comes out - do you just swap it in?

Never blindly - an upgrade is a regression risk until proven otherwise. When a sharper or cheaper model ships, we benchmark it against your golden dataset and production traces, check for behavior and cost changes, and roll it out behind a flag with the old version one click away. You get the capability and cost gains of riding the frontier without tracking releases yourself or risking a silent quality drop - the upgrade is proven on your evals before it touches a real user.

Q06

How do you catch problems before our users do?

Continuous drift detection. A scheduled job samples recent production traces, re-runs the system on those inputs, and compares outputs, tool-call patterns, cost, and latency against baseline - alerting when any metric moves a few points week over week. Groundedness and resolution are scored on live traffic with an LLM-as-judge, so degradation surfaces on a chart, not in a complaint. The whole point of the loop is that the failure is caught and measured before it's customer-facing.

[ 11 ]Keep going up the ladder

AutomateWorkflow AutomationDeterministic workflows with LLM steps - wired into your stack.BuildAutonomous AgentsAutonomous agents that decide and act, not just answer.BuildCompany IntelligenceMulti-agent systems and knowledge brains grounded in your data.

Let's scope your
maintenance & evolution.

Scope this build