LLMOps you can maintain: traces, evals, guardrails, and cost budgets in one architecture

LLMs are only useful when you can operate them: observe behaviour, measure quality, control risk, and cap spend. This article lays out a minimal, durable LLMOps architecture you can run with a small team: Traces → Evals → Guardrails → Budgets, wired into CI/CD and your app.

Why this architecture

Principles

Prefer simple, observable components over complex orchestration.
Keep one path to prod: the same code serves human and automated tests.
Prove value with evals before exposing features to real users.
Fail closed with guardrails and cap spend with hard budgets.

System overview

flowchart LR
  A[App] --> B[Tracing Middleware]
  B --> C[Guardrails (schema, policy, filters)]
  C --> D[LLM Provider(s)]
  B --> E[Trace Store + Cost Meter]
  E --> F[Evals Runner (CI + nightly)]
  F --> G[Release Gate]
  E --> H[Budget Enforcer (quotas, alerts)]

What this gives you

End-to-end visibility (inputs, outputs, tokens, latency, user, version).
Reproducible quality signals (eval scores) tied to a commit.
Runtime safety (schema validation, policy filters, PII scrubs).
Spend control (per-model, per-route, per-user budgets).

Traces: capture everything the model did

Instrument every model call once at your HTTP/SDK boundary. Emit OpenTelemetry spans (or structured JSON Lines) with inputs/outputs, token counts, costs, and calling context.

// express/fastify-style middleware (TypeScript)
import { v4 as uuid } from 'uuid'
import { tracer } from './otel'
import { price } from './pricing' // maps model+tokens -> $
import { callLLM } from './providers'

export async function llmMiddleware(req, res) {
  const span = tracer.startSpan('llm.call', {
    attributes: {
      route: req.route,
      model: req.body.model,
      version: process.env.APP_SHA,
      user_id: req.user?.id ?? 'anon',
      request_id: uuid(),
    },
  })
  const started = Date.now()

  try {
    const result = await callLLM(req.body)
    const tokensIn = result.usage.prompt_tokens
    const tokensOut = result.usage.completion_tokens
    const cost = price(req.body.model, tokensIn, tokensOut)

    span.setAttributes({ tokensIn, tokensOut, cost_usd: cost })
    span.end()
    res.json({ ...result, meta: { cost_usd: cost } })
  } catch (err) {
    span.recordException(err)
    span.setStatus({ code: 2, message: String(err) })
    span.end()
    throw err
  }
}

What to include in every trace

User/session, route, model+version, prompt template name, inputs (redacted), outputs, tokens, latency, cost, and a stable eval_id when running tests.

Evals: gate releases with automated checks

Create a small, high-signal eval set per feature (10–50 cases). Run it in CI (fast subset) and nightly (full). Block deploys when scores regress or cost blows past thresholds. These release gates are how you earn autonomy one level at a time.

# eval_runner.py
from stats import wilcoxon
from evals import faithfulness, relevance, exact_match

def run_eval(model_version, dataset):
    scores = []
    for case in dataset:
        out = model(case.input)
        scores.append({
          "faith": faithfulness(out, case.refs),
          "rel": relevance(out, case.refs),
          "em": exact_match(out, case.refs),
          "cost": out.meta["cost_usd"]
        })
    return scores

baseline = load_baseline("support_bot")   # previous release scores
current  = run_eval(app_sha, dataset())
assert wilcoxon(current.faith, baseline.faith).pvalue > 0.05, "Faithfulness regressed"
assert sum(s["cost"] for s in current) <= baseline_total_cost * 1.15, "Cost budget exceeded"
save_results(current, app_sha)

Eval types that matter

Task success (exact/semantic match, structured field accuracy).
Groundedness/faithfulness (uses provided context, no fabrication).
Toxicity/safety policy checks on outputs.
Latency & cost per case and in aggregate.

Guardrails: schema, policy, and redaction

Guardrails sit before and after the model. Use them to reject unsafe inputs, constrain outputs to a schema, and scrub sensitive data from logs.

// JSON Schema validation for outputs (Zod example)
import { z } from 'zod'
const SupportReply = z.object({
  intent: z.enum(['status', 'refund', 'return', 'handoff']),
  answer: z.string().max(1000),
  citations: z.array(z.string().url()).max(5),
})

const rail = async (prompt) => {
  // Input guard: PII / secrets filter
  if (containsSecrets(prompt)) throw new Error('blocked: secrets')

  const raw = await callLLM({ schema: SupportReply }) // provider w/ JSON mode
  const parsed = SupportReply.parse(raw) // throws on violation
  if (violatesPolicy(parsed.answer)) throw new Error('blocked: policy')
  return parsed
}

Practical guardrails

Input: rate limits, size limits, PII/secret filters, prompt-injection checks.
Output: JSON schema validation, allow-list enums, citation/URL checks, profanity/toxicity filters.
Context: approved tools only; redact secrets before tracing; store hashes for sensitive values.

Budgets: hard caps for cost and concurrency

Budgets stop runaway spend and keep tail latencies under control. Enforce per-model, per-feature, and per-user caps with circuit breakers.

# budgets.yml
models:
  gpt-4o:
    daily_usd: 75
    tpm: 120000
    rpm: 3000
routes:
  support_bot:
    per_user_daily_usd: 0.30
    per_conv_cap_usd: 0.06
    max_concurrency: 200
alerts:
  slack_channel: '#llm-spend'
  notify_over: '80%' # of any budget

// Budget enforcement snippet
import { Budget } from './budget'

const budget = new Budget('support_bot')
if (!budget.allow({ user: req.user?.id, estCost: est_usd })) {
  return res.status(429).json({ error: 'budget_exceeded' })
}
const result = await rail(req.body.prompt)
budget.commit(result.meta.cost_usd)

Budgeting tips

Pre-estimate cost from tokens before calling; hard-fail if over cap.
Alert at 80% of any budget and trip at 100% with a friendly fallback.
Separate eval spend from prod spend with distinct budgets.

One data model for everything

-- Minimal schema in Postgres
CREATE TABLE llm_traces(
  id uuid PRIMARY KEY, ts timestamptz, route text, model text, app_sha text,
  user_id text, prompt_name text, tokens_in int, tokens_out int,
  latency_ms int, cost_usd numeric(10,5),
  input_redacted jsonb, output jsonb, eval_id text
);

CREATE TABLE eval_results(
  id uuid PRIMARY KEY, eval_id text, app_sha text, route text,
  case_id text, faith real, relevance real, exact real,
  cost_usd numeric(10,5), passed boolean, ts timestamptz
);

CREATE TABLE budgets(
  scope text, key text, period date, spent_usd numeric(12,4),
  PRIMARY KEY(scope, key, period)
);

Why this works

Traces, evals, and budgets share identifiers (route, model, app_sha) so you can correlate them in dashboards and gates.

CI/CD integration (release gates)

# .github/workflows/eval.yml
jobs:
  evals:
    steps:
      - run: pnpm test:prompts # unit tests on templates/guards
      - run: python eval_runner.py # fast eval set
      - run: node scripts/check_gates.js # assert scores & budgets

Gates to enforce

No regression on critical evals (p-value or threshold).
No new policy violations in test logs.
Cost/latency within configured budgets.

Ops dashboard (one page)

Panels to ship

Spend & tokens by route/model (today, 7d), with budget utilisation.
P95 latency and error rate by route.
Eval trend (last 10 runs) with red/green gate status.
Top failing cases with links to traces and prompts.

60-day rollout plan

Days 0–15

Add tracing middleware; start logging tokens, cost, latency, route, user.
Define budgets.yml and wire basic enforcement + alerts.

Days 16–30

Draft eval sets per feature; run in CI; add release gates for faithfulness + cost.

Days 31–60

Add JSON-schema output validation, policy filters, and PII redaction.
Build a minimal dashboard (spend, latency, evals) and begin weekly reviews.

Common failure modes

Watch for these

Unversioned prompts: put template names and SHAs in traces.
Hidden tool calls: trace tool I/O separately; apply guardrails there too.
Budget drift: revisit caps monthly; separate dev/eval/prod budgets.
Eval rot: rotate 10–20% of cases quarterly; keep a frozen "golden" subset.