Everything you need to measure AI quality
Four quality pillars. Seven infrastructure touchpoints. One index. The complete platform for teams who refuse to ship blind.
Quality Index
One score, fully decomposed
The Quality Index (0–100) is a weighted composite of four pillars. Each metric is normalized using piecewise linear interpolation, then aggregated with customer-defined weights. You always know why the score moved.

Task Quality
Correctness, relevance, faithfulness, and grounding quality rolled into one pillar. Combines dataset-based scoring from LangSmith, LLM-as-judge evaluation for subjective criteria, and RAG-specific metrics like Faithfulness and Context Recall.
- Dataset-based scoring with LangSmith, Promptfoo, or custom runner
- LLM-as-judge for non-deterministic criteria with confidence scores
- RAG metrics: Faithfulness, Context Recall, Context Precision
- Piecewise linear normalization to 0–1 scale per metric

Reliability
Track failures, timeouts, rate limits, tool-call success rates, and error budgets derived from your OpenTelemetry traces. Set SLOs and get alerted before your users notice.
- Automatic error rate and timeout detection from OTel spans
- Tool-call success rate tracking for agentic workflows
- SLO-based error budget monitoring
- Rate limit and fallback event detection

Efficiency
Latency distributions, token usage, cost per request, and cache hit rates. Every API call is converted to dollars using provider pricing so you can optimize spend without guessing.
- P50/P95/P99 latency tracking across all pipeline stages
- Per-request cost calculation using live provider pricing
- Token usage analysis with cached vs. uncached breakdown
- Cache hit rate monitoring and optimization signals

Safety
Policy compliance, guardrail pass rates, and unsafe content detection. Safety is special — critical violations cap your entire Quality Index regardless of how well other pillars score.
- Guardrail pass/fail rates across all requests
- Content safety scoring and PII detection
- Policy compliance checks with configurable rules
- Safety gate: violations cap the overall Quality Index
Infrastructure
Built for every layer of your stack
Seven capabilities baked into the platform — so you don't have to build them yourself.
Open instrumentation, zero lock-in
OTel/OTLP native. LLM calls, retrieval, tools — understood automatically.
- Instrument once, fan out to Datadog, your warehouse, and us
- Swap vendors without rewriting instrumentation
- No proprietary SDK — your OTel setup just works
Traces + evals + experiments, unified
One system: what happened, did it help, should this ship?
- Diagnose bad requests to the span level
- Measure if prompt/model/retrieval changes improved outcomes
- Rank failures by user impact
AI-native release management
Every deploy gets a quality snapshot. Regressions caught before users see them.
- Baseline vs. candidate on every PR
- CI/CD eval gates — block bad releases
- Prompt, model, and retriever versioning
Datasets & annotation
Production traces become eval datasets. Quality compounds daily.
- Auto-sample bad requests into datasets
- Human annotation, custom scorers, rubrics
- Segment by task, user cohort, or workflow
Managed platform, portable data
We run the infra. You own the data.
- We manage: ingestion, dashboards, evals, alerting
- You keep: traces, results, annotations, cost data
- Export everything, anytime
Enterprise-ready from day one
RBAC, workspaces, audit logs, PII redaction — built in, not bolted on.
- Dev / staging / prod workspaces
- RBAC with audit logging
- PII redaction, configurable retention
Offline + online quality
Gate releases offline. Catch regressions live.
- Offline: benchmarks, model comparisons, CI gates
- Online: errors, latency, feedback, drift
- Traces feed back into datasets — continuous loop
Integrations
20+ connectors. Your entire stack.
OTel, LangSmith, GitHub, eval frameworks, guardrails, gateways, experiment trackers — all with step-by-step guides.
Eval Studio
Define what quality means for you
Not every AI app prioritizes the same metrics. Eval Studio lets you declare what matters, how to measure it, and what blocks a release.
Eval Cards
Each evaluation is captured as an Eval Card — a human-readable spec that doubles as machine-runnable config. Cards include title, linked task, rubric criteria, data binding, and gating designation.
- Title, description, and linked user task
- "Why it matters" and "risk if it fails"
- Measurable rubric with thresholds
- Data binding: LangSmith, production sampling, CSV/JSONL
- Gating (blocks release) vs. monitoring (SLO-style)
Relevance Weighting
Interactive sliders let you define pillar weights. A medical AI prioritizes accuracy; a chatbot prioritizes latency. Persona templates accelerate onboarding.
Automated Remediation
When regressions are detected, the platform analyzes correlated eval scores and traces, then drafts a GitHub PR with the proposed fix. You review and merge — always human-in-the-loop.
- Prompt rewrites and parameter tuning
- OTel instrumentation additions
- Eval config and CI hook updates
- Retrieval strategy adjustments
- Comprehensive PR descriptions with linked evidence
Impact Quantification
Translate technical regressions into business impact. Quality change times exposure times cost-of-failure gives you a dollar figure for every regression.
- Index change since last release
- Top regressions by estimated impact ($)
- Implicated resource identification
- Developer productivity tracking (PR cycle time)
- ROI reporting for stakeholders
Start measuring AI quality today
Deploy in minutes. See your first Quality Index in under 30 minutes.