Integration Guides

15 connectors. One platform.

Step-by-step guides with copy-paste configs, verified instructions, and the exact API calls we make under the hood.

OpenTelemetry

OTLP endpoints on gRPC (4317) and HTTP (4318). Every LLM call becomes a trace with spans for model calls, retrieval, tool invocations, and post-processing.

~10 minAuth: Bearer tokenPython, Node.js, Go, Java

Install the OpenTelemetry SDK

Add the OTel SDK and the GenAI instrumentation package to your application.

terminal

# Python
pip install opentelemetry-sdk opentelemetry-exporter-otlp \
  opentelemetry-instrumentation-openai

# Node.js
npm install @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-grpc \
  @traceloop/instrumentation-openai

Initialize instrumentation

Call the instrumentor before any LLM client is created. This patches OpenAI, Anthropic, and other providers to emit GenAI spans automatically.

terminal

# Python — add at application startup
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)

OpenAIInstrumentor().instrument(tracer_provider=provider)

Configure the OTel Collector

If you run an OTel Collector as a sidecar or gateway, add qualityindex.ai as an OTLP exporter.

terminal

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 1000
    timeout: 5s

exporters:
  otlp/qualityindex:
    endpoint: "https://ingest.qualityindex.ai:4317"
    headers:
      Authorization: "Bearer ${QI_API_KEY}"
    compression: gzip

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/qualityindex]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/qualityindex]

Verify spans are flowing

Make a test LLM call and check the dashboard. You should see a trace within seconds.

terminal

curl -s https://api.qualityindex.ai/api/v1/connectors/otel/health \
  -H "Authorization: Bearer ${QI_API_KEY}"

# Response
{
  "type": "otel",
  "status": "healthy",
  "last_span_received": "2026-03-07T14:32:00Z",
  "spans_ingested_24h": 1
}

Using a managed OTel provider?

If you use Datadog, New Relic, or Grafana as your OTel backend, add qualityindex.ai as a secondary exporter. Traces flow to both destinations.

LangSmith

Sync your curated datasets and evaluation results. Correlate semantic quality scores with production telemetry for a complete picture.

~5 minAuth: Service Key or PATSync: Every 15 min

Generate a LangSmith API key

In LangSmith, go to Settings → API Keys and create a Service Key (recommended) or Personal Access Token.

terminal

# The key looks like:
# ls-svc-xxxxxxxxxxxxxxxxxxxxxxxxxxxx

export LANGSMITH_API_KEY="ls-svc-xxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Connect in the dashboard

Navigate to Settings → Connectors → Add Connector → LangSmith. Paste your API key.

terminal

# Under the hood, we validate by calling:
GET https://api.smith.langchain.com/api/v1/workspaces/current
Headers: x-api-key: ls-svc-xxxx...

Select datasets to sync

Choose which LangSmith datasets to sync. Each maps to Eval Cards. Enable 'data minimization mode' to import only scores, not raw content.

terminal

# We list your datasets:
GET https://api.smith.langchain.com/api/v1/datasets
Headers: x-api-key: ls-svc-xxxx...

# Synced fields: id, name, description, example_count, schema

Map datasets to Eval Cards

For each synced dataset, create or link an Eval Card defining which metrics and thresholds to track.

terminal

POST /api/v1/eval-cards
{
  "name": "Product Search Relevance",
  "pillar": "task_quality",
  "rubric": [
    { "metric": "relevance-score", "threshold": 0.8, "weight": 0.6 },
    { "metric": "faithfulness",    "threshold": 0.9, "weight": 0.4 }
  ],
  "data_binding": {
    "source": "langsmith",
    "dataset_id": "ds_abc123",
    "sample_size": 200
  },
  "gate": false
}

Import evaluation results

When you run evaluations in LangSmith, we automatically sync the results into the Task Quality pillar.

terminal

# We import results via:
POST https://api.smith.langchain.com/api/v1/runs/query
Body: {
  "session_id": "experiment-session-id",
  "is_root": true,
  "select": ["id", "outputs", "feedback_stats"]
}

# Feedback scores fetched via:
GET https://api.smith.langchain.com/api/v1/feedback
Params: run_id=<run_id>

Data minimization mode

Import only scores, metadata, and schema — not raw prompt/response content. Useful for HIPAA and SOC 2 compliance.

GitHub

Link every deployment to a Quality Index snapshot. Get automated remediation PRs when regressions are detected. Always human-in-the-loop.

~3 minAuth: GitHub App (not OAuth)Least-privilege permissions

Install the GitHub App

Visit the installation page. Choose "Only select repositories" and pick the repos to monitor.

terminal

# Permissions requested:
# - Contents (Read): Read code to analyze prompt files and configs
# - Pull requests (Read & Write): Create remediation PRs
# - Metadata (Read): Repository metadata
# - Webhooks: Receive push and PR events
#
# No access to: Actions, Issues, Discussions, Pages, Secrets

Configure release tracking

Map each repository to a service name. The platform watches for push events on your main branch.

terminal

# Webhook events we listen to:
# - push (main/master): triggers rescore
# - pull_request (opened, synchronize): eval preview
# - installation_repositories (added, removed): updates repo list

# Each push generates a release snapshot:
{
  "tag": "v2.4.1",
  "commit_sha": "a1b2c3d",
  "quality_index": 87,
  "deployed_at": "2026-03-07T08:00:00Z"
}

Enable automated remediation

When the Quality Index drops, the platform proposes a fix as a PR with full evidence.

terminal

# Remediation PR includes:
# - Which Eval Card detected the regression
# - Before/after scores with timestamps
# - Root cause analysis
# - The proposed fix
# - Link to the qualityindex.ai dashboard
#
# You review and merge — always human-in-the-loop.

PR quality gates (optional)

Block PRs that drop Quality Index below a threshold. Posted as a check status.

terminal

# PR check status:
# ✅ qualityindex.ai — Quality Index: 89 (+2 from main)
# ❌ qualityindex.ai — Quality Index: 74 (-8 from main)
#    Reason: Faithfulness dropped below 0.80

# Configure in dashboard:
# Settings → Repositories → [repo] → Quality Gate
# Minimum Quality Index: 80
# Block on: Safety violations, Task Quality regression > 5 pts

Security model

GitHub App tokens are scoped per-installation, auto-expire after 1 hour, and can be revoked instantly. Webhook signatures verified via HMAC-SHA256.

Langfuse

Dual-export your traces to Langfuse and qualityindex.ai simultaneously. Sync Langfuse scores, sessions, and cost data for unified quality tracking.

~10 minAuth: API keyLangfuse SDK v3 (OTel-native)

Configure dual OTel export

Langfuse SDK v3 is OTel-native. Configure your OTel Collector or SDK to export traces to both Langfuse and qualityindex.ai simultaneously.

terminal

# Option A: OTel Collector with dual exporters
exporters:
  otlp/langfuse:
    endpoint: "https://cloud.langfuse.com/api/public/otel"
    headers:
      Authorization: "Basic ${LANGFUSE_BASE64_KEYS}"
  otlp/qualityindex:
    endpoint: "https://ingest.qualityindex.ai:4317"
    headers:
      Authorization: "Bearer ${QI_API_KEY}"

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp/langfuse, otlp/qualityindex]

Connect Langfuse REST API

In the qualityindex.ai dashboard, add your Langfuse API keys to sync scores, sessions, and cost data that Langfuse computes.

terminal

# We sync the following from Langfuse:
# - Session scores (user feedback, auto-evals)
# - Trace cost calculations
# - Generation metadata (model, tokens, latency)
# - Score definitions and configs

# Auth: Langfuse Public Key + Secret Key
# Endpoint: https://cloud.langfuse.com/api/public/

Map Langfuse scores to Eval Cards

Langfuse scores (from human feedback or auto-eval functions) map directly to Eval Card rubric metrics.

terminal

POST /api/v1/eval-cards
{
  "name": "User Satisfaction Score",
  "pillar": "task_quality",
  "rubric": [
    { "metric": "langfuse.user_score", "threshold": 0.7, "weight": 1.0 }
  ],
  "data_binding": {
    "source": "langfuse",
    "score_name": "user_satisfaction",
    "min_traces": 100
  },
  "gate": false
}

Promptfoo

Run local or CI eval suites with Promptfoo and push results to qualityindex.ai. Assertion pass rates map directly to Task Quality Eval Cards.

~8 minAuth: QI API keyCLI + CI/CD

Install Promptfoo

Promptfoo is an open-source CLI for LLM evaluation and red-teaming.

terminal

# Install globally
npm install -g promptfoo

# Or as a dev dependency
npm install --save-dev promptfoo

Run eval and export JSON results

Run your eval suite and output structured JSON results that qualityindex.ai can parse.

terminal

# Run eval with JSON output
promptfoo eval --output results.json

# results.json contains:
# - Each test case with pass/fail per assertion
# - Scores, latency, token counts per provider
# - Aggregate pass rates per metric

Upload results via the API

Push Promptfoo results to qualityindex.ai. Each assertion type maps to an Eval Card rubric metric.

terminal

# Upload results
curl -X POST https://api.qualityindex.ai/api/v1/connectors/promptfoo/import \
  -H "Authorization: Bearer ${QI_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @results.json

# Response
{
  "imported": 248,
  "eval_cards_updated": ["ec_relevance", "ec_safety"],
  "quality_index_delta": "+3"
}

Automate in CI/CD

Add to your CI pipeline to run evals on every push and automatically update the Quality Index.

terminal

# .github/workflows/eval.yml
name: LLM Eval
on: [push]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm install -g promptfoo
      - run: promptfoo eval --output results.json
      - run: |
          curl -X POST https://api.qualityindex.ai/api/v1/connectors/promptfoo/import \
            -H "Authorization: Bearer ${{ secrets.QI_API_KEY }}" \
            -H "Content-Type: application/json" \
            -d @results.json

Ragas

Import RAG-specific evaluation metrics — faithfulness, context precision, answer relevancy, context recall — directly into your Quality Index.

~5 minAuth: QI API keyPython SDK

Install Ragas

Ragas is the standard framework for evaluating RAG pipelines.

terminal

pip install ragas

Run evaluation and capture results

Run ragas.evaluate() on your dataset and capture the output.

terminal

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

results = evaluate(
    dataset=your_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

# Export as dict
scores = results.to_pandas().to_dict(orient="records")

Push results to qualityindex.ai

Each Ragas metric maps to a Task Quality rubric criterion. Upload the scores via our Python SDK or REST API.

terminal

import requests

response = requests.post(
    "https://api.qualityindex.ai/api/v1/connectors/ragas/import",
    headers={"Authorization": f"Bearer {QI_API_KEY}"},
    json={
        "metrics": {
            "faithfulness": results["faithfulness"],
            "answer_relevancy": results["answer_relevancy"],
            "context_precision": results["context_precision"],
            "context_recall": results["context_recall"],
        },
        "dataset_size": len(your_dataset),
        "eval_card_id": "ec_rag_quality"
    }
)

Create Eval Card for RAG metrics

Define thresholds for each Ragas metric. The Quality Index updates whenever new results are imported.

terminal

POST /api/v1/eval-cards
{
  "name": "RAG Pipeline Quality",
  "pillar": "task_quality",
  "rubric": [
    { "metric": "faithfulness",       "threshold": 0.85, "weight": 0.35 },
    { "metric": "answer_relevancy",   "threshold": 0.80, "weight": 0.30 },
    { "metric": "context_precision",  "threshold": 0.75, "weight": 0.20 },
    { "metric": "context_recall",     "threshold": 0.70, "weight": 0.15 }
  ],
  "data_binding": { "source": "ragas" },
  "gate": false
}

Braintrust

Sync experiment results and scores from Braintrust projects. Map Braintrust evaluations to Eval Cards with automatic threshold detection.

~5 minAuth: Braintrust API keyREST API

Generate a Braintrust API key

In Braintrust, go to Settings → API Keys and create a key scoped to your project.

terminal

# Set as environment variable
export BRAINTRUST_API_KEY="bt_xxxxxxxxxxxxxxxxxxxx"

Connect in the dashboard

Navigate to Settings → Connectors → Add Connector → Braintrust. Paste your API key.

terminal

# We validate by fetching your project list:
GET https://api.braintrust.dev/v1/project
Headers: Authorization: Bearer bt_xxxx...

# Then list experiments in each project:
GET https://api.braintrust.dev/v1/experiment?project_id=<id>

Select experiments to sync

Choose which Braintrust experiments should feed into your Quality Index. Scores sync automatically after each experiment run.

terminal

# We import experiment scores:
GET https://api.braintrust.dev/v1/experiment/<id>/results

# Mapped fields:
# - scores → Eval Card rubric metrics
# - metadata.model → resource linkage
# - duration → Efficiency pillar
# - input/output tokens → cost calculation

Map to Eval Cards

Each Braintrust scorer maps to an Eval Card metric. Thresholds are auto-detected from your historical baseline.

terminal

POST /api/v1/eval-cards
{
  "name": "Braintrust Accuracy",
  "pillar": "task_quality",
  "rubric": [
    { "metric": "braintrust.accuracy", "threshold": 0.85, "weight": 0.6 },
    { "metric": "braintrust.relevance", "threshold": 0.80, "weight": 0.4 }
  ],
  "data_binding": {
    "source": "braintrust",
    "project_id": "proj_abc123",
    "experiment_filter": "latest"
  },
  "gate": false
}

DeepEval

Unit-test style LLM evaluation. Parse DeepEval test results and map metric scores to Eval Card thresholds. CI/CD integration triggers on test completion.

~8 minAuth: QI API keyPython + CI/CD

Install DeepEval

DeepEval provides pytest-style assertions for LLM outputs.

terminal

pip install deepeval

Write and run tests

Create test cases with metrics like hallucination, answer relevancy, faithfulness, and toxicity.

terminal

# test_llm.py
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric

def test_no_hallucination():
    test_case = LLMTestCase(
        input="What is the capital of France?",
        actual_output="The capital of France is Paris.",
        context=["Paris is the capital of France."]
    )
    metric = HallucinationMetric(threshold=0.5)
    assert_test(test_case, [metric])

# Run with JSON output
# deepeval test run test_llm.py --output-file results.json

Upload results to qualityindex.ai

Push DeepEval test results. Each metric type maps to an Eval Card criterion.

terminal

curl -X POST https://api.qualityindex.ai/api/v1/connectors/deepeval/import \
  -H "Authorization: Bearer ${QI_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @results.json

# Metric mapping:
# HallucinationMetric   → Task Quality (faithfulness)
# AnswerRelevancyMetric → Task Quality (relevance)
# ToxicityMetric        → Safety (content safety)
# BiasMetric            → Safety (bias detection)

Add to CI pipeline

Run DeepEval on every PR and gate merges on quality thresholds.

terminal

# .github/workflows/deepeval.yml
name: DeepEval Quality Gate
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install deepeval
      - run: deepeval test run tests/ --output-file results.json
      - run: |
          curl -X POST https://api.qualityindex.ai/api/v1/connectors/deepeval/import \
            -H "Authorization: Bearer ${{ secrets.QI_API_KEY }}" \
            -d @results.json

Arize Phoenix

Traces flow natively via OTLP — no separate connector needed. Import Phoenix-specific eval results via REST API for additional quality signals.

~5 minAuth: OTLP + API keyOpenTelemetry native

Traces already flow via OTel

If you use the OpenTelemetry connector, Phoenix traces already reach qualityindex.ai. Phoenix is built on OTel — no extra configuration needed for trace data.

terminal

# Phoenix uses standard OTel OTLP export.
# If your OTel Collector is configured for qualityindex.ai,
# Phoenix traces are already flowing.

# Verify:
curl -s https://api.qualityindex.ai/api/v1/connectors/otel/health \
  -H "Authorization: Bearer ${QI_API_KEY}"

Connect Phoenix evaluations API

For Phoenix-specific eval results (LLM-as-judge, retrieval evals), connect the Phoenix API to import scores.

terminal

# In the dashboard: Settings → Connectors → Add → Arize Phoenix
# Enter your Phoenix server URL and API key

# We fetch eval results:
GET https://your-phoenix-instance/api/v1/evaluations
Headers: Authorization: Bearer <phoenix_api_key>

# Imported: evaluation names, scores, labels, trace links

Map Phoenix evals to Eval Cards

Phoenix evaluation annotations map to Eval Card metrics. Retrieval evals feed Task Quality, latency feeds Efficiency.

terminal

POST /api/v1/eval-cards
{
  "name": "Phoenix Retrieval Quality",
  "pillar": "task_quality",
  "rubric": [
    { "metric": "phoenix.relevance", "threshold": 0.80, "weight": 0.5 },
    { "metric": "phoenix.qa_correctness", "threshold": 0.85, "weight": 0.5 }
  ],
  "data_binding": {
    "source": "arize_phoenix",
    "evaluation_name": "retrieval_quality"
  },
  "gate": false
}

LiteLLM

Unified gateway to 100+ LLMs. Enable the built-in OTel callback and all model calls are captured with unified token, cost, and latency attributes.

~5 minAuth: OTLP bearer token100+ LLM providers

Enable the OTel callback in LiteLLM

LiteLLM has a built-in OpenTelemetry callback. Enable it to emit spans for every model call across all providers.

terminal

# Python — add before any LiteLLM calls
import litellm

litellm.callbacks = ["otel"]

# Set the OTLP endpoint
import os
os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = "https://ingest.qualityindex.ai:4317"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = "Authorization=Bearer ${QI_API_KEY}"

For LiteLLM Proxy: configure in config.yaml

If you run the LiteLLM Proxy Server, add OTel export to the proxy config.

terminal

# litellm_config.yaml
general_settings:
  otel: true

environment_variables:
  OTEL_EXPORTER_OTLP_ENDPOINT: "https://ingest.qualityindex.ai:4317"
  OTEL_EXPORTER_OTLP_HEADERS: "Authorization=Bearer YOUR_QI_API_KEY"
  OTEL_SERVICE_NAME: "litellm-proxy"

Verify multi-provider traces

LiteLLM normalizes attributes across providers. Check that traces appear with correct model, token, and cost data.

terminal

# Each LiteLLM call produces a span with:
# gen_ai.provider.name: "openai" / "anthropic" / "bedrock" / etc.
# gen_ai.request.model: "gpt-4o" / "claude-3.5-sonnet" / etc.
# gen_ai.usage.input_tokens: 150
# gen_ai.usage.output_tokens: 83
# litellm.cost: 0.0024  (calculated from provider pricing)

# The Efficiency pillar auto-computes:
# - Cost per request across all providers
# - Latency p50/p95 per model
# - Token usage trends

Guardrails AI

Capture per-validator pass/fail from Guard calls. Each validator result maps to Safety pillar guardrail pass rates.

~10 minAuth: OTel instrumentationPython SDK

Install OTel instrumentation for Guardrails AI

Wrap your Guard calls with OTel spans to capture per-validator pass/fail results.

terminal

pip install guardrails-ai opentelemetry-sdk opentelemetry-exporter-otlp

# Initialize OTel (if not already done via the OTel connector)
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="https://ingest.qualityindex.ai:4317"))
)

Instrument Guard calls

Add tracing to your Guard validation pipeline. Each validator produces a child span with pass/fail status.

terminal

from opentelemetry import trace
from guardrails import Guard

tracer = trace.get_tracer("guardrails")

guard = Guard.from_pydantic(OutputModel)

with tracer.start_as_current_span("guardrails.validate") as span:
    result = guard(
        llm_api=openai.chat.completions.create,
        model="gpt-4o",
        messages=[{"role": "user", "content": user_input}]
    )
    span.set_attribute("guardrails.passed", result.validation_passed)
    span.set_attribute("guardrails.validators_run", len(result.validated_output or []))
    for i, log in enumerate(result.validation_logs):
        span.set_attribute(f"guardrails.validator.{i}.name", log.validator_name)
        span.set_attribute(f"guardrails.validator.{i}.passed", log.validation_passed)

Create Safety Eval Cards

Map validator results to Safety pillar metrics. Critical validators can gate your Quality Index.

terminal

POST /api/v1/eval-cards
{
  "name": "Output Validation Safety",
  "pillar": "safety",
  "rubric": [
    { "metric": "guardrails.pii_detection",    "threshold": 0.99, "weight": 0.4 },
    { "metric": "guardrails.hallucination",     "threshold": 0.95, "weight": 0.3 },
    { "metric": "guardrails.format_validation", "threshold": 0.98, "weight": 0.3 }
  ],
  "data_binding": { "source": "otel", "span_filter": "guardrails.validate" },
  "gate": true
}

Lakera Guard

Import prompt injection scan results. Map threat detections to Safety pillar metrics: injection rate, detection confidence, policy compliance.

~5 minAuth: Lakera API keyREST API, <50ms latency

Get your Lakera API key

terminal

# Lakera API key format
export LAKERA_API_KEY="lk_xxxxxxxxxxxxxxxxxxxx"

Connect in the dashboard

Add Lakera Guard as a connector. We poll your scan history and import threat detection results.

terminal

# Under the hood, we call Lakera's API:
POST https://api.lakera.ai/v2/guard
Headers: Authorization: Bearer lk_xxxx...
Body: { "input": "<prompt text>" }

# Response includes:
# - flagged: true/false
# - categories: { "prompt_injection": 0.92, "jailbreak": 0.15 }
# - threshold: the configured policy threshold

Map to Safety pillar

Lakera detections map to Safety pillar metrics. High-confidence threat detections can gate your Quality Index.

terminal

POST /api/v1/eval-cards
{
  "name": "Prompt Security",
  "pillar": "safety",
  "rubric": [
    { "metric": "lakera.prompt_injection_rate", "threshold": 0.01, "weight": 0.5 },
    { "metric": "lakera.jailbreak_rate",        "threshold": 0.005, "weight": 0.3 },
    { "metric": "lakera.policy_compliance",      "threshold": 0.99, "weight": 0.2 }
  ],
  "data_binding": { "source": "lakera", "scan_window": "24h" },
  "gate": true
}

Real-time screening

Lakera Guard runs in <50ms per request. Use it as middleware before your LLM call, and we'll track the block rate and threat categories in your Safety pillar.

NeMo Guardrails

NVIDIA's programmable guardrails for content safety, topic control, and jailbreak prevention. Track rail activations via OTel spans.

~10 minAuth: OTel instrumentationPython, LangChain integration

Add OTel instrumentation to NeMo config

NeMo Guardrails integrates with LangChain. If you have LangChain OTel instrumentation, rail activations produce child spans automatically.

terminal

# Install NeMo Guardrails with OTel support
pip install nemoguardrails opentelemetry-instrumentation-langchain

# If using LangChain, instrumenting LangChain also captures NeMo rails:
from opentelemetry.instrumentation.langchain import LangchainInstrumentor
LangchainInstrumentor().instrument()

Configure rail activation tracking

Each rail activation (topic blocked, jailbreak detected, PII redacted) produces a span event with the rail name, action taken, and user message context.

terminal

# NeMo Guardrails config.yml
models:
  - type: main
    engine: openai
    model: gpt-4o

rails:
  input:
    flows:
      - self check input   # Jailbreak detection
      - check pii          # PII redaction
  output:
    flows:
      - self check output  # Hallucination prevention
      - check topic        # Off-topic redirect

# Each rail activation produces OTel span attributes:
# nemo.rail.name: "self check input"
# nemo.rail.action: "blocked" / "allowed" / "redirected"
# nemo.rail.confidence: 0.94

Map rail activations to Safety pillar

Each rail type maps to a Safety Eval Card metric. Block rates and confidence scores flow into the Quality Index.

terminal

POST /api/v1/eval-cards
{
  "name": "NeMo Rail Safety",
  "pillar": "safety",
  "rubric": [
    { "metric": "nemo.jailbreak_block_rate",   "threshold": 0.98, "weight": 0.4 },
    { "metric": "nemo.pii_redaction_rate",      "threshold": 0.99, "weight": 0.3 },
    { "metric": "nemo.topic_compliance_rate",   "threshold": 0.95, "weight": 0.3 }
  ],
  "data_binding": { "source": "otel", "span_filter": "nemo.rail.*" },
  "gate": true
}

Weights & Biases

Sync W&B run metrics, model versions, and artifacts. Correlate experiment results with Quality Index releases.

~5 minAuth: W&B API keyREST API

Generate a W&B API key

In Weights & Biases, go to Settings → Danger Zone → API Keys.

terminal

export WANDB_API_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Connect in the dashboard

Add W&B as a connector. We sync run metrics, model versions, and artifact metadata from your selected projects.

terminal

# We validate and fetch projects:
GET https://api.wandb.ai/graphql
Headers: Authorization: Bearer <wandb_api_key>

# Query: list projects, runs, and metrics
# We import: run.summary metrics, run.config, model artifacts

Map experiments to Eval Cards

W&B run metrics map to Eval Card rubric criteria. Model registry versions link to Quality Index releases.

terminal

POST /api/v1/eval-cards
{
  "name": "Model Accuracy (W&B)",
  "pillar": "task_quality",
  "rubric": [
    { "metric": "wandb.accuracy",   "threshold": 0.90, "weight": 0.6 },
    { "metric": "wandb.f1_score",   "threshold": 0.85, "weight": 0.4 }
  ],
  "data_binding": {
    "source": "wandb",
    "project": "my-project",
    "run_filter": "latest"
  },
  "gate": false
}

Link model versions to releases

When you promote a model version in W&B, we automatically create a release snapshot in qualityindex.ai.

terminal

# W&B Model Registry events trigger release snapshots:
# - Model linked to registry → new release in qualityindex.ai
# - Model promoted to "production" → release marked as production
# - Model metrics updated → Quality Index recalculated

# In the dashboard, you'll see:
# Release v3.2 (W&B model: my-model:v12) → QI: 87

MLflow

Import experiment metrics and model versions from MLflow. Correlate MLflow runs with Quality Index history for release-level quality tracking.

~5 minAuth: MLflow tracking URIREST API

Configure MLflow tracking server URL

Point qualityindex.ai at your MLflow tracking server (self-hosted or Databricks-managed).

terminal

# In the dashboard: Settings → Connectors → Add → MLflow
# Enter your MLflow Tracking URI:
# - Self-hosted: http://mlflow.internal:5000
# - Databricks: https://<workspace>.cloud.databricks.com

# For Databricks, also provide a Personal Access Token.

Select experiments to sync

Choose which MLflow experiments should feed into your Quality Index. We import run metrics, parameters, and model version metadata.

terminal

# Under the hood:
POST http://mlflow.internal:5000/api/2.0/mlflow/runs/search
Body: {
  "experiment_ids": ["1", "2"],
  "filter_string": "metrics.accuracy > 0",
  "max_results": 100,
  "order_by": ["start_time DESC"]
}

# Imported fields: run_id, metrics.*, params.*, tags.*, artifacts

Map runs to Eval Cards and releases

MLflow run metrics map to Eval Card rubric criteria. Registered model versions link to Quality Index releases.

terminal

POST /api/v1/eval-cards
{
  "name": "MLflow Evaluation Metrics",
  "pillar": "task_quality",
  "rubric": [
    { "metric": "mlflow.accuracy",    "threshold": 0.90, "weight": 0.5 },
    { "metric": "mlflow.latency_p95", "threshold": 500,  "weight": 0.3 },
    { "metric": "mlflow.cost_per_req","threshold": 0.05, "weight": 0.2 }
  ],
  "data_binding": {
    "source": "mlflow",
    "experiment_id": "1",
    "run_filter": "latest"
  }
}

Need a connector that's not listed?

Our connector framework is plugin-based. If your tool emits OpenTelemetry traces or has a REST API with export capabilities, we can build a connector for it. Request access and tell us what you need.

Ready to connect?

Most teams go from zero to a live Quality Index in under 30 minutes.

Quick Start Guide