Integration Guides
15 connectors. One platform.
Step-by-step guides with copy-paste configs, verified instructions, and the exact API calls we make under the hood.
OpenTelemetry
OTLP endpoints on gRPC (4317) and HTTP (4318). Every LLM call becomes a trace with spans for model calls, retrieval, tool invocations, and post-processing.
Install the OpenTelemetry SDK
Add the OTel SDK and the GenAI instrumentation package to your application.
# Python
pip install opentelemetry-sdk opentelemetry-exporter-otlp \
opentelemetry-instrumentation-openai
# Node.js
npm install @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-grpc \
@traceloop/instrumentation-openaiInitialize instrumentation
Call the instrumentor before any LLM client is created. This patches OpenAI, Anthropic, and other providers to emit GenAI spans automatically.
# Python — add at application startup
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
OpenAIInstrumentor().instrument(tracer_provider=provider)Configure the OTel Collector
If you run an OTel Collector as a sidecar or gateway, add qualityindex.ai as an OTLP exporter.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
send_batch_size: 1000
timeout: 5s
exporters:
otlp/qualityindex:
endpoint: "https://ingest.qualityindex.ai:4317"
headers:
Authorization: "Bearer ${QI_API_KEY}"
compression: gzip
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/qualityindex]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [otlp/qualityindex]Verify spans are flowing
Make a test LLM call and check the dashboard. You should see a trace within seconds.
curl -s https://api.qualityindex.ai/api/v1/connectors/otel/health \
-H "Authorization: Bearer ${QI_API_KEY}"
# Response
{
"type": "otel",
"status": "healthy",
"last_span_received": "2026-03-07T14:32:00Z",
"spans_ingested_24h": 1
}Using a managed OTel provider?
If you use Datadog, New Relic, or Grafana as your OTel backend, add qualityindex.ai as a secondary exporter. Traces flow to both destinations.
LangSmith
Sync your curated datasets and evaluation results. Correlate semantic quality scores with production telemetry for a complete picture.
Generate a LangSmith API key
In LangSmith, go to Settings → API Keys and create a Service Key (recommended) or Personal Access Token.
# The key looks like:
# ls-svc-xxxxxxxxxxxxxxxxxxxxxxxxxxxx
export LANGSMITH_API_KEY="ls-svc-xxxxxxxxxxxxxxxxxxxxxxxxxxxx"Connect in the dashboard
Navigate to Settings → Connectors → Add Connector → LangSmith. Paste your API key.
# Under the hood, we validate by calling:
GET https://api.smith.langchain.com/api/v1/workspaces/current
Headers: x-api-key: ls-svc-xxxx...Select datasets to sync
Choose which LangSmith datasets to sync. Each maps to Eval Cards. Enable 'data minimization mode' to import only scores, not raw content.
# We list your datasets:
GET https://api.smith.langchain.com/api/v1/datasets
Headers: x-api-key: ls-svc-xxxx...
# Synced fields: id, name, description, example_count, schemaMap datasets to Eval Cards
For each synced dataset, create or link an Eval Card defining which metrics and thresholds to track.
POST /api/v1/eval-cards
{
"name": "Product Search Relevance",
"pillar": "task_quality",
"rubric": [
{ "metric": "relevance-score", "threshold": 0.8, "weight": 0.6 },
{ "metric": "faithfulness", "threshold": 0.9, "weight": 0.4 }
],
"data_binding": {
"source": "langsmith",
"dataset_id": "ds_abc123",
"sample_size": 200
},
"gate": false
}Import evaluation results
When you run evaluations in LangSmith, we automatically sync the results into the Task Quality pillar.
# We import results via:
POST https://api.smith.langchain.com/api/v1/runs/query
Body: {
"session_id": "experiment-session-id",
"is_root": true,
"select": ["id", "outputs", "feedback_stats"]
}
# Feedback scores fetched via:
GET https://api.smith.langchain.com/api/v1/feedback
Params: run_id=<run_id>Data minimization mode
Import only scores, metadata, and schema — not raw prompt/response content. Useful for HIPAA and SOC 2 compliance.
GitHub
Link every deployment to a Quality Index snapshot. Get automated remediation PRs when regressions are detected. Always human-in-the-loop.
Install the GitHub App
Visit the installation page. Choose "Only select repositories" and pick the repos to monitor.
# Permissions requested:
# - Contents (Read): Read code to analyze prompt files and configs
# - Pull requests (Read & Write): Create remediation PRs
# - Metadata (Read): Repository metadata
# - Webhooks: Receive push and PR events
#
# No access to: Actions, Issues, Discussions, Pages, SecretsConfigure release tracking
Map each repository to a service name. The platform watches for push events on your main branch.
# Webhook events we listen to:
# - push (main/master): triggers rescore
# - pull_request (opened, synchronize): eval preview
# - installation_repositories (added, removed): updates repo list
# Each push generates a release snapshot:
{
"tag": "v2.4.1",
"commit_sha": "a1b2c3d",
"quality_index": 87,
"deployed_at": "2026-03-07T08:00:00Z"
}Enable automated remediation
When the Quality Index drops, the platform proposes a fix as a PR with full evidence.
# Remediation PR includes:
# - Which Eval Card detected the regression
# - Before/after scores with timestamps
# - Root cause analysis
# - The proposed fix
# - Link to the qualityindex.ai dashboard
#
# You review and merge — always human-in-the-loop.PR quality gates (optional)
Block PRs that drop Quality Index below a threshold. Posted as a check status.
# PR check status:
# ✅ qualityindex.ai — Quality Index: 89 (+2 from main)
# ❌ qualityindex.ai — Quality Index: 74 (-8 from main)
# Reason: Faithfulness dropped below 0.80
# Configure in dashboard:
# Settings → Repositories → [repo] → Quality Gate
# Minimum Quality Index: 80
# Block on: Safety violations, Task Quality regression > 5 ptsSecurity model
GitHub App tokens are scoped per-installation, auto-expire after 1 hour, and can be revoked instantly. Webhook signatures verified via HMAC-SHA256.
Langfuse
Dual-export your traces to Langfuse and qualityindex.ai simultaneously. Sync Langfuse scores, sessions, and cost data for unified quality tracking.
Configure dual OTel export
Langfuse SDK v3 is OTel-native. Configure your OTel Collector or SDK to export traces to both Langfuse and qualityindex.ai simultaneously.
# Option A: OTel Collector with dual exporters
exporters:
otlp/langfuse:
endpoint: "https://cloud.langfuse.com/api/public/otel"
headers:
Authorization: "Basic ${LANGFUSE_BASE64_KEYS}"
otlp/qualityindex:
endpoint: "https://ingest.qualityindex.ai:4317"
headers:
Authorization: "Bearer ${QI_API_KEY}"
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp/langfuse, otlp/qualityindex]Connect Langfuse REST API
In the qualityindex.ai dashboard, add your Langfuse API keys to sync scores, sessions, and cost data that Langfuse computes.
# We sync the following from Langfuse:
# - Session scores (user feedback, auto-evals)
# - Trace cost calculations
# - Generation metadata (model, tokens, latency)
# - Score definitions and configs
# Auth: Langfuse Public Key + Secret Key
# Endpoint: https://cloud.langfuse.com/api/public/Map Langfuse scores to Eval Cards
Langfuse scores (from human feedback or auto-eval functions) map directly to Eval Card rubric metrics.
POST /api/v1/eval-cards
{
"name": "User Satisfaction Score",
"pillar": "task_quality",
"rubric": [
{ "metric": "langfuse.user_score", "threshold": 0.7, "weight": 1.0 }
],
"data_binding": {
"source": "langfuse",
"score_name": "user_satisfaction",
"min_traces": 100
},
"gate": false
}Promptfoo
Run local or CI eval suites with Promptfoo and push results to qualityindex.ai. Assertion pass rates map directly to Task Quality Eval Cards.
Install Promptfoo
Promptfoo is an open-source CLI for LLM evaluation and red-teaming.
# Install globally
npm install -g promptfoo
# Or as a dev dependency
npm install --save-dev promptfooRun eval and export JSON results
Run your eval suite and output structured JSON results that qualityindex.ai can parse.
# Run eval with JSON output
promptfoo eval --output results.json
# results.json contains:
# - Each test case with pass/fail per assertion
# - Scores, latency, token counts per provider
# - Aggregate pass rates per metricUpload results via the API
Push Promptfoo results to qualityindex.ai. Each assertion type maps to an Eval Card rubric metric.
# Upload results
curl -X POST https://api.qualityindex.ai/api/v1/connectors/promptfoo/import \
-H "Authorization: Bearer ${QI_API_KEY}" \
-H "Content-Type: application/json" \
-d @results.json
# Response
{
"imported": 248,
"eval_cards_updated": ["ec_relevance", "ec_safety"],
"quality_index_delta": "+3"
}Automate in CI/CD
Add to your CI pipeline to run evals on every push and automatically update the Quality Index.
# .github/workflows/eval.yml
name: LLM Eval
on: [push]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm install -g promptfoo
- run: promptfoo eval --output results.json
- run: |
curl -X POST https://api.qualityindex.ai/api/v1/connectors/promptfoo/import \
-H "Authorization: Bearer ${{ secrets.QI_API_KEY }}" \
-H "Content-Type: application/json" \
-d @results.jsonRagas
Import RAG-specific evaluation metrics — faithfulness, context precision, answer relevancy, context recall — directly into your Quality Index.
Install Ragas
Ragas is the standard framework for evaluating RAG pipelines.
pip install ragasRun evaluation and capture results
Run ragas.evaluate() on your dataset and capture the output.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
results = evaluate(
dataset=your_dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
# Export as dict
scores = results.to_pandas().to_dict(orient="records")Push results to qualityindex.ai
Each Ragas metric maps to a Task Quality rubric criterion. Upload the scores via our Python SDK or REST API.
import requests
response = requests.post(
"https://api.qualityindex.ai/api/v1/connectors/ragas/import",
headers={"Authorization": f"Bearer {QI_API_KEY}"},
json={
"metrics": {
"faithfulness": results["faithfulness"],
"answer_relevancy": results["answer_relevancy"],
"context_precision": results["context_precision"],
"context_recall": results["context_recall"],
},
"dataset_size": len(your_dataset),
"eval_card_id": "ec_rag_quality"
}
)Create Eval Card for RAG metrics
Define thresholds for each Ragas metric. The Quality Index updates whenever new results are imported.
POST /api/v1/eval-cards
{
"name": "RAG Pipeline Quality",
"pillar": "task_quality",
"rubric": [
{ "metric": "faithfulness", "threshold": 0.85, "weight": 0.35 },
{ "metric": "answer_relevancy", "threshold": 0.80, "weight": 0.30 },
{ "metric": "context_precision", "threshold": 0.75, "weight": 0.20 },
{ "metric": "context_recall", "threshold": 0.70, "weight": 0.15 }
],
"data_binding": { "source": "ragas" },
"gate": false
}Braintrust
Sync experiment results and scores from Braintrust projects. Map Braintrust evaluations to Eval Cards with automatic threshold detection.
Generate a Braintrust API key
In Braintrust, go to Settings → API Keys and create a key scoped to your project.
# Set as environment variable
export BRAINTRUST_API_KEY="bt_xxxxxxxxxxxxxxxxxxxx"Connect in the dashboard
Navigate to Settings → Connectors → Add Connector → Braintrust. Paste your API key.
# We validate by fetching your project list:
GET https://api.braintrust.dev/v1/project
Headers: Authorization: Bearer bt_xxxx...
# Then list experiments in each project:
GET https://api.braintrust.dev/v1/experiment?project_id=<id>Select experiments to sync
Choose which Braintrust experiments should feed into your Quality Index. Scores sync automatically after each experiment run.
# We import experiment scores:
GET https://api.braintrust.dev/v1/experiment/<id>/results
# Mapped fields:
# - scores → Eval Card rubric metrics
# - metadata.model → resource linkage
# - duration → Efficiency pillar
# - input/output tokens → cost calculationMap to Eval Cards
Each Braintrust scorer maps to an Eval Card metric. Thresholds are auto-detected from your historical baseline.
POST /api/v1/eval-cards
{
"name": "Braintrust Accuracy",
"pillar": "task_quality",
"rubric": [
{ "metric": "braintrust.accuracy", "threshold": 0.85, "weight": 0.6 },
{ "metric": "braintrust.relevance", "threshold": 0.80, "weight": 0.4 }
],
"data_binding": {
"source": "braintrust",
"project_id": "proj_abc123",
"experiment_filter": "latest"
},
"gate": false
}DeepEval
Unit-test style LLM evaluation. Parse DeepEval test results and map metric scores to Eval Card thresholds. CI/CD integration triggers on test completion.
Install DeepEval
DeepEval provides pytest-style assertions for LLM outputs.
pip install deepevalWrite and run tests
Create test cases with metrics like hallucination, answer relevancy, faithfulness, and toxicity.
# test_llm.py
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
def test_no_hallucination():
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
context=["Paris is the capital of France."]
)
metric = HallucinationMetric(threshold=0.5)
assert_test(test_case, [metric])
# Run with JSON output
# deepeval test run test_llm.py --output-file results.jsonUpload results to qualityindex.ai
Push DeepEval test results. Each metric type maps to an Eval Card criterion.
curl -X POST https://api.qualityindex.ai/api/v1/connectors/deepeval/import \
-H "Authorization: Bearer ${QI_API_KEY}" \
-H "Content-Type: application/json" \
-d @results.json
# Metric mapping:
# HallucinationMetric → Task Quality (faithfulness)
# AnswerRelevancyMetric → Task Quality (relevance)
# ToxicityMetric → Safety (content safety)
# BiasMetric → Safety (bias detection)Add to CI pipeline
Run DeepEval on every PR and gate merges on quality thresholds.
# .github/workflows/deepeval.yml
name: DeepEval Quality Gate
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install deepeval
- run: deepeval test run tests/ --output-file results.json
- run: |
curl -X POST https://api.qualityindex.ai/api/v1/connectors/deepeval/import \
-H "Authorization: Bearer ${{ secrets.QI_API_KEY }}" \
-d @results.jsonArize Phoenix
Traces flow natively via OTLP — no separate connector needed. Import Phoenix-specific eval results via REST API for additional quality signals.
Traces already flow via OTel
If you use the OpenTelemetry connector, Phoenix traces already reach qualityindex.ai. Phoenix is built on OTel — no extra configuration needed for trace data.
# Phoenix uses standard OTel OTLP export.
# If your OTel Collector is configured for qualityindex.ai,
# Phoenix traces are already flowing.
# Verify:
curl -s https://api.qualityindex.ai/api/v1/connectors/otel/health \
-H "Authorization: Bearer ${QI_API_KEY}"Connect Phoenix evaluations API
For Phoenix-specific eval results (LLM-as-judge, retrieval evals), connect the Phoenix API to import scores.
# In the dashboard: Settings → Connectors → Add → Arize Phoenix
# Enter your Phoenix server URL and API key
# We fetch eval results:
GET https://your-phoenix-instance/api/v1/evaluations
Headers: Authorization: Bearer <phoenix_api_key>
# Imported: evaluation names, scores, labels, trace linksMap Phoenix evals to Eval Cards
Phoenix evaluation annotations map to Eval Card metrics. Retrieval evals feed Task Quality, latency feeds Efficiency.
POST /api/v1/eval-cards
{
"name": "Phoenix Retrieval Quality",
"pillar": "task_quality",
"rubric": [
{ "metric": "phoenix.relevance", "threshold": 0.80, "weight": 0.5 },
{ "metric": "phoenix.qa_correctness", "threshold": 0.85, "weight": 0.5 }
],
"data_binding": {
"source": "arize_phoenix",
"evaluation_name": "retrieval_quality"
},
"gate": false
}LiteLLM
Unified gateway to 100+ LLMs. Enable the built-in OTel callback and all model calls are captured with unified token, cost, and latency attributes.
Enable the OTel callback in LiteLLM
LiteLLM has a built-in OpenTelemetry callback. Enable it to emit spans for every model call across all providers.
# Python — add before any LiteLLM calls
import litellm
litellm.callbacks = ["otel"]
# Set the OTLP endpoint
import os
os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = "https://ingest.qualityindex.ai:4317"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = "Authorization=Bearer ${QI_API_KEY}"For LiteLLM Proxy: configure in config.yaml
If you run the LiteLLM Proxy Server, add OTel export to the proxy config.
# litellm_config.yaml
general_settings:
otel: true
environment_variables:
OTEL_EXPORTER_OTLP_ENDPOINT: "https://ingest.qualityindex.ai:4317"
OTEL_EXPORTER_OTLP_HEADERS: "Authorization=Bearer YOUR_QI_API_KEY"
OTEL_SERVICE_NAME: "litellm-proxy"Verify multi-provider traces
LiteLLM normalizes attributes across providers. Check that traces appear with correct model, token, and cost data.
# Each LiteLLM call produces a span with:
# gen_ai.provider.name: "openai" / "anthropic" / "bedrock" / etc.
# gen_ai.request.model: "gpt-4o" / "claude-3.5-sonnet" / etc.
# gen_ai.usage.input_tokens: 150
# gen_ai.usage.output_tokens: 83
# litellm.cost: 0.0024 (calculated from provider pricing)
# The Efficiency pillar auto-computes:
# - Cost per request across all providers
# - Latency p50/p95 per model
# - Token usage trendsGuardrails AI
Capture per-validator pass/fail from Guard calls. Each validator result maps to Safety pillar guardrail pass rates.
Install OTel instrumentation for Guardrails AI
Wrap your Guard calls with OTel spans to capture per-validator pass/fail results.
pip install guardrails-ai opentelemetry-sdk opentelemetry-exporter-otlp
# Initialize OTel (if not already done via the OTel connector)
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="https://ingest.qualityindex.ai:4317"))
)Instrument Guard calls
Add tracing to your Guard validation pipeline. Each validator produces a child span with pass/fail status.
from opentelemetry import trace
from guardrails import Guard
tracer = trace.get_tracer("guardrails")
guard = Guard.from_pydantic(OutputModel)
with tracer.start_as_current_span("guardrails.validate") as span:
result = guard(
llm_api=openai.chat.completions.create,
model="gpt-4o",
messages=[{"role": "user", "content": user_input}]
)
span.set_attribute("guardrails.passed", result.validation_passed)
span.set_attribute("guardrails.validators_run", len(result.validated_output or []))
for i, log in enumerate(result.validation_logs):
span.set_attribute(f"guardrails.validator.{i}.name", log.validator_name)
span.set_attribute(f"guardrails.validator.{i}.passed", log.validation_passed)Create Safety Eval Cards
Map validator results to Safety pillar metrics. Critical validators can gate your Quality Index.
POST /api/v1/eval-cards
{
"name": "Output Validation Safety",
"pillar": "safety",
"rubric": [
{ "metric": "guardrails.pii_detection", "threshold": 0.99, "weight": 0.4 },
{ "metric": "guardrails.hallucination", "threshold": 0.95, "weight": 0.3 },
{ "metric": "guardrails.format_validation", "threshold": 0.98, "weight": 0.3 }
],
"data_binding": { "source": "otel", "span_filter": "guardrails.validate" },
"gate": true
}Lakera Guard
Import prompt injection scan results. Map threat detections to Safety pillar metrics: injection rate, detection confidence, policy compliance.
Get your Lakera API key
Sign up at Lakera and generate an API key from your dashboard.
# Lakera API key format
export LAKERA_API_KEY="lk_xxxxxxxxxxxxxxxxxxxx"Connect in the dashboard
Add Lakera Guard as a connector. We poll your scan history and import threat detection results.
# Under the hood, we call Lakera's API:
POST https://api.lakera.ai/v2/guard
Headers: Authorization: Bearer lk_xxxx...
Body: { "input": "<prompt text>" }
# Response includes:
# - flagged: true/false
# - categories: { "prompt_injection": 0.92, "jailbreak": 0.15 }
# - threshold: the configured policy thresholdMap to Safety pillar
Lakera detections map to Safety pillar metrics. High-confidence threat detections can gate your Quality Index.
POST /api/v1/eval-cards
{
"name": "Prompt Security",
"pillar": "safety",
"rubric": [
{ "metric": "lakera.prompt_injection_rate", "threshold": 0.01, "weight": 0.5 },
{ "metric": "lakera.jailbreak_rate", "threshold": 0.005, "weight": 0.3 },
{ "metric": "lakera.policy_compliance", "threshold": 0.99, "weight": 0.2 }
],
"data_binding": { "source": "lakera", "scan_window": "24h" },
"gate": true
}Real-time screening
Lakera Guard runs in <50ms per request. Use it as middleware before your LLM call, and we'll track the block rate and threat categories in your Safety pillar.
NeMo Guardrails
NVIDIA's programmable guardrails for content safety, topic control, and jailbreak prevention. Track rail activations via OTel spans.
Add OTel instrumentation to NeMo config
NeMo Guardrails integrates with LangChain. If you have LangChain OTel instrumentation, rail activations produce child spans automatically.
# Install NeMo Guardrails with OTel support
pip install nemoguardrails opentelemetry-instrumentation-langchain
# If using LangChain, instrumenting LangChain also captures NeMo rails:
from opentelemetry.instrumentation.langchain import LangchainInstrumentor
LangchainInstrumentor().instrument()Configure rail activation tracking
Each rail activation (topic blocked, jailbreak detected, PII redacted) produces a span event with the rail name, action taken, and user message context.
# NeMo Guardrails config.yml
models:
- type: main
engine: openai
model: gpt-4o
rails:
input:
flows:
- self check input # Jailbreak detection
- check pii # PII redaction
output:
flows:
- self check output # Hallucination prevention
- check topic # Off-topic redirect
# Each rail activation produces OTel span attributes:
# nemo.rail.name: "self check input"
# nemo.rail.action: "blocked" / "allowed" / "redirected"
# nemo.rail.confidence: 0.94Map rail activations to Safety pillar
Each rail type maps to a Safety Eval Card metric. Block rates and confidence scores flow into the Quality Index.
POST /api/v1/eval-cards
{
"name": "NeMo Rail Safety",
"pillar": "safety",
"rubric": [
{ "metric": "nemo.jailbreak_block_rate", "threshold": 0.98, "weight": 0.4 },
{ "metric": "nemo.pii_redaction_rate", "threshold": 0.99, "weight": 0.3 },
{ "metric": "nemo.topic_compliance_rate", "threshold": 0.95, "weight": 0.3 }
],
"data_binding": { "source": "otel", "span_filter": "nemo.rail.*" },
"gate": true
}Weights & Biases
Sync W&B run metrics, model versions, and artifacts. Correlate experiment results with Quality Index releases.
Generate a W&B API key
In Weights & Biases, go to Settings → Danger Zone → API Keys.
export WANDB_API_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"Connect in the dashboard
Add W&B as a connector. We sync run metrics, model versions, and artifact metadata from your selected projects.
# We validate and fetch projects:
GET https://api.wandb.ai/graphql
Headers: Authorization: Bearer <wandb_api_key>
# Query: list projects, runs, and metrics
# We import: run.summary metrics, run.config, model artifactsMap experiments to Eval Cards
W&B run metrics map to Eval Card rubric criteria. Model registry versions link to Quality Index releases.
POST /api/v1/eval-cards
{
"name": "Model Accuracy (W&B)",
"pillar": "task_quality",
"rubric": [
{ "metric": "wandb.accuracy", "threshold": 0.90, "weight": 0.6 },
{ "metric": "wandb.f1_score", "threshold": 0.85, "weight": 0.4 }
],
"data_binding": {
"source": "wandb",
"project": "my-project",
"run_filter": "latest"
},
"gate": false
}Link model versions to releases
When you promote a model version in W&B, we automatically create a release snapshot in qualityindex.ai.
# W&B Model Registry events trigger release snapshots:
# - Model linked to registry → new release in qualityindex.ai
# - Model promoted to "production" → release marked as production
# - Model metrics updated → Quality Index recalculated
# In the dashboard, you'll see:
# Release v3.2 (W&B model: my-model:v12) → QI: 87MLflow
Import experiment metrics and model versions from MLflow. Correlate MLflow runs with Quality Index history for release-level quality tracking.
Configure MLflow tracking server URL
Point qualityindex.ai at your MLflow tracking server (self-hosted or Databricks-managed).
# In the dashboard: Settings → Connectors → Add → MLflow
# Enter your MLflow Tracking URI:
# - Self-hosted: http://mlflow.internal:5000
# - Databricks: https://<workspace>.cloud.databricks.com
# For Databricks, also provide a Personal Access Token.Select experiments to sync
Choose which MLflow experiments should feed into your Quality Index. We import run metrics, parameters, and model version metadata.
# Under the hood:
POST http://mlflow.internal:5000/api/2.0/mlflow/runs/search
Body: {
"experiment_ids": ["1", "2"],
"filter_string": "metrics.accuracy > 0",
"max_results": 100,
"order_by": ["start_time DESC"]
}
# Imported fields: run_id, metrics.*, params.*, tags.*, artifactsMap runs to Eval Cards and releases
MLflow run metrics map to Eval Card rubric criteria. Registered model versions link to Quality Index releases.
POST /api/v1/eval-cards
{
"name": "MLflow Evaluation Metrics",
"pillar": "task_quality",
"rubric": [
{ "metric": "mlflow.accuracy", "threshold": 0.90, "weight": 0.5 },
{ "metric": "mlflow.latency_p95", "threshold": 500, "weight": 0.3 },
{ "metric": "mlflow.cost_per_req","threshold": 0.05, "weight": 0.2 }
],
"data_binding": {
"source": "mlflow",
"experiment_id": "1",
"run_filter": "latest"
}
}Need a connector that's not listed?
Our connector framework is plugin-based. If your tool emits OpenTelemetry traces or has a REST API with export capabilities, we can build a connector for it. Request access and tell us what you need.
Ready to connect?
Most teams go from zero to a live Quality Index in under 30 minutes.