LLMOps
Improve quality, cut latency and cost.
Here are 10 numeric, per-run/per-route variables that are high-value in LLMOps (e.g., LangSmith, Arize/Phoenix, Weights & Biases, Azure OpenAI Monitoring, AWS Bedrock Model Evaluations) and well suited to visualize using Immersion Analytics:
| # | Variable | What it is (numeric) | Why it matters | Good IA mapping (suggestion) |
|---|---|---|---|---|
| 1 | Task Success Rate (%) | Pass@1 / goal-completion on eval sets | Primary quality signal | Y-axis (higher → up) |
| 2 | Time to First Token (ms) | Latency to first streamed token | Perceived speed | X-axis (left = faster) |
| 3 | p95 End-to-End Latency (ms) | Slow-tail response time | SLO reliability | Z-depth (closer = lower) |
| 4 | Cost per 1K Tokens ($) | Effective $/1K input+output tokens | Unit economics | Color (cooler = cheaper) |
| 5 | Hallucination Rate (%) | % outputs judged unfaithful | Trustworthiness | Transparency (more hollow = worse) |
| 6 | Grounding Hit Rate (%) | RAG evidence coverage/recall | Factual support | Glow (brighter = higher) |
| 7 | Cache Hit Rate (%) | Prompt/embedding cache hits | Throughput + cost relief | Satellites (more/larger satellites = higher) |
| 8 | Error / Rate-Limit Rate (%) | 4xx/5xx + throttles per 100 calls | Stability/SRE health | Pulsation (faster = higher rate) |
| 9 | Safety Violation Rate (%) | Toxicity/policy breaches | Risk & compliance | Shimmer (stronger = riskier) |
| 10 | Throughput (req/min) | Successful requests per minute | Capacity under load | Size (bigger = higher) |
What quality gains and cost savings could you unlock by seeing all ten—simultaneously—across your prompts, routes, and models?
MLOps
Reduce drift and improve reliability.
Here are 10 numeric, per-model/per-deployment variables that are high-value in MLOps (e.g., AWS SageMaker, Google Vertex AI, Azure ML, Databricks/MLflow, Weights & Biases, Arize, WhyLabs) and well suited to visualize using Immersion Analytics:
| # | Variable | What it is (numeric) | Why it matters | Good IA mapping (suggestion) |
|---|---|---|---|---|
| 1 | Model Quality (AUC/F1/PR-AUC) | Current evaluation metric (0–1) | Ensures the model is delivering value | Y-axis (higher → up) |
| 2 | p95 Latency (ms) | 95th percentile inference time | Protects UX/SLOs and tail performance | X-axis (right = slower) |
| 3 | Error Rate (%) | Failures/timeouts per request | Stability and reliability signal | Transparency (higher = more hollow) |
| 4 | Throughput (req/s) | Requests served per second | Capacity planning & scaling | Size (bigger = higher) |
| 5 | Drift Score (PSI/KL, 0–1) | Shift from training to serving | Early warning for silent failure | Glow (brighter = more drift) |
| 6 | Calibration Error (ECE, %) | Gap between predicted probs and reality | Trustworthy decisions & thresholds | Shimmer (stronger = worse calibration) |
| 7 | Cost per 1k Predictions ($) | Infra + model fees per 1k calls | Controls unit economics | Z-depth (closer = cheaper) |
| 8 | Data Freshness Lag (min) | Age of features at inference | Stale data degrades outcomes | Pulsation (faster = staler/urgent) |
| 9 | Feature Null Rate (%) | Missing/invalid feature values | Data quality at the point of use | Satellites (more satellites = more nulls) |
| 10 | Guardrail Violations (/1k) | Toxicity/PII/hallucinations or policy breaches | Safety/compliance risk | Color (hotter = more violations) |
What uptime, quality, and cost savings could you unlock by seeing all ten—simultaneously—across every model, endpoint, and environment?