For AI engineers, ML platform teams, and compliance officers needing visibility into LLM application performance, cost, quality, and safety in production. Covers tracing depth, evaluation capabilities, prompt management, integration breadth, and — for compliance buyers — audit logging and data residency.
Last verified April 21, 2026
Editorial independence: aicompliancevendors.com does not accept vendor payment for inclusion or ranking. Every pick below is editor-selected against the criteria stated on this page, and every factual claim is traceable to a cited public source.
Top picks: Langfuse — Teams prioritizing MIT-licensed self-hosted LLM engineering with maximum integration flexibility; Arize AI — Teams managing mixed ML and LLM portfolios needing unified observability; LangSmith — LangChain and LangGraph teams needing native framework observability. Plus 4 more vendors reviewed below. Last updated April 21, 2026; every entry cites public sources.
Evaluation framework: LLM-as-judge, heuristic, or human annotation capabilities.
Integration breadth: supports major LLM providers and agent frameworks.
Active development: features shipped in the 12 months preceding April 2026.
At least one publicly documented pricing tier.
Vendor product pages, documentation, and pricing pages were reviewed. Pricing verified against official pricing pages. WhyLabs excluded: enterprise operations discontinued after Apple acquisition (September 2025, GeekWire). Ranking favors feature completeness, pricing transparency, and production-scale readiness.
Langfuse leads open-source LLM observability with 22,000+ GitHub stars and 10B+ observations/month. MIT licensing permits free commercial self-hosting. Covers the full LLM lifecycle: tracing, prompt management, evaluation, experiments, and annotation. OpenTelemetry-native with 80+ integrations. SOC 2 Type II and ISO 27001. Cloud: Hobby free (50k/mo), Core $29/mo, Pro $199/mo, Enterprise $2,499/mo.
Strengths
MIT-licensed self-hosting with all features included — zero cost at any scale.
OpenTelemetry-native with 80+ integrations; no framework lock-in.
SOC 2 Type II and ISO 27001.
Limitations
Self-hosting requires infrastructure management.
Cloud free tier limited to 50k observations/month.
Arize provides open-source Phoenix (Apache 2.0, free self-hosted) and AX managed SaaS (Pro $50/month, Enterprise custom). Phoenix adds drift detection and embedding analysis for classic ML and LLM teams. LlamaIndex, LangChain, DSPy, and OpenTelemetry integrations are supported. AX Enterprise adds SOC 2 Type II, HIPAA, and Data Fabric (Snowflake and BigQuery). AX Free: 25,000 spans/month, 7-day retention.
Strengths
Phoenix OSS free with ML monitoring lineage — drift detection and embedding analysis.
AX Pro transparent pricing at $50/month.
Data Fabric integration with Snowflake/BigQuery for enterprise data workflows.
Limitations
AX Free limited to 25k spans/month, 7-day retention.
LangSmith provides the deepest native integration for LangChain and LangGraph workloads, with automatic trace clustering and failure mode detection. Framework-agnostic tracing via OpenTelemetry covers non-LangChain stacks. Managed cloud, BYOC, and self-hosted deployment cover data residency. Developer free: 5,000 traces/month; Plus: $39/seat/month. Maintained by LangChain Inc.
Strengths
Deepest native LangChain and LangGraph instrumentation.
Automatic trace clustering and failure mode detection.
BYOC and self-hosted for data residency flexibility.
Limitations
Maintained by LangChain Inc. — potential vendor alignment concern.
Developer free tier limited to 5,000 traces/month.
Braintrust's Loop AI agent automatically generates evaluation datasets, refines scorers, and optimizes prompts from production data — teams report 30%+ accuracy improvements within weeks. Brainstore database delivers 80x faster trace queries. Used by Notion, Stripe, Vercel, Airtable, and Instacart. Backed by a16z and Greylock. Starter free (1GB data, 10k scores); Pro $249/month.
Strengths
Loop AI agent for automated eval dataset generation and prompt optimization.
Brainstore database for 80x faster trace queries.
Strong cross-functional collaboration across engineering and non-technical teams.
Limitations
Self-hosting requires Enterprise plan commitment.
Pro at $249/month is higher-cost than Arize or LangSmith for managed SaaS.
Fiddler is an AI Control Plane for agentic applications — observability, guardrails, and governance in one enterprise platform. Fiddler Trust Models provide built-in safety, faithfulness, and PII guardrails. Fiddler emphasizes auditable governance and compliance trails. Self-serve Lite tier available; Enterprise pricing on request.
Strengths
Built-in safety, faithfulness, and PII guardrails — no separate integration required.
Auditable governance for regulated industries.
Root cause analysis with full execution context and decision lineage.
Limitations
Less open-source transparency than Langfuse or Arize Phoenix.
Patronus AI focuses on automated hallucination detection, factuality checking, and AI safety evaluation. Developer plan free; evaluator API at $10/1k (small) and $20/1k (large) calls. Enterprise: custom. Patronus AI is stronger on evaluation and red-teaming-adjacent capabilities than on production trace observability — better as a secondary evaluation pipeline than a primary observability platform.
Strengths
Developer plan free to start for evaluation pipeline development.
Strong automated hallucination detection and factuality evaluation.
Evaluator API integrates into existing CI/CD pipelines.
Limitations
Weaker production trace observability than Arize, Langfuse, and LangSmith.
Criteria-based recommendations for the most common shortlist scenarios.
For free, unlimited, self-hosted observability, Langfuse (MIT licensed) is the default. For mixed ML and LLM portfolios, Arize Phoenix OSS provides the strongest ML monitoring lineage. For LangChain-native teams, LangSmith is the tightest integration. For eval automation, Braintrust's Loop is the differentiator. For regulated enterprises, Fiddler AI or Galileo are the most appropriate options.
What we did not include
Transparency about exclusions.
WhyLabs excluded: enterprise operations discontinued following Apple acquisition (September 2025, GeekWire). Open-source langkit continues as a community project. Arthur lacks a current public product page with documented LLM observability pricing as of April 2026.
Frequently asked
What is the difference between LLM observability and LLM evaluation?+
LLM observability monitors production systems in real time: tracing requests, tracking latency, cost, error rates, and quality metrics over live traffic. LLM evaluation focuses on pre-deployment testing using datasets, metrics, and human annotation. Most platforms blend both.
Which LLM observability platform has the most generous free tier?+