AI Evaluator

Command Palette

Search for a command to run...

Back

AI Evaluator

v1.0.0Tool

AI-as-Judge evaluation engine for scoring agent outputs on Quality, Accuracy, and Efficiency with batch processing and output validation

evaluationscoringqualityvalidationbatch
Edit
Capabilities

Executes the defined tool functionality

Content

AI Evaluator Tool

Overview

AI-as-Judge evaluation engine that scores agent outputs on three dimensions (1-5 each): Quality, Accuracy, and Efficiency. Supports single evaluation, batch processing, output validation, human feedback, and aggregate reporting.

Available Operations

Single Evaluation

  • Score - Evaluate a single agent output against the judge prompt
  • Returns quality, accuracy, efficiency scores with reasoning

Batch Evaluation

  • Batch - Score all unscored traces within a time window
  • Configurable hours lookback (default: 24h)
  • Skips already-scored traces automatically

Output Validation

  • Validate - Check output against quality criteria (format, content, safety, completeness)
  • Uses the gene-output-validator prompt
  • Returns pass/fail with specific issues and passed checks

Human Feedback

  • Feedback - Submit human feedback score (1-5) for a specific trace
  • Persists to Langfuse trace metadata

Reporting

  • Report - Aggregate scoring report by template, agent type, and cost tier
  • Configurable period (default: 7 days)

Parameters

ParameterTypeRequiredDescription
operationstringYesOne of: score, batch, validate, feedback, report
traceIdstringNoTrace ID (required for validate, feedback)
sessionIdstringNoSession ID (for single evaluation)
outputstringNoAgent output to evaluate
inputstringNoUser input context
scorenumberNoHuman feedback score 1-5 (for feedback)
hoursnumberNoLookback window in hours (for batch, report)
expectedFormatstringNoExpected output format (for validate)

Use Cases

  • Automated quality assurance for agent outputs
  • Batch evaluation of recent agent sessions
  • Output format and safety validation
  • Human-in-the-loop feedback collection
  • Performance reporting and trend analysis
Details

Version

1.0.0

Created

July 4, 2026

Updated

July 4, 2026

Tags

evaluation, scoring, quality, validation, batch

Actions
Test in Playground