Edit

Command Palette

Search for a command to run...

Edit AI Evaluator

Modify this tool gene.

Gene Details
evaluationscoringqualityvalidationbatch
Content (Markdown)
Preview

AI Evaluator Tool

Overview

AI-as-Judge evaluation engine that scores agent outputs on three dimensions (1-5 each): Quality, Accuracy, and Efficiency. Supports single evaluation, batch processing, output validation, human feedback, and aggregate reporting.

Available Operations

Single Evaluation

  • Score - Evaluate a single agent output against the judge prompt
  • Returns quality, accuracy, efficiency scores with reasoning

Batch Evaluation

  • Batch - Score all unscored traces within a time window
  • Configurable hours lookback (default: 24h)
  • Skips already-scored traces automatically

Output Validation

  • Validate - Check output against quality criteria (format, content, safety, completeness)
  • Uses the gene-output-validator prompt
  • Returns pass/fail with specific issues and passed checks

Human Feedback

  • Feedback - Submit human feedback score (1-5) for a specific trace
  • Persists to Langfuse trace metadata

Reporting

  • Report - Aggregate scoring report by template, agent type, and cost tier
  • Configurable period (default: 7 days)

Parameters

ParameterTypeRequiredDescription
operationstringYesOne of: score, batch, validate, feedback, report
traceIdstringNoTrace ID (required for validate, feedback)
sessionIdstringNoSession ID (for single evaluation)
outputstringNoAgent output to evaluate
inputstringNoUser input context
scorenumberNoHuman feedback score 1-5 (for feedback)
hoursnumberNoLookback window in hours (for batch, report)
expectedFormatstringNoExpected output format (for validate)

Use Cases

  • Automated quality assurance for agent outputs
  • Batch evaluation of recent agent sessions
  • Output format and safety validation
  • Human-in-the-loop feedback collection
  • Performance reporting and trend analysis