AI Evaluator

K

Back to ToolsBack

AI Evaluator

v1.0.0Tool

AI-as-Judge evaluation engine for scoring agent outputs on Quality, Accuracy, and Efficiency with batch processing and output validation

evaluationscoringqualityvalidationbatch

Capabilities

Executes the defined tool functionality

Content

AI Evaluator Tool

Overview

AI-as-Judge evaluation engine that scores agent outputs on three dimensions (1-5 each): Quality, Accuracy, and Efficiency. Supports single evaluation, batch processing, output validation, human feedback, and aggregate reporting.

Available Operations

Single Evaluation

Score - Evaluate a single agent output against the judge prompt
Returns quality, accuracy, efficiency scores with reasoning

Batch Evaluation

Batch - Score all unscored traces within a time window
Configurable hours lookback (default: 24h)
Skips already-scored traces automatically

Output Validation

Validate - Check output against quality criteria (format, content, safety, completeness)
Uses the gene-output-validator prompt
Returns pass/fail with specific issues and passed checks

Human Feedback

Feedback - Submit human feedback score (1-5) for a specific trace
Persists to Langfuse trace metadata

Reporting

Report - Aggregate scoring report by template, agent type, and cost tier
Configurable period (default: 7 days)

Parameters

Parameter	Type	Required	Description
`operation`	string	Yes	One of: score, batch, validate, feedback, report
`traceId`	string	No	Trace ID (required for validate, feedback)
`sessionId`	string	No	Session ID (for single evaluation)
`output`	string	No	Agent output to evaluate
`input`	string	No	User input context
`score`	number	No	Human feedback score 1-5 (for feedback)
`hours`	number	No	Lookback window in hours (for batch, report)
`expectedFormat`	string	No	Expected output format (for validate)

Use Cases

Automated quality assurance for agent outputs
Batch evaluation of recent agent sessions
Output format and safety validation
Human-in-the-loop feedback collection
Performance reporting and trend analysis

Related Content

Genes

Sync Engine

tool

Youtube Analytics Tool

tool

Coda Tool

tool

Auto Optimizer

tool

Details

Version

1.0.0

Created

July 4, 2026

Updated

July 4, 2026

Tags

evaluation, scoring, quality, validation, batch

Actions

Test in Playground