NOWAI-Bench

An open benchmarking suite for enterprise AI agents.

2
Benchmarks
2026Q2
Last updated

Overview

NOWAI-Bench is a coordinated, multi-benchmark effort by ServiceNow to measure whether AI agents perform reliably across the workflows, modalities, and governance demands of real enterprises. Rather than a single test, it is an expanding portfolio of benchmarks—each targeting a distinct slice of what enterprise agents are asked to do.

The current release covers two slices: EnterpriseOps-Gym evaluates long-horizon task agents across eight enterprise domains, and EVA-Bench evaluates voice agents on both task accuracy and conversational experience. Together they span text-based multi-step workflow execution and governed voice interaction—two of the most common deployment patterns for enterprise agents today.

This document describes the currently released benchmarks and how to read their results. It is intended to stay live: as new benchmarks land, the README is updated to reflect them.

Benchmark Leaderboard

A high-level overview of each NOWAI-Bench benchmark.

v1.0

EnterpriseOps-Gym

Long-horizon task agents evaluated across eight enterprise domains. long-horizon task agents
Top model
Claude Opus 4.5
Anthropic
37.4% Task Success Rate · Oracle mode A task passes only if all verification conditions are met.
2 GPT-5.4
34.8
3 Gemini 3 Pro
31.2
4 Claude Sonnet 4.6
28.6
5 Model Five (placeholder)
26.0

EVA-Bench

Voice agents evaluated on task accuracy and conversational experience. voice agents · accuracy and experience
EVA-Accuracy Pass@1 Scores for accuracy. All values normalized to 0–1 (higher is better). 95% bootstrap confidence intervals shown for each value.
Nova + GPT-5.4 + Sonic
Mixed Models · Cascade
0.41
2 Claude Opus 4.5
0.32
3 Scribe+Gemini-3-Flash
0.31
EVA-Experience Pass@1 Scores for conversational experience. All values normalized to 0–1 (higher is better). 95% bootstrap confidence intervals shown for each value.
Gemini Live
Google · Speech-to-Speech
0.49
2 GPT-Realtime
0.47
3 Whisper+Qwen 3.5
0.43