An open benchmarking suite for enterprise AI agents.
NOWAI-Bench is a coordinated, multi-benchmark effort by ServiceNow to measure whether AI agents perform reliably across the workflows, modalities, and governance demands of real enterprises. Rather than a single test, it is an expanding portfolio of benchmarks—each targeting a distinct slice of what enterprise agents are asked to do.
The current release covers two slices: EnterpriseOps-Gym evaluates long-horizon task agents across eight enterprise domains, and EVA-Bench evaluates voice agents on both task accuracy and conversational experience. Together they span text-based multi-step workflow execution and governed voice interaction—two of the most common deployment patterns for enterprise agents today.
This document describes the currently released benchmarks and how to read their results. It is intended to stay live: as new benchmarks land, the README is updated to reflect them.
A high-level overview of each NOWAI-Bench benchmark.