Skip to main content
Evaluations score traces so you can quantify improvements and catch regressions as models, prompts, and code change. Next: turn failures into reusable test data with Datasets and Queues.