Agentic evaluations: what frontier labs need from evaluators in 2026
Two years ago, evaluator work meant reading two model responses and clicking the better one. In 2026, half the briefs landing on our roster don't have a "model response" at all — they have a trajectory. The model opened a browser, ran three searches, hit a wrong API, recovered, queried a database, wrote a script, executed it, hit a stack trace, fixed the script, and finally produced an answer. The evaluator's job is to read that whole 40-step rollout and decide where it went wrong, what should have happened instead, and whether the final answer is good despite (or because of) the mess in the middle.
This is agentic evaluation, and it's where the rate ceiling has moved. A standard preference-comparison brief still pays $25–40/hr at the crowd tier. An agentic-trajectory review pays $80–160/hr at the expert tier — partly because nobody has trained themselves to do it yet, and partly because reading a 40-step rollout for hidden failures is genuinely hard.
What "agentic" actually means in a brief
Strip the marketing and an agentic eval is one of three shapes.
Tool-use rollout
The model is given a goal and a list of tools — `search`, `read_file`, `run_python`, `send_email` — and it picks which to call, in what order, with what arguments. The output is a JSON-ish log of every call, every return value, and the final answer. You're scoring four things at once: did it pick the right tools, did it pass them sensible arguments, did it interpret the returns correctly, and was the final answer right.
The hidden failure mode: the model can produce a correct final answer through a broken process. It guesses the answer, then back-fills tool calls that look like they support it. A surface-level rater marks this 5/5. A careful rater catches it and flags "right answer, fabricated reasoning" — which is the most useful signal a lab can collect, and the one that earns you the senior bracket.
Computer-use rollout
The model is given a screenshot and keyboard/mouse control, and asked to do something a human would do — book a flight, fill out a form, find a setting buried four menus deep. The eval surface is a stream of screenshots plus the actions taken between them.
The job here is mostly visual and procedural. Did the model click the right thing? Did it get lost in a modal? Did it fall for a dark pattern (clicking "Allow notifications" when the task was "find privacy settings")? Computer-use briefs are absurdly time-consuming — one rollout can be 200 screenshots — and they pay accordingly. Expect $90–140/hr, sometimes per-rollout instead of per-hour.
Multi-turn task with tool use and memory
The most expert-heavy form. The model is given a multi-day task — "investigate this codebase and write a security review", "audit these three quarters of financials and flag what doesn't add up" — and the trajectory includes memory writes, sub-goal planning, and human-in-the-loop checkpoints.
This is the work that needs domain experts. A non-engineer cannot tell a real security finding from a hallucinated one. A non-CPA cannot tell a real accounting irregularity from noise. Labs pay $120–180/hr for these reviews and most briefs cap at 4–6 hours/week per rater because they're cognitively brutal.
Which skills actually transfer
If you've been doing preference-comparison work, three habits transfer cleanly. Three don't.
Transfers
- Reading dense text under fatigue. Agentic transcripts are long. If you can stay sharp through a 6,000-word safety review, you can stay sharp through a 40-step rollout.
- Disagreeing with the rubric in writing. Senior raters flag rubric defects in their comments, and labs read those comments. The same instinct that lets you write "rubric ambiguous on case X" on a chat brief works here.
- Spotting confident-wrong. A model that asserts a falsehood with high confidence is the single most dangerous output, in both chat and agent settings. If you can spot it in chat, you can spot it in a tool-use log.
Doesn't transfer
- Speed. Crowd raters are paid per item and trained to be fast. Agentic raters are paid per hour or per rollout and rewarded for being thorough. The first week feels brutally slow.
- Reading prose only. You'll be reading JSON-ish tool logs, stack traces, raw HTML, and occasionally screenshots. If "reading code" is not in your skill set, the tool-use briefs will be a wall.
- Trusting the rubric on first read. Agentic rubrics are new and often broken. You'll find ambiguity that the rubric author didn't anticipate at least once per brief. Senior raters flag it; junior raters guess and lose calibration points.
What labs actually need from a senior agentic rater
Three things, in order.
- Decompose the trajectory. A senior rater opens a 40-step rollout and immediately breaks it into phases — "exploration phase steps 1–8, hypothesis-testing 9–20, solution drafting 21–35, verification 36–40". Each phase gets its own score and its own comments. Labs use this structure to localise model failures.
- Distinguish process failures from outcome failures. The model can succeed despite a broken process, or fail despite a sound process. Both are training signal. Both need to be named.
- Write the counterfactual. Not "the model should have done better", but "step 17 should have been `read_file('config.toml')` instead of `search('config.toml')` because the file already existed in the workspace". Specific, actionable, attributable to a single step.
The pay-by-shape map
Approximate 2026 ranges, before tax, for evaluator-side work on agentic briefs OBG commissions through frontier labs:
- Tool-use rollout, general domain: $60–95/hr
- Tool-use rollout, regulated domain (med / legal / fin): $90–150/hr
- Computer-use rollout, general productivity tasks: $70–110/hr
- Computer-use rollout, specialist software (CAD, EHR, trading): $110–170/hr
- Multi-turn agent with memory, software-engineering brief: $90–160/hr
- Multi-turn agent with memory, regulated domain: $120–180/hr
- Adversarial agentic red-team (capability uplift): $110–200/hr by brief
The variance is wider than chat eval because the briefs are bespoke. A computer-use brief on Epic (the EHR) pays the rare-skill premium because labs need raters who already know Epic — they're not paying for AI eval skill alone, they're paying for the intersection of AI eval skill and EHR fluency. The same applies to Bloomberg terminals, AutoCAD, and any other specialist software.
How to get on the agentic queue
If you're already on the OBG roster, you don't need to do anything — agentic briefs are matched off the same credentials and capacity signals as the chat briefs. If you're new, three things speed up the match.
- Name the specialist software you actually use daily. "I use Epic 7 hours a day" gets you on the EHR computer-use queue. "I know healthcare" doesn't.
- Mention any code/script comfort. Tool-use briefs filter heavily for raters who can read a Python traceback without freezing. You don't need to be a software engineer — you need to not be scared of code.
- Set realistic weekly capacity. Agentic briefs are 2–6 hours per rollout. If you have 4 hours/week available, say 4. Don't promise 15 and ghost the last 11 — your match rank takes the hit.
The work is already on our roster. The bottleneck is people. If you have a credential and an hour to opt in, the trajectories are waiting.