Behavioural Benchmark / Judgment
Don't just ask whether the patch passed. Watch how the model worked.
Behavioural Benchmark profiles model behaviour on coding tasks: decomposition, search, edits, test runs, recovery, consistency, and official outcomes.
Beyond Leaderboards
The same score can hide very different engineering behaviour.
SWE-bench tells you whether a task resolved. Behavioural profiling adds the path: how the model investigated, changed code, used tools, responded to failures, and reached the outcome.
Decomposition
Does the model break work down, or jump straight into patches?
Search & context
Does it inspect relevant files and history, or guess from the prompt?
Recovery behaviour
When tests fail, does it adapt its strategy or repeat the same path?
Role fit
Fast prototyping, legacy maintenance, architecture work, and QA need different behavioural profiles.
Current artifact
Behavioral Profiler
The current implementation imports SWE-bench tasks, runs models under a neutral profiler, captures patches, evaluates with the official harness, and renders behaviour plus outcome as separate evidence.