Behavioural Benchmark / Judgment

Don't just ask whether the patch passed. Watch how the model worked.

Behavioural Benchmark profiles model behaviour on coding tasks: decomposition, search, edits, test runs, recovery, consistency, and official outcomes.

View profiler on GitHub Discuss benchmark design

Beyond Leaderboards

The same score can hide very different engineering behaviour.

SWE-bench tells you whether a task resolved. Behavioural profiling adds the path: how the model investigated, changed code, used tools, responded to failures, and reached the outcome.

Decomposition

Does the model break work down, or jump straight into patches?

Search & context

Does it inspect relevant files and history, or guess from the prompt?

Recovery behaviour

When tests fail, does it adapt its strategy or repeat the same path?

Role fit

Fast prototyping, legacy maintenance, architecture work, and QA need different behavioural profiles.

Current artifact

Behavioral Profiler

The current implementation imports SWE-bench tasks, runs models under a neutral profiler, captures patches, evaluates with the official harness, and renders behaviour plus outcome as separate evidence.