FrontierSWE

Benchmarking software engineering skill
at the edge of human ability.

By

Leaderboard

#ModelAVG RANKDominance
1
GPT-5.5
Codex
2.3583%
2
Claude Opus 4.7
Claude Code
3.2971%
3
Claude Opus 4.6
Claude Code
3.8265%
4
GPT-5.4
Codex
3.9763%
5
Gemini 3.1 Pro
Gemini CLI
5.2647%
6
DeepSeek V4 Pro
Claude Code
6.2135%
7
Kimi K2.6
Kimi CLI
6.4432%
8
Kimi K2.5
Kimi CLI
6.7428%
9
Qwen3.6-Plus
Qwen Code
6.9126%
Rank: avg position across tasks (lower = better)Dominance: win rate vs random opponent on task.

Task Performance

We show the best score achieved by any model on the task.

Implementation5

Implementation tasks challenge agents to build complex software systems from scratch or reimplement existing ones in a different language. No model was able to successfully complete any of these tasks in any trial, so we used test pass rate of best@5 as a partial reward to rank models.

View details for PostgreSQL 18 on SQLite
01 Implementation

PostgreSQL 18 on SQLite

Build a PostgreSQL 18 server in Zig that uses SQLite for storage.

0/5success rate
16%test pass rate
Claude Opus 4.6 (Claude Code)
View details for Wan 2.1 on MAX/Mojo
02 Implementation

Wan 2.1 on MAX/Mojo

With Modular

Implement Wan 2.1 text-to-video inference on Modular's MAX/Mojo stack.

0/5success rate
50%workloads passed
Claude Opus 4.6 (Claude Code)
View details for Git to Zig
03 Implementation

Git to Zig

Reimplement git v2.47.0 in Zig.

0/5success rate
23%test pass rate
Claude Opus 4.6 (Claude Code)
View details for Dart → Haskell
04 Implementation

Dart → Haskell

Reimplement the dart_style formatter as a standalone Haskell executable.

0/5success rate
27%test pass rate
GPT-5.5 (Codex)
View details for Lua Native Compiler
05 Implementation

Lua Native Compiler

Build a real AOT compiler from Lua 5.4 source to native x86-64 ELF.

0/5success rate
89%test pass rate
Claude Opus 4.7 (Claude Code)

Research3

Research tasks require agents to design and train ML models or devise novel algorithms, evaluated on held-out data the agent never sees during development.

View details for Optimizer Design
08 Research

Optimizer Design

Design a single optimizer that beats tuned AdamW across diverse ML workloads.

3.2xfewer steps vs tuned AdamW
GPT-5.5 (Codex)

Performance Optimization9

Performance optimization tasks ask agents to optimize performance over speed or compression without breaking existing behavior. Correctness here is a gate, not the goal. A failed correctness check means the agents solution did not pass all functional tests. Tasks are ranked by 0.5 × correctness + 0.5 × speedup (or 1 - compression ratio), and the speedup shown is best@5.

View details for Notebook Compression
12 Performance

Notebook Compression

Build a lossless domain-specific compressor for canonicalized Jupyter notebooks.

3/5correctness
0.693reduction
Claude Opus 4.6 (Claude Code)