PostgreSQL 18 on SQLite
Build a PostgreSQL 18 server in Zig that uses SQLite for storage.
Benchmarking software engineering skill
at the edge of human ability.
We show the best score achieved by any model on the task.
Implementation tasks challenge agents to build complex software systems from scratch or reimplement existing ones in a different language. No model was able to successfully complete any of these tasks in any trial, so we used test pass rate of best@5 as a partial reward to rank models.
Build a PostgreSQL 18 server in Zig that uses SQLite for storage.
With Modular
Implement Wan 2.1 text-to-video inference on Modular's MAX/Mojo stack.
Reimplement git v2.47.0 in Zig.
Reimplement the dart_style formatter as a standalone Haskell executable.
Build a real AOT compiler from Lua 5.4 source to native x86-64 ELF.
Research tasks require agents to design and train ML models or devise novel algorithms, evaluated on held-out data the agent never sees during development.
With Thoughtful Lab
Post-train Qwen3-8B to solve FrogsGame boards through tool use.
Train a 2D-only molecular graph regressor for a PCQM4Mv2-derived task.
Design a single optimizer that beats tuned AdamW across diverse ML workloads.
Performance optimization tasks ask agents to optimize performance over speed or compression without breaking existing behavior. Correctness here is a gate, not the goal. A failed correctness check means the agents solution did not pass all functional tests. Tasks are ranked by 0.5 × correctness + 0.5 × speedup (or 1 - compression ratio), and the speedup shown is best@5.
Reimplement the libexpat XML parser as a drop-in x86-64 assembly shared library.
Reimplement FFmpeg's libswscale scaler and pixel-format converter in Zig or Rust.
Make pyright faster without changing diagnostics.
Build a lossless domain-specific compressor for canonicalized Jupyter notebooks.
Speed up Revideo's rendering pipeline without changing video output.
Speed up Wasmtime's Cranelift backend without breaking correctness.
Implement a fast dependent type checker for a Martin-Löf-style core language.
With Prime Intellect
Make a pinned Granite hybrid Mamba2 layer faster on B200 without changing semantics.
Make SGLang serving for Qwen3.5-4B faster on a B200 GPU.