Motivation

Curator turns real GitHub pull requests into verified SWE-style coding tasks. It is the task-curation stage of the SWE-Lego-Live pipeline, sitting before trajectory generation, SFT, and RL.

GitHub PRs -> curator -> tracer -> trainer -> rl

Curator exists because high-quality agent training data needs more than a patch and a repository URL. Each task must have a reproducible container, a clear instruction, a bug-introducing patch, a ground-truth fix, and a test script that separates solved from unsolved attempts.

Curator provides:

Multi-language PR collection across Python, JavaScript, TypeScript, Go, C, C++, Java, and Rust
LLM-assisted PR filtering and task instruction generation
Harbor-compatible task directories with Docker environments and tests
NOP and Oracle validation before a task is exposed downstream
verifiable_tasks.txt manifests that downstream blocks can trust
Batch state and resume support for long-running data construction
Static difficulty scoring and task metadata for dataset analysis

Where to go next

Getting Started - set up the block and run a smoke generation
Core Concepts - PR pools, task skeletons, validation, and manifests
Run Generation - run per-language generation scripts safely
Outputs - understand task directories, manifests, logs, and state packages
Dashboard - monitor Curator progress online

Motivation

Where to go next

On this page