Arcified.AI Winning Playbook for Strong Compute ARC AGI 2 Hackathon
ML Engineers at 2K Games and Google DeepMind built ARC Evolve, solving 80% of training puzzles—far surpassing frontier models.
Strong Compute hosted a 24‑hour, round‑the‑clock sprint focused on the ARC AGI 2 challenge. When the dust settled, the overall prize in Competition A went to a two‑person team operating under Arcified .AI.
Arcified’s members—Vijay raj Gohil, a ML Engineer at 2K Games, and Aditya Shah, an ML engineer at Google DeepMind. Their final system, nicknamed ARC Evolve, reached an ≈ 80 % full‑solve rate on training puzzles, far out‑performing baseline numbers typically reported for large frontier models.
What is ARC AGI 2?
The ARC (Abstraction & Reasoning Corpus) tasks created by François Chollet test a model’s ability to infer symbolic transformations from tiny demonstration sets. ARC AGI 2 raises the bar with new transformation families and a strict “all‑or‑nothing” scoring rule: a task is counted only if the model reproduces the entire output grid perfectly. The benchmark has become a proving ground for methods that claim progress toward more general reasoning.
The thought behind the Build
In a single paragraph: Vijayraj Gohil and Aditya Shah, together they sketched a strategy they called “small data, big search.” That ethos shaped every design choice that followed.
Strategy
From One‑Shot RL VR to AlphaEvolve‑Style Search
Arcified’s technical recipe fuses two complementary ideas drawn from very recent literature:
One‑Shot Reinforcement Learning with Verifiable Rewards (RL VR). A paper released in April 2025 showed that, for reasoning‑heavy datasets, fine‑tuning with just one carefully chosen example plus a binary “full‑solve” reward can match or exceed thousand‑sample runs. Arcified used this paradigm to initialize a compact 7‑billion‑parameter language model for ARC AGI 2.
AlphaEvolve search. Google DeepMind’s AlphaEvolve project demonstrated how an LLM‑guided evolutionary loop could rediscover matrix‑multiplication breakthroughs after decades. Arcified adapted the same idea to iteratively refine chains‑of‑thought for ARC puzzles, letting a high‑precision evaluator provide graded feedback between generations.
By combining the two, the team produced a self‑improving loop: RL VR delivers an initial policy; AlphaEvolve‑style search mutates that policy’s reasoning trace until it converges on a stable program that maps input to output.
How It Works—A Closer Look
Task taxonomy and sampling
ARC AGI 2 examples fall into three geometric regimes:No‑change (input and output are the same size),
Contraction (output is smaller), and
Expansion (output is larger).
Arcified built histograms to quantify the prevalence of each regime in the public training set, then repeated the analysis on held‑out evaluation tasks. They discovered that most puzzles clustered in the no‑change and contraction buckets. Using that insight, they curated ten “high‑entropy” samples—balanced across regime and across three difficulty bands (easy, medium, hard)—to act as the sole training pool.
Group Relative Policy Optimisation (GRPO)
The ten samples were duplicated and permuted to form a synthetic mini‑corpus. GRPO fine‑tuning rewarded only perfect grid matches (1/0 signal), steadily raising the policy’s success on unchanged‑size puzzles to the mid‑80‑percent range.Evolutionary refinement
Each RL‑generated chain‑of‑thought (CoT) was passed to an evaluator LLM that produced fine‑grained scores on intermediate steps. Those scores fed an evolutionary loop that mutated, recombined, and re‑ranked CoTs, repeatedly boot‑strapping better transforms until the evaluator’s reward plateaued.Deterministic program extraction
The final CoT was translated into concise, deterministic grid‑manipulation code, ensuring reproducibility for judging.
Infrastructure Notes
They ran Initial experiments on Strong Compute Burst Workstations; once tested they scale‑up training on the company’s ISC cluster of H100 GPUs, spun up on demand within minutes. Built‑in hot‑swap utilities and cycling_utils functionality made it straightforward to patch issues without interrupting the 24‑hour clock.
Demo Day
During a ten‑minute slot, Arcified presented a concise slide deck: methodology overview, before‑and‑after solve counts, and a comparison showing their 85 % success rate next to the single‑digit scores typical of Gemini 2.5 Pro and OpenAI o3 on the same training samples. Judges highlighted the rigorous data sampling strategy and clear empirical gains.
What Comes Next
Arcified .AI plans to release their ARC Evolve code once additional refactoring is complete, extend experiments to larger reasoning models with 300–400 RL steps, and continue pushing towards a full public entry in the broader ARC Grand Prize later this year. They also aim to investigate whether multiple parallel traces of longer chains‑of‑thought yield further gains.
Acknowledgements
Vijayraj Gohil and Aditya Shah thank Ben Sand, Adam Peaston, Tim Smoothy, and Rebecca Pham at Strong Compute for rapid infrastructure support and guidance throughout the event.
Github Repo - https://github.com/vraj130/ArcEvolve
Slides -https://docs.google.com/presentation/d/17f3aFA1XEIFqLk9RSQbdeNM7v0Pou0xetJ2CNeMIugo/edit?usp=sharing