This past weekend, my team, ClosedAI, participated in the ARC AGI2 track of Strong Compute’s intense 24-hour hackathon—and we ended up winning! Here's a detailed look at our approach, the innovations we introduced, and the results we achieved.
What's ARC-AGI-2?
The ARC-AGI-2 benchmark, created by François Chollet, consists of 1,000 challenging visual puzzles designed to assess true abstract reasoning in AI. Human participants typically solve around 60% of these puzzles, whereas most existing AI models only manage between 10% and 20%. Each puzzle allows just two submission attempts, demanding high accuracy and generalization from minimal examples.
Our Strategy
Given the tight 24-hour constraint, we prioritized maximizing accuracy (pass@2) and computational efficiency. Our team divided the workload into two parallel streams: data augmentation and model architecture. Constant communication and rapid iteration allowed us to promptly resolve issues and share critical insights.
Our Implementation
Synthetic Data Generation with LLMs
We built an automated data generation pipeline using large language models (LLMs). Starting from minimal human-provided examples, we generated hundreds of synthetic puzzle variations per task. These were then filtered and clustered to ensure a diverse and comprehensive training dataset.
Custom Reasoning Token Blocks
To make our model’s reasoning transparent and easily debuggable, we introduced structured "token blocks." Each token block explicitly represented a distinct reasoning step, facilitating rapid error identification and correction.
The "Less Is More" Architecture (LIMO)
Inspired by recent research showing the effectiveness of minimal but precise prompts, we employed the LIMO architecture, consisting of:
A primitive encoder converting puzzle grids into structured embeddings.
A modular library of fundamental operations (rotate, mirror, count, color-match).
A neural scoring mechanism selecting the most plausible operation sequences.
Results & Performance
Our combined approach achieved a 75% resolution rate on the training puzzles, significantly outperforming the typical AI baseline performance of 10-20%. Each puzzle was solved in less than one second, meeting the competition’s strict efficiency criteria.
Infrastructure Utilization
Leveraging Strong Compute’s Instant Super Computer (ISC) platform, we rapidly conducted parameter sweeps and experiments across numerous A100 GPUs. Automated end-to-end submission checks ensured quick identification and resolution of issues, maintaining seamless workflow continuity.
Lessons Learned and Future Directions
Early Automation: Integrating automated end-to-end tests early on was critical in saving debugging time.
Modular Design Advantages: Our modular and structured reasoning approach consistently outperformed monolithic models in accuracy and interpretability.
Future work will involve open-sourcing our synthetic data generation pipeline and reasoning token blocks, along with exploring meta-learning techniques for automatic reasoning strategy discovery.
Acknowledgments
We are grateful to Ben Sand, Adam Peaston, Tim Smoothy, and Rebecca Pham from Strong Compute for their invaluable support and mentorship throughout the event. Their assistance played a significant role in our success.
Written by Sanika Chavan, Mudit Sinha, Aman Priyanshu
Join us for our next ARC Prize Hackathon in SF and Sydney: https://lu.ma/strongcompute