Strong Words: Case Studies

Scaling from 5 to 256 GPUs with zero dev-ops in one week.

Strong Compute — Mon, 02 Feb 2026 00:05:10 GMT

Without Strong Compute this would have taken 2 full time engineers 3-6 months.

Before

On-premises compute hardware limited to 5 NVIDIA GPUs
Slow job migration and deployment between cloud providers
Limited visibility into resource utilization
High operational overhead managing compute resources

After

On-premises compute hardware limited to 5 NVIDIA GPUs
Slow job migration and deployment between cloud providers
Limited visibility into resource utilization
High operational overhead managing compute resources

44 experiments run across 6 separate AI projects - 23 rapid iteration experiments, 21 long-run training experiments
6.5 hours total training time on 256 GPUs in 90 cloud machines across 3 different cloud providers - including H100 and A100 instances

Challenge: Complex AI workloads, scarce hardware

LayerJot, a cutting-edge med-tech startup in Belmont, CA, faced a critical challenge common to AI-driven research teams: managing complex, compute-intensive workloads across multiple datasets and models.

LayerJot’s projects span:

Computer vision for medical equipment catalog processing
Multi-modal AI models like CLIP and Llama
Generalist robot policy models for surgical equipment handling

Solution: Scaling from 5 to 256 GPUs with zero dev-ops

Strong Compute deployed an AI engineer on-site with LayerJot for a full week, working shoulder-to-shoulder with their team to optimize infrastructure and accelerate their AI workloads using the Strong Compute Instant Super Computer.

Technical Deep Dive: Datasets and Model Adaptation

Data Ingested

Models Adapted

On-Site Collaboration: Beyond Infrastructure Management

For one intensive week, Strong Compute embedded an AI engineer directly at LayerJot’s Belmont, CA office. Our engineer worked side-by-side with LayerJot’s team, providing:

Real-time infrastructure optimization
Hands-on model adaptation support
Direct troubleshooting of complex AI workload challenges
Custom infrastructure configuration tailored to LayerJot’s unique research needs

Key Outcomes

Resolved Dense Encoder code base issues and successfully ran experiments
Adapted CLIP-style model for Strong Compute checkpointing
Successfully trained VLA Robotics repo in interactive containers
Integrated model checkpoints from ingested datasets
Demonstrated Claude Code’s capability to adapt complex legacy code bases for training on Strong Compute!

Breakthrough Results

Performance Metrics

Reduced job deployment time from hours to minutes
60GB/sec inter-cloud data transfer speed
7.8-second container launch times

Operational Impact

Resolved complex code base integration challenges
Enabled continuous experiment-based training
Simplified multi-provider infrastructure management

Quote from the Customer

“Strong Compute transformed how we think about infrastructure. It’s not just a tool; they are a strategic partner in our AI development.” - Soren Harner, CEO, LayerJot

Looking Forward

LayerJot is now positioned to:

Scale AI research more rapidly
Reduce infrastructure management overhead
Accelerate medical technology innovation

Try Strong Compute Today

Strong Compute: Complete Command and Control for GPU Compute

Arcified.AI Winning Playbook for Strong Compute ARC AGI 2 Hackathon

Strong Compute — Wed, 18 Jun 2025 23:32:00 GMT

Strong Compute hosted a 24‑hour, round‑the‑clock sprint focused on the ARC AGI 2 challenge. When the dust settled, the overall prize in Competition A went to a two‑person team operating under Arcified .AI.

Arcified’s members—Vijay raj Gohil, a ML Engineer at 2K Games, and Aditya Shah, an ML engineer at Google DeepMind. Their final system, nicknamed ARC Evolve, reached an ≈ 80 % full‑solve rate on training puzzles, far out‑performing baseline numbers typically reported for large frontier models.

Hackathon Winners: Vijay raj Gohil and Aditya Shah

What is ARC AGI 2?

The ARC (Abstraction & Reasoning Corpus) tasks created by François Chollet test a model’s ability to infer symbolic transformations from tiny demonstration sets. ARC AGI 2 raises the bar with new transformation families and a strict “all‑or‑nothing” scoring rule: a task is counted only if the model reproduces the entire output grid perfectly. The benchmark has become a proving ground for methods that claim progress toward more general reasoning.

The thought behind the Build

In a single paragraph: Vijayraj Gohil and Aditya Shah, together they sketched a strategy they called “small data, big search.” That ethos shaped every design choice that followed.

Strategy

From One‑Shot RL VR to AlphaEvolve‑Style Search

Arcified’s technical recipe fuses two complementary ideas drawn from very recent literature:

One‑Shot Reinforcement Learning with Verifiable Rewards (RL VR). A paper released in April 2025 showed that, for reasoning‑heavy datasets, fine‑tuning with just one carefully chosen example plus a binary “full‑solve” reward can match or exceed thousand‑sample runs. Arcified used this paradigm to initialize a compact 7‑billion‑parameter language model for ARC AGI 2.
AlphaEvolve search. Google DeepMind’s AlphaEvolve project demonstrated how an LLM‑guided evolutionary loop could rediscover matrix‑multiplication breakthroughs after decades. Arcified adapted the same idea to iteratively refine chains‑of‑thought for ARC puzzles, letting a high‑precision evaluator provide graded feedback between generations.

By combining the two, the team produced a self‑improving loop: RL VR delivers an initial policy; AlphaEvolve‑style search mutates that policy’s reasoning trace until it converges on a stable program that maps input to output.

How It Works—A Closer Look

Task taxonomy and sampling
ARC AGI 2 examples fall into three geometric regimes:
- No‑change (input and output are the same size),
- Contraction (output is smaller), and
- Expansion (output is larger).
Arcified built histograms to quantify the prevalence of each regime in the public training set, then repeated the analysis on held‑out evaluation tasks. They discovered that most puzzles clustered in the no‑change and contraction buckets. Using that insight, they curated ten “high‑entropy” samples—balanced across regime and across three difficulty bands (easy, medium, hard)—to act as the sole training pool.
Group Relative Policy Optimisation (GRPO)

The ten samples were duplicated and permuted to form a synthetic mini‑corpus. GRPO fine‑tuning rewarded only perfect grid matches (1/0 signal), steadily raising the policy’s success on unchanged‑size puzzles to the mid‑80‑percent range.
Evolutionary refinement

Each RL‑generated chain‑of‑thought (CoT) was passed to an evaluator LLM that produced fine‑grained scores on intermediate steps. Those scores fed an evolutionary loop that mutated, recombined, and re‑ranked CoTs, repeatedly boot‑strapping better transforms until the evaluator’s reward plateaued.
Deterministic program extraction

The final CoT was translated into concise, deterministic grid‑manipulation code, ensuring reproducibility for judging.

Infrastructure Notes

They ran Initial experiments on Strong Compute Burst Workstations; once tested they scale‑up training on the company’s ISC cluster of H100 GPUs, spun up on demand within minutes. Built‑in hot‑swap utilities and cycling_utils functionality made it straightforward to patch issues without interrupting the 24‑hour clock.

Demo Day

During a ten‑minute slot, Arcified presented a concise slide deck: methodology overview, before‑and‑after solve counts, and a comparison showing their 85 % success rate next to the single‑digit scores typical of Gemini 2.5 Pro and OpenAI o3 on the same training samples. Judges highlighted the rigorous data sampling strategy and clear empirical gains.

What Comes Next

Arcified .AI plans to release their ARC Evolve code once additional refactoring is complete, extend experiments to larger reasoning models with 300–400 RL steps, and continue pushing towards a full public entry in the broader ARC Grand Prize later this year. They also aim to investigate whether multiple parallel traces of longer chains‑of‑thought yield further gains.

Acknowledgements

Vijayraj Gohil and Aditya Shah thank Ben Sand, Adam Peaston, Tim Smoothy, and Rebecca Pham at Strong Compute for rapid infrastructure support and guidance throughout the event.

Github Repo - https://github.com/vraj130/ArcEvolve

Slides -https://docs.google.com/presentation/d/17f3aFA1XEIFqLk9RSQbdeNM7v0Pou0xetJ2CNeMIugo/edit?usp=sharing

Text-to-Manim: Generating Visual Explanations using GRPO and Gemini Rewards

Strong Compute — Mon, 16 Jun 2025 01:38:58 GMT

Hackathon Winners: Bernett Orlando, Ramprasadh Kumar and Karthik Ragunath Ananda Kumar

1. Introduction

Our overarching mission is to build a personalized AI tutor capable of delivering high-quality educational content to anyone, anywhere. We believe that true democratization of education can only be achieved by making learning deeply engaging, personalized, and universally accessible. A critical component of this vision is the ability to transform abstract questions into clear, visual explanations — a method proven to resonate more effectively with the way humans understand complex concepts.

In this hackathon project, we focused on one specific but essential challenge: automatically converting mathematical questions into visual animations using the Manim animation engine. The end goal is to empower students with dynamic visualizations that enhance understanding, retention, and conceptual clarity — especially in STEM education.

2. Problem Statement

Humans are inherently visual learners. Concepts that are difficult to grasp through text-based explanations can often be instantly clarified through animation or visual demonstrations. Despite the power of this modality, creating educational animations remains a time-intensive and highly manual process. Our challenge was to automate this pipeline: given a natural language mathematical question, can we generate a Manim animation script that explains the solution visually?

3. Initial Approach: Supervised Fine-Tuning (SFT)

We began by attempting a supervised fine-tuning (SFT) approach. Specifically, we fine-tuned the DeepSeek LLM using a dataset of input-output pairs, where:

Input = a mathematical question
Output = the corresponding Manim script to animate the explanation

We also attempted to incorporate Chain-of-Thought (CoT) reasoning in the outputs, guiding the model to not only solve the problem but also break it down into explanatory visual steps.

Challenges

However, we encountered two major limitations:

Lack of high-quality training data: Manim-query pairs are a highly niche and scarce dataset. Publicly available examples are limited in volume and diversity.
Absence of Chain-of-Thought (CoT) annotations: Even where datasets exist, few contain intermediate reasoning steps essential for generating coherent explanatory animations.

Due to these challenges, the SFT approach failed to generalize well and lacked visual accuracy and semantic coherence.

4. Proposed Solution: GRPO with Reward Modeling via Gemini as External Judge

To address these limitations, we pivoted to a novel reinforcement learning framework based on GRPO (Generative Reward Policy Optimization). Instead of relying on static data, we introduced an external LLM-based reward model — built on top of Gemini — to act as a judge of the model’s outputs. This model provided feedback on the quality of generated animations, enabling us to train the base model using reward signals rather than hardcoded labels.

Reward Model Criteria

Our reward model evaluated each generated Manim animation based on the following five criteria:

Prompt Consistency
Does the animation match the original mathematical prompt in terms of objects involved, actions depicted, and conceptual correctness?
Screen Fit
Do the visual elements stay within the canvas boundaries? Do any objects overflow or render off-screen?
Non-overlapping Layout
Are the visual elements well-spaced? Do objects overlap in distracting or confusing ways?
Semantic Coherence
Does the animation make logical sense? For example, do equations appear where expected? Are objects used in appropriate ways?
Clarity of Explanation
Is the final animation pedagogically effective? Would a student find it helpful in understanding the concept?

These multi-dimensional reward signals allowed us to optimize for visual, spatial, and semantic quality — aspects that are difficult to enforce via traditional supervised learning.

5. Results and Observations

With GRPO and Gemini-based reward modeling, our model demonstrated significantly better convergence compared to SFT. Not only did the animations become more visually accurate, but the overall explanatory coherence also improved. The model was able to generalize across a range of simple mathematical prompts and produce clear, legible Manim animations with minimal hallucinations or layout issues.

6. Future Directions

This project represents just the beginning of our journey toward building a fully autonomous AI tutor. Moving forward, we plan to:

Expand the complexity and diversity of supported mathematical questions (algebra, calculus, geometry, etc.)
Integrate real-time preview and editing tools for generated animations
Incorporate user feedback and corrections into the reward signal (RLHF loop)
Extend support beyond Manim to other visual engines and modalities (e.g., interactive graphs, 3D geometry)

We are excited to continue developing this project with the support of StrongCompute, and look forward to pushing the boundaries of personalized AI education.

7. Acknowledgments

We thank the hackathon organizers and the community for providing a platform to explore such impactful ideas. We are especially grateful to Strong Compute for providing infrastructure and support.

This post was written by Karthik Ragunath Ananda Kumar, AI Researcher @ Tavus Inc, Bernett Orlando, Senior ML SWE @ Google Research and Ramprasadh Kumar, Systems @ NVIDIA

Links:

Presentation slides: Google Slides
Github repo: Link here

How ClosedAI Won Strong Compute's ARC AGI2 Hackathon #9: Our Journey

Strong Compute — Tue, 06 May 2025 22:57:23 GMT

Winners: Aman Priyanshu, Sinha, Sanika Chavan, Mudit Sinha

This past weekend, my team, ClosedAI, participated in the ARC AGI2 track of Strong Compute’s intense 24-hour hackathon—and we ended up winning! Here's a detailed look at our approach, the innovations we introduced, and the results we achieved.

What's ARC-AGI-2?

The ARC-AGI-2 benchmark, created by François Chollet, consists of 1,000 challenging visual puzzles designed to assess true abstract reasoning in AI. Human participants typically solve around 60% of these puzzles, whereas most existing AI models only manage between 10% and 20%. Each puzzle allows just two submission attempts, demanding high accuracy and generalization from minimal examples.

Our Strategy

Given the tight 24-hour constraint, we prioritized maximizing accuracy (pass@2) and computational efficiency. Our team divided the workload into two parallel streams: data augmentation and model architecture. Constant communication and rapid iteration allowed us to promptly resolve issues and share critical insights.

Our Implementation

Synthetic Data Generation with LLMs

We built an automated data generation pipeline using large language models (LLMs). Starting from minimal human-provided examples, we generated hundreds of synthetic puzzle variations per task. These were then filtered and clustered to ensure a diverse and comprehensive training dataset.

Custom Reasoning Token Blocks

To make our model’s reasoning transparent and easily debuggable, we introduced structured "token blocks." Each token block explicitly represented a distinct reasoning step, facilitating rapid error identification and correction.

The "Less Is More" Architecture (LIMO)

Inspired by recent research showing the effectiveness of minimal but precise prompts, we employed the LIMO architecture, consisting of:

A primitive encoder converting puzzle grids into structured embeddings.
A modular library of fundamental operations (rotate, mirror, count, color-match).
A neural scoring mechanism selecting the most plausible operation sequences.

Results & Performance

Our combined approach achieved a 75% resolution rate on the training puzzles, significantly outperforming the typical AI baseline performance of 10-20%. Each puzzle was solved in less than one second, meeting the competition’s strict efficiency criteria.

Infrastructure Utilization

Leveraging Strong Compute’s Instant Super Computer (ISC) platform, we rapidly conducted parameter sweeps and experiments across numerous A100 GPUs. Automated end-to-end submission checks ensured quick identification and resolution of issues, maintaining seamless workflow continuity.

Lessons Learned and Future Directions

Early Automation: Integrating automated end-to-end tests early on was critical in saving debugging time.
Modular Design Advantages: Our modular and structured reasoning approach consistently outperformed monolithic models in accuracy and interpretability.

Future work will involve open-sourcing our synthetic data generation pipeline and reasoning token blocks, along with exploring meta-learning techniques for automatic reasoning strategy discovery.

Acknowledgments

We are grateful to Ben Sand, Adam Peaston, Tim Smoothy, and Rebecca Pham from Strong Compute for their invaluable support and mentorship throughout the event. Their assistance played a significant role in our success.

Written by Sanika Chavan, Mudit Sinha, Aman Priyanshu

Github repo link: https://github.com/sanikac10/Annotating-ARC-AGI-2/tree/main/Annotating-ARC-AGI-2-main

Join us for our next ARC Prize Hackathon in SF and Sydney: https://lu.ma/strongcompute

Strong Compute GPU Hackathon Recap: DeepCertainty: No Hallucinations, Just Results

Strong Compute — Fri, 28 Mar 2025 03:50:16 GMT

We’ve been running GPU hackathons in San Francisco and Sydney to see what happens when you give smart people full access to compute.

The most exciting projects aren’t just clever — they’re grounded. They tie model output to something you can check. A compile. A benchmark. A math proof. A correct answer, not just a convincing one.

That’s a subtle but powerful shift. A lot of machine learning treats model output like a good guess — probabilistic, fuzzy, often right but not always reproducible. These projects took a different approach: don’t just generate something — generate something you can verify.

And the difference shows.

No Hallucinations

We’ve seen a move away from the traditional “trust the model” mindset toward something more rigorous: can we prove this works?

This is especially important in code generation, scientific reasoning, and anything where correctness matters. When you’re training or fine-tuning on tasks that involve real-world outcomes — not just vibes — you need more than confidence. You need certainty.

At our March hackathons, we saw CUDA and Math Fine Tunings that show provable deep learning is practical:

CUDA Codegen from PyTorch Modules

One team built a smart transpiler that takes PyTorch modules and converts them into CUDA kernels. The model generates CUDA code and then evaluates each candidate across three dimensions:

Does it compile?
Does it produce the correct output?
Is it faster than the original?

This is a huge unlock. Because now, instead of relying on token-by-token loss or human labels, you can score the model’s output based on reality. Compilation success becomes a training signal. Runtime performance becomes a benchmark. And correctness becomes a pass/fail gate.

Winning team: Robert Zhang, JRH, our CEO, Ben Sand, and Rahman Hajiyev

They used a method inspired by DeepSeek — sampling multiple CUDA candidates, scoring them relatively, and feeding that back into training via group-relative policy optimization. It’s reinforcement learning with a feedback loop rooted in physics, not language.

Results (from Fine Tuning Llama DeepSeek7B on 8x L4s through Strong Compute)

Check out the winning team’s presentation here.

Mathematical Reasoning with Python Tool-Calling

Another project focused on mathematical reasoning — but with a twist.

Runners up: Karthik Ragunath Ananda Kumar

Rather than having the model do all the work internally (and risking a hallucinated equation), it called out to Python tools mid-inference. For example, it might solve part of a problem itself, then delegate the numerical computation to a verified function.

This kind of delegation is exciting. It opens the door to integrating with formal verification tools like Lean— not just solving math problems, but producing verifiable, explainable proofs.

In practice, mathematicians don’t just want to know if something is true. They want to understand why. The model becomes a co-pilot, helping construct the steps — not just giving you a binary answer.

Check out Karthik and Divya’s presentation
GitHub Link For Fine-tuning: https://github.com/Karthik-Ragunath/isc-demos-karthik/tree/main/deepseek
Inference Code: https://github.com/Karthik-Ragunath/isc-demos-karthik/blob/main/deepseek/inference_consolidated.py

Why This Matters

Verifiable machine learning isn’t just a niche — it’s the direction the field needs to go.

We’ve all seen what happens when models are powerful but ungrounded. Outputs that look right but aren’t. Answers that sound convincing until you test them.

These projects — and the teams behind them — are showing what it looks like to go beyond that. To treat model outputs not as a final product, but as hypotheses. And then build systems that can validate them, at speed.

We want Strong Compute hackathons to keep pushing in this direction: ideas that are smart and measurable. Tools that show their work. Models that can be trusted because they’re tested.

Join to Hack on ARC Prize or Fine-Tune Deep Seek April 18–19.

We’re bringing the GPUs and the hacker house energy back again.

Whether you choose to push the frontier on reasoning (ARC Prize) or scale a smarter distillation demo (Deep Seek), we’ve got clusters, food, desks, and a clean training setup ready for you.

Previous Winners and Grantees:

PyTorch → CUDA Fine-Tuning: Improved translation accuracy from 10% to 30%.
ARC Prize: Our grantee placed 2nd in the 2024 ARC contest.
Chess Bots: Trained from scratch to 2000 ELO in just 10 hours.

For engineers, AI researchers, students — anyone comfortable with PyTorch.

We provide the Instant Super Computer (ISC), so you can start training multinode in under an hour. No setup headaches. No fuss.

Engineers only. All code. No slidegineers or recruiters. All applicants vetted for technical fit.

Competition A: ARC Prize Challenge

Compete to win compute for the 2025 ARC Prize
Work on unsolved ARC-AGI-2 tasks with full resources and benchmarks
Judged on research rigor, novelty, and benchmark performance

Competition B: Deep Seek Fine-Tuning

Fine-tune DeepSeek-R1 distill variants on your dataset
Show what your model can do that the base model can’t
Model sizes: 1.5B to 70B — all provided

Prize: $2.5K–$25K Research Compute Grant

Let’s push the frontier — together.

Apply now — see you April 18-19.

Scaling AI Research from 4 to 60+ GPUs: How Strong Compute Enabled InSite's AI for Construction Monitoring.

Strong Compute — Wed, 04 Dec 2024 02:38:00 GMT

Customer Situation

Insite is developing AI monitoring for Construction Projects.

The development approach used many university research teams (160 developers) to prototype solutions.

A large scale up of compute management was needed.

Cian, founder of InSite Project Solutions, needed to manage 160 university students across 26 AI research teams - a massive expansion from previous years. He faced a critical infrastructure challenge. His existing in-house compute setup of 3-4 GPUs couldn't support this scale of concurrent AI development, placing timeline and resource risk on developing computer vision models for construction sites.

Project Goals

Radical improvement to construction site monitoring through AI-powered, ultra-high-resolution imagery analysis. The solution delivers:

24/7 monitoring with unprecedented detail, capturing site activity up to 800 meters away
80% improvement in AI model performance using 64megapixel imagery
6-10x cost reduction compared to traditional on-site project planning
Real-time analytics and comprehensive reporting for construction managers

Previous Infrastructure Limitations

Without the Strong Compute, InSite’s AI developers faced significant technical hurdles:

Limited GPU availability creating research bottlenecks
No job scheduling system, leading to a "first-come, first-served" chaos
Resource conflicts with teams frequently hitting "CUDA Out of Memory" errors
Performance degradation from concurrent workloads

The Strong Compute Solution

Strong Compute enabled Insite's research capabilities by:

Seamlessly scaling from 4 to 60+ GPUs
Eliminating infrastructure management overhead
Providing robust job scheduling and resource allocation
Enabling true parallel research across 26 teams

Business Impact

"It just worked in the background," says Cian. "I didn't need to manage it. I just set up the accounts and the users went in and used it. I didn't need to get in and fix it if it broke or set it up or whatever. It just worked."

Strong Compute enabled Insite Project Solutions to:

Successfully manage an 8x growth in AI researchers.
Accelerate development of cutting-edge computer vision models
Avoid expensive cloud computing costs and complex cloud configurations
Focus on innovation instead of infrastructure management

Why Strong Compute?

Strong Compute proved to be the perfect solution for scaling AI research operations:

Zero infrastructure management overhead
Immediate access to massive GPU computing power
Cost-effective alternative to cloud providers
Built-in safeguards against runaway computing costs
Seamless onboarding for large research teams

Using Strong Compute, Insite Project Solutions transformed a potential operational nightmare into a seamless research operation. This enabled breakthrough innovations in construction site monitoring while managing a record number of concurrent AI development teams.

Inside Our Chess Bot Hackathons and Zero-Code Cluster

Strong Compute — Wed, 30 Oct 2024 03:38:24 GMT

A few months ago, we kicked off our AI chess bot hackathons with a big question: How can we make AI training more accessible while showcasing our zero-code cluster management?

Inspired to push the boundaries, we decided to build a chess bot in a weekend.

What started as an ambitious project has evolved into a proving ground for our capabilities, with 1,100 GPUs across five providers and 40 engineers running simultaneous training workloads.

The Power Behind the Hackathons: Our System’s Capabilities

Our system, refined over two years, can handle complex, large-scale workloads seamlessly. Here’s what sets it apart:

Up to 90GB/sec (720Gbps) on cluster data read speed.
Up to 60GB/sec (480Gbps) cloud-to-cloud data transfer
Up to 20GB/sec (160Gbps) to a single node for container loads
Integrated across 6 cloud providers
Scales to support 1,000+ GPUs and 40 developers simultaneously
Compatibility with GPUs (H100, A100, A10), scaling from 1 GPU to 16 GPUs per node
Infiniband & Ethernet support for high-performance needs

With this setup, developers can scale from a single GPU to a full cluster in just an hour. We introduced Live Billing Systems and Real-Time Cost Controls to keep costs manageable, offering features like per-developer budgets and one-click stop controls.

Recap: Previous Hackathons

Hackathon 1 - Chess vision

Our first cut of the chess hackathon concept formulated the task as a regression problem. What does a human do when deciding which move to make?

Well, who really knows what humans do, but what we do is consider a handful of potential moves (maybe even all possible moves) and develop a feeling for which are good and which are bad. Then we pick the move that feels like the best one.

To replicate this process with an AI, we train a neural network to calculate that “feeling” as a quantitative score for every potential move, then we sample from the distribution described by those scores to select a move.

By “a move” what we mean is a potential board state that the player could move to, the state of the board at the end of the move. We encode the board as an 8x8 tensor of integers and pass that as input to our neural network to evaluate.

We also transform the board from being “white pieces” and “black pieces” to being “my pieces” and “opponent pieces”, orienting the board accordingly, such that the model is always asked to score the board from the perspective of the player about to move.

We included two example model architectures suitable for this task in the chess-hackathon repository, a ResNet-based Convolutional Neural Network (CNN) and a Transformer-based model.

Both model types relied on learned embeddings. In the case of the CNN embeddings were used to convert the 2d tensor of integers into a 3d tensor of floats where the 3rd dimension is analogous to the channels of an image. For the Transformer model, embeddings were additively infused with positional information.

The strongest models from the first hackathon round were predominantly CNN-based models.

Hackathon 2 - ChessGPT

Throughout the course of the first hackathon we got a lot of questions about LLMs. Can we bring them? Can we use them? Our answer was essentially “no”.

Firstly we had decided that all models must be trained from initialization (from scratch) throughout the course of the hackathon, no pre-trained model weights were allowed. Secondly the task that we had formulated did not seem at all amenable to LLMs. Perhaps this was a failure of imagination, but we also wanted to maximise the likelihood that everyone would be able to submit a functional model.

In any event we were inspired to look more into the potential to include an LLM track to the chess hackathon. After some searching we discovered the work of Adam Karvonen which demonstrated that an LLM (of modest size) can be trained from scratch on PGNs (historic chess games recorded in Portable Game Notation) to do next-character prediction in a GPT-like manner and thereby generate the next move to be made in the game.

We were fascinated by the apparent capability of the Transformer architecture, as shown in Adam’s work, to learn latent representations of a partially completed game which demonstrably encode details of the board state, the model never having been explicitly shown what a chess board even looks like.

The second hackathon sought to implement this formulation of the task, training “ChessGPT” models to do next-character prediction on a dataset comprising PGNs from recent training runs by Leela Chess Zero.

Rather than trust the models implicitly to generate valid moves, we generated all possible moves and asked the models to score each with a probability of continuing the game PGN with each.

One observation worth noting is that the ChessGPT models seemed weak at identifying and exploiting blunders made by their opponents. We speculate this might be due to our choice of training data - PGNs from games played by a highly competent chess engine which contain very few if any serious blunders, hanging a queen for example. The model would therefore consider it very unlikely that a game would continue with a piece moving to take the queen at the particular stage of the game.

Hackathon 3 - Vision and ChessGPT

For the third and subsequent hackathons we unified the two formulations attempted for the first two hackathons.

At each move, models were required to take two inputs - the PGN of the game up to that move, and a short string representing the potential next move in Standard Algebraic Notation (SAN) - and return a score for that potential move.

ChessGPT models could proceed by appending the potential mode string to the PGN and passing this sequence directly to the Transformer network.

Vision models were required to convert the PGN and potential move SAN into a representation of the potential board state and score that potential board state.

The strongest models from this hackathon were predominantly vision-based models, which were markedly more capable of identifying and exploiting blunders, but the strongest model - check out the blog linked below - used interleaved convolutional and self-attention layers.

How to win the Chess Hackathon

There have been a couple of consistent features of the winning team approaches. We’ll detail a few of our thoughts below, but you might also like to hear from the recent winners themselves how they achieved victory.

Choose a simple model architecture and training approach

The chess-hackathon repository and provided datasets are generally more than enough to work with. If you do want to experiment with a novel architecture, make sure you have spent some time researching that architecture ahead of time and validate that the model input and output tensors are the correct type and shape. If you want to bring your own dataset, spend some time designing and testing your data pipeline ahead of time.

Validate your model early

Your model might be the strongest chess AI the world has ever seen, but if it takes a whole cluster of compute and an hour to make a move (or if we can’t run it for some other reason) then we just won’t let it play, and a surefire way not to win is to not be allowed to compete.

We publish a validation script with the chess-hackathon repository that checks your model meets our tournament specifications. Before you even launch your model to train, generate a checkpoint for your model and validate that it will pass our pre-flight check.

We also publish super detailed instructions on how to develop your model so that it meets our compatibility requirements, so pay close attention to those and set your project up to be compatible from the beginning.

Start training early and train for as long as possible

Deep learning models take time to train, you are likely to run out of cluster time before your model stops improving in training. The winning teams have consistently been those whose models were able to train for many hours. Start training early, and train for as long as you can.

You might be wondering, what will I do with all the time while I wait for my model to train? Here are some suggestions.

Firstly, always be recovering your checkpoints and evaluating your models. Evaluating models is tricky when your training and target objectives are so loosely connected. How do I know if my model is good at chess? How does anyone know they’re good at chess? Play them off and see which one wins. Play against them yourself.

Secondly, be prepared for your training run to fail at some point. This might happen due to a hardware failure on the cluster you’re training on, or an internet or power outage. Interruptions are an inevitable fact of life when you’re training on hundreds of GPUs at a time. When your training is interrupted, you’re going to want to recover the latest checkpoint and start training again.

Looking Ahead: The Next Mega Chess Hackathon

We’ve heard the feedback that a weekend may not be enough time to dive deep.

That’s why we’re opening up early access for our next event.

Participants can join virtually a week ahead for onboarding, system access, and experiment credits. Then, the hackathon weekend will open with burst access in San Francisco and Sydney.

Our next Mega Chess Hackathon promises to be our biggest yet. You’ll have the chance to leverage powerful tools, experiment with advanced models, and test your AI chess skills.

Case Study: How Our Team Won the Mega Chess Hackathon with Deep Learning and Rapid Iteration

Strong Compute — Tue, 24 Sep 2024 18:00:16 GMT

Our team recently competed in and won the Strong Compute Mega Chess Hackathon. Ten San Francisco and Sydney teams competed simultaneously to build the strongest possible chess-playing deep learning model in just two days. The event culminated in a model vs. model tournament, where the bots faced off to determine the final winner. It was an exciting and challenging experience. We would like to share some of the lessons learned along the way.

Team Formation and Strategy

Our team comprised Justin F. Knoll, Suryaprakash Senthil Kumar, and Ashish Mukharji. We did not know each other before the event but made a point to connect via Zoom and share our backgrounds, possible technical approaches, working styles, and goals for the event beforehand. We were confident in our team and approach before the event started. Forming a team and getting a rough consensus on our approach before the event started saved us precious hacking time and is a highly recommended tactic.

Once the event began, we dove in to familiarize ourselves with the Strong Compute ISC platform, the provided datasets and models, and the actual tournament gameplay example scripts.

Our first objective was to close the loop: to train a very basic model from randomized weights into a candidate model competing in a one-round mock tournament. It’s hard to overstate how valuable this was in ensuring we understood all parts of the stack, the submission requirements, and the tournament API.

We started a multi-hour training run and pulled one of the intermediate checkpoints to close the loop. Seeing even a very weak and undertrained model playing chess on a live-refreshing board was a magical moment! The gameplay test script gave us a way to evaluate models against each other heuristically.

We let the model train and monitored training loss, rank correlation, adaptive learning rate adjustments, etc. to gauge training performance.

Exploring a Range of Technical Approaches

Confident that we understood the full stack and submission requirements, and with a way to approximately evaluate model performance, we turned our attention to selecting our own moves as a team within the tournament.

Given the complexity of building a competitive chess-playing model, we explored two high-level approaches: a vision-like model and a GPT-based model. One key inspiration for the GPT-based models was Adam Karvonen’s paper on “Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models.” Within the vision approach, we experimented with CNNs, transformers, and multi-head attention.

Adam's mechanistic interpretability research on Chess-GPT models applied linear probes to analyze Chess-GPT network activations and concluded that the network creates an emergent world model including the chessboard, piece positions, and even latent variables like player ELO rating. Learning about this research was a fascinating side quest, but the mechanistic interpretability results were not ultimately about model performance, but about emergent world models.

At times, we worried about training a GPT-based model based on Leela Chess Zero self-play, since such a model is ultimately doing probabilistic next-token prediction over examples from the training corpus, and one presumes that some of the Leela Chess Zero self-play games from early training would be examples of spectacularly poor play! On the other hand, if one were to train a GPT-based model on only grandmaster games, it would never have seen the sort of blunders we expected to encounter in the tournament models, and so wouldn’t know how to exploit them. In general, the GPT-based chess models are fascinating but seemed harder to reason about how to train for high performance.

We did some initial hyperparameter tuning of the models by modifying values and observing training metrics over short “cycle” test runs, then using the selected parameters for longer “burst” runs. We stopped short of doing a more structured hyperparameter sweep.

Team-Level Divide and Conquer

We were able to divide and conquer by having individual team members focus on optimizing different model approaches and run long-lived “burst” jobs to train those models in parallel. This was key to parallelizing progress and maximizing how quickly we could validate — or discard — our hypotheses.

We had access to multi-gigabyte datasets, including historical grandmaster games and Leela Chess Zero self-play. We experimented with merging some of the provided datasets, which was not difficult and seemed effective. We experimented with adding our own outside data sets (for example, a 29GB Lichess database export), but time constraints forced us to focus more on model tuning and training on the provided datasets than on converting and ingesting outside data.

Leveraging Strong Compute's ISC for Training

All teams were provided with access to Strong Compute's service, which made it possible for us to train our models using powerful 72xA100 clusters. This infrastructure was a game-changer for rapid iteration.

The Winning Model

On day two, we selected some of the longest-trained models with the best training metrics and played them against each other using the gameplay script, keeping an informal tally of performance. The vision approaches were dominant, so we did some final tuning and played a sub-tournament amongst the strongest two vision models, ultimately selecting a CNN-based model with multi-head attention and dilated convolution to expand the receptive field and capture relationships further across the board.

Post-Hackathon Reflection

Participating in this hackathon was a valuable learning experience. Two days is not much time to build and coordinate as a team, and we didn’t get a chance to implement many of our ideas. Just as with building a commercial product, we had to strike a balance between speed and rigor and aim to maximize the rate of validated learning.

While overall time was scarce, the ability to easily do distributed training on a 72xA100 cluster was a game changer. More data, more epochs, deeper models: all of these were feasible. Ensuring that we were always using the cluster for some experiment and not letting it idle was an important tactic.

Acknowledgments

A huge thanks to my teammates and everyone who made this event possible, especially Ben Sand, Adam Peaston, Tim Smoothy, and Rebecca Pham from Strong Compute, for providing the infrastructure and support that allowed us to compete at this level.

Conclusion

This hackathon pushed our limits and taught us the value of rapid iteration, strategic model selection, and leveraging powerful computing resources. We’re thrilled with our success and glad to be able to share the lessons here.

This was written by Chess Hackathon participants, Ashish Mukharji, Justin F. Knoll and Suryaprakash Senthil Kumar.

Try Strong Compute at our next event.