Text-to-Manim: Generating Visual Explanations using GRPO and Gemini Rewards
Automatically converting mathematical questions into visual animations using the Manim animation engine.
1. Introduction
Our overarching mission is to build a personalized AI tutor capable of delivering high-quality educational content to anyone, anywhere. We believe that true democratization of education can only be achieved by making learning deeply engaging, personalized, and universally accessible. A critical component of this vision is the ability to transform abstract questions into clear, visual explanations — a method proven to resonate more effectively with the way humans understand complex concepts.
In this hackathon project, we focused on one specific but essential challenge: automatically converting mathematical questions into visual animations using the Manim animation engine. The end goal is to empower students with dynamic visualizations that enhance understanding, retention, and conceptual clarity — especially in STEM education.
2. Problem Statement
Humans are inherently visual learners. Concepts that are difficult to grasp through text-based explanations can often be instantly clarified through animation or visual demonstrations. Despite the power of this modality, creating educational animations remains a time-intensive and highly manual process. Our challenge was to automate this pipeline: given a natural language mathematical question, can we generate a Manim animation script that explains the solution visually?
3. Initial Approach: Supervised Fine-Tuning (SFT)
We began by attempting a supervised fine-tuning (SFT) approach. Specifically, we fine-tuned the DeepSeek LLM using a dataset of input-output pairs, where:
Input = a mathematical question
Output = the corresponding Manim script to animate the explanation
We also attempted to incorporate Chain-of-Thought (CoT) reasoning in the outputs, guiding the model to not only solve the problem but also break it down into explanatory visual steps.
Challenges
However, we encountered two major limitations:
Lack of high-quality training data: Manim-query pairs are a highly niche and scarce dataset. Publicly available examples are limited in volume and diversity.
Absence of Chain-of-Thought (CoT) annotations: Even where datasets exist, few contain intermediate reasoning steps essential for generating coherent explanatory animations.
Due to these challenges, the SFT approach failed to generalize well and lacked visual accuracy and semantic coherence.
4. Proposed Solution: GRPO with Reward Modeling via Gemini as External Judge
To address these limitations, we pivoted to a novel reinforcement learning framework based on GRPO (Generative Reward Policy Optimization). Instead of relying on static data, we introduced an external LLM-based reward model — built on top of Gemini — to act as a judge of the model’s outputs. This model provided feedback on the quality of generated animations, enabling us to train the base model using reward signals rather than hardcoded labels.
Reward Model Criteria
Our reward model evaluated each generated Manim animation based on the following five criteria:
Prompt Consistency
Does the animation match the original mathematical prompt in terms of objects involved, actions depicted, and conceptual correctness?Screen Fit
Do the visual elements stay within the canvas boundaries? Do any objects overflow or render off-screen?Non-overlapping Layout
Are the visual elements well-spaced? Do objects overlap in distracting or confusing ways?Semantic Coherence
Does the animation make logical sense? For example, do equations appear where expected? Are objects used in appropriate ways?Clarity of Explanation
Is the final animation pedagogically effective? Would a student find it helpful in understanding the concept?
These multi-dimensional reward signals allowed us to optimize for visual, spatial, and semantic quality — aspects that are difficult to enforce via traditional supervised learning.
5. Results and Observations
With GRPO and Gemini-based reward modeling, our model demonstrated significantly better convergence compared to SFT. Not only did the animations become more visually accurate, but the overall explanatory coherence also improved. The model was able to generalize across a range of simple mathematical prompts and produce clear, legible Manim animations with minimal hallucinations or layout issues.
6. Future Directions
This project represents just the beginning of our journey toward building a fully autonomous AI tutor. Moving forward, we plan to:
Expand the complexity and diversity of supported mathematical questions (algebra, calculus, geometry, etc.)
Integrate real-time preview and editing tools for generated animations
Incorporate user feedback and corrections into the reward signal (RLHF loop)
Extend support beyond Manim to other visual engines and modalities (e.g., interactive graphs, 3D geometry)
We are excited to continue developing this project with the support of StrongCompute, and look forward to pushing the boundaries of personalized AI education.
7. Acknowledgments
We thank the hackathon organizers and the community for providing a platform to explore such impactful ideas. We are especially grateful to Strong Compute for providing infrastructure and support.
This post was written by Karthik Ragunath Ananda Kumar, AI Researcher @ Tavus Inc, Bernett Orlando, Senior ML SWE @ Google Research and Ramprasadh Kumar, Systems @ NVIDIA
Links:
Presentation slides: Google Slides
Github repo: Link here