Launch HN: Golpo (YC S25) – AI-generated explainer videos

Launch HN: Golpo (YC S25) – AI-generated explainer videos video.golpoai.com/...

Hey HN! We’re Shraman and Shreyas Kar, building Golpo (https://video.golpoai.com), an AI generator for whiteboard-style explainer videos, capable of creating videos from any document or prompt.

We’ve always made videos to communicate any concept and felt like it was the clearest way to communicate. But making good videos was time-consuming and tedious. It required planning, scripting, recording, editing, syncing voice with visuals. Even a 2-minute video could take hours.

AI video tools are impressive at generating cinematic scenes and flashy content, but struggle to explain a product demo, walk through a complex workflow, or teach a technical topic. People still spend hours making explainer videos manually because existing AI tools aren’t built for learning or clarity.

Our solution is Golpo. Our video generation engine generates time-aligned graphics with spoken narration that are good for onboarding, training, product walkthroughs, and education. It’s fast, scalable, and built from the ground up to help people understand complex ideas through simple storytelling.

Here’s a demo: https://www.youtube.com/watch?v=C_LGM0dEyDA#t=7.

Golpo is built specifically for use cases involving explaining, learning, and onboarding. In our (obviously biased!) opinion, it feels authentic and engaging in a way no other AI video generator does.

Golpo can generate videos in over 190 languages. After it generates a video, you can fully customize its animations by just describing the changes you want to see in each motion graphic it generates in natural language.

It was challenging to get this to work! Initially, we used a code-generation approach with Manim, where we fine-tuned a language model to emit Python animation scripts directly from the input text. While promising for small examples, this quickly became brittle, and the generated code usually contained broken imports, unsupported transforms, and poor timing alignment between narration and visuals. Debugging and regenerating these scripts was often slower than creating them manually.

We also explored training a custom diffusion-based video model, but found it impractical for our needs. Diffusion could produce high-fidelity cinematic scenes, but generating coherent sequences beyond about 30 seconds was unreliable without complex stitching, making edits required regenerating large portions of the video, and visuals frequently drifted from the instructional intent, especially for abstract or technical topics. Also, we did not have the compute to scale this.

Existing state-of-the-art systems like Sora and Veo 3 face similar limitations: they are optimized for cinematic storytelling, not step-by-step educational content, and they lack both the deterministic control needed for time-aligned narration and the scalability for 5–10 minute explainers.

In the end, we took a different path of training a reinforcement learning agent to “draw” whiteboard strokes, step-by-step, optimized for clear, human-like explanations. This worked well because the action space was simple and the environment was not overly complex, allowing the agent to learn efficient, precise, and consistent drawing behaviors.

Here are some sample videos that Golpo generated:

https://www.youtube.com/watch?v=33xNoWHYZGA (Whiteboard Gym - the tech behind Golpo itself)

https://www.youtube.com/watch?v=w_ZwKhptUqI (How do RNNs work?)

https://www.youtube.com/watch?v=RxFKo-2sWCM (function pointers in C)

https://golpo-podcast-inputs.s3.us-east-2.amazonaws.com/file... (basic intro to Gödel's theorem)

You can try Golpo here: https://video.golpoai.com, and we will set you up with 2 credits. We’d love your feedback, especially on what feels off, what you’d want to control, and how you might use it. Comments welcome!

My suggestion would be to re-think the demo videos. I have only watched most of the way into the "function pointers in C" example. If I didn't already know C well, I would not be able to follow that. The technical diagrams don't stay on the screen long enough for new learners to process the information. These videos probably look fantastic to the person who wrote the document it summarizes, but to a newbie the information is fleeting and hard to follow. The machine doesn't understand that the screen shouldn't be completely wiped all the time while it follows the narrative. Some visuals should be static for paragraphs, or stay visible while detail marked up around it. For a true master of the art, see 3blue1brown.

bangaladore · 4 months ago

> For a true master of the art, see 3blue1brown.

I agree. Rather than (what I assume is) E2E text -> video/audio output, it seems like training a model on how to utilize the community fork of manim which 3blue1brown uses for videos would produce a better result.

[1] https://github.com/ManimCommunity/manim/

albumen · 4 months ago

Manim is awesome and I'd love to see that, but it doesn't easily offer the "hand-drawn whiteboard" look they've got currently.

typs · 4 months ago

If that demo video is how it actually works, this is a pretty amazing technical feat. I’m definitely going to try this out.

Edit: I've used. It's amazing. I'm going to be using this a lot.

Masih77 · 4 months ago

I call bs on training a RL agent to literally output strokes. The way each image renders is a dead give away that this is just using a text to image model, then convert it to svg, and finally animate the svg paths. They might even bypass the svg conversions with clever mask reveals. I was able to achieve the same thing in about 5 mins. https://giphy.com/gifs/rFVxSxZMlflZUX4TqI

skar01 · 4 months ago

Thank you!!

metalliqaz · 4 months ago

delbronski · 4 months ago

Wow, I was skeptical at first, but the result was pretty awesome!

Congrats! Cool product.

Feedback: I tried making a product explainer video for a tree planting rover I’m working on. The rover looked different in every scene. I can imagine this kind of consistency may be more difficult to get right. Maybe if I had uploaded a photo of how the rover looks it may have helped. In one scene the rover looks like an actual rover, in the other it looks like a humanoid robot.

But still, super impressed!

Thanks! We are working on the consistency.

torlok · 4 months ago

Going by the example videos, this is nothing like I'd expect a whiteboard video to look like. It fills the slides in erratically, even text. No human does that. It's distracting more than anything. If a human teacher wants to show cause-and-effect, they'll draw the cause, then an arrow, then the effect to emphasize what they're saying. Your videos resemble printing more than drawing.

grues-dinner · 4 months ago

It seems really strange that you wouldn't farm this kind of thing out to a non-AI function that animates the text properly into the space given using parameters that the AI generated. I mean it's impressive that it does work at all, let alone as well as it does, but are we also going to get an AI to do all the video encoding as well?

dtran · 4 months ago

Love this idea! The Whiteboard Gym explainer video seemed really text-heavy (although I did learn enough to guess that that's because text likely beat drawing/adding an image for these abstract concepts for the GRPO agent). I found Shraman's personal story video much more engaging! https://x.com/ShramanKar/status/1955404430943326239

Signed up and waiting on a video :)

Edit: here's a 58s explainer video for the concept of body doubling: https://video.golpoai.com/share/448557cc-cf06-4cad-9fb2-f56b...

addandsubtract · 4 months ago

The body doubling concept is something I've noticed myself, but never knew there was a term for it. TIL :)

Love it. The tone is just right. A couple of suggestions:

Have you tried a "filled line" approach, rather than "outlined" strokes? Might feel more like individual marker strokes.

I made a demo video on the free tier and it did a great job explaining acoustic delay lines in an accessible fashion, after feeding it a catalog PDF with an overview of the historical artefact and photography of an example unit. Unfortunately the service invented its own idea of what the artefact looked like. Could you offer a storyboard view and let users erase the incorrect parts and sketch their own shapes? Or split the drawing up into logical elements and the user could redraw them as needed, which would then be reused where that element is used in other frames?

Thank you!! We are actually currently working on the storyboarding feature!!

mclau157 · 4 months ago

I have used AI in the past to learn a topic but by creating a GUI with input sliders and output that I can see how things change when I change parameters, this could work here where people can basically ask "what if x happens" and see the result which also makes them feel in control of the learning

giorgioz · 4 months ago

I love the concept but the implementation in the demo seem not good enough to me. I think the black and white demo is quite ugly... 1) Explainer videos are not in black and white 2) the images are not drawn live usually. 3) text being drawn on the go is just a fake animation. In reality most explainer videos show short meaningful sentence appearing all at once so the user has more time for reading me.

Keep up refining the generated demo! Best of luck

fxwin · 4 months ago

I'm also not the biggest fan of the white-on-black style, but there is definitely precedent (at least in science-youtube-space) for explainer videos "drawn live" [1-4]

[1] https://www.youtube.com/@Aleph0

[2] https://www.youtube.com/@MinutePhysics

[3] https://www.youtube.com/@12tone

[4] https://www.youtube.com/@SimplilearnOfficial