I'm stoked to share a project I've been working on called PDF to Podcast. It's a free, open-source tool that automatically converts PDF documents into engaging, informative podcast-style audio content using large language models and text-to-speech tech.
Inspiration: The idea for this project came from the NotebookLM demo at Google I/O, where they showcased generating audio dialogue from uploaded PDFs and other sources. However, that audio feature hasn't been publicly released yet, and I wanted to challenge myself to build something similar using existing tools and APIs.
How it works:
The user uploads a PDF The tool extracts the text and feeds it into Google's Gemini Flash language model Gemini Flash generates a natural, engaging podcast dialogue script based on the key information in the document This script is then converted to audio using OpenAI's text-to-speech API The user can listen to the generated "podcast episode" and read along with the transcript I chose to use Gemini Flash for the language model because it's good at writing high-quality prose while being fast and cheap. We use OpenAI's TTS API to then bring the dialogue to life.
Under the hood, it's built with Python, FastAPI, Gradio for the web UI, and my own library, promptic, for calling the LLM and getting structured output. The code is open-source and available on GitHub.
Apart from the tool's practical utility, I'm hoping this project can serve as a helpful example for others looking to build applications on top of large language models. It demonstrates an end-to-end flow from document intake to language model usage to audio output, with a simple web interface on top.
I would love to hear any feedback or ideas from the HN community! I think there's a lot of potential to expand on this concept and make all sorts of written content more accessible and engaging through audio conversion. Let me know what you think :)
It starts like this:
The way this uses different OpenAI TTS voices for the different roles is really neat!Hopefully not much, but I've heard horror stories about trailing spaces...
However, I find that when I realize a podcast is generated using AI and synthetic audio, I immediately lose interest. For me, the value of podcasts lies in authentic human conversations, and AI-generated content just doesn’t have the same appeal.
Probably it's just me being obsessed with old-school podcasts, though. I do believe there are listeners (not sure if many or few) who don't mind if a podcast is AI-generated.
I have tried to set up something similar with text-to-speech browsers extension but I loose my place if I have to close and reopen.
Take some article or book written for adults. Maybe some archaeological discovery, interesting stuff from HN. Or science books from the 1960s.
Then have it turned into a conversation between the father and a curious, seven year old daughter. And convert it to audio with two different speakers.
While it’s been fun to build this, I never ended up letting my kids use it. It just feels wrong. The educational equivalent of Harlow’s Monkeys.
Do you have any samples of the audio? It would be great to hear what it's like before trying it out.
Also, have you considered doing this all in client side JS? Would be a good way to protect the API key (at least in this demo case).
Like others have mentioned, I’d be scared to accidentally upload a 100 page PDF only for it to cost me $100 without me really knowing up front.
https://simply-ai.podbean.com
https://www.simplynews.ai