I do question whether it's helpful to focus on audio without text. They're focusing on melody and rhythm, and it seems that listening without subtitles is better for that, but that doesn't get you understanding, while listening to comprehensible input with subtitles lets you get melody, rhythm, and actual vocabulary and grammar at the same time. It also lets you stretch "comprehensible" a bit further, since you have an extra source of contextual input.
Babies also have a lot of context for what the words they're hearing actually mean. I suppose there's the "watch translated peppa pig" approach for that.
My project (https://nuenki.app) translates appropriate-difficulty sentences into your target language as you browse, so you casually pick it up over time through comprehensible input.