> Vocal Synthesis: This allows one to generate new audio that sounds like someone singing. One can write lyrics, as well as melody, and have the AI generate an audio that can match it. You could even specify how you want the voice to sound like. Google has also presented models capable of vocal synthesis, such as googlesingsong.
Google's singsong paper does the exact opposite. Given human vocals, it produces an musical accompaniment.
Given that Google is mentioned "out of the blue", that «also» seems to indicate that what was mistaken is '«vocal»': [You can have vocal synthesis given music as an input, and] Google has also presented models capable of _music_ synthesis [given vocals as an input], such as googlesingsong
I got into AI music back in 2017, kind of sparked by AlphaGo. Started by looking at machine listening stuff, like Nick Collins' work. Always been really curious about AI doing music live coding.
In 2019, I built this thing called RaveForce [github.com/chaosprint/RaveForce]. It was a fun project.
Back then, GANsynth was a big deal, looked amazing. But the sound quality… felt a bit lossy, you know? And MIDI generation, well, didn't really feel like "music generation" to me.
Now, I'm thinking about these things differently. Maybe the sound quality thing is like MP3 at first, then it becomes "good enough" – like a "retina moment" for audio? Diffusion models seem to be pushing this idea too. And MIDI, if used the right way, could be a really powerful tool.
Vocals synthesis and conversion are super cool. Feels like plugins, but next level. Really useful.
But what I really want to see is AI understanding music from the ground up. Like, a robot learning how synth parameters work. Then we can do 8bit music like the DRL breakthrough. Not just training on tons of copyrighted music, making variations, and selling it, which is very cheap.
Lots. For example, there are dozens of models that specifically have been trained on Bach MIDIs to generate new Bach style compositions. However, the generated MIDIs definitely do not sound like Bach :)
I'd link to some specific examples (easy to Google or search on GitHub) but I can't recall which models were more successful than others.
Almost nobody remembers it, but if you go back far enough, there was a Sid Meier game on the 3DO that algorithmically generated music in the style of Bach called (appropriately enough) CPU Bach.
This. Generating audio en masse is everything that's wrong with LLMs, and people trying to use them this demonstrate a *fundamental misunderstanding of music. The whole attraction of music is separate generators in temporary harmony, whether rhythmic, tonal, timbral. Generating premixed streams of audio ('mixed' implying more than one voice or instrument) completely misses the point how music is constructed in the first place. Anyone advocating this approach is not worth listening to.
But there are lots of applications for music which parallel the applications of ai generated images - things that are more commercial in nature. The media is functional, for use cases such as commercials, or social media type videos, where people just need something for the ambiance and don't want to deal with copyright or anything like that.
I am not sure that the internal process could not work through conceiving «temporary harmony[...] rhythmic, tonal, timbral [etc.]».
Furthermore, the sound itself is crucial, so perfect calibration of a perfect sound is definitely a part of what can be clearly be sought (when you do not want to leave that to a secondary human process in the workflow).
While I mostly agree with you, we know that music is defined by the listener. Who are we to discern what is or isn't music? Do you have the same opinion of text or code generated by or with the assistance of AI?
I almost never use midi and beyond chord charts, none of the musicians I know write scores. No one is preventing you from creating in the way you like, get off your high horse. Do whatever makes you happy.
One obvious area of improvement will be allowing you to tweak specific sections of an AI generated song. I was recently playing around with Suno, and while the results with their latest models are really impressive, sometimes you just want a little bit more control over specific sections of a track. To give a concrete example: I used deepseek-r1 to generate lyrics for a song about assabiyyah, and then used to Suno to generate the track [0]. The result was mostly fine, but it pronounced assabiyyah as ah-sa-BI-yah instead of ah-sah-BEE-yah. A relatively minor nitpick.
> Stem Splitting: This allows one to take an existing song, and split the audio into distinct tracks, such as vocals, guitar, drums and bass. Demucs by Meta is an AI model for stem splitting.
+1 for Demucs (free and open source).
Our band went back and used Demucs-GUI on a bunch of our really old pre-DAW stuff - all we had was the final WAVs and it did a really good job splitting out drums, piano, bass, vocals, etc. with the htdemucs_6s model. There was some slight bleed between some of the stems but other than that it was seamless.
I have used the htdemucs_6s a bunch, but I prefer the 4 stem model. The dedicated guitar and piano stems are usually full of really bad artifacts in the 6s model. It's still useful if you want to use it to transcribe the part to sheet music however. Just not useful to me in music production or as a backing track.
My primary use is for creating backing tracks I can play piano / keyboard along with (just for fun in my home). Most of the time I'll just use the 4s model and will keep drums, bass and vocals.
Yeah I could see that. We had better luck with the 6-stem, maybe it's because we had both rhythm and lead guitar in the mixes, but the 4-stem version didn't work as well for us.
In the future we may have music gen models that dynamically generate a soundtrack to our life, based off of ongoing events, emotions, etc. as well as our preferences.
If this happens, main character syndrome may get a bit worse :)
> code is now being written with the help of LLMs, and almost all graphic design uses photoshop.
AI models are tools, and engineers and artists should use them to do more per unit time.
Text prompted final results are lame and boring, but complex workflows orchestrated by domain practitioners are incredible.
We're entering an era where small teams will have big reach. Small studio movies will rival Pixar, electronic musicians will be able to conquer any genre, and indie game studios will take on AAA game releases.
The problem will be discovery. There will be a long tail of content that caters to diverse audiences, but not everyone will make it.
> Vocal Synthesis: This allows one to generate new audio that sounds like someone singing. One can write lyrics, as well as melody, and have the AI generate an audio that can match it. You could even specify how you want the voice to sound like. Google has also presented models capable of vocal synthesis, such as googlesingsong.
Google's singsong paper does the exact opposite. Given human vocals, it produces an musical accompaniment.
In 2019, I built this thing called RaveForce [github.com/chaosprint/RaveForce]. It was a fun project.
Back then, GANsynth was a big deal, looked amazing. But the sound quality… felt a bit lossy, you know? And MIDI generation, well, didn't really feel like "music generation" to me.
Now, I'm thinking about these things differently. Maybe the sound quality thing is like MP3 at first, then it becomes "good enough" – like a "retina moment" for audio? Diffusion models seem to be pushing this idea too. And MIDI, if used the right way, could be a really powerful tool.
Vocals synthesis and conversion are super cool. Feels like plugins, but next level. Really useful.
But what I really want to see is AI understanding music from the ground up. Like, a robot learning how synth parameters work. Then we can do 8bit music like the DRL breakthrough. Not just training on tons of copyrighted music, making variations, and selling it, which is very cheap.
IMO this would be much more useful.
https://openai.com/index/musenet
Also, Synfire is a somewhat difficult to grok DAW designed around algorithmically generating midi motif as building blocks for longer pieces.
https://www.youtube.com/watch?v=OrtJjEiWBtI
It's not particularly well-known but it's been around for many years.
I'd link to some specific examples (easy to Google or search on GitHub) but I can't recall which models were more successful than others.
https://www.youtube.com/watch?v=nJkPWSKuTHI
But there are lots of applications for music which parallel the applications of ai generated images - things that are more commercial in nature. The media is functional, for use cases such as commercials, or social media type videos, where people just need something for the ambiance and don't want to deal with copyright or anything like that.
Furthermore, the sound itself is crucial, so perfect calibration of a perfect sound is definitely a part of what can be clearly be sought (when you do not want to leave that to a secondary human process in the workflow).
All that really matters is whether users like what the generator generates
[0] https://suno.com/song/0caf26e0-073e-4480-91c4-71ae79ec0497
Fundamentally, a song can be represented as a 2d image without any loss
> Stem Splitting: This allows one to take an existing song, and split the audio into distinct tracks, such as vocals, guitar, drums and bass. Demucs by Meta is an AI model for stem splitting.
+1 for Demucs (free and open source).
Our band went back and used Demucs-GUI on a bunch of our really old pre-DAW stuff - all we had was the final WAVs and it did a really good job splitting out drums, piano, bass, vocals, etc. with the htdemucs_6s model. There was some slight bleed between some of the stems but other than that it was seamless.
https://github.com/CarlGao4/Demucs-Gui
My primary use is for creating backing tracks I can play piano / keyboard along with (just for fun in my home). Most of the time I'll just use the 4s model and will keep drums, bass and vocals.
If this happens, main character syndrome may get a bit worse :)
https://en.wikipedia.org/wiki/IMUSE
AI models are tools, and engineers and artists should use them to do more per unit time.
Text prompted final results are lame and boring, but complex workflows orchestrated by domain practitioners are incredible.
We're entering an era where small teams will have big reach. Small studio movies will rival Pixar, electronic musicians will be able to conquer any genre, and indie game studios will take on AAA game releases.
The problem will be discovery. There will be a long tail of content that caters to diverse audiences, but not everyone will make it.
If you think Pixar is Pixar solely because they have an in-house software stack, you're missing the forest for a small shrub.
Good writing and good directing don't need hundreds of millions of dollars.
I disagree engineers and artists should do more per unit time. like we need more content per second....
....as if art and real inspiration would ever follow the chaotic beat of human progress