My cofounder and I trained an AI music generation model and after a month of testing we're launching 1.0 today. Ours is interesting because it's a latent diffusion model instead of a language model, which makes it more controllable: https://sonauto.ai/
Others do music generation by training a Vector Quantized Variational Autoencoder like Descript Audio Codec (https://github.com/descriptinc/descript-audio-codec) to turn music into tokens, then training an LLM on those tokens. Instead, we ripped the tokenization part off and replaced it with a normal variational autoencoder bottleneck (along with some other important changes to enable insane compression ratios). This gave us a nice, normally distributed latent space on which to train a diffusion transformer (like Sora). Our diffusion model is also particularly interesting because it is the first audio diffusion model to generate coherent lyrics!
We like diffusion models for music generation because they have some interesting properties that make controlling them easier (so you can make your own music instead of just taking what the machine gives you). For example, we have a rhythm control mode where you can upload your own percussion line or set a BPM. Very soon you'll also be able to generate proper variations of an uploaded or previously generated song (e.g., you could even sing into Voice Memos for a minute and upload that!). @Musicians of HN, try uploading your songs and using Rhythm Control/let us know what you think! Our goal is to enable more of you, not replace you.
For example, we turned this drum line (https://sonauto.ai/songs/uoTKycBghUBv7wA2YfNz) into this full song (https://sonauto.ai/songs/KSK7WM1PJuz1euhq6lS7 skip to 1:05 if impatient) or this other song I like better (https://sonauto.ai/songs/qkn3KYv0ICT9kjWTmins - we accidentally compressed it with AAC instead of Opus which hurt quality, though)
We also like diffusion models because while they're expensive to train, they're cheap to serve. We built our own efficient inference infrastructure instead of using those expensive inference as a service startups that are all the rage. That's why we're making generations on our site free and unlimited for as long as possible.
We'd love to answer your questions. Let us know what you think of our first model! https://sonauto.ai/
Speaking as a musician who plays real instruments (as opposed to electronic production): how does this help me? And how does this enable more of me?
I am asking with an open mind, with no cynicism intended.
We want you to be able to upload recordings of your real instruments and do all sorts of cool things with them (e.g., transform them, generate vocals for your guitar riff, use the melody as a jazz song, or just get some inspiration for what to add next).
IMO AI alone will never be able to touch hearts like real people do, but people using AI will be able to like never before.
I've said it before, there, is no consumer market for an infinity jukebox because you can't sing along with songs you don't already know, there's already an overabundance of recorded music, and emotion in generative music (especially vocals) is fake. Nobody likes fakery for its own sake. Marketers like it because they want musical wallpaper, the same way commercials have it and it increasingly seeps into 'news' coverage. The market for fully-generated songs is background music in supermarkets, product launch videos, and in-group entertainment ('original songs for your company holiday party! Hilarious musical portraits of your favorite executives - us!').
If you want to innovate in this area (and you should, your diffusion model sounds interesting), make an AI band that can accompany solo musicians. Prioritize note data rather than fully produced tracks (you can have an AI mix engineer as well as an AI bass player or drummer). Give people tools to build something in stages and they'll get invested in it. People want interactivity, not a slot machine. Many musicians love sequencers, arpeggiators, chord generators, and other musical automata; what they don't love is a magic 8-ball that leaves themw ith nothing to do and makes them feel uncreative.
Otherwise your product will just end up on the cultural scrapheap, associated with lowest-common denominator fakers spamming social media as is already happening with imagery.
I'm just asking to try to build some intuition on what people who actually train soa models think were capabilities are heading.
Either way, congrats on the launch :)
We are already at a stage where AI is touching hearts.
Hm... From my vantage point, it seems like a pretty weird choice of businesses if you think that.
> IMO AI alone will never be able to touch hearts like real people do, but people using AI will be able to like never before.
That's all very heartwarming but musicianship is also a profession, not just a human expression of creativity. Even if you're not charging yet, you're a business and plan on profiting from this, right? It seems to me that:
1) Generally, if people want music currently, they pay for musician-created music, even if its wildly undervalued in venues like streaming services.
2) You took music, most of which people already paid musicians to create and they aren't getting paid any more because of this, and you used it to make an automated service that people will be able to pay for music instead of paying musicians.
3) Your service certainly doesn't hurt, and might even enhance people's ability to write and perform music without considering the economics of doing so. For example, hobbyists.
4) So you're not trying to replace musicians making music with people typing in prompts-- you're trying to replace musicians being paid to make music with you being paid to make music. Right? Your business isn't replacing musicianship as a human art form, but for it to succeed, it will have to replace it, in some amount, as a profession, right? Unless you are planning on creating an entirely new market for music, fundamentally, I'm not sure how it couldn't.
Am I wrong on the facts, here? If so, well hey, this is capitalism and that's just how it works around here. If I'm mistaken, I'd like to hear how. Regardless, this is very consequential to a lot of people, and they deserve the people driving these changes to be upfront about it-- not gloss over it.
In this way it is a tool only useful to expert musicians.
Its a good muse, but I wouldn't trust what it makes out of the gate
There's always going to be a balance between creating high level tools like this with no dials and low level tools with finer control, and while this touts itself as being "more controllable", it's clearly not there. But, the same way Adobe has integrated outpainting and generative fill into Photoshop, it's only a matter of time before products like this are built into Ableton and VSTs - where a creator can highlight a bar or two and ask your AI to make the the snippet more ethereal, create a bridge between the verse and the sax solo, or help you with an outro.
That said, similar to generating basic copy for a marketing site, these tools will be great for generating cheap background music but not much else, but any musician, marketing agency, or film-maker worth their salt is going to need very specifically branded music for their needs, and they're likely willing to pay for a real licence to something audiences will recognize, using generative AI and tools to remix the content to their specific need.
How long have you been working on this?
Deleted Comment
Look at current music production and compare it to past. Older music seems so much simpler. It was so much easier to come up with that 20% 'novel' when pop/recorded music was new. Ironically I think AI freeing people to focus on that 20% is going to add a lot of creativity to music, not reduce it.
I say this as someone who hates the concept of AI music. I'm actually really excited to see what it enables/creates (but I don't want to use it, even though I really could use it for vocals that I currently pay others to do for me).
I'll be here making my bad knockoffs of bad synth pop bands having fun and taking weeks to do 5% of what kids these days will start off as their entry point, with my 20% creativity ignored because my music sounds 'off' when I can't get the 80% familiar down.
People thought synthesizers were the end of music, yet Switched on Bach begot Jean Michel Jarre begot Kate Bush and on and on.
Also, our model specifically excels at songs from the era before overproduction. Try asking for a Johnny Cash or Ella Fitzgerald-style country or swing/jazz song!
Here's an example: https://sonauto.ai/songs/taJX3GrKZW7C5qOhjopr
Why diffuse an entire track? We should be building these models to create music the same way that humans do, by diffusing samples, then having the model build the song using samples in a proper sequencer, diffuse vocals etc.
Problem with Suno etc, is that as other people have mentioned, you can't iterate or adjust anything. Saying "make the drums a little punchier and faster paced right after the chorus" is a really tough query to process if you've diffused the whole track rather than built it up.
Same thing with LLM story writing, the writing needs a good foundation, more generating information about the world and history and then generating a story taking that stuff into account, vs a simple "write me a story about x"
I play guitar, but I'm not much of a guitarist or singer. I really like songwriting, not trying to be polished as a performer. So I intermittently look into the AI world to see whether it has tools I could use to generate a higher-quality song demo than I could do on my own.
I've been looking for something that could take a chord progression and style instructions and create a decent backing track for a singer to sing over.
But your saying "Very soon you'll also be able to generate proper variations of an uploaded or previously generated song (e.g., you could even sing into Voice Memos for a minute and upload that!)" is very intriguing. I mean, I can sing and play, it just isn't very professional. But if I could then have an AI take what I did and just... make it better... that would be kind of awesome.
In fact, I believe you could have a very big market among songwriters if you could do that. What I would love to see is this:
My guitar parts are typically not just strummed, but involve picking, sometimes fairly intricate. I'm just not that good at it. It would be fantastic to have an AI that would just take would I played and fix it so that it's more perfect.
And then to have a tool where I could say, "OK, now add a bass part," and "OK, now add drums" would be awesome.
https://www.pgmusic.com/
https://youtu.be/PCYTqDSUbvU
Dead Comment
I think it's better to think of the process of finding the right song as a search algorithm through the space of all possible songs. The current approach just uses a "pick a random point in a general area". Once we find something that is roughly correct we need something that lets us iteratively tweak the aspects that are not quite right, decreasing the search space and allowing us to iteratively take smaller and smaller steps in defined directions.
We just published a blog today discussing this - https://montyanderson.net/writing/synthesis
I think the other missing pieces I've found are upscaling and stem splitting. While existing tool exist for splitting stems exist, my testing found that this didn't work well in practice (at least on Suno music), likely due to a combination of encoder-specific artifacts and the overall low sound quality. Existing upscaling approaches also faced similar issues.
My naive guess is that these are things that will benefit from being closely intertwined with the generation process. Eg when splitting up stems, you can use the diffusion model(s) to help jointly converge individual stems into reasonable standalone tracks.
I'm excited about the potential of these tools. I've definitely personally found uses cases for small independent game projects where a paying for musicians is far out of budget, and the style of music is not one I can execute on my own. But I'm not willing to sacrifice on quality of results to do so.
At some point it's just not efficient to try and get the desired output purely through a prompt, and it would be helpful to download the output in a format you can plug into your DAW to tweak.
Variation in small details is fine, but you need control over larger scale structure.