Readit News logoReadit News
Posted by u/zaptrem 2 years ago
Show HN: Sonauto – A more controllable AI music creatorsonauto.ai/...
Hey HN,

My cofounder and I trained an AI music generation model and after a month of testing we're launching 1.0 today. Ours is interesting because it's a latent diffusion model instead of a language model, which makes it more controllable: https://sonauto.ai/

Others do music generation by training a Vector Quantized Variational Autoencoder like Descript Audio Codec (https://github.com/descriptinc/descript-audio-codec) to turn music into tokens, then training an LLM on those tokens. Instead, we ripped the tokenization part off and replaced it with a normal variational autoencoder bottleneck (along with some other important changes to enable insane compression ratios). This gave us a nice, normally distributed latent space on which to train a diffusion transformer (like Sora). Our diffusion model is also particularly interesting because it is the first audio diffusion model to generate coherent lyrics!

We like diffusion models for music generation because they have some interesting properties that make controlling them easier (so you can make your own music instead of just taking what the machine gives you). For example, we have a rhythm control mode where you can upload your own percussion line or set a BPM. Very soon you'll also be able to generate proper variations of an uploaded or previously generated song (e.g., you could even sing into Voice Memos for a minute and upload that!). @Musicians of HN, try uploading your songs and using Rhythm Control/let us know what you think! Our goal is to enable more of you, not replace you.

For example, we turned this drum line (https://sonauto.ai/songs/uoTKycBghUBv7wA2YfNz) into this full song (https://sonauto.ai/songs/KSK7WM1PJuz1euhq6lS7 skip to 1:05 if impatient) or this other song I like better (https://sonauto.ai/songs/qkn3KYv0ICT9kjWTmins - we accidentally compressed it with AAC instead of Opus which hurt quality, though)

We also like diffusion models because while they're expensive to train, they're cheap to serve. We built our own efficient inference infrastructure instead of using those expensive inference as a service startups that are all the rage. That's why we're making generations on our site free and unlimited for as long as possible.

We'd love to answer your questions. Let us know what you think of our first model! https://sonauto.ai/

adrianh · 2 years ago
I'm interested to hear more about your statement of "Our goal is to enable more of you, not replace you."

Speaking as a musician who plays real instruments (as opposed to electronic production): how does this help me? And how does this enable more of me?

I am asking with an open mind, with no cynicism intended.

zaptrem · 2 years ago
If the future of music was truly just typing some text into a box and taking or leaving what the machine gives you that would be kinda depressing.

We want you to be able to upload recordings of your real instruments and do all sorts of cool things with them (e.g., transform them, generate vocals for your guitar riff, use the melody as a jazz song, or just get some inspiration for what to add next).

IMO AI alone will never be able to touch hearts like real people do, but people using AI will be able to like never before.

anigbrowl · 2 years ago
But then why are you going down the dead-end route of generating complete songs? Nobody wants this except marketing people.

I've said it before, there, is no consumer market for an infinity jukebox because you can't sing along with songs you don't already know, there's already an overabundance of recorded music, and emotion in generative music (especially vocals) is fake. Nobody likes fakery for its own sake. Marketers like it because they want musical wallpaper, the same way commercials have it and it increasingly seeps into 'news' coverage. The market for fully-generated songs is background music in supermarkets, product launch videos, and in-group entertainment ('original songs for your company holiday party! Hilarious musical portraits of your favorite executives - us!').

If you want to innovate in this area (and you should, your diffusion model sounds interesting), make an AI band that can accompany solo musicians. Prioritize note data rather than fully produced tracks (you can have an AI mix engineer as well as an AI bass player or drummer). Give people tools to build something in stages and they'll get invested in it. People want interactivity, not a slot machine. Many musicians love sequencers, arpeggiators, chord generators, and other musical automata; what they don't love is a magic 8-ball that leaves themw ith nothing to do and makes them feel uncreative.

Otherwise your product will just end up on the cultural scrapheap, associated with lowest-common denominator fakers spamming social media as is already happening with imagery.

Version467 · 2 years ago
Just to clarify, when you say never. Do you actually mean never (or some practical equivalent like ~100 years), or do you mean not right now, but possibly in 5-10 years?

I'm just asking to try to build some intuition on what people who actually train soa models think were capabilities are heading.

Either way, congrats on the launch :)

manibatra · 2 years ago
Love what you are doing but "never" is just not true. Used Suno to create a song about our daughter the other day which had wife and I in tears.

We are already at a stage where AI is touching hearts.

chefandy · 2 years ago
> If the future of music was truly just typing some text into a box and taking or leaving what the machine gives you that would be kinda depressing.

Hm... From my vantage point, it seems like a pretty weird choice of businesses if you think that.

> IMO AI alone will never be able to touch hearts like real people do, but people using AI will be able to like never before.

That's all very heartwarming but musicianship is also a profession, not just a human expression of creativity. Even if you're not charging yet, you're a business and plan on profiting from this, right? It seems to me that:

1) Generally, if people want music currently, they pay for musician-created music, even if its wildly undervalued in venues like streaming services.

2) You took music, most of which people already paid musicians to create and they aren't getting paid any more because of this, and you used it to make an automated service that people will be able to pay for music instead of paying musicians.

3) Your service certainly doesn't hurt, and might even enhance people's ability to write and perform music without considering the economics of doing so. For example, hobbyists.

4) So you're not trying to replace musicians making music with people typing in prompts-- you're trying to replace musicians being paid to make music with you being paid to make music. Right? Your business isn't replacing musicianship as a human art form, but for it to succeed, it will have to replace it, in some amount, as a profession, right? Unless you are planning on creating an entirely new market for music, fundamentally, I'm not sure how it couldn't.

Am I wrong on the facts, here? If so, well hey, this is capitalism and that's just how it works around here. If I'm mistaken, I'd like to hear how. Regardless, this is very consequential to a lot of people, and they deserve the people driving these changes to be upfront about it-- not gloss over it.

LZ_Khan · 2 years ago
Inspiration? You can generate hundreds of ideas in a day. The tracks will not be perfect but that's where actual musicians can take the ideas/themes from the tracks and perfect it.

In this way it is a tool only useful to expert musicians.

jimmyjazz14 · 2 years ago
I mean if you want inspiration there are literally millions of amazing songs on Spotify by real musicians. I have yet to hear an AI composed song that was in the least bit musically inspiring.
93po · 2 years ago
When Suno came out I spent literally hours/days playing around with it to generate music, and came out with some that's really close to good, and good enough I've gone back to listen to a few. I'd love the tooling to take a premise and be able to tweak it to my liking without spending 1000 hours learning specific software and without thousands of hours learning to play an instrument or learning to sing.
yoyohello13 · 2 years ago
I just don’t get this. Part of the joy of creating things is the work I put in. The easier something is to make, the less meaning it has to me. I feel like just asking a machine to make a bunch of songs is kind of meaningless.
suyash · 2 years ago
That is just 'marketing speak' so as long you are their customers, they need to make money from users who will be using their service to make music.
whoomp12341 · 2 years ago
same thing with AI code writing.

Its a good muse, but I wouldn't trust what it makes out of the gate

cush · 2 years ago
There's a lot of negative comments here, but these are the earliest days and generating entire songs is kind of the hello world of this tech.

There's always going to be a balance between creating high level tools like this with no dials and low level tools with finer control, and while this touts itself as being "more controllable", it's clearly not there. But, the same way Adobe has integrated outpainting and generative fill into Photoshop, it's only a matter of time before products like this are built into Ableton and VSTs - where a creator can highlight a bar or two and ask your AI to make the the snippet more ethereal, create a bridge between the verse and the sax solo, or help you with an outro.

That said, similar to generating basic copy for a marketing site, these tools will be great for generating cheap background music but not much else, but any musician, marketing agency, or film-maker worth their salt is going to need very specifically branded music for their needs, and they're likely willing to pay for a real licence to something audiences will recognize, using generative AI and tools to remix the content to their specific need.

TheActualWalko · 2 years ago
If anyone here is interested in something that leans towards the Ableton end of the spectrum, we're building this: https://wavtool.com/
antidnan · 2 years ago
Wow, so cool, very interested. This is exactly what I wanted to see with next gen DAWs.

How long have you been working on this?

cush · 2 years ago
So rad!

Deleted Comment

boringg · 2 years ago
I want to say two things -- one congrats - I am sure your team has been working exceptionally hard to develop this - and the songs sound reasonable good for AI! Two I am soo competely unenthusiastic about AI music and it infiltrating the music world - all of it sounds like fingernails on a chalkboard. Just mainstream overproduced low quality radio music. I know its a stepping stone but it kills me to listen to it right now.
visarga · 2 years ago
That's because you didn't listen to the MIT license song. Gen music has the potential to make even the driest texts sound good, I didn't realize that before. How about paper abstract music? https://suno.com/song/cb729eb6-4cc5-4c15-ab74-0cdbef779684
_DeadFred_ · 2 years ago
80% of music is familiarity, 20% novelty, yet the majority of peoples' time goes into getting the 80% down so that they can add their 20%.

Look at current music production and compare it to past. Older music seems so much simpler. It was so much easier to come up with that 20% 'novel' when pop/recorded music was new. Ironically I think AI freeing people to focus on that 20% is going to add a lot of creativity to music, not reduce it.

I say this as someone who hates the concept of AI music. I'm actually really excited to see what it enables/creates (but I don't want to use it, even though I really could use it for vocals that I currently pay others to do for me).

I'll be here making my bad knockoffs of bad synth pop bands having fun and taking weeks to do 5% of what kids these days will start off as their entry point, with my 20% creativity ignored because my music sounds 'off' when I can't get the 80% familiar down.

People thought synthesizers were the end of music, yet Switched on Bach begot Jean Michel Jarre begot Kate Bush and on and on.

mewpmewp2 · 2 years ago
I would agree when AI gets to a point where it's possible to do that 20%. It is just not possible yet to combine it in such ways. Right now you basically get whatever music, but there's no way to add that 20%. Same with image/video generation. AI advancements have obviously been amazing and far beyond what I would've expected, but there's still ways to go.
zaptrem · 2 years ago
Agreed. My thoughts on this are here; https://news.ycombinator.com/item?id=39992817#39994616

Also, our model specifically excels at songs from the era before overproduction. Try asking for a Johnny Cash or Ella Fitzgerald-style country or swing/jazz song!

Here's an example: https://sonauto.ai/songs/taJX3GrKZW7C5qOhjopr

cowboylowrez · 2 years ago
how does the model know how to do a johnny cash style? did you feed it johnny cash tracks? if so, what were the licensing terms? are you interested in answering these questions about training data or would this be too dodgy to chat about on a tech website?
fennecbutt · 2 years ago
I really feel like the popularity of diffusion has made it far too shallow.

Why diffuse an entire track? We should be building these models to create music the same way that humans do, by diffusing samples, then having the model build the song using samples in a proper sequencer, diffuse vocals etc.

Problem with Suno etc, is that as other people have mentioned, you can't iterate or adjust anything. Saying "make the drums a little punchier and faster paced right after the chorus" is a really tough query to process if you've diffused the whole track rather than built it up.

Same thing with LLM story writing, the writing needs a good foundation, more generating information about the world and history and then generating a story taking that stuff into account, vs a simple "write me a story about x"

zaptrem · 2 years ago
I completely agree on the editing aspect. However if you want to generate five stem tracks, then all five tracks must have the full bandwidth of your auto encoder. Accordingly each inference or training staff would take much more compute for the same result. That’s why we’d prefer to do it all together and split after.
saaaaaam · 2 years ago
How worried are you about being sued? Seems like your training data probably includes quite a bit of copyright protected stuff. Just listened to the “blue scoobie doo” example and the influences are fairly obvious. With record companies getting super litigious about this, is that a concern? Or did you licence your training data?
garyrob · 2 years ago
My hobby is songwriting. (Example: https://www.youtube.com/watch?v=Kjng3UoKkGk)

I play guitar, but I'm not much of a guitarist or singer. I really like songwriting, not trying to be polished as a performer. So I intermittently look into the AI world to see whether it has tools I could use to generate a higher-quality song demo than I could do on my own.

I've been looking for something that could take a chord progression and style instructions and create a decent backing track for a singer to sing over.

But your saying "Very soon you'll also be able to generate proper variations of an uploaded or previously generated song (e.g., you could even sing into Voice Memos for a minute and upload that!)" is very intriguing. I mean, I can sing and play, it just isn't very professional. But if I could then have an AI take what I did and just... make it better... that would be kind of awesome.

In fact, I believe you could have a very big market among songwriters if you could do that. What I would love to see is this:

My guitar parts are typically not just strummed, but involve picking, sometimes fairly intricate. I'm just not that good at it. It would be fantastic to have an AI that would just take would I played and fix it so that it's more perfect.

And then to have a tool where I could say, "OK, now add a bass part," and "OK, now add drums" would be awesome.

maroonblazer · 2 years ago
If all you're looking for is polished backing tracks, why couldn't Band in a Box serve that function?

https://www.pgmusic.com/

garyrob · 2 years ago
It could, but I want it to be even easier and with better results! I think AI has that potential. I am absolutely sure it does, in fact, and that some AI product will obsolete Band In A Box within the next decade. Maybe within the next year. If the people who make BIAB aren't working on it, themselves, with full focus, they are making a big mistake.
mschulkind · 2 years ago
Check out this AI vocals plugin. It's pretty impressive already.

https://youtu.be/PCYTqDSUbvU

LastTrain · 2 years ago
That song is quite nice, so is the performance. It would, IMO, would be less good if it were 'fixed' to be more perfect.
zaptrem · 2 years ago
Awesome to hear this resonates with you! If you join our Discord server I'll ping @everyone when improvements are ready.

Dead Comment

dwallin · 2 years ago
I think the problem here is the same one as the other current music generation services. Iteration is so important to creativity and right now you can't really properly iterate. In order to get the right song you just spray and pray and keep generating until one that is sufficient arrives or you give up. I know you hint at this being a future direction of development but in my opinion it's a key feature to take these services beyond toys.

I think it's better to think of the process of finding the right song as a search algorithm through the space of all possible songs. The current approach just uses a "pick a random point in a general area". Once we find something that is roughly correct we need something that lets us iteratively tweak the aspects that are not quite right, decreasing the search space and allowing us to iteratively take smaller and smaller steps in defined directions.

Barneyhill · 2 years ago
Yep, I came to similar conclusions w/ text-to-audio models - in terms of creative work the ability to iterate is really lacking with the current interfaces. We've stopped working on text-to-audio models and are instead focusing on targeting a lower-level of abstraction by directly exposing an Ableton environment to LLM agents.

We just published a blog today discussing this - https://montyanderson.net/writing/synthesis

zaptrem · 2 years ago
Our variations feature coming very soon is exactly this! Rhythm Control is an early version of this.
dwallin · 2 years ago
I'll keep an eye out for that! The variations feature in Suno is a good example of what not to do here, as it effectively just makes another random iteration using existing settings.

I think the other missing pieces I've found are upscaling and stem splitting. While existing tool exist for splitting stems exist, my testing found that this didn't work well in practice (at least on Suno music), likely due to a combination of encoder-specific artifacts and the overall low sound quality. Existing upscaling approaches also faced similar issues.

My naive guess is that these are things that will benefit from being closely intertwined with the generation process. Eg when splitting up stems, you can use the diffusion model(s) to help jointly converge individual stems into reasonable standalone tracks.

I'm excited about the potential of these tools. I've definitely personally found uses cases for small independent game projects where a paying for musicians is far out of budget, and the style of music is not one I can execute on my own. But I'm not willing to sacrifice on quality of results to do so.

SubiculumCode · 2 years ago
I uploaded a bit of a song that I recorded once (that I wrote, unpublished), and I am trying to get it to riff on it, generate something close to it, etc.
SubiculumCode · 2 years ago
More strength does what? More or less similar?
nomel · 2 years ago
Same with text models, for me. If I can't edit my query and the AI response, to retry/keep the context in check, then I have trouble finding use for it, in creation. I need to be able to directly influence the entire loop, and, most importantly, keep the context for the next token prediction clean and short.
skybrian · 2 years ago
Letting you edit the response is quite easy to do, technically speaking. It's not done in the default UI for most AI Chatbots, unfortunately. You will need to look for alternative UIs.
ljm · 2 years ago
I've noticed that the output tends to suffer when you pass in longer lyrics, too. Lots of my experiments start off fairly strong but then it's like it starts to forget, and the lyrics lose any rhythmic structure or just becomes incoherent.

At some point it's just not efficient to try and get the desired output purely through a prompt, and it would be helpful to download the output in a format you can plug into your DAW to tweak.

p1esk · 2 years ago
But that’s not a problem when listening to Spotify? Why can’t we treat these music generation engines the same way we treat music streaming services?
darby_eight · 2 years ago
Idk what you're referring to specifically, but music discovery services are terrible across all of spotify, apple music, google music, tidal, etc. I don't expect these services to read your mind, but they also don't ask for many parameters to help with the search. Definitely a huge opportunity here for innovative new services.
ctrw · 2 years ago
Basically you need something like comfy UI for music.

Variation in small details is fine, but you need control over larger scale structure.

rcarmo · 2 years ago
Nice, but Google login is a no-go for me (or any form of social login, really).
dubeux · 2 years ago
same.
anjel · 2 years ago
same.