I built something similar which is a lot faster but on a large scale test the google software handily outperforms my own (92% accuracy versus 87% or so, and that is a huge difference because it translates into ~30% fewer errors).
Wow the Onsets and Frames algorithm is insanely interesting. It's like a mixture of run-length encoding of (vertical and horizontal) strings of (0dim/time) and (1d/time) structures (onsets as points in time, activations as lines in time). But..hm.. why stop at such low dimensionality structures..! :^)
Shame that it uses quadratically scaling transformers - there are many sub-quadratic transformers that work quite well or better (https://github.com/lucidrains?tab=repositories) - because that 4 second sub-sample limitation seems quite unlike how I imagine most people experience music. Interesting, though. I wonder if I could take a stab at this..
Also interesting that the absolute timing of onsets worked better than relative timing - that also seems kinda bizarre to me, since, when I listen to music, it is never in absolute terms (e.g. "wow I just loved how this connects to the start of the 12th bar" vs "wow I loved that transition from what was playing 2 bars ago".
Another thing on relative timing.. when I listen to music, for me, very nuanced, gradual, and intentional deviations of tempo have significant sentimenal effects - which suggests to me that you need a 'covariant' description of how the tempo needs to change over time, so, not only do you need relative timing of events, you also need relative timing of the relative timing of events as well
Some examples:
- Jonny Greenwood's Phantom Thread II from the Phantom Thread soundtrack [0]
- the breakdown in Holy Other's amazing "Touch" [1], where the song basically grinds to a halt before releasing all the pent up emotional potential energy.
For anyone interested I've transcribed this song [1] using the replicate link the author provided (Colab throws errors for me) using mode music-piano-v2. It spits out mp3s there instead of midis so you can hear how it did [2]
I can't help but feel it is heavily impacted by ambience of the recording as well. The midi is of course a very rigid and literal interpretation of what the model is hearing as pitches over time, but of course it lacks the subtlety of realizing a pitch is sustaining because of an ambient effect, or that the attach is is actually a little bit before the beginning of the pitch, etc.
If it could be enhanced to consider such things, I bet you would get much cleaner, more machine-like midis, which are generally preferable.
Music's a lot more than a collection of notes ... and the timbre of one piano is about as far away from a mixture of reeds and dulcit electronics as you can get. (The Fulero is very nice.)
I'm still looking over this to see its capabilities, but if reading this right, we can turn any mp3/wav into a set of midis, which allows us to import into music editing software (like Finale). If this works, this is huge. Congrats to the team.
I don't have absolute ear. I once tried to "reverse engineer" a guitar solo using a tuner. Too much work, not very good result. Hope this finally brings music transcription to less skilled/gifted musicians/hobbyists.
Sounds incredible and I'm curious how well it works. For a quick intro about the state of the art in this space, watch the Melodyne videos on Youtube. In short and without having tried Omniscent Mozart: I would not expect that it gives perfect results without manual help. If it could aid transcription in a semi-automtic way, like Melodyne does, that would already be a big victory for an open source alternative.
https://piano-scribe.glitch.me/
I built something similar which is a lot faster but on a large scale test the google software handily outperforms my own (92% accuracy versus 87% or so, and that is a huge difference because it translates into ~30% fewer errors).
Also interesting that the absolute timing of onsets worked better than relative timing - that also seems kinda bizarre to me, since, when I listen to music, it is never in absolute terms (e.g. "wow I just loved how this connects to the start of the 12th bar" vs "wow I loved that transition from what was playing 2 bars ago".
Another thing on relative timing.. when I listen to music, for me, very nuanced, gradual, and intentional deviations of tempo have significant sentimenal effects - which suggests to me that you need a 'covariant' description of how the tempo needs to change over time, so, not only do you need relative timing of events, you also need relative timing of the relative timing of events as well
Some examples:
- Jonny Greenwood's Phantom Thread II from the Phantom Thread soundtrack [0]
- the breakdown in Holy Other's amazing "Touch" [1], where the song basically grinds to a halt before releasing all the pent up emotional potential energy.
[0] https://www.youtube.com/watch?v=ztFmXwJDkBY, especially just before the violin starts at 1:04
[1] https://www.youtube.com/watch?v=OwyXSmTk9as, around 2:20
[1] https://www.youtube.com/watch?v=h-eEZGun2PM [2] https://replicate.com/p/qr4lfzsqafc3rbprwmvg2cw5ve
I can't help but feel it is heavily impacted by ambience of the recording as well. The midi is of course a very rigid and literal interpretation of what the model is hearing as pitches over time, but of course it lacks the subtlety of realizing a pitch is sustaining because of an ambient effect, or that the attach is is actually a little bit before the beginning of the pitch, etc.
If it could be enhanced to consider such things, I bet you would get much cleaner, more machine-like midis, which are generally preferable.
I found this link to be more helpful than the GitHub repo for understanding what it does:
https://music-and-culture-technology-lab.github.io/omnizart-...
Dead Comment
The colab notebook is full of warnings and crashes with errors in the "Transcribe" box. Replicate.com does something but the results are garbage.
What am I doing wrong?