Parakeet is pretty good, but there are times it struggles. Would be interesting to see how Qwen compares once Handy has it in.
But hopefully we can move towards that - standardised taxation (especially VAT and corporation tax would help massively here), the abolition of notaries, standardised requirements for document certification, and EU-wide digital ID so no need to fly in and sign in person.
I did this as a subsidiary for a US company and literally had to email and call people every few days to move the process along (mostly, it was the banks who somehow expected us to be a multi-national company and wanted to charge an arm and a leg just to let us open a bank account. Most banks outright refused us).
When the notary finally filed the paperwork to the court, the court replied after a few weeks with additional clarifications for which we had to go AGAIN to the notary to do the whole song and dance of them chanting at us in German at 1000 words per minute.
Everything took painfully long and delayed investment for while. People have absolutely no idea how painful it is to merely have the incorporated entity available. Then, it takes a few weeks to get your tax ID - this is when you can start employing people / accepting payments etc.
Deleted Comment
It is exactly what happens. You are confusing the task (classification vs. generation) with the learning paradigm (zero-shot).
In the voice cloning context, the class is the speaker's voice (not observed during training), samples of which are generated by the machine learning model.
The definition applies 1:1. During inference, it is predicting the conditional probability distribution of audio samples that belong to that unseen class. It is "predict[ing] the class that they belong to," which very same class was "not observed during training."
You're getting hung up on the semantics.
Actually, the general definition fits this context perfectly. In machine learning terms, a specific 'speaker' is simply a 'class.' Therefore, a model generating audio for a speaker it never saw during training is the exact definition of the Zero-Shot Learning problem setup: "a learner observes samples from classes which were not observed during training," as I quoted.
Your explanation just rephrases the very definition you dismissed.
> a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to.
That's not what happens in zero-shot voice cloning, which is why I dismissed your definition copied from Wikipedia.