Hi everyone! Creator of Transformers.js here :) ...
Thanks so much to everyone for sharing! It's awesome to see the positive feedback from the community. As you'll see from the demo, everything runs inside the browser!
As of 2023/03/16, the library supports BERT, ALBERT, DistilBERT, T5, T5v1.1, FLAN-T5, GPT2, BART, CodeGen, Whisper, CLIP, Vision Transformer, and VisionEncoderDecoder models, for a variety of tasks including: masked language modelling, text classification, text-to-text generation, translation, summarization, question answering, text generation, automatic speech recognition, image classification, zero-shot image classification, and image-to-text. Of course, we plan to add many more models and tasks in the near future!
Try out some of the other models/tasks from the "Task" dropdown (like the code-completion or speech-to-text demos).
---
To respond to some comments about poor translation/generation quality, many of the models are actually quite old (e.g., T5 is from 2020)... and if you run the same prompt through the PyTorch version of the model, you will get similar outputs. The purpose of the library/project is to bring these models to the browser; we didn't train the models, so, poor quality can (mostly) be blamed on the original model.
Also, be sure to play around with the generation parameters... as with many LLMs, generation parameters matter a lot.
Yes, there are some workarounds you can do to get it working in non-browser environments. I do aim to get a permanent solution, which will ideally work out-of-the-box for both browser and node/deno environments.
Some other users also reported the issue (which stems from a bug in onnxruntime-web), and we were able to get it working in these cases:
I really liked the suggestion that if it takes off, the web should consider trying to expose something like the OpenXLA intermediate model, which powers the new PyTorch 2.0, TensorFlow, Jax, and a bunch of other top tier ML frameworks.
It already is very well optimized for a ton of hardware (cpus, gpus, ml chips). The Intermediate Representation might already be a web-safe-ish model, effectively self-sandboxing, which could make it safe to expose.
Making each webapp target & optimize ML for every possible device target sounds terrible.
The purpose of MLIR is that most of the optimization can be done at lower levels. Instead of everyone figuring out & deciding on their own how best to target & optimize for js, wasm, webgl, and/or webgpu, you just use the industry standard intermediate representation & let the browser figure out the tradeoffs. If there is inboard hardware, neural cores, they might just work!
Good to see WebML has OpenXLA on their radar... but also a bit afraid, expecting some half ass excuses why of course we're going to make some brand new other thing instead. The web & almost everyone else has such a bad NIH problem. WASI & web file apis being totally different is one example, where there's just no common cause, even though it'd make all the difference. And with ML, the cost of having your own tech versus being able to re-use the work everyone else puts on feels like a near suicidal decision to make an API that will never be good, never perform anywhere where near it could.
I don't think a high level representation is necessary for relatively straightforward FMA extensions (either outer products in the case of Apple AMX or matrix products in the case of CUDA/Intel AMX). WebGPU + tensor core support and WASM + AMX support would be simpler to implement, likely more future proof and wouldn't require maintaining a massive layer of abstraction.
The issue is, much of the performance of Pytorch, JAX, et al comes from running a JIT that is tuned to the underlying HW, and come with support for high level intrinsic operations that were either hand-tuned or have extra hardware support, especially ops dealing with parallelizing computation across multiple cores.
You'd probably end up representing these as external library function calls in WASM, but then the WASM JIT would have to be taught that these are magic functions that are potentially treated specially, so at that point you're just embedding HLO ops as library func, and them embedding an HLO translator into the WASM runtime, I'm not sure that's any better.
By analogy would be be better to eliminate fragment and vertex shaders and just use WASM for sending shaders to the GPU, or is the domain specific language and its constraints beneficial to the GPU drivers?
checkout https://mlc.ai/web-stable-diffusion, which is builds on top of Apache TVM and brings in models from PyTorch2.0, ONNX and other means into the ML compilation flow
Hah, ChatGPT has successfully poisoned the well. Well done sama.
This lib is great work, a JS interface for running HF models. The comments about how "bad" the outputs are as surprising to me as they are alarming.
OAI has now set the zero-effort bar so high that even HNers (who click on .js headlines) fall into the gap they've left. That sucking sound you hear is market share being hoovered up.
No they're not mate, it's just you. I've read the guidelines (thanks for helpfully linking them). I see this on HN, people infer offense and cite the book rather than engage.
By not highlighting what you found "snarky" your response is a definitional "shallow dismissal". I see you just "picked the most provocative thing to complain about". Not a lot of being "kind" either.
So you know what would also be great? If you held yourself to the standards you're keen to police around here.
This is the third time that a candidate has been elected. In this article I will use the names of the candidates and the candidates.
In 2016 the following is a list of the current and former U.S. presidential candidates:
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush/Bush/Bush/Bush (with Republican presidential candidates)
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush/Bush/Bush/Bush/Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/B-1919191929
That's pretty neat. I'm personally wondering in how far ML compute will be done on consumer devices, rather than on servers. We're currently seeing a lot of models that are so large that it doesn't seem feasible to run them locally. But I think there is reason to believe that these models carry a lot of redundancy. Redundancy that could lead to order of magnitude less memory/compute needed.
The trick here will be using large models as data generators to distill some sub task into a web computable model. (I’ve done it a few times for vision rather than text and it’s amazing how potent it is.)
Right! In a lot of cases, just having the synthetic responses plus human filtering for your sub task is enough for less essential tasks. I’m thinking of “procedural” content useful for less sensitive things like games.
It's possible to run a full GPT-3 style language model on any device with 4GB of RAM now, so running models on consumer devices is getting more and more feasible by the day. https://simonwillison.net/2023/Mar/11/llama/
> I'm personally wondering in how far ML compute will be done on consumer devices, rather than on servers.
Running ML on the device has been one of Apple's value propositions for a long time. They are currently silent on everything that's unfolding, but I expect them to at least mention something and WWDC (and trying to run that something on the device)
If I understand correctly, there was an all-company invited annual AI day which was silent on recent developments.
But then ~two weeks later there was what seemed like an on-background / press leak about the XDG group that specifically mentioned AI as a current discipline. (Gurman / Bloomberg)
It seems to me that the release of Core ML stable diffusion (mentions itt) is something if a comment in of itself. At least in the read between the lines / hiding in plain sight style of Apple.
The company is unveiling a new and presumably next major computing platform at a quality level only they could possibly deliver.
So the relative quiet / lack of comment may be in deference to the gravity of that work.
That said, these changes are too big to ignore—-we should at least hear language that acknowledges the major developments in AI of late at WWDC and some idea for how Apple is thinking about them.
I feel like that's been the pretty consistent lesson in computing over the past decades. New technologies start out as expensive, exotic, and specialized and become cheap and commonplace over time. The more business value the technology provides, the faster it will happen as well I think.
The models will certainly get better (faster to train, less data needed, smaller param counts, etc.) too, though, just like compilers and software have evolved hugely alongside hardware.
they'll meet in the middle. that's what's already happening, and there will probably be co-processors added into consumer devices that excel specifically at the kind of processing that these models need.
Hi! Creator of the library here. If you change the generation parameters to be greedy (i.e., sample=no and top_k=0), you will get "Bonjour, comment êtes-vous?"
The top_k and sample generation parameters are just there to show that they are supported :), and is sometimes useful for the other tasks (like text generation w/gpt2, to get more variety)
I understand there's reasons the translation is incorrect, but if the very first example you're showing on the page is wrong, most people (who are fluent enough) will just roll their eyes and leave it at that. Maybe showcase an example that works?
I uploaded the Windows XP desktop wallpaper into the image classifier. Just the raw image file. It gave me the labels "monitor", "computer screen", "desktop". "Field", "sky", grass", that kind of thing were nowhere to be found.
I know this is more of a comment on the state of AI models than Transformers.js. It's probably not even representative of state-of-the-art image classifier models. Just a fun example of how these things learn.
Haha very interesting! I assume it's because that type of image is only found on computer screens, so, the model thinks the grass "contributes to it's idea of what a computer screen is".
... and of course, the library only ports those models to the browser; if you train a better model, you can always convert it to the ONNX format, then use it with the library.
Even the default example of "Hello, how are you?" from English to French yields an awfully wrong result ("Hello, what is your experience?")...
I wouldn't trust them for anything else.
The other models are not better, here's the text generation output from "I enjoy walking my cute dog":
> I enjoy walking with my cute dog, I have been going to the park, and I just happened to like walking with my cute dog. I like to play with the dog.
My dog (Hannah) has been on my way home since December and when she came home she told me to go out and stay back. I told her that she had been too busy. I had to start working and had to go outside and go see myself again.
It could be just an algorithm that generates random sentences that it wouldn't make less sense.
I think it's worth pointing out that the library just gets the models working in the browser. The correctness of the translation is dependent on the model itself.
If you run the model using HuggingFace's python library, you will also get the same results (I've tested it, since, I wasn't too happy with those default translations and generations).
With regards to the text generation output, this is also similar to what you will get from the PyTorch model. Check out this blog post from HuggingFace themselves which discusses this: https://huggingface.co/blog/how-to-generate.
> Even the default example of "Hello, how are you?" from English to French yields an awfully wrong result ("Hello, what is your experience?")...
Really? For me that gives "Bonjour, comment êtes-vous?" with the default settings.
> text generation output
Yeah, text generation is really something that requires a big model. The Llama 7B param model quantized to 4bit is 13G and that is the smallest model I'd actually attempt to use for unconstrained text generation.
« Bonjour, comment êtes-vous? » barely translates to « Hi, how are you feeling today? » or, depending on the context, to something like « Hi, please describe yourself » to a native French speaker.
Thanks so much to everyone for sharing! It's awesome to see the positive feedback from the community. As you'll see from the demo, everything runs inside the browser!
As of 2023/03/16, the library supports BERT, ALBERT, DistilBERT, T5, T5v1.1, FLAN-T5, GPT2, BART, CodeGen, Whisper, CLIP, Vision Transformer, and VisionEncoderDecoder models, for a variety of tasks including: masked language modelling, text classification, text-to-text generation, translation, summarization, question answering, text generation, automatic speech recognition, image classification, zero-shot image classification, and image-to-text. Of course, we plan to add many more models and tasks in the near future!
Try out some of the other models/tasks from the "Task" dropdown (like the code-completion or speech-to-text demos).
---
To respond to some comments about poor translation/generation quality, many of the models are actually quite old (e.g., T5 is from 2020)... and if you run the same prompt through the PyTorch version of the model, you will get similar outputs. The purpose of the library/project is to bring these models to the browser; we didn't train the models, so, poor quality can (mostly) be blamed on the original model.
Also, be sure to play around with the generation parameters... as with many LLMs, generation parameters matter a lot.
---
If you want to keep up-to-date with the development, check us out on twitter: https://twitter.com/xenovacom :)
Some other users also reported the issue (which stems from a bug in onnxruntime-web), and we were able to get it working in these cases:
1. https://github.com/xenova/transformers.js/issues/4 2. https://github.com/xenova/transformers.js/issues/19
It already is very well optimized for a ton of hardware (cpus, gpus, ml chips). The Intermediate Representation might already be a web-safe-ish model, effectively self-sandboxing, which could make it safe to expose.
https://news.ycombinator.com/item?id=35078410
Edit: There seems to be some progress on a WASM backend for OpenXLA here: https://github.com/openxla/iree/issues/8327
and a proposed WebML working group at W3C: https://www.w3.org/2023/03/proposed-webmachinelearning-chart... that references OpenXLA
The purpose of MLIR is that most of the optimization can be done at lower levels. Instead of everyone figuring out & deciding on their own how best to target & optimize for js, wasm, webgl, and/or webgpu, you just use the industry standard intermediate representation & let the browser figure out the tradeoffs. If there is inboard hardware, neural cores, they might just work!
Good to see WebML has OpenXLA on their radar... but also a bit afraid, expecting some half ass excuses why of course we're going to make some brand new other thing instead. The web & almost everyone else has such a bad NIH problem. WASI & web file apis being totally different is one example, where there's just no common cause, even though it'd make all the difference. And with ML, the cost of having your own tech versus being able to re-use the work everyone else puts on feels like a near suicidal decision to make an API that will never be good, never perform anywhere where near it could.
You'd probably end up representing these as external library function calls in WASM, but then the WASM JIT would have to be taught that these are magic functions that are potentially treated specially, so at that point you're just embedding HLO ops as library func, and them embedding an HLO translator into the WASM runtime, I'm not sure that's any better.
By analogy would be be better to eliminate fragment and vertex shaders and just use WASM for sending shaders to the GPU, or is the domain specific language and its constraints beneficial to the GPU drivers?
This lib is great work, a JS interface for running HF models. The comments about how "bad" the outputs are as surprising to me as they are alarming.
OAI has now set the zero-effort bar so high that even HNers (who click on .js headlines) fall into the gap they've left. That sucking sound you hear is market share being hoovered up.
It would be great if we all try to keep the tone respectful and avoid snarkiness to maintain a constructive discussion
https://news.ycombinator.com/newsguidelines.html
By not highlighting what you found "snarky" your response is a definitional "shallow dismissal". I see you just "picked the most provocative thing to complain about". Not a lot of being "kind" either.
So you know what would also be great? If you held yourself to the standards you're keen to police around here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 4142 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 1 2 3 4 5 6 7 8 9 10 11 12 13 15 15 16 16 18 19 20 21 22 23 24 25 25 26 27 28 29 30 31 32 32 33 34 35 36 37 38 39 41 42 44 45 46 47 48 50 51 53 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 85 86 87 88 89 90 92 93 94 95 97 98 99 100
This is the third time that a candidate has been elected. In this article I will use the names of the candidates and the candidates. In 2016 the following is a list of the current and former U.S. presidential candidates: Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush/Bush/Bush/Bush (with Republican presidential candidates) Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush/Bush/Bush/Bush/Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/B-1919191929
Or perhaps hardware will catch up before.
Running ML on the device has been one of Apple's value propositions for a long time. They are currently silent on everything that's unfolding, but I expect them to at least mention something and WWDC (and trying to run that something on the device)
But then ~two weeks later there was what seemed like an on-background / press leak about the XDG group that specifically mentioned AI as a current discipline. (Gurman / Bloomberg)
It seems to me that the release of Core ML stable diffusion (mentions itt) is something if a comment in of itself. At least in the read between the lines / hiding in plain sight style of Apple.
The company is unveiling a new and presumably next major computing platform at a quality level only they could possibly deliver.
So the relative quiet / lack of comment may be in deference to the gravity of that work.
That said, these changes are too big to ignore—-we should at least hear language that acknowledges the major developments in AI of late at WWDC and some idea for how Apple is thinking about them.
I feel like that's been the pretty consistent lesson in computing over the past decades. New technologies start out as expensive, exotic, and specialized and become cheap and commonplace over time. The more business value the technology provides, the faster it will happen as well I think.
The models will certainly get better (faster to train, less data needed, smaller param counts, etc.) too, though, just like compilers and software have evolved hugely alongside hardware.
There already are, e.g., Google Edge TPU, Apple Neural Engine, etc.
Which the model likes to translate to,
Which no french person would ever say to you because that's a lot of words and doesn't really sound very... French.And of course are you talking to someone familiar... so on and so forth.
The top_k and sample generation parameters are just there to show that they are supported :), and is sometimes useful for the other tasks (like text generation w/gpt2, to get more variety)
Deleted Comment
I know this is more of a comment on the state of AI models than Transformers.js. It's probably not even representative of state-of-the-art image classifier models. Just a fun example of how these things learn.
... and of course, the library only ports those models to the browser; if you train a better model, you can always convert it to the ONNX format, then use it with the library.
I wouldn't trust them for anything else.
The other models are not better, here's the text generation output from "I enjoy walking my cute dog":
> I enjoy walking with my cute dog, I have been going to the park, and I just happened to like walking with my cute dog. I like to play with the dog. My dog (Hannah) has been on my way home since December and when she came home she told me to go out and stay back. I told her that she had been too busy. I had to start working and had to go outside and go see myself again.
It could be just an algorithm that generates random sentences that it wouldn't make less sense.
I think it's worth pointing out that the library just gets the models working in the browser. The correctness of the translation is dependent on the model itself.
If you run the model using HuggingFace's python library, you will also get the same results (I've tested it, since, I wasn't too happy with those default translations and generations).
With regards to the text generation output, this is also similar to what you will get from the PyTorch model. Check out this blog post from HuggingFace themselves which discusses this: https://huggingface.co/blog/how-to-generate.
Deleted Comment
Really? For me that gives "Bonjour, comment êtes-vous?" with the default settings.
> text generation output
Yeah, text generation is really something that requires a big model. The Llama 7B param model quantized to 4bit is 13G and that is the smallest model I'd actually attempt to use for unconstrained text generation.
The idiomatic translation here would be "Bonjour, comment allez-vous?"