Readit News logoReadit News
md2rp · 2 years ago
A Visual Guide to Vision Transformers This is a visual guide to Vision Transformers (ViTs), a class of deep learning models that have achieved state-of-the-art performance on image classification tasks. Vision Transformers apply the transformer architecture, originally designed for natural language processing (NLP), to image data. This guide will walk you through the key components of Vision Transformers in a scroll story format, using visualizations and simple explanations to help you understand how these models work and how the flow of the data through the model looks like.
bArray · 2 years ago
Nice! A small piece of feedback: I would have the dimensions mentioned in the text also annotated on the diagram. It wasn't exactly clear how the input data was flattened for example.
byteknight · 2 years ago
Would also add, as a 100% math idiot, linear transformations, and how it performs them is not explained.

Entirely plausible this is intended for someone more "mathmatical" than myself but appreciate the work regardless.

md2rp · 2 years ago
Thanks for the feedback! Will add it in the revision!
challenger-derp · 2 years ago
Very nice. I wish I could do this sort of scroll story in my digital notes. Is this done with a javascript library?
md2rp · 2 years ago
Yes this was done with a combination of GSAP Scrolltrigger https://gsap.com/docs/v3/Plugins/ScrollTrigger/ and https://d3js.org/
TuringTest · 2 years ago
That kind of scroll is OK-ish for a background parallax effect, or maybe some pretty fade-in/out effects while elements scroll into view (without changing their relative position in the page).

When it interferes with the main functionality of the page, namely reading the content, they break accessibility, distract over understanding the difficult topic, make the content brittle against changes in the platform (different browsers or future standard updates), and as others pointed out make it difficult or impossible to use alternative presentations.

With most comments commenting on the presentation and not on the content, I think it makes clear that it detracts from the experience more than helps.

causal · 2 years ago
I like this, but think there is some crucial motivation missing in steps 10.1-10.3 regarding what query/key weights are and why they're needed.
abhgh · 2 years ago
They are like "continuous" databases. See slides 4-5 here [1] - this is from a talk I had given a while ago.

[1] https://drive.google.com/file/d/12uHo9QIfS-jBpVTs3lmQ3BEpxhD...

vikiomega9 · 2 years ago
this post made sense to me https://teltam.github.io/posts/soft-dictionary-keys.html

It helps to think of kqv as a form of look up.

ThouYS · 2 years ago
yes, same issue in all transformer tutorials
lordswork · 2 years ago
I suspect this is because most people (including people writing these tutorials) don't have a strong grasp on this piece as well.
causal · 2 years ago
The 2b1b video was the first to make it click for me
lyapunova · 2 years ago
To be honest, I actually really like the visual delivery here. It's especially helpful for understanding what's going on with computer vision problems. Please make more!
bilsbie · 2 years ago
This is great. Is it better to combine this with a language model so it can also apply knowledge about relationships between items?
SpaceManNabs · 2 years ago
Lucas Beyer has a lot of references and material as well that I recommend.
nothrowaways · 2 years ago
Neat
Jacob__ · 2 years ago
I like it a lot.