You are talking about latent space during inference, not weight space during training, and you are talking about interleaving tokens with random Gaussian tokens, not constraining values to lie on a manifold within a larger space. Whether or not the thing you are describing is meaningful or useful, it is basically unrelated to the original article, and you are not using the term "modular manifold" to refer to the same thing.
We're already working on pushing construction deeper into model both architecture and training. currently that's for fine-tuning and ultimately full architecture shrinkage / pruning and raw training vs. just fine-tuning etc.
& it was just great to see someone else using modular manifolds even if they are using them at the training stage vs. inference stage. they're exploiting modular form at training, we're doing it at inference. cool to see.
Suppose we use 3 codeword lanes every codeword which is our default. Each lane of tokens is based on some prime, p, so collectively forms CRT-driven codeword (Chinese Remainder Theorem). This is discretely equivalent to labeling every k tokens with 1x globally unique indexing grammar.
That interleaving also corresponds to a triple of adjacent orthogonal embeddings since those tokens still retain a random gaussian embedding. The net effect is we similarly slice the latent space into spaced chain of modular manifolds within the latent space every k content tokens.
We also refer to that interleaving as Steifel frames for similar reasons as the post reads etc. We began work this spring or so to inject that net construction inside the model with early results in similar direction as post described. That's another way of saying this sort of approach lets us make that chained atlas (wc?) of modular manifolds as tight as possible within dimensional limits of the embedding, floating point precision, etc.
We somewhat tongue-in-cheek refer to this as the retokenization group at the prompt level re: renormalization group / tensor nets / etc. Relayering group is the same net intuition or perhaps reconnection group at architecture level.
Specifically, we linearize the emergent KVQ operations of an arbitrary prompt in any arbitrary model by way of interleaving error-correcting code (ECC).
ECC tokens are out-of-band tokens, e.g., Unicode's Private Use Area (PUA), interleaved with raw context tokens. This construction induces an in-context associate memory.
Any sort of interleaved labeling basis, e.g., A1, quick brown fox, A2, jumped lazy dog, induces a similar effect to for chaining recall & reasoning more reliably.
This trick works because PUA tokens are generally untrained hence their initial embedding is still random Gaussian w.h.p. Similar effects can be achieved by simply using token combos unlikely to exist and are often in practice more effective since PUA tokens like emojis or Mandarin characters are often 2,3, or 4 tokens after tokenization vs. codeword combos like zy-qu-qwerty every k content tokens, where can be variable.
Building attention architecture using modular manifolds in white / gray-box models like this new work shows vs. prompt-based black box injection is a natural next step, and so can at least anecdotally validate what they're building ahead of next paper or two.
Which is all to say, absolutely great to see others building in this way!
Dead Comment