cschmidt (u/cschmidt)

cschmidt commented on Google boss says AI investment boom has 'elements of irrationality' bbc.com/news/articles/cwy... · Posted by u/jillesvangurp

officeplant · a month ago

Currently? Wishing there was an S&P 500 that banned tech stocks.

cschmidt · a month ago

There are equal weight S&P ETFs, which avoid having a handful of stock dominating. However, they do have to do a lot more rebalancing to keep things in line.

cschmidt commented on Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text? twitter.com/karpathy/stat... · Posted by u/JnBrymn

cschmidt · 2 months ago

There is other research that works with pixels of text, such as this recent paper I saw at COLM 2025 https://arxiv.org/abs/2504.02122.

cschmidt commented on Eleven Music elevenlabs.io/blog/eleven... · Posted by u/meetpateltech

RyanOD · 5 months ago

I'm dreading the day I discover a new "band" I'm totally into only to discover it's entirely AI.

cschmidt · 5 months ago

I worry how often that is happening already on Spotify.

cschmidt commented on Stanford’s Department of Management Science and Engineering poetsandquants.com/2025/0... · Posted by u/curioustock

cschmidt · 5 months ago

I think in this context Management Science is an older term that was synonymous with operations research. The flagship journal of Informs (the institute for operations research and management science) has the same name. Studying how to optimize thing, lots of statistics and math. Stanford was at the forefront of the field from George Danzig onwards. So not trying to make management a “science” in this case.

cschmidt · 5 months ago

I’m not sure about this masters program, but the undergrad program seems to be proper ORMS.

cschmidt commented on Stanford’s Department of Management Science and Engineering poetsandquants.com/2025/0... · Posted by u/curioustock

TexanFeller · 5 months ago

> Management Science

It’s jarring and galling to see management and science put together in a way that’s suggestive of management being a science. It reeks of stolen valor.

Obligatory Feynman on “sciences”: https://youtu.be/tWr39Q9vBgo?si=SYTZSNA0G-RZDguA

cschmidt · 5 months ago

I think in this context Management Science is an older term that was synonymous with operations research. The flagship journal of Informs (the institute for operations research and management science) has the same name. Studying how to optimize thing, lots of statistics and math. Stanford was at the forefront of the field from George Danzig onwards. So not trying to make management a “science” in this case.

cschmidt commented on The bitter lesson is coming for tokenization lucalp.dev/bitter-lesson-... · Posted by u/todsacerdoti

pas · 6 months ago

ah, okay, thanks!

so basically reverse notation has the advantage of keeping magnitude of numbers (digits!) relative to each other constant (or at least anchored to the beginning of the number)

doesn't attention help with this? (or, it does help, but not much? or it falls out of autoregressive methods?)

cschmidt · 6 months ago

Attention does help, which is why it can learn arithmetic, even with arbitrary tokenization. However, if you put it in a standard form, such as right-to-left groups of 3, you make it an easier problem for the LLM to learn. All the examples it sees are in the same format. Here, the issue is that BLT operates in an autoregressive manner (strictly left to right), which makes it harder to tokenize the digits in a way that is easier for the LLM to learn. Each digit is its own token (Llama style), or flipping the digits might be the best.

cschmidt commented on The bitter lesson is coming for tokenization lucalp.dev/bitter-lesson-... · Posted by u/todsacerdoti

pas · 6 months ago

... why does reversing the all the digits help? could you please explain it? many thanks!

cschmidt · 6 months ago

Math operations go right to left in the text, while we write them left to right. So if you see the digits 123... in an autoreressive manner, you don't know really anything, since it could be 12345 or 1234567. If you flipped 12345 as 543..., you know the place value of each. You know that the 5 you encounter first is in the ones place, the 4 is the tens place, etc. It gives the LLM a better chance of learning arithmetic.

cschmidt commented on The bitter lesson is coming for tokenization lucalp.dev/bitter-lesson-... · Posted by u/todsacerdoti

cschmidt · 6 months ago

Virtually all current tokenization schemes do work at the raw byte level, not the utf-8 character. They do this to avoid the Out of Vocabulary (OOV) or unknown token problem. In older models, if you came across something in the data you can't tokenize, you add a <UNK>. But tokenization should be exactly reversible, so now people use subword tokenizers including all 256 single bytes in the vocab. That way you can always represent any text by dropping down to the single byte level. The other alternative would be to add all utf-8 code points to the vocabulary, but there are more than 150k of those, and enough are rare, that many would be undertrained. You'd have a lot of glitch tokens (https://arxiv.org/abs/2405.05417). That does mean an LLM isn't 100% guaranteed to output well formed utf-8.

cschmidt · 6 months ago

And in regard to utf-8 being a shitty biased tokenizer, here is recent paper trying to design a better style of encoding https://arxiv.org/abs/2505.24689

cschmidt commented on The bitter lesson is coming for tokenization lucalp.dev/bitter-lesson-... · Posted by u/todsacerdoti

rryan · 6 months ago

Don't make me tap the sign: There is no such thing as "bytes". There are only encodings. UTF-8 is the encoding most people are using when they talk about modeling "raw bytes" of text. UTF-8 is just a shitty (biased) human-designed tokenizer of the unicode codepoints.

cschmidt · 6 months ago

Virtually all current tokenization schemes do work at the raw byte level, not the utf-8 character. They do this to avoid the Out of Vocabulary (OOV) or unknown token problem. In older models, if you came across something in the data you can't tokenize, you add a <UNK>. But tokenization should be exactly reversible, so now people use subword tokenizers including all 256 single bytes in the vocab. That way you can always represent any text by dropping down to the single byte level. The other alternative would be to add all utf-8 code points to the vocabulary, but there are more than 150k of those, and enough are rare, that many would be undertrained. You'd have a lot of glitch tokens (https://arxiv.org/abs/2405.05417). That does mean an LLM isn't 100% guaranteed to output well formed utf-8.