bnjmn (u/bnjmn) - Readit News

bnjmn commented on Palma 2 shop.boox.com/products/pa... · Posted by u/tosh

bnjmn · 5 months ago

They say the e-ink display has "Unmatched Speed, Never Seen on ePaper" so it would nice to know the actual refresh rate.

This is not an endorsement, but https://daylightcomputer.com/ claims 60fps, so that's the bar to meet in my opinion. Caveat: the daylight display is not true e-ink, but an e-ink-like LCD, IIUC.

bnjmn commented on Tokenisation Is NP-Complete arxiv.org/abs/2412.15210... · Posted by u/belter

mcyc · 8 months ago

NB: Can't edit my original reply.

Sorry actually I misread part of your comment in relation to the paper and confused δ and another parameter, K.

To clarify, δ is the number of tokens in the tokenized corpus and K is the size of the vocabulary.

So, if you are asking about why would they limit _K_, then my answer still applies (after swapping δ for K). But if you still mean "why do they pick some arbitrary δ as the limit of the size of the tokenized corpus", then I think the answer is just "because that makes it a decision problem".

bnjmn · 8 months ago

Thanks for these detailed replies! Now I really want to read your paper.

bnjmn commented on Tokenisation Is NP-Complete arxiv.org/abs/2412.15210... · Posted by u/belter

immibis · 8 months ago

NP is a category of decision problems - problems with boolean answers. Saying that it's NP-complete to find the tokeniser that produces the fewest symbols is meaningless. You have to convert it to the form "is there a tokenizer that produces fewer than N symbols?" before it even makes sense to ask whether it's NP-complete.

bnjmn · 8 months ago

I fully agree with your final statement, but needing to constrain the problem in an artificial way to prove it's NP-complete doesn't mean the constraint was justified or realistic, because then you've only proved the constrained version of the decision problem is NP-hard.

There might be plenty of perfectly "good" tokenizers (whatever that ends up meaning) that can be found or generated without formulating their design as an NP-complete decision problem. Claiming "tokenization is NP-complete" (paper title) in general seems like an overstatement.

bnjmn commented on Tokenisation Is NP-Complete arxiv.org/abs/2412.15210... · Posted by u/belter

bnjmn · 8 months ago

> We still do not know, for instance, what makes a good tokeniser (Gowda and May, 2020; Cognetta et al., 2024): which characteristics should its produced subwords `s` have to be a good starting point for language modelling? If we knew this, then we could define an objective function which we could evaluate tokenisers with.

I don't see how the authors get past this true general statement from the first paragraph of the introduction. Finding a good tokenizer is not just NP-hard; we have no idea how hard it might be because we don't have theoretical agreement on what "good" means.

In order to have something to prove, the authors decide (somewhat arbitrarily):

> Specifically, we focus on finding tokenisers that maximise the compression of a text. Given this objective, we then define the tokenisation problem as the task of finding a tokeniser which compresses a dataset to at most δ symbols.

Is a tokenizer that maximizes the compression of text (e.g. by identifying longer tokens that tend to be used whole) necessarily a better tokenizer, in terms of overall model performance? Compression might be a useful property for an objective function to consider... but then again maybe not, if it makes the problem NP-hard.

I'm also not sure how realistic the limitation to "at most δ symbols" is. I mean, that limit is undeniably useful to make the proof of NP-completeness go through, because it's a similar mechanism to the minimum number of satisfied clauses in the MAX-2-SAT definition. But why not just keep adding tokens as needed, rather than imposing any preordained limit? IIRC OpenAI's tokenizer has a vocabulary of around 52k subword strings. When that tokenizer was being designed, I don't imagine they worried much if the final number had been 60k or even 100k. How could you possibly choose a meaningful δ from first principles?

To put that point a different way, imagine the authors had proven NP-completeness by reduction from the Knapsack Problem, where the knapsack you're packing has some maximum capacity. If you can easily swap your knapsack out for a larger knapsack whenever it gets (close to) full, then the problem becomes trivial.

If the authors managed to prove that any arbitrary objective function would lead to NP-hard tokenizer optimization problem, then their result would be more general. If the paper proves that somehow, I missed it.

I suppose this paper suggests "here be dragons" in an interesting if incomplete way, but I would also say there's no need to hurt yourself with an expensive optimization problem when you're not even sure it delivers the results you want.

bnjmn commented on The longest word you can type on the first row rubenerd.com/the-longest-... · Posted by u/nafnlj

bnjmn · 2 years ago

On any macOS computer (or replace /usr/share/dict/words with your own word list):

  grep '^[qwertyuiop]*$' /usr/share/dict/words | \
  awk '{ print length(), $0 }' | \
  sort -n

bnjmn commented on How the most popular cars in the US track drivers wired.com/story/car-data-... · Posted by u/arkadiyt

bnjmn · 2 years ago

Are there any new cars / car brands that credibly promise not to track their drivers?

Any car with a network connection for software updates seems likely to be harvesting driver data, or is at least capable of doing so.

bnjmn commented on Effect of exercise training for five years on all cause mortality bmj.com/content/371/bmj.m... · Posted by u/evo_9

tomerico · 5 years ago

Why do you say that the results aren't impressive? 49% reduction in all cause mortality when High intensity interval training is compared to moderate intensity continuous training seems like a very strong result.

bnjmn · 5 years ago

Mostly because they admit those results are not statistically significant (for example, at the bottom of the summary diagram). That said, I admire their honesty, and I hope these results suggest other kinds of experiments to other researchers. Maybe the study just needs to be larger, or longer.

Focusing on specific kinds of mortality might also give stronger results than measuring "all cause" mortality. I say that because it seems like cancer was the biggest killer in these groups. I am not a doctor, but I wouldn't have thought cancer was causally related to (lack of) exercise, the way cardiovascular diseases are believed to be. I'd be interested to see larger studies with enough non-cancer deaths to say something statistically significant about the effect of exercise on those outcomes.

bnjmn commented on Effect of exercise training for five years on all cause mortality bmj.com/content/371/bmj.m... · Posted by u/evo_9

bnjmn · 5 years ago

Although these results are not exactly impressive or compelling (they don't make me want to change my exercise habits), it's reassuring to see researchers go through with publishing underwhelming results, rather than cherry-picking only the interesting results and sitting on the rest, which is a major contributing factor to the crisis of confidence/replication in the social sciences.

- https://en.wikipedia.org/wiki/Cherry_picking

- https://www.nature.com/news/scientific-method-statistical-er...

- https://fivethirtyeight.com/features/science-isnt-broken/

- etc...

bnjmn commented on The Bitter Lesson (2019) incompleteideas.net/IncId... · Posted by u/radkapital

kdoherty · 5 years ago

Potentially also of interest is Rod Brooks' response "A Better Lesson" (2019): https://rodneybrooks.com/a-better-lesson/

bnjmn · 5 years ago

"Potentially" is an understatement! A much better take, IMO.