Why would you then assume the reasoning tokens will include hints supplied in the prompt "faithfully"? The model may or may not include the hints - depending on whether the model activations believe those hints are necessary to arrive at the answer. In their experiments, they found between 20% and 40% of the time, the models included those hints. Naively, that sounds unsurprising to me.
Even in the second experiment when they trained the model to use hints, the optimization was around the answer, not the tokens. I am not surprised the models did not include the hints because they are not trained to include the hints.
That said, and in spite of me potentially coming across as an unsurprised-by-the-result reader, it is a good experiment because "now we have some experimental results" to lean into.
Kudos to Anthropic for continuing to study these models.
Memorization is not a panacea. I never found memorizing l33t code problems to be edifying. I think it's because those kinds of tight, self-referential, clever programs are far removed from the activity of writing applications. Most working programmers do not run into a novel algorithm problem but once or twice a career. Application programming has more the flavor of a human-mediated graph-traversal, where the human has access to a node's local state and they improvise movement and mutation using only that local state plus some rapidly decaying stack. That is, there is no well-defined sequence for any given real-world problem, only heuristics.
Whether leet code or anything else.