Nothing out of spite, and purely limited by the amount of effort required to support these models.
We are hopeful too -- where users can technically add models to Ollama directly. Although there is definitely some learning curve.
Deleted Comment
I wonder if the author would be willing to try with another representation.
[1]: Does Prompt Formatting Have Any Impact on LLM Performance? https://arxiv.org/html/2411.10541v1
[2]: Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding - A Survey https://arxiv.org/html/2402.17944v2
(Shameless plug: I am one of the developers of Thinky.gg (https://thinky.gg), which is a thinky puzzle game site for a 'shortest path style' [Pathology] and a Sokoban variant [Sokoath] )
These games are typically NP Hard so the typical techniques that solvers have employed for Sokoban (or Pathology) have been brute forced with varying heuristics (like BFS, dead-lock detection, and Zobrist hashing). However, once levels get beyond a certain size with enough movable blocks you end up exhausting memory pretty quickly.
These types of games are still "AI Proof" so far in that LLMs are absolutely awful at solving these while humans are very good (so seems reasonable to consider for for ARC-AGI benchmarks). Whenever a new reasoning model gets released I typically try it on some basic Pathology levels (like 'One at a Time' https://pathology.thinky.gg/level/ybbun/one-at-a-time) and they fail miserably.
Simple level code for the above level (1 is a wall, 2 is a movable block, 4 is starting block, 3 is the exit):
000
020
023
041
Similar to OP, I've found Claude couldn’t manage rule dynamics, blocked paths, or game objectives well and spits out random results.
I've seen AI struggle with ASCII, but when presented as other data structures, it performs better.
edit:
e.g. JSON with structured coordinates, graph based JSON, or a semantic representation with the coordinates