WebAccessBench: Digital Accessibility Reliability in LLM-Generated Websites [pdf]

The whitepaper says that the benchmark counted accessibility problems using the tool axe-core (https://github.com/dequelabs/axe-core). It’s too bad that neither the site nor the paper contains any examples of an LLM output and its list of detected problems. I am curious about these aspects:

• Which of axe-core’s rules (https://github.com/dequelabs/axe-core/blob/develop/doc/rule-...) LLMs violate most often

• Which groups of users are most affected by those rule violations (e.g. blind users or deaf users)

• Whether it’s likely that I unintentionally violate those same rules in web pages I write

Examples of rule violations and statistics on most-violated rules would make the website more convincing by showing that the detected accessibility errors reflect real problems. It would rule out that the only detected error was a single noisy false positive rule in axe-core. I bet that most readers are not familiar enough with axe-core to trust that it has no false positive rules.

Introducing WebAccessBench, a novel benchmark for AI language models to assess accessibility quality and WCAG conformance in generated web interfaces under realistic prompting conditions.

I did a bit of research and found that LLMs are incredibly bad at basic digital accessibility tasks. You can compare models and read the full white paper at conesible.de/wab.

Overall data shows that guiding a model with expert-grade prompts has very little effect over a small nudge. The benchmark results suggest that objective error count is too high to rely on LLM technology at all in digital accessibility work, even under explicit expert guidance. It also suggests massive implications for society at large, and major discrimination of people with disabilities.