https://www.nrc.gov/docs/ML0534/ML053410342.pdf
NRC is a good place to start. They have been at trying to prevent tech from hurting people for awhile.
I'm essentially pro-nuclear, I just don't trust people who run it.
I attempted some similar VQ-VAE work instead trying to tokenize rendered text. I was curious if I could make a visual llm working on 10 pt rendered font, but I also tried using PDF sources. The basic idea was to do what more advanced diffusion image models can do where they generate images of text. Make a specific image text diffusion model to do completions. Further I wondered if I could embed things like document type and language so you could have a latent representation of text more abstracted than current dictionary tokenizers. Learned a lot and thought it was all beautifully displayed in this post.