1. Replace the DDN layer with a flow between images and a latent variable. During training, compute in the direction image -> latent. During inference, compute in the direction latent -> image. 2. For your discrete options 1, ..., k, have trainable latent variables z_1, ..., z_k. This is a "code book".
Training looks like the following: Start with an image and run a flow from the image to the latent space (with conditioning, etc.). Find the closest option z_i, and compute the L2 loss between z_i and your flowed latent variable. Additionally, add a loss corresponding to the log determinant of the Jacobian of the flow. This second loss is the way a normalizing flow avoids mode collapse. Finally, I think you should divide the resulting gradient by the softmax of the negative L2 losses for all the latent variables. This gradient division is done for the same reason as dividing the gradient when training a mixture-of-experts model.
During inference, choose any latent variable z_i and flow from that to a generated image.
I tend to make them as Python servers which serve plain html/js/css with web components. I know this is a bit more complicated than just having a single html file with inline js and css, but the tools I made were a bit too complicated for the LLMs to get just right, and separating out the logic into separate js files as web components made it easy for me to fix the logic myself. I also deliberately prompted the LLMs to avoid React because adding I didn't want to need a build step.
The only one I actually still use is the TODO app I made: https://github.com/cooljoseph1/todo-app It stores everything in a JSON file, and you can have multiple TODO lists at once by specifying that JSON file when you launch it.