The minimum, though, is that all copyrighted works the supplier has legal access to can be copied, transformed arbitrarily, and used for training. And they can share those and transformed versions with anyone else who already has legal access to that data. And no contract, including terms of use, can override that. And they can freely scrape it but maybe daily limits imposed to avoid destructive scraping.
That might be enough to collect, preprocess, and share datasets like The Pile, RefinedWeb, uploaded content the host shares (eg The Stack, Youtube). We can do a lot with big models trained that way. We can also synthesize other data from them with less risk.
Well, those spec issues are usually not documented or new engineers won't know where to find a full list. That means the architecturally-insecure OS's might be more secure in specific areas due to all the investment put into them over time. So, recommending the "higher-security design" might actually lower security.
For techniques like Fil-C, the issues include abstraction gap attacks and implementation problems. For the former, the model of Fil-C might mismatch the legacy code in some ways. (Ex: Ada/C FFI with trampolines.) Also, the interactions between legacy and Fil-C might introduce new bugs because integrations are essentially a new program. This problem did occur in practice in a few, research works.
I haven't reviewed Fil-C. I've forgotten too much C and the author was really clever. It might be hard to prove the absence of bugs in it. However, it might still be very helpful in securing C programs.