In the GPT-2 era I created CouldReads, a big data set of generated book titles/synopses trained on thousands of e-books. It was a fun project in the naivete of 2020 but it's less amusing now.
https://feder001.com/exploring-wikipedia-as-a-database-part-...
Bluesky API library spun off from the other project: https://github.com/tfederman/pysky
Haven't really started it yet, but a master list of RSS feeds and the code I used to source them: https://github.com/tfederman/huge-rss-list
And also a new project to fetch all links seen in the Bluesky firehose and gather metadata to build a database of sites and pages at a more granular level than the domain. For example, is account X posting video links from one YT channel or many?
https://feder001.com/exploring-wikipedia-as-a-database-part-...
https://feder001.com/exploring-wikipedia-as-a-database-part-...