Deleted Comment
Releasing YAHOO responsibly has been time consuming, and we're relying on Drop Site to tackle the redactions. See this post for context.
We did an initial parsing pass of all four DOJ document batches on Friday. This takes a raw PDF and returns chunks containing typed blocks—each with a type (Title, Text, Figure, etc.), bounding boxes, content, and confidence scores. For PDFs that were just scans of photographs (which was like 90% of new content in Friday's release), it gave in depth descriptions of those! You can type search terms like "door" at https://www.jmail.world/photos to see what I mean.
For apps like Jmail and JFlights we use their structured extraction endpoint instead—you define a schema (e.g. {from, to, subject, date, body} for emails or {departure_airport, arrival_airport, passengers[], date} for flights) and it pulls those fields directly into JSON.
The JFlights example served as the best ad for Reducto and how doc parsing technology can speed up hours of journalistic investigations like this.
See for yourself. Given this document
https://www.jmail.world/drive/HOUSE_OVERSIGHT_002031
It inferred and enriched multiple flight cards on JFlights (https://www.jmail.world/flights). I was really shook when I first saw this.
Has anyone written a parser for the text messages? A messages-like UI to be able to read through all the texts would be super interesting too. The format DOJ released them in is impossible to follow.
https://michelcrypt4d4mus.github.io/epstein_text_messages/
He also shouted us out last month which was very kind of him
Do you have a page about each dataset you're sourcing and the background on them like your provide here?
The "EFTA00000468" saga has me distrusting the authenticity of most of these datasets.
Re: the DDoSecrets emails though (YAHOO dataset), I have more to share.
Drop Site News agreed to give us access to the Yahoo dataset discovered by DDoSecrets, but on the condition that we help redact it. It's a completely unfiltered dataset. It's literally just .eml files for jeeprojects@yahoo.com. It includes many attached documents. There is no illegal imagery, but it has photos of Epstein's extended family (nephews, nieces, etc) and headshots of many models that Epstein's executive assistant would send to him. I was quite shocked that this thing existed.
We built some internal redaction tools that the Drop Site team is now using to comb through all of this. We've released 5 batches of the Yahoo mail now, with the 1k+ Amazon receipts being the most recent.
A few thoughts on how we do redaction are here: https://www.jmail.world/about.
Unlike the DOJ, we've tried to minimize the ambiguity about what was redacted.
For example: all redacted images are replaced with a Gemini-generated description of that photograph.
Another example: we are aggressively redacting email addresses and phone numbers of normal people to avoid spamming them. Perhaps others would leave it all in, but Riley and I don't want to be responsible for these people's lives getting disrupted by this entire saga. For example, we redacted this guy's email but not his name: https://www.jmail.world/thread/4accfb5f3ed84656e9762740081a4...
Riley and I were not expecting this type of scope when we first dropped Jmail. Jmail is an interesting side project for us, and this new dataset requires full-time attention. Thankfully we have help though. We're happy to take on this responsibility given how helpful, thoughtful and careful both the Drop Site and DDoSecrets team has been here.
Specifically at https://www.jmail.world/photos
We found that Volume 2 and Volume 4 had the most never-before-seen stuff.
https://www.justice.gov/epstein/doj-disclosures/data-set-2-f...https://www.justice.gov/epstein/doj-disclosures/data-set-4-f...
Also, this morning they quietly released volumes 5-7. Will have to find out how much of this is new.
Deleted Comment