A graph explorer of the Epstein emails

What if we extended this idea beyond one dataset to all discrete news events and entities: people, organizations, places.

Just like here you could get a timeline of key events, a graph of connected entities, links to original documents.

Newsrooms might already do this internally idk.

This code might work as a foundation. I love that it's RDF.

VikingCoder · 4 months ago

Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale

Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus

darth_aardvark · 4 months ago

Palantir, arguably the closest thing to Torment Nexus Inc. IRL, literally builds a product that does this.

throwaway290 · 4 months ago

...and of course it's in RDF!

jandrewrogers · 4 months ago

This has been attempted many times. They all fail the same way.

These general data models start to become useful and interesting at around a trillion edges, give or take an order of magnitude. A mature graph model would be at least a few orders of magnitude larger, even if you aggressively curated what went into it. This is a simple consequence of the cardinality of the different kinds of entities that are included in most useful models.

No system described in open source can get anywhere close to even the base case of a trillion edges. They will suffer serious scaling and performance issues long before they get to that point. It is a famously non-trivial computer science problem and much of the serious R&D was not done in public historically.

This is why you only see toy or narrowly focused graph data models instead of a giant graph of All The Things. It would be cool to have something like this but that entails some hardcore deep tech R&D.

michelpp · 4 months ago

There are open source projects moving toward this scale, the GraphBLAS for example uses an algebraic formulation over compressed sparse matrix representations for graphs that is designed to be portable across many architectures, including cuda. It would be nice if companies like nivida could get more behind our efforts, as our main bottleneck is development hardware access.

To plug my project, I've wrapped the SuiteSparse GraphBLAS library in a postgres extension [1] that fluidly blends algebraic graph theory with the relational model, the main flow is to use sql to structure complex queries for starting points, and then use the graphblas to flow through the graph to the endpoints, then joining back to tables to get the relevant metadata. On cheap hetzner hardware (amd epyc 64 core) we've achieved 7 billion edges per second BFS over the largest graphs in the suitesparse collection (~10B edges). With our cuda support we hope to push that kind of performance into graphs with trillions of edges.

[1] https://github.com/OneSparse/OneSparse

babelfish · 4 months ago

I don't have any experience on graph modeling, but it seems like Neo4j should be able to support 1 trillion edges, based on this (admittedly marketing) post of theirs? https://neo4j.com/press-releases/neo4j-scales-trillion-plus-...

stevage · 4 months ago

>These general data models start to become useful and interesting at around a trillion edges

That is a wild claim. Perhaps for some very specific definition of "useful and interesting"? This dataset is already interesting (hard to say whether it's useful) at a much tinier scale.

mmooss · 4 months ago

> It is a famously non-trivial computer science problem and much of the serious R&D was not done in public historically.

Could you point us to any public research on this issue? Or the history of the proprietary research? Just the names might help - maybe there are news articles, it's a section in someone's book, etc.

theteapot · 4 months ago

> It would be cool to have something like this ..

Aren't LLMs something like this?

afavour · 4 months ago

The New York Times has an API that lets you query “tags” or “topics” and the articles associated with them:

https://developer.nytimes.com/docs/semantic-api-product/1/ov...

The Guardian has similar:

https://open-platform.theguardian.com/documentation/tag

Either or both could be an interesting starting point for something like that. I tried to find something for the BBC and was surprised they didn’t have anything. I would have figured public media would have been a great resource for this.

pjc50 · 4 months ago

Someone did one for (a small subset of) UK media. People were furious. https://brokenbottleboy.substack.com/p/mapped-out

ggm · 4 months ago

Given 6 degrees is rooted in reality, this means we can draw causal graphs from anyone (bad) to anyone (we don't like) and then invent specious reasons why it means "it's all connected, man"

That said, some networks of shorter paths than 6 are interesting. Right now, there's a 1:1 direct path from these documents to a bunch of people with an interest in confounding what evidentiary value they have in justice processes. That's more interesting to me, than what the documents say right now.

johongo · 4 months ago

Emil Eifrem (founder of Neo4j) has a talk about them doing this with the Panama papers

Centigonal · 4 months ago

Check out GDELT!

https://www.gdeltproject.org/

pbronez · 4 months ago

Yup, this is a fantastic project and probably the most mature attempt at a global knowledge graph for contemporary news.

scotty79 · 4 months ago

300 categories, 60 attributes ... Doesn't sound very high res.

FanaHOVA · 4 months ago

One co trying: https://www.system.com

j-pb · 4 months ago

If it's RDF it won't work as the foundation.

axus · 4 months ago

One wonders what the US government agencies use.

PaulHoule · 4 months ago

Isn’t that what Palantir’s product is?

cjohnson318 · 4 months ago

They probably use Excel, maybe Microsoft Access.

arthurcolle · 4 months ago

Probably not particularly useful but GCHQ & NSA both have neat graph related repos

UK: https://github.com/gchq/Gaffer

US: https://github.com/NationalSecurityAgency/lemongraph

dboreham · 4 months ago

Internet search engines have their origins in government projects fwiw. They had search engines before Alta Vista, used for searching data sets that pre-date the internet, and some of the people involved in those went to work on the original commercial search engines.

abnercoimbre · 4 months ago

I think you meant one shudders. And yeah, Snowden made it clear there's orders of magnitude more data than this graph explorer for them to sift through.

fancy_pantser · 4 months ago

Software like i2 Analyst's Notebook.

"Brad Edwards" and "Bradley Edwards" might be the same individual.

tovej · 4 months ago

Yes, the dataset also has three entries for Virginia Giuffre, "Virginia L. Giuffre", "Virginia Roberts Giuffre", and "Jane Doe Number 3 (Virginia Roberts)"

adolph · 4 months ago

I read a recent observation that people subject to discovery are often making purposeful typos in key names in order for the communication to remain under the radar.

potato3732842 · 4 months ago

Everyone is potentially subject to discovery. Some people are just more aware of it.

GuinansEyebrows · 4 months ago

Likewise for instances of "Larry" and "Lawrence" Summers... probably a lot of those.

DrewADesign · 4 months ago

I’m sure some developer/archivist is working on a name authority as we speak.

cyrusradfar · 4 months ago

great use case for using AI to suggest mergers and clean up.

specproc · 4 months ago

LLMs are awful for this. I've got a project that's doing structured extraction and half the work is deduplication.

I didn't go down the route of LLMs for the clean up, as you're getting into scale and context issues with larger datasets.

I got into semantic similarity networks for this use case. You can do efficient pairwise matching with Annoy, set a cutoff threshold, and your isolated subgraphs are merger candidates.

I wrapped up my code in a little library if you're into this sort of thing.

github.com/specialprocedures/semnet