This has been attempted many times. They all fail the same way.
These general data models start to become useful and interesting at around a trillion edges, give or take an order of magnitude. A mature graph model would be at least a few orders of magnitude larger, even if you aggressively curated what went into it. This is a simple consequence of the cardinality of the different kinds of entities that are included in most useful models.
No system described in open source can get anywhere close to even the base case of a trillion edges. They will suffer serious scaling and performance issues long before they get to that point. It is a famously non-trivial computer science problem and much of the serious R&D was not done in public historically.
This is why you only see toy or narrowly focused graph data models instead of a giant graph of All The Things. It would be cool to have something like this but that entails some hardcore deep tech R&D.
There are open source projects moving toward this scale, the GraphBLAS for example uses an algebraic formulation over compressed sparse matrix representations for graphs that is designed to be portable across many architectures, including cuda. It would be nice if companies like nivida could get more behind our efforts, as our main bottleneck is development hardware access.
To plug my project, I've wrapped the SuiteSparse GraphBLAS library in a postgres extension [1] that fluidly blends algebraic graph theory with the relational model, the main flow is to use sql to structure complex queries for starting points, and then use the graphblas to flow through the graph to the endpoints, then joining back to tables to get the relevant metadata. On cheap hetzner hardware (amd epyc 64 core) we've achieved 7 billion edges per second BFS over the largest graphs in the suitesparse collection (~10B edges). With our cuda support we hope to push that kind of performance into graphs with trillions of edges.
>These general data models start to become useful and interesting at around a trillion edges
That is a wild claim. Perhaps for some very specific definition of "useful and interesting"? This dataset is already interesting (hard to say whether it's useful) at a much tinier scale.
> It is a famously non-trivial computer science problem and much of the serious R&D was not done in public historically.
Could you point us to any public research on this issue? Or the history of the proprietary research? Just the names might help - maybe there are news articles, it's a section in someone's book, etc.
Either or both could be an interesting starting point for something like that. I tried to find something for the BBC and was surprised they didn’t have anything. I would have figured public media would have been a great resource for this.
Given 6 degrees is rooted in reality, this means we can draw causal graphs from anyone (bad) to anyone (we don't like) and then invent specious reasons why it means "it's all connected, man"
That said, some networks of shorter paths than 6 are interesting. Right now, there's a 1:1 direct path from these documents to a bunch of people with an interest in confounding what evidentiary value they have in justice processes. That's more interesting to me, than what the documents say right now.
Internet search engines have their origins in government projects fwiw. They had search engines before Alta Vista, used for searching data sets that pre-date the internet, and some of the people involved in those went to work on the original commercial search engines.
I think you meant one shudders. And yeah, Snowden made it clear there's orders of magnitude more data than this graph explorer for them to sift through.
As "The Rest Is Politics" podcasts points out, the meagre consequences mostly came to Brits: Ghislaine Maxwell, Prince Andrew aka Andrew Mountbatten, and the former UK embassador to the US.
It's a bit too bad that the network visualisation relies on d3: it is really slow with big networks, and the force directed algorithm is far from the best.
Have you tried using JS libraries built specifically to visualise graph networks such as Sigma.js, Vivagraph or Cytoscape?
Shameless plug: if OP is looking to stay on d3, he could also try slotting in my C++/WASM versions[1] of the main d3 many-body forces. Not the best, but I've found >3x speedup using these for periplus.app :)
Yes, the dataset also has three entries for Virginia Giuffre, "Virginia L. Giuffre", "Virginia Roberts Giuffre", and "Jane Doe Number 3 (Virginia Roberts)"
I read a recent observation that people subject to discovery are often making purposeful typos in key names in order for the communication to remain under the radar.
LLMs are awful for this. I've got a project that's doing structured extraction and half the work is deduplication.
I didn't go down the route of LLMs for the clean up, as you're getting into scale and context issues with larger datasets.
I got into semantic similarity networks for this use case. You can do efficient pairwise matching with Annoy, set a cutoff threshold, and your isolated subgraphs are merger candidates.
I wrapped up my code in a little library if you're into this sort of thing.
>A force-directed graph is a technique for visualizing networks where nodes are treated like physical objects with forces acting between them to create a stable arrangement. Attractive forces (like springs) pull connected nodes together, while repulsive forces (like electric charges) push all nodes apart, resulting in a layout where connected nodes are closer and unconnected nodes are more separated
I think it would be better and faster if the website calculated the positions of the nodes in the background (with a good enough limit of iterations), and then showed the result. Animating 4k nodes and 25k edges (15k by default) is a waste of CPU and is laggy even on my high-end CPU. But maybe the author was limited by the tools used.
If you look at this graph and your prescient thought is "haha take that MAGA" then you are a brainwashed ideologue. This graph gives a window into the layers of rot in our political system. The complexity is perfectly represented by its form but it seems like your graph is just a big arrow that says "orange man bad".
He is, and so are a very large number of people associated with him.
That is not an exhaustive list. But people who want things to improve should also shut down their ability to confect scandals or distractions, like the "Obama tan suit" controversy. Once Americans have a reasonable selection of non-insane non-compromised candidates, things may get better. The election of Mamdani is a good start in that direction, because all the other (D) candidates were horribly compromised.
This is great work to show relationship and connections. The government gets scared from these types of efforts as there are many members who are extremely guilty of crimes related to this and others.
We need to expand on network mapping with data and areas as well.
After seen this I interested in a map of each person to assist with knowing who they are, who they worked for during the email date, and who they currently work for.
Just like here you could get a timeline of key events, a graph of connected entities, links to original documents.
Newsrooms might already do this internally idk.
This code might work as a foundation. I love that it's RDF.
Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus
These general data models start to become useful and interesting at around a trillion edges, give or take an order of magnitude. A mature graph model would be at least a few orders of magnitude larger, even if you aggressively curated what went into it. This is a simple consequence of the cardinality of the different kinds of entities that are included in most useful models.
No system described in open source can get anywhere close to even the base case of a trillion edges. They will suffer serious scaling and performance issues long before they get to that point. It is a famously non-trivial computer science problem and much of the serious R&D was not done in public historically.
This is why you only see toy or narrowly focused graph data models instead of a giant graph of All The Things. It would be cool to have something like this but that entails some hardcore deep tech R&D.
To plug my project, I've wrapped the SuiteSparse GraphBLAS library in a postgres extension [1] that fluidly blends algebraic graph theory with the relational model, the main flow is to use sql to structure complex queries for starting points, and then use the graphblas to flow through the graph to the endpoints, then joining back to tables to get the relevant metadata. On cheap hetzner hardware (amd epyc 64 core) we've achieved 7 billion edges per second BFS over the largest graphs in the suitesparse collection (~10B edges). With our cuda support we hope to push that kind of performance into graphs with trillions of edges.
[1] https://github.com/OneSparse/OneSparse
That is a wild claim. Perhaps for some very specific definition of "useful and interesting"? This dataset is already interesting (hard to say whether it's useful) at a much tinier scale.
Could you point us to any public research on this issue? Or the history of the proprietary research? Just the names might help - maybe there are news articles, it's a section in someone's book, etc.
Aren't LLMs something like this?
https://developer.nytimes.com/docs/semantic-api-product/1/ov...
The Guardian has similar:
https://open-platform.theguardian.com/documentation/tag
Either or both could be an interesting starting point for something like that. I tried to find something for the BBC and was surprised they didn’t have anything. I would have figured public media would have been a great resource for this.
That said, some networks of shorter paths than 6 are interesting. Right now, there's a 1:1 direct path from these documents to a bunch of people with an interest in confounding what evidentiary value they have in justice processes. That's more interesting to me, than what the documents say right now.
https://www.gdeltproject.org/
UK: https://github.com/gchq/Gaffer
US: https://github.com/NationalSecurityAgency/lemongraph
Americans..?
It is very funny that the "unaccountability shield" stops at the US border, though, so it's taken out Prince Andrew.
[^1]: https://www.npmjs.com/package/d3-manybody-wasm
I have a 4090 and 32 GB of RAM and this thing is chugging at like 2 FPS, with the UI being completely unresponsive.
Deleted Comment
I didn't go down the route of LLMs for the clean up, as you're getting into scale and context issues with larger datasets.
I got into semantic similarity networks for this use case. You can do efficient pairwise matching with Annoy, set a cutoff threshold, and your isolated subgraphs are merger candidates.
I wrapped up my code in a little library if you're into this sort of thing.
github.com/specialprocedures/semnet
https://observablehq.com/@d3/force-directed-graph/2
That is not an exhaustive list. But people who want things to improve should also shut down their ability to confect scandals or distractions, like the "Obama tan suit" controversy. Once Americans have a reasonable selection of non-insane non-compromised candidates, things may get better. The election of Mamdani is a good start in that direction, because all the other (D) candidates were horribly compromised.
We need to expand on network mapping with data and areas as well.