X-ray: a Python library for finding bad redactions in PDF documents

Explain like I’m stupid: what is the most gracious interpretation of redaction when releasing files like this?

Why should anyone involved retain any anonymity?

I’m asking in good faith because naively it seems like this should not even exist. All of it should be exposed.

EDIT: I did not think about the innocent folks that might be caught in the crossfire. That checks out. Thanks everyone!

Iirc WikiLeaks took the position of any information that would directly lead to the bodily harm of an individual (or something to that effect). The rational being, "Yes, group A did something horrible that warrants investigation, but if we publish their GPS coordinates they will be blown to smitherines"

vlovich123 · 2 days ago

Unless those people impacted were friendly to US interests? if I recall correctly they published the names of collaborators and informants in Iraq. They also published military tactics that would help those trying to kill US soldiers. GPS coordinates by comparison generally go stale very quickly.

dragonwriter · 2 days ago

There was, to say the least, not a specific law mandating release of the material held by WikiLeaks and specify what was to be, and what was not to be, redacted, so I don't see that as much of a guide here.

dragonwriter · 2 days ago

The law mandating release requires redaction of victim identities, information relating to investigations that are still active, child sexual absue material, and information related to national security.

It generally prohibits other redactions, and expressly prohibits redactions for embarassment, reputational harm, or political sensitivity.

Of course, there is considerable concern that the actual reactions do not appear to comply with the legal requirements.

supercheetah · 2 days ago

FWIW, a lot of of the victims (possibly all) are saying they don't care about redactions if they end up being used to protect perpetrators. They want to make sure everyone is held accountable.

dragonwriter · 2 days ago

https://abcnews.go.com/US/epsteins-alleged-victims-accuse-do...

Specifically, a number of Epstein victims have complained that the release was unacceptable because it was incomplete, illegally redacted material other than victim names which was not excepted from release under the law mandating release, and because it failed to redact victim identities required to be protected under the law mandating release.

krapp · 2 days ago

Protecting the identity of victims, eyewitnesses or informants.

sawjet · 2 days ago

Don't forget the co-conspirators!

empath75 · 2 days ago

The files of a high profile and long running investigation are going to be full of false leads, hoaxes and other bullshit. The reason they don’t just always release the files after closing cases is that there genuinely are going to he innocent people caught in the crossfire who have privacy rights.

This case is so important and such a clusterfuck that the files need to be opened anyway.

ozim · 2 days ago

Person asking above question explains he doesn’t understand so I guess he also doesn’t understand prosecutors, lawyers, law enforcement, judges make mistakes.

So yes this is best explanation. Revealing everything might bring great harm to innocent people just because they were somehow mentioned in the documents.

Just add all the experience we already have with “internet investigators” that ruin people lives for petty reasons.

Adobe Pro, when used properly, will redact anything in a PDF permanently.

Whoever did these "bad" redactions doesn't even know how to use a PDF Editor.

We have paralegals and lawyers "mark for redaction", then review the documents, then "apply redactions". It's literally be done by thousands of lawyers/paralegals for decades. This is just someone not following the process and procedure, and making mistakes. It's actually quite amateurish. You should never, ever screw up redactions if you follow the proper process. Good on the X-ray project on trying to find errors.

I just want to add, applying black highlights on top of text is in fact, the "old" way of redaction, as it was common to do this, and then simply print the paper with the black bars, and send the paper as the final product.

Whoever did it is probably old, and may have done it thinking they were going to print it on paper afterwards!! Just guessing as to why someone would do this.

tgsovlerkhgsel · 2 days ago

Or they may not understand how PDF works and think that it's the same as paper.

Especially with the "draw a black box over it" method, the text also stops being trivially mouse-selectable (even if CTRL+A might still work).

Another possibility is, of course, that whoever was responsible for this knew exactly what they were doing, but this way they can claim a honest mistake rather than intentionally leaking the data.

aidos · 2 days ago

A while back I did a little work with a company that were meant to help us improve our security posture. I terminated the contract after they sent me documents in which they’d redacted their own AWS keys using this method.

zahlman · 2 days ago

> Or they may not understand how PDF works and think that it's the same as paper.

Yes; that's presumably included in being "amateurish" and "not following proper process".

selectodude · 2 days ago

Any attorney or law enforcement that works for the US Federal Government receives very, very comprehensive instructions on how to redact information on basically the first day of training. There is absolutely zero doubt among any of my DOGE'd friends that this was 100 percent on purpose malicious compliance.

unfocused · 2 days ago

Agreed. I worked on the Canadian side of the legal side and there is a very comprehensive process for redaction. Nobody does redaction unless they follow the process. Never seen anyone 15+ years do something silly like this in the office.

hsbauauvhabzb · 2 days ago

So you think it was trump supporters as opposed to in spite of trump? Genuine question - Who stands to gain? I don’t follow this enough to know.

mlissner · 2 days ago

Cool to see this here. It’s funny because we do so many huge, complex, multiyear projects at Free Law Project, but this is the most viral any of our work has ever gone!

Anyway, I made X-ray to analyze the millions of documents we have in CourtListener so that we can try to educate people about the issue.

The analysis was fun. We used S3 batch jobs to analyze millions of documents in a matter of minutes, but we haven’t done the hard part of looking at the results and reporting them out. One day.

thangalin · 2 days ago

https://www.argeliuslabs.com/deep-research-on-pdf-redaction-...

> Information Leaking from Redaction Marks: Even when content is properly removed, the redaction marks themselves can leak some information if not done carefully. For example, if you have a black box exactly covering a word, the length of that black box gives a clue to the word’s length (and potentially its identity).

Does X-ray employ glyph spacing attacks and try to exploit font metric leaks?

No, we worked with researchers that developed that kind of system, but didn't broadcast our work b/c the research was too sensitive. Seems the cat is out the bag now though.

I think the combination of AI and font-metrics is going to be wild though. You ought to be able to make a system that can figure out likely words based on the unredacted ones and the redaction's size. I haven't seen any redaction system yet that protects against this.

Presumably with font kerning and pixel perfect recreation of the source, it would be possible to guess the word very accurately.

The strings oioioi and oooiii will have different widths in some fonts because character organisation matters a lot.

setopt · 2 days ago

I suppose it gets a bit more complex again if you enable stuff like microtype, but even then you can probably measure how much inter-letter and inter-word spacing has been adjusted by just scanning other text in the same line.

I think the conclusion is honestly that PDF is an outdated format for keeping records that might have to be redacted in the future, like court documents. Something reflowable like epub could have the text replaced with constant-space black squares instead no hints leaked as someone mentioned in a parallel comment.

embedding-shape · 2 days ago

I haven't gone through more than just 10% of the files released today, but noticed that at least EFTA00037069.pdf for example has a `/Prev` pointer, meaning the previous revision of the file is available inside of the PDF itself. In this case, the difference is minor (stuff moved around), but I'm guessing if it's in one file, it could be more. You can run `qpdf --show-object=trailer EFTA00037069.pdf` on a PDF file to see for yourself if it's there.

I'm almost fully convinced that someone did this bad intentionally, together with the bad redactions, as surely people tasked with redacting a bunch of files receive some instructions on what to do/not to do?

victor9000 · 2 days ago

I looked into this specific file, and the history doesn't contain anything too interesting. The root file is already the fully redacted and flattened document, and the edit in question is the addition of a numbered footer to each page.

xhevahir · 2 days ago

> as surely people tasked with redacting a bunch of files receive some instructions on what to do/not to do?

You've phrased this as a question; I gather that you know better than to assume a modicum of competence from these people.

throwawaysleep · 2 days ago

All the reporting I have read suggests that they are roping anyone and everyone they can into doing redactions. So I suspect many simply lack the experience to do it well.

Ok, so say someone says "We're overloaded, we need more people" so someone else says "Ok, department Q, R and T changes priority to doing redaction" then at least one person somewhere in this chain has to at least consider that every person from Q, R and T must go through at least a 3 slide powerpoint or whatever saying what's happening, this is what to do, this is what to not do, right?

blitzar · 2 days ago

They should all have been using the same redaction tooling.

If I were to hazard a guess, pure speculation, I would say the unretrievable parts were court / previously redacted and the retrievable parts are the latest round of panicked rushed redactions.

mmmlinux · 2 days ago

Give a room full of high school students instructions for a 3 step process. I guarantee at least 10% are going to screw it up somehow.

Dead Comment

jmward01 · 2 days ago

Hmmm.. The more I think about this the more any font kerning is likely a major leak for redaction. Even if the boxes have randomness applied to them, the words around a blacked out area have exact positioning that constrains the text within so that only certain letter/space combinations could fit between them. With a little knowledge of the rendering algorithm and some educated guessing about the text a bruit force search may be able to do a very credible job of discovering the actual text. This isn't my field. Anyone out there that has actually worked on this problem?

worewood · 2 days ago

There was a recent vulnerability, where researchers were able to extract information from an encrypted chat session from an LLM, by analyzing packet size/timings of the underlying SSL connection. A classic side-channel attack. Seems possible to draw a parallel between the two.

dylan604 · 2 days ago

> the more any font kerning is likely a major leak for redaction

Now I want a font that randomly adjusts the kerning automagically to be used by people in standard word processors not some graphics app. In this way, every time the same word appears in the document, the kerning is different between each one.

chews · 2 days ago

My autism wants that idea straight into a dumpster fire.

Really depends on the length and predictability of the redaction, but yes. If it's short and contextually it's only likely to be either "yes" or "no", you've got it. If it's longer and could contain an unknown person's name along with some other words, well, that's harder.

I feel like this creates a hash value and the real question is how unique of a value does it represent and how easy it is to narrow it down given throwing a dictionary at it. Similarly, unknown names could likely be teased out like a one-time pad. If they appear in multiple sentences then their randomness quickly repeats and becomes something that potentially could be isolated from the rest of the words around them. This would probably be a fun problem for a cryptography class to work on.

IshKebab · 2 days ago

Unlikely to be possible except for the smallest redactions, like if you have a single name redacted and a list of candidates. But I think kerning wouldn't help you much more than just knowing the rough length anyway.

ComplexSystems · 18 hours ago

Kerning and perplexity together could probably solve quite a few of these.

blitz_skull · 2 days ago

brotchie · 2 days ago

You'd think the go-to workflow for releasing redacted PDFs would be to draw black rectangles and then rasterize to image-only PDFs :shrug:

selinkocalar · 2 days ago

As someone who's built an entire business on "anti-screenshots" this is brilliant.

PDF redaction fails are everywhere and it's usually because people don't understand that covering text with a black box doesn't actually remove the underlying data.

I see this constantly in compliance. People think they're protecting sensitive info but the original text is still there in the PDF structure.

Not to mention some PDF editors preserve previous edits in the PDF file itself, which people also seems unaware of. A bit more user friendly description of the feature without having to read the specification itself: https://developers.foxit.com/developer-hub/document/incremen...

shbooms · 2 days ago

often times you will have requirements that the documents you release be digitally searchable and so in these cases, this would not be an option

pottertheotter · 2 days ago

This made me think of something I came across recently that’s almost the opposite problem of requiring PDFs to be searchable. A local government would publish PDFs where the text is clearly readable on screen, but the selectable text layer is intentionally scrambled, so copy/paste or search returns garbage. It's a very hostile thing to do, especially with public data!

8note · 2 days ago

run some ocr on them after to recreate the text layer?

Deleted Comment

alessandroliva · 2 days ago

This being on top of the news on Esptein files being badly redacted is pretty funny

Are you under the impression that they're unconnected?

shrubble · 2 days ago

Shockingly, you can see redaction info from within your browser's PDF viewer. I am using Brave on Linux, and went here:

https://www.justice.gov/multimedia/Court%20Records/Matter%20...

As a test, select with your mouse the entire first line of paragraph number 90, and then paste it into a text editor or a shell. The unredacted text appears!

sroussey · 2 days ago

Why would “Financial Strategy Group, Ltd” be redacted?

ktpsns · 2 days ago

This is exactly the type of bad redactions which the X-ray software will also find.

Fnoord · 2 days ago

Yep, bingo:

$ uvx --from x-ray xray "https://www.justice.gov/multimedia/Court%20Records/Matter%20..."

You can X-ray a PDF?

belter · 2 days ago

There is no way looking at most of the now unredacted text that it should redacted.

It´s clear that the DOJ was paying overtime, based on the number of redactions, so the agents and lawyers just roamed free...