"git doesn't really work ... because docx is a binary blob."
Well, yes, but the binary blob is a zip archive of a directory of text XML files, and one could imagine tooling that wraps the git interaction in an unzip/zip bracket.
The real problem is that lawyers, like basically all other non-programmers, neither know nor care about the sequence of bytes that makes a file in the minds of programmers. In their minds the file IS what they see when they open it in word: a sequence of white rectangles with text laid out on it in specific ways, including tables with borders, etc. The fact that a lot of really complicated stuff goes on inside the file to get the WYSIWYG rendering is not only irrelevant to them, it's unknown.
Maybe the answer here will be along the lines of Karpathy's musings about making LLMs work directly with pixels (images of text), instead of encoded text and tokenizers [1]. An AI tool would take the document visually-standard legal document form, and read it, and produce output with edits, redlines, etc as directed by the user.
Diffing the XML is a complete nonstarter. I've spent years working with the OpenXML format and can assure you it is very complex even for a professional software engineer with 10 years of experience.
The diff of the document (referred to as a "redline") is what lawyers send to the client and their counterparties. It's essential that the redline is legible for all parties and reflects their professionalism.
Moreover, it is not enough to see the structural changes between the versions. A lawyer needs to see the formatting changes between the versions as well which cannot be accomplished by diffing XML files.
I saw this project on a recent hackernews comment and I had seen some comments there about how it does / can work decently with git features iirc (https://news.ycombinator.com/item?id=46265811)
I am interested to hear what your thoughts on recutils are and if perhaps we can have microsoft word/similar to git+recutils like workflow maybe
I thought about it and a tar/zipped git folder which can contain images/other content too which can be referenced with recutils instead of openxml/word document to me does feel an interesting idea
I am not sure but I think that openxml directly embeds data like pictures which can defnitely make it hard for git software to work perhaps but basically I am interested what you think about this/any feedback
You don't seem to be aware of any of the work I'm doing on CSTML (built to replace HTML and XML, and yes, built to be useful for legal documents (even though IANAL)). If you're interested in collaborating to go after the law market, let's talk! You're trying to sneak in a side door. I'm planning to smash down the main gates, the ones you say are impregnable. My investigation says they're not unbreakable, but instead strong and brittle. Many attacks will bounce off, yes, but brittleness means that these are defenses that will shatter before they bend.
Something I've started doing in my workflow is using Pandoc to convert between Markdown and DOCX when authoring long documents. This lets me put the Markdown into Git and apply the Gemini CLI to it. When referencing other documents, I'll also convert them to MD and drop them into a folder so I can tell the AI to read them and cross-reference things.
At the start of the project the Markdown is authoritative, and the DOCX is just for previewing the styling. (Pandoc can insert the text into a layout template with place holders.)
Towards the end of a project I'll start treating the DOCX as authoritative but continue generating Markdown from it, so I can run the AI over it as a final proof-read or whatever.
This is similar to what people used to do with DocBook, but with a more friendly text format and a more AI-friendly "modern" workflow with Git, etc...
It's easy to think that Word's functionality is what you see on the Ribbon, mentally map that to Google Docs, and think that the latter can replace the former. But Word is extremely deep. The templating and style sheets allow for a level of fine grained control that doesn't exist in alternatives. There are features that exist purely for the legal market, like Table of Authorities, and customizable line numbering and hyphenation.
Maybe one day there'll be a product to replace Word, but it won't succeed by claiming to be a generalist replacement but only as a niche product that solves a particularly painful problem for lawyers and then expands over time to capture more use cases.
Within the last ten years I have worked with lawyers and legal secretaries who were still using WordPerfect. I have to say that I was surprised to learn this.
I swear Google Docs also used to do a better job of replicating Word's ribbon, and has slowly pruned it of a lot of features that are individually niche, but cumulatively very important.
Word's "ribbon" is a shitshow, and has been since day one.
It's depressing to read about Word's entrenchment. This entire once-great application is now an execrable mess, with menus scattered under cryptic buttons (and abridged into dumbed-down menus that require you to expose yet another, collapsed one to access essential, frequently-used functions), a file... thing (not even a dialog, let alone a proper File dialog) that shows you a canned list of locations in a UI that appears to consist only of text...
The style-handling is even messed up, once one of Word's great strengths.
This is going to be very true right up until it isn't.
Yeah, I know that sounds fake-deep but we've seen this before; I'm old enough to remember when WordPerfect was the standard that wasn't going anywhere.
It will just be one of those inflection-point thingies.
I don't necessarily disagree with you, but I did want to point out that a big part of what made it possible for Word to displace WordPerfect in the legal world was, literally, the fact that Word implemented full support for WordPerfect's file format including all sorts of weird quirky edge cases.
So, an analogous "Word-killer" today would presumably have to implement all of the docx format's weird quirks etc. On the one hand, the file format is standardized and open, so in principle that should be possible; on the other hand, it's a pretty gnarly file format, with a lot of nooks and crannies. Ironically, I remember hearing once that some of the weirder nooks and crannies of the docx format have their roots in... Word's WordPerfect interoperability features.
And as somebody who recently spent far more time than he expected to trying to reliably get data _out_ of a set of mildly-complicated docx files, I can report that the various fiddly details that the OP notes as being particularly important in the legal domain --- very specific details of paragraph formatting, complex table structures, etc. --- are a huge PITA to deal with when working with the docx format.
Yes, exactly. A successor could theoretically replace Word, but first it needs to replicate all of its existing functionality.
For a competitor to supplant Word, it would need to:
- Be fully backwards compatible with .docx. Lawyers will inevitably receive .docx files from counterparties that they need to review, redline, and mark up. The new processor has to handle everything Word does flawlessly. (As an engineer who has spent considerable time building a high-quality docx comparison engine, I can tell you this is tremendously difficult.)
- If it introduces a new file format, support seamless comparison and conversion between that format and .docx. Not technically impossible, but also tremendously difficult with marginal upside.
- Defeat the Microsoft Office bundle in the market — meaning it either offers enough advantage that organizations pay for both, or it replaces Excel, PowerPoint, and Outlook too.
Given the enormous challenge of building a viable Word competitor and the marginal room for improvement that Microsoft has left on the table, I think it's very unlikely that a competitor will threaten its market position.
As the US government becomes more erratic and untrustworthy it will encourage large organisations to look for alternatives to American software and services.
The stated intent of the US National Security Strategy is to destabilise and undermine Europe. That is a big incentive for European organisations to replace Windows, Office, and any other Microsoft service.
Linux and LibreOffice usage will grow as a direct consequence of the US government's new antipathy to Europe.
Maybe, but imagine you are some EU commissioner, your choices look like this:
1. Fund a home grown alternative. Spend millions of Euros all the while fighting off a barrage of complaints that EUWord doesnt do things, costs too much, is burning taxes and productivity, etc.
2. Spend a nominal sum, but kick the project into the long grass, and hope that the US retreats from its stance back to the norm. "Maybe Word will be ok in 2027 after midterms, right? or 2029? Maybe I stick my fingers in my ears and tough it out"
2. is realistically what most politicians would do. Making tough, really difficult decisions is not something they like to do.
Imo, as long as companies are paying for E3 licenses, they won't pay for another solution. And they'll be paying Microsoft for licenses as long as they have Active Directory, right? Seems like the whole Microsoft ecosystem is built on AD (and probably Excel too)
Yes, AD is the value proposition. Your employees can get cloud-synced, multi-user real-time editing of documents. This is what kills the "but my Linux app can do it for free."
It runs on-premise, has all kinds of certificates and has a history of half a cenutry (give or take.) That kills Google Docs.
It's cross-paltform, killing whatever Apple thinks it has.
Too many people think Word is a text editor. I'd use Notepad++ if it had full AD integration. But it doesnt.
> And they'll be paying Microsoft for licenses as long as they have Active Directory, right?
They'll be paying long beyond on-prem AD as well. EntraID is becoming the new identity system. If you're already on E3/E5, you might as well make use of it, and making most use of it means being stuck in the whole Microsoft ecosystem.
Why bother looking for alternatives, even if one particular product might be better, when Microsoft gives you literally everything at at least a mediocre level, for one price and pre-integrated.
I've been doing all my personal notes etc that I want a rich text format in .ODT for decades now and don't regret it one bit.
I do regret being overly paranoid in my 20s and not writing down my master passphrase to my personal documents -- I lost a huge chunk of diaries and writings due to that.
Fun fact: ODT uses Blowfish encryptio. Remember when we made Bruce Schnierer a meme like Chuck Norris? He wrote it -- apparently it's faster than AES?
Anyways, if you save with password in a .ODT file, if you pick a strong password you've got a nice little self contained encrypted volume that doesn't require "suspicious" software to open.
ANYWAYS, a bit of a tangent but... looking forward to death of Word.
I'm sure ODT works well for many personal use cases, but can guarantee it will never see adoption in the legal industry. Microsoft Word is the only viable option for lawyers.
>I'm sure ODT works well for many personal use cases, but can guarantee it will never see adoption in the legal industry. Microsoft Word is the only viable option for lawyers.
The legal industry also uses MD5 to certify digital evidence hasn't been tampered with, that too will eventually bite them in the ass.
> I'm sure ODT works well for many personal use cases, but can guarantee it will never see adoption in the legal industry. Microsoft Word is the only viable option for lawyers.
I'm a lawyer, though I'm practicing in a wholly different legal system (Romanic civil law) and another country. Why would you say that?
No issues against .docx and and Word per se, but I hate that stupid ribbon with undying hatred. Thus I use LibreOffice as much as I can, while maintaining a licensed Office 365 setup under dual boot with Windows for cases when I have no other choice.
I'm surprised Google Docs doesn't support all the features lawyers need by now. Seems like a market they'd want to go after, and their .docx conversion seems decent enough for basic formatting, tables, etc.
Curious what the top 3 features are that are missing. The article only mentions multi-level decimal clause numbering (e.g. 9.1.2). Seems like it would be a very easy feature to add. I've heard that line numbering is also a big legal thing, but Docs already has that.
It's also not nearly as scriptable as Word is. Word has had macros ("fields") since its first Windows versions, VBA for over 20 years now, it's easy to develop complex add-ons - where I live we've had one for grammar checking for decades now (speaking of that, Google Docs' language features for less popular languages are far behind Word's). Various software supports export to Word and some programs even import from it. You'd be surprised what levels of automation has been achieved with Word.
Files are also easily shared (on physical media, email, no need for anyone to have a Google account to edit and send them back), encrypted, burned onto a CD for storage. DOC/DOCX are ubiquitous and stable file formats. No worries about data leaks in the cloud as it's all local by default...
So as far as formatting goes, it seems like it's only list formatting and small caps you've identified, am I missing anything else? (I am baffled by Docs' refusal to add small caps.)
But then as far as workflow is concerned, I'm not sure Docs is as unusable as you say it is -- the commit atomicity and comprehensive history aren't supported by Word either, are they? That's just a function of maintaining 20 separate copies of the file with each set of changes. You can still do that with Docs if you want to, rather than relying on the version history. And then "Tools > Compare documents" lets you merge in all the changes from another document, in an atomic way if you want. And if you want to use the revision history in a "master" version, you can used named versions as well.
Yes, everybody at the firm needs to use Docs. That's not unique to law -- every company that switches from MS365 to Google makes that kind of overnight transition, but it makes sense because you're paying one company or the other, not both.
It's the communication between firms that is going to be stuck in .docx basically forever though, so this is where Google needs to improve its conversion. Ideally Google would also build a "send a copy/transfer" feature so a firm can receive a Google Doc but know that from the moment it "opens" it, a new copy is made on their local Drive so the sending firm never sees edits or activity. But because that feels like it would be too easy to mess up, I think actual .docx file attachments will themselves be immortal, even if both sides used Docs.
Docx conversion isn’t great actually. I happened to open a docx with embedded png images and google docs couldn’t display them. If they whiff on a widely used image format like png I imagine there are a lot of shortcomings.
Oof, oh yeah. I mainly deal with text, but I remember there are multiple image formats I think Word supports that Docs doesn't. Also basic vector drawings don't convert. I don't understand how stuff like that hasn't been fixed by now.
There is a real world alternative to Markdown. Ada programming language standard is defined in a text format that is converted to TXT, HTML and PDF. Tools can compare different Ada standard versions, sometimes several versions, producing multi colored documents.
> Additionally, every colleague, counterparty, outside-counsel, and client a lawyer ever works with uses docx. To introduce a new format into this ecosystem would introduce friction into every single interaction.
As an attorney, this is what kept me from switching to LibreOffice or Google Docs. I gave it a shot, but since the other attorneys I work with (both in and outside the US) and my clients all use Word, I ended up wasting a lot of time fixing files after converting between formats. In the end, it just wasn’t worth it.
I’m fairly tech-savvy, but many of my coworkers struggle with the mental effort required to switch to new software. Two colleagues I greatly respect still use WordPerfect and Word 2003 because they dislike change so much. It's too much of a lift for these people to wholesale switch word processors.
I'd go a few layers even broader than this article and say that the modern tech industry has an abysmal track record when building tools for non-software technical fields. Tech builds either their own software-oriented workflows or the most dumbed down consumer-oriented workflow they can. Law is an excellent example of a field with a very high degree of fidelity, philosophy, and process yet it can only ever have partial crossover with software development methodologies. Tech often treats someone like a lawyer as either a substandard developer or an advanced consumer without making a real effort to understand the context and needs of highly complicated yet non-software professions.
Well, yes, but the binary blob is a zip archive of a directory of text XML files, and one could imagine tooling that wraps the git interaction in an unzip/zip bracket.
The real problem is that lawyers, like basically all other non-programmers, neither know nor care about the sequence of bytes that makes a file in the minds of programmers. In their minds the file IS what they see when they open it in word: a sequence of white rectangles with text laid out on it in specific ways, including tables with borders, etc. The fact that a lot of really complicated stuff goes on inside the file to get the WYSIWYG rendering is not only irrelevant to them, it's unknown.
Maybe the answer here will be along the lines of Karpathy's musings about making LLMs work directly with pixels (images of text), instead of encoded text and tokenizers [1]. An AI tool would take the document visually-standard legal document form, and read it, and produce output with edits, redlines, etc as directed by the user.
[1] https://x.com/karpathy/status/1980397031542989305
The diff of the document (referred to as a "redline") is what lawyers send to the client and their counterparties. It's essential that the redline is legible for all parties and reflects their professionalism.
Moreover, it is not enough to see the structural changes between the versions. A lawyer needs to see the formatting changes between the versions as well which cannot be accomplished by diffing XML files.
If openxml can be converted to csv/similar perhaps which can be converted to recutils
Recutils supports both mdb (Microsoft Access database files)/csv files to/from recutils
I saw this project on a recent hackernews comment and I had seen some comments there about how it does / can work decently with git features iirc (https://news.ycombinator.com/item?id=46265811)
I am interested to hear what your thoughts on recutils are and if perhaps we can have microsoft word/similar to git+recutils like workflow maybe
I thought about it and a tar/zipped git folder which can contain images/other content too which can be referenced with recutils instead of openxml/word document to me does feel an interesting idea
I am not sure but I think that openxml directly embeds data like pictures which can defnitely make it hard for git software to work perhaps but basically I am interested what you think about this/any feedback
At the start of the project the Markdown is authoritative, and the DOCX is just for previewing the styling. (Pandoc can insert the text into a layout template with place holders.)
Towards the end of a project I'll start treating the DOCX as authoritative but continue generating Markdown from it, so I can run the AI over it as a final proof-read or whatever.
This is similar to what people used to do with DocBook, but with a more friendly text format and a more AI-friendly "modern" workflow with Git, etc...
Maybe one day there'll be a product to replace Word, but it won't succeed by claiming to be a generalist replacement but only as a niche product that solves a particularly painful problem for lawyers and then expands over time to capture more use cases.
On the Google Docs front — I wrote specifically about its viability as a Word successor in an earlier post, "Why Lawyers Will Never Use Google Docs".
https://theredline.versionstory.com/p/why-lawyers-will-never...
It's depressing to read about Word's entrenchment. This entire once-great application is now an execrable mess, with menus scattered under cryptic buttons (and abridged into dumbed-down menus that require you to expose yet another, collapsed one to access essential, frequently-used functions), a file... thing (not even a dialog, let alone a proper File dialog) that shows you a canned list of locations in a UI that appears to consist only of text...
The style-handling is even messed up, once one of Word's great strengths.
Google offers so many things free out of the box. And for serious spreadsheet sort of work, I use numpy.
Google pretty much won the Office software war.
Just not in any of the biggest markets for office software, like the legal industry, and government.
So was 2010 or 2011 the Year of the Linux Desktop?
That's a "citation needed" if ever there was one.
Yeah, I know that sounds fake-deep but we've seen this before; I'm old enough to remember when WordPerfect was the standard that wasn't going anywhere.
It will just be one of those inflection-point thingies.
So, an analogous "Word-killer" today would presumably have to implement all of the docx format's weird quirks etc. On the one hand, the file format is standardized and open, so in principle that should be possible; on the other hand, it's a pretty gnarly file format, with a lot of nooks and crannies. Ironically, I remember hearing once that some of the weirder nooks and crannies of the docx format have their roots in... Word's WordPerfect interoperability features.
And as somebody who recently spent far more time than he expected to trying to reliably get data _out_ of a set of mildly-complicated docx files, I can report that the various fiddly details that the OP notes as being particularly important in the legal domain --- very specific details of paragraph formatting, complex table structures, etc. --- are a huge PITA to deal with when working with the docx format.
For a competitor to supplant Word, it would need to:
- Be fully backwards compatible with .docx. Lawyers will inevitably receive .docx files from counterparties that they need to review, redline, and mark up. The new processor has to handle everything Word does flawlessly. (As an engineer who has spent considerable time building a high-quality docx comparison engine, I can tell you this is tremendously difficult.)
- If it introduces a new file format, support seamless comparison and conversion between that format and .docx. Not technically impossible, but also tremendously difficult with marginal upside.
- Defeat the Microsoft Office bundle in the market — meaning it either offers enough advantage that organizations pay for both, or it replaces Excel, PowerPoint, and Outlook too.
Given the enormous challenge of building a viable Word competitor and the marginal room for improvement that Microsoft has left on the table, I think it's very unlikely that a competitor will threaten its market position.
The stated intent of the US National Security Strategy is to destabilise and undermine Europe. That is a big incentive for European organisations to replace Windows, Office, and any other Microsoft service.
Linux and LibreOffice usage will grow as a direct consequence of the US government's new antipathy to Europe.
1. Fund a home grown alternative. Spend millions of Euros all the while fighting off a barrage of complaints that EUWord doesnt do things, costs too much, is burning taxes and productivity, etc.
2. Spend a nominal sum, but kick the project into the long grass, and hope that the US retreats from its stance back to the norm. "Maybe Word will be ok in 2027 after midterms, right? or 2029? Maybe I stick my fingers in my ears and tough it out"
2. is realistically what most politicians would do. Making tough, really difficult decisions is not something they like to do.
Cite, please?
It runs on-premise, has all kinds of certificates and has a history of half a cenutry (give or take.) That kills Google Docs.
It's cross-paltform, killing whatever Apple thinks it has.
Too many people think Word is a text editor. I'd use Notepad++ if it had full AD integration. But it doesnt.
They'll be paying long beyond on-prem AD as well. EntraID is becoming the new identity system. If you're already on E3/E5, you might as well make use of it, and making most use of it means being stuck in the whole Microsoft ecosystem.
Why bother looking for alternatives, even if one particular product might be better, when Microsoft gives you literally everything at at least a mediocre level, for one price and pre-integrated.
It's looking like Windows will be more of an issue here than anything in Office. But either way they can only push people so far.
I do regret being overly paranoid in my 20s and not writing down my master passphrase to my personal documents -- I lost a huge chunk of diaries and writings due to that.
Fun fact: ODT uses Blowfish encryptio. Remember when we made Bruce Schnierer a meme like Chuck Norris? He wrote it -- apparently it's faster than AES?
Anyways, if you save with password in a .ODT file, if you pick a strong password you've got a nice little self contained encrypted volume that doesn't require "suspicious" software to open.
ANYWAYS, a bit of a tangent but... looking forward to death of Word.
The legal industry also uses MD5 to certify digital evidence hasn't been tampered with, that too will eventually bite them in the ass.
I've seen too many things that seemed to be around forever eventually disappear to believe in the permanence of Word.
I'm a lawyer, though I'm practicing in a wholly different legal system (Romanic civil law) and another country. Why would you say that?
No issues against .docx and and Word per se, but I hate that stupid ribbon with undying hatred. Thus I use LibreOffice as much as I can, while maintaining a licensed Office 365 setup under dual boot with Windows for cases when I have no other choice.
Curious what the top 3 features are that are missing. The article only mentions multi-level decimal clause numbering (e.g. 9.1.2). Seems like it would be a very easy feature to add. I've heard that line numbering is also a big legal thing, but Docs already has that.
https://theredline.versionstory.com/p/why-lawyers-will-never...
The short answer is Google Docs:
- Requires all-or-nothing adoption which is a non-starter for law-firms
- Does not support commit atomicity
- Does not store a comprehensive history of the document
Files are also easily shared (on physical media, email, no need for anyone to have a Google account to edit and send them back), encrypted, burned onto a CD for storage. DOC/DOCX are ubiquitous and stable file formats. No worries about data leaks in the cloud as it's all local by default...
So as far as formatting goes, it seems like it's only list formatting and small caps you've identified, am I missing anything else? (I am baffled by Docs' refusal to add small caps.)
But then as far as workflow is concerned, I'm not sure Docs is as unusable as you say it is -- the commit atomicity and comprehensive history aren't supported by Word either, are they? That's just a function of maintaining 20 separate copies of the file with each set of changes. You can still do that with Docs if you want to, rather than relying on the version history. And then "Tools > Compare documents" lets you merge in all the changes from another document, in an atomic way if you want. And if you want to use the revision history in a "master" version, you can used named versions as well.
Yes, everybody at the firm needs to use Docs. That's not unique to law -- every company that switches from MS365 to Google makes that kind of overnight transition, but it makes sense because you're paying one company or the other, not both.
It's the communication between firms that is going to be stuck in .docx basically forever though, so this is where Google needs to improve its conversion. Ideally Google would also build a "send a copy/transfer" feature so a firm can receive a Google Doc but know that from the moment it "opens" it, a new copy is made on their local Drive so the sending firm never sees edits or activity. But because that feels like it would be too easy to mess up, I think actual .docx file attachments will themselves be immortal, even if both sides used Docs.
Sources of standard: http://www.ada-auth.org/arm-files/2022-SRC.zip
Format tool can be found here: http://www.ada-auth.org/arm.html#Format_Tool
Sample output with edits: http://www.ada-auth.org/standards/22aarm/html/AA-5-2.html
Red color is for Ada 2012 -> Ada 2012 TC1 diff, green color is for Ada 2012 TC1 -> Ada 2022 diff.
As an attorney, this is what kept me from switching to LibreOffice or Google Docs. I gave it a shot, but since the other attorneys I work with (both in and outside the US) and my clients all use Word, I ended up wasting a lot of time fixing files after converting between formats. In the end, it just wasn’t worth it.
I’m fairly tech-savvy, but many of my coworkers struggle with the mental effort required to switch to new software. Two colleagues I greatly respect still use WordPerfect and Word 2003 because they dislike change so much. It's too much of a lift for these people to wholesale switch word processors.