There have been a few times I wanted the ability to select some text out of a Markdown doc. For example, a GitHub CI check to ensure that PRs / issues / etc are properly formatted.
This can be done to some extent with regex, but those expressions are brittle and hard to read or edit later. mdq uses a familiar pipe syntax to navigate the Markdown in a structured way.
It's in 0.x because I don't want to fully commit to the syntax being stable, in case real-world testing shows that the syntax needs tweaking. But I think the project is in a pretty good spot overall, and would be interested in feedback!
This is because GitHub is not building the features we need, instead they are putting their energy towards the AI land grab. Bitbucket, by contrast, has a feature where you can block PRs using a checkbox list outside of the description box. There are better ways to solve this first example from OP readme. Cool project, I write mainly MDX these days, would be cool to see support for that dialect
I feel like a “Linear for GitHub” is due.
If you want to open an enhancement request issue, I'm happy to take a look (PRs also welcome, but not required). If you're not on GitHub, let me know and we can figure out some other way to get the request tracked.
Thanks for taking a look at the project!
I'm currently in a phase of trying to shed tools and added complexity, rather than add them
Prior to gitlab ratcheting up the usability, features, and cost effectiveness, I preferred hosted git for 99% of use cases.
You throw the ball to where it's going. Gitlab might be delivering more value in the short term, but if things wind up looking significantly different in ten years, they might be in for a world of hurt. Innovator's dilemma is real.
It's a danger to ignore the tectonic changes happening. It's also incredibly risky to lean fully in, because we're not sure where the value accrues or which systems are the most important to build. It doesn't seem like foundation models are it.
It's smart to build basic scaffolding, let the first movers make all the expensive mistakes, then integrate the winning approaches into your platform. That requires a lot of energy though.
So do the plagiarism machines!
I'm not sure this is better. I like the idea of the full context of the PR being available in a small set of relatively standardised fields. Smaller, non-semantic sets are easier to standardise.
It has saved us from self induced pain and is a great coordination point
If a document is supposed to have structure - even something as simple as nested lists of paragraphs - it doesn't seem realistic to expect regular text manipulation tools to do a whole lot with them. Something like "remove the second paragraph of the third entry in the fourth bullet-point list" is well beyond any sane use of any regex dialect that might be powerful enough. (Keeping in mind that traditional regexes can't balance brackets; presumably they can't properly track indentation levels either.)
See also: TOML - generally quite human-editable, but still very much structured with potentially arbitrary nesting.
You're right: Regular expressions are equivalent to finite state machines[1], which lack the infinite memory needed to handle arbitrarily nested structures [2]. If there is a depth limit, however, it is possible (but painful) to craft a regex to describe the situation. For example, suppose you have a language where angle brackets serve as grouping symbols, like parentheses usually do elsewhere [3]. Ignoring other characters, you could verify balanced brackets up to one nesting level with
and two levels with Don't do this when you have better options.---
[1] https://reindeereffect.github.io/2018/06/24/index.html
[2] As do any machines I can afford, but my money can buy a pretty good illusion.
[3] < and > are not among the typical regex metacharacters, so they make for an easier discussion.
They were meant to be analyzable in some ways. Count lines, extract headers, maybe sed-replace some words. But being able to operate/analyze over multiline strings was never a strong point of unix tools.
We could have called it something like extended markdown language or something and use a wicked acronym like eXMaLa for it.
Shame no such technology exists and never did.
markdown -> xhtml -> sxml -> logic (racket)
[0]: https://markdowndb.com/
As a preview, two specific cases I've seen:
1) In PRs, some companies like to have semi-structured metadata, like a link to a related ticket under a heading "Ticket". In mdq, you could find that using `# Ticket | [](^https://issues.acme.com/)`
2) Many projects ask people who submit bugs to check off whether they've searched for existing bugs. `- [x] I've looked in the bug tracker for existing bugs`
After trying a bunch of the usual ones, the only "notes system" I've stuck with is just a directory of markdown files that's automatically committed to git on any change using watchexec.
I've wanted to add a little smarts to it so I could use it to track tasks (eg. sort, prune completed, forward uncomplete tasks over to the next day's journal, collect tasks from "projects", etc.) so I started writing some Rust code using markdown-rs. Then, to round-trip markdown with changes, only the javascript version of the library currently supports serializing github flavored markdown. So then I actually dumped the markdown ast to json from rust and picked it up in js to serialize it for a proof of concept. That's about as far as I got so far. But while markdown-rs saves position information, it doesn't save source token information (like, * and - are both list items) so you can't reliably round-trip.
FWIW, the other thing I was hoping to do was treat markdown documents as trees (based on headings) use an xpath kind of language to pull out sections. Anyway, will check out your code, thanks for posting.
The core assumption here is that Markdown was/is designed to be serializeable to html - this is why a markdown document/AST is mostly not a tree structure, for tree-ish elements such as sub-sections. Instead, it is flat, an array of elements in order of appearance in the document. Apparently this most closely matches the structure of html, at both the block and inline levels. Only Lists and Blockquotes (afair) support nesting.
Ex: h1 -> paragraph -> h2 -> paragraph is not nested, it is an array of four ordered elements.
Anyway, you might throw a task at Cursor or Copilot to see how an equivalent implementation using html fares against your test suite, you may be able to develop more quickly.