I like the idea of using select/put/delete (sql-style syntax) to query non-rdb data storage. It sort of raises the question of, could there be 1 universal language to query relational databases, text file storage (json, csv, etc), and anything else.
Or put another way, is there any data storage format that couldn’t be queried by SQL?
That's basically SQL. Many SQL systems have lots of built in connectivity to various data sources.
DuckDB is a good example of a (literally) serverless SQL-based tool for data processing. It is designed to be able to treat the common data serialization formats as though they are tables in a schema [1], and you can export to many of the same formats. With extensions, you can also connect to relational databases as foreign tables.
This connectivity is a big reason it has built a pretty avid following in the data science world.
> Or put another way, is there any data storage format that couldn’t be queried by SQL?
Is your SQL Turing-complete? If yes, then it could query anything. Whether or not you'd like the experience is another thing.
Queries are programs. Querying data from a fixed schema, is easy. Hell, you could make an "universal query language" by just concatenating together this dasel, with SQL and Cypher, so you'd use the relevant facet when querying a specific data source. The real problem starts when your query structure isn't fixed - where what data you need depends on what the data says. When you're dealing with indirection. Once you start doing joins or conditionals or `foo[bar['baz']] if bar.hasProperty('baz') else 42` kind of indirection, you quickly land in the Turing tarpit[0] - whatever your query language is, some shapes of data will be super painful for it to deal with. Painful, but still possible.
> It sort of raises the question of, could there be 1 universal language to query relational databases, text file storage (json, csv, etc), and anything else.
Sure there could be -- any turing-complete language (which SQL is) can query anything.
But the reason we have different programming languages* is because they have different affordances and make it easy to express certain things at the cost of being less convenient for other things. Thus APL/Prolog/Lisp/C/Python can all coexist.
SQL is great for relational databases, but it's like commuting to work in a tank when it comes to key-value stores.
* and of course because programmers love building tools, and a language is the ultimate tool.
> could there be 1 universal language to query relational databases, text file storage (json, csv, etc), and anything else.
> Or put another way, is there any data storage format that couldn’t be queried by SQL?
We created PLDB.io (a Programming Language DataBase) and have studied nearly every language ever created and thought about this question a lot.
Yes, there could be 1 language to query everything, but there will always be a better DSL more relevant for particular kinds of data than others. It's sort of like how with a magnifying glass you can magnify anything, but if you want to look at bacteria you're going to want a microscope (and you wouldn't want a microscope to study an elephant).
Now it may turn out that there is 1 universal syntax that works best for everything (I'm sure people can guess what I would say), but I can't think of a case where you wouldn't want to have a DSL with semantics evolved to match a particular domain.
There are a lot of differences between storage formats. It would be incredibly difficult to create a universal query language. It would need to either a) change the storage formats so much that they're not really following their original standard, or b) create so many different versions of the query language that it's not really one standard.
Off the top of my head, SQL can't do lists as values, and doesn't have simple key-value storage. Json doesn't have tables, or primary keys / foreign keys, and can have nested data
> Or put another way, is there any data storage format that couldn’t be queried by SQL?
Depends on how keen you are on pure SQL. For example, postgres and sqlite have json-extensions, but they also enhance the syntax for it. Simliar can be done for all other formats too, but this means you need to learn special syntax and be aware of the storage-format for every query. This is far off from a real universal language.
From what I understand SQL is or at least can be made Turing complete so in that sense you should be able to query any data store using it. However, that doesn’t mean it will be efficient to do so.
I suspect for most data structures you could construct an index to make querying faster. But think about querying something like a linked list: it is not going to be too efficient without an index but you should still be able to write an engine that will do so.
If you have something like a collection of arbitrary JSON objects without a set structure you should still be able to express what you are trying to do with SQL because Turing completeness means it can examine the object structure as well as contents before deciding what to do with it. But your SQL would look more like procedural code than you might be used to.
> It sort of raises the question of, could there be 1 universal language to query relational databases...
Even if SQL and/or another query language could be Turing-complete, that doesn't mean that you can have 1 universal language to perform all possible queries in an efficient way. In basic computer science terms that means that your data structure is linked with the queries, and efficiency you want to achieve, and ad-hoc changes should be created for specific problems.
Tree and graph structures can be queried using SQL (with more or less difficulty depending on how you have chosen to encode and index them), but it's not a particularly simple and straightforward language to use for such a task.
interesting! that's a mathematically provable statement? or do you mean, most data can be represented as a graph? (I'm not asking antagonistically; I'm genuinely interested in the statement you made)
If entries can be relations themselves it is not possible afaik. For example
User | Telephone Numbers
-----+------------------
A | 123, 456 <- not atomic; more than 1 number (i.e. a set)
B | 789
Now there are academic operators to convert to and from a purely relational system, but I don't think they are implemented/in the standard. I forgot what they are called, however.
In general you don't want a universal query language. Depending on the shape of the data you want different things to be easily expressible. You can, for example express queries on tree-shaped data with SQL (see xPath-Accelerator), but it is quite cumbersome and its meaning is lost to the reader. I.e.: It's fine when computer-generated, but there is too much noise for a human to read/write themselves.
I'd be glad to be proven wrong here, but as time has shown, there is no one size fits all for programming languages. The requirements for different applications just vary too much.
I'm abusing `git config -f $PATH` as a poor man's portable INI parser/modifier in shell scripts, works pretty well but imposes some Git-specific syntax limitations.
jq is such an underrated tool, it is fully turing complete functional language in and of itself. It opens up an ocean of possibilities in combination with scripting languages (python has bindings), and it is ungodly fast chewing up through tons of JSON like its nothing.
It is a bit perl-ish, but being pure and functional it is a little easier to reason about when you have to revisit your queries.
PS I am certainly bookmarking your tool as well =]
There is also amusing project jqjq that implements jq in jq itself that I love to point folks at to show how expressive the language is: https://github.com/wader/jqjq
Neat; seems about every quarter or so one of these types of tools is highlighted here.
Awaiting all the responses from people to show off or list what tool they've landed on to support their specific use cases; I always learn a lot from these.
Personally I think this is a problem better spent by fixing the shell. There’s a few alt shells out there now, Nushell, Elvish plus the one I help maintain, Murex (https://murex.rocks).
I’m obviously going to biased here, but it’s definitely worth your time checking out some alt shells.
I did not know about Murex. This actually looks pretty cool! Or at least a definite improvement. How is it regarding using it as a scripting language in lieu of Bash?
The last shell I was intrigued by was es-shell which despite being old is still being updated and uses functional semantics while still looking like a shell language. I had chatgpt generate a comparison of all these with Bash (take with a grain of salt, I already had to correct at least one thing):
I'm a bit confused as to the use case. Is it just a way to interact with json/yaml style documents as if they were a structured database, but from the command line? Kind of an in-between for those moments you don't want to write a quick script to batch modify files?
It looks really well done, I think I'm just failing to see how this is more beneficial than just opening a single file in the editor and making changes, or writing a quick functional script so you have the history of the changes that were made to a batch of files.
If someone could explain how I could (and why I should) add a new tool to my digital toolbelt, I'd greatly appreciate it.
I use jq for this kind of thing several times a week. It’s great for piped data - things like running curl to fetch JSON, then piping it though to reformat it in different ways.
Here’s a jq expression I used recently to turn a complete GitHub Issues thread into a single Markdown document:
curl -s "https://api.github.com/repos/simonw/shot-scraper/issues/1/comments" \
| jq -r '.[] | "## Comment by \(.user.login) on \(.created_at)\n\n\(.body)\n"'
I use this pattern a lot. Data often comes in slightly the wrong shape - being able to fix that with a one-liner terminal command is really useful.
PowerShell[0]'s built-in Microsoft.PowerShell.Utility[1] module has commands to `Convert-From...` or `Convert-To...` JSON, CSV, a version of XML (CliXML), or custom key/value pairs (StringData) into objects, which can then be manipulated. Combined with IO cmdlets from the built-in module Microsoft.PowerShell.Management[2] such as `Get-Content` and `Set-Content`, a fair chunk of flat-file storage should be able to be made mutable with PowerShell.
I used yq last week to scan through all the Java projects (i.e. Maven pom.xml-files) within our org to check which ones inherit from the corporate pom.
For things that are mostly shell scripts and things in a similar family (Ansible playbooks, deployment pipelines etc.) and where you need to modify a structured file quickly, it's usually much faster to use the DSL provided by the tool than calling out to various scripts to extract or modify a single JSON key.
People often say that they'd prefer to write their shell scripts in Python or even Go these days, but the problem there is that the elements of structured programming makes the overall steps difficult to follow. Typically, the paradigm with use cases adjacent with shell scripts is to be able to view what it is doing without any sort of abstractions.
the in-between mode that you mention but seem to dismiss it is the way most traditional unixheads work with data most of the time: from the command line
editor? when i pull up emacs, 50% of the time it's write emacs macros, and I do that because shell scripts don't easily go backward in the stream. (something rarely mentioned about teco was that it was a stream editor that would chew its way forward through files; you didn't need the memory to keep it all in core, and it could go backward within understandable limits)
writing an actual shellscript is only for when it's really hairy, you are going to be repeating it and/or you need the types of error handling that cloud up the clarity of the commandline
the commandline does provide rudimentary "records" in the saved history
one benefit (idk if it applies here) is if the select/put/delete statements didn’t require loading the data in memory; so you could query massive data files with limited RAM and not have to solve that problem yourself for each data storage format you’re working with
I maintain dasel on Debian [1], and It's such a nice tool! I don't use most of its advanced capabilities, but being able to convert from/to any of those formats with a single tool is very useful.
Cool project -- but we need a standardized/spec'd query language in order to realize the goals in the "one tool to rule them all" section of this readme.
I have a hard time internalizing the jq query syntax, and am not overly excited to invest in learning all the quirks when it's not based on a widely-adopted open standard. Maybe `JMESPath` could be the way forward.
Sometimes `gron` can be a pretty great alternative approach, depending on your use case. At least it is very intuitive and plays nicely with other tools.
Ultimately JSON, TOML, YAML, XML, properties files are tree structures, and XPath type syntax should roughly apply to them all, along with about a hundreds "path expression" languages (java had SpEL, velocity, JSP-EL, OGNL, and probably dozens of others).
XPath, although it had some clunky artifacts for XML (which was the reason we moved from XML like namespaces... ugh), had basically the apex of expression/path/navigation capabilites. It would be really nice to see XPath ported to a general nav language that is supported by all programming environments and handled all the relevant formats.
I still like Xidel[0] for this reason; it may be a little older, but for a CLI scraper a lot of data transformations needs can be satisfied with Xpath/XQuery.
Speaking of trees, gron/ungron is an amazing transformer that allows one to use any query tool on the leaves of the tree and then turn the flattened structure back into a document (json).
I'd love to see gron/ungron implemented for all tree structures.
Or put another way, is there any data storage format that couldn’t be queried by SQL?
DuckDB is a good example of a (literally) serverless SQL-based tool for data processing. It is designed to be able to treat the common data serialization formats as though they are tables in a schema [1], and you can export to many of the same formats. With extensions, you can also connect to relational databases as foreign tables.
This connectivity is a big reason it has built a pretty avid following in the data science world.
[1] https://duckdb.org/docs/data/overview
[2] https://duckdb.org/docs/extensions/json#json-importexport
[3] https://duckdb.org/docs/extensions/postgres
Is your SQL Turing-complete? If yes, then it could query anything. Whether or not you'd like the experience is another thing.
Queries are programs. Querying data from a fixed schema, is easy. Hell, you could make an "universal query language" by just concatenating together this dasel, with SQL and Cypher, so you'd use the relevant facet when querying a specific data source. The real problem starts when your query structure isn't fixed - where what data you need depends on what the data says. When you're dealing with indirection. Once you start doing joins or conditionals or `foo[bar['baz']] if bar.hasProperty('baz') else 42` kind of indirection, you quickly land in the Turing tarpit[0] - whatever your query language is, some shapes of data will be super painful for it to deal with. Painful, but still possible.
--
[0] - https://en.wikipedia.org/wiki/Turing_tarpit
Sure there could be -- any turing-complete language (which SQL is) can query anything.
But the reason we have different programming languages* is because they have different affordances and make it easy to express certain things at the cost of being less convenient for other things. Thus APL/Prolog/Lisp/C/Python can all coexist.
SQL is great for relational databases, but it's like commuting to work in a tank when it comes to key-value stores.
* and of course because programmers love building tools, and a language is the ultimate tool.
> Or put another way, is there any data storage format that couldn’t be queried by SQL?
We created PLDB.io (a Programming Language DataBase) and have studied nearly every language ever created and thought about this question a lot.
Yes, there could be 1 language to query everything, but there will always be a better DSL more relevant for particular kinds of data than others. It's sort of like how with a magnifying glass you can magnify anything, but if you want to look at bacteria you're going to want a microscope (and you wouldn't want a microscope to study an elephant).
Now it may turn out that there is 1 universal syntax that works best for everything (I'm sure people can guess what I would say), but I can't think of a case where you wouldn't want to have a DSL with semantics evolved to match a particular domain.
Off the top of my head, SQL can't do lists as values, and doesn't have simple key-value storage. Json doesn't have tables, or primary keys / foreign keys, and can have nested data
Depends on how keen you are on pure SQL. For example, postgres and sqlite have json-extensions, but they also enhance the syntax for it. Simliar can be done for all other formats too, but this means you need to learn special syntax and be aware of the storage-format for every query. This is far off from a real universal language.
I suspect for most data structures you could construct an index to make querying faster. But think about querying something like a linked list: it is not going to be too efficient without an index but you should still be able to write an engine that will do so.
If you have something like a collection of arbitrary JSON objects without a set structure you should still be able to express what you are trying to do with SQL because Turing completeness means it can examine the object structure as well as contents before deciding what to do with it. But your SQL would look more like procedural code than you might be used to.
Even if SQL and/or another query language could be Turing-complete, that doesn't mean that you can have 1 universal language to perform all possible queries in an efficient way. In basic computer science terms that means that your data structure is linked with the queries, and efficiency you want to achieve, and ad-hoc changes should be created for specific problems.
In general you don't want a universal query language. Depending on the shape of the data you want different things to be easily expressible. You can, for example express queries on tree-shaped data with SQL (see xPath-Accelerator), but it is quite cumbersome and its meaning is lost to the reader. I.e.: It's fine when computer-generated, but there is too much noise for a human to read/write themselves. I'd be glad to be proven wrong here, but as time has shown, there is no one size fits all for programming languages. The requirements for different applications just vary too much.
I would really like to find a good workflow for idempotent modifications to INI files, but haven't stumbled across one yet.
[0] https://github.com/JFryy/qq
It is a bit perl-ish, but being pure and functional it is a little easier to reason about when you have to revisit your queries.
PS I am certainly bookmarking your tool as well =]
There is also amusing project jqjq that implements jq in jq itself that I love to point folks at to show how expressive the language is: https://github.com/wader/jqjq
Awaiting all the responses from people to show off or list what tool they've landed on to support their specific use cases; I always learn a lot from these.
I’m obviously going to biased here, but it’s definitely worth your time checking out some alt shells.
The last shell I was intrigued by was es-shell which despite being old is still being updated and uses functional semantics while still looking like a shell language. I had chatgpt generate a comparison of all these with Bash (take with a grain of salt, I already had to correct at least one thing):
https://gist.github.com/pmarreck/b7bd1c270cb77005205bf91f80c...
It looks really well done, I think I'm just failing to see how this is more beneficial than just opening a single file in the editor and making changes, or writing a quick functional script so you have the history of the changes that were made to a batch of files.
If someone could explain how I could (and why I should) add a new tool to my digital toolbelt, I'd greatly appreciate it.
Here’s a jq expression I used recently to turn a complete GitHub Issues thread into a single Markdown document:
I use this pattern a lot. Data often comes in slightly the wrong shape - being able to fix that with a one-liner terminal command is really useful.---
[0]: https://microsoft.com/powershell
[1]: https://learn.microsoft.com/powershell/module/microsoft.powe...
[2]: https://learn.microsoft.com/powershell/module/microsoft.powe...
People often say that they'd prefer to write their shell scripts in Python or even Go these days, but the problem there is that the elements of structured programming makes the overall steps difficult to follow. Typically, the paradigm with use cases adjacent with shell scripts is to be able to view what it is doing without any sort of abstractions.
editor? when i pull up emacs, 50% of the time it's write emacs macros, and I do that because shell scripts don't easily go backward in the stream. (something rarely mentioned about teco was that it was a stream editor that would chew its way forward through files; you didn't need the memory to keep it all in core, and it could go backward within understandable limits)
writing an actual shellscript is only for when it's really hairy, you are going to be repeating it and/or you need the types of error handling that cloud up the clarity of the commandline
the commandline does provide rudimentary "records" in the saved history
It's exactly a quick functional script.
[1]: https://tracker.debian.org/pkg/dasel
I have a hard time internalizing the jq query syntax, and am not overly excited to invest in learning all the quirks when it's not based on a widely-adopted open standard. Maybe `JMESPath` could be the way forward.
Sometimes `gron` can be a pretty great alternative approach, depending on your use case. At least it is very intuitive and plays nicely with other tools.
XPath, although it had some clunky artifacts for XML (which was the reason we moved from XML like namespaces... ugh), had basically the apex of expression/path/navigation capabilites. It would be really nice to see XPath ported to a general nav language that is supported by all programming environments and handled all the relevant formats.
[0]: https://github.com/benibela/xidel
I'd love to see gron/ungron implemented for all tree structures.
https://github.com/tomnomnom/gron
Deleted Comment
https://github.com/dbohdan/structured-text-tools
In fact it's already on it 6 times.
Sometimes we don’t actually want to parse yaml, we just want to mutate it without needing to module the underlying objects.
Being able to select and replace, add data to an existing yaml document is a huge win for automation.
I wrote about this a little in https://www.bbkane.com/blog/go-project-notes/#scripting-chan... and it's really helped me keepy GitHub workflows and various config files in sync across project repos