It's a clever format, especially if the focus is on machines generating it and humans or machines reading it. It might even work for humans occasionally making minor edits without having to load the file in the spreadsheet.
I think it can encode anything except for something matching the regex `(\t+\|)+` at the end of cells (*Update:* Maybe `\n?(\t+\|)+`, but that doesn't change my point much) including newlines and even newlines followed by `\` (with the newline extension, of course).
For a cell containing `cell<newline>\`, you'd have:
|cell<tab>|
\\<tab >|
(where `<tab >` represents a single tab character regardless of the number of spaces)
Moreover, if you really needed it, you could add another extension to specify tabs or pipes at the end of cells. For a POC, two cells with contents `a<tab>|` and `b<tab>|` could be represented as:
|a<tab ><tab>|b
~tab pipe<tab>|tab pipe
(with literal words "tab" and "pipe"). Something nicer might also be possible.
*Update:* Though, if the focus is on humans reading it, it might also make sense to allow a single row of the table to wrap and span multiple lines in the file, perhaps as another extension.
For multiline cell contents, there is rule 7, the multi line extension. Newlines are not allowed in cells otherwise, because of rule 2, it's a line based format.
I personally use it to write tabular data manually, used to define our datamodel. Because this format is editor agnostic, colleagues can easily read and edit as well. So in my case it's focus on human read/write and machine read.
Also, news transmissions from agencies to newspapers or TV stations used (maybe still use in some places) a format called IPTC 7901 which also makes use of the SOH, SOT, EOT and EOH codes:
This stems from them coming via a serial wire (which is why news updates are also called “wires” in that context) to a TTY.
(Nowadays, you’d have a server receiving everything over the Internet and spitting it out in this format via a serial port or Telnet connection if needed.)
According to Wikipedia, fancier news messages are possible using some more codes, but I’ve never seen them in the wild in recent years:
Is there a text format like TSV/CSV that can represent nested/repeating sub-structures?
We have YAML but it's too complex. JSON is rather verbose with all the repeated keys and quoting, XML even moreso. I'd also like to see a 'schema tree' corresponding to a header row in TSV/CSV. I'd even be fine with a binary format with standard decoding to see the plain-text contents. Something for XML like what MessagePack does for JSON would work, since we already have schema specifications.
Every textual data format that is not originally S-expressions eventually devolves into an informally-specified, bug-ridden, slow implementation of half of S-expressions.
I have been using TSV a LOT lately for batch inputs and outputs for LLMs. Imagine categorizing 100 items. Give it a 100 row tsv with an empty category column, and have it emit a 100 row tsv with the category column filled in.
It has some nice properties:
1) it’s many fewer tokens than JSON.
2) it’s easier to edit prompts and examples in something like Google sheets, where the default format of a copied group of cells is in TSV.
3) have I mentioned how many fewer tokens it is? It’s faster, cheaper, and less brittle than a format that requires the redefinition of every column name for every row.
Obviously this breaks down for nested object hierarchies or other data that is not easily represented as a 2d table, but otherwise we’ve been quite happy. I think this format solves some other things I’ve wanted, including header comments, inline comments, better alignment, and markdown support.
Yeah :) Though I can think of two reasons why: it's not typable for most people on a keyboard, and most programs are not designed to deal with it, or render it properly in an aligned way, like tab characters.
Or we could use the actual characters for this purpose - the FS (file separator), GS (group separator), RS (record separator), and US (unit separator).
ASCII (and through it, Unicode) has these values specifically for this purpose.
I don't think popularizing these ASCII characters would solve the problem in its entirety.
If RS and US were in common use, there would be a need to have a visible representation for them in the terminal, and a way to enter RS on the keyboard. Pretty soon, strings that contain RS would become much more common in the wild.
Then, one day somebody would need to store one of those strings in a table, and there would be no way to do so without escaping.
I do think that having RS display in the terminal (like a newline followed by some graphic?) and using it would be an improvement over TSV's use of newline for this purpose, but considering that it's not a perfect solution, I can understand why people are not overly motivated to make this happen. The time for this may have been 40+ years ago when a standard for how to display or type it would be feasible to agree upon.
I did an ETL project for an ERP system that used these separators years ago. It was ridiculously easy because I didn't have to worry about escaping. Parsing was an easy state machine.
Notepad++ handles the display and entry of these characters fairly easily. I think they're nowhere as unergonomic as people say they are.
I’m pretty sure part of the intent is that it should be easy to write (type) in this format. Separator characters are not that. Depending on the editor, they’re not especially readable either.
I like that there is plenty of room for comments, and the multiline extension is also cool. The backslash almost looks like what I would write on paper if I wanted to sneak something into the previous line :)
I think it can encode anything except for something matching the regex `(\t+\|)+` at the end of cells (*Update:* Maybe `\n?(\t+\|)+`, but that doesn't change my point much) including newlines and even newlines followed by `\` (with the newline extension, of course).
For a cell containing `cell<newline>\`, you'd have:
(where `<tab >` represents a single tab character regardless of the number of spaces)Moreover, if you really needed it, you could add another extension to specify tabs or pipes at the end of cells. For a POC, two cells with contents `a<tab>|` and `b<tab>|` could be represented as:
(with literal words "tab" and "pipe"). Something nicer might also be possible.*Update:* Though, if the focus is on humans reading it, it might also make sense to allow a single row of the table to wrap and span multiple lines in the file, perhaps as another extension.
I personally use it to write tabular data manually, used to define our datamodel. Because this format is editor agnostic, colleagues can easily read and edit as well. So in my case it's focus on human read/write and machine read.
https://www.iptc.org/std/IPTC7901/1.0/specification/7901V5.p...
This stems from them coming via a serial wire (which is why news updates are also called “wires” in that context) to a TTY.
(Nowadays, you’d have a server receiving everything over the Internet and spitting it out in this format via a serial port or Telnet connection if needed.)
According to Wikipedia, fancier news messages are possible using some more codes, but I’ve never seen them in the wild in recent years:
https://en.wikipedia.org/wiki/IPTC_7901#C0_control_codes
https://en.wikipedia.org/wiki/ASCII#Character_groups
https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C0_con...
We have YAML but it's too complex. JSON is rather verbose with all the repeated keys and quoting, XML even moreso. I'd also like to see a 'schema tree' corresponding to a header row in TSV/CSV. I'd even be fine with a binary format with standard decoding to see the plain-text contents. Something for XML like what MessagePack does for JSON would work, since we already have schema specifications.
But CSV represented as JSON is usually accomplished like so:
Every textual data format that is not originally S-expressions eventually devolves into an informally-specified, bug-ridden, slow implementation of half of S-expressions.
It has some nice properties: 1) it’s many fewer tokens than JSON. 2) it’s easier to edit prompts and examples in something like Google sheets, where the default format of a copied group of cells is in TSV. 3) have I mentioned how many fewer tokens it is? It’s faster, cheaper, and less brittle than a format that requires the redefinition of every column name for every row.
Obviously this breaks down for nested object hierarchies or other data that is not easily represented as a 2d table, but otherwise we’ve been quite happy. I think this format solves some other things I’ve wanted, including header comments, inline comments, better alignment, and markdown support.
ASCII (and through it, Unicode) has these values specifically for this purpose.
If RS and US were in common use, there would be a need to have a visible representation for them in the terminal, and a way to enter RS on the keyboard. Pretty soon, strings that contain RS would become much more common in the wild.
Then, one day somebody would need to store one of those strings in a table, and there would be no way to do so without escaping.
I do think that having RS display in the terminal (like a newline followed by some graphic?) and using it would be an improvement over TSV's use of newline for this purpose, but considering that it's not a perfect solution, I can understand why people are not overly motivated to make this happen. The time for this may have been 40+ years ago when a standard for how to display or type it would be feasible to agree upon.
Both already possible, they have official symbols representing them
> Then, one day somebody would need to store one of those strings in a table, and there would be no way to do so without escaping.
Why? But also, yes, escaping also exists, just like in the alternative formats
Notepad++ handles the display and entry of these characters fairly easily. I think they're nowhere as unergonomic as people say they are.