Consider Using CSV - Readit News

Pinus · 3 years ago

CSV looks deceptively simple. It is far too easy to just write(','.join(whatever)), which sort of works, until it doesn’t, and then someone, sometimes I, has to sort out the resulting mess. PLEASE use a proper CSV library (Python comes with a CSV module in the standard library), or at least implement the entire format according to the RFC from the outset, even if you think you won’t need it!

chaps · 3 years ago

Oh yes. CSVs are deceptively challenging especially if your use-case is from excel files to csv. Excel will happily convert a worksheet to csv, but it's a naive conversation. Headers that start on line 3, multi-line headers, inconsistent column counts, etc. It adds up really quickly!

zem · 3 years ago

I've also run into issues where I wrote some code that worked with csv input, and told users they could just export their data from excel. turns out excel doesn't export in utf-8 by default, we had some weird issues until we figured that out.

kasajian · 3 years ago

This is a matter of developer education. The correct way to create and parse CSV files is to use a third-party library. They can get complicated. A field in a CSV can contain commas and quotes. In some cases, a single field can contain a line-feed, and you'll need to ensure the parser you use supports that. This would allow an entire CSV file to be embedded inside the field of a CSV field. At a minimum, a parser must support Excel's default parser logic.

But, if you pick the right parser and generator, then you're ok with using it.

Macha · 3 years ago

Right, but if you're picking CSV, you likely expect to interoperate with a provider that's not yourself. And then there's no way a parser can handle all CSV formats in the wild.

e.g. the example from my comment on the last CSV discussion (https://news.ycombinator.com/item?id=28223719)

What variant is this:

    1,5,Here is a string "" that does stuff,2021-1-1

What is the value of the third column?

Is this a CSV file without quoting? Then it's

    Here is a string "" that does stuff

Or is it a CSV file with double quote escaping? Then it's

    Here is a string " that does stuff

brundolf · 3 years ago

Yeah, but this is less of a problem if it's an internal API. You can stick to a stricter subset of the standard, and/or only handle the types of column values that you actually need

Still probably worth using a library, but it isn't a source of problems in my experience

ndsipa_pomu · 3 years ago

As much as I like and use CSV for database work, it has a problem with being poorly specified. The most common problems are when processing CSVs produced elsewhere which might not enclose text fields with quotes and thus have issues with data that includes commas and multi-line data.

gregmac · 3 years ago

There is a spec (RFC 4180 [1]) but it's definitely not widely followed. Worse, for a lot of data there's no problems for potentially years, until your numbers get too big or the first time a quote or comma gets in the data.

In my experience one of the biggest barriers I run into -- and the primary reason I hate using CSV -- is Microsoft Excel. It misinterprets numbers as dates, it convers big numeric identifiers to exponents, and more. Even merely opening a RFC4180-compliant file and saving it changes the data, and even Excel itself will often have a different misinterpretation of the de file.

If humans never used Excel for CSV, it would be a viable format. At the same time in most cases where humans aren't in the loop (machine-to-machine communications), there's better formats. You could spec "RFC4180 CSV" and hope no developer just sees the "CSV" and assumes they understand. Or specify something like a JSON streaming format and avoid a whole lot of headache.

[1] https://www.ietf.org/rfc/rfc4180.txt

gregw2 · 3 years ago

Regarding excel and rfc4180…

I avoided CSV for quite a while because I had excel-vs-CSV compatibility concerns like this.

However, when I tested this for myself a few years back, Excel output to my surprise was rfc4180 or darn near it (it might use CRLF rather than LF?) It emitted commas and quotes the same way as the rfc for all the test cases I checked.

That said, I agree with you Excel is problematic as an input source. Usually the problems are the humans who touch the data in excel, but what I’ve found is the automation problems tend to be with Excel parsing and interpreting incoming data (before it goes to CSV.) Exponents, trimming leading zeros, etc. as you say. But if the data is confirmed good in excel before being emitted, the CSV it emits is decent.

Counterexamples welcome.

CharlesW · 3 years ago

Yes, I feel like this would've been more helpful generalized as "Consider DSV" (delimiter-separated values) than CSV specifically, because of the interop issues that often come up. I'd have also mentioned using Parquet.

mason55 · 3 years ago

I generally find Avro to be a better replacement for CSV than Parquet. It’s a better drop in for the typical CSV use case of “process this file row by row”.

Parquet is great, don’t get me wrong.

scrollaway · 3 years ago

Parquet has the opposite problem of CSV though. It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.

To read a parquet file in Python, you need Apache Arrow and Pandas. And literally the second result for "parquet python libraries" is an article titled "How To Read Parquet Files In Python Without a Distributed Cluster".

I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?

fatneckbeardz · 3 years ago

but which delimiter.

if you choose pipe ok, now you have to make sure nobody typed a pipe into the input field or spreadsheet, and you cannot store unix commands

if you choose tab, ok, now people will get confused when they try to edit the text file to replace tabs with spaces, and now you have trouble putting code snippets into data fields because they have tabs.

this is the problem and it's why xml/json exist.

in my particular domain, tab separated works pretty well but in a general context of the world at large, i feel like JSON has reasons it exists.

fbdab103 · 3 years ago

In the context of an API so long as using a "real" CSV library a lot of those inconsistencies do not appear. Problems happen when you have to interface with humans and desktop software (ie Excel) which has its own rules.

hermitcrab · 3 years ago

Yes, CSV is superior to JSON for tabular data, but has it's own issues. One issue is that the standard is not consistently applied. Another is the approach to escaping means that it is hard to parse a CSV file with multiple threads. You have to parse the entire file before you can be sure which " characters escape other characters.

I wrote an article about tabular formats and their strengths and weaknesses here: https://successfulsoftware.net/2022/04/30/why-isnt-there-a-d...

The resulting HN discussion is here: https://news.ycombinator.com/item?id=31220841

_dain_ · 3 years ago

>CSV is superior to JSON for tabular data

Can't you just do this?

  {
    "columns": ["col1", "col2", "col3"],
    "data": [
               [1,      2,      3],
               [4,      5,      6],
               [7,      8,      9]
    ]
  }

That's valid JSON but it's human-readable and human-editable rows of comma-separated data, just like CSV.

II2II · 3 years ago

> The most common problems are when processing CSVs produced elsewhere [...]

The limitations of CSV are certainly worth considering and, in the instances you mentioned, it may be not be worth using CSV. (If you are going to be using a more complex parser anyway, you may as well using a format that is better defined and where you are less likely to encounter edge cases.) That being said, there remain many cases where CSV is far more efficient and far less error prone.

ndsipa_pomu · 3 years ago

It tends to be a lowest common denominator or a choice between CSV and Excel documents which are trickier to automate.

Kamq · 3 years ago

Which works until you have a quote in your data. It also prevents chunking out the file to process in parallel since you could have new lines between the quotes.

What I've found to work well is to just % encode your delimiter, the new line character, and the '%' character. Basically every language has utilities for this.

Doesn't solve the issue with accepting outside files though. You have to be pessimistic with those regardless.

majkinetor · 3 years ago

With gzip on web server the difference is not important at all.

CSV in general is problematic as there is no standard (RFC 4180 is not). In certain contexts this surely can be good solution but definitelly not good in general scenario.

CharlesW · 3 years ago

As Wikipedia puts it, "CSV is widely used to refer to a large family of formats that differ in many ways". If there's a canonical standard, it appears to be RFC4180: https://www.rfc-editor.org/rfc/rfc4180

majkinetor · 3 years ago

It appears, but its not. I have not found single program so far that conforms only to this RFC and nothing else.

From the RFC itself:

   Status of This Memo

   This memo provides information for the Internet community.  It does
   not specify an Internet standard of any kind.  Distribution of this
   memo is unlimited.

briHass · 3 years ago

That was my first thought: JSON is highly amenable to compression; due to the repetition this blog is complaining about. It's a good lesson for junior devs: if you find yourself thinking about saving bits and bytes with custom protocols, you need to pull out of the rabbit hole and find the existing solution to your problem.

Sure, for a local data file or something where it's nice to be human-readable-ish, CSV can be a better choice than JSON (assuming you use a library for all the edge cases and string escapes.) If you really want a super-small and fast serialization, that's what protobuf is for.

thangalin · 3 years ago

CSV is also great for importing external data into documents. My text editor, KeenWrite[0], includes an R engine and a CSV-to-Markdown function[1]. This means you can write the following in a plain text R Markdown document:

    `r#csv2md('filanme.csv')`

The editor will convert Markdown to XHTML in the preview panel (in real time), then ConTeXt can typeset the XHTML into a PDF file in various styles.[2][3] This avoids spending time fighting with table formatting/consistency in certain word processors while storing the data in a machine-friendly format. (Thereby upholding the DRY principle because the data can have a single source of truth, as opposed to copying data into documents, which could go stale/diverge.)

Using JSON would be possible, but it's not as easy to convert into a Markdown table.

[0]: https://github.com/DaveJarvis/keenwrite

[1]: https://github.com/DaveJarvis/keenwrite/blob/main/R/csv.R#L3...

[2]: https://i.ibb.co/6FLXKsD/keenwrite-csv.png

[3]: https://i.ibb.co/47h6zNx/keenwrite-table.png

gugagore · 3 years ago

The only reason, in my eyes, to use CSV is to have easy interoperability with spreadsheet software.

If you want streaming: https://jsonlines.org/

Karellen · 3 years ago

JSON lines looks kinda interesting, but the newline-delimited thing seems weird.

It seems to me that you could write a JSON streaming parser that, if the outer element is an Array, reads and outputs/processes one Array element at a time as its JSON value. Yeah, you can't get the array length, and if there's a parse error somewhere down the line then you have to figure out how to deal with that (a non-streaming parser would have rejected the whole input), but that's kind of inherent in using a streaming parser. The upside is that you can work with any valid JSON.

Sure, if you're interoperating with shell tools, and don't have `jq` available, newline-delimited JSON might be helpful. But on the other hand, just install `jq`, dummy!

majkinetor · 3 years ago

Meh.

Excel compatibility really sux. And Excel is most used one by large. You really can't double click it effectivelly, as everything will be shown as generic type, so you have to mess up with wizard which is also half baked. I have to create tutorials for that for each service using it.

elcritch · 3 years ago

Sometimes CSV is nicer. Still you can cut down on your JSON by formatting it as a similar header style:

    [
      ["productId", "quantity", "customerId"],
      ["5710031efdfe", 1, "8fe96b88"],
      ["479cd9744e5c", 2, "526ba6f5"]
    ]

This style also works well with jsonlines a sibling comment mentioned. Of course my favorite is MessagePack (or CBOR) using similar styles. MsgPack can be as small as gzipped JSON. :)

account-5 · 3 years ago

I think one of the issues it data types. JSON has them CSV doesn't, so this means your program needs to be aware of which columns are which data type and do the conversion where needed.

It's similar to JSON Vs INI files for config files.

On a different note I wouldn't nest JSON in a CSV column. I'd delimit with a pipe or something the split string on that. Much simpler if you're in control of the data.

ajanuary · 3 years ago

How often is this a concern in practice? It’s a question I’ve been thinking about a bunch and the answer I keep coming back to is that most of the time, encoding the type in the data exchange format isn’t actually very useful. So I would be interested in use cases where it is.

Are you just trusting that the types coming in are going to be the correct ones? What happens if someone sends you `{“foo”: “10”}` instead of `{“foo”: 10}`? Do you validate with a schema up front? In which case your code already needs to know what types it expects and can convert them. Or are you letter the incorrect types run through your system until it hits a type error at runtime somewhere?

account-5 · 3 years ago

> How often is this a concern in practice?

No idea really, but if you're using a JSON parsing library then that is going to automatically convert for the data types. Which, provided you trust the data, saves you a job.

majkinetor · 3 years ago

JSON also has schema that can be used to verify it.

bufferoverflow · 3 years ago

The author didn't compare gzipped/brottlied sizes.

The author didn't think of any examples with even a bit more complexity. If you have 2-level object nesting, now what?