I prefer to use clickhouse-local for all my CSV needs as I don't need to learn a new language (or cli flags) and can just leverage SQL.
clickhouse local --file medias.csv --query "SELECT edito, count() AS count from table group by all order by count FORMAT PrettyCompact"
┌─edito──────┬─count─┐
│ agence │ 1 │
│ agrégateur │ 10 │
│ plateforme │ 14 │
│ individu │ 30 │
│ media │ 423 │
└────────────┴───────┘
With clickhouse-local, I can do lot more as I can leverage full power of clickhouse.
How does it compare with duckdb, which I usualy resort to?
What I like with duckdb is that it's a single binary, no server needed, and it's been happy so far with all the CSV file I've thrown at it.
clickhouse-local is similar to duckdb, you don't need a clickhouse-server running in order to use clickhouse-local. You just need to download the clickhouse binary and start using it.
clickhouse local
ClickHouse local version 25.4.1.1143 (official build).
:)
There are few benefits of using clickhouse-local since ClickHouse can just do lot more than DuckDB. One such example is handling compressed files. ClickHouse can handle compressed files with formats ranging from zstd, lz4, snappy, gz, xz, bz2, zip, tar, 7zip.
clickhouse local --query "SELECT count() FROM file('top-1m-2018-01-10.csv.zip :: *.csv')"
1000000
Also clickhouse-local is much more efficient in handling big csv files[0]
# Some medias of our corpus have the same ids on mediacloud.org
xan dedup -s mediacloud_ids medias.csv | xan count && xan count medias.csv
$medias | Select-Object -ExpandProperty mediacloud_ids -Unique | Measure-Object; $medias | Measure-Object -Property mediacloud_ids
Computing frequency tables
xan frequency -s edito medias.csv | xan view
$medias | Group-Object -Property edito | Sort-Object -Property Count -Descending
It's probably orders of magnitude slower, and of course, plotting graphs and so on gets tricky. But for the simple type of analysis I typically do, it's fast enough, I don't need to learn an extra tool, and the auto-completion of column/property names is very convenient.
Yes, I find PowerShell is criminally underrated for these type of tasks. Even though it's open source and cross-platform, the stigma from it's Windows-centric days is hard to overcome.
I use Pandas for most of my CSV work. It's super fast and very powerful. There's a bit of a learning curve. I can then use Python scripts to manipulate massive CSV files.
I tend to use csvkit for more complicated transformations, and OCaml's csvtool[0] for the simpler ones. For intermediate transformations I wrote my own csved[1] script, which reads for every line of a CSV reads it into @F, applies a Perl expression to that array, then writes it out. With the -h option you can also use the %F hash to access fields by name. It's very fast.
It looks like xsv and xan are in the "csvkit but faster" niche, which is nice, but now I must learn another set of commands.
And there are now many more recent utilities called csvtool, including a Perl and a Python one.
Something that would be insanely useful is if your tool could be users to do validations.
For example being able to define data types for each column and say required columns. And then run your tool as a validator and take the errors as an array that’d be amazing!
Coming from a dev who’s just over processing CSV files back into my apps.
https://harelba.github.io/q/
[0]: https://www.vantage.sh/blog/clickhouse-local-vs-duckdb
Translating the examples from the ReadMe, having read the file with:
Previewing the file in the terminal Reading a flattened representation of the first row Searching for rows Selecting some columns Sorting the file Deduplicating the file on some column Computing frequency tables It's probably orders of magnitude slower, and of course, plotting graphs and so on gets tricky. But for the simple type of analysis I typically do, it's fast enough, I don't need to learn an extra tool, and the auto-completion of column/property names is very convenient.I'm currently on my phone so can't go through all the examples, but knowing both PS and nu, nu has the better syntax.
EDIT:
Get data and view in table:
Get headers: Get count of rows: Get flattened, slight more convoluted (caveat there might be a better way): Search rows: Select columns: Sort file: Dedup based on column: Computing frequency and histogramIt looks like xsv and xan are in the "csvkit but faster" niche, which is nice, but now I must learn another set of commands.
And there are now many more recent utilities called csvtool, including a Perl and a Python one.
[0] https://github.com/Chris00/ocaml-csv
[1] https://github.com/gpvos/csved
For example being able to define data types for each column and say required columns. And then run your tool as a validator and take the errors as an array that’d be amazing!
Coming from a dev who’s just over processing CSV files back into my apps.
Could definitely be done as a small little bash script