qoega (u/qoega) - Readit News

qoega commented on Timescale Is Now TigerData tigerdata.com/blog/timesc... · Posted by u/pbowyer

freilanzer · 6 months ago

DuckDB seems to be the most interesting there.

qoega · 6 months ago

It is meant for single reader/writer workload so not meant to be used as a service

qoega commented on Show HN: Turn CSV Files into SQL Statements for Quick Database Transfers github.com/ryanwaldorf/ge... · Posted by u/ryanwaldorf

adammarples · 2 years ago

Yeah I'm not sure about redshift, but bigquery uses "autodetect" so something like

bq load --autodetect --source_format=CSV mydataset.mytable ./myfile.csv

And snowflake uses INFER_SCHEMA I believe you can do this

select * from table( infer_schema( location=>'@stage/my file.csv', file_format=>'my_csv_format' ) );

Although tbh I'm not sure if that's what you're looking for. You might enjoy looking at duckdb for stuff like this. My policy when starting data engineering was to bung everything into pandas dataframes, and now my policy is to try to avoid them at all costs because they're slow and memory hungry!

qoega · 2 years ago

In ClickHouse it is just `INSERT INTO t FROM INFILE 'data.csv.gz'`. Any supported format, any encryption, autodetected from file name and sample data piece to get column types, delimeters etc. Separate tools to convert CSV are not necessary if you can just import to db and export as SQL Statements.

echo "name,age,city John,30,New York Jane,25,Los Angeles" > example.csv

clickhouse local -q "SELECT * FROM file('example.csv') FORMAT SQLInsert" INSERT INTO table (`name`, `age`, `city`) VALUES ('John', 30, 'New York'), ('Jane', 25, 'Los Angeles');

qoega commented on Show HN: Postgres Columnstore index vs. ClickHouse for OLAP queries tablespace.io/blog/postgr... · Posted by u/smythe123

qoega · 2 years ago

Are you guy comparing 16vCPU/32GB vs 8vCPU/32GB and say yours is only 1.6 faster?

qoega commented on Ask HN: Does (or why does) anyone use MapReduce anymore? · Posted by u/bk146

qoega · 2 years ago

Now you rarely use basic MapReduce primitives, you have another layer of abstraction that can run on infrastructure that was running MR jobs before. This infrastructure allows to efficiently allocate some compute resources for "long" running tasks in a large cluster with respect to memory/cpu/network and other constraints. So basically schedulers of MapReduce jobs and cluster management tools became that good, because MR methodology had trivial abstractions, but required efficient implementation to make it work seamlessly.

Abstraction layers on top of this infrastructure now can optimize pipeline as a whole by merging several steps into one when possible, add combiners(partial reduce before shuffle). It requires whole processing pipeline to be defined in more specific operations. Some of them propose to use SQL to formulate task, but it can be done using other primitives. And given this pipeline it is easy to implement optimizations making whole system much more user-friendly and efficient compared to MapReduce, when user has to think about all the optimizations and implement them inside single map/reduce/(combine) operations.

qoega commented on ClickHouse Keeper: A ZooKeeper alternative written in C++ clickhouse.com/blog/click... · Posted by u/eatonphil

insanitybit · 2 years ago

So could I just point my Kafka at this thing and use it?

qoega · 2 years ago

You can even migrate your zookeeper to ClickHouse keeper. It requires small downtime, but you will have all your zookeeper data inside and your clients will just work when your keeper will be back

qoega commented on ClickHouse Keeper: A ZooKeeper alternative written in C++ clickhouse.com/blog/click... · Posted by u/eatonphil

zX41ZdbW · 2 years ago

https://github.com/ClickHouse/ClickHouse/issues/45367

qoega · 2 years ago

Did not expect to see issue I created

qoega commented on Uses and abuses of cloud data warehouses materialize.com/blog/ware... · Posted by u/Malp

RyanHamilton · 2 years ago

For real-time and large historical data, open source there's tdengine/questdb, commercial DolphinDB and kdb+. If you only need fast recent data and not large historical embedding is a good solution which means h2/duckdb/sqlite if open source, extremedb if commercial. I've benchmarked and ran applications on most these databases including running real-time analytics.

qoega · 2 years ago

Open-source ClickHouse also allows both real-time and large historical data.

qoega commented on Neeva acquired by Snowflake snowflake.com/blog/snowfl... · Posted by u/danielcampos93

riku_iki · 3 years ago

that link doesn't provide much details how they try to test ch for joins, and if they tried to test it at all..

qoega · 3 years ago

I think atwong just promotes his product https://news.ycombinator.com/threads?id=atwong

qoega commented on Show HN: Gitbi – Lightweight BI app based on Git repo github.com/ppatrzyk/gitbi... · Posted by u/pieca

qoega · 3 years ago

It is nice to have an image of expected dashboard in readme.

qoega commented on Building ClickHouse Cloud from scratch in a year clickhouse.com/blog/build... · Posted by u/techn00

mayank · 3 years ago

This is a wonderful article, architecture, and project. Can anyone from Clickhouse comment on any non-technical factors that allowed such a rapid pace of development, e.g. team size, structure, etc.?

qoega · 3 years ago

Passion and experience