danking00 (u/danking00)

danking00 commented on The FastLanes File Format [pdf] github.com/cwida/FastLane... · Posted by u/jandrewrogers

abirch · a month ago

What about Feather? This is on my to do list, but I thought that Feather was a file format based on Arrow: https://docs.pola.rs/api/python/stable/reference/api/polars....

This is referenced in the link above. https://arrow.apache.org/docs/python/ipc.html

Unfortunately I'm stuck with CSV at work for now.

danking00 · a month ago

Feather appears to just be block compressed Arrow IPC [1]. Lightweight compression techniques generally achieve two orders of magnitude faster random access compared to block compression. That’s one of the benefits of formats like FastLanes, Vortex, DuckDB native, etc. DuckDB has a good blog post about it here: https://duckdb.org/2022/10/28/lightweight-compression.html

[1]: https://arrow.apache.org/docs/python/feather.html

danking00 commented on ALP Rust is faster than C++ blog.spiraldb.com/alp-rus... · Posted by u/gatesn

danking00 · 6 months ago

cppreference says [1]:

    A prvalue of floating-point type can be converted to a prvalue of any integer type. The
    fractional part is truncated, that is, the fractional part is discarded.

    - If the truncated value cannot fit into the destination type, the behavior is undefined (even
      when the destination type is unsigned, modulo arithmetic does not apply).

Ugh. Like, surely the C++ stdlib could provide a "give me the best int approximation" function without triggering nasal demons. Sometimes I feel like C++ is not just difficult to use but actively trying to harm me.

[1] https://en.cppreference.com/w/cpp/language/implicit_conversi...

danking00 commented on What if we just didn't decompress it? blog.spiraldb.com/what-if... · Posted by u/gatesn

danking00 · 6 months ago

I think one of my favorite examples of avoiding decompression for great profit (really: health) is PLINK. If you’re not aware, it’s a tool for doing statistical genetics (things like quality control but also inferring population structure, computing relatedness between individuals, and fitting linear models). The code is mind bending to read but it’s blazing fast thanks to “compressing” genotypes into two bits (homozygous reference, heterozygous, homozygous alternate, N/A) and implementing all its operations on this 2-bit representation.

It’s exciting to see more work on compressed arrays make it out into the world. It would have been nice if a lot of PLINK’s bitfiddling functionality was in an accessible native library rather than wrapped up with the rest of that tool.

https://www.cog-genomics.org/plink/2.0/

danking00 commented on Show HN: Vortex – a high-performance columnar file format github.com/spiraldb/vorte... · Posted by u/gatesn

xiaodai · a year ago

I see. Very nice. So it's a trade-off. I imagine the throughput of these light-weight compression suffers a little. In analytical workloads, it's common to do things like compute the mean of a vector or compute the gradient for this batch of data so random access appear less of an issue here.

danking00 · a year ago

We’ll post a blog post soon with specific, benchmarked numbers, but, in this case, you can have your cake and eat it too!

The compression and decompression throughputs of Vortex (and other lightweight compression schemes) are similar or better than Parquet for many common datasets. Unlike Zstd or Blosc, the lightweight encodings are, generally, both computationally simple and SIMD friendly. We’re seeing multiple gibibytes per second on an M2 MacBook Pro on various datasets in the PBI benchmark [1].

The key insight is that most data we all work with has common patterns that don’t require sophisticated, heavyweight compression algorithm. Let’s take advantage of that fact to free up more cycles for compute kernels!

[1] https://github.com/cwida/public_bi_benchmark

danking00 commented on Show HN: Vortex – a high-performance columnar file format github.com/spiraldb/vorte... · Posted by u/gatesn

xiaodai · a year ago

There are a bunch of these including fst in the R ecosystem. JDF.jl in the julia ecosystem etc.

danking00 · a year ago

Thanks for introducing me to these other formats! I hadn't heard of them yet. All three of fst, JDF, and Vortex appear share the goal of high throughput (de)serialization of tabular data and random access to the data. However, it is not clear to me how JDF and fst permit random access on compressed data because both appear to use block compression (respectively Blosc and LZ4 or Zstd). While both Blosc and Zstd are extremely fast, accessing a single value of a single row necessarily requires decompressing a whole block of data. Instead of O(1) random access you get O(N_ROWS_PER_BLOCK) random access.

In Vortex, we've specifically invested in high throughput compression techniques that admit O(1) random access. These kinds of techniques are also sometimes called "lightweight compression". The DuckDB folks have a good writeup [1] on the common ones.

[1] https://duckdb.org/2022/10/28/lightweight-compression.html

danking00 commented on Show HN: Vortex – a high-performance columnar file format github.com/spiraldb/vorte... · Posted by u/gatesn

sa46 · a year ago

Parquet also encodes the physical layout using footers [1], as does ORC [2]. Perhaps the author meant support for semi-structured data, like the spans you mention.

[1]: https://parquet.apache.org/docs/file-format/

[2]: https://orc.apache.org/specification/ORCv2/#file-tail

danking00 · a year ago

Yeah we should be more clear in our description about how our footers differ from Parquet. Parquet is a bit more prescriptive; for example, it requires row groups which are not required by Vortex. If you have a column with huge values and another column of 8 bit ints, they can be paged separately, if you like.

danking00 commented on Show HN: Vortex – a high-performance columnar file format github.com/spiraldb/vorte... · Posted by u/gatesn

kwillets · a year ago

Does this fragment columns into rowgroups like Parquet, or is it more of a pure columnstore? IME a data warehouse works much better if each column isn't split into thousands of fragments.

danking00 · a year ago

Yeah, you and us are on the same page (heh). We don’t want the format to require row grouping. The file format has a layout schema written in a footer. A row group style layout is supported but not required. Specification of the layout will probably evolve, but currently the in-memory structure becomes the on-disk structure. So, if you have a ChunkedArray of StructArray of ChunkedArray you’ll get row groups and pages within them. If you had a StructArray of ChunkedArray you’ll just get per-column pages.

I’m working on the Python API now. I think we probably want the user to specify, on write, whether they want row groups or not and then we can enforce that as we write.

danking00 commented on Brainwide silencing of prion protein by AAV-mediated [..] epigenetic editor science.org/doi/abs/10.11... · Posted by u/danking00

danking00 · a year ago

See also one of the author's twitter threads: https://x.com/cureffi/status/1806404950055862547

danking00 commented on Priced out of home ownership bbc.co.uk/news/articles/c... · Posted by u/user20180120

whoknowsidont · a year ago

Because the U.S. doesn't actually have a supply problem, despite this often getting repeated.

U.S. construction of homes has actually kept up with population growth and moves, by every conceivable metric.

What IS happening though is that landlords (private and corporate) "warehouse" units constantly to artificially restrict supply and demand, and they do it in collaboration with each other.

Every single city and state in the U.S. has a vacancy problem. We DO have both houses and apartments that are ready and able to be rented and owned.

The problem is simply the price, and the prices aren't being driven by legitimate supply and demand.

You can see this effect happen where prices increase NOT to match population (and often in SPITE of it), but they increase to match the upper bounds of regional income.

You need a place to live more than landlords need a few months of rent. This strategy to prevent downward price pressure from the market allows them to justify the current prices while setting up a "baseline" for the next few years.

Some free reading:

* https://www.construction-physics.com/p/is-there-a-housing-sh...

* https://reventureconsulting.com/the-myth-of-the-us-housing-s...

* https://charleshughsmith.blogspot.com/2023/08/the-problem-is...

danking00 · a year ago

Unless I’m misreading, the first two references support the idea that vacancy rates are too low but don’t comment on why.

The last reference indeed argues in favor of too much financialization of housing units but that blog is also tripping a lot of my crackpot alarms.

Can you succinctly explain why you believe the low vacancy rates in major metros (which I think we agree is the cause of high rents / purchase costs) are caused primarily by units which are intentionally held empty despite demand?

In partial defense of your assertion, the FRED data does show that 1/4 to 1/3 of vacant units are for sale or rent at any given time.

https://fred.stlouisfed.org/graph/?g=1oeIe

Total vacancy rate seems to hover around 10%? Rentable and buyable unit rates are an order of magnitude lower.

I’m not convinced that the Fed owning a bunch of mortgages is evidence that private companies bought homes and aren’t renting them. Wasn’t that a bail out to prevent people from losing their homes (because the companies owning the homes weren’t solvent and I guess if the company fails maybe you get foreclosed? I’m not sure why we did things the way we did in 08)

Could not the explanation also be that a lot of homes are in places that lack demand and therefore the owners don’t bother putting them up for sale or rent?

https://fred.stlouisfed.org/graph/?g=1oeII