lsorber (u/lsorber) - Readit News

lsorber commented on Reinforcement learning, explained with a minimum of math and jargon understandingai.org/p/rei... · Posted by u/JnBrymn

lsorber · 3 months ago

For those who want to dive deeper, here’s a 300 LOC implementation of GRPO in pure NumPy: https://github.com/superlinear-ai/microGRPO

The implementation learns to play Battleship in about 2000 steps, pretty neat!

lsorber commented on 32k context length text embedding models blog.voyageai.com/2024/09... · Posted by u/fzliu

voiper1 · 10 months ago

I read both those articles, but I still don't get how to do it. It seems the idea is that more of the embedding is informed by context, but do I _do_ late chunking?

My best guess so far is that somehow I embed a long text and then I break up the returned embedding into multiple parts and search each separately? But that doesn't sound right.

lsorber · 10 months ago

The name ‘late chunking’ is indeed somewhat of a misnomer in the sense that the technique does not partition documents into document chunks. What it actually does is to pool token embeddings (of a large context) into say sentence embeddings. The result is that your document is now represented as a sequence of sentence embeddings, each of which is informed by the other sentences in the document.

Then, you want to parition the document into chunks. Late chunking pairs really well with semantic chunking because it can use late chunking's improved sentence embeddings to find semantically more cohesive chunks. In fact, you can cast this as a binary integer programming problem and find the ‘best’ chunks this way. See RAGLite [1] for an implementation of both techniques including the formulation of semantic chunking as an optimization problem.

Finally, you have a sequence of document chunks, each represented as a multi-vector sequence of sentence embeddings. You could choose to pool these sentence embeddings into a single embedding vector per chunk. Or, you could leave the multi-vector chunk embeddings as-is and apply a more advanced querying technique like ColBERT's MaxSim [2].

[1] https://github.com/superlinear-ai/raglite

[2] https://huggingface.co/blog/fsommers/document-similarity-col...

lsorber commented on 32k context length text embedding models blog.voyageai.com/2024/09... · Posted by u/fzliu

throwup238 · 10 months ago

What’s the benefit of generating embeddings for such large chunks? Do people use these large contexts to include lots of document specific headers/footers or are they actually generating embeddings of single large documents?

I don’t understand how the math works out on those vectors

lsorber · 10 months ago

You don’t have to reduce a long context to a single embedding vector. Instead, you can compute the token embeddings of a long context and then pool those into say sentence embeddings.

The benefit is that each sentence’s embedding is informed by all of the other sentences in the context. So when a sentence refers to “The company” for example, the sentence embedding will have captured which company that is based on the other sentences in the context.

This technique is called ‘late chunking’ [1], and is based on another technique called ‘late interaction’ [2].

And you can combine late chunking (to pool token embeddings) with semantic chunking (to partition the document) for even better retrieval results. For an example implementation that applies both techniques, check out RAGLite [3].

[1] https://weaviate.io/blog/late-chunking

[2] https://jina.ai/news/what-is-colbert-and-late-interaction-an...

[3] https://github.com/superlinear-ai/raglite

lsorber commented on I Built a Smart Thermostat medium.com/swlh/how-i-bui... · Posted by u/ysabri

m463 · 5 years ago

> you struggle to identify the most common word to describe it

In case of temperature control, the magic word (acronym) was "PID" which I learned from a friend who set up an open-source bbq controller:

https://www.raspberrypi.org/blog/heatermeter-open-source-bar...

https://en.wikipedia.org/wiki/PID_controller

lsorber · 5 years ago

Did you even need the D, wouldn't a PI controller be sufficient?

lsorber commented on Apple M1 vs. Ryzen 3900X vs. Intel I9 in Software Development tech.ssut.me/apple-m1-chi... · Posted by u/ssut

t-writescode · 5 years ago

Did Apple not actually _design_ the chip, lay out transistor locations, do virtual testing against the future design, find bottlenecks, etc?

5nm is incredible and should not be understated; but I don't think anyone on the outside can say "most credit" should go to TSMC. A lot of credit? Absolutely. Most credit? I don't know.

lsorber · 5 years ago

If TSMC buys its lithography machines, why should it even get any credit for 5nm at all?

lsorber commented on Interpretability in Machine Learning: An Overview thegradient.pub/interpret... · Posted by u/atg_abhishek

blululu · 5 years ago

This is a good exposition of some formal definitions for 'interpretability' in the context of machine learning, but I am still not really clear on why such a property is necessary or even desirable in the context of high dimensional statistical learning algorithms. In some sense the power of modern machine learning (as opposed to a set of heuristics + feature engineering + a linear classifier) is that it is not limited by what its designers are able to imagine or understand. If it were possible to give a simple explanation of how a high dimensional classifier works then it would also likely be unnecessary to have so many parameters.

As an example, if we consider natural language processing, then we might say that we want our NLP algorithm to be interpretable. This is clearly a tall order since the study of linguistics is still full of unsolved riddles. It seems silly to insist that a computational model of language must be significantly easier to understand than language itself. If interpretability is not feasible with language - a construct that is intimately connecting to the faculties of the human brain - then why should we expect it to be feasible (or desirable for the wide range of applications that do not come naturally to people?

lsorber · 5 years ago

Could you give an example of an unsolved riddle from linguistics?

lsorber commented on NestedText, a nice alternative to JSON, YAML, TOML nestedtext.org... · Posted by u/nestedtext

jiggawatts · 5 years ago

> I think avoiding numeric types is a good decision.

Only if this format is intended for use-cases that never need to deal with numbers.

> One should remember that any sane application will be parsing the config file into internal data structures and validating it anyway so it gets little benefit from the numbers being already “parsed”.

That statement couldn't possibly be more wrong.

Number parsing (and encoding!) is a decidedly non-trivial problem. You need to concern yourself with -- at a minimum -- all of the following:

- Unsigned 64-bit numbers.

- A series of digits that would be bigger than a 64 bit whole number. Convert to float? Truncate in some way? Error?

- NaN

- Infinity

- Negative zero

- Denormal numbers.

- Differentiating between decimal/currency types and floating point numbers. Not all decimal values can be exactly represented as floats!

- Efficiently encoding floating point to use the minimum digits without losing precision.

- Parsing those minimal numbers with perfect "round-tripping".

- Doing the above efficiently.

- Securely too! Efficient parsers cut corners on sanity checks. I hoped you fuzzed your parser...

The above can easily amount to many kilobytes of extremely complex code. Look up "ryu" as an example of what Google came up with to make JSON number parsing reasonably efficient.

Meanwhile, reading a fixed-length number from a binary format can be done in a single machine instruction. One. It might not even take an entire CPU clock cycle! Okay, two, if you need to bounds-check your buffer, but there's ways to avoid that.

Afterwards, the bounds check is again literally just two machine instructions in complexity. That's not the difficult bit!

The difficult bit is the parsing.

lsorber · 5 years ago

In my opinion, the best solution to these issues is to:

1. Declare numbers as numbers in the configuration language. E.g. "decimal(1e1000)".

2. Parse declared numbers with a lossless format like Python's decimal.Decimal.

3. Let users decide at their own risk if they want to convert to a lossy format like float.

lsorber commented on Huang’s Law Is the New Moore’s Law, and Explains Why Nvidia Wants Arm wsj.com/articles/huangs-l... · Posted by u/bookofjoe

lsorber · 5 years ago

Where's the data that says Moore's law no longer holds? I see comments and articles asserting this but everytime with evidence. The data that I do find certainly still suggests Moore's law is doing fine.