LoRA Without Regret - Readit News

HumblyTossed · 2 months ago

The name gets me every single time. Always think it’s going to be about radio LoRa

papascrubs · 2 months ago

Not just me then. It's always the first thing that springs to mind.

apple4ever · 2 months ago

Nope not just you! Gets me everytime.

dannyfritz07 · 2 months ago

Dang it! Got me too! I've been wanting to hop into Meshtastic lately.

ijustlovemath · 2 months ago

Set up a node! Bare boards that work with the app are like $50 and take a few clicks to flash and setup. The basic antenna with no amp makes contacts up to 50mi away if the conditions are right. I have one in a window and one in a backpack at all times.

wkjagt · 2 months ago

I have a couple of nodes up, but not seeing a lot of traffic

halfmatthalfcat · 2 months ago

Same - sad it's not.

moffkalast · 2 months ago

No such thing as LoRa and LoRaWAN without regret I'm afraid, all the range but no throughput.

dvfjsdhgfv · 2 months ago

By the way, some time ago when I checked there were two cool applications of LoRa: (1) a mesh, for (hopefully) truly decentralized and more difficult to disrupt communication, (2) a gateway, so that you could get data from your sensors in remote places via standard internet protocols.

Both are very cool, but I wonder if I missed something else?

sifar · 2 months ago

And I thought you were going to say thinking machines :). Buy yeah LoRA trips me up too.

canadiantim · 2 months ago

I thought it was Lora the CRTD implementation, but then realized that Loro

Deleted Comment

mrandish · 2 months ago

Yeah, kinda disappointed it's just more AI stuff...

Dead Comment

kouteiheika · 2 months ago

> However, the literature is unclear on how well LoRA performs relative to FullFT.

I think the literature is clear on that?

"LoRA vs Full Fine-tuning: An Illusion of Equivalence" -- https://arxiv.org/abs/2410.21228v1

Quoting from the conclusions:

> The paper describes the finding that LoRA and full fine-tuning, with equal performance on the fine-tuning task, can have solutions with very different generalization behaviors outside the fine-tuning task distribution. We found that LoRA and full fine-tuning yield models with significant differences spectral properties of their weight matrices: LoRA models often containing “intruder dimensions”, high-ranking singular vectors approximately orthogonal to the singular vectors of pre-trained weight matrices. The existence of intruder dimensions correlates with the fine-tuned model forgetting more of the pre-training distribution as well as forgetting more when trained on tasks sequentially in a continual learning setup.

I'm surprised they didn't cite this; it's a well known paper.

adhi01 · 2 months ago

To say that the 'literature is clear on that' while citing a single paper, which has been rejected from ICLR, is a bit of an overstatement.

kouteiheika · 2 months ago

> which has been rejected from ICLR

Oh, you mean rejected just like these papers?

Efficient Estimation of Word Representations in Vector Space[1], one of the most influential papers in the space with tens of thousands of citations[2]? Or the RoBERTa[3] paper (dramatically improved upon BERT; RoBERTa and derived models currently have tens of millions of downloads on HF and still serve as a reliable industry workhorse)? Or the Mamba paper[4] (pretty much the only alternative to transformers that actually gets used)? Do you want me to keep going?

Honestly, I find that whether a paper gets rejected or not means diddly squat considering how broken the review system is, and through how much honestly terrible papers I have to wade through every time I'm looking through the conference submissions for anything good.

[1] -- https://openreview.net/forum?id=idpCdOWtqXd60

[2] -- https://scholar.google.com/scholar?cites=7447715766504981253

[3] -- https://openreview.net/forum?id=SyxS0T4tvS

[4] -- https://openreview.net/forum?id=AL1fq05o7H

muragekibicho · 2 months ago

Thanks for this comment.

p1esk · 2 months ago

Even that paper itself does not provide any "clear" conclusions about which method is better.

lelanthran · 2 months ago

> I'm surprised they didn't cite this; it's a well known paper.

I'm surprised you copied and pasted all of that without explaining what it means.

Does LoRA perform worse, better or statistically insignificantly different to FullFT?

You aren't able to tell from what you pasted, are you?

cheald · 2 months ago

Standard LoRA (W_delta = B@A with standard inits) generally underperforms FT, primarily because of "intruder dimensions" (new high-ranking singular vectors which misalign with the singular vectors of the underlying weights) as outlined in the paper.

There are techniques like PiCa and SVFT which can mitigate much of the loss, though.

crimsoneer · 2 months ago

If you're going to be snarky, could you at least clarify what the answer is for those of us who don't stay on top of ML research...?

richardvsu · 2 months ago

Why would they cite a paper that’s not helping with their Tinker API that was released soon after? :)

mountainriver · 2 months ago

> LoRA works well when not capacity constrained, i.e., the number of trainable parameters exceeds the amount of information to be learned, which can be estimated in terms of dataset size

I’m shocked they didn’t look at progressive merging of LoRAs. Research shows that’s the best way of improving its ability to model higher level features.

Seems like a massive miss, not to mention there is other research that contradicts a lot of their findings. This feels a bit like a researchers first pass at learning LoRA

let_tim_cook_ · 2 months ago

I'm not sure why progressive LoRa merging needs to be addressed here. They show there is a regime of problem where LoRa performs equivalently to FFT.

Progressive merging of LoRa is somewhere inbetween and categorically more complex than just LoRa so would be dominated by standard LoRa in that case.

While progressive merging could train faster as fewer params are trainable at any given time, it results in very larger adapter diffs OTO the size of the original model and doesn't retain the benefits of being able to deploy multiple adapters over the same base model idt.

yenepho · 2 months ago

I am curious, would you mind sharing a citation?

Mkengin · 2 months ago

https://arxiv.org/abs/2311.13600

https://arxiv.org/abs/2410.22911

https://arxiv.org/abs/2409.16167

logannyeMD · 2 months ago

Missed opportunity to title this "Lo-RAgrets"

sgt101 · 2 months ago

Question for dudes building modern nn's... what's the thinking on estimating structural capacity for real world problem? How should I estimate how many parameters to choose for the model?

p1esk · 2 months ago

You test different models on your real world problem, and pick the smallest one that works.

sgt101 · 2 months ago

I just think that there has to be some heuristic..

_def · 2 months ago

Took me a moment to realize this is not about LoRa.

ellisv · 2 months ago

I also mistook it to be about LoRa and not about LoRA

chrystalkey · 2 months ago

I too fell victim to mistaking LoRa for LoRa

markisus · 2 months ago

Can someone explain the bit counting argument in the reinforcement learning part?

I don’t get why a trajectory would provide only one bit of information.

Each step of the trajectory is at least giving information about what state transitions are possible.

An infinitely long trajectory can explore the whole state space if there are no absorbing states. Such a trajectory would provide a massive amount of information about the system, even if we ignored the final reward.

navar · 2 months ago

I believe it's because the way you measure things in RL, each episode only tells you whether it was good (say reward +1) or bad (say 0 or negative reward), it does not tell you anything about the trace that was produced to get the outcome. This reward is the only thing measured to produce your gradients. Hence why the amount of info in it is O(1).

This is in contrast to more "supervised" forms of learning where you could get a loss for each token produced (e.g. cross entropy loss), and where you'd get, as a consequence O(number of tokens) information into your gradients.

Deleted Comment

mountainriver · 2 months ago

A fair amount of research has shown that RL doesn’t add knowledge to the base model it just optimizes paths that already exist. Now ProRL from Nvidia showed there are ways of adding knowledge, mostly through progressive merging.

I’m still not fully convinced of the 1bit claim, they made other mistakes in the blog post

lewtun · 2 months ago

For those interested in playing with an implementation of these ideas, my colleagues at HF made some recipes here: https://github.com/huggingface/trl/blob/main/docs/source/lor...