WinLychee (u/WinLychee)

WinLychee commented on Ask HN: Why are HN comments so cynical? · Posted by u/aarondf

dang · 2 years ago

That comment has been unfairly misinterpreted and did not deserve to turn into a meme of dismissal. BrandonM was sincerely trying to help Drew with his YC application (that's what "app" meant on HN in 2007), and if you read the rest you can see that they had quite a nice exchange.

It's a hobbyhorse but I'm on a mission about this:

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

WinLychee · 2 years ago

That's helpful context, thanks. Unfortunately I cannot edit my original comment, but good to know.

WinLychee commented on Ask HN: Why are HN comments so cynical? · Posted by u/aarondf

WinLychee · 2 years ago

Never forget https://news.ycombinator.com/item?id=8863

> 1. For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.

That said, right now the industry is going through some turmoil. We're coming off the high of low interest rates, and it's turning into a mighty hangover. Plus, we're trying to automate ourselves away with AI, and (working in) tech just isn't fun with Scrum/Agile/Meetings/Sprints/Bluh.

WinLychee commented on RAG is more than just embedding search jxnl.github.io/instructor... · Posted by u/jxnlco

darkteflon · 2 years ago

To someone not familiar with the space, search seems like an incredibly complex and difficult space to get right. In your view, is it reasonable for the average developer prepared to read both of those books to expect to come out the other side and construct something ready for production? Thanks!

WinLychee · 2 years ago

It's a problem with a long tail, and it very much depends on what objective you're optimizing for. In search at least, you aim for "good" and "better", but will never achieve "perfect". It's a pretty interesting space at the meeting point of software and data science. You probably don't necessarily need to read full books before diving in, but play around with "learning to rank" https://xgboost.readthedocs.io/en/latest/tutorials/learning_... and maybe check out https://www.microsoft.com/en-us/research/uploads/prod/2017/0... . Also https://www.tensorflow.org/recommenders/examples/basic_retri... .

WinLychee commented on Tracking Austrian grocery prices by scraping store sites mastodon.gamedev.place/@b... · Posted by u/boffbowsh

standardUser · 2 years ago

I'm not sure how other mastodon servers look, but this one is light years ahead of Twitter in terms of readability. I hope someone eventually writes a tell-all book about exactly what was wrong with that company and its approach to product design.

WinLychee · 2 years ago

Different incentives. A publically traded corporation with thousands of workers trying to grow in perpetuity, has different goals than a community project. While the former has more resources, the latter is more mission driven.

WinLychee commented on An easy way to concurrency and parallelism with Python stdlib bitecode.dev/p/the-easy-w... · Posted by u/olsgaarddk

eachro · 2 years ago

So what is the consensus view on how to do parallelism in python if you just have something that is embarassingly parallel with no communication between processes necessary?

WinLychee · 2 years ago

if you have a task that is easy to split, make a python script that runs on a subset of the task, split into N subsets, and write one output per process? Once they all complete, join together the outputs. Maybe https://docs.dask.org/en/stable/ is a good start if you want a framework. I don't think there's a consensus, it depends on the problem.

WinLychee commented on Nvidia’s AI supremacy is only temporary petewarden.com/2023/09/10... · Posted by u/sebg

sottol · 2 years ago

Research doesn't really have the sunk-cost that industry does. New students are willing to try new things and supervisors don't necessarily need to reign them in.

I wonder what is holding AMD back in research? Their cards seem much less costly. I would have figured a nifty research student would figure out quickly how to port torch and run twice as many gpus with his small budget to eek out a bit more performance.

WinLychee · 2 years ago

The software support just isn't there. The drivers need work, the whole ecosystem is built on CUDA not OpenCL, etc. Not to say someone that tries super hard can't do it, e.g. https://github.com/DTolm/VkFFT .

WinLychee commented on Earth had hottest 3-months on record; unprecedented sea temps & extreme weather public.wmo.int/en/media/p... · Posted by u/myshpa

badtension · 2 years ago

> It was the hottest August on record – by a large margin – and the second hottest ever month after July 2023, according to the Copernicus Climate Change Service ERA 5 dataset. August as a whole is estimated to have been around 1.5°C warmer than the preindustrial average for 1850-1900, according to the C3S monthly climate bulletin.

Thinking about getting to 1.5 C averaged over the planet is surreal and we are still 30 years out from the promised "net zero". We have some tough times ahead of us.

WinLychee · 2 years ago

What's even better is that net zero isn't doing anything for _already emitted carbon_, or the already accumulated heat energy. We can get to net zero and average temperature is still going to keep rising.

WinLychee commented on Do we think about vector storage wrong? hachyderm.io/@softwaredou... · Posted by u/softwaredoug

pradn · 2 years ago

The elephant in the room is the Curse of Dimensionality. As the number of vector dimensions increase, "nearest neighbor" search behaves in an surprising manner. In a high-dimensional cube randomly sampled vectors are mostly near the edge of the cube, not at all what you expect based on two or three dimensions. There's other hitches, too.

When using vector search, you're making two leaps - that your document can really be summarized in a vector (keeping its important distinguishing semantic features), and that it's possible to retrieve similar documents using high-dimensional vectors. Both of these are lossy and have all sorts of caveats.

What would be great is if these new vector DB companies show examples where the top-ten "search engine results page" based on semantic embedding / vector search is clearly better than a conventionally-tuned, keyword-based system like ElasticSearch. I haven't see that, but I have seen tons of numbers about precision/recall/etc. What's the point of taking 10ns to retrieve good vector results if its unclear if they're solving a real problem?

https://en.wikipedia.org/wiki/Curse_of_dimensionality

WinLychee · 2 years ago

These vectors are lower-dimensional than traditional vectors though, aren't they? Vector embeddings are in the hundreds to low thousands range of dimensions (roughly between 128-1024), whereas TF-IDF has the same dimension as your vocabulary. It's also not just about being flat-out better, but about increasing the recall of queries, as you're grabbing content that doesn't contain the keywords directly, but is still relevant. You are also free to mix the two approaches together in one result set, which gives the best of both.

WinLychee commented on Do we think about vector storage wrong? hachyderm.io/@softwaredou... · Posted by u/softwaredoug

osigurdson · 2 years ago

One thing I have been wondering about is how, concretely, in practice embeddings will be used. Will companies create embeddings for most / all content that they have? Will they create an embedding for every sentence, paragraph or page of text (or all of those). Will they store hierarchies of embeddings? Do they then store the original text so they can invert the process (that seems obvious)?

WinLychee · 2 years ago

User/Item/Query embeddings are the most common. That way you can generate per-user recommendations, or search results for a given query (with personalization using side information). Video will be interesting, once we have video embeddings (maybe this exists already). It depends on the use-case but a few of your ideas are certainly possible. Generally I've seen them at a coarse rather than fine level, but I'm sure that's out there too.

This looks like a good overview if you want to read about it: https://recsysml.substack.com/p/two-tower-models-for-retriev...

WinLychee commented on A case for dynamic scoring of high-skilled immigration slowboring.com/p/a-seriou... · Posted by u/btilly

ChadNauseam · 2 years ago

This is very similar to an idea proposed by the late economist William Vickrey. He suggested having an auction for visas, and distributing the revenue generated by the auction evenly between every citizen. This would somewhat align the incentives between citizens and immigrants without introducing feel-good steps like scholarship subsidies.

Realistically, there's not very many Americans who could get become an engineer if only they had a little more scholarship money. Those Americans can mostly get loans to pay for college, then pay off their loans using all the extra money they make now that they're an engineer. If you have any empirical data that contradicts this I'd be interested in seeing it.

WinLychee · 2 years ago

IMO there are many Americans who would work in IT/Tech if you paid them to do it, but the risk calculation doesn't currently make sense. If you're an adult making $25-35/hour in your current job, just meeting rent/utilities/obligations, it's hard to accept going back to school for several years to complete a Bachelor's degree, with zero guarantee of employment, but a definite guarantee of debt on the order of ~60K (taking a cheaper option). This is also true for those lower on the socioeconomic totem pole, whose parents are not going to pay for them to go to school. We've seen the result of making student loans widely available, there are many under-employed Americans in debt.

Numerically I agree with you, the debt load is worth the risk, if you're specifically going for software/IT, but the risk is not zero.