Readit News logoReadit News
abhgh commented on A linear-time alternative for Dimensionality Reduction and fast visualisation   medium.com/@roman.f/a-lin... · Posted by u/romanfll
lmeyerov · 3 days ago
We generally run UMAP on regular semi-structured data like database query results. We automatically feature encode that for dates, bools, low-cardinality vals, etc. If there is text, and the right libs available, we may also use text embeddings for those columns. (cucat is our GPU port of dirtycat/skrub, and pygraphistry's .featurize() wraps around that).

My last sentence was on more valuable problems, we are finding it makes sense to go straight to GNNs, LLMs, etc and embed multidimensional data that way vs via UMAP dim reductions. We can still use UMAP as a generic hammer to control further dimensionality reductions, but the 'hard' part would be handled by the model. With neural graph layouts, we can potentially even skip the UMAP for that too.

Re:pacmap, we have been eyeing several new tools here, but so far haven't felt the need internally to go from UMAP to them. We'd need to see significant improvements given the quality engineering in UMAP has set the bar high. In theory I can imagine some tools doing better in the future, but the creators have't done the engineering investment, so internally, we rather stay with UMAP. We make our API pluggable, so you can pass in results from other tools, and we haven't heard much from that path from others.

abhgh · 3 days ago
Thank you. Your comment about LLMs to semantically parse diverse data, as a first step, makes sense. In fact come to think of it, in the area of prompt optimization too - such as MIPROv2 [1] - the LLM is used to create initial prompt guesses based on its understanding of data. And I agree that UMAP still works well out of the box and has been pretty much like this since its introduction.

[1] Section C.1 in the Appendix here https://arxiv.org/pdf/2406.11695

abhgh commented on A linear-time alternative for Dimensionality Reduction and fast visualisation   medium.com/@roman.f/a-lin... · Posted by u/romanfll
lmeyerov · 3 days ago
Fwiw, we are heavy UMAP users (pygraphistry), and find UMAP CPU fine for interactive use at up to 30K rows and GPU at 100K rows, then generally switch to a trained mode when > 100K rows. Our use case is often highly visual - see correlations, and link together similar entities into explorable & interactive network diagrams. For headless, like in daily anomaly detection, we will do this to much larger scales.

We see a lot of wide social, log, and cyber data where this works, anywhere from 5-200 dim. Our bio users are trickier, as we can have 1K+ dimensions pretty fast. We find success there too, and mostly get into preconditioning tricks for those.

At the same time, I'm increasingly thinking of learning neural embeddings in general for these instead of traditional clustering algorithms. As scales go up, the performance argument here goes up too.

abhgh · 3 days ago
I was not aware this existed and it looks cool! I am definitely going to take out some time to explore it further.

I have a couple of questions for now: (1) I am confused by your last sentence. It seems you're saying embeddings are a substitute for clustering. My understanding is that you usually apply a clustering algorithm over embeddings - good embeddings just ensure that the grouping produced by the clustering algo "makes sense".

(2) Have you tried PaCMAP? I found it to produce high quality and quick results when I tried it. Haven't tried it in a while though - and I vaguely remember that it won't install properly on my machine (a Mac) the last time I had reached out for it. Their group has some new stuff coming out too (on the linked page).

[1] https://github.com/YingfanWang/PaCMAP

abhgh commented on Algorithms for Optimization [pdf]   algorithmsbook.com/optimi... · Posted by u/Anon84
cchianel · 18 days ago
I haven't; from a quick reading, InfoBax is for when you have an expensive function and want to do limited evaluations. Timefold works with cheap functions and does many evaluations. Timefold does this via Constraint Streams, so a function like:

    var score = 0;
    for (var shiftA : solution.getShifts()) {
        for (var shiftB : solution.getShifts()) {
            if (shiftA != shiftB && shiftA.getEmployee() == shiftB.getEmployee() && shiftA.overlaps(shiftB)) {
                score -= 1;
            }
        }
    }
    return score
usually takes shift * shift evaluations of overlaps, we only check the shifts affected by the change (changing it from O(N^2) to O(1) usually).

That being said, it might be useful for a move selector. I need to give it a more in depth reading.

abhgh · 18 days ago
Thanks for the example. Yes, true, this is for expensive functions - to be precise functions that depend on data that is hard to gather, so you interleave the process of computing the value of the function with gathering strategically just as much data as is needed to compute the function value. The video on their page [1] is quite illustrative: calculate shortest path on a graph where the edge weights are expensive to obtain. Note how the edge weights they end up obtaining forms a narrow band around the shortest path they find.

[1] https://willieneis.github.io/bax-website/

abhgh commented on Algorithms for Optimization [pdf]   algorithmsbook.com/optimi... · Posted by u/Anon84
cchianel · 18 days ago
Some additional optimization resources (for metaheuristics, where you only have the objective/score function and no derivative):

- "Essentials of Metaheuristics" by Sean Luke https://cs.gmu.edu/~sean/book/metaheuristics/

- "Clever Algorithms" by Jason Brownlee https://cleveralgorithms.com/

Timefold uses the metaheuristic algorithms in these books (Tabu Search, Late Acceptance, Simulated Annealing, etc.) to find near-optimal solutions quickly from a score function (typically defined in a Java stream-like/SQL-like syntax so score calculation can be done incrementally to improve score calculation speed).

You can see simplified diagrams of these algorithms in action in Timefold's docs: https://docs.timefold.ai/timefold-solver/latest/optimization....

Disclosure: I work for Timefold.

abhgh · 18 days ago
Timefold looks very interesting. This might be irrelevant but have you looked at stuff like InfoBax [1]?

[1] https://willieneis.github.io/bax-website/

abhgh commented on Terence Tao: At the Erdos problem website, AI assistance now becoming routine   mathstodon.xyz/@tao/11559... · Posted by u/dwohnitmok
fastasucan · 25 days ago
How do you know its correct? And how do you learn to engage with the theory heavy subject doing it this way?
abhgh · 24 days ago
You don't - the way I use LLMs for explanations is that I keep going back and forth between the LLM explanation and Google search /Wikipedia. And of course asking the LLM to cite sources helps.

This might sound cumbersome but without the LLM I wouldn't have (1) known what to search for, in a way (2) that lets me incrementally build a mental model. So it's a net win for me. The only gap I see is coverage/recall: when asked for different techniques to accomplish something, the LLM might miss some techniques - and what is missed depends upon the specific LLM. My solution here is asking multiple LLMs and going back to Google search.

abhgh commented on Awk Technical Notes (2023)   maximullaris.com/awk_tech... · Posted by u/signa11
dietrichepp · a month ago
Awk is still one of my favorite tools because its power is underestimated by nearly everyone I see using it.

    ls -l | awk '{print $3}'
That’s typical usage of Awk, where you use it in place of cut because you can’t be bothered to remember the right flags for cut.

But… Awk, by itself, can often replace entire pipelines. Reduce your pipeline to a single Awk invocation! The only drawback is that very few people know Awk well enough to do this, and this means that if you write non-trivial Awk code, nobody on your team will be able to read it.

Every once in a while, I write some tool in Awk or figure out how to rewrite some pipeline as Awk. It’s an enrichment activity for me, like those toys they put in animal habitats at the zoo.

abhgh · a month ago
Love awk. In the early days of my career, I used to write ETL pipelines and awk helped me condense a lot of stuff into a small number of LOC. I particularly prided myself in writing terse one-liners (some probably undecipherable, ha!); but did occasionally write scripts. Now I mostly reach for Python.
abhgh commented on Claude Haiku 4.5   anthropic.com/news/claude... · Posted by u/adocomplete
cromulen · 2 months ago
That's what you get when you use speculative decoding and focus / overfit the draft model on coding. Then when the answer is out of distribution for the draft model, you get increased token rejections by the main model and throughput suffers. This probably still makes sense for them if they expect a lot of their load will come from claude code and they need to make it economical.
abhgh · 2 months ago
I'm curious to know if Anthropic mentions anywhere that they use speculative decoding. For OpenAI they do seem to use it based on this tweet [1].

[1] https://x.com/stevendcoffey/status/1853582548225683814

abhgh commented on Let's Take Esoteric Programming Languages Seriously   feelingof.com/episodes/07... · Posted by u/strombolini
gosub100 · 2 months ago
Forgive my ignorance about AI, but had anyone tried a "nondeterministic" language that somehow uses learning to approximate the answer? I'm not talking about the current cycles where you train your model on a zillions of inputs, tune it, and release it. I mean a language where you tell it what a valid output looks like, and deploy it. And let it learn as it runs.

Ex: my car's heater doesn't work the moment you turn it on. So if I enter the car one of my first tasks is to turn the blower down to 0 until the motor warms up. A learning language could be used here, given free reign over all the (non-safety-critical) controls, and told that it's job is to minimize the number of "corrections" made by the user. Eventually it's reward would be gained by initializing the fan blower to 0, but it might take 100 cycles to learn this. Rather that train it on a GPU, a language could express the reward and allow it to learn over time, even though it's output would be "wrong" quite often.

That's an esoteric language I'd like to see.

abhgh · 2 months ago
Wouldn't this be an optimization problem, that's to say, something like z3 should be able to do - [1], [2]?

I was about to suggest probabilistic programming, e.g., PyMC [3], as well, but it looks like you want the optimization to occur autonomously after you've specified the problem - which is different from the program drawing insights from organically accumulated data.

[1] https://github.com/Z3Prover/z3?tab=readme-ov-file

[2] https://microsoft.github.io/z3guide/programming/Z3%20Python%...

[3] https://www.pymc.io/welcome.html

abhgh commented on Show HN: Traceroute Visualizer   kriztalz.sh/traceroute-vi... · Posted by u/PranaFlux
thelastgallon · 3 months ago
Traceroute isn't real, or: Whoops! Everyone Was Wrong Forever: https://gekk.info/articles/traceroute.htm
abhgh · 3 months ago
Hadn't seen this before, very nice read, thank you!

u/abhgh

KarmaCake day829July 6, 2012
About
https://blog.quipu-strands.com/
View Original