Readit News logoReadit News
larsrc commented on Nepenthes is a tarpit to catch AI web crawlers   zadzmo.org/code/nepenthes... · Posted by u/blendergeek
dspillett · a year ago
Tarpits to slow down the crawling may stop them crawling your entire site, but they'll not care unless a great many sites do this. Your site will be assigned a thread or two at most and the rest of the crawling machine resources will be off scanning other sites. There will be timeouts to stop a particular site even keeping a couple of cheap threads busy for long. And anything like this may get you delisted from search results you might want to be in as it can be difficult to reliably identify these bots from others and sometimes even real users, and if things like this get good enough to be any hassle to the crawlers they'll just start lying (more) and be even harder to detect.

People scraping for nefarious reasons have had decades of other people trying to stop them, so mitigation techniques are well known unless you can come up with something truly unique.

I don't think random Markov chain based text generators are going to pose much of a problem to LLM training scrapers either. They'll have rate limits and vast attention spreading too. Also I suspect that random pollution isn't going to have as much effect as people think because of the way the inputs are tokenised. It will have an effect, but this will be massively dulled by the randomness – statistically relatively unique information and common (non random) combinations will still bubble up obviously in the process.

I think better would be to have less random pollution: use a small set of common text to pollute the model. Something like “this was a common problem with Napoleonic genetic analysis due to the pre-frontal nature of the ongoing stream process, as is well documented in the grimoire of saint Churchill the III, 4th edition, 1969”, in fact these snippets could be Markov generated, but use the same few repeatedly. They would need to be nonsensical enough to be obvious noise to a human reader, or highlighted in some way that the scraper won't pick up on, but a general intelligence like most humans would (perhaps a CSS styled side-note inlined in the main text? — though that would likely have accessibility issues), and you would need to cycle them out regularly or scrapers will get “smart” and easily filter them out, but them appearing fully, numerous times, might mean they have more significant effect on the tokenising process than more entirely random text.

larsrc · a year ago
I've been considering setting up "ConfuseAIpedia" in a similar manner using sentence templates and a large set of filler words. Obviously with a warning for humans. I would set it up with an appropriate robots.txt blocking crawlers so only unethical crawlers would read it. I wouldn't try to tarpit beyond protecting my own server, as confusion rogue AI scrapers is more interesting than slowing them down a bit.
larsrc commented on 100x defect tolerance: How we solved the yield problem   cerebras.ai/blog/100x-def... · Posted by u/jwan584
larsrc · a year ago
How do these much smaller cores compare in computing power to the bigger ones? They seem to implicitly claim that a core is a core is a core, but surely one gets something extra out of the much bigger one?
larsrc commented on Behind the Scenes: Alien Worlds – Jumpship over a tidally locked planet   blendernation.com/2024/12... · Posted by u/sohkamyung
larsrc · a year ago
Cool to see the whole process behind it. Some sneaky tricks there, and much knowledge. Unfortunately, I find the picture doesn't show the tidally-lockedness very well. If it wasn't for the title, I would never have noticed. The haze on the sun side makes it look like clouds.
larsrc commented on Cognitive load is what matters   minds.md/zakirullin/cogni... · Posted by u/zdw
0xFACEFEED · a year ago
> A function with well-constrained inputs and outputs is easy to reason about.

It's quite easy to imagine a well factored codebase where all things are neatly separated. If you've written something a thousand times, like user authentication, then you can plan out exactly how you want to separate everything. But user authentication isn't where things get messy.

The messy stuff is where the real world concepts need to be transformed into code. Where just the concepts need to be whiteboarded and explained because they're unintuitive and confusing. Then these unintuitive and confusing concepts need to somehow described to the computer.

Oh, and it needs to be fast. So not only do you need to model an unintuitive and confusing concept - you also need to write it in a convoluted way because, for various annoying reasons, that's what performs best on the computer.

Oh, and in 6 months the unintuitive and confusing concept needs to be completely changed into - surprise, surprise - a completely different but equally unintuitive and confusing concept.

Oh, and you can't rewrite everything because there isn't enough time or budget to do that. You have to minimally change the current uintuitive and confusing thing so that it works like the new unintuitive and confusing thing is supposed to work.

Oh, and the original author doesn't work here anymore so no one's here to explain the original code's intent.

larsrc · a year ago
Oh, and there's massive use of aspect-oriented programming, the least local paradigm ever!
larsrc commented on Cognitive load is what matters   minds.md/zakirullin/cogni... · Posted by u/zdw
K0nserv · a year ago
I've been thinking about the notion of "reasoning locally" recently. Enabling local reasoning is the only way to scale software development past some number of lines or complexity. When reasoning locally, one only needs to understand a small subset, hundreds of lines, to safely make changes in programs comprising millions.

I find types helps massively with this. A function with well-constrained inputs and outputs is easy to reason about. One does not have to look at other code to do it. However, programs that leverage types effectively are sometimes construed as having high cognitive load, when it in fact they have low load. For example a type like `Option<HashSet<UserId>>` carries a lot of information(has low load): we might not have a set of user ids, but if we do they are unique.

The discourse around small functions and the clean code guidelines is fascinating. The complaint is usually, as in this post, that having to go read all the small functions adds cognitive load and makes reading the code harder. Proponents of small functions argue that you don't have to read more than the signature and name of a function to understand what it does; it's obvious what a function called last that takes a list and returns an optional value does. If someone feels compelled to read every function either the functions are poor abstractions or the reader has trust issues, which may be warranted. Of course, all abstractions are leaky, but perhaps some initial trust in `last` is warranted.

larsrc · a year ago
> If someone feels compelled to read every function either the functions are poor abstractions or the reader has trust issues, which may be warranted.

Or it's open source and the authors were very much into Use The Source, Luke!

larsrc commented on Cognitive load is what matters   minds.md/zakirullin/cogni... · Posted by u/zdw
rtpg · a year ago
I mean really nobody wants an app that is slow, hard to refactor, with confusing business logic etc. Everyone wants good proporties.

So then you get into what you’re good at. Maybe you’re good at modeling business logic (even confusing ones!). Maybe you’re good at writing code that is easy to refactor.

Maybe you’re good at getting stuff right the first time. Maybe you’re good at quickly fixing issues.

You can lean into what you’re good at to get the most bang for your buck. But you probably still have some sort of minimum standards for the whole thing. Just gotta decide what that looks like.

larsrc · a year ago
Some people are proud of making complex code. And too many people admire those who write complex code.
larsrc commented on Cognitive load is what matters   minds.md/zakirullin/cogni... · Posted by u/zdw
K0nserv · a year ago
Comments are decent but flawed. Being a type proponent I think the best strategy is lifting business requirements into the type system, encoding the invariants in a way that the compiler can check.
larsrc · a year ago
Comments should describe what the type system can't. Connect, pitfalls, workarounds for bugs in other code, etc.
larsrc commented on Cognitive load is what matters   minds.md/zakirullin/cogni... · Posted by u/zdw
temporallobe · a year ago
> // Sorry this is awful, we have to X but this was difficult because of Y

You don’t know how many times I’ve seen this with a cute little GitLens inline message of “Brian Smith, 10 years ago”. If Brian couldn’t figure it out 10 years ago, I’m not likely going to attempt it either, especially if it has been working for 10 years.

larsrc · a year ago
But knowing what Brian was considering at the time is useful, both due avoiding redoing that and for realising that some constraints may have been lifted.
larsrc commented on Why is it so hard to buy things that work well? (2022)   danluu.com/nothing-works/... · Posted by u/janandonly
cptskippy · a year ago
> I'm going to get a trimmer, so I want a thoughtfully designed and well made version of that. This has been my mission around my home since the start of the COVID pandemic. Upgrade all the little things around my home that annoy me or that would make my day a little bit better if they were upgraded.

The problem I have is that many things don't have a well made or thoughtfully designed version. They just have a more expensive version.

There's nothing more disappoint then spending good money on a "quality" product only to have it fall apart as fast or faster than a cheap version.

larsrc · a year ago
Even more infuriating is when the more expensive version is the same object but with extra unnecessary features added via software, to the detriment of usability.
larsrc commented on Why is it so hard to buy things that work well? (2022)   danluu.com/nothing-works/... · Posted by u/janandonly
larsrc · a year ago
And here I thought here would take about how new models come out so frequently that any information about long-term quality is long obsolete.

u/larsrc

KarmaCake day699June 19, 2019
About
Works on developer tools at Google since 2010. Previously worked at a Danish startup and at the Danish State Library doing digital archiving. Co-maintainer of Dia for 8 years. Co-founder of AustinMUD, lead maintainer for 5 years.
View Original