Readit News logoReadit News
m-i-l commented on Search My Site – open-source search engine for personal and independent websites   searchmysite.net... · Posted by u/OuterVale
1dom · 5 months ago
I like this, thank you! I just lost an hour of time to the exact sort of random but considered personal websites that I think made the Web great in the first place.
m-i-l · 5 months ago
Thanks for the great feedback:-) This is what searchmysite.net is attempting to do - help make "surfing the web" a fun leisure activity once more. It is good to see more people seem to get that point now. When it was on HN nearly 3 years ago[0], many people saw a search box and thought it must be a Google replacement, but were disappointed to find it wasn't. And I guess now more than ever it is useful to have a way of finding content on the web which has been made by humans rather than AI.

[0] https://news.ycombinator.com/item?id=31395231

m-i-l commented on Search My Site – open-source search engine for personal and independent websites   searchmysite.net... · Posted by u/OuterVale
1dom · 5 months ago
Best of both worlds:

> No results found for "digiatl". Did you mean to search for "digital" instead?

m-i-l · 5 months ago
At a big corporate, we had an Apache Solr based search which had some reasonably clever lemmatization and stats analysis and spell check config to suggest alternative searches if not many results were found for the original query, but one day someone reported an unfortunate edge case which caused a bit of a panic - if you searched "annual report” it returned "did you mean anal report?" (we were in the finance sector rather than medical sector, but there were a lot more documents in the corpus containing words like analysts, analysis, analytics etc). Anyway, the point is yes, it is great to have that sort of functionality, but it does come at a cost, and a small project like this might prefer to keep it simple.
m-i-l commented on Search My Site – open-source search engine for personal and independent websites   searchmysite.net... · Posted by u/OuterVale
Sophira · 5 months ago
According to the site, the funding comes from its "Search as a Service" feature[0], where anybody can pay them in order to have a search service focused on their site (which does not have to be in the public index and thus doesn't have to be personal/independent).

So, in the sense that the funding (aims to) comes from larger companies, you are correct. It's not VC, but it does seem like it could end up relying on payments from large companies, making it potentially vulnerable.

[0] https://searchmysite.net/pages/about/#search-as-a-service

m-i-l · 5 months ago
That's right. Most search engines are funded by advertising, where there is the clear conflict of interest[0], not to mention incentive for spam etc. Alternative models include a subscription fee (which I don't think would work for a small niche search like this) and donations (which may or may not be sustainable). Looking through some of the support forums for the big search engines, I'm pretty sure that enough site owners would pay a fee for support to pay the running costs for a large search engine, although for a smaller search engine like this there needs to be something more than just support, hence the search as a service features.

[0] "Advertising funded search engines will be inherently biased towards the advertisers and away from the needs of consumers", to quote Sergey Brin and Lawrence Page in their "The Anatomy of a Large-Scale Hypertextual Web Search Engine" paper from 1998.

m-i-l commented on Search My Site – open-source search engine for personal and independent websites   searchmysite.net... · Posted by u/OuterVale
ThinkBeat · 5 months ago
I am a bit confused. Solr is the search engine.

An LLM model is loaded. What does the LLM model add to the solution?

m-i-l · 5 months ago
The LLM was for an experiment in retrieval augmented generation, i.e. "a chat with your website" style interface, using Apache Solr as the vector store. Results (on a small self-hosted LLM to keep costs manageable) weren't good enough for the functionality to be fully rolled out, so the LLM has been disabled and is likely to be fully removed.
m-i-l commented on Search My Site – open-source search engine for personal and independent websites   searchmysite.net... · Posted by u/OuterVale
kreelman · 5 months ago
Thanks for putting this together. I wonder, is Postgres a bit of a large DB if it's just a personal website search tool? I'll have to give it a go. We need more tools like this.
m-i-l · 5 months ago
Postgres is just used for the site admin, i.e. keeping track of submissions, review status, subscriptions etc. The actual search index is in Apache Solr. In theory you could use Solr to store all the admin data, but it is generally not recommended to use a Solr style document store to master data. I guess something more lightweight like SQLite could be used, but it is intended to be deployed on servers and Postgres isn't too resource intensive.
m-i-l commented on In Praise of Print: Reading Is Essential in an Era of Epistemological Collapse   lithub.com/in-praise-of-p... · Posted by u/bertman
m-i-l · 9 months ago
A couple of references to the Nazis, but no reference to the Nazi book burnings, an incredibly symbolic physical manifestation of knowledge and information destruction, which I'd have thought would be very relevant in this context, i.e. in the praise of physical books? Perhaps it wasn't mentioned because it doesn't quite fit in with the narrative of digital being all bad, given digital knowlege can be more resistant to suppression and physical destruction.

Also some great quotes from 30 years ago, e.g. Carl Sagan's "when awesome technological powers are in the hands of the very few" the nation would “slide, almost without noticing, back into superstition and darkness". But did it actually have to end up this way? And is it still possible (with enough collective will power) to push Big Tech profiteering back enough to deliver some of the society enhancing changes originally envisioned in the mid-1990s? Just as it took decades for the full positive implications of the invention of the printing press to come to fruition, perhaps we still need more time before we decry the internet as a net negative?

m-i-l commented on Chi-fi tuning – Why it sounds piercing to Western ears (2020)   audioreviews.org/chi-fi-t... · Posted by u/userbinator
matthewmorgan · 9 months ago
I once saw a YouTube a short clip of some kind of Chinese street music / singing performed by old men. It was ear piercing and weird and also strangely fascinating. I'll never be able to find it again
m-i-l · 9 months ago
My children were given a soft toy a few years back from a relative who had bought it from a Chinese street market while on holiday in China. When it was switched on it jumped about frantically and sang a very loud and shrill song. Not 100% sure which language it was, but it is entirely possible it was some form of Chinese street music, and certainly fits the article's description of "Mainland Chinese recordings" as "shouty, harsh and ear-piercing". Normally my children love things that adults find annoying, but even they were afraid of this one.
m-i-l commented on Ask HN: Website with 6^16 subpages and 80k+ daily bots    · Posted by u/damir
superkuh · 10 months ago
I did a $ find . -type f | wc -l in my ~/www I've been adding to for 24 years and I have somewhere around 8,476,585 files (not counting the ~250 million 30kb png tiles I have for 24/7/365 radio spectrogram zoomable maps since 2014). I get about 2-3k bot hits per day.

Today's named bots: GPTBot => 726, Googlebot => 659, drive.google.com => 340, baidu => 208, Custom-AsyncHttpClient => 131, MJ12bot => 126, bingbot => 88, YandexBot => 86, ClaudeBot => 43, Applebot => 23, Apache-HttpClient => 22, semantic-visions.com crawler => 16, SeznamBot => 16, DotBot => 16, Sogou => 12, YandexImages => 11, SemrushBot => 10, meta-externalagent => 10, AhrefsBot => 9, GoogleOther => 9, Go-http-client => 6, 360Spider => 4, SemanticScholarBot => 2, DataForSeoBot => 2, Bytespider => 2, DuckDuckBot => 1, SurdotlyBot => 1, AcademicBotRTU => 1, Amazonbot => 1, Mediatoolkitbot => 1,

m-i-l · 10 months ago
Those are the good bots, which say who they are, probably respect robots.txt, and appear on various known bot lists. They are easy to deal with if you really want. But in my experience it is the bad bots you're more likely to want to deal with, and those can be very difficult, e.g. pretending to be browsers, coming from residential IP proxy farms, mutating their fingerprint too fast to appear on any known bot lists, etc.
m-i-l commented on I built a 20k watt microwave oven [video]   youtube.com/watch?v=mg79n... · Posted by u/surprisetalk
m-i-l · 10 months ago
I used to work in an office which briefly had a commercially available 4KW microwave in the coffee area. I used to like it because it was fast. Unfortunately several other people failed to appreciate that you had to take the 800W timings and divide by 5, and it was quickly removed after several people set fire to their food.

u/m-i-l

KarmaCake day3940July 30, 2014
About
Personal website: https://michael-lewis.com/

Side project: https://searchmysite.net/

View Original