cipherself (u/cipherself)

cipherself commented on Capsudo: Rethinking sudo with object capabilities ariadne.space/2025/12/12/... · Posted by u/fanf2

Veserv · 6 days ago

No, you should run every program with only the privileges it needs. The very concept of running your programs with all your privileges as a user by default is wrong-headed to begin with. To strain the "user" model you should have a distinct "user" for every single program which has only the resources and privileges needed by/allocated to that program. The actual user can allocate their resources to these "users" as needed. This is a fairly primitive version of the idea due to having to torture fundamentally incompatible insecure building blocks to fit, but points in the direction of the correct idea.

cipherself · 6 days ago

I have used systemd services before to do this to run an application, I had a user created specifically for the application, and I defined the capabilities the application needed via CapabilityBoundingSet and AmbientCapabilities [0] and I used a lot of stuff from [1] to restrict the application e.g. the sandboxing facilities, restricting the allowed syscalls [2], ...etc. systemd also comes with a useful command systemd analyze security [3]

[0] https://www.freedesktop.org/software/systemd/man/latest/syst...

[1] https://www.freedesktop.org/software/systemd/man/latest/syst...

[2] https://www.freedesktop.org/software/systemd/man/latest/syst...

[3] https://www.freedesktop.org/software/systemd/man/latest/syst...

cipherself commented on Building a Simple Search Engine That Works karboosx.net/post/4eZxhBo... · Posted by u/freediver

nmstoker · a month ago

Reminds me of reading Programming Collective Intelligence by Toby Segaran, which inspired me with a range of things, like building search, recommenders, classifiers etc.

cipherself · a month ago

That was a great book, I wonder what the 2025 equivalent of it is...

cipherself commented on Show HN: Spam classifier in Go using Naive Bayes github.com/igomez10/nspam... · Posted by u/igomeza

cipherself · a month ago

12 (13?) years ago I had also written a Naïve Bayes classifier in Perl https://github.com/cipherself/NaiveBayes_perl

IIRC, next thing on my TODO list was to add vectorization. Also (like OP) it uses log probabilities to avoid floating-point underflow.

cipherself commented on Heartbeats in Distributed Systems arpitbhayani.me/blogs/hea... · Posted by u/sebg

__turbobrew__ · a month ago

Does anyone have recommendations on books/papers/articles which cover gossip protocols?

I have been more interested in learning about gossip protocols and how they are used, different tradeoffs, etc.

cipherself · a month ago

10 years ago I've implemented SCAMP (a gossip protocol) in Clojure, you might find it interesting, the implementation is quite small https://github.com/cipherself/gossip

cipherself commented on How often does Python allocate? zackoverflow.dev/writing/... · Posted by u/ingve

nu11ptr · a month ago

I admit it may just be because I'm a PL nerd, but I thought it was general knowledge that pretty much EVERYTHING in Python is an object, and an object in Python is always heap allocated AFAIK. This goes deeper than just integers. Things most think of as declarative (like classes, modules, etc.) are also objects, etc. etc. It is both the best thing (for dynamism and fun/tinkering) and worst thing (performance optimization) about Python.

If you've never done it, I recommend using the `dir` function in a REPL, finding interesting things inside your objects, do `dir` on those, and keep the recursion going. It is a very eye opening experience as to just how deep the objects in Python go.

cipherself · a month ago

> I recommend using the `dir` function in a REPL

A while back I wrote this https://mohamed.computer/posts/python-internals-cpython-byte..., perhaps it's interesting for people who use `dir` and wonder what some of the weird things that show up are.

cipherself commented on Production RAG: what I learned from processing 5M+ documents blog.abdellatif.io/produc... · Posted by u/tifa2up

DSingularity · 2 months ago

I think he just means it should be assumed to be standard practice and considered baseline at this point.

cipherself · 2 months ago

Assuming that's what he meant, why would it be considered baseline versus anything else? I am genuinely curious because I'd like to know more about issues people face with this or that vector store in general.

cipherself commented on Code from MIT's 1986 SICP video lectures github.com/felipap/sicp-c... · Posted by u/felipap

725686 · 2 months ago

If you are into SICP, you would probably like a nicely formatted html version of the book:

https://sarabander.github.io/sicp/html/index.xhtml#SEC_Conte...

And also this:

https://eli.thegreenplace.net/tag/sicp

cipherself · 2 months ago

Moreover, you can have SICP inside emacs by just downloading a package from Melpa:

https://melpa.org/#/sicp

cipherself commented on Production RAG: what I learned from processing 5M+ documents blog.abdellatif.io/produc... · Posted by u/tifa2up

pamelafox · 2 months ago

Yes, AI Search has a new agentic retrieval feature that includes synthetic query generation: https://techcommunity.microsoft.com/blog/azure-ai-foundry-bl... You can customize the model used and the max # of queries to generate, so latency depends on those factors, plus the length of the conversation history passed in. The model is usually gpt-4o or gpt-4.1 or the -mini of those, so it's the standard latency for those. A more recent version of that feature also uses the LLM to dynamically decide which of several indices to query, and executes the searches in parallel.

That query generation approach does not extract structured data. I do maintain another RAG template for PostgreSQL that uses function calling to turn the query into a structured query, such that I can construct SQL filters dynamically. Docs here: https://github.com/Azure-Samples/rag-postgres-openai-python/...

I'll ask the search about SPLADE, not sure.

cipherself · 2 months ago

Got it, I think this might make sense for a "conversation" type of search not for an instant search feature because lowest latency is gonna be too high IMO.

cipherself commented on Production RAG: what I learned from processing 5M+ documents blog.abdellatif.io/produc... · Posted by u/tifa2up

hatmanstack · 2 months ago

Not here to schlep for AWS but S3 Vectors is hands down the SOTA here. That combined with a Bedrock Knowledge Base to handle Discovery/Rebalance tasks makes for the simplest implementation on the Market.

Once Bedrock KB backed by S3 Vectors is released from Beta it'll eat everybody's lunch.

cipherself · 2 months ago

S3 Vectors is hands down the SOTA here

SOTA for what? Isn't it just a vector store?

cipherself commented on Production RAG: what I learned from processing 5M+ documents blog.abdellatif.io/produc... · Posted by u/tifa2up

pamelafox · 2 months ago

At Microsoft, that's all baked into Azure AI Search - hybrid search does BM25, vector search, and re-ranking, just with setting booleans to true. It also has a new Agentic retrieval feature that does the query rewriting and parallel search execution.

Disclosure: I work at MS and help maintain our most popular open-source RAG template, so I follow the best practices closely: https://github.com/Azure-Samples/azure-search-openai-demo/

So few developers realize that you need more than just vector search, so I still spend many of my talks emphasizing the FULL retrieval stack for RAG. It's also possible to do it on top of other DBs like Postgres, but takes more effort.

cipherself · 2 months ago

I am working on search but rather for text-to-image retrieval, nevertheless, I am curious if by that's all baked into Azure AI search you also meant synthetic query generation from the grandparent comment. If so, what's your latency for this? And do you extract structured data from the query? If so, do you use LLMs for that?

Moreover I am curious why you guys use bm25 over SPLADE?