Readit News logoReadit News
barthelomew commented on Race optimization algorithms with good initializations (beat them with bonuses)   debargha.com... · Posted by u/barthelomew
barthelomew · a month ago
You can get bonuses though. The optimization algorithms do not get bonuses.
barthelomew commented on LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives   arxiv.org/abs/2510.03761... · Posted by u/oldfuture
barthelomew · 2 months ago
Paper LaTeX files often contain surprising details. When a paper lacks code, looking at latex source has become a part of my reproduction workflow. The comments often reveal non-trivial insights. Often, they reveal a simpler version of the methodology section (which for poor "novelty" purposes is purposely obscured via mathematical jargon).
barthelomew commented on ProofOfThought: LLM-based reasoning using Z3 theorem proving   github.com/DebarghaG/proo... · Posted by u/barthelomew
tannhaeuser · 2 months ago
LLMs are statistical language models (d'uh) not reasoners after all. I found generating logic programs, and Prolog source specifically, to work unreasonably well, though [1], maybe because Prolog was introduced for symbolic natural language processing and there's a wealth of translation examples in the training set. Might be worth checking out Z3's alternative Datalog syntax [2] instead of its Lisp-ish SMTLib syntax.

[1]: https://quantumprolog.sgml.net/llm-demo/part1.html

[2]: https://microsoft.github.io/z3guide/docs/fixedpoints/syntax

barthelomew · 2 months ago
Yep! Datalog syntax for Z3 is pretty neat! We used SMT [1] in our grammars paper because it allowed the most interoperability with solvers, but our technique also works with PROLOG; as tested our at the behest of reviewers at NeurIPS. I would assume that this should also work with datalog [2].

[1] https://arxiv.org/abs/2505.20047 [2] https://github.com/antlr/grammars-v4/blob/master/datalog/dat...

barthelomew commented on ProofOfThought: LLM-based reasoning using Z3 theorem proving   github.com/DebarghaG/proo... · Posted by u/barthelomew
tonerow · 2 months ago
Cool research! I went to the repo to see what the DSL looked like but it was hard to find a clear example. It would be cool if you added a snippet to the README.
barthelomew · 2 months ago
Hey! Thank you for the interest! I shall do that. Meanwhile, check out Page 11 onwards. We describe a lot of situations! (https://arxiv.org/pdf/2409.17270)
barthelomew commented on ProofOfThought: LLM-based reasoning using Z3 theorem proving   github.com/DebarghaG/proo... · Posted by u/barthelomew
sigmoid10 · 2 months ago
What are you talking about? OpenAI has supported structured json output in the API since 2023. Only the current structured output API was introduced by OpenAI in summer 2024, but it was primarily a usability improvement that still runs json behind the scenes.
barthelomew · 2 months ago
You're right about the 2023 JSON mode, but our project required enforcing a much more complex DSL grammar (look in Appendix for details), not just ensuring a *valid JSON object*. The newer structured output APIs are a significant improvement, but the earlier tools weren't a fit for the specific constraints we were working under at the time.
barthelomew commented on ProofOfThought: LLM-based reasoning using Z3 theorem proving   github.com/DebarghaG/proo... · Posted by u/barthelomew
nakamoto_damacy · 2 months ago
LLMs lack logical constraints in the generative process; they only learn probabilistic constraints. If you apply logic verification post-hoc, you're not "ensuring the correctness of your LLMs reasoning" (I went down this path a year ago); you're classifying whether the LLM's statistically driven pattern generation happens to correspond to correct logic or not, where the LLMs output may be wrong 100% of the time, and your theorem prover simply acts as a classifier, ensuring nothing at all.
barthelomew · 2 months ago
Yep, this is a genuine problem, and this is what we term as the autoformalization gap in our follow up paper. (https://arxiv.org/abs/2505.20047)

Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)

You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.

barthelomew commented on ProofOfThought: LLM-based reasoning using Z3 theorem proving   github.com/DebarghaG/proo... · Posted by u/barthelomew
ivanbakel · 2 months ago
The repo is sparse on the details unless you go digging, which perhaps makes sense if this is just meant as the artifact for the mentioned paper.

Unless I’m wrong, this is mainly an API for trying to get an LLM to generate a Z3 program which “logically” represents a real query, including known facts, inference rules, and goals. The “oversight” this introduces is in the ability to literally read the logical statement being evaluated to an answer, and running the solver to see if it holds or not.

The natural source of doubt is: who’s going to read a bunch of SMT rules manually and be able to accurately double-check them against real-world understanding? Who double checks the constants? What stops the LLM from accidentally (or deliberately, for achieving the goal) adding facts or rules that are unsound (both logically and from a real-world perspective)?

The paper reports a *51%* false positive rate on a logic benchmark! That’s shockingly high, and suggests the LLM is either bad at logical models or keeps creating unsoundnesses. Sadly, the evaluation is a bit thin on the ground about how this stacks up, and what causes it to fall short.

barthelomew · 2 months ago
Yep. The paper was written last year with GPT-4o. Things have become a lot better since then with newer models.

E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.

Illustration of this can be seen in Fig 14, 15 on Page 29.

In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.

This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.

https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...

barthelomew commented on ProofOfThought: LLM-based reasoning using Z3 theorem proving   github.com/DebarghaG/proo... · Posted by u/barthelomew
jssmith · 2 months ago
I see JSON parse errors on occasion when using OpeanAI structured outputs that resolve upon retry. It seems it’s giving instructions to the LLM but validation is still up to the caller. Wondering if others see this too.
barthelomew · 2 months ago
Hey, yes! This is because the DSL (Domain Specific Language) is pretty complex, and the LLM finds it hard. We prototype a much more effective version using SMT in our NeurIPS 2025 paper (https://arxiv.org/abs/2505.20047). We shall soon open source that code!

u/barthelomew

KarmaCake day228August 19, 2019
About
ML research; debargha@case.edu
View Original