Once GPT is tuned more heavily on Lean (proof assistant) -- the way it is on Python -- I expect its usefulness for research level math to increase.
I work in a field related to operations research (OR), and ChatGPT 4o has ingested enough of the OR literature that it's able to spit out very useful Mixed Integer Programming (MIP) formulations for many "problem shapes". For instance, I can give it a logic problem like "i need to put i items in n buckets based on a score, but I want to fill each bucket sequentially" and it actually spits out a very usable math formulation. I usually just need to tweak it a bit. It also warns against weak formulations where the logic might fail, which is tremendously useful for avoiding pitfalls. Compare this to the old way, which is to rack my brain over a weekend to figure out a water-tight formulation of MIP optimization problem (which is often not straightforward for non-intuitive problems). GPT has saved me so much time in this corner of my world.
Yes, you probably wouldn't be able to use ChatGPT well for this purpose unless you understood MIP optimization in the first place -- and you do need to break down the problem into smaller chunks so GPT can reason in steps -- but for someone who can and does, the $20/month I pay for ChatGPT more than pays for itself.
side: a lot of people who complain on HN that (paid/good - only Sonnet 3.5 and GPT4o are in this category) LLMs are useless to them probably (1) do not know how to use LLMs in way that maximizes their strengths; (2) have expectations that are too high based on the hype, expecting one-shot magic bullets. (3) LLMs are really not good for their domain. But many of the low-effort comments seem to mostly fall into (1) and (2) -- cynicism rather than cautious optimism.
Many of us who have discovered how to exploit LLMs in their areas of strength -- and know how to check for their mistakes -- often find them providing significant leverage in our work.
HN, and the internet in general, have become just an ocean of reactionary sandbagging and blather about how "useless" LLMs are.
Meanwhile, in the real world, I've found that I haven't written a line of code in weeks. Just paragraphs of text that specify what I want and then guidance through and around pitfalls in a simple iterative loop of useful working code.
It's entirely a learned skill, the models (and very importantly the tooling around them) have arrived at the base line they needed.
Much Much more productive world by just knuckling down and learning how to do the work.
> Much Much more productive world by just knuckling down and learning how to do the work.
The fact everyone that say they've become more productive with LLMs won't say how exactly. I can talk about how VIM have make it more enjoyable to edit code (keybinding and motions), how Emacs is a good environment around text tooling (lisp machine), how I use technical books to further my learning (so many great books out here). But no one really show how they're actually solving problems with LLMs and how the alternatives were worse for them. It's all claims that it's great with no further elaboration on the workflows.
> I haven't written a line of code in weeks. Just paragraphs of text that specify what I want and then guidance through and around pitfalls in a simple iterative loop of useful working code.
Code is intent described in terms of machinery actions. Those actions can be masked by abstracting them in more understandable units, so we don't have to write opcodes, but we can use python instead. Programming is basically make the intent clear enough so that we know what units we can use. Software engineering is mostly selecting the units in a way to do minimal work once the intent changes or the foundational actions do.
Chatting with a LLM look to me like your intent is either vague or you don't know the units to use. If it's the former, then I guess you're assuming it is the expert and will guide you to the solution you seek, which means you believe it understands the problem more than you do. The second is more strange as it looks like playing around with car parts, while ignoring the manuals it comes with.
What about boilerplate and common scenarios? I agree that LLMs helps a great deal with that, but the fact is that there are perfectly good tools that helped with that like snippets, templates, and code generators.
In my view these models produce above average code which is good enough for most jobs. But the hacker news sampling could be biased towards the top tier of coders - so their personal account of it not being good enough can also be true. For me the quality isn't anywhere close to good enough for my purposes, all of my easy code is already done so I'm only left working on gnarly niche stuff which the LLMs are not yet helpful with.
For the effect on the industry, I generally make the point that even if AI only replaces the below average coder it will cause a downward pressure on above average coders compensation expectation.
Personally, humans appear to be getting dumber at the same time that AI is getting smarter and while, for now, the crossover point is at a low threshold that threshold will of course increase over time. I used to try to teach ontologies, stats, SMT solvers to humans before giving up and switching to AI technologies where success is not predicated on human understanding. I used to think that the inability for most humans to understand these topics was a matter of motivation, but have rather recently come to understand that these limitations are generally innate.
What sort of problems do you solve? I tried to use it. I really did. I've been working on a tree edit distance implementation base on a paper from 95. Not novel stuff. I just can't get it to output anything coherent. The code rarely runs, it's written in absolutely terrible style, it doesn't follow any good practices for performant code. I've struggled with getting it to even implement the algorithm correctly, even though it's in the literature I'm sure it was trained on.
Even test cases have brought me no luck. The code was poorly written, being too complicated and dynamic for test code in the best case and just wrong on average. It constantly generated test cases that would be fine for other definitions of "tree edit distance" but were nonsense for my version of a "tree edit distance".
What are you doing where any of this actually works? I'm not some jaded angry internet person, but I'm honestly so flabbergasted about why I just can't get anything good out of this machine.
That’s fine until your code makes its way to production, an unconsidered side effect occurs and then you have to face me.
You are still responsible for what you do regardless of the means you used to do it. And a lot of people use this not because it’s more productive but because it requires less effort and less thought because those are the hard bits.
I’m collecting stats at the moment but the general trend in quality as in producing functional defects is declining when an LLM is involved in the process.
So far it’s not a magic bullet but a push for mediocrity in an industry with a rather bad reputation. Never a good story.
> I've found that I haven't written a line of code in weeks
Which is great until your next job interview. Really, it's tempting in the short run but I made a conscious decision to do certain tasks manually only so that I don't lose my basic skills.
But "lines of code written" is a hollow metric to prove utility. Code literacy is more effective than code illiteracy.
Lines of natural language vs discrete code is a kind of preference. Code is exact which makes it harder to recall and master. But it provides information density.
> by just knuckling down and learning how to do the work?
This is the key for me. What work? If it's the years of learning and practice toward proficiency to "know it when you see it" then I agree.
> I've found that I haven't written a line of code in weeks
How are people doing this, none of the code that gpt4o/copilot/sonnet spit out i ever use because it never meets my standards. How are other people accepting the shit it spits out.
For someone who didn't study a STEM subject or CS in school, I've gone from 0 to publishing a production modern looking app in a matter of a few weeks (link to it on my profile).
Sure, it's not the best (most maintainable, non-redundant styling) code that's powering the app but it's more than enough to put an MVP out to the world and see if there's value/interest in the product.
I use sonet 3.5 and while it's actually usable for codegen (compared to gpt/copilot) it's still really not that great. It does well at tasks like "here's a stinky collection of tests that accrued over time - clean this up in style of x" but actually writing code still shows fundamental lack of understanding of underlying API and problem (the most banal example being constantly generating `x || Array.isArray(x)` test)
> HN, and the internet in general, have become just an ocean of reactionary sandbagging and blather about how "useless" LLMs are.
Now imagine how profoundly depressing it is to visit a HN post like this one, and be immediately met with blatant tribalism like this at the very top.
Do you genuinely think that going on a performative tirade like this is what's going to spark a more nuanced conversation? Or would you rather just the common sentiment be the same as yours? How many rounds of intellectual dishonesty do we need to figure this out?
> Meanwhile, in the real world, I've found that I haven't written a line of code in weeks. Just paragraphs of text that specify what I want and then guidance through and around pitfalls in a simple iterative loop of useful working code.
could it be that you are mostly engaged in "boilerplate coding", where LLMs are indeed good?
People in general don't like change and are naturally defending against it. And the older people get the greater the percentage of people fighting against it. A very useful and powerful skill is to be flexible and adaptable. You positioned yourself in the happy few.
> Meanwhile, in the real world, I've found that I haven't written a line of code in weeks. Just paragraphs of text that specify what I want and then guidance through and around pitfalls in a simple iterative loop of useful working code.
Comment on first principles:
Following the dictum that you can't prove the absence of bugs, only their presence, the idea of what constitutes "working code" deserves much more respect.
From an engineering perspective, either you understand the implementation or you don't. There's no meaning to iteratively loop of producing working code.
Stepwise refinement is a design process under the assumption that each step is understood in a process of exploration of the matching of a solution to a problem. The steps are the refinement of definition of a problem, to which is applied an understanding of how to compute a solution. The meaning of working code is in the appropriateness of the solution to the definition of the problem. Adjust either or both to unify and make sense of the matter.
The discipline of programming is rotting when the definition of working is copying code from an oracle you run it to see if it goes wrong.
The measure of works must be an engineering claim of understanding the chosen problem domain and solution. Understanding belongs to the engineer.
LLMs do not understand and cannot be relied upon to produce correct code.
If use of an LLM puts the engineer in contact with proven principles, materials and methods which he adapts to the job at hand, while the engineer maintains understanding of correctness, maybe that's a gain.
But if the engineer relies on the LLM transformer as an oracle, how does the engineer locate the needed understanding? He can't get it from the transformer: he's responsible for checking the output of the transformer!
OTOH if the engineer draws on understanding from elsewhere, what is the value of the transformer but as a catalog? As such, who has accountability for the contents of the catalog? It can't be the transformer because it can't understand. It can't be the developer of the transformer because he can't explain why the LLM produces any particular result! It has to be the user of the transformer.
So a system of production is being created whereby the engineer's going-in position is that he lacks the understanding needed to code a solution and he sees his work as integrating the output of an oracle that can't be relied upon.
The oracle is a peculiar kind of calculator with a unknown probability of generating relevant output that works at superhuman speeds, while the engineer is reduced to an operator in the position of verifying that output at human speeds.
This looks like a feedback system for risky results and slippery slope towards heretofore unknown degrees of incorrectness and margins for error.
At the same time, the only common vernacular for tracking oracle veracity is in arcane version numbers, which are believed, based on rough experimentation, to broadly categorize the hallucinatory tendencies of the oracle.
The broad trend of adoption of this sketchy tech is in the context of industry which brags about seeking disruption and distortion, regards its engineers as cost centers to be exploited as "human resources", and is managed by a specialized class of idiot savants called MBAs.
Get this incredible technology into infrastructure and in control of life sustaining systems immediately!
I also do OR-adjacent work, but I've had much less luck using 4o for formulating MIPs. It tends to deliver correct-looking answers with handwavy explanations of the math, but the equations don't work and the reasoning doesn't add up.
It's a strange experience, like taking a math class where the proofs are weird and none of the lessons click for you, and you start feeling stupid, only to learn your professor is an escaped dementia patient and it was gobbledygook to begin with.
I had a similar experience yesterday using o1 to see if a simple path exists through s to t through v using max flow. It gave me a very convincing-looking algorithm that was fundamentally broken. My working solution used some techniques from its failed attempt, but even after repeated hints it failed to figure out a working answer (it stubbornly kept finding s->t flows, rather than realizing v->{s,t} was the key.)
It's also extremely mentally fatiguing to check its reasoning. I almost suspect that RLHF has selected for obfuscating its reasoning, since obviously-wrong answers are easier to detect and penalize than subtly-wrong answers.
Yip. We need research into how long it takes experts to repair faulty answers, vs. generate them on their own.
Benchmarking 10,000 attempts on an IQ test is irrelevant if on most of those attempts the time taken to repair an answer is long than the time to complete the test yourself.
I find its useful to generate examplars in areas you're roughly familiar with, but want to see some elaboration or a refresher. You can stich it all together to get further, but when it comes time to actually build/etc. something -- you need to start from scratch.
The time taken to reporduce what it's provided, now that you understand it, is trivial compared to the time needed to repair its flaws.
I'm currently teaching a course on MIP, and out of interest I tried asking 4o about some questions I ask students. It could give the 'basic building blocks' (How to do x!=y, how to do a knapsack), but as soon as I asked it a vaguely interesting question that wasn't "bookwork", I don't think any of it's models were right.
I'm interested on how you seem to be getting better answers than me (or, maybe I just discard the answer once I can see it's wrong and write it myself, once I see it's wrong?)
In fact, I just asked it to do (and explain) x!=y for x,y integer variables in the range {1..9}, and while the constraints are right, the explanation isn't.
I an also working in OR and I have had the complete opposite experience with respect to MILP optimization(and the research actually agrees; there was a big survey paper published earlier this year showing LLMs were mostly correct on textbook problems but got more and more useless as complexity and novelty increased.)
The results are boiler plate at best, but misleading and insidious at worst, especially when you get into detailed tasks. Ever try to ask a LLM what a specific constraint does or worse ask it to explain the mathematical model of some proprietary CPLEX syntactic sugar? It hallucinates the math, the syntax, the explanation, everything.
Can you point me to that paper? What version of the model were they using?
Have you tried again with the latest LLMs? ChatGPT4 actually (correctly) explains what each constraint does in English -- it doesn't just provide the constraint when you ask it for the formulation. Also, not sure if CPLEX should be involved at all -- I usually just ask it for mathematical formulations, not CPLEX calling code (I don't use CPLEX). The OR literature primarily contains math formulations and that's where LLMs can best do pattern matching to problem shape.
I had the same experience with computational geometry.
Very good at giving a textbook answer ("give a Python/ Numpy function that returns the Voronoi diagram of set of 2d points").
Now, I ask for the Laguerre diagram, a variation that is not mentioned in textbooks, but very useful in practice. I can spend a lot of time spoon-feeding the answer, I just have the bullshiting student answers.
I tried other problems like numerical approximation, physics simulation, same experience.
I don't get the hype. Maybe it's good at giving variations of glue code ie. Stack Overflow meet autocomplete ? As a search tool it's bad because it's so confidently incorrect, you may be fooled by bad answers.
But many of the low-effort comments seem to mostly fall into (1) and (2) -- cynicism rather than cautious optimism.
One good riposte to reflexive LLM-bashing is, "Isn't that just what a stochastic parrot would say?" Some HN'ers would dismiss a talking dog because the C code it wrote has a buffer overflow error.
It's understandable that people whose career and lifelong skill set that are seemingly on the precipice of obsolescence are going to be extremely hostile to that threat.
How many more years is senior swe work going to be a $175k/yr gig instead of an $75k check-what-the-robot-does gig?
It also doesn't help that Lean has had so many breaking changes in such little time. When I tried using GPT-4 for it, it mostly rendered old code that would fail to run unless you already knew the answer and how to fix it, which basically made it entirely unhelpful.
> people who complain on HN that (paid/good - only Sonnet 3.5 and GPT4o are in this category)
Correction: I complain that the only decent model in "Open"AI's arsenal, that is GPT-4, has been replaced by a cheaper GPT-4o, which gives subpar answers to most of my question (I don't care it does it faster). As they moved it to "old, legacy" models, I expect they will phase it out, at which point I'll cancel my OpenAI subscriptions and Sonnet 3.5 will become the clear leader for my daily tasks.
Kudos to Anthropic for their great work, you guys are going in the right direction.
I'm not sure the lean coverage of pure math research is that much (maybe like 1% is represented on mathlib). But I think a system like alpha proof could even today be useful for mathematicians--I mostly dislike systems like o1 where they confidently say nonsense with such high frequency. But i think value is already there.
I take cynicism over unbridled optimism. People speak as if we were on the cusp of technological singularity, but I've seen nothing to indicate we're not already past the inflection point of the logistic curve, and well into diminishing returns territory.
_can_ GPT be tuned more heavily on Lean?
It looks like the amount of python code in the corpus would outnumber Lean something like 1000:1. Although I guess OpenAI could generate more and train on that.
Or (4) LLMs simply do not work properly for many use cases in particular where large volumes of trained data doesn't exist in its corpus.
And in these scenarios rather than say "I don't know" it will over and over again gaslight you with incoherent answers.
But sure condescendingly blame on the user for their ignorance and inability to understand or use the tool properly. Or call their criticism low-effort.
Yeah I have been using them to help with learning graduate maths as a grad student. Claude Sonnet 3.5 was unparalleled and the first quite useful one. GPT4o preview seems about equal (based on cutting and pasting the past six months of prompts into it).
The first profession AI seems on track to decimate is programming. In particular, the brilliant but remote and individual contributor. There is an obvious conflict of interest in this forum.
I see this theory a lot but mostly from people who haven’t tried pair coding with a quality llm. In fact these llms give experienced developers super powers; you can be crazy productive with them.
If you think we are close to the maximum useful software in the world already, then maybe. I do not believe that. Seeing software production and time costs drop one to two orders of magnitude means we will have very different viable software production processes. I don’t believe for a second that it disenfranchises quality thinkers; it empowers them.
Before it can replace the brilliant programmer, it needs to be able to replace the mediocre programmer. There is so much programming and other tech/it related work that businesses or people want, but can't justify paying even low tech salaries in America for.
So far, there is little chance of a non-technical person developing a technical solution to their problems using AI.
The programmers who will find LLMs most useful are going to be those who prior to LLMs were copying and pasting from Stack Overflow, and asking questions online about everything they were doing - tasks that LLMs have precisely replaced (it has now memorized all that boilerplate code, consensus answers, and API usage examples).
The developers who will find LLMs the least useful are the "brilliant" ones who never found any utility in any of that stuff, partly because they are not reinventing the wheel for the 1000th time, but instead addressing more challenging and novel problems.
The conflict of interest might have something to do with the fact that OpenAI's CEO/founder was once a major figure in Y Combinator. But I think you wanted to insinuate that the conflict of interest ran in the other direction.
Once ChatGPT can even come close to replacing a junior engineer, you can retry your claim. The progression of the tech underlying ChatGPT will be sub-linear.
I doubt it. It can do some impressive stuff for sure, but I very rarely get a perfectly working answer out of ChatGPT. Don't get me wrong, it's often extremely useful as a starting point and time saver, but it clearly isn't close to replacing anyone vaguely competent.
The important point is, I feel, that most people are not even at the level of intelligence of a "a mediocre, but not completely incompetent, graduate student." A mediocre graduate science student, especially of the sort who graduates and doesn't quit, is a very impressive individual compared to the rest of us.
For "us", having such a level of intelligence available as an assistant throughout the day is a massive life upgrade, if we can just afford more tokens.
My sheer productivity boost from these models is miraculous. It's like upgrading from a text editor to a powerful IDE. I've saved a mountain of hours just by removing tedious time sinks -- one-off language syntax, remembering patterns for some framework, migrating code, etc. And this boost applies to nearly all of my knowledge work.
Then I see contrarians claiming that LLMs are literally never useful for anyone, and I get "don't believe your lying eyes" vibes. At this point, such sentiments feel either willfully ignorant, or said in bad faith. It's wild.
Anyone intelligent enough to make a living programming likely has more than enough IQ to become a mediocre somewhat competent graduate student in math.
They just don't have the background, and probably lack the interest to dedicate studying for a few years to get to that level.
>A mediocre graduate science student, especially of the sort who graduates and doesn't quit, is a very impressive individual compared to the rest of us.
Incorrect. University graduates shows a good work ethic, a certain character and a ability to manage time. It's not a measure of being better than the rest of humanity. Also, it's not a good measure of intelligence. If you only want to view the world through credentials. Academics don't consider your intelligence until you have a Ph.D and X years of work in your field. Industry only uses graduates as a entry requirement for junior roles and then favors and cares only about your years of experience after that.
Given that statement I can only assume you haven't been to University. You are mistaken to think, especially in time we are in now that the elite class are any more knowledgeable then you are.
Which is why I think the AI era isn't hype but very much real. Jensen said AI has reached the era of iPhone.
We wont have AGI or ASI, whatever definition people have with those terms in the next 5 - 10 years. But I would often like to refer AI as Assisted or Argumented Intelligence. And it will provide enough value that drives current Computer and Smartphone sales for at least another 5 - 10 years. Or 3-4 cycles.
Terry is a genius that can get that value out of an LLM.
Average Joe can't do anything like that yet, both because he won't be as good at prompting the model, and because his problems in life aren't text-based anyway.
To be honest, I have gotten 100x more useful answers out of Siri's WolframAlpha integration than I ever have out of ChatGPT. People don't want a "not completely incompetent graduate student" responding to their prompts, they want NLP that reliably processes information. Last-generation voice assistants could at least do their job consistently, ChatGPT couldn't be trusted to flick a light switch on a regular basis.
I use both for different things. WolframAlpha is great for well-defined questions with well-defined answers. LLMs are often great for anything that doesn't fall into that.
I use home assistant with the extended open ai integration from HACS. Let me tell you, it’s orders of magnitude better than generic voice assistants. It can understand fairly flexibly my requests without me having a literal memory of every device in the house. I can ask for complex tasks like turning every light in the basement on without there being a zone basement by inferring from the names. I have air quality sensors throughout and I can ask it to turn on the fan in areas with low air quality and if literally does it without programming an automation.
Usually Alexa will order 10,000 rolls of toilet paper and ship them to my boss when I ask it to turn on the bathroom fan.
Personally tho the utility of this level of skill (beginner grad in many areas) for me personally is in areas I have undergraduate questions in. While I literally never ask it questions in my field, I do for many other fields I don’t know well to help me learn. over the summer my family traveled and I was home alone so I fixed and renovated tons of stuff I didn’t know how to do. I work a headset and had the voice mode of ChatGPT on. I just asked it questions as I went and it answered. This enabled me to complete dozens of projects I didn’t know how to even start otherwise. If I had had to stop and search the web and sift through forums and SEO hell scapes, and read instructions loosely related and try to synthesize my answers, I would have gotten two rather than thirty projects done.
How does this square up with literally what Terence Tao (TFA) writes about O1? Is this meant to say there's a class of problems that O1 is still really bad at (or worse than intuition says it should be, at least)? Or is this "he says, she says" time for hot topics again on HN?
Then you have a skill issue. 10 million paying are for GPT monthly because a large of them are getting useful value out of it. WolframAlpha has been out for a while and didn't take off for a reason. "GPT couldn't be trusted to flick a light switch on a regular basis" pretty much implies you are not serious or your knowledge about the capabilities of LLM is pretty much dated or derived from things you have read.
Even more amazing, there plenty - PLENTY - of posters here that routinely either completely shit on LLMs, or casually dismiss them as "hype", "useless", and what have you.
I've been saying this for quite some time now, but some people are in for a very rude awakening when the SOTA models 5-10 years from now are able to completely replace senior devs and engineers.
Better buckle up, and start diversifying your skills.
The way I see it these models especially O1 is an intelligence booster. If you start with zero it gives you back zero. Especially if you’re just genuinely trying to use it and not just trying to do some gotcha stuff.
Diversifying to what? When AI can fully replace senior developers the world as we know it is over. Best case capitalism enters terminal decline: buy rifles. Worst case, hope that whatever comes out the either side is either benevolent or implodes quickly.
I mean paying several hundred to thousands of grad students to RLHF for several years and you get a corpus of grad-student text. I'm not surprised at all. AI companies hire grad students to RLHF in every subject matter (chemistry, physics, math, etc).
The grad-students write the prompts, correct the model, and all of that is fed into a "more advanced" model. It's corpi of text. Repeat this for every grade level and subject.
Ask the model that's being trained on chemistry grad level work a simple math question and it will probably get it wrong. They aren't "smart". It's aggregations of text and ways to sample and then predict.
Except you’re talking about a general purpose foundation model that’s doing all these subjects at once. It’s not like you choose the subject specific model with Claude or gpt-01.
The key isn’t whether these things are smart or not. The key is that they put something that can answer basic grad level questions on almost any subject. For people that don’t have a graduate level education in any subject this is a remarkable tool.
I don’t know why the statement that “wow this is useful and a remarkable step forward” is always met with “yeah but it’s not actually smart.” So? Half of all humans have an IQ less than 100. They’re not smart either. Is this their value? For a machine, being able to produce accurate answers to most basic graduate level questions is -science fiction- regardless of whether it’s “smart.”
The NLP feat alone is stunning, and going from basically one step above gibberish to “basic grad school” in two years is a mouth dropping rate of change. I suspect folks who quibble over whether it’s “real intelligence” or simply a stochastic parrot have lost the ability to dream.
The o1 model is really remarkable. I was able to get very significant speedups to my already highly optimized Rust code in my fast vector similarity project, all verified with careful benchmarking and validation of correctness.
Not only that, it also helped me reimagine and conceptualize a new measure of statistical dependency based on Jensen-Shannon divergence that works very well. And it came up with a super fast implementation of normalized mutual information, something I tried to include in the library originally but struggled to find something fast enough when dealing with large vectors (say, 15,000 dimensions and up).
While it wasn’t able to give perfect Rust code that compiled on the very first try, it was able to fix all the bugs in one more try after pasting in all the compiler warning problems from VScode. In contrast, gpt-4o usually would take dozens of tries to fix all the many rust type errors, lifetime/borrowing errors, and so on that it would inevitably introduce. And Claude3.5 sonnet is just plain stupid when it comes to Rust for some reason.
I really have to say, this feels like a true game changer, especially when you have really challenging tasks that you would be hard pressed to find many humans capable of helping with (at least without shelling out $500k+/year in compensation for).
And it’s not just the performance optimization and relatively bug free code— it’s the creative problem solving and synthesis of huge amounts of core mathematical and algorithmic knowledge plus contemporary research results, combined with a strong ability to understand what you’re trying to accomplish and making it happen.
Here is the diff to the code file showing the changes:
But a lot of what you pay humans $500k a year for is to work with enormous existing systems that an LLM cannot understand just yet. Optimizing small libraries and implementing fast functions though is a huge improvement in any programmer's toolbox.
Yes, that’s certainly true, and that’s why I selected that library in particular to try with it. The fact that it’s mathematical— so not many lines of code, but each line packs a lot of punch and requires careful thought to optimize— makes it a perfect test bed for this model in particular. For larger projects that are simpler, you’re probably better off with Claude3.5 sonnet, since it has double the context window.
And sometimes it just bugs out and doesn't give any response? Faced that twice now, it "thought" for like 10-30s then no answer and I had to click regenerate and wait for it again.
Thinking about training LLMs on geometry. A lot of information in the sources would be contained in the diagrams accompanying the text. This model is not multi-modal, so maybe it wasn't trained on the accompanying diagrams at all.
I would really like if people check on a set of geometry and a set of analysis questions and compare the difference.
It will be trash. I'll have to dig up a chat I had the weekend GPT4 was released, I was musing about dodecahedron packing problems and GPT4 started with an assertion that a line through a sphere intersects the surface 3 times.
Maybe if you fine tuned it on Euclid's Elements and then allowed it to run experiments with Mathematica snippets it could check its assumptions before spouting nonsense
The novelty to me is that the “The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student.” in so many subject areas! I have found great value in using LLMs to sort things out. In areas where I am very experienced it can be really helpful at tons of small chores. Like Terrence was pointing out in his third experiment — if you break the problem down it does solid work filling in smaller blanks. You need the conceptual understanding. Part of this is prompting skill. If you go into an area you don’t know you have to try and build the prompts up. Dive into something small and specific and work outward if the answer is known. Start specific and focused if starting from the outside in. I have used this to cut through conceptual layers of very complex topics I have zero knowledge in and then verify my concepts via experts on YT/research papers/trusted sources. It is an amazing tool.
This has been my experience as well. I treat LLMs like an intern or junior who can do the legwork that I have no bandwidth to do myself. I have to supervise it and help it along, checking for mistakes, but I do get useful results in the end.
Attitudinally, I suspect people who have had experience supervising interns or mentoring juniors are probably those who are able to get value out of LLMs (paid ones - free ones are no good) rather than grizzled lone individual contributors -- I myself have been in this camp for most of my early career -- who don't know how to coax value out of people.
One of the most interesting aspects of this thread is how it brings us back to the fundamentals of attention in machine learning [1]. This is a key point: while humans have intelligence, our attention is inherently limited. This is why the concept behind Attention Is All You Need [2] is so relevant to what we're discussing.
My 2 cents: our human intelligence is the glue that binds everything together.
‘Able to make the same creative mathematical leaps as Terence Tao’ seems like a pretty high bar to be setting for AI.
This is like when you’re being interviewed for a programming job and the interviewer explains some problem to you that it took their team months to figure out, and then they’re disappointed you can’t whiteboard out the solution they came up with in 40 minutes without access to google.
My experience of working with people like Terence Tao, and being nowhere near their standard, is that they are looking for any kind of creativity. Everything is accepted, and it doesn't have to be "at their level".
Having read what he's saying there, and with my experience, I think your characterisation is inaccurate.
And having been at the talk he gave for the IMO earlier this year he is impressed with some of the interactions, it's just that he feels that any kind of "creative spark" is still missing.
Right, Terrance was hoping it would have something new to think about it, some new perspective, right or wrong. GPTs have the ability to process insane amounts of information across all branches of math, science and art. This ability eclipses that of the most motivated intellectuals such as Terrance. It is thus a little disappointing that it was unable to find anything in its vast knowledgebase to apply a new lens to the problem.
I wonder what the creative spark even is in the context of an autoregressive transformer.
Perhaps it’s an ability to confabulate facts into the context window which are not present in the training data but which are, in the context of maths, viable hypotheses? Every LLM can generate bullshit, but maybe we just need the right bullshit?
> ‘Able to make the same creative mathematical leaps as Terence Tao’ seems like a pretty high bar to be setting for AI.
There's no need to try to infer this kind of high bar, because what he says is actually very specific and concrete: "Here the result was mildly disappointing ... Essentially the model proposed the same strategy that was already identified in the most recent work on the problem (and which I restated in the blog post), but did not offer any creative variants of that strategy." Crucially the blog post in question was part of his input to ChatGPT.
Otherwise, he's been clear that while he anticipates a future where it is more useful, at present he only uses AI/ChatGPT for bibliography formatting and for writing out simple "Hello World" style code. (He is, after all, a mathematician and not a coder.) I've seen various claims online that he's using ChatGPT all the time to help with his research and, beyond the coding usage, that just seems to not be true.
(However, it's fair to say that "able to help Terence Tao with research" is actually a high bar.)
This has been observed by more people than just Terence Tao. Try using chatgpt to program something of higher complexity than tutorial code or write a basic blog post, it lacks creativity and the code is poorly designed.
Even for like basic Rust programs it ties itself into countless borrow checker issues and cannot get out of it, both the OpenAI as well as Sonnet (Anthropic).
It doesn't really get logic still, but it does small edits well when the code is very clear.
I think this will always remain a problem. Because it can never shut up, it keeps making stuff up and "hallucinate" (works normally, just incorrectly) to dig itself further and further into a hole.
Autocomplete on steroids is what peak AI will look like till the time we can crack consciousness and AGI (which the modern versions are nothing even close to).
If arguably the person with the highest IQ currently living, is impressed but still not fully satisfied that a computer doesn’t give Nobel prize winning mathematical reasoning I think that’s a massive metric itself
So what then should the first year maths PhD think? I believe Tao obliquely addresses this with his previous post with effectively “o1 is almost as good as a grad student”
> If arguably the person with the highest IQ currently living, is impressed but still not fully satisfied that a computer doesn’t give Nobel prize winning mathematical reasoning
No offense, but every part of this characterization is really unserious! He says "the model proposed the same strategy that was already identified in the most recent work on the problem (and which I restated in the blog post), but did not offer any creative variants of that strategy." That's very different from what you're suggesting.
The way you're talking, it sounds like it'd be actually impossible for him to meaningfully say anything negative about AI. Presumably, if he was directly critical of it, it would only be because his standards as the world's smartest genius must simply be too high!
In reality, he's very optimistic about it in the future but doesn't find it useful now except for basic coding and bibliography formatting. It's fascinating to see how this very concrete and easily understood sentiment is routinely warped by the Tao-as-IQ-genius mythos.
It's interesting that humans would also benefit from the "chain of thought" type reasoning. In fact, I would argue all students studying math will greatly increase their competence if they are required to recall all relevant definition and information before using it. We don't do this in practice (including teachers and mathematicians!) because recall is effortful, and we don't like to spent more effort than necessary to solve a problem. If recall fails, then we have to look up information which takes even more effort. This is why in practice, there is a tremendous incentive to just "wing it".
AI has no emotional barrier to wasted effort, which make them better reasoners than their innate ability would suggest.
Showing your work in tests is kind of like “chain of thought” reasoning, but there’s a slight difference. Both force you to break down your process step by step, making sure the logic holds and you aren’t skipping crucial steps. But while showing your work is more about demonstrating the correct procedure, “chain of thought” reasoning pushes you to recall relevant definitions and concepts as you go, ensuring a deeper understanding. In both cases, the goal is to avoid just “winging it,” but “chain of thought” really digs into the recall aspect, which humans tend to avoid because it’s effortful.
Wow! I love this take. Somehow with all this evidence of COT helping out LLMs, I never thought about using it more myself. Sure, we kind of do it already but definitely not to the degree of LLMs, at least not usually. Maybe that's why writing is so often admired as a way to do great thinking - it enables longer chains of thoughts with less effort.
I assumed that everybody did this when trying to solve a maths problem they are stuck on (thinking university type level maths rather than school maths) and when I was teaching I would always get people to go back to the definitions.
I wasn't amazing at maths research (did a PhD and post-doc and then gave up) but my experience was that it was partly thinking hard about things and grappling with what was going on and trying to break it down somehow, but also scanning everything you know related to the problem, trying to find other problems that resemble it in some way that you can steal ideas from etc.
I'm so excited in anticipation of my near-term return to studying math, as an independent curiosity hobby. It's going to be epically fun this time around with LLM's to lean on. Coincidentally like Terence Tao, I've also been asking complex analysis queries* of LLM's, things I was trying to understand better in my working through textbooks. Their ability to interpret open-ended math questions, and quickly find distant conceptual links that are helpful and relevant, astonishes me. Fields laureate Professor Tao (naturally) looks down on the current crop of mathematics LLM—"not completely incompetent graduate student..."—but at my current ability level that just means looking up.
*(I remember a specific impressive example from 6 months ago: I asked if certain definitions could be relaxed to allow complex analysis on a non-orientable manifold, like a Klein bottle, something I spent a lot of time puzzling over, and an LLM instantly figured out it would make the Cauchy-Riemann equations globally inconsistent. (In a sense the arbitrary sign convention in CR defines an orientation on a manifold: reversing manifold orientation is the same as swapping i with -i. I understand this now, solely because an LLM suggested looking at it). Of course, I'm sure this isn't original LLM thinking—the math's certainly written down somewhere in its training material, in some highly specific postgraduate textbook I have no knowledge of. That's not relevant to me. For me, it's absolutely impossible to answer this type of question, where I have very little idea where to start, without either an LLM or a PhD-level domain specialist. There is no other tool that can make this kind of semantic-level search accessible to me. I'm very carefully thinking how best to make use of such an, incredibly powerful but alien, tool...)
I agree. Having access to a kind of semantic full search engine on basically all textbooks on Earth feels like a superpower. Even better would be if it could pinpoint the exact textbook references it found the answer in.
How will we even measure this? Benchmarks are gamed/trained on and there is no way that there is much signal in the chatbot arena for these types of queries?
I think in just a few month the average user will not be able to tell the difference in performance between the major models
I work in a field related to operations research (OR), and ChatGPT 4o has ingested enough of the OR literature that it's able to spit out very useful Mixed Integer Programming (MIP) formulations for many "problem shapes". For instance, I can give it a logic problem like "i need to put i items in n buckets based on a score, but I want to fill each bucket sequentially" and it actually spits out a very usable math formulation. I usually just need to tweak it a bit. It also warns against weak formulations where the logic might fail, which is tremendously useful for avoiding pitfalls. Compare this to the old way, which is to rack my brain over a weekend to figure out a water-tight formulation of MIP optimization problem (which is often not straightforward for non-intuitive problems). GPT has saved me so much time in this corner of my world.
Yes, you probably wouldn't be able to use ChatGPT well for this purpose unless you understood MIP optimization in the first place -- and you do need to break down the problem into smaller chunks so GPT can reason in steps -- but for someone who can and does, the $20/month I pay for ChatGPT more than pays for itself.
side: a lot of people who complain on HN that (paid/good - only Sonnet 3.5 and GPT4o are in this category) LLMs are useless to them probably (1) do not know how to use LLMs in way that maximizes their strengths; (2) have expectations that are too high based on the hype, expecting one-shot magic bullets. (3) LLMs are really not good for their domain. But many of the low-effort comments seem to mostly fall into (1) and (2) -- cynicism rather than cautious optimism.
Many of us who have discovered how to exploit LLMs in their areas of strength -- and know how to check for their mistakes -- often find them providing significant leverage in our work.
HN, and the internet in general, have become just an ocean of reactionary sandbagging and blather about how "useless" LLMs are.
Meanwhile, in the real world, I've found that I haven't written a line of code in weeks. Just paragraphs of text that specify what I want and then guidance through and around pitfalls in a simple iterative loop of useful working code.
It's entirely a learned skill, the models (and very importantly the tooling around them) have arrived at the base line they needed.
Much Much more productive world by just knuckling down and learning how to do the work.
edit: https://aider.chat/ + paid 3.5 sonnet
The fact everyone that say they've become more productive with LLMs won't say how exactly. I can talk about how VIM have make it more enjoyable to edit code (keybinding and motions), how Emacs is a good environment around text tooling (lisp machine), how I use technical books to further my learning (so many great books out here). But no one really show how they're actually solving problems with LLMs and how the alternatives were worse for them. It's all claims that it's great with no further elaboration on the workflows.
> I haven't written a line of code in weeks. Just paragraphs of text that specify what I want and then guidance through and around pitfalls in a simple iterative loop of useful working code.
Code is intent described in terms of machinery actions. Those actions can be masked by abstracting them in more understandable units, so we don't have to write opcodes, but we can use python instead. Programming is basically make the intent clear enough so that we know what units we can use. Software engineering is mostly selecting the units in a way to do minimal work once the intent changes or the foundational actions do.
Chatting with a LLM look to me like your intent is either vague or you don't know the units to use. If it's the former, then I guess you're assuming it is the expert and will guide you to the solution you seek, which means you believe it understands the problem more than you do. The second is more strange as it looks like playing around with car parts, while ignoring the manuals it comes with.
What about boilerplate and common scenarios? I agree that LLMs helps a great deal with that, but the fact is that there are perfectly good tools that helped with that like snippets, templates, and code generators.
For the effect on the industry, I generally make the point that even if AI only replaces the below average coder it will cause a downward pressure on above average coders compensation expectation.
Personally, humans appear to be getting dumber at the same time that AI is getting smarter and while, for now, the crossover point is at a low threshold that threshold will of course increase over time. I used to try to teach ontologies, stats, SMT solvers to humans before giving up and switching to AI technologies where success is not predicated on human understanding. I used to think that the inability for most humans to understand these topics was a matter of motivation, but have rather recently come to understand that these limitations are generally innate.
Even test cases have brought me no luck. The code was poorly written, being too complicated and dynamic for test code in the best case and just wrong on average. It constantly generated test cases that would be fine for other definitions of "tree edit distance" but were nonsense for my version of a "tree edit distance".
What are you doing where any of this actually works? I'm not some jaded angry internet person, but I'm honestly so flabbergasted about why I just can't get anything good out of this machine.
You are still responsible for what you do regardless of the means you used to do it. And a lot of people use this not because it’s more productive but because it requires less effort and less thought because those are the hard bits.
I’m collecting stats at the moment but the general trend in quality as in producing functional defects is declining when an LLM is involved in the process.
So far it’s not a magic bullet but a push for mediocrity in an industry with a rather bad reputation. Never a good story.
Which is great until your next job interview. Really, it's tempting in the short run but I made a conscious decision to do certain tasks manually only so that I don't lose my basic skills.
But "lines of code written" is a hollow metric to prove utility. Code literacy is more effective than code illiteracy.
Lines of natural language vs discrete code is a kind of preference. Code is exact which makes it harder to recall and master. But it provides information density.
> by just knuckling down and learning how to do the work?
This is the key for me. What work? If it's the years of learning and practice toward proficiency to "know it when you see it" then I agree.
How are people doing this, none of the code that gpt4o/copilot/sonnet spit out i ever use because it never meets my standards. How are other people accepting the shit it spits out.
Sure, it's not the best (most maintainable, non-redundant styling) code that's powering the app but it's more than enough to put an MVP out to the world and see if there's value/interest in the product.
This is cult like behaviour that reminds me so much of the crypto space.
I don't understand why people are not allowed to be critical of a technology or not find it useful.
And if they are they are somehow ignorant, over-reacting or deficient in some way.
Please post a video of your workflow.
It’s incredibly valuable for people to see this in action, otherwise they, quite legitimately, will simply think this is not true.
Now imagine how profoundly depressing it is to visit a HN post like this one, and be immediately met with blatant tribalism like this at the very top.
Do you genuinely think that going on a performative tirade like this is what's going to spark a more nuanced conversation? Or would you rather just the common sentiment be the same as yours? How many rounds of intellectual dishonesty do we need to figure this out?
could it be that you are mostly engaged in "boilerplate coding", where LLMs are indeed good?
Deleted Comment
Comment on first principles:
Following the dictum that you can't prove the absence of bugs, only their presence, the idea of what constitutes "working code" deserves much more respect.
From an engineering perspective, either you understand the implementation or you don't. There's no meaning to iteratively loop of producing working code.
Stepwise refinement is a design process under the assumption that each step is understood in a process of exploration of the matching of a solution to a problem. The steps are the refinement of definition of a problem, to which is applied an understanding of how to compute a solution. The meaning of working code is in the appropriateness of the solution to the definition of the problem. Adjust either or both to unify and make sense of the matter.
The discipline of programming is rotting when the definition of working is copying code from an oracle you run it to see if it goes wrong.
The measure of works must be an engineering claim of understanding the chosen problem domain and solution. Understanding belongs to the engineer.
LLMs do not understand and cannot be relied upon to produce correct code.
If use of an LLM puts the engineer in contact with proven principles, materials and methods which he adapts to the job at hand, while the engineer maintains understanding of correctness, maybe that's a gain.
But if the engineer relies on the LLM transformer as an oracle, how does the engineer locate the needed understanding? He can't get it from the transformer: he's responsible for checking the output of the transformer!
OTOH if the engineer draws on understanding from elsewhere, what is the value of the transformer but as a catalog? As such, who has accountability for the contents of the catalog? It can't be the transformer because it can't understand. It can't be the developer of the transformer because he can't explain why the LLM produces any particular result! It has to be the user of the transformer.
So a system of production is being created whereby the engineer's going-in position is that he lacks the understanding needed to code a solution and he sees his work as integrating the output of an oracle that can't be relied upon.
The oracle is a peculiar kind of calculator with a unknown probability of generating relevant output that works at superhuman speeds, while the engineer is reduced to an operator in the position of verifying that output at human speeds.
This looks like a feedback system for risky results and slippery slope towards heretofore unknown degrees of incorrectness and margins for error.
At the same time, the only common vernacular for tracking oracle veracity is in arcane version numbers, which are believed, based on rough experimentation, to broadly categorize the hallucinatory tendencies of the oracle.
The broad trend of adoption of this sketchy tech is in the context of industry which brags about seeking disruption and distortion, regards its engineers as cost centers to be exploited as "human resources", and is managed by a specialized class of idiot savants called MBAs.
Get this incredible technology into infrastructure and in control of life sustaining systems immediately!
It's a strange experience, like taking a math class where the proofs are weird and none of the lessons click for you, and you start feeling stupid, only to learn your professor is an escaped dementia patient and it was gobbledygook to begin with.
I had a similar experience yesterday using o1 to see if a simple path exists through s to t through v using max flow. It gave me a very convincing-looking algorithm that was fundamentally broken. My working solution used some techniques from its failed attempt, but even after repeated hints it failed to figure out a working answer (it stubbornly kept finding s->t flows, rather than realizing v->{s,t} was the key.)
It's also extremely mentally fatiguing to check its reasoning. I almost suspect that RLHF has selected for obfuscating its reasoning, since obviously-wrong answers are easier to detect and penalize than subtly-wrong answers.
Benchmarking 10,000 attempts on an IQ test is irrelevant if on most of those attempts the time taken to repair an answer is long than the time to complete the test yourself.
I find its useful to generate examplars in areas you're roughly familiar with, but want to see some elaboration or a refresher. You can stich it all together to get further, but when it comes time to actually build/etc. something -- you need to start from scratch.
The time taken to reporduce what it's provided, now that you understand it, is trivial compared to the time needed to repair its flaws.
I'm interested on how you seem to be getting better answers than me (or, maybe I just discard the answer once I can see it's wrong and write it myself, once I see it's wrong?)
In fact, I just asked it to do (and explain) x!=y for x,y integer variables in the range {1..9}, and while the constraints are right, the explanation isn't.
Deleted Comment
https://chatgpt.com/share/66e652e1-8e2c-800c-abaa-92e29e0550...
The results are boiler plate at best, but misleading and insidious at worst, especially when you get into detailed tasks. Ever try to ask a LLM what a specific constraint does or worse ask it to explain the mathematical model of some proprietary CPLEX syntactic sugar? It hallucinates the math, the syntax, the explanation, everything.
Have you tried again with the latest LLMs? ChatGPT4 actually (correctly) explains what each constraint does in English -- it doesn't just provide the constraint when you ask it for the formulation. Also, not sure if CPLEX should be involved at all -- I usually just ask it for mathematical formulations, not CPLEX calling code (I don't use CPLEX). The OR literature primarily contains math formulations and that's where LLMs can best do pattern matching to problem shape.
Many of the standard formulations are in here:
https://msi-jp.com/xpress/learning/square/10-mipformref.pdf
All the LLM is doing is fitting the problem description to a combination of these formulations (and others).
Very good at giving a textbook answer ("give a Python/ Numpy function that returns the Voronoi diagram of set of 2d points").
Now, I ask for the Laguerre diagram, a variation that is not mentioned in textbooks, but very useful in practice. I can spend a lot of time spoon-feeding the answer, I just have the bullshiting student answers.
I tried other problems like numerical approximation, physics simulation, same experience.
I don't get the hype. Maybe it's good at giving variations of glue code ie. Stack Overflow meet autocomplete ? As a search tool it's bad because it's so confidently incorrect, you may be fooled by bad answers.
One good riposte to reflexive LLM-bashing is, "Isn't that just what a stochastic parrot would say?" Some HN'ers would dismiss a talking dog because the C code it wrote has a buffer overflow error.
How many more years is senior swe work going to be a $175k/yr gig instead of an $75k check-what-the-robot-does gig?
Correction: I complain that the only decent model in "Open"AI's arsenal, that is GPT-4, has been replaced by a cheaper GPT-4o, which gives subpar answers to most of my question (I don't care it does it faster). As they moved it to "old, legacy" models, I expect they will phase it out, at which point I'll cancel my OpenAI subscriptions and Sonnet 3.5 will become the clear leader for my daily tasks.
Kudos to Anthropic for their great work, you guys are going in the right direction.
I tried to use 4/4o for a MIP several months ago. Frequently, it would iterate through three or four bad implementations over and over.
Claude 3.5 has been a significant improvement. I don’t really use chatgpt for anything at this point.
Nothing is static in the way things are moving.
Would you be willing to pay even more, if it meant you were getting proportionally more valuable answers?
E.g. $200/month or $2,000/month (assuming the $2,000/month gets into employee/intern/contractor level of results.)
This might drive a positive feedback loop.
Or (4) LLMs simply do not work properly for many use cases in particular where large volumes of trained data doesn't exist in its corpus.
And in these scenarios rather than say "I don't know" it will over and over again gaslight you with incoherent answers.
But sure condescendingly blame on the user for their ignorance and inability to understand or use the tool properly. Or call their criticism low-effort.
“The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student.”
With regard to interacting with the equivalent of Alexa. That’s a remarkable difference in 5 years.
If you think we are close to the maximum useful software in the world already, then maybe. I do not believe that. Seeing software production and time costs drop one to two orders of magnitude means we will have very different viable software production processes. I don’t believe for a second that it disenfranchises quality thinkers; it empowers them.
So far, there is little chance of a non-technical person developing a technical solution to their problems using AI.
The developers who will find LLMs the least useful are the "brilliant" ones who never found any utility in any of that stuff, partly because they are not reinventing the wheel for the 1000th time, but instead addressing more challenging and novel problems.
AI just destroyed shutterstock.
What it may do is change the job requrements. Web/JS has decimated (reduced by 90% or more) MFC C++ jobs after all.
The programmer doesnt just write Python. That is the how... not the what.
Once ChatGPT can even come close to replacing a junior engineer, you can retry your claim. The progression of the tech underlying ChatGPT will be sub-linear.
Deleted Comment
For "us", having such a level of intelligence available as an assistant throughout the day is a massive life upgrade, if we can just afford more tokens.
Then I see contrarians claiming that LLMs are literally never useful for anyone, and I get "don't believe your lying eyes" vibes. At this point, such sentiments feel either willfully ignorant, or said in bad faith. It's wild.
They just don't have the background, and probably lack the interest to dedicate studying for a few years to get to that level.
Intelligence is probably a distant third.
Incorrect. University graduates shows a good work ethic, a certain character and a ability to manage time. It's not a measure of being better than the rest of humanity. Also, it's not a good measure of intelligence. If you only want to view the world through credentials. Academics don't consider your intelligence until you have a Ph.D and X years of work in your field. Industry only uses graduates as a entry requirement for junior roles and then favors and cares only about your years of experience after that. Given that statement I can only assume you haven't been to University. You are mistaken to think, especially in time we are in now that the elite class are any more knowledgeable then you are.
We wont have AGI or ASI, whatever definition people have with those terms in the next 5 - 10 years. But I would often like to refer AI as Assisted or Argumented Intelligence. And it will provide enough value that drives current Computer and Smartphone sales for at least another 5 - 10 years. Or 3-4 cycles.
Average Joe can't do anything like that yet, both because he won't be as good at prompting the model, and because his problems in life aren't text-based anyway.
Usually Alexa will order 10,000 rolls of toilet paper and ship them to my boss when I ask it to turn on the bathroom fan.
Personally tho the utility of this level of skill (beginner grad in many areas) for me personally is in areas I have undergraduate questions in. While I literally never ask it questions in my field, I do for many other fields I don’t know well to help me learn. over the summer my family traveled and I was home alone so I fixed and renovated tons of stuff I didn’t know how to do. I work a headset and had the voice mode of ChatGPT on. I just asked it questions as I went and it answered. This enabled me to complete dozens of projects I didn’t know how to even start otherwise. If I had had to stop and search the web and sift through forums and SEO hell scapes, and read instructions loosely related and try to synthesize my answers, I would have gotten two rather than thirty projects done.
I've been saying this for quite some time now, but some people are in for a very rude awakening when the SOTA models 5-10 years from now are able to completely replace senior devs and engineers.
Better buckle up, and start diversifying your skills.
Dead Comment
The grad-students write the prompts, correct the model, and all of that is fed into a "more advanced" model. It's corpi of text. Repeat this for every grade level and subject.
Ask the model that's being trained on chemistry grad level work a simple math question and it will probably get it wrong. They aren't "smart". It's aggregations of text and ways to sample and then predict.
The key isn’t whether these things are smart or not. The key is that they put something that can answer basic grad level questions on almost any subject. For people that don’t have a graduate level education in any subject this is a remarkable tool.
I don’t know why the statement that “wow this is useful and a remarkable step forward” is always met with “yeah but it’s not actually smart.” So? Half of all humans have an IQ less than 100. They’re not smart either. Is this their value? For a machine, being able to produce accurate answers to most basic graduate level questions is -science fiction- regardless of whether it’s “smart.”
The NLP feat alone is stunning, and going from basically one step above gibberish to “basic grad school” in two years is a mouth dropping rate of change. I suspect folks who quibble over whether it’s “real intelligence” or simply a stochastic parrot have lost the ability to dream.
Dead Comment
Not only that, it also helped me reimagine and conceptualize a new measure of statistical dependency based on Jensen-Shannon divergence that works very well. And it came up with a super fast implementation of normalized mutual information, something I tried to include in the library originally but struggled to find something fast enough when dealing with large vectors (say, 15,000 dimensions and up).
While it wasn’t able to give perfect Rust code that compiled on the very first try, it was able to fix all the bugs in one more try after pasting in all the compiler warning problems from VScode. In contrast, gpt-4o usually would take dozens of tries to fix all the many rust type errors, lifetime/borrowing errors, and so on that it would inevitably introduce. And Claude3.5 sonnet is just plain stupid when it comes to Rust for some reason.
I really have to say, this feels like a true game changer, especially when you have really challenging tasks that you would be hard pressed to find many humans capable of helping with (at least without shelling out $500k+/year in compensation for).
And it’s not just the performance optimization and relatively bug free code— it’s the creative problem solving and synthesis of huge amounts of core mathematical and algorithmic knowledge plus contemporary research results, combined with a strong ability to understand what you’re trying to accomplish and making it happen.
Here is the diff to the code file showing the changes:
https://github.com/Dicklesworthstone/fast_vector_similarity/...
cough
Dead Comment
And now we have a $number we can relate, and refer, to.
For example, I asked a pretty simple question here and it got completely confused:
https://moorier.com/math-chat-1.pnghttps://moorier.com/math-chat-2.pnghttps://moorier.com/math-chat-3.png
(Full chat should be here: https://chatgpt.com/share/66e5d2dd-0b08-8011-89c8-f6895f3217...)
I would really like if people check on a set of geometry and a set of analysis questions and compare the difference.
Maybe if you fine tuned it on Euclid's Elements and then allowed it to run experiments with Mathematica snippets it could check its assumptions before spouting nonsense
Attitudinally, I suspect people who have had experience supervising interns or mentoring juniors are probably those who are able to get value out of LLMs (paid ones - free ones are no good) rather than grizzled lone individual contributors -- I myself have been in this camp for most of my early career -- who don't know how to coax value out of people.
One of the most interesting aspects of this thread is how it brings us back to the fundamentals of attention in machine learning [1]. This is a key point: while humans have intelligence, our attention is inherently limited. This is why the concept behind Attention Is All You Need [2] is so relevant to what we're discussing.
My 2 cents: our human intelligence is the glue that binds everything together.
[1] https://en.wikipedia.org/wiki/Attention_(machine_learning)
[2] https://en.wikipedia.org/wiki/Attention_Is_All_You_Need
This is like when you’re being interviewed for a programming job and the interviewer explains some problem to you that it took their team months to figure out, and then they’re disappointed you can’t whiteboard out the solution they came up with in 40 minutes without access to google.
Having read what he's saying there, and with my experience, I think your characterisation is inaccurate.
And having been at the talk he gave for the IMO earlier this year he is impressed with some of the interactions, it's just that he feels that any kind of "creative spark" is still missing.
Perhaps it’s an ability to confabulate facts into the context window which are not present in the training data but which are, in the context of maths, viable hypotheses? Every LLM can generate bullshit, but maybe we just need the right bullshit?
Deleted Comment
There's no need to try to infer this kind of high bar, because what he says is actually very specific and concrete: "Here the result was mildly disappointing ... Essentially the model proposed the same strategy that was already identified in the most recent work on the problem (and which I restated in the blog post), but did not offer any creative variants of that strategy." Crucially the blog post in question was part of his input to ChatGPT.
Otherwise, he's been clear that while he anticipates a future where it is more useful, at present he only uses AI/ChatGPT for bibliography formatting and for writing out simple "Hello World" style code. (He is, after all, a mathematician and not a coder.) I've seen various claims online that he's using ChatGPT all the time to help with his research and, beyond the coding usage, that just seems to not be true.
(However, it's fair to say that "able to help Terence Tao with research" is actually a high bar.)
It doesn't really get logic still, but it does small edits well when the code is very clear.
I think this will always remain a problem. Because it can never shut up, it keeps making stuff up and "hallucinate" (works normally, just incorrectly) to dig itself further and further into a hole.
Autocomplete on steroids is what peak AI will look like till the time we can crack consciousness and AGI (which the modern versions are nothing even close to).
If arguably the person with the highest IQ currently living, is impressed but still not fully satisfied that a computer doesn’t give Nobel prize winning mathematical reasoning I think that’s a massive metric itself
So what then should the first year maths PhD think? I believe Tao obliquely addresses this with his previous post with effectively “o1 is almost as good as a grad student”
No offense, but every part of this characterization is really unserious! He says "the model proposed the same strategy that was already identified in the most recent work on the problem (and which I restated in the blog post), but did not offer any creative variants of that strategy." That's very different from what you're suggesting.
The way you're talking, it sounds like it'd be actually impossible for him to meaningfully say anything negative about AI. Presumably, if he was directly critical of it, it would only be because his standards as the world's smartest genius must simply be too high!
In reality, he's very optimistic about it in the future but doesn't find it useful now except for basic coding and bibliography formatting. It's fascinating to see how this very concrete and easily understood sentiment is routinely warped by the Tao-as-IQ-genius mythos.
AI has no emotional barrier to wasted effort, which make them better reasoners than their innate ability would suggest.
I wasn't amazing at maths research (did a PhD and post-doc and then gave up) but my experience was that it was partly thinking hard about things and grappling with what was going on and trying to break it down somehow, but also scanning everything you know related to the problem, trying to find other problems that resemble it in some way that you can steal ideas from etc.
*(I remember a specific impressive example from 6 months ago: I asked if certain definitions could be relaxed to allow complex analysis on a non-orientable manifold, like a Klein bottle, something I spent a lot of time puzzling over, and an LLM instantly figured out it would make the Cauchy-Riemann equations globally inconsistent. (In a sense the arbitrary sign convention in CR defines an orientation on a manifold: reversing manifold orientation is the same as swapping i with -i. I understand this now, solely because an LLM suggested looking at it). Of course, I'm sure this isn't original LLM thinking—the math's certainly written down somewhere in its training material, in some highly specific postgraduate textbook I have no knowledge of. That's not relevant to me. For me, it's absolutely impossible to answer this type of question, where I have very little idea where to start, without either an LLM or a PhD-level domain specialist. There is no other tool that can make this kind of semantic-level search accessible to me. I'm very carefully thinking how best to make use of such an, incredibly powerful but alien, tool...)
I think in just a few month the average user will not be able to tell the difference in performance between the major models
Deleted Comment