OpenAI is good at unminifying code

Author of HumanifyJS here! I've created specifically a LLM based tool for this, which uses LLMs on AST level to guarantee that the code keeps working after the unminification step:

https://github.com/jehna/humanify

thomassmith65 · a year ago

Would it be difficult to add a 'rename from scratch' feature? I mean a feature that takes normal code (as opposed to minified code) and (1) scrubs all the user's meaningful names, (2) chooses names based on the algorithm and remaining names (ie: the built-in names).

Sometimes when I refactor, I do this manually with an LLM. It is useful in at least two ways: it can reveal better (more canonical) terminology for names (eg: 'antiparallel_line' instead of 'parallel_line_opposite_direction'), and it can also reveal names that could be generalized (eg: 'find_instance_in_list' instead of 'find_animal_instance_in_animals').

jehna1 · a year ago

Yes, I think you could use HumanifyJS for that. The way it works is that:

1. I ask LLM to describe what the meaning of the variable in the surrounding code

2. Given just the description, I ask the LLM to come up with the best possible variable name

You can check the source code for the actual prompts:

https://github.com/jehna/humanify/blob/eeff3f8b4f76d40adb116...

Deleted Comment

firtoz · a year ago

More tools should be built on ASTs, great work!

I'm still waiting for the AST level version control tbh

jansvoboda11 · a year ago

Unison supposedly has an AST-aware version control system: https://www.unison-lang.org/

rightonbrother · a year ago

Smalltalk envy source controll

Deleted Comment

sebstefan · a year ago

What kind of question does it ask the LLM? Giving it a whole function and asking "What should we rename <variable 1>?" repeatedly until everything has been renamed?

Asking it to do it on the whole thing, then parsing the output and checking that the AST still matches?

jehna1 · a year ago

For each variable:

1. It asks the LLM to write a description of what the variable does

2. It asks for a good variable name based on the description from 1.

3. It uses a custom Babel plugin to do a scope-aware rename

This way the LLM only decides the name, but the actual renaming is done with traditional and reliable tools.

thrdbndndn · a year ago

Does it work with huge files? I'm talking about something like 50k lines.

Edit: I'm currently trying it with a mere 1.2k JS file (openai mode) it's only 70% done after 20 minutes. Even if it works therodically with 50k LOC file, I don't think you should try.

jehna1 · a year ago

It does work with any sized file, although it is quite slow if you're using the OpenAI API. HumanifyJS works so it processes each variable name separately, and keeps the context size manageable for an LLM.

I'm currently working on parallelizing the rename process, which should give orders of magnitude faster processing times for large files.

kingsloi · a year ago

It has this in the README

> Large files may take some time to process and use a lot of tokens if you use ChatGPT. For a rough estimate, the tool takes about 2 tokens per character to process a file:

> echo "$((2 * $(wc -c < yourscript.min.js)))" > So for refrence: a minified bootstrap.min.js would take about $0.5 to un-minify using ChatGPT.

> Using humanify local is of course free, but may take more time, be less accurate and not possible with your existing hardware.

punkpeye · a year ago

Looks useful! I will update the article to link to this tool. Thanks for sharing!

jehna1 · a year ago

Super, thank you for adding the link! It really helps to get people to find the tool

cryptoz · a year ago

Finally someone else using ASTs while working with LLMs and modifying code! This is such an under-utilized area. I am also doing this with good results: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...

jehna1 · a year ago

Super interesting! Since you're generating code with LLMs, you should check out this paper:

https://arxiv.org/pdf/2405.15793

It uses smart feedback to fix the code when LLMs occasionally do hiccups with the code. You could also have a "supervisor LLM" that asserts that the resulting code matches the specification, and gives feedback if it doesn't.

zamadatix · a year ago

It's a shame this loses one of the most useful aspects of LLM un-minifying - making sure it's actually how a person would write it. E.g. GPT-4o directly gives the exact same code (+contextual comments) with the exception of writing the for loop in the example in a natural way:

    for (var index = 0; index < inputLength; index += chunkSize) {

Comparing the ASTs is useful though. Perhaps there's a way to combine the approaches - have the LLM convert, compare the ASTs, have the LLM explain the practical differences (if any) in context of the actual implementation and give it a chance to make any changes "more correct". Still not guaranteed to be perfect but significantly more "natural" resulting code.

ouraf · a year ago

Depends on how many tokens you want to spend.

Making the code, fully commenting it and also giving an example after that might cost three times as much

strictnein · a year ago

As someone who has spent countless hours and days deobfuscating malicious Javascript by hand (manually and with some scripts I wrote), your tool is really, really impressive. Running it locally on a high end system with a RTX 4090 and it's great. Good work :)

boltzmann-brain · a year ago

how do you make an LLM work on the AST level? do you just feed a normal LLM a text representation of the AST, or do you make an LLM where the basic data structure is an AST node rather than a character string (human-language word)?

WhitneyLand · a year ago

The frontier models can all work with both source code and ASTs as a result of their standard training.

Knowing this raises the question, which is better to feed an LLM source code of ASTs?

The answer is really it depends on the use case, there are tradeoffs. For example keeping comments intact possibly gives the model hints to reason better. On the other side, it can be argued that a pure AST has less noise for the model to be confused by.

There are other tradeoffs as well. For example, any analysis relating to coding styles would require the full source code.

dunham · a year ago

It looks like they're running `webcrack` to deobfuscate/unminify and then asking the LLM for better variable names.

jehna1 · a year ago

I'm using both a custom Babel plugin and LLMs to achieve this.

Babel first parses the code to AST, and for each variable the tool:

1. Gets the variable name and surrounding scope as code

2. Asks the LLM to come up with a good name for the given variable name, by looking at the scope where the variable is

3. Uses Babel to make the context-aware rename to AST based on the LLM's response

bgirard · a year ago

How well does it compare to the original un-minified code if you compare it against minify + humanify. Would be neat if it can improve mediocre code.

jehna1 · a year ago

On structural level it's exactly 1-1: HumanifyJS only does renames, no refactoring. It may come up with better names for variables than the original code though.

Deleted Comment

fny · a year ago

Is it possible to add a mode that doesn't depend on API access (e.g. copy and paste this prompt to get your answer)? Or do you make roundtrips?

jehna1 · a year ago

There is a fully local mode that does not use ChatGPT at all – everything happens on your local machine.

API access of ChatGPT mode is needed as there are many round trips and it uses advanced API-only tricks to force the LLM output.

KolmogorovComp · a year ago

Thanks for your tool. Have you been able to quantify the gap between your local model and chatgpt in terms of ‘unminification performance’?

jehna1 · a year ago

At the moment I haven't found good ways of measuring the quality between different models. Please share if you have any ideas!

For small scripts I've found the output to be very similar between small local models and GPT-4o (judging by a human eye).

anticensor · a year ago

Thanks for creating this megafier, can you add support for local LLMs?

jehna1 · a year ago

Better yet, it already does have support for local LLMs! You can use them via `humanify local`

Deleted Comment

benreesman · a year ago

Came here to say Humanify is awesome both as a specific tool and in my opinion a really great way to think about how to get the most from inherently high-temperature activities like modern decoder nucleus sampling.

Dead Comment

JS minification is fairly mechanical and comparably simple, so the inversion should be relatively easy. It would be of course tedious enough to be manually done in general, but transformations themselves are fairly limited so it is possible to read them only with some notes to track mangled identifiers.

A more general unminification or unobfuscation still seems to be an open problem. I wrote handful of programs that are intentionally obfuscated in the past and ChatGPT couldn't understand them even at the surface level in my experience. For example, a gist for my 160-byte-long Brainfuck interpreter in C had some comment trying to use GPT-4 to explain the code [1], but the "clarified version" bore zero similarity with the original code...

[1] https://gist.github.com/lifthrasiir/596667#gistcomment-47512...

panda-giddiness · a year ago

> JS minification is fairly mechanical and comparably simple, so the inversion should be relatively easy.

Just because a task is simple doesn't mean its inverse need be. Examples:

  - multiplication / prime factorization
  - deriving / integrating
  - remembering the past / predicting the future

Code unobfuscation is clearly one of those difficult inverse problems, as it can be easily exacerbated by any of the following problems:

  - bugs
  - unused or irrelevant routines
  - incorrect implementations that incidentally give the right results

In that sense, it would be fortunate if chatGPT could give decent results at unobfuscating code, as there is no a priori expectation that it should be able to do so. It's good that you've also checked chatGPT's code unobfuscation capabilities on a more difficult problem, but I think you've only discovered an upper limit. I wouldn't consider the example in the OP to be trivial.

lifthrasiir · a year ago

Of course, it is not generalizable! In my experience though, most minifiers do only the following:

- Whitespace removal, which is trivially invertible.

- Comment removal, which we never expect to recover via unminification.

- Renaming to shorter names, which is tedious to track but still mechanical. And most minifiers have little understanding of underlying types anyway, so they are usually very conservative and rarely reuse the same mangled identifier for multiple uses. (Google Closure Compiler is a significant counterexample here, but it is also known to be much slower.)

- Constant folding and inlining, which is annoying but can be still tracked. Again, most minifiers are limited in their reasoning to do extensive constant folding and inlining.

- Language-specific transformations, like turning `a; b; c;` into `a, b, c;` and `if (a) b;` into `a && b;` whenever possible. They will be hard to understand if you don't know in advance, but there aren't too many of them anyway.

As a result, minified code still remains comparably human-readable with some note taking and perseverance. And since these transformations are mostly local, I would expect LLMs can pick them up by their own as well.

(But why? Because I do inspect such programs fairly regularly, for example for comments like https://news.ycombinator.com/item?id=39066262)

drakythe · a year ago

As a point of order Code Minification != Code Obfuscation.

Minification does tend to obfuscate as as side effect, but it is not the goal, so reversing minification becomes much easier. Obfuscation on the other hand can minify code, but crucially that isn't the place it starts from. As the goal is different between minificaiton and obfuscation reversing them takes different efforts and I'd much rather attempt to reverse minification than I would obfuscation.

I'd also readily believe there are hundreds/thousands of examples online of reverse code minification (or here is code X, here is code X _after_ minifcation) that LLMs have ingested in their training data.

johnfn · a year ago

> JS minification is fairly mechanical and comparably simple, so the inversion should be relatively easy.

This is stated as if it's a truism, but I can't understand how you can actually believe this. Converting `let userSignedInTimestamp = new Date()` to `let x = new Date()` is trivial, but going the other way probably requires reading and understanding the rest of the surrounding code to see in what contexts `x` is being used. Also, the rest of the code is also minified, making this even more challenging. Even if you do all that right, it's at best it's still a lossy conversion, since the name of the variable could capture characteristics that aren't explicitly outlined in the code at all.

lifthrasiir · a year ago

You are technically true, but I think you should try some reverse engineering to see that it is usually possible to reconstruct much of them in spite of the amount of transformations made. I do understand that this fact might be hard to believe without any prior.

EDIT: I think I got why some comments complain I downplayed the power of LLM here. I never meant to, and I wanted to say that the unminification is a relatively easy task compared to other reverse engineering tasks. It is great we can automate the easy task, but we still have to wait for a better model to do much more.

viscanti · a year ago

Because of how trivial that step is, it's likely pretty easy to just take lots of code and minify it. Then you have the training data you need to learn to generate full code from minified code. If your goal is to generate additional useful training data for your LLM, it could make sense to actually do that.

hluska · a year ago

And I can’t understand why any reasonably intelligent human feels the need to be this abrasive. You could educate but instead you had to be condescending.

Max-q · a year ago

Converting a picture from color to black and white is a fairly simple task. Getting back the original in color is not easy. This is if course due to data lost in the process.

Minification works in the same way. A lot of information needed for understanding the code is lost. Getting back that information can be a very demanding task.

lifthrasiir · a year ago

But it is not much different from reading through badly documented codes without any comments or meaningful names. In fact, many codes to be minified are not that bad and thus it is often possible to infer the original code just from its structure. It is still not a trivial task, but I think my comment never implied that.

015a · a year ago

The act of reducing the length of variable names by replacing something descriptive (like "timeFactor") with something much shorter ("i") may be mechanical and simple, but it is destructive and reversing that is not relatively easy; in fact, its impossible to do without a fairly sophisticated understanding of what the code does. That's what the LLM did for this; which isn't exactly surprising, but it is cool; being so immediately dismissive isn't cool.

lifthrasiir · a year ago

I never meant to be dismissive, in fact my current job is to build a runtime for ML accelerator! I rather wanted to show that unminification is much easier than unobfuscation, and that the SOTA model is yet to do the latter.

Also, it should be noted that the name reconstruction is not a new problem and was already partly solved multiple times before the LLM era. LLM is great in that it can do this without massive retraining, but the reconstruction depends much on the local context (which was how earlier solutions approached the problem), so it doesn't really show its reasoning capability.

GaggiX · a year ago

Random try (the first one) with Claude 3.5 Sonnet: https://claude.site/artifacts/246c1b1a-3088-447a-a526-b1e716...

I'm not on PC so it's not tested.

lifthrasiir · a year ago

That's much better in that most of the original code remains present and comments are not that far off, but its understanding of global variables are utterly wrong (to be expected though, as many of them serve multiple purposes).

Earw0rm · a year ago

Yep, I've tried to use LLMs to disassemble and decompile binaries (giving them the hex bytes as plaintext), they do OK on trivial/artificial cases but quickly fail after that.

albert_e · a year ago

Should the title say ChatGPT or gpt-4 (the model) instead of OpenAI (the company)?

dantondwa · a year ago

There is a certain justice in the use of OpenAI as a name for their product, given that OpenAI has turned the generic technical GPT name into a brand.

j_maffe · a year ago

GPT is not a brand. A court ruling turned down that notion. It's a technology.

AlwaysRock · a year ago

And that they launch new models so often that GPT could mean 3.5, 4, 4o mini, or 4, just to name the ones I know off the top of my head.

ChadNauseam · a year ago

The generative pretrained transformer was invented by OpenAI, and it seems reasonable for a company to use the name it gave to its invention in its branding.

Of course, they didn't invent Generative pretraining (GP) or transfomers (T) but AFAIK they were the first to publicly combine them

cbm-vic-20 · a year ago

I left my Kleenex next to the Xerox.

taneq · a year ago

Better Hoover it up!

It would have been a better title, yes.

ubj · a year ago

I agree, this would make the title more accurate.

johnisgood · a year ago

I think it should not say the name of the company, but either ChatGPT or GPT-x.

whimsicalism · a year ago

more likely to get downvotes that way, potentially even downweighted

LLMs are excellent at text transformation. It's their core strength and I don't see it being used enough.

xanderlewis · a year ago

It’s not only their core strength — it’s what transformers were designed to do and, arguably, it’s all they can do. Any other supposed ability to reason or even retain knowledge (rather than simply regurgitate text without ‘understanding’ its intended meaning) is just a side effect of this superhuman ability.

stavros · a year ago

I see your point, but I think there's more to it. It's kind of like saying "all humans can do is perceive and produce sound, any other ability is just a side-effect". We might be focusing too much on their mechanism for "perception" and overlooking other capabilities they've developed.

baq · a year ago

> it’s all they can do

this overlooks how they do it. we don't really know. it might be logical reasoning, it might be a very efficient content addressable human-knowledge-in-a-blob-of-numbers lookup table... it doesn't matter if they work, which they do, sometimes scarily well. dismissing their abilities because they 'don't reason' is missing the forest for the trees in that they'd be capable of reasoning if they were able to run sat solvers on their output mid generation.

sitkack · a year ago

Hinton claims they do reason. I am going to go with Hinton on this.

PaulHoule · a year ago

Particularly those that are basically linear, that don’t involve major changes in the order of things or a deep consideration of relationships between things.

They can’t sort a list but they can translate languages, for instance, given that a list sorted almost right is wrong but that we will often settle for an almost right translation.

worldsayshi · a year ago

One potential benefit should be that with the right tooling around it it should be able to translate your code base to a different language and/or framework more or less at the push of a button. So if a team is wondering if it would be worth it to switch a big chunk of the code base from python to elixir they don't have to wonder anymore.

I tried translating a python script to javascript the other day and it was flawless. I would expect it to scale with a bit of hand-railing.

adamdiy · a year ago

see projects like https://github.com/joshpxyne/gpt-migrate

think there's also a YC company recently focusing on the nasty, big migrations with LLM help

scarface_74 · a year ago

ChatGPT is trained well enough on all things AWS that it can do a decent job translating Python based SDK code to Node and other languages, translate between CloudFormation/Terraform/CDK (in various languages).

It does a well at writing simple to medium complexity automation scripts around AWS.

If it gets something wrong, I tell it to “verify your answer using the documentation available on the web”

msp26 · a year ago

Isn't this already their main use case for business? We use them primarily for extracting structured data from other forms.

greenthrow · a year ago

The problem is the use case is where you don't care about the risk of hallucinations or you can validate the output without already having the data in a useful format. Plus you need to lack the knowledge/skill to do it more quickly using awk/python/perl/whatever.

I think text transformation is a sufficiently predictable task that one could make a transformer that completely avoids hallucinations. Most LLMs have high temperatures which introduces randomness and therefore hallucinations into the result.

That's why having good test suites and tools are more important than ever.

bdcravens · a year ago

I'm sure there's some number greater than zero of developers who are upset because they use minification as a means of obfuscation.

Reminds me of the tool that was provided in older versions of ColdFusion that would "encrypt" your code. It was a very weak algorithm, and didn't take long for someone to write a decrypter. Nevertheless some people didn't like this, because they were using this tool, thinking it was safe for selling their code without giving access to source. (In the late 90s/early 2000s before open source was the overwhelming default)

julianeon · a year ago

I bet there's some kind of use case/arms race soon to happen, like this:

Website offers some kind of contest which is partly dependent on obfuscated client side code.

Clever contestants can now through it into ChatGPT to improve their chances.

Now, it begins.

ninetyninenine · a year ago

This is an example of superior intellectual performance to humans.

There’s no denying it. This task is intellectual. Does not involve rote memorization. There are not tons and tons of data pairs on the web of minimized code and unminified code for llms to learn from.

The llm understands what it is unminifying and it is in general superior to humans on this regard. But only in this specific subject.

indoordin0saur · a year ago

This is just transforming text.

> There are not tons and tons of data pairs on the web of minimized code and unminified code for llms to learn from.

Are you sure about this? These can be easily generated from existing JS to use as a training set, not to mention the enormous amount of non-minified JS which is already used to train it.

ozr · a year ago

I'm bullish on AI, but I'm not convinced this is an example of what you're describing.

The challenge of understanding minified code for a human comes from opaque variable names, awkward loops, minimal whitespacing, etc. These aren't things that a computer has trouble with: it's why we minify in the first place. Attention, as a scheme, should do great with it.

I'd also say there is tons of minified/non-minified code out there. That's the goal of a map file. Given that OpenAI has specifically invested in web browsing and software development, I wouldn't be surprised if part of their training involved minified/unminified data.

etbebl · a year ago

> These aren't things that a computer has trouble with

They are irrelevant for executing the code, but they're probably pretty relevant for an LLM that is ingesting the code and text and inferring its function based on other examples it has seen. It's definitely more impressive that an LLM can succeed at this without the context of (correct) variable names than with them.

minification and unminification is a heuristic process not an algorithmic one. It is akin to decompiling code or reverse engineering. It's a step beyond just your typical AI you see in a calculator.

danbolt · a year ago

I don’t claim expertise in AI or understanding intelligence, but could we also say that a pocket calculator really understands arithmetic and has superior intellectual performance compared to humans?

https://chatgpt.com/share/a430518b-16f8-47bb-8cb7-d9b8518376...

pornel · a year ago

Things are called AI only until they can be done well by a computer, and then they become just an algorithm.

There was a time when winning in Chess was a proof of humans' superior intellect, and then it became just an algorithm. Then Go.

hamstergene · a year ago

Why not count the fact that humans created a tool to help themselves at unminifying towards human score?

Having a computer multiplying 1000-digit numbers instantly is an example of humans succeeding at multiplying: by creating a tool for that first. Because what else is intellectually succeeding there? It’s not like the computer has created itself.

If one draws a boundary of human intelligence at the skull bone and does not count the tools that this very intelligence is creating and using as mere steps of problem solving process, then one will also have to accept that humans are not intelligent enough to fly into space or do surgery or even cook most of the meals.

gmd63 · a year ago

> Does not involve rote memorization. There are not tons and tons of data pairs on the web of minimized code and unminified code for llms to learn from.

GPT-4 has consumed more code than your entire lineage ever will and understands the inherent patterns between code and minified versions. Recognizing the abstract shape of code sans variable names and mapping in some human readable variable names from a similar pattern you've consumed from the vast internet doesn't seem farfetched.

okanat · a year ago

And a human can do it without seeing that amount of code and consuming less energy.

plaidfuji · a year ago

I think I’d agree with your statement, in the same sense that a chess simulator or AlphaGo are superior to human intellect for their specific problem spaces.

LLMs are very good at a surprisingly broad array of semi-structured-text-to-semi-structured-text transformations, particularly within the manifold of text that is widely available on the internet.

It just so happens that lots of code is widely available on the internet, so LLMs tend to outperform on coding tasks. There’s also lots of marketing copy, general “encyclopedic” knowledge, news, human commentary, and entertainment artifacts (scripts, lyrics, etc). LLMs traverse those spaces handily as well. The capabilities of AI ultimately boil down to their underlying dataset and its quality.

dontlikeyoueith · a year ago

Yes, bow down before your god.

You people are so weird.

You're in denial. Nobody is worshipping a god here.

I'm simply saying the AI has superior performance to humans on this specific subject. That's all.

Why did you suddenly make this comment of "bowing before your god" when I didn't even mention anything remotely close to that?

I'll tell you why. Because this didn't come from me. It came from YOU. This is what YOU fear most. This is what YOU think about. And your fear of this is what blinds you to the truth.

seadan83 · a year ago

> This is an example of superior intellectual performance to humans.

So is too multiplying _many_ large numbers together.

> The llm understands what it is unminifying and it is in general superior to humans on this regard.

Proof needed. I would grant in terms of speed as very likely to be true. In general, I do not know how accuracy would compare.

throwthrowuknow · a year ago

Umm yeah there are tons of examples in Github repos.

mplewis · a year ago

Yeah, ok. Now count the number of Rs in this word.

jackconsidine · a year ago

That's interesting. It's gotten a lot better I guess. A little over a year ago, I tried to use GPT to assist me in deobfuscating malicious code (someone emailed me asking for help with their hacked WP site via custom plugin). I got much further just stepping through the code myself.

After reading through this article, I tried again [0]. It gave me something to understand, though it's obfuscated enough to essentially eval unreadable strings (via the Window object), so it's not enough on it's own.

Here was an excerpt of the report I sent to the person:

> For what it’s worth, I dug through the heavily obfuscated JavaScript code and was able to decipher logic that it:

> - Listens for a page load

> - Invokes a facade of calculations which are in theory constant

> - Redirects the page to a malicious site (unk or something)

[0] https://chatgpt.com/share/f51fbd50-8df0-49e9-86ef-fc972bca6b...

api · a year ago

Anyone working on decompiler LLMs? Seems like we could render all code open source.

Training data would be easy to make in this case. Build tons of free GitHub code with various compilers and train on inverting compilation. This is a case where synthetic training data is appropriate and quite easy to generate.

You could train the decompiler to just invert compilation and the use existing larger code LLMs to do things like add comments.

BluSyn · a year ago

The potential implications of this are huge. Not just open sourcing, but imagine easily decompiling and modifying proprietary apps to fix bugs or add features. This could be a huge unlock, especially for long dead programs.

For legal reasons I bet this will become blocked behavior in major models.

roflmaostc · a year ago

I've never seen a law forbidding decompiling programs. But, some programs forbid to decompile applications by the license agreement. Further, you still don't have any right on this source code. It depends on the license...

jraph · a year ago

> Seems like we could render all code open source

Unfortunately not really. Having the source is a first step, but you also need the rights to use it (read, modify, execute, redistribute the modifications), and only the authors of the code can grant these rights.

torginus · a year ago

Doesn't it count as 'clean room' reverse engineering - or alternatively, we could develop an LLM that's trained on the outputs and side-effects of any given function, and learns to reproduce the source code from that.

Or, going back to the original idea, while the source code produced in such a way might be illegal, it's very likely 'clean' enough to train an LLM on it to be able to help in reproducing such an application.

js8 · a year ago

> Seems like we could render all code open source.

I agree. I think "AI generating/understanding source code" is a huge red herring. If AI was any good at understanding code, it would just build (or fix) the binary.

And I believe how it will turn out to be, when we really have AI programmers, they will not bother with human-readable code, but code everything in machine code (and if they are tasked in maintaining existing system, they will understand in its entirety, across the SW and HW stack). It's kinda like diffusion models that generate images don't actually bother with learning drawing techniques.

Vampiero · a year ago

Why wouldn't AIs benefit from using abstractions? At the very least it saves tokens. Fewer tokens means less time spent solving a problem, which means more problem solving throughput. That is true for machines and people alike.

If anything I expect AI-written programs in the not so distant future to be incomprehensible because they're too short. Something like reading an APL program.

johndough · a year ago

> Anyone working on decompiler LLMs?

Here is an LLM for x86 to C decompilation: https://github.com/albertan017/LLM4Decompile

croes · a year ago

Unminifying isn't decompiling.

It's just renaming variable and functions and inserting line breaks.

Minifying includes way more tricks than shorter variable names and removing white-space

No but it’s a baby brother of the same problem. Compiling is a much more complex transform but ultimately it is just a code transform.

aengelke · a year ago

There was a paper about this at CGO earlier this year [1]. Correctness is a problem that is hard to solve, though; 50% accuracy might not be enough for serious use cases, especially given that the relation to the original input for manual intervention is hard to preserve.

[1]: https://arxiv.org/abs/2305.12520

DonHopkins · a year ago

That's not how copyright and licensing works.

You could already break the law and open yourself up to lawsuits and prosecution by stealing intellectual property and violating its owners rights before there were LLMs. They just make it more convenient, not less illegal.

poikroequ · a year ago

I think there's actually some potential here, considering LLMs are already very good at translating text between human languages. I don't think LLMs on their own would be very good, but a specially trained AI model perhaps, such as those trained for protein folding. I think what an LLM could do best is generate better decompiled code, giving better names to symbols, and generating code in a style a human is more likely to write.

I usually crap on things like chatgpt for being unreliable and hallucinating a lot. But in this particular case, decompilers already usually generate inaccurate code, and it takes a lot of work to fix the decompiled code to make it correct (I speak from experience). So introducing AI here may not be such a huge stretch. Just don't expect an AI/LLM to generate perfectly correct decompiled code and we're good (wishful thinking).

layer8 · a year ago

It can’t really compensate for missing variable and function names, not to mention comments.