Author of HumanifyJS here! I've created specifically a LLM based tool for this, which uses LLMs on AST level to guarantee that the code keeps working after the unminification step:
Would it be difficult to add a 'rename from scratch' feature? I mean a feature that takes normal code (as opposed to minified code) and (1) scrubs all the user's meaningful names, (2) chooses names based on the algorithm and remaining names (ie: the built-in names).
Sometimes when I refactor, I do this manually with an LLM. It is useful in at least two ways: it can reveal better (more canonical) terminology for names (eg: 'antiparallel_line' instead of 'parallel_line_opposite_direction'), and it can also reveal names that could be generalized (eg: 'find_instance_in_list' instead of 'find_animal_instance_in_animals').
What kind of question does it ask the LLM? Giving it a whole function and asking "What should we rename <variable 1>?" repeatedly until everything has been renamed?
Asking it to do it on the whole thing, then parsing the output and checking that the AST still matches?
Does it work with huge files? I'm talking about something like 50k lines.
Edit: I'm currently trying it with a mere 1.2k JS file (openai mode) it's only 70% done after 20 minutes. Even if it works therodically with 50k LOC file, I don't think you should try.
It does work with any sized file, although it is quite slow if you're using the OpenAI API. HumanifyJS works so it processes each variable name separately, and keeps the context size manageable for an LLM.
I'm currently working on parallelizing the rename process, which should give orders of magnitude faster processing times for large files.
> Large files may take some time to process and use a lot of tokens if you use ChatGPT. For a rough estimate, the tool takes about 2 tokens per character to process a file:
> echo "$((2 * $(wc -c < yourscript.min.js)))"
> So for refrence: a minified bootstrap.min.js would take about $0.5 to un-minify using ChatGPT.
> Using humanify local is of course free, but may take more time, be less accurate and not possible with your existing hardware.
It uses smart feedback to fix the code when LLMs occasionally do hiccups with the code. You could also have a "supervisor LLM" that asserts that the resulting code matches the specification, and gives feedback if it doesn't.
It's a shame this loses one of the most useful aspects of LLM un-minifying - making sure it's actually how a person would write it. E.g. GPT-4o directly gives the exact same code (+contextual comments) with the exception of writing the for loop in the example in a natural way:
for (var index = 0; index < inputLength; index += chunkSize) {
Comparing the ASTs is useful though. Perhaps there's a way to combine the approaches - have the LLM convert, compare the ASTs, have the LLM explain the practical differences (if any) in context of the actual implementation and give it a chance to make any changes "more correct". Still not guaranteed to be perfect but significantly more "natural" resulting code.
As someone who has spent countless hours and days deobfuscating malicious Javascript by hand (manually and with some scripts I wrote), your tool is really, really impressive. Running it locally on a high end system with a RTX 4090 and it's great. Good work :)
how do you make an LLM work on the AST level? do you just feed a normal LLM a text representation of the AST, or do you make an LLM where the basic data structure is an AST node rather than a character string (human-language word)?
The frontier models can all work with both source code and ASTs as a result of their standard training.
Knowing this raises the question, which is better to feed an LLM source code of ASTs?
The answer is really it depends on the use case, there are tradeoffs. For example keeping comments intact possibly gives the model hints to reason better. On the other side, it can be argued that a pure AST has less noise for the model to be confused by.
There are other tradeoffs as well. For example, any analysis relating to coding styles would require the full source code.
On structural level it's exactly 1-1: HumanifyJS only does renames, no refactoring. It may come up with better names for variables than the original code though.
Came here to say Humanify is awesome both as a specific tool and in my opinion a really great way to think about how to get the most from inherently high-temperature activities like modern decoder nucleus sampling.
JS minification is fairly mechanical and comparably simple, so the inversion should be relatively easy. It would be of course tedious enough to be manually done in general, but transformations themselves are fairly limited so it is possible to read them only with some notes to track mangled identifiers.
A more general unminification or unobfuscation still seems to be an open problem. I wrote handful of programs that are intentionally obfuscated in the past and ChatGPT couldn't understand them even at the surface level in my experience. For example, a gist for my 160-byte-long Brainfuck interpreter in C had some comment trying to use GPT-4 to explain the code [1], but the "clarified version" bore zero similarity with the original code...
> JS minification is fairly mechanical and comparably simple, so the inversion should be relatively easy.
Just because a task is simple doesn't mean its inverse need be. Examples:
- multiplication / prime factorization
- deriving / integrating
- remembering the past / predicting the future
Code unobfuscation is clearly one of those difficult inverse problems, as it can be easily exacerbated by any of the following problems:
- bugs
- unused or irrelevant routines
- incorrect implementations that incidentally give the right results
In that sense, it would be fortunate if chatGPT could give decent results at unobfuscating code, as there is no a priori expectation that it should be able to do so. It's good that you've also checked chatGPT's code unobfuscation capabilities on a more difficult problem, but I think you've only discovered an upper limit. I wouldn't consider the example in the OP to be trivial.
Of course, it is not generalizable! In my experience though, most minifiers do only the following:
- Whitespace removal, which is trivially invertible.
- Comment removal, which we never expect to recover via unminification.
- Renaming to shorter names, which is tedious to track but still mechanical. And most minifiers have little understanding of underlying types anyway, so they are usually very conservative and rarely reuse the same mangled identifier for multiple uses. (Google Closure Compiler is a significant counterexample here, but it is also known to be much slower.)
- Constant folding and inlining, which is annoying but can be still tracked. Again, most minifiers are limited in their reasoning to do extensive constant folding and inlining.
- Language-specific transformations, like turning `a; b; c;` into `a, b, c;` and `if (a) b;` into `a && b;` whenever possible. They will be hard to understand if you don't know in advance, but there aren't too many of them anyway.
As a result, minified code still remains comparably human-readable with some note taking and perseverance. And since these transformations are mostly local, I would expect LLMs can pick them up by their own as well.
As a point of order Code Minification != Code Obfuscation.
Minification does tend to obfuscate as as side effect, but it is not the goal, so reversing minification becomes much easier. Obfuscation on the other hand can minify code, but crucially that isn't the place it starts from. As the goal is different between minificaiton and obfuscation reversing them takes different efforts and I'd much rather attempt to reverse minification than I would obfuscation.
I'd also readily believe there are hundreds/thousands of examples online of reverse code minification (or here is code X, here is code X _after_ minifcation) that LLMs have ingested in their training data.
> JS minification is fairly mechanical and comparably simple, so the inversion should be relatively easy.
This is stated as if it's a truism, but I can't understand how you can actually believe this. Converting `let userSignedInTimestamp = new Date()` to `let x = new Date()` is trivial, but going the other way probably requires reading and understanding the rest of the surrounding code to see in what contexts `x` is being used. Also, the rest of the code is also minified, making this even more challenging. Even if you do all that right, it's at best it's still a lossy conversion, since the name of the variable could capture characteristics that aren't explicitly outlined in the code at all.
You are technically true, but I think you should try some reverse engineering to see that it is usually possible to reconstruct much of them in spite of the amount of transformations made. I do understand that this fact might be hard to believe without any prior.
EDIT: I think I got why some comments complain I downplayed the power of LLM here. I never meant to, and I wanted to say that the unminification is a relatively easy task compared to other reverse engineering tasks. It is great we can automate the easy task, but we still have to wait for a better model to do much more.
Because of how trivial that step is, it's likely pretty easy to just take lots of code and minify it. Then you have the training data you need to learn to generate full code from minified code. If your goal is to generate additional useful training data for your LLM, it could make sense to actually do that.
And I can’t understand why any reasonably intelligent human feels the need to be this abrasive. You could educate but instead you had to be condescending.
Converting a picture from color to black and white is a fairly simple task. Getting back the original in color is not easy. This is if course due to data lost in the process.
Minification works in the same way. A lot of information needed for understanding the code is lost. Getting back that information can be a very demanding task.
But it is not much different from reading through badly documented codes without any comments or meaningful names. In fact, many codes to be minified are not that bad and thus it is often possible to infer the original code just from its structure. It is still not a trivial task, but I think my comment never implied that.
The act of reducing the length of variable names by replacing something descriptive (like "timeFactor") with something much shorter ("i") may be mechanical and simple, but it is destructive and reversing that is not relatively easy; in fact, its impossible to do without a fairly sophisticated understanding of what the code does. That's what the LLM did for this; which isn't exactly surprising, but it is cool; being so immediately dismissive isn't cool.
I never meant to be dismissive, in fact my current job is to build a runtime for ML accelerator! I rather wanted to show that unminification is much easier than unobfuscation, and that the SOTA model is yet to do the latter.
Also, it should be noted that the name reconstruction is not a new problem and was already partly solved multiple times before the LLM era. LLM is great in that it can do this without massive retraining, but the reconstruction depends much on the local context (which was how earlier solutions approached the problem), so it doesn't really show its reasoning capability.
That's much better in that most of the original code remains present and comments are not that far off, but its understanding of global variables are utterly wrong (to be expected though, as many of them serve multiple purposes).
Yep, I've tried to use LLMs to disassemble and decompile binaries (giving them the hex bytes as plaintext), they do OK on trivial/artificial cases but quickly fail after that.
The generative pretrained transformer was invented by OpenAI, and it seems reasonable for a company to use the name it gave to its invention in its branding.
Of course, they didn't invent Generative pretraining (GP) or transfomers (T) but AFAIK they were the first to publicly combine them
It’s not only their core strength — it’s what transformers were designed to do and, arguably, it’s all they can do. Any other supposed ability to reason or even retain knowledge (rather than simply regurgitate text without ‘understanding’ its intended meaning) is just a side effect of this superhuman ability.
I see your point, but I think there's more to it. It's kind of like saying "all humans can do is perceive and produce sound, any other ability is just a side-effect". We might be focusing too much on their mechanism for "perception" and overlooking other capabilities they've developed.
this overlooks how they do it. we don't really know. it might be logical reasoning, it might be a very efficient content addressable human-knowledge-in-a-blob-of-numbers lookup table... it doesn't matter if they work, which they do, sometimes scarily well. dismissing their abilities because they 'don't reason' is missing the forest for the trees in that they'd be capable of reasoning if they were able to run sat solvers on their output mid generation.
Particularly those that are basically linear, that don’t involve major changes in the order of things or a deep consideration of relationships between things.
They can’t sort a list but they can translate languages, for instance, given that a list sorted almost right is wrong but that we will often settle for an almost right translation.
One potential benefit should be that with the right tooling around it it should be able to translate your code base to a different language and/or framework more or less at the push of a button. So if a team is wondering if it would be worth it to switch a big chunk of the code base from python to elixir they don't have to wonder anymore.
I tried translating a python script to javascript the other day and it was flawless. I would expect it to scale with a bit of hand-railing.
ChatGPT is trained well enough on all things AWS that it can do a decent job translating Python based SDK code to Node and other languages, translate between CloudFormation/Terraform/CDK (in various languages).
It does a well at writing simple to medium complexity automation scripts around
AWS.
If it gets something wrong, I tell it to “verify your answer using the documentation available on the web”
The problem is the use case is where you don't care about the risk of hallucinations or you can validate the output without already having the data in a useful format. Plus you need to lack the knowledge/skill to do it more quickly using awk/python/perl/whatever.
I think text transformation is a sufficiently predictable task that one could make a transformer that completely avoids hallucinations. Most LLMs have high temperatures which introduces randomness and therefore hallucinations into the result.
I'm sure there's some number greater than zero of developers who are upset because they use minification as a means of obfuscation.
Reminds me of the tool that was provided in older versions of ColdFusion that would "encrypt" your code. It was a very weak algorithm, and didn't take long for someone to write a decrypter. Nevertheless some people didn't like this, because they were using this tool, thinking it was safe for selling their code without giving access to source. (In the late 90s/early 2000s before open source was the overwhelming default)
This is an example of superior intellectual performance to humans.
There’s no denying it. This task is intellectual. Does not involve rote memorization. There are not tons and tons of data pairs on the web of minimized code and unminified code for llms to learn from.
The llm understands what it is unminifying and it is in general superior to humans on this regard. But only in this specific subject.
> There are not tons and tons of data pairs on the web of minimized code and unminified code for llms to learn from.
Are you sure about this? These can be easily generated from existing JS to use as a training set, not to mention the enormous amount of non-minified JS which is already used to train it.
I'm bullish on AI, but I'm not convinced this is an example of what you're describing.
The challenge of understanding minified code for a human comes from opaque variable names, awkward loops, minimal whitespacing, etc. These aren't things that a computer has trouble with: it's why we minify in the first place. Attention, as a scheme, should do great with it.
I'd also say there is tons of minified/non-minified code out there. That's the goal of a map file. Given that OpenAI has specifically invested in web browsing and software development, I wouldn't be surprised if part of their training involved minified/unminified data.
> These aren't things that a computer has trouble with
They are irrelevant for executing the code, but they're probably pretty relevant for an LLM that is ingesting the code and text and inferring its function based on other examples it has seen. It's definitely more impressive that an LLM can succeed at this without the context of (correct) variable names than with them.
minification and unminification is a heuristic process not an algorithmic one. It is akin to decompiling code or reverse engineering. It's a step beyond just your typical AI you see in a calculator.
I don’t claim expertise in AI or understanding intelligence, but could we also say that a pocket calculator really understands arithmetic and has superior intellectual performance compared to humans?
Why not count the fact that humans created a tool to help themselves at unminifying towards human score?
Having a computer multiplying 1000-digit numbers instantly is an example of humans succeeding at multiplying: by creating a tool for that first. Because what else is intellectually succeeding there? It’s not like the computer has created itself.
If one draws a boundary of human intelligence at the skull bone and does not count the tools that this very intelligence is creating and using as mere steps of problem solving process, then one will also have to accept that humans are not intelligent enough to fly into space or do surgery or even cook most of the meals.
> Does not involve rote memorization. There are not tons and tons of data pairs on the web of minimized code and unminified code for llms to learn from.
GPT-4 has consumed more code than your entire lineage ever will and understands the inherent patterns between code and minified versions. Recognizing the abstract shape of code sans variable names and mapping in some human readable variable names from a similar pattern you've consumed from the vast internet doesn't seem farfetched.
I think I’d agree with your statement, in the same sense that a chess simulator or AlphaGo are superior to human intellect for their specific problem spaces.
LLMs are very good at a surprisingly broad array of semi-structured-text-to-semi-structured-text transformations, particularly within the manifold of text that is widely available on the internet.
It just so happens that lots of code is widely available on the internet, so LLMs tend to outperform on coding tasks. There’s also lots of marketing copy, general “encyclopedic” knowledge, news, human commentary, and entertainment artifacts (scripts, lyrics, etc). LLMs traverse those spaces handily as well. The capabilities of AI ultimately boil down to their underlying dataset and its quality.
You're in denial. Nobody is worshipping a god here.
I'm simply saying the AI has superior performance to humans on this specific subject. That's all.
Why did you suddenly make this comment of "bowing before your god" when I didn't even mention anything remotely close to that?
I'll tell you why. Because this didn't come from me. It came from YOU. This is what YOU fear most. This is what YOU think about. And your fear of this is what blinds you to the truth.
That's interesting. It's gotten a lot better I guess. A little over a year ago, I tried to use GPT to assist me in deobfuscating malicious code (someone emailed me asking for help with their hacked WP site via custom plugin). I got much further just stepping through the code myself.
After reading through this article, I tried again [0]. It gave me something to understand, though it's obfuscated enough to essentially eval unreadable strings (via the Window object), so it's not enough on it's own.
Here was an excerpt of the report I sent to the person:
> For what it’s worth, I dug through the heavily obfuscated JavaScript code and was able to decipher logic that it:
> - Listens for a page load
> - Invokes a facade of calculations which are in theory constant
> - Redirects the page to a malicious site (unk or something)
Anyone working on decompiler LLMs? Seems like we could render all code open source.
Training data would be easy to make in this case. Build tons of free GitHub code with various compilers and train on inverting compilation. This is a case where synthetic training data is appropriate and quite easy to generate.
You could train the decompiler to just invert compilation and the use existing larger code LLMs to do things like add comments.
The potential implications of this are huge. Not just open sourcing, but imagine easily decompiling and modifying proprietary apps to fix bugs or add features. This could be a huge unlock, especially for long dead programs.
For legal reasons I bet this will become blocked behavior in major models.
I've never seen a law forbidding decompiling programs.
But, some programs forbid to decompile applications by the license agreement. Further, you still don't have any right on this source code. It depends on the license...
Unfortunately not really. Having the source is a first step, but you also need the rights to use it (read, modify, execute, redistribute the modifications), and only the authors of the code can grant these rights.
Doesn't it count as 'clean room' reverse engineering - or alternatively, we could develop an LLM that's trained on the outputs and side-effects of any given function, and learns to reproduce the source code from that.
Or, going back to the original idea, while the source code produced in such a way might be illegal, it's very likely 'clean' enough to train an LLM on it to be able to help in reproducing such an application.
> Seems like we could render all code open source.
I agree. I think "AI generating/understanding source code" is a huge red herring. If AI was any good at understanding code, it would just build (or fix) the binary.
And I believe how it will turn out to be, when we really have AI programmers, they will not bother with human-readable code, but code everything in machine code (and if they are tasked in maintaining existing system, they will understand in its entirety, across the SW and HW stack). It's kinda like diffusion models that generate images don't actually bother with learning drawing techniques.
Why wouldn't AIs benefit from using abstractions? At the very least it saves tokens. Fewer tokens means less time spent solving a problem, which means more problem solving throughput. That is true for machines and people alike.
If anything I expect AI-written programs in the not so distant future to be incomprehensible because they're too short. Something like reading an APL program.
There was a paper about this at CGO earlier this year [1]. Correctness is a problem that is hard to solve, though; 50% accuracy might not be enough for serious use cases, especially given that the relation to the original input for manual intervention is hard to preserve.
You could already break the law and open yourself up to lawsuits and prosecution by stealing intellectual property and violating its owners rights before there were LLMs. They just make it more convenient, not less illegal.
I think there's actually some potential here, considering LLMs are already very good at translating text between human languages. I don't think LLMs on their own would be very good, but a specially trained AI model perhaps, such as those trained for protein folding. I think what an LLM could do best is generate better decompiled code, giving better names to symbols, and generating code in a style a human is more likely to write.
I usually crap on things like chatgpt for being unreliable and hallucinating a lot. But in this particular case, decompilers already usually generate inaccurate code, and it takes a lot of work to fix the decompiled code to make it correct (I speak from experience). So introducing AI here may not be such a huge stretch. Just don't expect an AI/LLM to generate perfectly correct decompiled code and we're good (wishful thinking).
https://github.com/jehna/humanify
Sometimes when I refactor, I do this manually with an LLM. It is useful in at least two ways: it can reveal better (more canonical) terminology for names (eg: 'antiparallel_line' instead of 'parallel_line_opposite_direction'), and it can also reveal names that could be generalized (eg: 'find_instance_in_list' instead of 'find_animal_instance_in_animals').
1. I ask LLM to describe what the meaning of the variable in the surrounding code
2. Given just the description, I ask the LLM to come up with the best possible variable name
You can check the source code for the actual prompts:
https://github.com/jehna/humanify/blob/eeff3f8b4f76d40adb116...
Deleted Comment
I'm still waiting for the AST level version control tbh
Deleted Comment
Asking it to do it on the whole thing, then parsing the output and checking that the AST still matches?
1. It asks the LLM to write a description of what the variable does
2. It asks for a good variable name based on the description from 1.
3. It uses a custom Babel plugin to do a scope-aware rename
This way the LLM only decides the name, but the actual renaming is done with traditional and reliable tools.
Edit: I'm currently trying it with a mere 1.2k JS file (openai mode) it's only 70% done after 20 minutes. Even if it works therodically with 50k LOC file, I don't think you should try.
I'm currently working on parallelizing the rename process, which should give orders of magnitude faster processing times for large files.
> Large files may take some time to process and use a lot of tokens if you use ChatGPT. For a rough estimate, the tool takes about 2 tokens per character to process a file:
> echo "$((2 * $(wc -c < yourscript.min.js)))" > So for refrence: a minified bootstrap.min.js would take about $0.5 to un-minify using ChatGPT.
> Using humanify local is of course free, but may take more time, be less accurate and not possible with your existing hardware.
https://arxiv.org/pdf/2405.15793
It uses smart feedback to fix the code when LLMs occasionally do hiccups with the code. You could also have a "supervisor LLM" that asserts that the resulting code matches the specification, and gives feedback if it doesn't.
Making the code, fully commenting it and also giving an example after that might cost three times as much
Knowing this raises the question, which is better to feed an LLM source code of ASTs?
The answer is really it depends on the use case, there are tradeoffs. For example keeping comments intact possibly gives the model hints to reason better. On the other side, it can be argued that a pure AST has less noise for the model to be confused by.
There are other tradeoffs as well. For example, any analysis relating to coding styles would require the full source code.
Babel first parses the code to AST, and for each variable the tool:
1. Gets the variable name and surrounding scope as code
2. Asks the LLM to come up with a good name for the given variable name, by looking at the scope where the variable is
3. Uses Babel to make the context-aware rename to AST based on the LLM's response
Deleted Comment
API access of ChatGPT mode is needed as there are many round trips and it uses advanced API-only tricks to force the LLM output.
For small scripts I've found the output to be very similar between small local models and GPT-4o (judging by a human eye).
Deleted Comment
+1
Dead Comment
A more general unminification or unobfuscation still seems to be an open problem. I wrote handful of programs that are intentionally obfuscated in the past and ChatGPT couldn't understand them even at the surface level in my experience. For example, a gist for my 160-byte-long Brainfuck interpreter in C had some comment trying to use GPT-4 to explain the code [1], but the "clarified version" bore zero similarity with the original code...
[1] https://gist.github.com/lifthrasiir/596667#gistcomment-47512...
Just because a task is simple doesn't mean its inverse need be. Examples:
Code unobfuscation is clearly one of those difficult inverse problems, as it can be easily exacerbated by any of the following problems: In that sense, it would be fortunate if chatGPT could give decent results at unobfuscating code, as there is no a priori expectation that it should be able to do so. It's good that you've also checked chatGPT's code unobfuscation capabilities on a more difficult problem, but I think you've only discovered an upper limit. I wouldn't consider the example in the OP to be trivial.- Whitespace removal, which is trivially invertible.
- Comment removal, which we never expect to recover via unminification.
- Renaming to shorter names, which is tedious to track but still mechanical. And most minifiers have little understanding of underlying types anyway, so they are usually very conservative and rarely reuse the same mangled identifier for multiple uses. (Google Closure Compiler is a significant counterexample here, but it is also known to be much slower.)
- Constant folding and inlining, which is annoying but can be still tracked. Again, most minifiers are limited in their reasoning to do extensive constant folding and inlining.
- Language-specific transformations, like turning `a; b; c;` into `a, b, c;` and `if (a) b;` into `a && b;` whenever possible. They will be hard to understand if you don't know in advance, but there aren't too many of them anyway.
As a result, minified code still remains comparably human-readable with some note taking and perseverance. And since these transformations are mostly local, I would expect LLMs can pick them up by their own as well.
(But why? Because I do inspect such programs fairly regularly, for example for comments like https://news.ycombinator.com/item?id=39066262)
Minification does tend to obfuscate as as side effect, but it is not the goal, so reversing minification becomes much easier. Obfuscation on the other hand can minify code, but crucially that isn't the place it starts from. As the goal is different between minificaiton and obfuscation reversing them takes different efforts and I'd much rather attempt to reverse minification than I would obfuscation.
I'd also readily believe there are hundreds/thousands of examples online of reverse code minification (or here is code X, here is code X _after_ minifcation) that LLMs have ingested in their training data.
This is stated as if it's a truism, but I can't understand how you can actually believe this. Converting `let userSignedInTimestamp = new Date()` to `let x = new Date()` is trivial, but going the other way probably requires reading and understanding the rest of the surrounding code to see in what contexts `x` is being used. Also, the rest of the code is also minified, making this even more challenging. Even if you do all that right, it's at best it's still a lossy conversion, since the name of the variable could capture characteristics that aren't explicitly outlined in the code at all.
EDIT: I think I got why some comments complain I downplayed the power of LLM here. I never meant to, and I wanted to say that the unminification is a relatively easy task compared to other reverse engineering tasks. It is great we can automate the easy task, but we still have to wait for a better model to do much more.
Minification works in the same way. A lot of information needed for understanding the code is lost. Getting back that information can be a very demanding task.
Also, it should be noted that the name reconstruction is not a new problem and was already partly solved multiple times before the LLM era. LLM is great in that it can do this without massive retraining, but the reconstruction depends much on the local context (which was how earlier solutions approached the problem), so it doesn't really show its reasoning capability.
I'm not on PC so it's not tested.
Of course, they didn't invent Generative pretraining (GP) or transfomers (T) but AFAIK they were the first to publicly combine them
Deleted Comment
this overlooks how they do it. we don't really know. it might be logical reasoning, it might be a very efficient content addressable human-knowledge-in-a-blob-of-numbers lookup table... it doesn't matter if they work, which they do, sometimes scarily well. dismissing their abilities because they 'don't reason' is missing the forest for the trees in that they'd be capable of reasoning if they were able to run sat solvers on their output mid generation.
They can’t sort a list but they can translate languages, for instance, given that a list sorted almost right is wrong but that we will often settle for an almost right translation.
I tried translating a python script to javascript the other day and it was flawless. I would expect it to scale with a bit of hand-railing.
think there's also a YC company recently focusing on the nasty, big migrations with LLM help
It does a well at writing simple to medium complexity automation scripts around AWS.
If it gets something wrong, I tell it to “verify your answer using the documentation available on the web”
Deleted Comment
Reminds me of the tool that was provided in older versions of ColdFusion that would "encrypt" your code. It was a very weak algorithm, and didn't take long for someone to write a decrypter. Nevertheless some people didn't like this, because they were using this tool, thinking it was safe for selling their code without giving access to source. (In the late 90s/early 2000s before open source was the overwhelming default)
Website offers some kind of contest which is partly dependent on obfuscated client side code.
Clever contestants can now through it into ChatGPT to improve their chances.
Now, it begins.
There’s no denying it. This task is intellectual. Does not involve rote memorization. There are not tons and tons of data pairs on the web of minimized code and unminified code for llms to learn from.
The llm understands what it is unminifying and it is in general superior to humans on this regard. But only in this specific subject.
> There are not tons and tons of data pairs on the web of minimized code and unminified code for llms to learn from.
Are you sure about this? These can be easily generated from existing JS to use as a training set, not to mention the enormous amount of non-minified JS which is already used to train it.
The challenge of understanding minified code for a human comes from opaque variable names, awkward loops, minimal whitespacing, etc. These aren't things that a computer has trouble with: it's why we minify in the first place. Attention, as a scheme, should do great with it.
I'd also say there is tons of minified/non-minified code out there. That's the goal of a map file. Given that OpenAI has specifically invested in web browsing and software development, I wouldn't be surprised if part of their training involved minified/unminified data.
They are irrelevant for executing the code, but they're probably pretty relevant for an LLM that is ingesting the code and text and inferring its function based on other examples it has seen. It's definitely more impressive that an LLM can succeed at this without the context of (correct) variable names than with them.
There was a time when winning in Chess was a proof of humans' superior intellect, and then it became just an algorithm. Then Go.
Having a computer multiplying 1000-digit numbers instantly is an example of humans succeeding at multiplying: by creating a tool for that first. Because what else is intellectually succeeding there? It’s not like the computer has created itself.
If one draws a boundary of human intelligence at the skull bone and does not count the tools that this very intelligence is creating and using as mere steps of problem solving process, then one will also have to accept that humans are not intelligent enough to fly into space or do surgery or even cook most of the meals.
GPT-4 has consumed more code than your entire lineage ever will and understands the inherent patterns between code and minified versions. Recognizing the abstract shape of code sans variable names and mapping in some human readable variable names from a similar pattern you've consumed from the vast internet doesn't seem farfetched.
LLMs are very good at a surprisingly broad array of semi-structured-text-to-semi-structured-text transformations, particularly within the manifold of text that is widely available on the internet.
It just so happens that lots of code is widely available on the internet, so LLMs tend to outperform on coding tasks. There’s also lots of marketing copy, general “encyclopedic” knowledge, news, human commentary, and entertainment artifacts (scripts, lyrics, etc). LLMs traverse those spaces handily as well. The capabilities of AI ultimately boil down to their underlying dataset and its quality.
You people are so weird.
I'm simply saying the AI has superior performance to humans on this specific subject. That's all.
Why did you suddenly make this comment of "bowing before your god" when I didn't even mention anything remotely close to that?
I'll tell you why. Because this didn't come from me. It came from YOU. This is what YOU fear most. This is what YOU think about. And your fear of this is what blinds you to the truth.
So is too multiplying _many_ large numbers together.
> The llm understands what it is unminifying and it is in general superior to humans on this regard.
Proof needed. I would grant in terms of speed as very likely to be true. In general, I do not know how accuracy would compare.
Deleted Comment
After reading through this article, I tried again [0]. It gave me something to understand, though it's obfuscated enough to essentially eval unreadable strings (via the Window object), so it's not enough on it's own.
Here was an excerpt of the report I sent to the person:
> For what it’s worth, I dug through the heavily obfuscated JavaScript code and was able to decipher logic that it:
> - Listens for a page load
> - Invokes a facade of calculations which are in theory constant
> - Redirects the page to a malicious site (unk or something)
[0] https://chatgpt.com/share/f51fbd50-8df0-49e9-86ef-fc972bca6b...
Training data would be easy to make in this case. Build tons of free GitHub code with various compilers and train on inverting compilation. This is a case where synthetic training data is appropriate and quite easy to generate.
You could train the decompiler to just invert compilation and the use existing larger code LLMs to do things like add comments.
For legal reasons I bet this will become blocked behavior in major models.
Unfortunately not really. Having the source is a first step, but you also need the rights to use it (read, modify, execute, redistribute the modifications), and only the authors of the code can grant these rights.
Or, going back to the original idea, while the source code produced in such a way might be illegal, it's very likely 'clean' enough to train an LLM on it to be able to help in reproducing such an application.
I agree. I think "AI generating/understanding source code" is a huge red herring. If AI was any good at understanding code, it would just build (or fix) the binary.
And I believe how it will turn out to be, when we really have AI programmers, they will not bother with human-readable code, but code everything in machine code (and if they are tasked in maintaining existing system, they will understand in its entirety, across the SW and HW stack). It's kinda like diffusion models that generate images don't actually bother with learning drawing techniques.
If anything I expect AI-written programs in the not so distant future to be incomprehensible because they're too short. Something like reading an APL program.
Here is an LLM for x86 to C decompilation: https://github.com/albertan017/LLM4Decompile
It's just renaming variable and functions and inserting line breaks.
[1]: https://arxiv.org/abs/2305.12520
That's not how copyright and licensing works.
You could already break the law and open yourself up to lawsuits and prosecution by stealing intellectual property and violating its owners rights before there were LLMs. They just make it more convenient, not less illegal.
I usually crap on things like chatgpt for being unreliable and hallucinating a lot. But in this particular case, decompilers already usually generate inaccurate code, and it takes a lot of work to fix the decompiled code to make it correct (I speak from experience). So introducing AI here may not be such a huge stretch. Just don't expect an AI/LLM to generate perfectly correct decompiled code and we're good (wishful thinking).