Show HN: I wrote a Java decompiler in pure C language

gibibit · 3 months ago

I am always curious how different C programs decide how to manage memory.

In this case there are is a custom string library. Functions returned owned heap-allocated strings.

However, I think there's a problem where static strings are used interchangably with heap-allocated strings, such as in the function `string class_simple_name(string full)` ( https://github.com/neocanable/garlic/blob/72357ddbcffdb75641... )

Sometimes it returns a static string like `g_str_int` and sometimes a newly heap-allocated string, such as returned by `class_type_array_name(g_str_int, depth)`.

Callers have no way to properly release the memory allocated by this function.

neocanable · 3 months ago

In multi-threaded mode, each thread will create a separate memory pool. If in single-threaded mode, a global memory pool is used. You can refer to https://github.com/neocanable/garlic/blob/72357ddbcffdb75641.... The x_alloc and x_alloc_in in it indicate where the memory is allocated. When each task ends, the memory allocated in the memory pool is released, and the cycle repeats.

norir · 3 months ago

Many command line tools do not need memory management at all, at least to first approximation. Free nothing and let the os cleanup on process exit. Most libraries can either use an arena internally and copy any values that get returned to the user to the heap at boundaries or require the user to externally create and destroy the arena. This can be made ergonomic with one macro that injects an arena argument into function defs and another that replaces malloc by bumping the local arena data pointer that the prior macro injected.

1718627440 · 3 months ago

That might be true, but leaking is neither the critical nor the most hard to find memory management issue, and good luck trying to adapt or even run valgrind with a codebase that mindlessly allocates and leaks everywhere.

IshKebab · 3 months ago

Interesting. Someone should come up with a language that prevents these sorts of mistakes!

cenamus · 3 months ago

Thank god Lisp is older than C, don't have to deal with such nonsense :-)

brabel · 3 months ago

That’s impossible. Just be more careful and everything should work, the author’s C was just a bit rusty!

kookamamie · 3 months ago

Yes, perhaps it could have a marketing slogan like "Write once, crash everywhere!"

uecker · 3 months ago

I think he is using memory pools, so this is ok.

pjmlp · 3 months ago

If only there were a couple of OSes implementated during the 1960's with such programming languages....

kazinator · 3 months ago

In the same file:

  static bool is_java_identifier_start(char c)
  {
    return (isalpha(c) || c == '_' || c == '$');
  }

Undefined behavior in isalpha if c happens to be negative (and not equal to EOF), like some UTF-8 byte.

I think some <ctype.h> implementations are hardened against this issue, but not all.

masfoobar · 3 months ago

> I am always curious how different C programs decide how to manage memory.

At a basic level, you can create memory on the stack or on the heap. Obviously I will focus on the heap as that is dynamically allocating memory of a certain size.

The C programming language does not force you how to handle memory. You are pretty much on your own. For some C programmers (and likely more inexperienced ones) they will malloc individual variables like they are creating a 'new' instance in a typical OOP language like Java. This can be a telltale sign of a programmer working with C that comes from an OOP background. As they learn and improve on their C skills they realise they should create a chunk of memory of a certain type, but could still be malloc(ing) and free(ing) all over the code, making it difficult to understand what is being used and where -- especially if you are looking at code you did not write.

You can also have programs that do not bother free(ing) memory. For example, a simple shell program that just does simple input->process->output and terminates. For these types of programs, just let the OS deal with freeing the memory.

Good C code (in my opinion) uses malloc and free in only a handful of functions. There are higher level functions for proper Allocators. One example is an Arena Allocator. Then if you want a function which may require dynamic memory, you can tell it which allocator to use. It gives you control, generally speaking. You can create a simple string library or builder with an allocator.

Of course an Allocator does not have to use memory on the heap. It can still use on the stack as well.

There are various other patterns to use in the world of memory, especially in C.

SunlitCat · 3 months ago

Strings! The bane of C programming, and a big reason I prefer C++. :D

jbellis · 3 months ago

I don't think it's available in a standalone repo but it IS available as a standalone library, IntelliJ's FernFlower decompiler is the gold standard https://github.com/JetBrains/intellij-community/blob/master/... https://www.jetbrains.com/intellij-repository/releases

I guess there's some history there that I'm not familiar with because JBoss also has a FernFlower decompiler library https://mvnrepository.com/artifact/org.jboss.windup.decompil...

mudkipdev · 3 months ago

https://github.com/Vineflower/vineflower

jbellis · 3 months ago

cool, TIL!

> Examples of Vineflower's output, compared to other decompilers, can be found on the wiki.

[wiki is empty]

:-/

appendixv3 · 3 months ago

Very cool project! Love the idea of a Java decompiler written in C — the speed must be great.

Any plan to support `.dex` in the future? Also curious how you handle inner classes inside JARs.

mdaniel · 3 months ago

The "jikes" compiler from IBM <https://github.com/daveshields/jikespg> was written in C++ and was for the longest time screaming fast. It also had its own parser generator lpg which was fun to play with, if you're into those things <https://github.com/daveshields/jikespg>

It seems someone liked it and made a "v2" along with LSP support https://github.com/A-LPG/LPG2#lpg2

amiga386 · 3 months ago

Jikes also gave massively better error messages than the official Java compiler, from what I remember, and it certainly ran a lot faster on the Amiga (https://aminet.net/package/dev/lang/jikes) than trying to run javac via Kaffe (https://en.wikipedia.org/wiki/Kaffe) did.

pjmlp · 3 months ago

Certainly not everything on Jikes, given that it was one of the first bootstraped Java toolchains.

https://www.jikesrvm.org/

neocanable · 3 months ago

I am writing the part of decompiling dex and apk. The current speed is about 10 times faster than that of Java, and it takes up less resources than Java. And the compiled binary is smaller, only about 300k. Thank you for your attention.

Koshkin · 3 months ago

> 10 times faster than that of Java

I was hoping that these days' Java would be "almost" as fast C/C++. Oh well.

mdaniel · 3 months ago

This has been my life experience with things written in C/C++, so speed doesn't matter. Or, I guess from an alternative perspective, it ran very fast, but exited very fast, too :-D

  $ ./objdir/garlic $the_jar_file -o out-dir -t $(nproc)
  Progress : 85 (1024)Segmentation fault: 11

tslater2006 · 3 months ago

The readme shows support for dumping dex files. Edit: missed that it has a comment that stays "unsupport for now" but at least it looks like something planned

neocanable · 3 months ago

It is processes inner classes recursively. First read all entry from jar, and analyze the relationships between classes. Then do some decompile job.

neocanable · 3 months ago

the project support dex and apk now.

Deleted Comment

stefanos82 · 3 months ago

Nice job! I don't know whether you know https://github.com/java-decompiler/jd-gui or not, but in case you haven't seen it before, maybe you could use it as a reference, since it's written in Java, for extra fun with your adventure?

rafram · 3 months ago

Things may have changed, but my impression as of several years ago was that JD-GUI was far, far behind the state of the art (Fernflower, aka the built-in IntelliJ decompiler) in terms of correctness, re-sugaring, support for modern Java features, and so on. Fernflower is open source as part of IntelliJ: https://github.com/fesh0r/fernflower

GranPC · 3 months ago

Is there a good GUI for this a la jadx-gui that isn't an entire IDE?

Dead Comment

neocanable · 3 months ago

The decompiler support android apk and dex now

cosmolev · 3 months ago

How does the output compare to https://www.decompiler.com/ in terms of correctness?

keepamovin · 3 months ago

By hand or with AI? Fascinating. So much work! What was your motivation for this?

neocanable · 3 months ago

90% by hand, 10% AI. I do this for fun and to learn about jvm.

jebarker · 3 months ago

I think that sort of ratio is the sweet spot for learning. I've been writing an 8086 simulator in C++ and using an LLM for answering specific technical questions I come up with has drastically sped up my progress without it actually doing the work for me.

keepamovin · 3 months ago

Wow, impressive. A project of the scale and depth.

xandrius · 3 months ago

Irrelevant to me. People would never ask whether someone has created something looking at SO or not. If the thing works as advertised, good for them!

lyxell · 3 months ago

To some people the process leading to a finished project is the most interesting thing about posts like these.

Bjartr · 3 months ago

A great question to ask. We're in the middle of learning where AI can and can't be effective. Knowing where and how it's being used is quite useful.

ConanRus · 3 months ago

Can you also write a C decompiler in pure Java language?

dardeaup · 3 months ago

Of course it can be done! It wouldn't be as general purpose as the Java decompiler in C because the C decompiler would have to know about the CPU architecture of the executable code (just as the Java decompiler has to know about JVM opcodes).

mdaniel · 3 months ago

https://github.com/NationalSecurityAgency/ghidra/blob/Ghidra... (Apache 2)