Readit News logoReadit News
aleclm commented on The rev.ng decompiler goes open source   rev.ng/blog/open-sourcing... · Posted by u/quic_bcain
flexagoon · 2 years ago
Are there any plans to support type inference? It seems like it currently shows all variables as generic64_t. Would be nice to automatically detect their types like Ghidra does (albeit sometimes incorrectly)
aleclm commented on The rev.ng decompiler goes open source   rev.ng/blog/open-sourcing... · Posted by u/quic_bcain
jcranmer · 2 years ago
One of the blog posts I keep meaning to write but never quite get around to is a post that C is not portable assembly. What is necessary is decompilation to a portable C-like assembly, but that target is not C, and I think focusing on creating valid C tends to drag you towards suboptimal decisions, even leaving aside issues like "should SLL decompile to x << y or x << (y % 32)?"

In my experience with Ghidra, I've just seen far too many times where Ghidra starts with wrong types for something and the result becomes gibberish--even just plain dropping stuff altogether. There are some cases where it's clear it's just poor analysis on Ghidra's part (e.g., it doesn't seem to understand stack slot reuse, and memcpy-via-xmm is very confusing to it). And Ghidra's type system lacks function pointer types, which is very annoying when you're doing vtable-heavy C++ code.

I do like the appeal of a recompileable target language. But that language need not be C--in fact, I'm actually sketching out the design of such a language for my own purposes in being able to read LLVM IR without going crazy (which means I need to distinguish between, e.g., add nuw and just plain add).

Analysis necessarily involves multiple levels. Given that a lot of the type analysis today tends to be crap, I'd rather prefer to have the ability to see a more solid first-level analysis that does variable recovery and works out function calling conventions so that it can inform my ability to reverse engineer structures or things like "does this C++ method return a non-trivial struct that is an implicit first parameter?"

(Also, since I'm largely looking at C++ code in practice, I'd absolutely love to be able to import C++ header files to fill in known structure types.)

aleclm · 2 years ago
> should SLL decompile to x << y or x << (y % 32)?

I think this a bit of a misguided question. The hardware has a precise semantic defined, usually. QEMU's << behaves similarly to C (undefined behavior for rhs > 32), but this means that the lifter (still QEMU) will account for this and emit code preserving the semantics.

tl;dr: the code we emit should do the right thing depending on what the original instruction did, without making assumptions on what happens in case of C undefined behaviors.

> Ghidra's type system lacks function pointer types

Weird limitation, we support those.

> it doesn't seem to understand stack slot reuse

That's a tricky one. We're now re-designing certain parts of the pipeline to enable LLVM to promote stack accesses to SSA values, which basically solves the stack slot reuse. This is probably one of the most important features experienced reversers ask for.

> that language need not be C--

Making up your own language is temptation one should resist.

Anyway, we're rewriting our backend using an MLIR dialect (we call it clift) which targets C but should be good enough to emit something "similar to C but slightly different". It might make sense to have a different backend there. But a "standard C" backend has to be the first use case.

We thought about emitting C++, it would make our life simpler. But I think targeting non-C as the first and foremost backend would be a mistake.

Also, a Python backend would be cool.

> Analysis necessarily involves...

I would be interested in discussing more what exactly you mean here. Why don't you join our discord server?

> I'd absolutely love to be able to import C++ header files to fill in known structure types

We have a project for importing from header files. Basically we want use a compiler to turn them into DWARF debug symbols and then import those. Not too hard.

aleclm commented on The rev.ng decompiler goes open source   rev.ng/blog/open-sourcing... · Posted by u/quic_bcain
costco · 2 years ago
Congrats. Do you have any regrets about outsourcing lifting to the QEMU TCG or has it worked well?
aleclm · 2 years ago
Thanks!

It has been working very well. Two regrets:

1. Not rebasing our fork of QEMU for years has put us in a bad spot. But just today a member of our team managed to lift stuff with the latest QEMU. And he has also been able to lift Qualcomm Hexagon code, for which we helped to add support in QEMU. Eventually we'll be the first proper Hexagon decompiler :)

2. Focusing too much on QEMU led our frontend to be tightly coupled with QEMU. It will now take some effort to enable support for additional frontends, non-QEMU based. But not impossible: our idea is to let user add support for a new architecture by defining, in C, a struct for the CPU state and a bunch of functions acting on it. That's it. No need to learn any internal representation.

tl;dr QEMU was a great choice, it worked so well that we didn't work on that part of the codebase for too much time and now there's some technical debt there. But we're addressing it.

aleclm commented on The rev.ng decompiler goes open source   rev.ng/blog/open-sourcing... · Posted by u/quic_bcain
jcranmer · 2 years ago
Here's my issue with decompilers:

I don't want to look at assembly code. I'd rather see expression trees, expressed in C-like syntax, than trying to piece together variables from two-address or three-address instructions. Looking at assembly tends to lead to brain farts like "wait, was the first or second operand the output operand?" (really, fuck AT&T syntax) or "wait, does ja implement ugt or sgt?"

So that means I want to look at something vaguely C-like. But the problem is that the C type system is too powerful for decompilers to robustly lift to, and the resulting code is generally at best filled with distractions of wait-I-can-fix-this excessive casting and at worst just wrong. And when it's wrong, I have to resort to staring at the assembly, which (for Ghidra at least) means throwing away a lot of the notes I've accumulated because they don't correlate back to underlying assembly.

So what I really want isn't something that can emit recompilable C code, that's optimizing for something that doesn't help me in the end. What I want is robust decompilation to something that lets me ignore the assembly entirely. I'm a compiler writer, I can handle a language where integers aren't signed but the operands are.

aleclm · 2 years ago
I 120% agree with what you're saying, but emitting valid C is kinda part of what you're asking, in design terms.

Our goal is: omit all the casts that can be omitted without changing the semantics according to C. In fact, we have a PR doing exactly this (still on the old repo, hopefully it will go in soon).

But, how can you expect to be able to be strict with what C allows you to do implicitly, if you're not even emitting valid C? For instance, thanks to the fact that we emit valid C, we could test if the assembly emitted by a compiler is the same before and after removing redundant casts.

My point is that emitting valid C is kind of a prerequisite for what you're asking, a rather low bar to pass, but that, in practice, no mainstream decompiler passes. It's pretty obvious the decompiled code will often be redundant and outright wrong if you don't even guarantee it's syntactically valid. Then clearly it's not a panacea, but it's an important design criterion and shows the direction we want to go.

As for comments: we still haven't implemented inline comments, but they will be attached to program addresses, so they will be available both in disassembly and decompiled C. It's not very hard to do, but that needs some love.

aleclm commented on The rev.ng decompiler goes open source   rev.ng/blog/open-sourcing... · Posted by u/quic_bcain
JonChesterfield · 2 years ago
Not setting environment variables is indeed solved by not setting environment variables - but `source ./environment` is what's written on the announcement page at the top of this thread. './revng' doesn't appear anywhere on it.

You haven't set LD_LIBRARY_PATH but other people will do. Also LIBRARY_PATH, and put other stuff on PATH and so forth. Module systems are especially prone to this, but ending up with .bashrc doing it happens too.

You have granted the user the ability to override parts of the toolchain with environment variables and moving files to various different directories. That's nice. Some compiler devs will appreciate it. Also it's doing the thing Linux recommends for things installed globally so that's defensible.

In exchange, you will get bug reports saying "your product does not work", where the root cause eventually turns out to be "my linker chose a different library to my loader for some internal component". You also lose however many people try the product once, see it immediately fall over and don't take the time to tell you about the experience.

I think that's a bad trade-off. Static linking is my preferred fix, but generally anything that stops forgotten environment variables breaking your software in confusing ways is worth considering.

aleclm · 2 years ago
> `source ./environment` is what's written on the announcement page at the top of this thread. './revng' doesn't appear anywhere on it.

You're right, but after that there's a link to the docs where we say to use `./revng`. The blog post is for the impatient :) On the long run the docs is what most people will look at.

I don't think we want to support use cases that might break system packages too. If you set LD_LIBRARY_PATH to a directory where you have an LLVM installation, that might break any system program using LLVM too... Why should we try to fix that using `DT_RPATH` (which is a deprecated way of doing things) when system components don't do it?

We might cleanup the environment from LD_LIBRARY_PATH and other stuff, that might be a sensible default, yeah. Also we might have some sanity check printing a warning if weird libraries are pulled in.

But it's hard to take a decision without a specific use case in mind. If you have an example, bring it forward and I'm happy to discuss what should be the right approach there.

aleclm commented on The rev.ng decompiler goes open source   rev.ng/blog/open-sourcing... · Posted by u/quic_bcain
dark-star · 2 years ago
It doesn't work with my ELF file:

    [orchestra] [darkstar@shiina revng]$ ./revng artifact --analyze --progress decompile-to-single-file ../maytag.ko 
    [=======================================] 100% 0.57s Analysis list revng-initial-auto-analysis (5): import-binary
    [===================>                   ]  50% 0.57s Run analyses lists (2): revng-initial-auto-analysis
    [=========>                             ]  25% 0.57s revng-artifact (2): Run analyses
    Only ELF executables and ELF dynamic libraries are supported
    [orchestra] [darkstar@shiina revng]$ file ../maytag.ko 
    ../maytag.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (FreeBSD), not stripped
Does it not support FreeBSD binaries?

Edit: Ah I missed that it doesn't support kernel modules, probably has nothing to do with FreeBSD but the fact that this is not a simple executable

aleclm · 2 years ago
Can you open an issue on GitHub and attach the binary? I don't think it should be too hard to load that.

u/aleclm

KarmaCake day867January 15, 2014
About
Co-founder of rev.ng Labs.
View Original