Free Online Toolbox for developers

LLM4Decompile: Decompiling binary code with Large Language Models

Artificial intelligence has seen incredible growth in recent years, infiltrating more and more domains; now it’s the turn of decompilation! Will binaries finally reveal all their secrets? Will AI open new perspectives in the field of reverse engineering? Let’s explore what LLM4Decompile offers, a project for decompiling binary code using large language models.

A quick reminder about decompilation

Decompilation is the process of converting binary code or bytecode into source code understandable by developers. Used for software analysis, reverse engineering, or recovering lost code, it is essential for understanding the internal workings of a program when its source code is not available.

LLM4Decompile

LLM4Decompile, the first open-source large language model (LLM) dedicated to decompilation, is the result of innovative research conducted by a team of researchers.

Until now, decompilation has been quite limited. Existing tools could retrieve source code that was often unreadable and difficult to understand for developers (I hate assembly, and I am only comfortable with high-level languages…). This difficulty arises from the significant loss of information during compilation. But that was before the arrival of LLM4Decompile!

LLM4Decompile has been trained on a dataset of over 4 billion tokens of C code and x86 assembly. This language model has thus learned to decode binaries in an “intelligent” manner compared to traditional decompilation. Thanks to its billions of parameters, it can capture code patterns and semantics at a level never seen before.

They have also developed a standardized benchmark for decompilation, named Decompile-Eval. Based on real projects, this benchmark evaluates the models’ ability to regenerate recompilable and re-executable code. LLM4Decompile manages to recompile about 90% of decompiled binaries! More than 20% of the regenerated code passes unit tests, meaning it successfully restores a significant portion of the program’s logic. This is better than GPT-4.

For now, LLM4Decompile is limited to C and x86 assembly languages. However, we can expect great prospects in the future. It could pave the way for malware analysis, porting old video games, transpilation, and more.

This project is open-source, so feel free to participate (or use it): LLM4Decompile on GitHub.




Suggested Reads

Leave a Reply