{"id":885,"date":"2024-06-04T23:03:56","date_gmt":"2024-06-04T21:03:56","guid":{"rendered":"https:\/\/extendsclass.com\/blog\/?p=885"},"modified":"2024-06-02T20:36:14","modified_gmt":"2024-06-02T18:36:14","slug":"llm4decompile-decompiling-binary-code-with-large-language-models","status":"publish","type":"post","link":"https:\/\/extendsclass.com\/blog\/llm4decompile-decompiling-binary-code-with-large-language-models","title":{"rendered":"LLM4Decompile: Decompiling binary code with Large Language Models"},"content":{"rendered":"\n<p>Artificial intelligence has seen incredible growth in recent years, infiltrating more and more domains; now it&#8217;s the turn of decompilation! Will binaries finally reveal all their secrets? Will AI open new perspectives in the field of reverse engineering? Let&#8217;s explore what LLM4Decompile offers, a project for decompiling binary code using large language models.<\/p>\n\n\n\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_47_1 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"ez-toc-toggle-icon-1\"><label for=\"item-69dad84c02b24\" aria-label=\"Table of Content\"><span style=\"display: flex;align-items: center;width: 35px;height: 30px;justify-content: center;direction:ltr;\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/label><input  type=\"checkbox\" id=\"item-69dad84c02b24\"><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/extendsclass.com\/blog\/llm4decompile-decompiling-binary-code-with-large-language-models\/#A_quick_reminder_about_decompilation\" title=\"A quick reminder about decompilation\">A quick reminder about decompilation<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/extendsclass.com\/blog\/llm4decompile-decompiling-binary-code-with-large-language-models\/#LLM4Decompile\" title=\"LLM4Decompile\">LLM4Decompile<\/a><\/li><\/ul><\/nav><\/div>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"A_quick_reminder_about_decompilation\"><\/span>A quick reminder about decompilation<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p><em><em>Decompilation is the process of converting binary code or bytecode into source code understandable by developers. Used for software analysis, reverse engineering, or recovering lost code, it is essential for understanding the internal workings of a program when its source code is not available.<\/em><\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"LLM4Decompile\"><\/span>LLM4Decompile<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>LLM4Decompile, the first open-source large language model (LLM) dedicated to decompilation, is the result of innovative research conducted by a team of researchers.<\/p>\n\n\n\n<p>Until now, decompilation has been quite limited. Existing tools could retrieve source code that was often unreadable and difficult to understand for developers (I hate assembly, and I am only comfortable with high-level languages\u2026). This difficulty arises from the significant loss of information during compilation. But that was before the arrival of LLM4Decompile!<\/p>\n\n\n\n<p>LLM4Decompile has been trained on a dataset of over 4 billion tokens of C code and x86 assembly. This language model has thus learned to decode binaries in an &#8220;intelligent&#8221; manner compared to traditional decompilation. Thanks to its billions of parameters, it can capture code patterns and semantics at a level never seen before.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1200\" height=\"1422\" src=\"https:\/\/extendsclass.com\/blog\/wp-content\/uploads\/2024\/06\/pipeline.jpg\" alt=\"\" class=\"wp-image-887\"\/><\/figure>\n\n\n\n<p>They have also developed a standardized benchmark for decompilation, named Decompile-Eval. Based on real projects, this benchmark evaluates the models&#8217; ability to regenerate recompilable and re-executable code. LLM4Decompile manages to recompile about 90% of decompiled binaries! More than 20% of the regenerated code passes unit tests, meaning it successfully restores a significant portion of the program&#8217;s logic. This is better than GPT-4.<\/p>\n\n\n\n<p>For now, LLM4Decompile is limited to C and x86 assembly languages. However, we can expect great prospects in the future. It could pave the way for malware analysis, porting old video games, transpilation, and more.<\/p>\n\n\n\n<p>This project is open-source, so feel free to participate (or use it): <a href=\"https:\/\/github.com\/albertan017\/LLM4Decompile\">LLM4Decompile on GitHub<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Artificial intelligence has seen incredible growth in recent years, infiltrating more and more domains; now it&#8217;s the turn of decompilation! Will binaries finally reveal all their secrets? Will AI open new perspectives in the field of reverse engineering? Let&#8217;s explore what LLM4Decompile offers, a project for decompiling binary code using large language models. A quick [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":893,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_sitemap_exclude":false,"_sitemap_priority":"","_sitemap_frequency":""},"categories":[4,2],"tags":[],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/extendsclass.com\/blog\/wp-json\/wp\/v2\/posts\/885"}],"collection":[{"href":"https:\/\/extendsclass.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/extendsclass.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/extendsclass.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/extendsclass.com\/blog\/wp-json\/wp\/v2\/comments?post=885"}],"version-history":[{"count":1,"href":"https:\/\/extendsclass.com\/blog\/wp-json\/wp\/v2\/posts\/885\/revisions"}],"predecessor-version":[{"id":892,"href":"https:\/\/extendsclass.com\/blog\/wp-json\/wp\/v2\/posts\/885\/revisions\/892"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/extendsclass.com\/blog\/wp-json\/wp\/v2\/media\/893"}],"wp:attachment":[{"href":"https:\/\/extendsclass.com\/blog\/wp-json\/wp\/v2\/media?parent=885"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/extendsclass.com\/blog\/wp-json\/wp\/v2\/categories?post=885"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/extendsclass.com\/blog\/wp-json\/wp\/v2\/tags?post=885"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}