Overview

Welcome to the Decompiler Analysis Engine. It is a complete library for performing automated data-flow analysis on software, starting from the binary executable. This documentation is geared toward understanding the source code and starts with a brief discussion of the libraries capabilities and moves immediately into the design of the decompiler and the main code workflow.

The library provides its own Register Transfer Language (RTL), referred to internally as p-code, which is designed specifically for reverse engineering applications. The disassembly of processor specific machine-code languages, and subsequent translation into p-code, forms a major sub-system of the decompiler. There is a processor specification language, referred to as SLEIGH, which is dedicated to this translation task, and there is a corresponding section in the documentation for the classes and methods used to implement this language in the library (See SLEIGH). This piece of the code can be built as a standalone binary translation library, for use by other applications.

For getting up to speed quickly on the details of the source and the decompiler's main data structures, there is a specific documentation page describing the core classes and methods.

Finally there is a documentation page summarizing the simplification rules used in the core decompiler analysis.

Capabilities

Design

The main design elements of the decompiler come straight from standard Compiler Theory data structures and algorithms. This should come as no surprise, as both compilers and decompilers are concerned with translating from one coding language to another. They both follow a general work flow:

Parse/tokenize input language.
Build abstract syntax trees in an intermediate language.
Manipulate/optimize syntax trees.
Map intermediate language to output language constructs.
Emit final output language encoding.

With direct analogs to (forward engineering) compilers, the decompiler uses:

A Register Transfer Language (RTL) referred to as p-code.
Static Single Assignment (SSA) form.
Basic blocks and Control Flow Graphs.
Term rewriting rules.
Dead code elimination.
Symbol tables and scopes.

Despite these similarities, the differences between a decompiler and a compiler are substantial and run throughout the entire process. These all stem from the fact that, in general, descriptive elements and the higher-level organization of a piece of code can only be explicitly expressed in a high-level language. So the decompiler, working with a low-level language as input, can only infer this information.

The features mentioned above all have a decompiler specific slant to them, and there are other tasks that the decompiler must perform that have no real analog with a compiler. These include:

Variable merging (vaguely related to register coloring)
Type propagation
Control flow structuring
Function prototype recovery
Expression recovery

Main Work Flow

Here is an outline of the decompiler work flow.

Specify Entry Point

The user specifies a starting address for a particular function.

Generate Raw P-code

The p-code generation engine is called SLEIGH. Based on a processor specification file, it maps binary encoded machine instructions to sequences of p-code operations. P-code operations are generated for a single machine instruction at a specific address. The control flow through these p-code operations is followed to determine if control falls through, or if there are jumps or calls. A work list of new instruction addresses is kept and is continually revisited until there are no new instructions. After the control flow is traced, additional changes may be made to the p-code.

PIC constructions are checked for, now that the extent of the function is known. If a call is to a location that is still within the function, the call is changed to a jump.
Functions which are marked as inlined are filled in at this point, before basic blocks are generated. P-code for the inlined function is generated separately and control flow is carefully set up to link it in properly.

Generate Basic Blocks and the CFG

Basic blocks are generated on the p-code instructions (not the machine instructions) and a control flow graph of these basic blocks is generated. Control flow is normalized so that there is always a unique start block with no other blocks falling into it. In the case of subroutines which have branches back to their very first machine instruction, this requires the creation of an empty placeholder start block that flows immediately into the block containing the p-code for the first instruction.

Inspect Sub-functions

-# Addresses of direct calls are looked up in the
   database and any parameter information is
   recovered.
-# If there is information about an indirect call,
   parameter information can be filled in and the
   indirect call can be changed to a direct call.
-# Any call for which no prototype is found has a
   default prototype set for it.
-# Any global or default prototype recovered at this
   point can be overridden locally.

Adjust/Annotate P-code

The context database is searched for known values of memory locations coming into the function. These are implemented by inserting p-code COPY instructions that assign the correct value to the correct memory location at the beginning of the function.
The recovered prototypes may require that extra p-code is injected at the call site so that certain actions of the call are explicit to the analysis engine.
Other p-code may be inserted to indicate changes a call makes to the stack pointer. Its possible that the change to the stack pointer is unknown. In this case INDIRECT p-code instructions are inserted to indicate that the state of the stack pointer is unknown at that point, preparing for the extrapop action.
For each p-code call instruction, extra inputs are added to the instruction either corresponding to a known input for that call, or in preparation for the prototype recovery actions. If the (potential) function input is located on the stack, a temporary is defined for that input and a full p-code LOAD instruction, with accompanying offset calculation, is inserted before the call to link the input with the (currently unknown) stack offset. Similarly extra outputs are added to the call instructions either representing a known return value, or in preparation for parameter recovery actions.
Each p-code RETURN instruction for the current function is adjusted to hide the use of the return address and to add an input location for the return value. The return value is considered an input to the RETURN instruction.

The Main Simplification Loop

Generate SSA Form

This is very similar to forward engineering algorithms. It uses a fairly standard phi-node placement algorithm based on the control flow dominator tree and the so-called dominance frontier. A standard renaming algorithm is used for the final linking of variable defs and uses. The decompiler has to take into account partially overlapping variables and guard against various aliasing situations, which are generally more explicit to a compiler. The decompiler SSA algorithm also works incrementally. Many of the stack references in a function cannot be fully resolved until the main term rewriting pass has been performed on the register variables. Rather than leaving stack references as associated LOAD s and STORE s, when the references are finally discovered, they are promoted to full variables within the SSA tree. This allows full copy propagation and simplification to occur with these variables, but it often requires 1 or more additional passes to fully build the SSA tree. Local aliasing information and aliasing across subfunction calls can be annotated in the SSA structure via INDIRECT p-code operations, which holds the information that the output of the INDIRECT is derived from the input by some indirect (frequently unknown) effect.

Eliminate Dead Code

Dead code elimination is essential to the decompiler because a large percentage of machine instructions have side-effects on machine state, such as the setting of flags, that are not relevant to the function at a particular point in the code. Dead code elimination is complicated by the fact that its not always clear what variables are temporary, locals, or globals. Also, compilers frequently map smaller (1-byte or 2-byte) variables into bigger (4-byte) registers, and manipulation of these registers may still carry around left over information in the upper bytes. The decompiler detects dead code down to the bit, in order to appropriately truncate variables in these situations.

Propagate Local Types

The decompiler has to infer high-level type information about the variables it analyzes, as this kind of information is generally not present in the input binary. Some information can be gathered about a variable, based on the instructions it is used in (i.e. if it is used in a floating point instruction). Other information about type might be available from header files or from the user. Once this is gathered, the preliminary type information is allowed to propagate through the syntax trees so that related types of other variables can be determined.

Perform Term Rewriting

The bulk of the interesting simplifications happen in this section. Following Formal Methods style term rewriting, a long list of rules are applied to the syntax tree. Each rule matches some potential configuration in a portion of the syntax tree, and after the rule matches, it specifies a sequence of edit operations on the syntax tree to transform it. Each rule can be applied repeatedly and in different parts of the tree if necessary. So even a small set of rules can cause a large transformation. The set of rules in the decompiler is extensive and is tailored to specific reverse engineering needs and compiler constructs. The goal of these transformations is not to optimize as a compiler would, but to simplify and normalize for easier understanding and recognition by human analysts (and follow on machine processing). Typical examples of transforms include: copy propagation, constant propagation, collecting terms, cancellation of operators and other algebraic simplifications, undoing multiplication and division optimizations, commuting operators, ....

Adjust Control Flow Graph

The decompiler can recognize

unreachable code
unused branches
empty basic blocks
redundant predicates
...

It will remove branches or blocks in order to simplify the control flow.

Recover Control Flow Structure

The decompiler recovers higher-level control flow objects like loops, if/else blocks, and switch statements. The entire control flow of the function is built up hierarchically with these objects, allowing it to be expressed naturally in the final output with the standard control flow constructs of the high-level language. The decompiler recognizes common high-level unstructured control flow idioms, like break, and can use node-splitting in some situations to undo compiler flow optimizations that prevent a structured representation.

Perform Final P-code Transformations

During the main simplification loop, many p-code operations are normalized in specific ways for the term rewriting process that aren't necessarily ideal for the final output. This phase does transforms designed to enhance readability of the final output. A simple example is that all subtractions (INT_SUB) are normalized to be an addition on the twos complement in the main loop. This phase would convert any remaining additions of this form back into a subtraction operation.

Exit SSA Form and Merge Low-level Variables (phase 1)

The static variables of the SSA form need to be merged into complete high-level variables. The first part of this is accomplished by formally exiting SSA form. The SSA phi-nodes and indirects are eliminated either by merging the input and output variables or inserting extra COPY operations. Merging must guard against a high-level variable holding different values (in different memory locations) at the same time. This is similar to register coloring in compiler design.

Determine Expressions and Temporary Variables

A final determination is made of what the final output expressions are going to be, by determining which variables in the syntax tree will be explicit and which represent temporary variables. Certain terms must automatically be explicit, such as constants, inputs, etc. Other variables are forced to be explicit because they are read too many times or because making it implicit would propagate another variable too far. Any variables remaining are marked implicit.

Merge Low-level Variables (phase 2)

Even after the initial merging of variables in phase 1, there are generally still too many for normal C code. So the decompiler does additional, more speculative merging. It first tries to merge the inputs and outputs of copy operations, and then the inputs and outputs of more general operations. And finally, merging is attempted on variables of the same type. Each potential merge is subject to register coloring restrictions.

Add Type Casts

Type casts are added to the code so that the final output will be syntactically legal.

Establish Function's Prototype

The register/stack locations being used to pass parameters into the function are analyzed in terms of the parameter passing convention being used so that appropriate names can be selected and the prototype can be printed with the input variables in the correct order.

Select Variable Names

The high-level variables, which are now in their final form, have names assigned based on any information gathered from their low-level elements and the symbol table. If no name can be identified from the database, an appropriate name is generated automatically.

Do Final Control Flow Structuring

Order separate components
Order switch cases
Determine which unstructured jumps are breaks
Stick in labels for remaining unstructured jumps

Emit Final C Tokens

Following the recovered function prototype, the recovered control flow structure, and the recovered expressions, the final C tokens are generated. Each token is annotated with its syntactic meaning, for later syntax highlighting. And most tokens are also annotated with the address of the machine instruction with which they are most closely associated. This is the basis for the machine/C code cross highlighting capability. The tokens are passed through a standard Oppen pretty-printing algorithm to determine the final line breaks and indenting.

Table of Contents