In this article we will explain Virtualization Obfuscators, why they so popularly used by malwares, how do they differ from normal obfuscators and a step by step approach to deobfuscate them.
Nowdays almost every Malware is protected using some method and that protection has to be removed in order to even begin with advanced Static Analysis. So most of the protectors generally restore the actual code at some point of time during execution, as a result several automation tool were built. these tools were able to beat protectors and thus it created a need of a protector that can handle this issue. thus virtualization obfuscators were born.
Unlike general obfuscators, Virtualization obfuscators convert the portions of x86 assembly to custom language and then during run time this code is interpreted. the main advantage of using virtualization obfuscators over non virtualized obfuscators is that it never restores code to its origional form.
In Virtualization Obfuscator the language interpreter understands is RISC(Reduced Instruction set) and thus it breaks one CISC instruction to multiple RISC instructions.
eg: mov ebx, dword[eax + edx * 3 + 0x12455]
The above instruction will be transformed to multiple instruction where one instruction will fetch edx*3 one will fetch eax, another will add eax and edx and another will add 0x12455 and finally one instruction will move the resulting value to ebx register. upon exiting the virtualization obfuscator it is the responsibility of interpreter to give the control back to the x86 portion.
So How to deal with the Obfuscated code?
One way is to convert the byte code back to X86 and then we can perform analysis on the code and reverse it. but it is easier said than done.
So now the question comes how to convert byte code back to its X86 counterpart?
There are multiple solution to this problem:
1. If we can get our hands on a compiler that accepts the interpreted language in front end and converts it to x86 code in back end.
2. Another way is as the obfuscator derives language for each sample from template language, so if we can get our hands on two different samples they might have many similarities. whereas deciding the syntax of the language will be difficult in general, but we can find out that whether the language is borrowed from a particular language’s family.
Combining the above two observations we can come to a conclusion that in order to beat these protections we need to create a back end infrastructure which translates some representation of the template language into x86 code, and a mechanism to generate a front end for the compiler that is specific to the language accepted by the protected sample.
there exist different ways to deobfuscate this protector, but we would be explaining a six step approach.
Step by Step approach to generate a Compiler:
1. Reverse Engineering whole Virtual Machine:
In this step a Reverse Engineer examines the whole Virtual Machine and design an intermediate language that captures the semantics of the language and a translator that can map the VM bytecode instruction to intermediate language instructions. But still more analysis is usually required to fully break the protection.
2. Detecting the Enrty Point to the VM:
It is generally a hard problem to solve statically, but generally it is easy for a Reverse Engineer to find out the Entry Point as they know which portion of the code requires protections and can look at the encryption and arrive at the point where the program jumps to the entry point using dynamic analysis.
3. By producing disassembled code given a protected executable:
In addition to knowing the layout of one instance of virtualization interpreter, the reverse engineer must know in which aspects two derivations from the template language are same and in which aspects they differ.
To generate a disassembler we must recognize the obfuscation in the handlers which takes constant parameters then we must extract the sequence of arithmetic operations responsible for de-obfuscating thing the constants.
Another method is by performing pure symbolic execution upon the VM opcode handlers we obtain the representation of each handler as a mathematical function. By doing so we can avoid the irrelevant details that were introduced through obfuscation. We can then use a theorem prover to determine whether the function computed by the handlers matches one of the handlers known previously.
4. Disassemble and convert bytecode to intermediate code:
With a custom disassembler we can disassemble the bytecode instructions to VM bytecode instructions. But disassembled code can be sometimes hard to read due to complicated semantics therefore it is convenient to convert each instruction to a simpler language.
5. Applying compiler Optimizations to the IR:
Now when we have got a simpler tanslated code, we can perform compiler optimizations locally to each block to transform the intermediate representation to something that is more similar to X86 instruction.
6. Generate x86 code:
This is the final step in deobfuscating the protector, Code generation in this particular application has several differences from the operations performed by the regular compiler as we seek to make the code as similar to the per-protected original code as possible. There might also be a case where there is not an assembly instruction that corresponds to the VM instruction. In such cases multiple X86 instruction are used to form to convert the VM instruction. We also sometimes have to generate the instructions that are not typically considered by standard compilers, such as the privilaged in and out.
We discussed how Virtualization Obfuscators obfuscates the code and a step by step approach to deobfuscate it. As we mentioned earlier that there exits many other approaches to deobfuscate this protector as a lot of research is going on in this field.
- Unpacking Virtualization Obfuscators by Rolf Rolles.
Leave a Reply