Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


pyREtic - Rich Smith [BlackHat/Defcon 2010]

In Memory Reverse Engineering for Obfuscated Python Bytecode

Dave Aitel

on 4 August 2010

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of pyREtic - Rich Smith [BlackHat/Defcon 2010]

pyREtic In Memory Reverse Engineering
For Obfuscated Python Bytecode Questions ? Rich Smith
Why Reverse Python ? Fun ! Anti-Reversing Techniques Hide in a packager Change the bytecode
magic number Sourcecode obfuscation Opcode remapping Encrypt the marshalled code object Change the marshalling format Use a modified runtime Needed to assess the 'security posture'
of some closed source Python applications No dropping of commercial application source code
or discussion of bugs therein What won't be discussed Available toolsets did not well or at all with obfuscated bytecode No considerations for anti RE techniques General Trends - the bigger picture Developer trends: Moving away from C / C++ Moving towards higher level
Python, Ruby, Lua .... Distribution trends: Software as a service Free applications - pay for subscription Web Tu - Dot - Oh & "The Cloud" Reverse Engineering needs to respond Tools may work at a lower level than desired Reliance on having access to the binaries Often approach the problem with a 'C' centric POV Other side effects: Everything always in BETA

Less experienced developers doing product code

Time to market & new features more key than ever Their flip-side being: Bugs, Bugs, Bugs !

Large populations of users - often rapidly seeded

Cross-platform / Cross-architecture remote bugs Why reverse at a higher layer ? Reversing high-level languages at the
C / Assembly layer quickly gets messy Language specific bugs & flaws Many layers between coder & code File types .py .pyc .pyo .pyd Compile & Serialise Source code statements Human readable Run on any Python platform The 'standard' serialised form
Created by default anytime a
.py is compiled by an operation
such as compile()/__import__()
Does not speed up execution,
only initialisation
X-Platform but not X-version
Purposely undocumented.... Same structure as .pyc
-o have asserts removed
-oo also have the inline
documentation removed
for smaller file size
In some cases this could
break the code
e.g. PLY Python Lex/Yacc Complex 'Frozen' format
Created by the CPython
script freeze.py
Creates a C based shared
object that has the serialised
Python objects inside
See AntiFreeze by Aaron Portnoy
& Ali Rizvi-Santiago to easily
access/modify the Python in .pyd's Packaging Optionally Take the Python files to a native binary format such as ELF, PE, or Mach-O
Makes distribution easier as user doesn't have to have Python installed
Distributes the Python runtime and dependencies with the application code
Multiple different solutions: py2exe, py2app, cxFreeze, pyInstaller... Format 4 Byte magic number: 4 Byte timestamp: Marshalled code object: Used to do Python version check
Last 2 bytes 0xD,0xA to cause magic
number corruption if edited as text If this is not equal to the filesystem
timestamp on the .py a new .pyc is
compiled The serialised code objects Object Hierachy Module Class Method Function Code No top level code object Superclasses in __bases__ im_func holds the
methods function object func_code holds code object co_code:





[co_argcount, co_nlocals, co_stacksize, co_flags,
co_names, co_varnames, co_lnotab, co_freevars,
co_cellvars] A string representing the bytecode A tuple containing constant values referenced by the bytecode A string containing the name of the code object A string containing the filename from which this was compiled An integer denoting the line number the object begins at in source Constants / Variables / Definitions Python Bytecode & Opcodes opcode.py module defines all opcodes & their
mnemonic mapping in opcode.opmap dictionary
All opcodes 1 byte: Python 2.6 has 113 defined
Some opcodes take a 2 byte argument, those
that do noted by being greater than or equal
to the opcode.HAVE_ARGUMENT value
(HAVE_ARGUMENT = 70 by default) Existing Python
reversing tools Disassemblers dis.py - a standard Python
Takes bytecode back to its
opcode mnemonics & args
Relies on opcode.py module Debuggers Bytecode Assemblers
& Modifiers Decompilers Allow the direct assembly
of Python opcodes
Allow the modification
of an objects bytecode e.g. BytePlay, BytecodeAssembler,
AntiFreeze pdb.py - a standard Python
Itself extending bdp.py
Very much a development
Assumes .py availability Take .pyc/.pyo back to sourcecode
A few different ones available as both
software & online service
Free & commerical
Quality varies, as do supported
Python versions
All assume .pyc/.pyo file availability
All expect 'standard Python formats' e.g. decompyle, UnPyc, depython Commercial / closed source applications
want to develop in Python but not allow
access to the source code
A number of different techniques
seen in use, often combined
Observed techniques tend to focus on
hiding / changing the bytecode on
disk (.pyc/.pyo) General Approach Remove reliance on access to the
marshalled object on disk (.pyc/.pyo)
Let the application undo its own
Get in process at the Python layer
Use runtime object access and interrogation
to construct a source code representation Getting in process Only .pyc's distributed by vendor
Need to use their interpreter However, Python import rules still apply .... EASY !
Just replace any of the distributed .pyc's with a .py of the same name The .py takes presedence because of a new timestamp Arbitrary Python code can now be run import.c foo_module.pyc foo_module_orig.pyc our_module.py foo_module.py Magic number fix If the first 4 bytes of the .pyc have been changed
but nothing else
Simply change them back to a valid version :)
Simple but makes all standard decompiles fail Non-standard marshalling Can be a pain in the ass ! Opcode re-re-mapping pyREtic toolkit REpdb opcode_remap A pure Python implementation of the
ideas discussed Demo ! Clouds on the horizon Future directions Wing IDE plugin
Further bug fixes / work on UnPyc
Callgraph to source code GUI
Online service so you can send me bugs Just use one of the packagers to hide
the .pyc/.py files
Often seen with .pyd on Win32 Attempts to hide the
logic behind the source
Seen a lot in Javascript
and web malware
Never actually seen this
used in reality for Python
import sys
class fuscat1:
def __init__(ordemo1,Chri1):


(because you are using old limited demo,
some relatively limited-scope local symbols
may be omitted):

Test : fuscat1
val : Chri1
t : ordemo2
Example """
Source obfuscation demo
import sys

class Test:
doc strings
def __init__(self, val):

self.val = val

t = Test(2) Looks easy to reverse,
but no time spent on it
as I've never seen it (or
similar techniques) used
in real Python applications Also from the same author !! !! Only $19.99 !! import.c does a magic
value check when importing
If the magic number is different
then import fails Customise marshal.c to change how objects
are serialised Code objects that go into a .pyc are encrypted before hitting disk A more complex obfuscation
Seen in some more recent obfuscation attempts

Take the standard opcode value to instruction
mappings and rearrange them

opcode.h defines these for the compiled runtime

opcodes.py not distributed with the application
Motivation being, even if the bytecode is accessed then relating to instructions / source is not possiblewith standard tools - they rely on standard mappings Combinations & multiple different techniques used together Many degrees of freedom allowing
almost any degree of obfuscation
or encryption Require work at the C layer to
reverse the operations that were
being done Wanted to keep the solution as
generic as possible C-Layer reversing to see
the marshalling operates requires
grunt work for each app RE at Python layer means the
unmarshalling is blindly
done for you For runtime analysis and decompilation
the opcode mapping must be understood It is this understanding that vendors employing remapping attempt to break A purely Pythonic way to derive the new
opcode mapping is through a fairly simple
'known plaintext' attack Python has a pretty simple instruction format All instructions : 1 byte

Arguments : 2 bytes
(optional) print “bugs” Python source Python instructions 0 LOAD_CONST 0 ('bugs')
Python bytecode 0x64 0x0, 0x0
0x64 0x1, 0x0
0x53 Remapped bytecode 0x44 0x0, 0x0
0x44 0x1, 0x0
0x23 0x64
0x53 0x44
0x23 Arguments stay the same Remapped bytecode from
known Python sourcecode
can be diffed against standard
bytecode to produce the new
mapping Step-by-step remapping Find the version of Python the packaged application is using
With an unmodified Python runtime compile the standard library
modules to the .pyb format as a reference set
From in-process with the obfuscated runtime create .pyb files
For the standard library modules for which reference .pyb's are
Diff the two sets of .pyb's noting the bytes that have changed
Use the changed bytes to create a new opcode.py
Use the new opcode.py for disassembly / decompilation The pyEREtic toolkit includes this
functionality all prepackaged with
other details already sorted out Gotchas Creation of .pyb files from different versions of Python
for obfuscated and reference will mean a much lower
remap hit-rate as the bytecode produced won't be 1:1
in all cases .pyb file format A file format created for this task to ease remapping
when other obfuscations also in place
Removes the dependence on standard marshalling An arbitrary raw bytecode format The code objects of from each of a module's applicable objects just dumped
in the alphabetical order they appear in __dict__ to disk Came from the real need to
assess Python apps for bugs In memory decompilation
vs static .pyc decompilation 3 main areas In-memory decompilation & reconstruction Using a modified & extended UnPyc
as a basis for the decompiler http://unpyc.sourceforge.net Free & oo enough
to allow easy modification
Made more 'fault tolerant'
Fixed a few other bugs
All changes will be sent to the
developer & the modified
version released with pyREtic
Not perfect, but good
enough 3 types of in-memory
decompilation: Convenience Features Each works better against
different obfuscation situations
Differentiated by the objects that
are being decompiled and the way
that objects are inter-related Filesystem traversal,
module object decompile Filesystem traversal,
code object decompile Object traversal,
code object decompile Walks the filesystem structure of .pyc's
reads modules found
Uses the obfuscated runtimes marshal
to bindly unmarshal
Passes module bytecode stream to
decompiler Methodology Use when Access to obfuscated .pyc's on disk
Non-standard marshalling on disk BUT
marshal.dump() available at runtime
Source code quality analogous to that
of the decompilers normal usage
Methodology Use when Access to obfuscated .pyc's on disk
Non-standard marshalling on disk and
marshal module NOT available at runtime
Source code quality analogous to that
of the decompilers normal usage on code
objects, less for top level objects
Walks the filesystem structure of .pyc's
imports modules found
Traverse the objects of a module, and
access code objects of relevant objects
Passes multiple code objects to decompiler
Reconstructs the source code structure of
top level objects & attributes Walks instantiated objects and finds related
Accesses code objects of objects found
Passes multiple code objects to decompiler
Reconstructs the source code structure of
top level objects & attributes Use when Methodology No access to .pyc's, or even disk at all
Good for cloud / webservice
Source code quality analogous to that
of the decompilers normal usage on code
objects, less for top level objects
Can only decompile objects that have
been instantiated - fewer objects decompile
Implements the generation
of reference .pyb files
Generate equivilent
obfuscated .pyb's
Diff and get new opcode
Generate new opcode.py
and opcodes.py [for UnPyc] Command-line access to pyREtic
Based on pdb [ therefore bdb ]
Focused on bug finding &
rapid assessment of application
Lots of convenience features &
third party tool tie in: callgraphing,
docstring greps etc , exechunt
Module/Call graphing pylint - bad code is buggy code! Dangerous function search - exec, system() .. Obfuscation
detection __doc__ string 'grep' - 'ToDo's & FixMe's' etc Project support Runtime retrieval of source code
may be increasingly useful in future

Python used as an access and container
language for many 'cloud' environments

Approaches discussed just as applicable
for other reflective languages Bytecode Decompilation vs Sourcecode Reconstruction The in-memory 'decompilation' used by pyREtic
only uses the UnPyc decompiler for part of the task Python decompilers require a code
object to operate upon to produce source Python module objects do not have a code object 'A module object does not contain the
code object used to initialize the module
(since it isn’t needed once the initialization
is done). ' Python documentation: Section 3 - Data Model This means once imported into a Python
running thread they cannot be 'decompiled'
using the normal methods Instead 'source code reconstruction' is performed Source code reconstruction Runtime analysis and interrogation of objects to
get an insight into their purpose and relationships

Only has to be done for module level objects

Classes, functions, methods, generators within a module
object have code objects which can be truly decompiled Because reconstruction occurs at runtime AFTER object instantiation
the initial state of an object may be unknown, only its current state is
known e.g.: import os
from socket import socket

def test_func(first, *args, **kwargs ):
a doc string
return arg + 5

foo = 9
bar = test_func(3)
foo = 10 from decompiling
test_func.func_code.co_code from __dict__ From the context of a running Python:

Import a module

Query the namespace of the module which
reside in __dict__

For attributes that do not have code objects
query their values

For attributes that do have code objects,
access the bytecode and decompile, query
other attributes & try to determine their
initial state from test_func.__doc__ from test_func.co_argcount,
co_varnames & co_flags from __dict__ & querying
to see if the string
matches the module name What reconstruction can't get Initial values of reassigned variables - foo's value
of 9 will not be known, only it's value of 10
Calls to functions, only the returned values - bar
will be 8 not test_func(3)
Functions that have been imported with a 'from'
construct will appear as top level functions, classes
can be distinguished via their __module__ attribute
Imports using 'as' will not be distinguishable Not perfect, but good enough in many cases ! Overflows aren't the only bugs ! Initial aims: Allow rapid assessment and targeting
of areas of likely bugs
Get back to a source representation
A general approach against protections
used today Python language 101 Lawyers don't seem to agree that what happens
in Vegas stays in Vegas :( Every language has its own quirks and easy to
make, but non-obvious, mistakes. e.g. class Test:
var = []
def __init__(self, x):

foo = Test(10)
bar = Test(20)

print foo.var, bar.var
>>([10, 20], [10 20]) Class vs Instance
attributes: def call_me(args_in = []):
print args_in

call_me(["a", "b", "c"])
> ["a", "b", "c", "foo"]

>> ["foo"]
>> ["foo", "foo"] Mutable default
arguments: default argument bound at object creation time not
instantiation time Conclusions Development & distribution landscape changing
Need to evolve to maintain ROI
Python Anti-RE techniques being increasingly seen
Moving from static decompilation to in-memory & source
code reconstruction gives good leverage against them
pyREtic toolkit that pulls together the concepts into a
POC toolkit to find bugs using these techniques for people
to explore further [rich@immunityinc.com] Lots of attack surface area in Python application
but not a huge amount of existing work ...and maybe there hasn't needed to be, but this is changing ROI Bug Snobs! If it gets you in, it gets you in

Lots of low hanging fruit in
unexplored areas

Not everyone is a Nico .... More secure than ever ? Lines of code
People who think they can code
Pervasiveness of technology Depends on your metrics: Closer to the information
Full transcript