mirror of
https://github.com/JHUAPL/CodeCut.git
synced 2026-01-09 13:28:06 -05:00
143 lines
7.0 KiB
Plaintext
143 lines
7.0 KiB
Plaintext
##############################################################################################
|
|
# Copyright 2019 The Johns Hopkins University Applied Physics Laboratory LLC
|
|
# All rights reserved.
|
|
# Permission is hereby granted, free of charge, to any person obtaining a copy of this
|
|
# software and associated documentation files (the "Software"), to deal in the Software
|
|
# without restriction, including without limitation the rights to use, copy, modify,
|
|
# merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
|
|
# permit persons to whom the Software is furnished to do so.
|
|
#
|
|
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
|
|
# INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
|
|
# PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
|
# LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
|
# TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
|
|
# OR OTHER DEALINGS IN THE SOFTWARE.
|
|
#
|
|
# HAVE A NICE DAY.
|
|
|
|
#################################################################
|
|
#### CodeCut - Detecting Object File Boundaries in IDA Pro ####
|
|
#################################################################
|
|
|
|
**** Terminology ****
|
|
|
|
I tend to use the term "module" for a set of related functions within a binary
|
|
that came from a single object file. So you will see the terms "module" and
|
|
"object file" used interchangeabley in the CC source and documentation.
|
|
|
|
**** Dependencies ****
|
|
|
|
CodeCut relies on:
|
|
Natural Language Toolkit (NLTK) - https://www.nltk.org
|
|
Snap.py - https://snap.stanford.edu/snappy/
|
|
|
|
**** Source Files ****
|
|
|
|
cc_main.py - Main entry point - simply load this up with the
|
|
"File -> Script file..." option in IDA.
|
|
|
|
lfa.py - Analysis engine for LFA.
|
|
|
|
mc.py - Analysis engine for MaxCut.
|
|
|
|
basicutils_7x.py - Provides an API to IDA - maybe one day we'll get this
|
|
ported to Ghidra!
|
|
|
|
map_read.py - For research purposes - compares a ground truth .map
|
|
file (from ld) to a .map file from CC and produces
|
|
a score. See RECON slides or the code itself for more
|
|
info. You need to add the option -Map=<target>.map to
|
|
the linker options in a Makefile to get a .map file.
|
|
|
|
The syntax to map_read is:
|
|
python map_read.py <ground truth file> <CC map file>
|
|
|
|
**** MaxCut Parameters ****
|
|
|
|
- Right now there is only one parameter for MaxCut, a value for the maximum
|
|
module size (currently set to 16K).
|
|
|
|
|
|
**** LFA Parameters & Interpolation ****
|
|
|
|
A couple areas for research:
|
|
|
|
- The idea behind LFA is that we throw out "external" calls - we can't
|
|
determine this exactly in a binary so we throw out calls that are above a
|
|
certain threshold. This is set to 4K in the code but it could be tweaked.
|
|
|
|
- There is a threshold set for edge detection - plus a little bit of extra
|
|
logic (value has to be positive and 2 of last 3 values were negative). You
|
|
can either vary this threshold or write your own edge_detect() function.
|
|
|
|
- Currently "calls to" affinity and "calls from" affinity are treated as
|
|
separate scores. If one of these scores is zero an interpolation from
|
|
the previous score is used - just a simple linear equation assuming
|
|
decreasing scores. This could be improved a number of ways but could
|
|
be replaced with an actual interpolation between scores.
|
|
|
|
- If both "calls to" affinity and "calls from" affinity for a function are 0
|
|
the function is skipped and is essentially treated like it's not there.
|
|
This happens for functions with no references or where all references are
|
|
above the "external" threshold. This means there can be gaps between the
|
|
modules in the output list.
|
|
|
|
- The portion of code that tries to name object files based on common strings
|
|
is completely researchy and open ended. Lots of things to play with there.
|
|
|
|
**** Output Files ****
|
|
|
|
CodeCut produces 7 files:
|
|
|
|
<target>_cc_results.csv - Raw score output from LFA and MaxCut, including where
|
|
edges are detected. Graphs can fairly easily be
|
|
generated in your favorite spreadsheet program.
|
|
|
|
<target>_{lfa,mc}_labels.py - Script that can be used to label your DB with CC's
|
|
output. After determining module boundaries, CC
|
|
attempts to guess the name (fun!) by looking at
|
|
common strings used by the module, for both the
|
|
LFA and MaxCut module lists. You can use this
|
|
script as a scratchpad to name unnamed modules as you
|
|
determine what they are, or you can also use other
|
|
functions in basicutils to change module names later.
|
|
|
|
<target>_{lfa,mc}_map.map - A .map file similar to the output from the ld. This is
|
|
for the purposes of comparing to a ground truth .map
|
|
file to test CC when you have source code.
|
|
|
|
<target>_{lfa,mc}_mod_graph.gv - a Graphviz graph file of the module relationships
|
|
This is a directed graph where a -> b indicates
|
|
that a function in module a calls a function in
|
|
module b. This may take a long time to render if
|
|
you have a large binary (more than a couple
|
|
hundred modules detected). For smaller binaries
|
|
this can pretty clearly communicate the software
|
|
architecture immediately. For larger binaries
|
|
this will show you graphically the most heavily
|
|
used modules in the binary.
|
|
|
|
You can use sfdp to render the graph into a PNG file with a command line like:
|
|
|
|
sfdp -x -Goverlap=scale -Tpng -Goutputorder=edgesfirst -Nstyle=filled -Nfillcolor=white <target>_lfa_mod_graph.gv > <target>.png
|
|
|
|
**** "Canonical" Names ****
|
|
NOTE on IDA and Canonical Names:
|
|
AFAICT IDA doesn't really have a concept of source file / object files in
|
|
the database (it does with source-level debugging but that's it I think).
|
|
In my ideal world, I'd write a nice GUI plugin to manage the object file
|
|
names and regions, and then you'd be able to select how to display object/
|
|
function names in the disassembly. For now though I have to save both the
|
|
object name and function name in the filename.
|
|
|
|
For now, my hacky workaround is to name modules and functions in camel case
|
|
(e.g. ReadNetworkString, or HtmlParsingEngine), and then combine them together
|
|
in a nasty snake case "canonical" format, that looks like:
|
|
|
|
<ObjectName>_<FunctionName>_<Address>
|
|
|
|
That way I can parse out function and object names to be able to rename
|
|
objects. I am open to suggestions on better ways to do this.
|
|
|