mirror of
https://github.com/JHUAPL/CodeCut.git
synced 2026-01-08 21:07:58 -05:00
23c4412561d6e8153290fb075f531de2b7d940c2
##############################################################################################
# Copyright 2019 The Johns Hopkins University Applied Physics Laboratory LLC
# All rights reserved.
# Permission is hereby granted, free of charge, to any person obtaining a copy of this
# software and associated documentation files (the "Software"), to deal in the Software
# without restriction, including without limitation the rights to use, copy, modify,
# merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
# permit persons to whom the Software is furnished to do so.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
# INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
# PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
# LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
# TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
# OR OTHER DEALINGS IN THE SOFTWARE.
#
# HAVE A NICE DAY.
#################################################################
#### CodeCut - Detecting Object File Boundaries in IDA Pro ####
#################################################################
**** Terminology ****
I tend to use the term "module" for a set of related functions within a binary
that came from a single object file. So you will see the terms "module" and
"object file" used interchangeabley in the CC source and documentation.
**** Dependencies ****
CodeCut relies on:
Natural Language Toolkit (NLTK) - https://www.nltk.org
Snap.py - https://snap.stanford.edu/snappy/
**** Source Files ****
cc_main.py - Main entry point - simply load this up with the
"File -> Script file..." option in IDA.
lfa.py - Analysis engine for LFA.
mc.py - Analysis engine for MaxCut.
basicutils_7x.py - Provides an API to IDA - maybe one day we'll get this
ported to Ghidra!
map_read.py - For research purposes - compares a ground truth .map
file (from ld) to a .map file from CC and produces
a score. See RECON slides or the code itself for more
info. You need to add the option -Map=<target>.map to
the linker options in a Makefile to get a .map file.
The syntax to map_read is:
python map_read.py <ground truth file> <CC map file>
**** MaxCut Parameters ****
- Right now there is only one parameter for MaxCut, a value for the maximum
module size (currently set to 16K).
**** LFA Parameters & Interpolation ****
A couple areas for research:
- The idea behind LFA is that we throw out "external" calls - we can't
determine this exactly in a binary so we throw out calls that are above a
certain threshold. This is set to 4K in the code but it could be tweaked.
- There is a threshold set for edge detection - plus a little bit of extra
logic (value has to be positive and 2 of last 3 values were negative). You
can either vary this threshold or write your own edge_detect() function.
- Currently "calls to" affinity and "calls from" affinity are treated as
separate scores. If one of these scores is zero an interpolation from
the previous score is used - just a simple linear equation assuming
decreasing scores. This could be improved a number of ways but could
be replaced with an actual interpolation between scores.
- If both "calls to" affinity and "calls from" affinity for a function are 0
the function is skipped and is essentially treated like it's not there.
This happens for functions with no references or where all references are
above the "external" threshold. This means there can be gaps between the
modules in the output list.
- The portion of code that tries to name object files based on common strings
is completely researchy and open ended. Lots of things to play with there.
**** Output Files ****
CodeCut produces 7 files:
<target>_cc_results.csv - Raw score output from LFA and MaxCut, including where
edges are detected. Graphs can fairly easily be
generated in your favorite spreadsheet program.
<target>_{lfa,mc}_labels.py - Script that can be used to label your DB with CC's
output. After determining module boundaries, CC
attempts to guess the name (fun!) by looking at
common strings used by the module, for both the
LFA and MaxCut module lists. You can use this
script as a scratchpad to name unnamed modules as you
determine what they are, or you can also use other
functions in basicutils to change module names later.
<target>_{lfa,mc}_map.map - A .map file similar to the output from the ld. This is
for the purposes of comparing to a ground truth .map
file to test CC when you have source code.
<target>_{lfa,mc}_mod_graph.gv - a Graphviz graph file of the module relationships
This is a directed graph where a -> b indicates
that a function in module a calls a function in
module b. This may take a long time to render if
you have a large binary (more than a couple
hundred modules detected). For smaller binaries
this can pretty clearly communicate the software
architecture immediately. For larger binaries
this will show you graphically the most heavily
used modules in the binary.
You can use sfdp to render the graph into a PNG file with a command line like:
sfdp -x -Goverlap=scale -Tpng -Goutputorder=edgesfirst -Nstyle=filled -Nfillcolor=white <target>_lfa_mod_graph.gv > <target>.png
**** "Canonical" Names ****
NOTE on IDA and Canonical Names:
AFAICT IDA doesn't really have a concept of source file / object files in
the database (it does with source-level debugging but that's it I think).
In my ideal world, I'd write a nice GUI plugin to manage the object file
names and regions, and then you'd be able to select how to display object/
function names in the disassembly. For now though I have to save both the
object name and function name in the filename.
For now, my hacky workaround is to name modules and functions in camel case
(e.g. ReadNetworkString, or HtmlParsingEngine), and then combine them together
in a nasty snake case "canonical" format, that looks like:
<ObjectName>_<FunctionName>_<Address>
That way I can parse out function and object names to be able to rename
objects. I am open to suggestions on better ways to do this.
Description
Releases
4
Languages
Java
78%
Python
21.2%
CSS
0.6%
HTML
0.2%