##############################################################################################
# Copyright 2018 The Johns Hopkins University Applied Physics Laboratory LLC
# All rights reserved.
# Permission is hereby granted, free of charge, to any person obtaining a copy of this 
# software and associated documentation files (the "Software"), to deal in the Software 
# without restriction, including without limitation the rights to use, copy, modify, 
# merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 
# permit persons to whom the Software is furnished to do so.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 
# INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR 
# PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 
# LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, 
# TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE 
# OR OTHER DEALINGS IN THE SOFTWARE.
#
# HAVE A NICE DAY.

#################################################################################
####  Local Function Affinity - Detecting Object File Boundaries in IDA Pro  ####
#################################################################################

**** Terminology ****

I tend to use the term "module" for a set of related functions within a binary
that came from a single object file.  So you will see the terms "module" and
"object file" used interchangeabley in the LFA source and documentation.

**** Dependencies ****

The only dependency is Natural Language Toolkit (NLTK) - which you can get
from https://www.nltk.org. 

**** Source Files ****

lfa.py -                 Main program - written to run as a script from IDA
                         Pro.  By default, will analyze either the .text
                         section or the full database, and output the files
                         listed in the "Output Files" section. *** Make sure to
                         set IDA_VERSION before you start! ***

basicutils<_6x,_7x>.py - provides a version-agnostic API to IDA.  You need to
                         set a definition in lfa.py to select which version
                         of IDA you have.

map_read.py -            For research purposes - compares a ground truth .map
                         file (from ld) to a .map file from LFA and produces
                         a score.  See RECON slides or the code itself for more
                         info.  You need to add the option -Map=<target>.map to
                         the linker options in a Makefile to get a .map file.

                   The syntax to map_read is:
                         python map_read.py <ground truth file> <LFA map file>

**** LFA Parameters, Interpolation, and Output ****

A couple areas for research:

  - The idea behind LFA is that we throw out "external" calls - we can't 
    determine this exactly in a binary so we throw out calls that are above a 
    certain threshold.  This is set to 4K in the code but it could be tweaked.

  - There is a threshold set for edge detection - plus a little bit of extra
    logic (value has to be positive and 2 of last 3 values were negative). You
    can either vary this threshold or write your own edge_detect() function.

  - Currently "calls to" affinity and "calls from" affinity are treated as
    separate scores.  If one of these scores is zero an interpolation from
    the previous score is used - just a simple linear equation assuming
    decreasing scores.  This could be improved a number of ways but could
    be replaced with an actual interpolation between scores.

  - If both "calls to" affinity and "calls from" affinity for a function are 0
    the function is skipped and is essentially treated like it's not there.
    This happens for functions with no references or where all references are
    above the "external" threshold.  This means there can be gaps between the
    modules in the output list.

  - The portion of code that tries to name object files based on common strings
    is completely researchy and open ended.  Lots of things to play with there.

**** Output Files ****

LFA produces 4 files:

<target>_lfa_results.csv - Raw score output from LFA, including where edges 
                           are detected.  Graphs can fairly easily be 
                           generated in your favorite spreadsheet program.

<target>_lfa_labels.py - Script that can be used to label your DB with LFA's 
			 output.  After determining module boundaries, LFA
                         attempts to guess the name (fun!) by looking at
                         common strings used by the module.  Y can use this
                         script as a scratchpad to name unnamed modules as you
                         determine what they are, or you can also use other
			 functions in basicutils to change module names later.

<target>_lfa_map.map - A .map file similar to the output from the ld.  This is
                       for the purposes of comparing to a ground truth .map
                       file to test LFA when you have source code.

<target>_lfa_mod_graph.gv - a Graphviz graph file of the module relationships
                            This is a directed graph where a -> b indicates
                            that a function in module a calls a function in
                            module b.  This may take a long time to render if
                            you have a large binary (more than a couple
                            hundred modules detected).  For smaller binaries
                            this can pretty clearly communicate the software
                            architecture immediately.  For larger binaries
                            this will show you graphically the most heavily
                            used modules in the binary.

You can use sfdp to render the graph into a PNG file with a command line like:

sfdp -x -Goverlap=scale -Tpng -Goutputorder=edgesfirst -Nstyle=filled -Nfillcolor=white <target>_lfa_mod_graph.gv > <target>.png

**** "Canonical" Names ****
NOTE on IDA and Canonical Names:
AFAICT IDA doesn't really have a concept of source file / object files in
the database (it does with source-level debugging but that's it I think).
In my ideal world, I'd write a nice GUI plugin to manage the object file
names and regions, and then you'd be able to select how to display object/
function names in the disassembly.  For now though I have to save both the
object name and function name in the filename.

For now, my hacky workaround is to name modules and functions in camel case
(e.g. ReadNetworkString, or HtmlParsingEngine), and then combine them together
in a nasty snake case "canonical" format, that looks like:

<ObjectName>_<FunctionName>_<Address>

That way I can parse out function and object names to be able to rename
objects.  I am open to suggestions on better ways to do this.
			
Description
No description provided
Readme 12 MiB
Latest
2024-12-04 14:11:34 -05:00
Languages
Java 78%
Python 21.2%
CSS 0.6%
HTML 0.2%