Comprehensive guide for reverse engineering binaries. Choose the appropriate techniques based on what the user needs:
- Full reconstruction: Complete, compilable source code
- Function analysis: Understand how specific parts work
- Protocol documentation: API specs and wire formats
- Version comparison: What changed between releases
- Deepening existing RE: Fill gaps in partial reconstructions
Argument: $ARGUMENTS — path to the binary (or .gz-compressed binary), and optionally the target language (auto-detected if omitted).
The binary is ground truth. Every decision must be traceable to binary evidence—strings, symbols, disassembly, or runtime behavior. Your opinions about how code "should" look are irrelevant when they contradict what's in the binary.
Approach: Start with Phase 1 (census) for context, then select techniques from subsequent phases based on scope. For full reconstruction, work through all phases. For focused tasks (single function, protocol docs, etc.), use Phase 0.5 techniques to zero in on the target.
Install analysis tools as needed. Prefer what's available; install what's missing.
bash
1# Check what's available
2which readelf objdump strings nm file ldd strace ltrace 2>/dev/null
3
4# Install essentials if missing
5apt-get update && apt-get install -y binutils file strace ltrace
6
7# For deeper analysis (install if needed for complex binaries)
8# apt-get install -y ghidra radare2
9# pip install capstone ropper
If the binary is compressed (.gz, .xz, .zst), decompress it to a temp
location first:
bash
1BINARY="/tmp/target_binary"
2# gzip -dk path/to/binary.gz -c > "$BINARY" && chmod +x "$BINARY"
Phase 0.5: Focused Analysis Techniques
Techniques for targeted reverse engineering when you need to understand or document specific parts of a binary without full reconstruction. Use these when the task is scoped to:
- Understanding how a specific function works
- Documenting an API/protocol from a component
- Comparing what changed between binary versions
- Deepening partial/existing reverse engineering
- Answering "how does it do X?" questions
These techniques complement the full reconstruction workflow—use them to zoom in on specific areas.
0.5.1 Identify target symbols/functions
bash
1BINARY="/usr/local/bin/target"
2TARGET="auth" # module, function name, or component
3
4# Find all related symbols
5nm "$BINARY" | grep -i "$TARGET" > /tmp/${TARGET}_symbols.txt
6
7# Get function addresses and names
8nm "$BINARY" | grep -i "$TARGET" | grep ' T ' | awk '{print $1, $3}'
9
10# For unstripped binaries, get source file info
11readelf --debug-dump=info "$BINARY" | grep -B5 -A10 "$TARGET"
Strings reveal behavior - log messages, error paths, endpoints, formats:
bash
1# All strings mentioning the target
2strings "$BINARY" | grep -i "$TARGET" > /tmp/${TARGET}_strings.txt
3
4# Categorize by type
5strings "$BINARY" | grep -iE "(error|fail).*${TARGET}" > /tmp/${TARGET}_errors.txt
6strings "$BINARY" | grep -iE "^/.*${TARGET}" > /tmp/${TARGET}_paths.txt # URLs/paths
7strings "$BINARY" | grep "json:.*${TARGET}" > /tmp/${TARGET}_json.txt # JSON fields
0.5.3 Disassemble target functions
Extract implementations of key functions:
bash
1# Get function bounds
2FUNC_START=$(nm "$BINARY" | grep "MyFunction" | awk '{print "0x"$1}')
3FUNC_END=$(nm "$BINARY" | awk -v start="$FUNC_START" '$1 > start {print "0x"$1; exit}')
4
5# Disassemble
6objdump -d "$BINARY" --start-address="$FUNC_START" --stop-address="$FUNC_END" > /tmp/func.asm
7
8# For Go binaries, use go tool objdump for better formatting
9go tool objdump -s MyFunction "$BINARY" > /tmp/func.asm
0.5.4 Analyze data structures
For typed languages (Go, Rust, C++ with debug info):
bash
1# Extract type definitions
2readelf --debug-dump=info "$BINARY" | grep -A 50 "DW_TAG_structure_type" | grep -A 50 "$TARGET"
3
4# For Go, look for reflect type metadata
5strings "$BINARY" | grep "type\\..*${TARGET}"
6strings "$BINARY" | grep "json:.*${TARGET}" # struct tags reveal wire formats
0.5.5 Compare binary versions (diff analysis)
When analyzing what changed:
bash
1OLD_BINARY="app-v1.2"
2NEW_BINARY="app-v1.3"
3
4# Symbol diff
5diff <(nm "$OLD_BINARY" | sort) <(nm "$NEW_BINARY" | sort) > /tmp/symbol_diff.txt
6
7# String diff (reveals new features, changed messages)
8diff <(strings "$OLD_BINARY" | sort) <(strings "$NEW_BINARY" | sort) > /tmp/string_diff.txt
9
10# Size diff per section
11diff <(readelf -S "$OLD_BINARY") <(readelf -S "$NEW_BINARY")
12
13# For specific function changes, compare disassembly
14diff <(objdump -d "$OLD_BINARY" --start-address=0xABCD) \
15 <(objdump -d "$NEW_BINARY" --start-address=0xDEF0)
Choose output based on the task:
For "how does function X work?" → Write a detailed explanation with:
- Purpose (inferred from strings, call sites, context)
- Algorithm (from disassembly/decompilation)
- Error paths (from error strings)
- Dependencies (from calls to other functions)
For "document API/protocol" → Create specification with:
- Endpoints/interfaces (from strings, URL construction)
- Request/response formats (from JSON tags, marshal/unmarshal code)
- Authentication (from header-setting code)
- Examples (verified against live API if possible)
For "what changed between versions?" → Write a changelog with:
- New functions/symbols
- Modified functions (with before/after behavior)
- Removed functionality
- Changed constants/strings
For "deepen existing RE" → Add to existing reconstruction:
- Implement previously stubbed functions
- Add missing error paths
- Clarify ambiguous logic
- Verify against binary
Phase 1: Binary Census
Do not write ANY source code until this phase is complete. This phase
produces a written inventory that governs all subsequent work.
1.1 File identification
bash
1file "$BINARY"
2readelf -h "$BINARY" # ELF header: arch, endianness, entry point
3readelf -d "$BINARY" # dynamic section: linked libraries
4readelf -n "$BINARY" # build ID, notes
Determine:
- Language: Rust (look for
rust_begin_unwind, core::fmt, mangled
_ZN symbols with h hash suffixes), Go (runtime.main, go.buildid),
C/C++ (standard vtable patterns, __cxa_throw), etc.
- Compiler version: Rust embeds version strings; Go embeds
go1.x;
GCC/Clang often visible in .comment section
- Static vs dynamic linking
- Stripped vs unstripped (
readelf -s for symbol table)
This is the single most important step. Every string in the binary witnesses a
code path that MUST exist in your source.
bash
1strings -n 6 "$BINARY" | sort -u > /tmp/binary_strings_all.txt
2wc -l /tmp/binary_strings_all.txt
3
4# Categorize strings
5strings -n 6 "$BINARY" | grep -i '\[debug\]\|error\|fail\|warn' > /tmp/strings_log.txt
6strings -n 6 "$BINARY" | grep -iE '^\/' > /tmp/strings_paths.txt
7strings -n 6 "$BINARY" | grep -E '\.(rs|go|py|c|cpp|h)' > /tmp/strings_source_refs.txt
8strings -n 6 "$BINARY" | grep -iE 'http|socket|addr|port|listen|connect' > /tmp/strings_network.txt
9strings -n 6 "$BINARY" | grep -iE 'usage|help|version|flag|arg|option' > /tmp/strings_cli.txt
1.3 Symbol analysis
bash
1# Dynamic symbols (even stripped binaries have these)
2nm -D "$BINARY" 2>/dev/null > /tmp/dynamic_symbols.txt
3readelf --dyn-syms "$BINARY" > /tmp/dynsym.txt
4
5# Full symbol table (if not stripped)
6nm "$BINARY" 2>/dev/null > /tmp/symbols.txt
7
8# Imported libraries and functions
9ldd "$BINARY" 2>/dev/null
10readelf -d "$BINARY" | grep NEEDED
1.4 Section analysis
bash
1readelf -S "$BINARY" # all sections with sizes
2readelf -p .rodata "$BINARY" # read-only data (constants, string literals)
3readelf -p .comment "$BINARY" # compiler info
4objdump -s -j .rodata "$BINARY" | head -200 # hex dump of rodata
1.5 Disassembly (selective)
Full disassembly of large binaries is impractical. Target specific areas:
bash
1# Entry point and main
2objdump -d "$BINARY" | grep -A 50 '<main>'
3objdump -d "$BINARY" | grep -A 50 '<_start>'
4
5# Function list (from symbols if available)
6objdump -t "$BINARY" | grep ' F ' | sort -k5 -n -r | head -50 # largest functions
7
8# Cross-reference: find code that references a specific string
9# 1. Find string offset in .rodata
10strings -t x "$BINARY" | grep "target string"
11# 2. Search disassembly for references to that offset
1.6 Produce the inventory document
Before proceeding, write a structured inventory (as a comment block or separate
file) containing:
- Binary metadata: arch, language, compiler version, linking, stripped?
- String checklist: every application-level string (not stdlib/compiler
noise), each marked as UNCOVERED. This checklist is updated throughout
reconstruction. A string is COVERED when source code containing it is written.
- Dependency list: external crates/packages/libraries with versions
(inferred from strings, symbol names,
.comment section)
- Module structure hypothesis: based on source file path strings
(e.g.,
src/main.rs, src/io.rs) and functional groupings
- Function inventory: known functions with approximate sizes, grouped by
module
Phase 2: Project Skeleton
2.1 Reconstruct the build configuration
- Rust:
Cargo.toml with dependencies inferred from Phase 1. Match
versions from embedded strings (e.g., tokio-1.38.0 in panic messages).
Or, if using Bazel, the appropriate BUILD.bazel + Cargo.toml.
- Go:
go.mod with dependencies from embedded module paths.
- C/C++:
Makefile/CMakeLists.txt with library flags from ldd output.
2.2 Create module files
Based on source path strings from the binary (e.g.,
/build/src/control_server.rs reveals a module named control_server).
Create empty files with doc comments recording:
- Binary offset range for functions in this module (if determinable)
- String references that belong to this module
2.3 String coverage tracking
Maintain a checklist (in a tracking file or structured comments) mapping every
application string to its source location. Format:
[x] "Failed to bind" -> src/main.rs:bind_listener()
[ ] "Invalid UTF-8 in request body" -> UNCOVERED
[x] "[DEBUG] Cgroup setup successful" -> src/cgroup.rs:setup_cgroup()
Update this as you write each function. Phase 4 verification will catch any
you missed, but proactive tracking is faster.
Phase 3: Function-by-Function Reconstruction
Work through the function inventory from Phase 1. For each function:
3.1 Anchor on strings
Every string reference in the function is a structural anchor. The strings
dictate:
- What error paths exist
- What log messages are emitted
- What CLI flags/help text is defined
- What file paths are accessed
3.2 Trace control flow
From disassembly or decompiler output around string references:
- Identify branch conditions (what causes each error message)
- Identify loops (retry patterns, polling, iteration)
- Identify function calls (callees and their signatures)
- Identify resource lifecycle (open/close, alloc/free, lock/unlock)
3.3 Write the source
Write the function with:
- A doc comment citing binary evidence (offset, string refs)
- The exact string literals from the binary (character-for-character)
- Control flow matching the binary's structure
- Error handling matching every error string
3.4 Mark coverage
After writing each function, update the string checklist. Flag any strings you
couldn't place — they indicate missing code paths.
3.5 Mark incomplete reconstructions
Not every function can be fully reconstructed in a single pass. When you cannot
fully recover a function body, closure, type, or control flow path, you MUST
mark it with a TODO(re): comment so it's greppable and clearly incomplete.
Required markers (use exactly // TODO(re): or # TODO(re): prefix):
- Stub function bodies:
// TODO(re): stub — <what the function should do>
- Empty goroutine/closure bodies:
// TODO(re): stub — <describe the closure's purpose from binary evidence>
- Placeholder types (
interface{}, any, object where concrete type
exists): // TODO(re): concrete type not recovered — likely <best guess>
- Discarded values (
_ = expr where the value is clearly used by the real
binary): // TODO(re): should be <how the value is consumed>
- Commented-out code standing in for unrecovered logic:
// TODO(re): not reconstructed — <brief description of what binary does>
- Hardcoded placeholders (zero values, empty strings, dummy data where the
binary has real logic):
// TODO(re): placeholder — <what should be here>
- Incomplete error handling (errors swallowed or ignored where the binary
handles them):
// TODO(re): error handling not reconstructed
Every TODO(re): must include a brief description of what the correct
implementation should do, based on binary evidence. Bare // TODO without
context is not acceptable.
Do not leave unmarked stubs. A function that returns nil where the binary
has real logic, a closure with an empty body, or a variable discarded with
_ = where the binary consumes it — all of these are reconstruction bugs if
left unmarked. The marker makes the gap visible and searchable.
3.6 No dead code
Dead code does not exist in an optimized compiled binary. If you wrote code
that nothing calls, your reconstruction is wrong — find the caller.
#[allow(dead_code)], #pragma unused, or equivalent suppressions are
forbidden. They mean you gave up finding the call site.
Phase 4: Differential Verification
After the source compiles, verify against the reference binary.
4.1 String diff (mandatory, non-negotiable)
bash
1# Extract strings from YOUR compiled binary
2strings -n 6 "$YOUR_BINARY" | sort -u > /tmp/my_strings.txt
3
4# Extract strings from the REFERENCE binary
5strings -n 6 "$REFERENCE_BINARY" | sort -u > /tmp/ref_strings.txt
6
7# Strings in reference but missing from yours = missing code paths
8comm -23 /tmp/ref_strings.txt /tmp/my_strings.txt > /tmp/missing_strings.txt
9
10# Strings in yours but not in reference = extra/wrong code
11comm -13 /tmp/ref_strings.txt /tmp/my_strings.txt > /tmp/extra_strings.txt
Every entry in missing_strings.txt (excluding stdlib/compiler noise) is a
reconstruction bug. Fix before proceeding.
4.2 Symbol diff
bash
1nm -D "$YOUR_BINARY" 2>/dev/null | sort > /tmp/my_dynsym.txt
2nm -D "$REFERENCE_BINARY" 2>/dev/null | sort > /tmp/ref_dynsym.txt
3diff /tmp/ref_dynsym.txt /tmp/my_dynsym.txt
Dynamic symbol mismatches indicate wrong dependency versions or missing
functionality.
4.3 Behavioral diff (when possible)
bash
1# Compare syscall sequences with identical inputs
2strace -f -o /tmp/ref_strace.txt "$REFERENCE_BINARY" <test_args> &
3strace -f -o /tmp/my_strace.txt "$YOUR_BINARY" <test_args> &
4
5# Compare: same files opened? same sockets? same signals handled?
6diff <(grep -E 'open|socket|bind|listen|connect|signal' /tmp/ref_strace.txt) \
7 <(grep -E 'open|socket|bind|listen|connect|signal' /tmp/my_strace.txt)
4.4 Section size comparison
bash
1# Compare section sizes — large discrepancies indicate missing/extra code
2readelf -S "$REFERENCE_BINARY" | grep -E '\.text|\.rodata|\.data' > /tmp/ref_sections.txt
3readelf -S "$YOUR_BINARY" | grep -E '\.text|\.rodata|\.data' > /tmp/my_sections.txt
4diff /tmp/ref_sections.txt /tmp/my_sections.txt
4.5 Stub scan (mandatory before completion)
Before declaring reconstruction complete, scan for remaining incomplete work:
bash
1# Find all TODO(re) markers — every one is an acknowledged gap
2grep -rn 'TODO(re)' src/ | tee /tmp/todo_re.txt
3wc -l /tmp/todo_re.txt
4
5# Find potential unmarked stubs
6grep -rn '_ = ' src/ | grep -v 'TODO' | tee /tmp/unmarked_discards.txt
7grep -rn 'interface{}' src/ | grep -v 'TODO' | tee /tmp/unmarked_interfaces.txt
8grep -rn '// Stub' src/ | grep -v 'TODO' | tee /tmp/unmarked_stubs.txt
Resolution requirements:
- Every
_ = expr that discards a meaningful value must either be fixed
(value consumed correctly) or marked with TODO(re):.
- Every
interface{} that stands in for a concrete type must either be
replaced with the correct type or marked with TODO(re):.
- Every function/closure with a stub body must be either reconstructed or
marked with
TODO(re):.
- The
TODO(re) count should be documented in the README or inventory so
the scope of remaining work is visible.
This scan catches gaps that slipped through Phase 3 without markers. It is
non-negotiable — unmarked stubs are worse than marked ones because they look
like intentional implementations.
Principles
- Binary is ground truth. If the binary says it, the source must say it.
- Strings are witnesses. Every string in the binary testifies to a code
path. Missing strings = missing logic. Extra strings = wrong logic.
- No dead code. The compiler already removed dead code. Everything in the
binary is reachable. Find the call path.
- No warning suppression.
#[allow(dead_code)], // nolint, #pragma
suppression = you failed to reconstruct the call graph. Fix the graph.
- Evidence over opinion. Log your evidence. Every function should cite
which binary offsets, strings, or disassembly patterns informed it.
- Verify differentially. Compiling is necessary but not sufficient.
The output binary must match the reference in strings, symbols, and behavior.
- Iterate. Phase 4 will find gaps. Return to Phase 3 and fill them.
Repeat until the string diff is clean.
- Mark what you can't finish. Use
// TODO(re): <description> for any
stub, placeholder type, discarded value, or unrecovered logic. Unmarked
stubs masquerade as correct implementations and are worse than acknowledged
gaps. The TODO(re): prefix is greppable and distinguishes reconstruction
gaps from normal development TODOs.