Core dump analysis
This page mostly consists of disorganized random ideas. Sorry.
Current task: make the canonical backtrace usable across different versions
of one package. This is obviously not possible in all cases, but should be
feasible in most cases where the packages/executables do not differ very much.
(This will obsolete the
BUILD_ID+OFFSET idea, but I'd like to keep it as a
fallback, because it should work quite well for packages of the same version
where build ids are available.)
- Where available (e.g. shared libraries), we can use symbol names.
- We'll have to use GDB/MI, as the ordinary GDB output is a bit irregular/unpredictable.
- The .eh_frame section appeared to contain useful information, but it's probably not enough -- see the Identifying functions ... below.
- Current idea: devise function fingerprinting scheme. See the individual section.
Determine if two coredumps describe the same crash. That is, given two coredumps (created on the same machine/OS), determine whether the crash occurred in a similar location (for some reasonable definition of similar) or for the same reason.
- If we only care about equality of some parts of the dump, we may consider using hashes to speed things up when comparing repeatedly.
For a given dump, determine whether the crash occurred in the main binary or in some of the linked libraries.
- We'll have to account for possibly different offsets of the segments (due to prelink/ASLR/...). This probably applies to point 1. as well.
In all cases, debugging symbols may or may not available.
The code metadata (in the form of
.eh_frame and other sections commonly
available in stripped binaries) appear insufficient to reliably identify the
same function compiled by different compiler versions. Hence, we'll have to look
at the data, which is the code itself;). Unfortunately, this means that we'll
have to work with machine code which has complex semantics, and implement our
approach for every architecture we want to support.
We can use the exception handling data found in
.eh_frame to obtain the
beginning and end address of most of the functions. This partitions
the .text segments into number of instruction sequences, each hopefully
corresponding to a C function. What we want to accomplish is for every such a
sequence find its fingerprint such that it is the same (or perhaps similar?)
for two functions compiled from the same code under (slightly) different
circumstances. (Note: if we decide to consider similarities, it would be nice if
two functions compiled from similar sources had similar fingerprints).
Possible code properties that could be used:
- Number of instructions
- - Many functions (especially the smaller ones) have the same length
- - Changes depending on compiler/optimization level
- Number of intra-function jumps / other CFG graph-theoretical properties
- - Probably changes depending on compiler/optimization level
- Library functions called
- We should be able to obtain names of library functions
- For non-library calls, we can transitively use the fingerprint
- + Shouldn't be affected too much by optimization
- - Some functions may not call any library function at all
- Sequence of calls ought to be preserved
- (Number of) registers used
- - compiler/optimization dependent
- Stack frame size
- - compiler/optimization dependent
- what about alloca() et al.?
- Constants used in the function
- Memory accessed from the function
- Presence/absence of special types of instructions
- Like floating point, atomic, barriers, ...
- Types of arithmetical/branch/test instructions
Identifying functions by their .eh_frame entry
.eh_frame section of each executable/binary contains information
necessary to "pop" stack frame of any function it contains. This is used for
exception handling. Note that the section needs not to be present, though it
appears to be available for most packages.
This section maps every function (identified by an address range) to data structure called FDE which mostly contains a short program in special language that defines what CPU register is saved where so that it can be restored when the frame is removed.
The idea is to identify functions by (part of) the program associated to them. Note that:
- FDEs are probably sensitive to which version and flags of the compiler were used to compile the program.
- Every program I've seen so far contained 10 - 30 % of FDEs that were composed solely of NOP instruction.
This script can be used to evaluate
usefulness of FDEs contained in an ELF binary. It takes textual outputs of
eu-readelf -e binary as a parameters. For every parameter, it prints the
number of FDEs, number of distinct FDEs and list of tuples
which means that there are
COUNT groups of size
GROUPSIZE of FDEs that are
equal. The more entries for lower sizes the better. If it is given at least two
parameters, it will also print how many of distinct FDEs are found in both
Results obtained from running the script on several versions of packaged mutt (various fedora versions, koji) and wget (from debian snapshots) are not very encouraging (not much unique FDEs and not much FDEs common among versions) - see the table.
The table lists total number of FDEs (which should correspond to number of functions), number of distinct FDEs, number of unique FDEs and for each two versions number of distinct FDEs common to both versions. The number of common FDEs is generally very low; my guess is that they are high where same version of gcc with the same switches was used. I suppose that we should confirm the unsuitability of using FDEs on some more (fedora) packages.
Conclusion: I don't think we can use this (information about start and length of functions will be useful though). Let's look into fingerprinting the functions through their assembly.
Obtaining canonical backtrace
The first step to answering one of the questions is to generate canonical backtrace for each core dump.
- Using GDB, we can obtain backtrace for each thread. For every frame, we get
- Numeric address in all cases.
- Function/symbol name if it's located in shared library or if we have debuginfo. Note that GDB needs the original binary and libraries to extract the backtrace.
- Using unstrip -n, we can map each executable segment included in the dump to library name and its Build ID.
- We can augument the backtrace so that we have the following information for
BUILD_IDis the identifier of the segment the address points into, and
OFFSETdenotes distance from the start of that segment.
- Name of the library or indication that the address belongs to executable.
- Symbol name if the address points to a library, or if we had debuginfo available in step 1.
The script canbt.py prints such a backtrace.
Such backtraces should be meaningful across different machines of the same architecture (or are they ... ?).
- How to find out which thread caused the signal to be sent? - I wasn't able to find out, but it seems that it is always the thread id 1 in GDB.
- Are these stack frames under main() real, or is it an artifact resulting from GDB not having debuginfo? - Most likely garbage. We can't use them for anything anyway.
Core dump processing in ABRT
Core dump is only used to generate human readable backtrace. If the user choses to generate the backtrace locally, the script abrt-action-analyze-core is run to extract build id's which are then used to download the correct debuginfo packages. Then, gdb is used to create backtrace.
What information can we obtain from a coredump
- Various loaded segments -- belonging to the executable or dynamic shared objects. If I understand this correctly, unchanged executable segments are truncated to one page (so that build id can be extracted, code is assumed to be unimportant).
- PRSTATUS note -- (general purpose) register contents and info about signals. One for each thread.
- PRPSINFO -- process data like uid, priority, etc.
- "Auxiliary vector" -- probably not important for us.
- FPREGSET -- contents of floating point (probably) registers. One for each thread.
- System V application binary interface; contains detailed ELF description http://www.sco.com/developers/devspecs/gabi41.pdf
- System V ABI, AMD64 processor supplement; processor dependent details http://www.x86-64.org/documentation/abi.pdf
- elfutils project page https://fedorahosted.org/elfutils/
- Kernel code that produces ELF core dumps lives at
fs/binfmt_elf.cin kernel source.
- Ulrich Drepper: How to write shared libraries; contains some details about run-time linking but otherwise is rather focused on performance aspects, i.e. not really useful http://people.redhat.com/drepper/dsohowto.pdf
- Linkers and loaders, electronic version of the book http://www.iecc.com/linker/
- Libelf by example, tutorial http://sourceforge.net/projects/elftoolchain/files/Documentation/libelf-by-example/20101106/libelf-by-example.pdf/download
- ERESI project, contains fingerprinting library that is probably not useful to us http://www.eresi-project.org/
- Fast Library Identification and Recognition Technology in IDA disassembler, not really useful to us either http://www.hex-rays.com/idapro/flirt.htm
- x86 disassembly wikibook http://en.wikibooks.org/wiki/X86_Disassembly
- My question on stackoverflow http://stackoverflow.com/questions/7283702/assembly-level-function-fingerprint
- Blog post by Ian Lance Taylor briefly discussing differences between .eh_frame and .debug_frame http://www.airs.com/blog/archives/460
- Site hosting DWARF debugging format standard documents http://dwarfstd.org/
- Original Fedora feature wiki page http://fedoraproject.org/wiki/Releases/FeatureBuildId
man ldcontains some information on build ids
- More ideas about build id, mentions canonical backtrace http://fedoraproject.org/wiki/Summer_Coding_2010_ideas_-_Universal_Build-ID