Core dump analysis

This page mostly consists of disorganized random ideas. Sorry.

Current state

Current task: make the canonical backtrace usable across different versions of one package. This is obviously not possible in all cases, but should be feasible in most cases where the packages/executables do not differ very much. (This will obsolete the BUILD_ID+OFFSET idea, but I'd like to keep it as a fallback, because it should work quite well for packages of the same version where build ids are available.)

Random notes:

  • Where available (e.g. shared libraries), we can use symbol names.
  • We'll have to use GDB/MI, as the ordinary GDB output is a bit irregular/unpredictable.
  • The .eh_frame section appeared to contain useful information, but it's probably not enough -- see the Identifying functions ... below.
  • Current idea: devise function fingerprinting scheme. See the individual section.

Basic goals

  1. Determine if two coredumps describe the same crash. That is, given two coredumps (created on the same machine/OS), determine whether the crash occurred in a similar location (for some reasonable definition of similar) or for the same reason.

    • If we only care about equality of some parts of the dump, we may consider using hashes to speed things up when comparing repeatedly.
  2. For a given dump, determine whether the crash occurred in the main binary or in some of the linked libraries.

    • We'll have to account for possibly different offsets of the segments (due to prelink/ASLR/...). This probably applies to point 1. as well.

In all cases, debugging symbols may or may not available.

Function fingerprinting

The code metadata (in the form of .eh_frame and other sections commonly available in stripped binaries) appear insufficient to reliably identify the same function compiled by different compiler versions. Hence, we'll have to look at the data, which is the code itself;). Unfortunately, this means that we'll have to work with machine code which has complex semantics, and implement our approach for every architecture we want to support.

We can use the exception handling data found in .eh_frame to obtain the beginning and end address of most of the functions. This partitions the .text segments into number of instruction sequences, each hopefully corresponding to a C function. What we want to accomplish is for every such a sequence find its fingerprint such that it is the same (or perhaps similar?) for two functions compiled from the same code under (slightly) different circumstances. (Note: if we decide to consider similarities, it would be nice if two functions compiled from similar sources had similar fingerprints).

Possible code properties that could be used:

  • Number of instructions
    • - Many functions (especially the smaller ones) have the same length
    • - Changes depending on compiler/optimization level
  • Number of intra-function jumps / other CFG graph-theoretical properties
    • - Probably changes depending on compiler/optimization level
  • Library functions called
    • We should be able to obtain names of library functions
    • For non-library calls, we can transitively use the fingerprint
    • + Shouldn't be affected too much by optimization
    • - Some functions may not call any library function at all
    • Sequence of calls ought to be preserved
  • (Number of) registers used
    • - compiler/optimization dependent
  • Stack frame size
    • - compiler/optimization dependent
    • what about alloca() et al.?
  • Constants used in the function
  • Memory accessed from the function
  • Presence/absence of special types of instructions
    • Like floating point, atomic, barriers, ...
  • Types of arithmetical/branch/test instructions
    • ?

Identifying functions by their .eh_frame entry

The .eh_frame section of each executable/binary contains information necessary to "pop" stack frame of any function it contains. This is used for exception handling. Note that the section needs not to be present, though it appears to be available for most packages.

This section maps every function (identified by an address range) to data structure called FDE which mostly contains a short program in special language that defines what CPU register is saved where so that it can be restored when the frame is removed.

The idea is to identify functions by (part of) the program associated to them. Note that:

  • FDEs are probably sensitive to which version and flags of the compiler were used to compile the program.
  • Every program I've seen so far contained 10 - 30 % of FDEs that were composed solely of NOP instruction.

This script can be used to evaluate usefulness of FDEs contained in an ELF binary. It takes textual outputs of eu-readelf -e binary as a parameters. For every parameter, it prints the number of FDEs, number of distinct FDEs and list of tuples GROUPSIZE: COUNT which means that there are COUNT groups of size GROUPSIZE of FDEs that are equal. The more entries for lower sizes the better. If it is given at least two parameters, it will also print how many of distinct FDEs are found in both files.

Results obtained from running the script on several versions of packaged mutt (various fedora versions, koji) and wget (from debian snapshots) are not very encouraging (not much unique FDEs and not much FDEs common among versions) - see the table.

The table lists total number of FDEs (which should correspond to number of functions), number of distinct FDEs, number of unique FDEs and for each two versions number of distinct FDEs common to both versions. The number of common FDEs is generally very low; my guess is that they are high where same version of gcc with the same switches was used. I suppose that we should confirm the unsuitability of using FDEs on some more (fedora) packages.

mutttotaldistinctunique 1.5.20-3.fc14 1.5.21-1.fc15 1.5.21-2.fc15 1.5.21-3.fc15 1.5.21-5.fc15 1.5.21-6.fc14 1.5.21-6.fc15
1.5.19-6.fc12 1223 1042 1009 432 428 64 55 56 64 55
1.5.20-3.fc14 1264 1072 1038 1029 93 73 71 93 70
1.5.21-1.fc15 1264 1073 1039 97 77 75 97 74
1.5.21-2.fc15 1264 1068 1030 375 369 1067 366
1.5.21-3.fc15 1257 1063 1027 989 376 982
1.5.21-5.fc15 1264 1066 1029 370 1057
1.5.21-6.fc14 1266 1069 1031 367
1.5.21-6.fc15 1264 1066 1029


wgettotaldistinctunique 1.10.2-3 1.11.3-1 1.11.4-1 1.12-1 1.12-2.1 1.12-5
1.9.1-12 442 8 3 3 2 2 2 1 1
1.10.2-3 396 207 170 95 94 84 8 2
1.11.3-1 384 209 171 208 158 8 2
1.11.4-1 384 209 171 158 8 2
1.12-1 463 230 190 10 2
1.12-2.1 464 255 219 3
1.12-5 484 369 353

Conclusion: I don't think we can use this (information about start and length of functions will be useful though). Let's look into fingerprinting the functions through their assembly.

Obtaining canonical backtrace

The first step to answering one of the questions is to generate canonical backtrace for each core dump.

  1. Using GDB, we can obtain backtrace for each thread. For every frame, we get following information:
    • Numeric address in all cases.
    • Function/symbol name if it's located in shared library or if we have debuginfo. Note that GDB needs the original binary and libraries to extract the backtrace.
  2. Using unstrip -n, we can map each executable segment included in the dump to library name and its Build ID.
  3. We can augument the backtrace so that we have the following information for each frame:
    • BUILD_ID+OFFSET, where BUILD_ID is the identifier of the segment the address points into, and OFFSET denotes distance from the start of that segment.
    • Name of the library or indication that the address belongs to executable.
    • Symbol name if the address points to a library, or if we had debuginfo available in step 1.

The script canbt.py prints such a backtrace.

Such backtraces should be meaningful across different machines of the same architecture (or are they ... ?).

  • How to find out which thread caused the signal to be sent? - I wasn't able to find out, but it seems that it is always the thread id 1 in GDB.
  • Are these stack frames under main() real, or is it an artifact resulting from GDB not having debuginfo? - Most likely garbage. We can't use them for anything anyway.

Core dump processing in ABRT

Core dump is only used to generate human readable backtrace. If the user choses to generate the backtrace locally, the script abrt-action-analyze-core is run to extract build id's which are then used to download the correct debuginfo packages. Then, gdb is used to create backtrace.

What information can we obtain from a coredump

  • Various loaded segments -- belonging to the executable or dynamic shared objects. If I understand this correctly, unchanged executable segments are truncated to one page (so that build id can be extracted, code is assumed to be unimportant).
  • PRSTATUS note -- (general purpose) register contents and info about signals. One for each thread.
  • PRPSINFO -- process data like uid, priority, etc.
  • "Auxiliary vector" -- probably not important for us.
  • FPREGSET -- contents of floating point (probably) registers. One for each thread.

Resources

ELF format
Dis/assembly related
.eh_frame
Build ID
prelink