This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Compiler: cl6x compiler to produce bitwise reproducible output

Tool/software: TI C/C++ Compiler

Hello,

I need cl6x compiler to provide bitwise reproducible output (see also https://reproducible-builds.org.e. multiple compilations of the same source base (done by different users, in their directories) should give exactly the same binary. I am using CGT 7.3.23.

Two issues found:

Issue #1

I've found random bytes changed in a binary. I've found that compiler is creating temporary file, which then compiles, and that temporary file name is included into binary' .symtab section :

$  readelf -s .symtab myobject.obj

Symbol table '.symtab' contains 999 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 00000000 0 NOTYPE LOCAL DEFAULT UND
1: 00000000 0 FILE LOCAL HIDDEN ABS 07894VfUHkc
[...]

How to get rid of this? This entry here is meaningless, as mentioned temporary file is anyway removed after compilation.

I did some reverse-engineering, and it seems that compiler is using some sort of gen_tempname function (https://github.molgen.mpg.de/git-mirror/glibc/blob/master/sysdeps/posix/tempname.c)., as I can see getpid and getimeofday syscalls when I execute the compiler (using strace tool). But I am unable to use LD_PRELOAD, as compiler is statically linked...

Issue #2

Build path is included into the binary's debugging symbols. I would like to be able to map vairable string into some arbitrary one. Similarly, GCC provides the following option: -ffile-prefix-map (see description here: https://gcc.gnu.org/onlinedocs/gcc/Overall-Options.html 

  • I am interested in this topic, because this is one of blockers to deliver to substantial build time savings, by improving build cache hit ratio.
  • I have similar problem in https://e2e.ti.com/support/microcontrollers/hercules/f/312/t/743189. I have found the following workaround. I set "keep generated assembly language (.asm) file (--keep_asm, -k). in Assembler Options. This results in deterministic names.

  • I saw that request, but do not see any solution, and topic was locked, thus created my own (and I have issue with cl6x, not arm compiler).

    Marvelous! That somewhat workarounds first issue I mentioned. Not a production though... Those assembly files will significantly increase build dir size (4 times of object file size). When I have thousands of objects (yeah, quite a big project), then I can count that in gigabytes...

    In worst case, I'll just implement yet-another-cl6x-wrapper (among "line buffering" wrapper and similar ones), which will instantly remove those files afterwards. Or use --asm_directory=$(BUILDDIR)/trash, and remove it after build. Need to rethink...

    Thanks anyway, that is some good initial approach until we get something production-ready from TI experts.

  • Now, when using --keep_asm, I also get some numbers, which are the same as last modification timestamp of the compiled *.asm file.

    Looking at binary hexdump, I see strings like the following:
    /path/to/asm/file.asm:$C$L6:1546604415

    Looking at file.asm:
    $ stat /path/to/asm/file.asm -c "%Y"
    1546604415

    So, it turned out that setting --keep_asm does not solve my issue...

  • A good summary on this topic is in this forum thread.  

    Consider using the utility objdiff from the cg_xml package.  By default, it ignores the debug information and the symbols.  This reduces the constraints imposed on the build.

    Thanks and regards,

    -George

  • Hello George,

    Thanks for the answer. It shed some light on the topic.

    I agree that some aspects of build process are not compiler/linker responsibility (e.g. maintaining order of inputs), but some other are, and I think that my request address such things.
    When I execute the same command, on the same host, in the same directory, I'd expect exactly same result (i.e. md5sum/sha256sum should match in both).
    Or I'd expect at least some easy method to fake build environment, so that compiler gives predictable results...

    In such case, I can keep only fingerprint (e.g. md5sum hash) of an executable + environment description (a few kilobytes), and compare rebuilt binaries with it, to assure I got exactly the same content (using tools that are available on any linux box). I cannot imagine how to achieve this efficiently with objdiff...

    Argument that "we don't test something, thus not delivering" does not seem to be relevant in this discussion. It's not a matter of testing or not, but willingness to support this kind of use case, and actually start doing anything related to this. And, based on amount of similar questions to mine, it seems there are some people who would be interested in bitwise identical binaries.

    So, maybe question should be: will you add tests (and support) for this?

  • The solution currently provided by TI compilers does not work this way ...

    Bartlomiej Kucharczyk said:
    When I execute the same command, on the same host, in the same directory, I'd expect exactly same result (i.e. md5sum/sha256sum should match in both).

    Instead, some executable or library is established as the baseline, and then objdiff is used to test whether subsequent builds are the same.

    Bartlomiej Kucharczyk said:
    maybe question should be: will you add tests (and support) for this?

    Unfortunately, that is not on our roadmap.

    Thanks and regards,

    -George

  • Hmm... that's sad news.

    Can anything be done you add this topic into your roadmap?

    Anyway, how I could compute a fingerprint (e.g. MD5 hash) of an executable/library, that could be used later to compare against newly built executable/library? 

    It is also acceptable for me to get some way to strip those debugging symbols (strip6x tool did not work for me -- still some build paths were in the objects).

  • Bartlomiej Kucharczyk said:
    Can anything be done you add this topic into your roadmap?

    I filed CODEGEN-5738 in the SDOWP system.  This does not report a bug, but requests support in the compiler for reproducible builds.  You are welcome to follow it with the SDOWP link below in my signature.  (However, it seems SDOWP is having problems today.  It should be resolved soon.)

    Thanks and regards,

    -George

  • Thank you! In the meantime, I've made a tool which is erasing some of the useless data from the binary:
    github.com/.../erase.py
  • Bartlomiej Kucharczyk said:
    I've made a tool which is erasing some of the useless data from the binary

    Thank you for the contribution.  But I don't see the advantage of this approach over the one used by objdiff.  objdiff doesn't erase anything, it just skips over the "useless data".

    Thanks and regards,

    -George

  • For me:

    • I'm able to compute and store fingerprint using md5sum (or other standard hash tool available on linux).
    • It was easier to write it than examining objdiff source code and figuring out what is "important data", to compute hash of that value. Most probably working with some ELF library is better.
  • "When I execute the same command, on the same host, in the same directory, I'd expect exactly same result (i.e. md5sum/sha256sum should match in both)."

    I think this runs afoul of the C standard, since at the very least __DATE__ and __TIME__ are populated with the time and date of the build.
  • You are right, thanks for pointing that out.

    However I do not understand how it is in contradiction with C standard. I haven't said anywhere that "all C source code will be 100% bitwise reproducible".:-)

    We simply try to avoid those macros in source code (and other pitfalls that are causing builds to be irreproducible). In case someone is using it, we'll catch that immediately (md5sum will differ even no change in source code/build environment). 

    But let's go one step back, and ask: do you have any reasons for using variable __DATE__ and __TIME__? Because personally, I cannot see any, so that could be good education for me. ;-)

  • Typically as part of a revision string. Someone must think its a good idea, or it would not be an ancient part of the standard. 8^)
  • Maybe that was useful when programming was done by carving rocks, and there was no git. ;-) I don't know...
    If you use any version control system, revision string can be generated from version control, deterministically.

    Topic is also discussed here:
    reproducible-builds.org/.../
    reproducible-builds.org/.../