sharkcz (sharkcz) wrote,

How to debug weird build issues

When working on a secondary arch Fedora like s390x, we witness interesting build issues sometimes. Like a sudden test failure in e2fsprogs in rawhide. No issue with previous build, no issue with the same sources in F-22. So we started to look what has changed and one thing in Rawhide was enabling the hardened builds globally for all builds. With the hardening disabled the test case passed. It can mean two possible causes - first the code is somehow bad, second there is a bug in the compiler. And when a new major gcc version is released we usually find a couple of bugs, sometimes even general ones, not specific for our architecture. When the issue should be in gcc, then it often depends on the optimization level, so I've tried to switch from the Fedora default -O2 to -O1. And voila, the test passed again. But this is now a global option, but we need to find the piece of code that might be mis-compiled. We call the procedure that follows "bisecting", inspired by bisecting in git as a method to find an offending commit, Here it means limiting the lower optimization level to a specific directory, then to one source file, and then to a single function. It is a time consuming process and requires modifying compiler flags in the buildsystem, using #pragma GCC optimize("O1") in files or adding __attribute__((optimize(("O1")))) to functions. In the case of the test in e2fsprogs we were quite sure it should be either the resize2fs binary or the e2fsck binary. At the end we have identified 3 function in rehash.c source file of e2fsprogs that had to be built with -O1 for the test case to pass. It looked a bit strange to me, usually it is one function that gcc mis-compiles. But from the past I knew another possible cause of interesting failures could be aliasing in combination with wrong code, like here. A quick test build with -fno-strict-aliasing also made the problem to away. The gcc maintainer then identified some pieces of the code that are clearly not aliasing safe and after a short discussion with the e2fsprogs developer we decided to disable strict aliasing for this package as an interim solution as the code is complex and it will take time to fix it properly. And what's the conclusion - using non-mainstream architectures helps in discovering bugs in applications. And also in the toolchain, but that will be another story :-)

EDIT 2016-03-01:
Other useful things to try are
  • __attribute__((noinline, noclone)) to make sure function is not inlined
  • -mno-lra option to disable LRA in case code is miscompiled due register allocations

    EDIT 2016-03-22
  • -fno-delete-null-pointer-checks and/or -fno-lifetime-dse (or -flifetime-dse=1) for detecting potentially buggy C/C++ code

    EDIT 2017-02-09
  • GCC has own FAQ entry
  • Subscribe

    • Post a new comment


      Anonymous comments are disabled in this journal

      default userpic

      Your IP address will be recorded