Comment 5 for bug 1640518

Revision history for this message
Andrew Morrow (acmorrow) wrote :

The following are reproduction instructions for the behavior that we are observing on Ubuntu 16.04 ppc64le. Note that we have run this same test on RHEL 7.1 ppc64le, and we do not observe any stack corruption. Note also that building and running this repro may depend on certain system libraries (SSL, etc) or python libraries being available on the system. Please install as needed. The particular commit here is fairly recent, just one that I happen to know demonstrates the issue.

- git clone https://github.com/mongodb/mongo.git
- cd mongo
- git checkout 3220495083b0d678578a76591f54ee1d7a5ec5df
- git apply acm.nov9.patch
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j$(echo "$(grep -c processor /proc/cpuinfo)/2" | bc) ./mongo ./mongod ./mongos
- ulimit -c unlimited && python buildscripts/resmoke.py --suites=concurrency_sharded --storageEngine=wiredTiger --excludeWithAnyTags=requires_mmapv1 --dbpathPrefix=... --repeat=500 --continueOnFailure

Note that you should provide an actual argument for the --dbpathPrefix argument in the last step, as this is where the running database instances will store data.

You will need to leave this running for several hours, perhaps overnight. In our runs, we find that about 1% of the repeated runs of the test fail, dropping a core.

The core files are typically (but not always!) associated with crashes of the mongos binary inside one of the several mongo::bsonExtractXXX functions, where we find our hand-rolled stack canary to be corrupted. A typical stack trace of a crashing thread looks like:

$ gdb ./mongos core.2016-11-09T23:11:56+00:00
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./mongos...done.
[New LWP 3821]
...
[New LWP 3736]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
Core was generated by `/home/acm/opt/src/mongo/mongos --configdb test-configRS/ubuntu1604-ppc-dev.pic.'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00003fff779ff21c in __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x3fff5d98f140 (LWP 3821))]
(gdb) bt
#0 0x00003fff779ff21c in __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00003fff77a01894 in __GI_abort () at abort.c:89
#2 0x000000005d899504 in mongo::fassertFailedWithLocation (msgid=<optimized out>, file=0x5e6f9570 "src/mongo/bson/util/bson_extract.cpp", line=<optimized out>) at src/mongo/util/assert_util.cpp:172
#3 0x000000005d95cb64 in mongo::fassertWithLocation (line=46, file=0x5e6f9570 "src/mongo/bson/util/bson_extract.cpp", testOK=<optimized out>, msgid=100001) at src/mongo/util/assert_util.h:273
#4 mongo::(anonymous namespace)::Canary::Canary (t=0x3fff5d98be90 '\315' <repeats 199 times>, <incomplete sequence \315>..., this=<synthetic pointer>) at src/mongo/bson/util/bson_extract.cpp:46
#5 mongo::bsonExtractTypedField (object=owned BSONObj 34 bytes @ 0x10008d85a37, fieldName=..., type=mongo::Bool, outElement=0x3fff5d98c730) at src/mongo/bson/util/bson_extract.cpp:83
#6 0x000000005d95cc5c in mongo::bsonExtractBooleanField (object=..., fieldName=..., out=0x3fff5d98c798) at src/mongo/bson/util/bson_extract.cpp:101
#7 0x000000005df5e7ec in mongo::AutoSplitSettingsType::fromBSON (obj=...) at src/mongo/s/balancer_configuration.cpp:400

The hand-rolled canary is implemented as follows:

class Canary {
public:

    static constexpr size_t kSize = 2048;

    __attribute__((always_inline)) explicit Canary(volatile unsigned char* const t) noexcept : _t(t) {
        __builtin_memset(const_cast<unsigned char*>(t), kBits, kSize);
        fassert(100001, std::accumulate(&_t[0], &_t[kSize], 0UL) == kChecksum);
    }

    __attribute__((always_inline)) ~Canary() {
        fassert(100002, std::accumulate(&_t[0], &_t[kSize], 0UL) == kChecksum);
    }

private:
    static constexpr uint8_t kBits = 0xCD;
    static constexpr size_t kChecksum = kSize * size_t(kBits);

    const volatile unsigned char* const _t;
};

} // namespace

The setup of the Canary in mongo::bsonExtractField is here:

Status bsonExtractField(const BSONObj& object, StringData fieldName, BSONElement* outElement) {

    volatile unsigned char* const cookie = static_cast<unsigned char *>(alloca(Canary::kSize));
    const Canary c(cookie);

    ...

}

In the crash above, it can be seen that the hand-rolled canary detected stack corruption in the Canary constructor. We memset the bytes, and then we read them back to checksum them, and they aren't the same. We also sometimes see the checksum fail in the Canary destructor. But the constructor case is more interesting. What could have happened to the memory between when we memset it, and when we read it back? One hypothesis would be that we had leaked a pointer to a local to another thread which wrote to it, but if that were the case we would expect to see crashes all the time, and on other systems, and we don't. More details on that below.

Looking at the corrupted memory, we see:

(gdb) frame 4
#4 mongo::(anonymous namespace)::Canary::Canary (t=0x3fff5d98be90 '\315' <repeats 199 times>, <incomplete sequence \315>..., this=<synthetic pointer>) at src/mongo/bson/util/bson_extract.cpp:46
46 fassert(100001, std::accumulate(&_t[0], &_t[kSize], 0UL) == kChecksum);
(gdb) print t
$1 = (volatile unsigned char * const) 0x3fff5d98be90 '\315' <repeats 199 times>, <incomplete sequence \315>...
(gdb) x /2048ub t
0x3fff5d98be90: 205 205 205 205 205 205 205 205
...
0x3fff5d98c010: 205 205 205 205 205 205 205 205
0x3fff5d98c018: 205 205 205 205 205 205 205 205
0x3fff5d98c020: 205 205 205 205 205 205 1 0
0x3fff5d98c028: 205 205 205 205 205 205 205 205
...
0x3fff5d98c680: 205 205 205 205 205 205 205 205
0x3fff5d98c688: 205 205 205 205 205 205 205 205

Interestingly, the corrupted bytes are always two bytes, always either 0x00 or 0x01, and always starting at an offset aligned 0xe.

We have tried several things to narrow down the range of possible causes.

- We have reproduced with our home built GCC 5.4.
- We have reproduced with the system GCC 5.4.
- We have reproduced with clang-3.9.
- We have run the testcase under the clang address sanitizer, with ASAN_OPTIONS=detect_stack_use_after_return=1.
- We have run on different hardware to ensure that this is not bad memory.
- We have run on bare metal to ensure that this is not related to the virtualization layer on which most of our ppc64le Ubuntu 16.04 instances run.
- The same test case is part of our continuous integration loop and runs nightly across dozens of operating systems and compiler variations, including Windows, OS X, and Linux on x86_64 (including Ubuntu 16.04).
- We have run the same test case on RHEL 7.1 ppc64le.

In none of these cases have we been able to reproduce the issue. It appears only on Ubuntu 16.04, and only when running that OS on POWER8.