MongoDB Memory corruption

Bug #1640518 reported by bugproxy
34
This bug affects 3 people
Affects Status Importance Assigned to Milestone
GLibC
Fix Released
Medium
glibc (Ubuntu)
Fix Released
Undecided
Adam Conrad
Xenial
Confirmed
Undecided
Adam Conrad
Yakkety
Confirmed
Undecided
Adam Conrad

Bug Description

== Comment: #0 - Calvin L. Sze <email address hidden> - 2016-11-01 23:09:10 ==
Team has changed to the Bare-metal Ubuntu 16.4. The problem still exists, so it is not related to the virtualization.

Since the bug is complicated to reproduce, Could we use sets of tools to collect the data when this happens?

---Problem Description---
MongoDB has memory corruption issues which only occurred on Ubuntu 16.04, it doesn't occur on Ubuntu 15.

Contact Information =Calvin Sze/Austin/IBM

---uname output---
Linux master 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:00:57 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = Model: 2.1 (pvr 004b 0201) Model name: POWER8E (raw), altivec supported

---System Hang---
 the system is still alive

---Debugger---
A debugger is not configured

---Steps to Reproduce---
 Unfortunately, not very easily. I had a test case that I was running on ubuntu1604-ppc-dev.pic.build.10gen.cc and xxxx-ppc-dev.pic.build.10gen.cc. I understand these to be two VMs running on the same physical host.

About 3.5% of the test runs on ubuntu1604-ppc-dev.pic.build.10gen.cc would fail, but all of the runs on the other machine passed. Originally, this failure manifested as the GCC stack protector (from -fstack-protector-strong) claiming stack corruption.

Hoping to be able to see the data that was being written and corrupting the stack, I manually injected a guard region into the stack of the failing functions as follows:

+namespace {
+
+class Canary {
+public:
+
+ static constexpr size_t kSize = 1024;
+
+ explicit Canary(volatile unsigned char* const t) noexcept : _t(t) {
+ ::memset(const_cast<unsigned char*>(_t), kBits, kSize);
+ }
+
+ ~Canary() {
+ _verify();
+ }
+
+private:
+ static constexpr uint8_t kBits = 0xCD;
+ static constexpr size_t kChecksum = kSize * size_t(kBits);
+
+ void _verify() const noexcept {
+ invariant(std::accumulate(&_t[0], &_t[kSize], 0UL) == kChecksum);
+ }
+
+ const volatile unsigned char* const _t;
+};
+
+} // namespace
+

 Status bsonExtractField(const BSONObj& object, StringData fieldName, BSONElement* outElement) {
+
+ volatile unsigned char* const cookie = static_cast<unsigned char *>(alloca(Canary::kSize));
+ const Canary c(cookie);
+

When running with this, the invariant would sometimes fire. Examining the stack cookie under the debugger would show two consecutive bytes, always at an offset ending 0x...e, written as either 0 0, or 0 1, somewhere at random within the middle of the cookie.

This indicated that it was not a conventional stack smash, where we were writing past the end of a contiguous buffer. Instead it appeared that either the currently running thread had reached up some arbitrary and random amount on the stack and done either two one-byte writes, or an unaligned 2-byte write. Another possibility was that a local variable had been transferred to another thread, which had written to it.

However, while looking at the code to find such a thing, I realized that there was another possibility, which was that the bytes had never been written correctly in the first place. I changed the stack canary constructor to be:

+ explicit Canary(volatile unsigned char* const t) noexcept : _t(t) {
+ ::memset(const_cast<unsigned char*>(_t), kBits, kSize);
+ _verify();
+ }

So that immediately after writing the byte pattern to the stack buffer, we verified the contents we wrote. Amazingly, this *failed*, with the same corruption as seen before. This means that either between the time we called memset to write the bytes and when we read them back, something either overwrote the stack cookie region, or that the bytes were never written correctly by memset, or that memset wrote the bytes, but the underlying physical memory never took the write.

Stack trace output:
 no

Oops output:
 no

Userspace tool common name: MongoDB

Userspace rpm: mongod

The userspace tool has the following bit modes: 64bit

System Dump Info:
  The system is not configured to capture a system dump.

Userspace tool obtained from project website: na

*Additional Instructions for Lilian Romero/Austin/IBM:
-Post a private note with access information to the machine that the bug is occuring on.
-Attach sysctl -a output output to the bug.
-Attach ltrace and strace of userspace application.

== Comment: #1 - Luciano Chavez <email address hidden> - 2016-11-02 08:41:47 ==
Normally for userspace memory corruption type problems I would recommend Valgrind's memcheck tool though if this works on other versions of linux, one would want to compare the differences such as whether or not you are using the same version of mongodb, gcc, glibc and the kernel.

Has a standalone testcase been produced that shows the issue without mongodb?

== Comment: #2 - Steven J. Munroe <email address hidden> - 2016-11-02 10:27:40 ==
We really need that standalone test case.

Need to look at WHAT c++ is doing with memset. I suspect the compiler is short circuiting the function and inlining. That is what you would want for optimization, but we need to know so we can steer this to the correct team.

== Comment: #3 - Calvin L. Sze <email address hidden> - 2016-11-02 13:17:30 ==
Hi Luciano and Steve, Thanks for the advise,

They don't have a standalone test case without Mongodb, I could image it take a while and probably not that easy to produce. I am seeking your advise how to approach this. The failure takes at least 24 - 48 hours running to reproduce. Steve, do you have what you needed for C++ test, or there is something I need to ask Mongo development team?

Thanks

== Comment: #4 - William J. Schmidt <email address hidden> - 2016-11-02 16:29:26 ==
(In reply to comment #3)
> Hi Luciano and Steve, Thanks for the advise,
>
> They don't have a standalone test case without Mongodb, I could image it
> take a while and probably not that easy to produce. I am seeking your
> advise how to approach this. The failure takes at least 24 - 48 hours
> running to reproduce. Steve, do you have what you needed for C++ test, or
> there is something I need to ask Mongo development team?
>
> Thanks

It's unclear to me yet that we have evidence of this being a problem in the toolchain. Does the last experiment (revised Canary constructor) ALWAYS fail, or does it also fail only ever 24 - 48 hours? If the latter, then all we know is that stack corruption happens. There's no indication of where the wild pointer is coming from (application problem, compiler problem, etc.). If it does always fail, however, then I question the assertion that they can't provide a standalone test case.

We need something more concrete to work with.

Bill

== Comment: #5 - Calvin L. Sze <email address hidden> - 2016-11-03 18:08:33 ==
Could this ticket be viewed by external customer/ISV?
I am thinking how to establish the direct communications between Mongodb development team and experts/owner of the ticket to pass the middle man, me :-)

Here are the MongoDB deelopment director, Andrew's answers to my 3 questions. And in addition he added comments.

Basically, there are 3 questions,

> 1. Is the mongoDB binary built with gcc came with Linux distributions or with IBM Advance toolchain gcc?

We build our own GCC, but we have reproduced the issue with both our custom GCC, and the builtin linux distribution GCC. We have also reproduced with clang 3.9 built from source on the Ubuntu 16.04 POWER machine, so we do not think that this is a compiler issue (could still be a std library issue).

> 2. Does the last experiment (revised Canary constructor) ALWAYS fail, or does it also fail only ever 24 - 48 hours?

No, we have never been able to construct a deterministic repro. We are only able to get it to fail after running the test a very large number of times.

> 3. Is there any way we can have a standalone test case without MongoDB?

We do not have such a repro at this time.

I do understand the position they are taking - it isn't a lot of information to go on, and most of the time the correct response to a mysterious software crash is to blame the software itself, not the surrounding ecosystem. However, we have a lot of *indirect* evidence that has made us skeptical that this is our bug. We would love to be proved wrong!

- The stack corruption has not reproduced on any other systems. We are running these same tests on every commit across dozens of Linux variants, and across four cpu architectures (x86_64, POWER, zSeries, ARMv8).
- We don't see crashes on other POWER, but we do on Ubuntu POWER.
- We don't see crashes on Windows, Solaris, OS X
- We have run the under the clang address sanitizer, with no reports.
- We have enabled the clang address sanitizer use-after-return detector, and found no results.

If this were a wild pointer in the MongoDB server process that was writing to the stack of other threads, we would expect to see corruption show up elsewhere, but we simply do not.

However, lets assume that this is a bug in our code, that for whatever reason only reveals itself on POWER, and only on Ubuntu. We would still be interesting in learning from the kernel team if there are additional power specific debugging techniques that we might be able to apply. In particular, the ability to programmatically set/unset hardware watchpoints over the stack canary. Another possibility would be to mprotect the stack canary, but it is not clear to us whether it is valid to mprotect part of the stack, either in general, or on POWER.

We would be happy to hear any suggestions on how to proceed.

Thanks,
Andrew

== Comment: #6 - Steven J. Munroe <email address hidden> - 2016-11-03 18:34:30 ==
you could tell what specific GCC version you are based on and configure options.

You could provide the disassemble of the canary code.

== Comment: #7 - William J. Schmidt <email address hidden> - 2016-11-03 23:01:55 ==
It would be useful to see what the Canary is compiled into, as Steve suggested. Let's make sure it's doing what we think it is.

Given we have multiple compilers producing the same results, we may want to think more about the runtime environment -- are you using the same glibc and libstdc++ in all cases? Clang at least would pick up the distro versions, as it doesn't provide its own.

One reason you see this on Ubuntu 16.04 and not on another linux distro is likely because of glibc level. The other linux's glibc is quite old by comparison. glibc 2.23, which appears on Ubuntu 16.04, is the first version to be compiled with -fstack-protector-strong by default. So this doesn't necessarily mean that the bug doesn't exist elsewhere; it just means that the stack protector code isn't enabled to spot the problem. If the stack corruption is benign, then it wouldn't be noticed otherwise.

I assume that glibc 2.23 was compiled with Ubuntu's version of gcc 5 that ships with the system, in case that becomes relevant.

I don't personally have a lot of experience with trying to debug something of this nature, in case we don't see something obvious from the disassembly of the canary. CCing Ulrich Weigand in case he has some ideas of other approaches to try.

== Comment: #9 - Ulrich Weigand <email address hidden> - 2016-11-04 12:21:48 ==
I don't really have any other great ideas either. Just two comments:

- Even though the original reported mentioned they already tried clang's address sanitizer, I'd definitely still also try reproducing the problem under valgrind -- the two are different in what exactly they detect, and using both tools in a complex problem can only help.

- The Canary code sample above has strictly speaking undefined behavior, I think: it is calling memset on a const *. (The const_cast makes the warning go away, but doesn't actually cure the undefined behavior.) I don't *think* this will cause codegen changes in this example, but it cannot hurt to try to fix this and see if anything changes.

== Comment: #12 - Calvin L. Sze <email address hidden> - 2016-11-06 10:32:25 ==
Hi Bill, Thanks

I have asked Andrew, waiting for his confirmation.

== Comment: #14 - Calvin L. Sze <email address hidden> - 2016-11-06 10:56:49 ==
Hi Calvin -

I can provide the assembly of the function that contains the canary (the canary itself gets inlined), but I think it might just be easier if I uploaded a binary and an associated corefile? That way your engineers could disassemble the crashing function themselves in the debugger and see exactly what the state was at the time of the crash.

What is the best way for me to get that information to you?

Thanks,
Andrew

== Comment: #15 - Calvin L. Sze <email address hidden> - 2016-11-06 10:58:54 ==
Provided the binary and core information.

Note from Mongo;

   I've uploaded a sample core file and the associated binary to your ftp
server as detailed above. The binary is named `mongod.power` and the core is
named `mongod.power.core`.

   You should expect to see a backtrace on the faulting thread which looks
like this (for the first few frames):

(gdb) bt
#0 0x00003fff997be5d0 in __libc_signal_restore_set (set=0x3fff5814c1f0)
    at ../sysdeps/unix/sysv/linux/nptl-signals.h:79
#1 __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:55
#2 0x00003fff997c0c00 in __GI_abort () at abort.c:89
#3 0x00000000223c33e8 in mongo::invariantFailed (expr=<optimized out>,
    file=0x24131b38 "src/mongo/bson/util/bson_extract.cpp",
    line=<optimized out>) at src/mongo/util/assert_util.cpp:154
#4 0x00000000224bbc48 in mongo::(anonymous namespace)::Canary::_verify (
    this=<optimized out>) at src/mongo/bson/util/bson_extract.cpp:58

The "Canary::_verify" frame (number 4) has a local variable "_t" which is an
on-the-stack array and filled with "0xcd" for a span of 1024 bytes. Near the
end of this block we see two bytes of poisoned memory which were altered:

0x3fff5814c858: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd
0x3fff5814c860: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd
0x3fff5814c868: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0x01 0x00
0x3fff5814c870: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd
0x3fff5814c878: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd

Note the two bytes set to values "0x01" and "0x00".

At the time of core-dump all the other threads seemed to be paused on system
calls such as "recv" or "__pthread_cond_wait". The verify function is called
when setting up our software canary, and checks the memory immediately after
its setup. We do not run any other functions on this thread between the
memory poisoning and the verification of the poisoning. All other threads
appear to be paused at this time.

== Comment: #16 - Calvin L. Sze <email address hidden> - 2016-11-06 10:59:40 ==
A follow up message from Mongo

The function calling the canary code, which you'll want to possibly
disassemble is in frame 6:

#6 mongo::bsonExtractStringField (object=..., fieldName=...,
    out=0x3fff5814caa8) at src/mongo/bson/util/bson_extract.cpp:138

   The lower numbered frames deal with the canary code itself.

== Comment: #17 - Calvin L. Sze <email address hidden> - 2016-11-06 11:03:46 ==
From Andrew,

>Given we have multiple compilers producing the same results, we may want to
>think more about the runtime environment -- are you using the same glibc and
>libstdc++ in all cases? Clang at least would pick up the distro versions, as
>it doesn't provide its own.

We have repro'd with three compilers:

- The system GCC, using system libstdc++ and system glibc
- Our hand-rolled GCC, using its own libstdc++, and system glibc
- One off clang-3.9 build, using system libstdc++, and system glibc.

Coincidentally, both system and hand-rolled GCC are 5.4.0, so there may not be as much variation there as hoped. We could try building with clang and libc++ to at least rule out libstdc++ as a factor.

>One reason you see this on Ubuntu 16.04 and not on the other linux distro is likely because of
>glibc level. The other linux distro's glibc is quite old by comparison. glibc 2.23, which
>appears on Ubuntu 16.04, is the first version to be compiled with
>-fstack-protector-strong by default.

I'm not sure I follow. Our software has been built with -fstack-protector-strong on both platforms, whether or not glibc has been, and the invocation of the __stack_chk_fail function is always from our code, not from glibc, or libstdc++. So, I'd expect that if there were stack corruption taking place as a result of our code, we would see the stack protector trip on both platforms. Or are you saying that on platforms where glibc itself wasn't built with -fstack-protector-whatever that user code built with that same flag won't report errors?

>So this doesn't necessarily mean that the
>bug doesn't exist elsewhere; it just means that the stack protector code isn't
>enabled to spot the problem. If the stack corruption is benign, then it
>wouldn't be noticed otherwise.

Yeah, still confused. I can definitely make the other linux distro box report a stack corruption:

[<email address hidden> ~]$ cat > boom.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct no_chars {
    unsigned int len;
    unsigned int data;
};

int main(int argc, char * argv[])
{
    struct no_chars info = { };

    if (argc < 3) {
        fprintf(stderr, "Usage: %s LENGTH DATA...\n", argv[0]);
        return 1;
    }

    info.len = atoi(argv[1]);
    memcpy(&info.data, argv[2], info.len);

    return 0;
}
[<email address hidden> ~]$ gcc -Wall -O2 -U_FORTIFY_SOURCE -fstack-protector-strong boom.c -o boom

[<email address hidden> ~]$ ./boom 64 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
*** stack smashing detected ***: ./boom terminated
Segmentation fault

I assume that glibc 2.23 was compiled with Ubuntu's version of gcc 5 that ships
with the system, in case that becomes relevant.

Correct, we have not made any changes to glibc - we are using the stock version that ships on the system.

== Comment: #18 - Calvin L. Sze <email address hidden> - 2016-11-06 11:04:24 ==
From Andrew

Also, I want to re-iterate that while we have definitely observed cases where the stack protector detects the stack corruption, we have also observed stack corruption within our own hand-rolled stack buffer, per the code posted earlier. The core dump that Adam provided is of this latter sort So to some extent, this is independent of -fstack-protector-strong.

One thing that I have not yet ruled out is whether -fstack-protect-strong could itself be at fault, somehow, though I find that unlikely given that we have reproduced with clang as well.

Still, it sounds like a worthwhile experiment, so I will see if I can still detect corruption in our hand-rolled stack canary when building without any form of -fstack-protector enabled.

== Comment: #19 - Calvin L. Sze <email address hidden> - 2016-11-06 11:05:58 ==
From Andrew,

I've performed this experiment, replacing our use of -fstack-protector-strong with -fno-stack-protector when building MongoDB, and I can confirm that we still observe stack corruption in our hand-rolled canary, per the code posted earlier.

I have a core file and executable. Let me know if you would be interested in my providing those in addition to the files provided yesterday by Adam.

== Comment: #21 - William J. Schmidt <email address hidden> - 2016-11-07 11:10:54 ==
Andrew, thanks for all the details, and for the binary and core file! I'll start poking through them this morning. I've just been absorbing all the notes that Calvin dumped into our bug tracking system yesterday.

You can ignore what I was saying about -fstack-protector-strong. My thought at the time was that *if* the flow of control entered glibc, that whether or not the code *there* was compiled with -fstack-protector-strong might prove to make a difference. Reading back through today, I see that was off base, so sorry for the distraction.

While I'm looking at the binary, there are a couple of other things you might want to try:
 - Replace ::memset with __builtin_memset with GCC to see whether that makes any difference;
 - Try Ulrich Weigand's suggestions from comment #9;
 - As you suggested, try clang + libc++ to try to rule libstdc++ in or out.

A couple of questions that may or may not prove relevant:
 - You've mentioned you don't get the crashes on the other linux distro. Have you tried your modified canary on the other linux distro anyway? If we're certain the two systems behave differently with the canary that may help us in narrowing things down.
 - Which version of the C++ standard are you compiling against? Is it just the default on all systems, or are you forcing a specific -std=...?

== Comment: #22 - William J. Schmidt <email address hidden> - 2016-11-07 12:18:41 ==
I'm having some difficulties with core file compatibility. I put your files on an Ubuntu 16.04.1 system, but I don't see quite the same results as you report under gdb, with libc and libgcc shared libs not at the correct address and a problem with the stack. There's a transcript below. I'm particularly concerned about the warning that the core file and executable may not match. Note also the report of stack corruption above frame #4, so I can't get to frame #6 to look at the register state. The library frames at #0-#3 are reporting the wrong information, which I assume to be because the libraries are at the wrong address.

For debug purposes it would probably be best to use the system compiler, just in case that wasn't the case here.

$ ls -l
total 1950688
-rw-r--r-- 1 wschmidt wschmidt 700141992 Nov 7 14:37 mongod.power
-rw-r--r-- 1 wschmidt wschmidt 1297350656 Nov 7 14:39 mongod.power.core
$ gdb mongod.power mongod.power.core
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from mongod.power...done.

warning: core file may not match specified executable file.
[New LWP 101461]
[New LWP 100045]
[New LWP 100062]
[New LWP 100056]
[New LWP 99983]
[New LWP 100052]
[New LWP 100054]
[New LWP 99892]
[New LWP 100051]
[New LWP 100048]
[New LWP 100007]
[New LWP 99868]
[New LWP 100059]
[New LWP 101459]
[New LWP 100001]
[New LWP 99986]
[New LWP 101403]
[New LWP 99980]
[New LWP 99882]
[New LWP 99893]
[New LWP 99877]
[New LWP 99872]
[New LWP 101462]
[New LWP 99874]
[New LWP 100058]
[New LWP 100231]
[New LWP 99994]
[New LWP 99873]
[New LWP 100003]
[New LWP 99993]
[New LWP 99879]
[New LWP 101398]
[New LWP 99891]
[New LWP 99880]
[New LWP 99910]
[New LWP 99895]
[New LWP 99901]
[New LWP 100011]
[New LWP 99974]
[New LWP 100049]
[New LWP 99898]
[New LWP 99875]
[New LWP 101460]
[New LWP 99878]
[New LWP 99871]
[New LWP 99896]
[New LWP 101954]
[New LWP 101406]
[New LWP 100015]
[New LWP 100068]
[New LWP 99984]
[New LWP 101519]
[New LWP 100053]
[New LWP 99996]
[New LWP 100050]
[New LWP 100055]
[New LWP 100057]
[New LWP 101807]
[New LWP 99890]
[New LWP 100004]
[New LWP 99884]
[New LWP 101437]
[New LWP 101455]
[New LWP 100013]
[New LWP 99894]
[New LWP 101411]
[New LWP 101457]
[New LWP 101431]
[New LWP 101458]
[New LWP 100443]
[New LWP 101438]
[New LWP 101414]
[New LWP 101433]
[New LWP 101784]
[New LWP 99979]
[New LWP 101397]
[New LWP 101402]
[New LWP 101401]
[New LWP 101435]
[New LWP 101405]
[New LWP 101423]
[New LWP 101425]
[New LWP 99897]
[New LWP 101419]
[New LWP 99989]
[New LWP 101409]
[New LWP 100008]
[New LWP 101410]
[New LWP 99998]
[New LWP 101413]
[New LWP 101469]
[New LWP 101418]
[New LWP 101427]
[New LWP 101399]
[New LWP 101235]
[New LWP 101396]
[New LWP 101421]
[New LWP 99990]
[New LWP 101407]
[New LWP 101480]
[New LWP 100060]
[New LWP 101499]
[New LWP 101506]
[New LWP 101395]
[New LWP 101415]
[New LWP 101400]
[New LWP 101412]
[New LWP 101408]
[New LWP 101420]
[New LWP 101416]
[New LWP 101492]
[New LWP 101513]
[New LWP 101782]
[New LWP 101404]
[New LWP 101481]
[New LWP 101417]
[New LWP 100067]
[New LWP 101429]
[New LWP 99883]
[New LWP 101430]
[New LWP 101436]
[New LWP 101454]
[New LWP 101428]
[New LWP 101422]
[New LWP 100108]
[New LWP 101434]
[New LWP 100064]
[New LWP 101453]
[New LWP 100061]
[New LWP 101426]
[New LWP 100066]
[New LWP 101452]
[New LWP 101439]
[New LWP 101456]
[New LWP 101451]
[New LWP 101450]
[New LWP 101432]
[New LWP 101449]
[New LWP 101424]
[New LWP 100065]
[New LWP 100063]
[New LWP 101448]
[New LWP 101447]
[New LWP 101446]
[New LWP 101445]
[New LWP 101444]
[New LWP 101443]
[New LWP 101442]
[New LWP 101441]
[New LWP 101440]

warning: .dynamic section for "/lib/powerpc64le-linux-gnu/libgcc_s.so.1" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/lib/powerpc64le-linux-gnu/libc.so.6" is not at the expected address (wrong library or version mismatch?)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
Core was generated by `/home/pic1user/proj/mongo-repro/mongod --oplogSize 1024 --port 30012 --nopreall'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00003fff997be5d0 in __copysign (y=<optimized out>, x=<optimized out>)
    at ../sysdeps/generic/math_private.h:233
233 ../sysdeps/generic/math_private.h: No such file or directory.
[Current thread is 1 (Thread 0x3fff5814ec20 (LWP 101461))]
(gdb) bt
#0 0x00003fff997be5d0 in __copysign (y=<optimized out>, x=<optimized out>)
    at ../sysdeps/generic/math_private.h:233
#1 __modf_power5plus (x=-6.2774385622041925e+66, iptr=0x3fff5814c1f0)
    at ../sysdeps/powerpc/power5+/fpu/s_modf.c:44
#2 0x00003fff997be4f0 in ?? () from /lib/powerpc64le-linux-gnu/libc.so.6
#3 0x00003fff997c0c00 in ?? () at ../signal/allocrtsig.c:45
   from /lib/powerpc64le-linux-gnu/libc.so.6
#4 0x00000000223c33e8 in mongo::invariantFailed (expr=<optimized out>,
    file=0x24131b38 "src/mongo/bson/util/bson_extract.cpp",
    line=<optimized out>) at src/mongo/util/assert_util.cpp:154
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) quit
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/powerpc64le-linux-gnu/5/lto-wrapper
Target: powerpc64le-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-ppc64el/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-ppc64el --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-ppc64el --with-arch-directory=ppc64le --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-secureplt --with-cpu=power8 --enable-targets=powerpcle-linux --disable-multilib --enable-multiarch --disable-werror --with-long-double-128 --enable-checking=release --build=powerpc64le-linux-gnu --host=powerpc64le-linux-gnu --target=powerpc64le-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2)
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial
$

I'll disassemble the binary and see if I can spot anything without the state information.

Oh, still waiting on permission to mirror the bug.

== Comment: #23 - William J. Schmidt <email address hidden> - 2016-11-07 13:39:45 ==
A little more information:

I've been looking at bsonExtractStringField's disassembly. It appears that this binary inlines the call to the Canary constructor as well as the call to _verify. As evidence, I see the PLT call to glibc's memset:

  8ebb3c: 71 c9 06 48 bl 9584ac <00000d72.plt_call.memset@@GLIBC_2.17>

And later I see the call to invariantFailed:

  8ebc44: e9 75 f0 4b bl 7f322c <_ZN5mongo15invariantFailedEPKcS1_j+0x8>

So we've answered Steve's initial question about which memset we're using. This isn't being inlined by the compiler, but does an out-of-line dynamic call to the GLIBC_2.17 version.

I'm not sure whether GCC would inline a 1024-byte memset using __builtin_memset, or just end up calling out the same way, but it might be worth trying out that replacement, and disassembling bsonExtractStringField again to see if the PLT call has gone away.

== Comment: #24 - William J. Schmidt <email address hidden> - 2016-11-07 13:50:04 ==
I forgot to mention that the ensuing code generation to accumulate the checksum and test it is completely straightforward and looks correct. So this looks like pretty strong evidence that the problem is in the GLIBC memset implementation.

  8ebb3c: 71 c9 06 48 bl 9584ac <00000d72.plt_call.memset@@GLIBC_2.17>
  8ebb40: 18 00 41 e8 ld r2,24(r1)
  8ebb44: 00 04 40 39 li r10,1024
  8ebb48: 00 00 20 39 li r9,0
  8ebb4c: a6 03 49 7d mtctr r10
  8ebb50: 00 00 43 89 lbz r10,0(r3)
  8ebb54: 01 00 63 38 addi r3,r3,1
  8ebb58: 14 52 29 7d add r9,r9,r10
  8ebb5c: f4 ff 00 42 bdnz 8ebb50 <_ZN5mongo22bsonExtractStringFieldERKNS_7BSONObjENS_10StringDataEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x80>
  8ebb60: 03 00 40 3d lis r10,3
  8ebb64: 00 34 4a 61 ori r10,r10,13312
  8ebb68: 00 50 a9 7f cmpd cr7,r9,r10
  8ebb6c: c4 00 9e 40 bne cr7,8ebc30 <_ZN5mongo22bsonExtractStringFieldERKNS_7BSONObjENS_10StringDataEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x160>

...

  8ebc30: 44 ff 82 3c addis r4,r2,-188
  8ebc34: 44 ff 62 3c addis r3,r2,-188
  8ebc38: 3a 00 a0 38 li r5,58
  8ebc3c: 38 aa 84 38 addi r4,r4,-21960
  8ebc40: 60 aa 63 38 addi r3,r3,-21920
  8ebc44: e9 75 f0 4b bl 7f322c <_ZN5mongo15invariantFailedEPKcS1_j+0x8>

== Comment: #28 - William J. Schmidt <email address hidden> - 2016-11-08 11:02:18 ==
Recording some information from email discussions.

(1) The customer is planning to attempt to use valgrind memcheck.
(2) The const cast problem with the canary has been fixed without changing the results.
(3) Prior to that fix, the canary was used on the RHEL system with no corruption detected, so this does seem to be Ubuntu-specific.
(4) -std=c++11 is used everywhere.
(5) The core and binary compatibility issues appear to be that they were generated on 16.10, not 16.04. New ones coming.
(6) The canary code now looks like:

+namespace {
+
+class Canary {
+public:
+
+ static constexpr size_t kSize = 2048;
+
+ explicit Canary(volatile unsigned char* const t) noexcept : _t(t) {
+ __builtin_memset(const_cast<unsigned char*>(t), kBits, kSize);
+ _verify();
+ }
+
+ ~Canary() {
+ _verify();
+ }
+
+private:
+ static constexpr uint8_t kBits = 0xCD;
+ static constexpr size_t kChecksum = kSize * size_t(kBits);
+
+ void _verify() const noexcept {
+ invariant(std::accumulate(&_t[0], &_t[kSize], 0UL) == kChecksum);
+ }
+
+ const volatile unsigned char* const _t;
+};
+
+} // namespace
+

And its application in bsonExtractTypedField looks like:

@@ -47,6 +82,10 @@ Status bsonExtractTypedField(const BSONObj& object,
                              StringData fieldName,
                              BSONType type,
                              BSONElement* outElement) {
+
+ volatile unsigned char* const cookie = static_cast<unsigned char *>(alloca(Canary::kSize));
+ const Canary c(cookie);
+
     Status status = bsonExtractField(object, fieldName, outElement);

(7) Steve Munroe investigated memset and he and Andrew are in agreement that we can rule it out:

I looked at the memset_power8 code (memset is just a IFUNC resolve stub). and I don't see how this problem is caused by memset_power8.

First some observations:

The canary is allocated with alloca for a large power of 2 (1024 bytes).
Alloca returns quadword aligned memory as required to maintain quadword stack alignment.
For this case memset_power8 will quickly jump to the vector store loop (quadword x 8) all from the same register (a vector splat of the fill char).

With this code the failure modes could only be:
Overwrite by N*quadwords,
Underwrite by N*quadwords,
A repeated pattern every quadword.

But we are not see this. Also think we are back to a clobber by some other code.

== Comment: #29 - William J. Schmidt <email address hidden> - 2016-11-08 11:03:33 ==
From Andrew, difficulties with Valgrind:

I did try the valgrind repro. However, I'm not able to make valgrind work:

The first try resulted in lots of "mismatched free/delete" reports, which is sort of odd, because they all seem to be from within the standard library:

> valgrind --soname-synonyms=somalloc=NONE --track-origins=yes --leak-check=no ./mongos
==17387== Memcheck, a memory error detector
==17387== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==17387== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==17387== Command: ./mongos
==17387==
==17387== Mismatched free() / delete / delete []
==17387== at 0x4895888: free (in /usr/lib/valgrind/vgpreload_memcheck-ppc64le-linux.so)
==17387== by 0x59514F: deallocate (new_allocator.h:110)
==17387== by 0x59514F: deallocate (alloc_traits.h:517)
==17387== by 0x59514F: _M_deallocate_buckets (hashtable_policy.h:2010)
==17387== by 0x59514F: _M_deallocate_buckets (hashtable.h:356)
==17387== by 0x59514F: _M_deallocate_buckets (hashtable.h:361)
==17387== by 0x59514F: _M_rehash_aux (hashtable.h:1999)
==17387== by 0x59514F: std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_rehash(unsigned long, unsigned long const&) (hashtable.h:1953)
==17387== by 0x595253: std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, true>*) (hashtable.h:1600)
==17387== by 0x5954D3: std::__detail::_Map_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (hashtable_policy.h:600)
==17387== by 0x593693: operator[] (unordered_map.h:668)
==17387== by 0x593693: mongo::InitializerDependencyGraph::addInitializer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<mongo::Status (mongo::InitializerContext*)> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) (initializer_dependency_graph.cpp:58)
==17387== by 0x591057: mongo::GlobalInitializerRegisterer::GlobalInitializerRegisterer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<mongo::Status (mongo::InitializerContext*)> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) (global_initializer_registerer.cpp:44)
==17387== by 0x52D46F: __static_initialization_and_destruction_0(int, int) [clone .constprop.34] (mongos_options_init.cpp:39)
==17387== by 0x137FED3: __libc_csu_init (in /home/acm/opt/src/mongo/mongos)
==17387== by 0x4F830A7: generic_start_main.isra.0 (libc-start.c:247)
==17387== by 0x4F83337: (below main) (libc-start.c:116)
==17387== Address 0x5151fb0 is 0 bytes inside a block of size 16 alloc'd
==17387== at 0x48951D4: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-ppc64le-linux.so)
==17387== by 0x59328F: allocate (new_allocator.h:104)
==17387== by 0x59328F: allocate (alloc_traits.h:491)
==17387== by 0x59328F: std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, true> > >::_M_allocate_buckets(unsigned long) [clone .isra.108] (hashtable_policy.h:1996)
==17387== by 0x595093: _M_allocate_buckets (hashtable.h:347)
==17387== by 0x595093: _M_rehash_aux (hashtable.h:1974)
==17387== by 0x595093: std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_rehash(unsigned long, unsigned long const&) (hashtable.h:1953)
==17387== by 0x595253: std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, true>*) (hashtable.h:1600)
==17387== by 0x5954D3: std::__detail::_Map_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mongo::InitializerDependencyGraph::NodeData> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (hashtable_policy.h:600)
==17387== by 0x59356B: operator[] (unordered_map.h:668)
==17387== by 0x59356B: mongo::InitializerDependencyGraph::addInitializer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<mongo::Status (mongo::InitializerContext*)> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) (initializer_dependency_graph.cpp:46)
==17387== by 0x591057: mongo::GlobalInitializerRegisterer::GlobalInitializerRegisterer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<mongo::Status (mongo::InitializerContext*)> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) (global_initializer_registerer.cpp:44)
==17387== by 0x52D46F: __static_initialization_and_destruction_0(int, int) [clone .constprop.34] (mongos_options_init.cpp:39)
==17387== by 0x137FED3: __libc_csu_init (in /home/acm/opt/src/mongo/mongos)
==17387== by 0x4F830A7: generic_start_main.isra.0 (libc-start.c:247)
==17387== by 0x4F83337: (below main) (libc-start.c:116)

So, that is a puzzle. However, I can instruct valgrind to ignore that. But it still fails to start, now with something more odd:

$ valgrind --show-mismatched-frees=no --soname-synonyms=somalloc=NONE --track-origins=yes --leak-check=no ./mongos
==19834== Memcheck, a memory error detector
==19834== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==19834== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==19834== Command: ./mongos
==19834==
MC_(get_otrack_shadow_offset)(ppc64)(off=1688,sz=8)

Memcheck: mc_machine.c:329 (get_otrack_shadow_offset_wrk): the 'impossible' happened.

host stacktrace:
==19834== at 0x3808D9B8: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x3808DB5F: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x3808DCDB: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x38078CE3: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x38076FAB: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x380BAA2B: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x381B9BB7: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x380BE19F: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x3810D04F: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x3810FFEF: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)
==19834== by 0x3812BB97: ??? (in /usr/lib/valgrind/memcheck-ppc64le-linux)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 19834)
==19834== at 0x4F3AC14: __lll_lock_elision (elision-lock.c:60)
==19834== by 0x4F2BBC7: pthread_mutex_lock (pthread_mutex_lock.c:92)
==19834== by 0x602753: mongo::DBConnectionPool::DBConnectionPool() (connpool.cpp:196)
==19834== by 0x5319EB: __static_initialization_and_destruction_0 (global_conn_pool.cpp:35)
==19834== by 0x5319EB: _GLOBAL__sub_I__ZN5mongo14globalConnPoolE (global_conn_pool.cpp:39)
==19834== by 0x137FED3: __libc_csu_init (in /home/acm/opt/src/mongo/mongos)
==19834== by 0x4F830A7: generic_start_main.isra.0 (libc-start.c:247)
==19834== by 0x4F83337: (below main) (libc-start.c:116)

Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using. Thanks.

I'm not really sure what to make of that, except that I did see some thing die in the same place, once or twice (__lll_lock_elision), when running with clang ASAN with the stack-use-after-return checking enabled. I wasn't really sure what to make of that, but it is interesting that this has turned up twice. I presume this is related to hardware lock elision?

Anyway, it doesn't seem like I can get this running with valgrind. Happy to try again if anyone is aware of a workaround.

== Comment: #30 - William J. Schmidt <email address hidden> - 2016-11-08 11:06:00 ==
CCing Carl Love. Carl, have you seen this sort of interaction between valgrind and lock elision before? (Comment #29, you can ignore the rest of this bugzilla for now.)

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-148069 severity-critical targetmilestone-inin16045
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → gcc-4.8 (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-11-09 10:52 EDT-------
Hello Canonical,
Sending this bug to you for awareness and advice.

Matthias Klose (doko)
affects: gcc-4.8 (Ubuntu) → gcc-5 (Ubuntu)
Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

For what it's worth, GCC is just a placeholder package for now. The compiler and libraries aren't directly implicated, at least as of now. The origin of the stack corruption remains unknown.

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

Andrew, according to the valgrind community, the error

  Memcheck: mc_machine.c:329 (get_otrack_shadow_offset_wrk): the 'impossible' happened.

is probably due to the lock elision code accessing a hardware register that valgrind doesn't know about, so there isn't a shadow register to consult. Our valgrind guys will look into that aspect of things. In the meantime, they tell us that you can circumvent this problem by not using --track-origins. Could you please give this a try?

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

I had another thought this evening. In case this is a threading problem, have you tried building with Clang and using ThreadSanitizer? Support for this was added to ppc64el in 2015.

Revision history for this message
Andrew Morrow (acmorrow) wrote :
Download full text (7.6 KiB)

The following are reproduction instructions for the behavior that we are observing on Ubuntu 16.04 ppc64le. Note that we have run this same test on RHEL 7.1 ppc64le, and we do not observe any stack corruption. Note also that building and running this repro may depend on certain system libraries (SSL, etc) or python libraries being available on the system. Please install as needed. The particular commit here is fairly recent, just one that I happen to know demonstrates the issue.

- git clone https://github.com/mongodb/mongo.git
- cd mongo
- git checkout 3220495083b0d678578a76591f54ee1d7a5ec5df
- git apply acm.nov9.patch
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j$(echo "$(grep -c processor /proc/cpuinfo)/2" | bc) ./mongo ./mongod ./mongos
- ulimit -c unlimited && python buildscripts/resmoke.py --suites=concurrency_sharded --storageEngine=wiredTiger --excludeWithAnyTags=requires_mmapv1 --dbpathPrefix=... --repeat=500 --continueOnFailure

Note that you should provide an actual argument for the --dbpathPrefix argument in the last step, as this is where the running database instances will store data.

You will need to leave this running for several hours, perhaps overnight. In our runs, we find that about 1% of the repeated runs of the test fail, dropping a core.

The core files are typically (but not always!) associated with crashes of the mongos binary inside one of the several mongo::bsonExtractXXX functions, where we find our hand-rolled stack canary to be corrupted. A typical stack trace of a crashing thread looks like:

$ gdb ./mongos core.2016-11-09T23:11:56+00:00
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./mongos...done.
[New LWP 3821]
...
[New LWP 3736]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
Core was generated by `/home/acm/opt/src/mongo/mongos --configdb test-configRS/ubuntu1604-ppc-dev.pic.'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00003fff779ff21c in __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x3fff5d98f140 (LWP 3821))]
(gdb) bt
#0 0x00003fff779ff21c in __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00003fff77a01894 in __GI_abort () at abort.c:89
#2 ...

Read more...

Revision history for this message
Andrew Morrow (acmorrow) wrote :

Here is the patch for the above comment

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Andrew Morrow (acmorrow) wrote :

Bill -

I will try again with valgrind without --track-origins=yes and post any interesting findings. Re ThreadSanitizer, we have tried before without success. The last time we tried, it didn't work because clang TSAN didn't support exceptions. Perhaps that has changed? We really like the sanitizers (we run ASAN and UBSAN in our CI loop), but our experience has been that there is a significant period of scrubbing out false positives and adding suppressions before meaningful signal can be extracted. I'm happy to add it to the list of things to try though.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

Hi Andrew -- not sure about Clang TSAN supporting exceptions at this point; it is probable that we don't have a solution there as I would expect that to require target support, and I've not heard of that happening for POWER. That said, I've been less connected to the Clang community for the last year or so, so anything's possible.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "Apply to 3220495083b0d678578a76591f54ee1d7a5ec5df" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Andrew Morrow (acmorrow) wrote :
Download full text (3.2 KiB)

Overnight, I ran this test case on both an Ubuntu 16.04 ppc64le system and a RHEL 7.1 ppc64le system.

The test ran 219 times on Ubuntu, with 15 cores, for a failure rate of around 5%. Most of the time corruption was detected in the Canary ctor (before doing other work), but a few times in the dtor:

$ grep "10000[12]" resmoke.log
[js_test:fsm_all_sharded_replication] 2016-11-09T21:06:42.140+0000 s40019| 2016-11-09T21:06:42.140+0000 I - [conn59] Fatal Assertion 100002 at src/mongo/bson/util/bson_extract.cpp 50
[js_test:fsm_all_sharded_replication] 2016-11-09T23:11:56.413+0000 s40019| 2016-11-09T23:11:56.412+0000 I - [conn1] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T03:49:43.577+0000 s40019| 2016-11-10T03:49:43.576+0000 I - [conn47] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T04:54:20.748+0000 s40019| 2016-11-10T04:54:20.745+0000 I - [conn28] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T04:56:43.929+0000 s40020| 2016-11-10T04:56:43.929+0000 I - [conn30] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T07:26:14.013+0000 s40019| 2016-11-10T07:26:14.013+0000 I - [conn3] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T07:52:45.732+0000 s40019| 2016-11-10T07:52:45.698+0000 I - [conn1] Fatal Assertion 100002 at src/mongo/bson/util/bson_extract.cpp 50
[js_test:fsm_all_sharded_replication] 2016-11-10T07:54:45.606+0000 s40020| 2016-11-10T07:54:45.606+0000 I - [conn30] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T08:28:04.278+0000 s40020| 2016-11-10T08:28:04.277+0000 I - [conn13] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T09:28:24.221+0000 s40020| 2016-11-10T09:28:24.221+0000 I - [conn32] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T09:41:14.880+0000 s40019| 2016-11-10T09:41:14.800+0000 I - [conn48] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T10:35:58.143+0000 s40019| 2016-11-10T10:35:58.118+0000 I - [conn2] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T11:06:01.027+0000 s40019| 2016-11-10T11:06:01.027+0000 I - [conn1] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T12:37:43.100+0000 s40019| 2016-11-10T12:37:43.100+0000 I - [conn1] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46
[js_test:fsm_all_sharded_replication] 2016-11-10T13:31:31.912+0000 s40020| 2016-11-10T13:31:31.912+0000 I - [conn1] Fatal Assertion 100001 at src/mongo/bson/util/bson_extract.cpp 46

The test ran 227 times ...

Read more...

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Andrew Morrow (acmorrow) wrote :

I tried valgrind as suggested above.

By adding --show-mismatched-frees=no and removing --track-origins=yes I was able to get the process to start up without a lot of false positives. However, the server process fails to open its listening socket, because valgrind reports an unsupported syscall:

[js_test:fsm_all_sharded_replication] 2016-11-10T14:24:04.496+0000 s40019| --17827-- WARNING: unhandled ppc64le-linux syscall: 326
[js_test:fsm_all_sharded_replication] 2016-11-10T14:24:04.496+0000 s40019| --17827-- You may be able to write your own handler.
[js_test:fsm_all_sharded_replication] 2016-11-10T14:24:04.496+0000 s40019| --17827-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
[js_test:fsm_all_sharded_replication] 2016-11-10T14:24:04.497+0000 s40019| --17827-- Nevertheless we consider this a bug. Please report
[js_test:fsm_all_sharded_replication] 2016-11-10T14:24:04.497+0000 s40019| --17827-- it at http://valgrind.org/support/bug_reports.html.
[js_test:fsm_all_sharded_replication] 2016-11-10T14:24:04.509+0000 s40019| 2016-11-10T14:24:04.509+0000 I - [mongosMain] Assertion: 15863:listen(): invalid socket? Function not implemented src/mongo/util/net/listen.cpp 184

The code around listen.cpp:184 looks like:

    182 SOCKET sock = ::socket(me.getType(), SOCK_STREAM, 0);
    183 ScopeGuard socketGuard = MakeGuard(&closesocket, sock);
    184 massert(15863,
    185 str::stream() << "listen(): invalid socket? " << errnoWithDescription(),
    186 sock >= 0);

It looks as if valgrind on ppc64le doesn't support the socket syscall? Note that I happened to be subscribed to the valgrind-developers list so I saw the other post there, and I've followed up with this same information already.

That discussion is here: https://sourceforge.net/p/valgrind/mailman/message/35483420/

In any event, it doesn't look like valgrind is going to be able to help here.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Andrew Morrow (acmorrow) wrote :

OK, I upgraded valgrind to 3.12 on the power machine and I can now get it to run meaningfully. We are seeing many error reports of the following form:

[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== Thread 50:
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== Invalid read of size 2
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== at 0x4F2AD20: __lll_unlock_elision (elision-unlock.c:36)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== by 0x4F1DB07: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:64)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== by 0x4F1DB07: pthread_mutex_unlock (pthread_mutex_unlock.c:314)

Or

[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604== Invalid write of size 2
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604== at 0x4F2AD30: __lll_unlock_elision (elision-unlock.c:37)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604== by 0x4F1DB07: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:64)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604== by 0x4F1DB07: pthread_mutex_unlock (pthread_mutex_unlock.c:314)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.999+0000 s40019| ==34604== by 0xD803C7: operator()<const mongo::executor::TaskExecutor::RemoteCommandCallbackArgs&, long unsigned int&, void> (functional:600)

In all cases, the invalid write appears to be a write into a freed block. Frequently, the address appears to be aligned 'Address 0x...e'. So, this is very interesting.

Another engineer and I took a close look at one of these instances, and we do not believe there is any way that the mutex could be accessed after it was deleted.

Is there a way we can disable the libc lock elision code? An environment variable or other similar setting? We would like to see if we still see these sorts of reports after disabling lock elision. If so, then it would almost certainly be a logic error in our code that we are just missing. On the other hand, if the valgrind reports go away when we disable lock elision, then it would be evidence that lock elision might be at fault for the stack corruption we are observing, at which point I would re-try our original repro.

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

Hi Andrew,

That indeed looks suspicious. I've been talking with our libc team. It appears that the existing patch that provides for disabling lock elision dynamically isn't present in the libc on Ubuntu 16.04, which is very unfortunate. They are thinking about other possible solutions.

This seems a very possible culprit as lock elision is not present on the other configurations where you've seen success, to my knowledge.

We may need to get a special test build of glibc with --disable-lock-elision for you to test with. I will ask the Ubuntu toolchain team whether this is a possibility.

Thanks for persevering through the valgrind concerns to get to this clue!

Bill

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Peter Bergner (pbergner) wrote :

The following might override the HTM lock elision. Can someone try it to see if it works?

bergner@ampere:~$ cat pthread_mutex_lock.c
#include <pthread.h>

#define PTHREAD_MUTEX_NO_ELISION_NP 512
extern int __pthread_mutex_lock (pthread_mutex_t *);

int
pthread_mutex_lock (pthread_mutex_t *mutex)
{
  mutex->__data.__kind |= PTHREAD_MUTEX_NO_ELISION_NP;
  return __pthread_mutex_lock (mutex);
}
bergner@ampere:~$ gcc -c -fPIC pthread_mutex_lock.c
bergner@ampere:~$ gcc -shared -Wl,-soname,libfoo.so.1 -o libfoo.so.1 pthread_mutex_lock.o
bergner@ampere:~$ LD_PRELOAD=./libfoo.so.1 ./a.out

...replacing ./a.out with the binary you want to run without lock elision.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Andrew Morrow (acmorrow) wrote :

I have the libfoo.so.1 interposer running, I will let it run overnight and report back tomorrow with any interesting findings.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Andrew Morrow (acmorrow) wrote :
Download full text (4.3 KiB)

I don't think the interposition is working, or I'm doing something wrong.

I changed pthread_mutex_lock.c to the following:

$ cat pthread_mutex_lock.c
#include <pthread.h>
#include <stdlib.h>

#define PTHREAD_MUTEX_NO_ELISION_NP 512
extern int __pthread_mutex_lock (pthread_mutex_t *);

int
pthread_mutex_lock (pthread_mutex_t *mutex)
{
  abort();
  mutex->__data.__kind |= PTHREAD_MUTEX_NO_ELISION_NP;
  return __pthread_mutex_lock (mutex);
}

$ gcc -c -fPIC pthread_mutex_lock.c
$ gcc -shared -Wl,-soname,libfoo.so.1 -o libfoo.so.1

I then wrote a small C++ program:

$ cat use_foo.cpp
#include <mutex>

int main(int argc, char* argv[]) {
    std::mutex m;
    std::lock_guard<std::mutex> guard(m);
    return EXIT_SUCCESS;
}

Compiled it to a.out:

$ g++ -std=c++11 -pthread ./use_foo.cpp

When run, it does *not* terminate in abort:

LD_PRELOAD=$(realpath ./libfoo.so.1 ) ./a.out ; echo $?

Under gdb, I can see that the libfoo.so.1 library is loaded:

$ LD_PRELOAD=$(realpath ./libfoo.so.1 ) gdb ./a.out
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./a.out...(no debugging symbols found)...done.
(gdb) sta
Temporary breakpoint 1 at 0x10000a24
Starting program: /home/acm/opt/src/1640518/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".

Temporary breakpoint 1, 0x0000000010000a24 in main ()
(gdb) info inferior
  Num Description Executable
* 1 process 14046 /home/acm/opt/src/1640518/a.out
(gdb) !lsof -p 14046 | grep lib
a.out 14046 amorrow mem REG 8,2 74328 2621621 /lib/powerpc64le-linux-gnu/libgcc_s.so.1
a.out 14046 amorrow mem REG 8,2 856616 2622143 /lib/powerpc64le-linux-gnu/libm-2.23.so
a.out 14046 amorrow mem REG 8,2 1851512 2622139 /lib/powerpc64le-linux-gnu/libc-2.23.so
a.out 14046 amorrow mem REG 8,2 171632 2622098 /lib/powerpc64le-linux-gnu/libpthread-2.23.so
a.out 14046 amorrow mem REG 8,2 2042040 2752992 /usr/lib/powerpc64le-linux-gnu/libstdc++.so.6.0.21
a.out 14046 amorrow mem REG 8,2 69136 6182484 /home/acm/opt/src/1640518/libfoo.so.1
a.out 14046 amorrow mem REG 8,2 268976 2622140 /lib/powerpc64le-linux-gnu/ld-2.23.so

Setting a breakpoint in pthread_mutex_lock lands me in __GI___pthread_mutex_lock. Perhaps the interposition symbol needs to be changed?

$ LD_PRELOAD=$(realpath ./libfoo.so.1 ) gdb ./a.out
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 20...

Read more...

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-11-10 19:54 EDT-------
(In reply to comment #49)

Something doesn't look right in your commands. They all have "LD_PRELOAD=" with nothing after the '=':
Comment #49 has:
> $ LD_PRELOAD= gdb ./a.out

That should be:
$ LD_PRELOAD=/path-to.../libfoo.so.1 gdb ./a.out

Also, if you use LD_PRELOAD (without export) for the gdb command, I wouldn't expect it to carry forward to the inferior. Although, as you say, gdb shows the library loaded.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Peter Bergner (pbergner) wrote :

When I add the abort and use your C++ test case, I see the abort:

bergner@ampere:~$ cat pthread_mutex_lock.c
#include <stdlib.h>
#include <pthread.h>

#define PTHREAD_MUTEX_NO_ELISION_NP 512
extern int __pthread_mutex_lock (pthread_mutex_t *);

int
pthread_mutex_lock (pthread_mutex_t *mutex)
{
  abort();
  mutex->__data.__kind |= PTHREAD_MUTEX_NO_ELISION_NP;
  return __pthread_mutex_lock (mutex);
}
bergner@ampere:~$ gcc -fPIC -c pthread_mutex_lock.c
bergner@ampere:~$ gcc -shared -Wl,-soname,libbar.so.1 -o libbar.so.1 pthread_mutex_lock.o
bergner@ampere:~$ cat foo.cpp
#include <mutex>

int main(int argc, char* argv[]) {
    std::mutex m;
    std::lock_guard<std::mutex> guard(m);
    return EXIT_SUCCESS;
}
bergner@ampere:~$ g++ -std=c++11 -pthread foo.cpp
bergner@ampere:~$ ./a.out
bergner@ampere:~$ LD_PRELOAD=./libbar.so.1 ./a.out
Aborted

Revision history for this message
Peter Bergner (pbergner) wrote :

gdb shows the abort is from the shim library too:

bergner@ampere:~$ gdb -q ./a.out
Reading symbols from ./a.out...(no debugging symbols found)...done.
(gdb) set environment LD_PRELOAD=./libbar.so.1
(gdb) run
Starting program: /home/bergner/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".

Program received signal SIGABRT, Aborted.
0x00003fffb7b3f27c in __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00003fffb7b3f27c in __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00003fffb7b418f4 in __GI_abort () at abort.c:89
#2 0x00003fffb7f507e4 in pthread_mutex_lock () from ./libbar.so.1
#3 0x0000000010000954 in __gthread_mutex_lock(pthread_mutex_t*) ()
#4 0x0000000010000b10 in std::mutex::lock() ()
#5 0x0000000010000be8 in std::lock_guard<std::mutex>::lock_guard(std::mutex&) ()
#6 0x0000000010000a80 in main ()
(gdb)

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-11-10 20:44 EDT-------
(In reply to comment #51)
> (In reply to comment #49)
>
> Something doesn't look right in your commands. They all have "LD_PRELOAD="
> with nothing after the '=':
> Comment #49 has:
> > $ LD_PRELOAD= gdb ./a.out

It's a mirroring problem to our bugzilla. The command looks correct in the Launchpad comment. I recommend reading/adding comments in Launchpad rather than from our bugzilla.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Adam Conrad (adconrad) wrote :

A test build of glibc with lock elision disabled is in progress here:

https://launchpad.net/~adconrad/+archive/ubuntu/nole/+packages

That said, the above trace looks suspiciously like a double-unlock. That breaks pthread rules, but the software implementation has historically let you do it anyway (or, rather, it fails to abort there because it would be expensive to do so, but your code is still buggy, prone to hanging, etc).

There's a thread on debian-devel right now about this same issue[1] as it related to ghostscript, where people noticed that with hardware lock elision, the world exploded, and without, it seemed to work, maybe, ish (though, as I said, the ghostscript bug was likely the cause of some mysterious hangs).

[1] https://lists.debian.org/debian-devel/2016/11/msg00210.html

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Paul Clarke (7-pc) wrote :

from "man pthread_mutex_unlock":
       If the mutex type is PTHREAD_MUTEX_ERRORCHECK, then error checking
       shall be provided. If a thread attempts to relock a mutex that it has
       already locked, an error shall be returned. If a thread attempts to
       unlock a mutex that it has not locked or a mutex which is unlocked, an
       error shall be returned.

That might be something to try.

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

Per the blog post mentioned from the thread in #35, this sort of problem should also manifest on a Broadwell or Skylake processor. Andrew, have you tried running on such machines?

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

From that debian thread:

"Per logs from message #15 on bug #842796:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=842796#15

SIGSEGV on __lll_unlock_elision is a signature (IME with very high
confidence) of an attempt to unlock an already unlocked lock while
running under hardware lock elision.

Well, unlocking an already unlocked lock is a pthreads API rule
violation, and it is going to crash the process on something that
implements hardware lock elision."

So I think we have some pretty good evidence of an application problem. I think that using Paul Clarke's suggestion may be necessary for you to figure out where the double-unlock is occurring. I'm not confident that valgrind will spot this.

We're going to continue trying to reproduce on our side and disable TLE to confirm that this segv goes away. Hard to know if this is related to the original reported problem, of course, but perhaps losing TLE will allow valgrind to find that if it's a separate issue.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Andrew Morrow (acmorrow) wrote :

First, I'm not sure what I was doing wrong yesterday, but I now have the LD_PRELOAD lock-elision-disablement running. And, when running under valgrind, we no longer see the reports from valgrind. I'm now running without valgrind to see whether we still observe stack corruption.

A few comments on the most recent comments.

Re #40: Bill, we have never actually seen a natural SEGV in __lll_unlock_elision in our tests - instead we see random stack corruption, per my description in #5. We saw some complaints about __lll_lock_elision when running under valgrind, but some comments in this thread have made me question whether valgrind actually understands what is happening there.

Re #38: We were thinking the same thing. I will ask around to see if we have any appropriate machines.

Re #37: I don't think it is an option for us as almost all of our locks are wrapped pthread locks via std::mutex.

Re #35: Almost all of our lock management is via C++ RAII types (std::unique_lock, etc), and we run these tests across many many systems. I'm almost certain that a double unlock would have been caught by now, or would cause crashes elsewhere. In particular I would expect that the windows debug runtime would alert us. However, I will see if there is some easy way we can check for this.

I'm running the original repro right now with HLE disabled. Usually it takes a few hours before we see a stack corruption event, so I will follow up later today with more results.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-11-11 14:14 EDT-------
I've been able to set up and reproduce this on a p8 16.04 system here. Seems to be the identical signature, 2 bytes at address 0xXXXXXe in the middle of the canary block set to 0x00.

Revision history for this message
Andrew Morrow (acmorrow) wrote :

That is good news that you have been able to reproduce the issue. I'm currently running the reproducer with the LD_PRELOAD disable-lock-elision hack in place, without valgrind, and I'm currently at 55 runs with no crashes. I will let it run overnight.

Also, per the earlier comment about double unlocking: I confirmed with one of the other engineers here that the Windows Debug CRT does check for double unlocks, and this test suite runs in our CI loop in that configuration without issue.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Peter Bergner (pbergner) wrote :

I'll note that the LD_PRELOAD interposer library is only needed for binaries that are already compiled and you want to override the pthread_mutex_lock() routine. If you can recompile your source, then you can place the interposer directly into your source and there is no need for LD_PRELOADing anything. Ie, you could just add the following to any of your source files:

#define PTHREAD_MUTEX_NO_ELISION_NP (512)

extern int __pthread_mutex_lock (pthread_mutex_t *);
int
pthread_mutex_lock (pthread_mutex_t *mutex)
{
  mutex->__data.__kind |= PTHREAD_MUTEX_NO_ELISION_NP;
  return __pthread_mutex_lock (mutex);
}

Revision history for this message
Peter Bergner (pbergner) wrote :

...and possibly wrap the above in:

#ifdef __powerpc__
...
#endif

so it's only used on POWER?

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Aaron Sawdey (acsawdey) wrote :

Question: Is there any magic I can do to this test case:

python buildscripts/resmoke.py --suites=concurrency_sharded --storageEngine=wiredTiger --excludeWithAnyTags=requires_mmapv1 --dbpathPrefix=... --repeat=500 --continueOnFailure

that would allow me to run multiple copies on the same machine?

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Aaron Sawdey (acsawdey) wrote :

Found it, looks like the --basePort option to resmoke is what I want.

Revision history for this message
Andrew Morrow (acmorrow) wrote :

Arron, re #50, yes, you can run as may copies as you want simultaneously, as long as:

1) The --dbpathPrefix argument points to distinct paths. So resmoke.py ... --dbpathPrefix=/var/tmp/run1 and resmoke.py ... --dbpathPrefix=/var/tmp/run2, etc.

2) You specify disjoint "port ranges" with the --basePort argument to resmoke.py. I don't happen to know how deep in to each port range one instance of the test needs to go, so I usually separate them by 10k but that is probably way too generous. I'll bet 2k is sufficient.

Does that answer your question?

Revision history for this message
Andrew Morrow (acmorrow) wrote :

Peter, re #47, yes, that is certainly true. However, I'm actually finding it advantageous to load it via LD_PRELOAD exactly because I don't need to recompile. So I can toggle back and forth between lock elision on/off without needing to recompile.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Adam Conrad (adconrad) wrote :

Ahh, if you're never actually seeing a SEGV in __lll_unlock_elision, this may indeed be something far more subtle than a double unlock and, in fact, I'd be even MORE interested in having you run the same batch of testing on an otherwise identical (ie: Ubuntu 16.04, blah blah blah) Skylake system.

It may well be a subtle glibc bug on powerpc, or a whackadoo silicon bug, or a more generic glibc bug or still a mongo bug, but the more data we can scrape up (I mean, unless you accidentally stumble on the bug in the process and fix it, then yay), the better.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Aaron Sawdey (acsawdey) wrote :

Andrew,
  Yes, that is working nicely with separate DB dirs and basePort I'm running multiple copies on one machine. Thanks!

Revision history for this message
Andrew Morrow (acmorrow) wrote :

Adam I agree on all points.

So far, my repro running with the LD_PRELOAD hack is at 118 iterations with no crashes and going strong. Given that we had an ~5% repro rate without the LD_PRELOAD hack, this is looking very encouraging, but I'm going to let it run all weekend just to be sure.

As for skylake, we are definitely going to work on tracking one down next week.

And of course, our default stance has always been that it is more likely our bug than anything else, but you never know! We will continue to run experiments and gather data.

Thanks again for your help.

Revision history for this message
Aaron Sawdey (acsawdey) wrote :

This is the other thing I am trying. I've modified the Canary object to use a 128k stack zone and then use mprotect to mark the aligned 64k page that's in the middle of it read-only. When the destructor is called, it changes it back to read-write. This should cause any write to this region to get a segv, and give us an idea of what is writing on the stack in the resulting coredump.

Revision history for this message
Aaron Sawdey (acsawdey) wrote :

One other thing, if you use the mprotect thing, it may be necessary to bump up the value of /proc/sys/vm/max_map_count, depending on how many of these Canary objects get constructed.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Andrew Morrow (acmorrow) wrote :

An engineer on our side did some Canary+mprotect experiments as well, but I don't happen to have details on what the approach/results were right now. I'll ask them to update this ticket with any interesting findings they may have.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

Here's another interesting data point. The original bug description specifies that the memory corruption is not seen on Ubuntu 15. Per https://bugzilla.linux.ibm.com/show_bug.cgi?id=117535, however, transactional lock elision has been enabled by default since Ubuntu 15.04 (glibc 2.21). Yet on 16.04, the use of TLE is sufficient to cause the stack corruption.

This seems to deepen the mystery more than it illuminates it. Have there been changes to TLE between the releases that could be at fault? Is another unknown component involved?

I'm told this morning that so far no failures are observed using the mprotect canary, the working theory being that the syscall disturbs the timing too much. Otherwise our results are consistent with yours on 16.04: failures with TLE enabled, no failures with the LD_PRELOAD workaround.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Andrew Morrow (acmorrow) wrote :

Regarding Ubuntu 15, I think that was a miscommunication somewhere along the line.

The only versions of Ubuntu that we build for are the LTS releases (12.04, 14.04, and 16.04), and the only one of those we have ever built on POWER is 16.04. Other than Ubuntu 16.04, the only other POWER distro we target is RHEL 7.1. So what we are seeing is I think still consistent with either a latent locking bug in MongoDB exposed by the new-to-us lock elision support, or one of the other sorts of issues as mentioned by Adam above.

Also, my test run using the disabled lock elision just reached 500 successful runs. I'm going to switch lock elision back on and make sure I start seeing corruption again, but I'm nearly certain that I will.

Thanks for the update on the mprotect experiment. It would have been great if that had worked. I think we will be able to share some details on our similar experiments on Monday.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Aaron Sawdey (acsawdey) wrote :

An update on my experiments:
* 500 runs no failures with TLE disabled
* 500 runs no failures TLE enabled but mprotect() syscall in Canary constructor/destructor
* 500 runs 11 failed with TLE enabled so about 2% fail rate
* Tried switching SMT off and interestingly got 200 runs no fails with TLE enabled.

This suggests to me that the timing of this race condition is rather tight. A lot of the fails are in the checksum right after the memset in the Canary constructor, which means the other guy comes back and writes the stack after memset wrote it but before the checksum read it. Also if strace's syscall timing is to be believed, 14% of the time mprotect() is under 8 microseconds, and even that seems to be sufficient to prevent the problem. Finally, by forcing other pthreads to always be on a different processor core (by disabling SMT) we also apparently eliminate this.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Hello, could bugproxy please be silenced, and/or prevented from reposting the same python command line over and over again? It seems there are sometimes attachments/comments that get amplified and re-posted with every other comment.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

We'll try to get to the bottom of the bugproxy oddity. It seems to be echoing a chunk out of comment #5 for no reason that I can see.

Revision history for this message
bugproxy (bugproxy) wrote :

- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache --build-fast-and-loose -j ./mongo ./mongod ./mongos

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

Given the results of Aaron's experiments over the weekend, here's a summary of what we think we're seeing:
 - There is an unsafe interaction between two threads.
 - This interaction can only be observed in a very small time window, and then relatively rarely.
 - The interaction is observable only when TLE is enabled. We posit this is because TLE improves lock/unlock speed, so that the observable time window applies.
 - The time window is sufficiently small that a relatively fast syscall to mprotect is sufficient to disrupt the timing.
 - If all threads are forced to run on separate processors (SMT=1), this is sufficient to disrupt the timing.

We believe this is an application problem. Further "evidence" against a problem in TLE itself is that TLE has been enabled for ppc64el on Ubuntu for over 18 months without any similar reports.

Unfortunately, I don't think we have any way to directly debug the problem, due to the narrow time window. We have considered hardware watchpoints. There is a DAWR (data address watch register) for each thread, but using it in this case seems impractical. GDB's implementation of hardware watchpoints in a multithreaded environment is such that, when a hardware watchpoint is set, it is set to the same address for all threads. So even if we were to script the setting of a hardware watchpoint under GDB, the time required for GDB to set up the watchpoint address on the fly would surely exceed the critical time window. You could try it, but I wouldn't expect much.

Further debugging seems to require application knowledge or a code crawl of some kind. The setting of two flag bytes is the only clue we have. To my knowledge we have never seen only a single byte clobbered in the canary, and the two bytes appear to be aligned on a 16-bit boundary. So the code doing the clobbering very likely contains a store-halfword (sth or, less likely, sthx or sthu) instruction. You could examine the disassembly of the application for occurrences of sth to narrow the field of search.

It could be two individual stores, in which case you'd be looking for two instructions very close together of the form:

  stb r<x>,<n>(r<b>)
  stb r<y>,<n+1>(r<b>)

Example:

  stb r5,0(r9)
  stb r6,1(r9)

Obviously this is a huge application so this doesn't help much in and of itself, but perhaps if you've already narrowed the problem down somewhat, this could be helpful.

I am running low on ideas for how we can help you debug the problem. We'll continue discussing it; if any better thoughts arise, we'll be sure to let you know.

Bill

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

Well, let me back that off a little. We're going to look into the TLE code a little more. The various __lll_*_elision routines are handed a pointer to short that they update, which certainly looks suspicious. So it's certainly possible that something in the pthreads implementation that calls this code is providing a bad pointer to TLE.

However, we would expect to see the same problem on x86 or s390 if that is the case, unless there is some POWER-specific code in the pthreads implementation. So again a Skylake experiment would be helpful.

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

Ulrich Weigand made an interesting comment on the glibc code in our internal bug (no longer mirrored), so I'm mirroring it by hand here.

--- Comment #77 from Ulrich Weigand <email address hidden> ---
According to comment #45, we see an invalid read at __lll_unlock_elision
(elision-unlock.c:36) and and invalid write at the next line. Looking at the
code, those are:

int
__lll_unlock_elision (int *lock, short *adapt_count, int pshared)
{
  /* When the lock was free we're in a transaction. */
  if (*lock == 0)
    __libc_tend (0);
  else
    {
      lll_unlock ((*lock), pshared);

      /* Update the adapt count AFTER completing the critical section.
         Doing this here prevents unneeded stalling when entering
         a critical section. Saving about 8% runtime on P8. */
      if (*adapt_count > 0)
        (*adapt_count)--;
    }
  return 0;
}

exactly the two lines accessing *adapt_count (which is a short).

What seems suspicious here is the fact that these lines happen (deliberately)
*outside* of the critical section protected by the mutex. Now this should
indeed be normally OK. However, I understand it is allowed to use that
critical section to also protect the life time of the memory object holding the
mutex itself. If this is done, I'm wondering whether those two lines outside
the critical section may now access that *memory* outside of its protected life
time ...

Just as a theoretical example, assume you have thread 1 executing a function
that allocates a mutex on its stack, locks the mutex, and passes its address to
a thread 2. Thread 2 now does now work and then unlocks the mutex. Thread 1,
in the meantime, attempts to lock the mutex again, which will block until it
was unlocked by thread 2. After it got the lock, thread 1 now immediately
unlocks the mutex again and then the function returns.

Now in this scenario, the cross-thread lock-unlock pair cannot use TM, so it
will fall back to regular locks, which means the unlock in thread 2 will
perform the *adapt_count update. On the other hand, the second lock-unlock
pair in thread 1 might just happen to go through with TM, and thus not touch
*adapt_count at all. This means there is no synchronization between the update
of *adapt_count in thread 2, and the re-use of the stack memory underlying the
mutex in thread 1, which could lead to the *adapt_count update being seen
*after* that reuse in thread 1, clobbering memory ...

Am I missing something here? Of course, I'm not sure if MongoDB shows that
exact pattern. But I guess different patterns might show similar races too.

One simple test might be to move the *adapt_count update to before the
lll_unlock and check whether the issue still reproduces. Not sure about
performance impacts, however.

Revision history for this message
Andrew Morrow (acmorrow) wrote :

I think it is very likely that we are doing the sort of stack-based mutex pattern described above, or something similar. In particular, I'd expect that we certainly have states where we wait on a stack mutex, and then immediately unwind and destroy the mutex after we unblock.

I'm working on getting access to a Skylake machine - we will definitely do that test if we can get access.

Revision history for this message
Aaron Sawdey (acsawdey) wrote :

So, I rebuilt the glibc 2.23 from the 16.04 sources and modified the values written to the adapt_count parm in the lock elision code. It's a short and the original code may store values 0, 1, 2, 3. We were seeing either 1 (canary hit in constructor) or 0 (canary hit in destructor). I changed it to use the values 0x3333, 0x2222, 0x1111, and 0. And I just saw the constructor canary hit with the expected values 0x11, 0x11 in the changed bytes. So, this is a race condition in the lock elision code with mutex located on the stack and being reused quickly by another hardware thread on the same processor core.

Revision history for this message
Andrew Morrow (acmorrow) wrote :

Aaron thank you very much for running that experiment and confirming that this is an issue in libc. I think the component should probably be updated?

Also, would you like us to try to continue to repro on a Skylake machine, or is this all architecture neutral code and therefore the POWER repro is conclusive?

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

Hi Andrew,

This is a POWER-specific "optimization" that dates to last December (so it in fact wouldn't show up in Ubuntu 15.04 or 15.10, it appears). The decrement used to be attached to the lock rather than the unlock and it was apparently moved to the present location because it showed a good performance improvement. Aaron is running some experiments with moving it ahead of the unlock rather than following it, which we believe will remove the problem. That may or may not be the final version of the fix, but it looks like the simplest solution.

The only reason for you to test on a Skylake machine would be for your own peace of mind, in case any TLE-specific issues are hiding in the x86 implementation. This specific problem, at least, is a POWER issue.

affects: gcc-5 (Ubuntu) → glibc (Ubuntu)
Revision history for this message
Andrew Morrow (acmorrow) wrote :

Hi Bill -

Thanks for the update, and for clarifying that this is POWER 16.04 only. We are very happy to be at a root cause for this issue - it had us pretty worried! We really appreciate all the help from everyone involved here.

Will there be an upstream glibc bug associated with this ticket that we can follow?

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

Hi Andrew,

Aaron has just opened https://sourceware.org/bugzilla/show_bug.cgi?id=20822. An odd confluence of vacations, world holidays, and family leave has conspired to take all of our glibc experts out of the office tomorrow, so we may be a little delayed with testing and submitting the final fix, thought the rest of us will do what we can with it. We are likewise very relieved to have proof of a cause! And we're certainly sorry for all the trouble this has caused you.

Our team will work with the Ubuntu toolchain guys to get a solution in the field as soon as possible. Thanks again for your patience and professionalism!

Bill

Revision history for this message
Andrew Morrow (acmorrow) wrote :

Hi Bill -

Thanks for the glibc bug link.

Totally understand about people being out, not a problem. However, I'm not very familiar with the development process for upstream glibc fixes to make their way into an LTS release.

Do you have a rough estimate of the timeline for that landing in something that we can pull in via apt-get update? Are we talking days, weeks, months? My expectation right now is that we are going to continue forward with issuing the MongoDB 3.4 GA on POWER Ubuntu 16.04, and relnote this issue, probably stating that MongoDB 3.4 should not be run in production on POWER Ubuntu 16.04 unless glibc has been upgraded to version X or newer.

We might even add a server startup warning, if it will be possible for us to detect at runtime that we are running with an affected glibc. I'd imagine that might prove difficult if this comes in as an Ubuntu patch to the existing libc version though. Thoughts on that?

Andrew

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

Hi Andrew -

I don't work directly in the glibc community, so I'm not completely familiar with their policies. However, the first step is to get a bug approved upstream so that it can be backported to the 2.23 release (in this case). Adam Conrad at Canonical has volunteered to help us shepherd the patch along, so we hope to be able to expedite it.

We have two possible patches to fix the problem; we are trying to determine which will have the least performance impact. I am hopeful that we can get a decision on that tomorrow so the patch can be put on the libc-alpha list and, we hope, approved without too much difficulty.

I will have to defer to Canonical folks (Adam, Matthias) on the process for getting the fix available, but I'm sure this isn't their first rodeo and this shouldn't take too long. My best guess is that the whole process may get done within a week, if all goes well with the community, and a little longer if not. We are definitely not talking months.

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :
Adam Conrad (adconrad)
Changed in glibc (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Adam Conrad (adconrad)
Changed in glibc (Ubuntu Xenial):
assignee: nobody → Adam Conrad (adconrad)
Changed in glibc (Ubuntu Yakkety):
assignee: nobody → Adam Conrad (adconrad)
Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

Hi Andrew,

Canonical's plans for handling this in the short term are described here: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1642390. We will continue to work the POWER patch, but the SRU described there is all you should need to track for your relnote.

Bill

Revision history for this message
Brian W Hart (hartb) wrote :

Howdy, I'm the originator of:

https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1641241

(which is dup'd to this bug)

I tested the new ("ubuntu5") libc6 packages from xenial-proposed. They prevent the crash with TensorFlow, and I have not noticed any other problems.

Will update:

https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1642390

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in glibc (Ubuntu Xenial):
status: New → Confirmed
Changed in glibc (Ubuntu Yakkety):
status: New → Confirmed
Changed in glibc (Ubuntu):
status: New → Confirmed
Changed in glibc:
importance: Unknown → Medium
status: Unknown → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package glibc - 2.24-9ubuntu2

---------------
glibc (2.24-9ubuntu2) zesty; urgency=medium

  * debian/patches/any/cvs-resolv-internal-qtype.diff: Revert to avoid
    failure in name resolution on upgrades from yakkety (LP: #1674532)

 -- Adam Conrad <email address hidden> Tue, 21 Mar 2017 15:27:15 -0600

Changed in glibc (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.