Bug #1773373 “assert(status == epicsMutexLockOK) dbEvent.c line ...” : Bugs : EPICS Base

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2018-05-25:

#1

Download full text (6.2 KiB)

My analysis so far:

This is the stack of the failing CAS-client thread:

#0 0xb7be13bc in __pthread_cond_wait (cond=0xb5c588c0, mutex=0xb5c588a8) at pthread_cond_wait.c:153
#1 0xb7cf5978 in __pthread_cond_wait (cond=<optimized out>, mutex=<optimized out>) at forward.c:139
#2 0x0fd3e374 in condWait (mutexId=0xb5c588a8, condId=0xb5c588c0) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#3 epicsEventWait (pevent=0xb5c588a8) at ../../../src/libCom/osi/os/posix/osdEvent.c:137
#4 0x0fd3c68c in epicsThreadSuspendSelf () at ../../../src/libCom/osi/os/Linux/osdThread.c:580
#5 0x0fd3a4fc in epicsAssert (pFile=0xfe870e0 "../dbEvent.c", line=531, pExp=0xfe85d68 "status == epicsMutexLockOK", pAuthorName=0xfd4c894 "the author") at ../../../src/libCom/osi/os/default/osdAssert.c:50
#6 0x0fe75ba4 in db_cancel_event (es=0x10d0d990) at ../dbEvent.c:531
#7 0x0ff05610 in destroyAllChannels (client=0x10d09e28, pList=0x10d09e68) at ../caservertask.c:653
#8 0x0ff06508 in destroy_tcp_client (client=0x10d09e28) at ../caservertask.c:697
#9 0x0ff073f8 in camsgtask (pParm=0x10d09e28) at ../camsgtask.c:160
#10 0x0fd3bbcc in start_routine (arg=0xb5c08258) at ../../../src/libCom/osi/os/Linux/osdThread.c:408
#11 0xb7bdc82c in start_thread (arg=0xb22ba480) at pthread_create.c:306
#12 0xb7ce70e4 in clone () from /lib/libc.so.6

camsgtask (camsgtest.c:41) exits due to an error
151: epicsPrintf ("CAS: forcing disconnect from %s\n", buf);
152: break;
153: }
154: }
155:
156: LOCK_CLIENTQ;
157: ellDelete ( &clientQ, &client->node );
158: UNLOCK_CLIENTQ;
159:
160: destroy_tcp_client ( client );

destroy_tcp_client (caservertask.c:676) calls destroyAllChannels
679: destroyAllChannels ( client, & client->chanList );

destroyAllChannels (caservertask.c:619) calls db_cancel_event
640: while ( TRUE ) {
641: /*
642: * AS state change could be using this list
643: */
644: epicsMutexMustLock ( client->eventqLock );
645: pevext = (struct event_ext *) ellGet ( &pciu->eventq );
646: epicsMutexUnlock ( client->eventqLock );
647:
648: if ( ! pevext ) {
649: break;
650: }
651:
652: if ( pevext->pdbev ) {
653: db_cancel_event (pevext->pdbev);

db_cancel_event (dbEvent.c:518) locks the event queue which fails
531: LOCKEVQUE ( pevent->ev_que )
This macro expands to: epicsMutexUnlock((pevent->ev_que)->writelock);

So at this time the mutex seems to be invalid. Why?
Content of the mutex:
(gdb) print *(( struct evSubscrip * ) es) -> ev_que->writelock
$6 = {node = {next = 0x10459d08, previous = 0xb5c58410}, id = 0x11013848, pFileName = 0xfe870e0 "../dbEvent.c", lineno = 298}

The only pssibility that makes sense it that it had been destroyed.

event_task (dbEvent.c:915) ("CAS-event" thread) destroys writelock mutexes when it exits:
928: do{
...
961: pendexit = evUser->pendexit;
962: epicsMutexUnlock ( evUser->lock );
963:
964: } while( ! pendexit );
965:
966: epicsMutexDestroy(evUser->firstque.writelock);
967:
968: {
969: struct event_que *nextque;
970:
971: ev_que...

My analysis so far:

This is the stack of the failing CAS-client thread:

#0  0xb7be13bc in __pthread_cond_wait (cond=0xb5c588c0, mutex=0xb5c588a8) at pthread_cond_wait.c:153
#1  0xb7cf5978 in __pthread_cond_wait (cond=<optimized out>, mutex=<optimized out>) at forward.c:139
#2  0x0fd3e374 in condWait (mutexId=0xb5c588a8, condId=0xb5c588c0) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#3  epicsEventWait (pevent=0xb5c588a8) at ../../../src/libCom/osi/os/posix/osdEvent.c:137
#4  0x0fd3c68c in epicsThreadSuspendSelf () at ../../../src/libCom/osi/os/Linux/osdThread.c:580
#5  0x0fd3a4fc in epicsAssert (pFile=0xfe870e0 "../dbEvent.c", line=531, pExp=0xfe85d68 "status == epicsMutexLockOK", pAuthorName=0xfd4c894 "the author") at ../../../src/libCom/osi/os/default/osdAssert.c:50
#6  0x0fe75ba4 in db_cancel_event (es=0x10d0d990) at ../dbEvent.c:531
#7  0x0ff05610 in destroyAllChannels (client=0x10d09e28, pList=0x10d09e68) at ../caservertask.c:653
#8  0x0ff06508 in destroy_tcp_client (client=0x10d09e28) at ../caservertask.c:697
#9  0x0ff073f8 in camsgtask (pParm=0x10d09e28) at ../camsgtask.c:160
#10 0x0fd3bbcc in start_routine (arg=0xb5c08258) at ../../../src/libCom/osi/os/Linux/osdThread.c:408
#11 0xb7bdc82c in start_thread (arg=0xb22ba480) at pthread_create.c:306
#12 0xb7ce70e4 in clone () from /lib/libc.so.6

camsgtask (camsgtest.c:41) exits due to an error
151:            epicsPrintf ("CAS: forcing disconnect from %s\n", buf);
152:                break;
153:        }
154:    }
155:
156:    LOCK_CLIENTQ;
157:    ellDelete ( &clientQ, &client->node );
158:    UNLOCK_CLIENTQ;
159:
160:    destroy_tcp_client ( client );

destroy_tcp_client (caservertask.c:676) calls destroyAllChannels
679: destroyAllChannels ( client, & client->chanList );

destroyAllChannels (caservertask.c:619) calls db_cancel_event
640:        while ( TRUE ) {
641:            /*
642:            * AS state change could be using this list
643:            */
644:            epicsMutexMustLock ( client->eventqLock );
645:            pevext = (struct event_ext *) ellGet ( &pciu->eventq );
646:            epicsMutexUnlock ( client->eventqLock );
647:
648:            if ( ! pevext ) {
649:                break;
650:            }
651:
652:            if ( pevext->pdbev ) {
653:                db_cancel_event (pevext->pdbev);

db_cancel_event (dbEvent.c:518) locks the event queue which fails
531:    LOCKEVQUE ( pevent->ev_que )
This macro expands to: epicsMutexUnlock((pevent->ev_que)->writelock);

So at this time the mutex seems to be invalid. Why?
Content of the mutex:
(gdb) print *(( struct evSubscrip * ) es) -> ev_que->writelock
$6 = {node = {next = 0x10459d08, previous = 0xb5c58410}, id = 0x11013848, pFileName = 0xfe870e0 "../dbEvent.c", lineno = 298}

The only pssibility that makes sense it that it had been destroyed.

event_task (dbEvent.c:915) ("CAS-event" thread) destroys writelock mutexes when it exits:
928:    do{
...
961:        pendexit = evUser->pendexit;
962:        epicsMutexUnlock ( evUser->lock );
963:
964:    } while( ! pendexit );
965:
966:    epicsMutexDestroy(evUser->firstque.writelock);
967:
968:    {
969:        struct event_que    *nextque;
970:
971:        ev_que = evUser->firstque.nextque;
972:        while(ev_que){ 
973:            nextque = ev_que->nextque;
974:            epicsMutexDestroy(ev_que->writelock);
975:            freeListFree(dbevEventQueueFreeList, ev_que);
976:            ev_que = nextque;
977:        }
978:    }

The loop exits and the mutexes get destroyed after evUser->pendexit becomes TRUE.
This happens in db_close_events (dbEvent.c:336):
348:    epicsMutexMustLock ( evUser->lock );
349:    evUser->pendexit = TRUE;
350:    epicsMutexUnlock ( evUser->lock );
351:    /* notify the waiting task */
352:    epicsEventSignal(evUser->ppendsem);
353:}

So who calls db_close_events?
destroy_tcp_client (caservertask.c:676) does it:
697:    destroyAllChannels ( client, & client->chanList );
698:    destroyAllChannels ( client, & client->chanPendingUpdateARList );
699:
700:    if ( client->evuser ) {
701:        db_close_events (client->evuser);
702:    }
703:
704:    destroy_client ( client );
705:}

But it does so only *after* destroyAllChannels in line 697, where it already crashed!

So where else is db_close_events called? In dbContext.cpp.
a) In dbContext::subscribe when creating the "CAC-event" thread failed:
214:            int status = db_start_events ( tmpctx, "CAC-event", 
215:                cacAttachClientCtx, ca_current_context (), above );
216:            if ( status ) {
217:                db_close_events ( tmpctx );
218:                throw std::bad_alloc ();

This would probably not have the observed effect, because the "CAC-event" thread which destroys the mutex is not running.

b) In dbContext::subscribe when another thread called dbContext::subscribe at the same time.
221:        if ( this->ctx ) {
222:            // another thread tried to simultaneously setup 
223:            // the event system
224:            db_close_events ( tmpctx );
225:        }
226:        else {
227:            this->ctx = tmpctx;
228:        }

I think this can never happen because the whole dbContext::subscribe function is guarded with a mutex.

c) In the ~dbContext destructor:
68:dbContext::~dbContext ()
69:{
70:    delete [] this->pStateNotifyCache;
71:    if ( this->ctx ) {
72:        db_close_events ( this->ctx );

Only possibility left. So where is the dbContext destroyed?
Only when  epics_auto_ptr < cacContext > cacContextNotify::pServiceContext gets destroyed, assigned, or reset.
Assignement operator is not used. Destructor is not called explicitly, thus only runs at end of destructor of its owner.
reset is called in the ca_client_context constructor (ca_client_context.cpp:69), but there is is still empty.
The only other place is in the ~ca_client_context destructor (ca_client_context.cpp:171).

So this means when the ca_client_context gets destroyed and calls the ~dbContex destructor, the "CAS-client" thread may still be active.

BTW: Are the threads using dbContex resources? If that is the case, they must finish before the ~dbContex destructor completes.

More important here:
We must make sure that "CAS-event" (event_task) does not destroy the mutex before "CAS-client" (camsgtask) finishes.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2018-05-25:

#2

This seems to me to be related, if not the same as, lp:1762543.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2018-05-25:

#3

Dirk, have you been able to trigger this crash?

Also, if you still have the core dump I'd be interested where the other threads are. Run "thread apply all backtrace" and attach the resulting (long) output.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2018-05-25:

#4

If you can replicated, please try 3.15 branch or later with d2b0e920012d4c9bc0eaecc0e092cedd776db578 which moves event cleanup from the dbEvent worker into db_close_events().

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2018-05-25:

#5

https://git.launchpad.net/epics-base/commit/?id=d2b0e920012d4c9bc0eaecc0e092cedd776db578

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2018-05-28:

#6

I have never been able to trigger the problem. The condition is rare and seems to be related to high network load. It seems that only one particular client (not under my control) triggers this fault by triggering "bad resource ID" errors.

I don't have a core dump. All other threads were still functional and the IOC seemed to work just fine. However, it has been restarted meanwhile.

I cannot easily switch a production system to 3.15. There are too many implications and those are not my systems. But I will apply the change of commit d2b0e920012d4c9bc0eaecc0e092cedd776db578 to 3.14.12.7 to see it it still happens.

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2018-05-28:

#7

Michael, this patch targets exactly the problem I found. It makes sure all threads have finished before destroying the mutexes. Thus I expect this patch to fix my problem. Applying it to 3.14.12 was no problem at all. Thanks.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2018-05-28:

#8

If you're cherry picking patches, you'll probably also want https://git.launchpad.net/epics-base/commit/?id=e794639e31559c0b4054ce5c31f46194d93325f1 which addresses a rarer race introduced by the first.

> I cannot easily switch a production system to 3.15 ...

The fact that this issue was identified and fixed back in 2012 might, I hope, encourage PSI to begin moving beyond 3.14.x.

Revision history for this message

Andrew Johnson (anj) wrote on 2018-06-05:

#9

Should we back-port these fixes to 3.14?
RL: Both patches are simple, MD should apply.

Changed in epics-base:
assignee:	nobody → mdavidsaver (mdavidsaver)
status:	New → Triaged
milestone:	none → 3.14.12.8

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2018-06-05:

#10

Please don't assign me any 3.14 only issues. I have no support for 3.14 maintenance work, nor inclination to donate my time for this.

	Status	Importance	Assigned to	Milestone
EPICS Base	Status tracked in 7.0
3.14	Fix Released	Medium	Ralph Lange	EPICS Base 3.14.12.8
3.15	Fix Released	Undecided	Ralph Lange	EPICS Base 3.15.6
3.16	Fix Released	Undecided	Unassigned	EPICS Base 3.16.2
7.0	Fix Released	Undecided	mdavidsaver	EPICS Base 7.0.2

EPICS Base

assert(status == epicsMutexLockOK) dbEvent.c line 531

Bug Description

Other bug subscribers

Remote bug watches