assert(status == epicsMutexLockOK) dbEvent.c line 531

Bug #1773373 reported by Dirk Zimoch
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Status tracked in 7.0
3.14
Fix Released
Medium
Ralph Lange
3.15
Fix Released
Undecided
Ralph Lange
3.16
Fix Released
Undecided
Unassigned
7.0
Fix Released
Undecided
mdavidsaver

Bug Description

Maybe related to bug 1762543.

I am using R3.14.12.7.

This is the output:

CAS: request from 172.26.120.14:57787 => bad resource ID
CAS: Request from 172.26.120.14:57787 => cmmd=15 cid=0x26f7 type=10 count=1 postsize=0
CAS: Request from 172.26.120.14:57787 => available=0x76635 ^IN=0 paddr=(nil)
CAS: forcing disconnect from 172.26.120.14:57787

A call to 'assert(status == epicsMutexLockOK)'
    by thread 'CAS-client' failed in ../dbEvent.c line 531.

Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :
Download full text (6.2 KiB)

My analysis so far:

This is the stack of the failing CAS-client thread:

#0 0xb7be13bc in __pthread_cond_wait (cond=0xb5c588c0, mutex=0xb5c588a8) at pthread_cond_wait.c:153
#1 0xb7cf5978 in __pthread_cond_wait (cond=<optimized out>, mutex=<optimized out>) at forward.c:139
#2 0x0fd3e374 in condWait (mutexId=0xb5c588a8, condId=0xb5c588c0) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#3 epicsEventWait (pevent=0xb5c588a8) at ../../../src/libCom/osi/os/posix/osdEvent.c:137
#4 0x0fd3c68c in epicsThreadSuspendSelf () at ../../../src/libCom/osi/os/Linux/osdThread.c:580
#5 0x0fd3a4fc in epicsAssert (pFile=0xfe870e0 "../dbEvent.c", line=531, pExp=0xfe85d68 "status == epicsMutexLockOK", pAuthorName=0xfd4c894 "the author") at ../../../src/libCom/osi/os/default/osdAssert.c:50
#6 0x0fe75ba4 in db_cancel_event (es=0x10d0d990) at ../dbEvent.c:531
#7 0x0ff05610 in destroyAllChannels (client=0x10d09e28, pList=0x10d09e68) at ../caservertask.c:653
#8 0x0ff06508 in destroy_tcp_client (client=0x10d09e28) at ../caservertask.c:697
#9 0x0ff073f8 in camsgtask (pParm=0x10d09e28) at ../camsgtask.c:160
#10 0x0fd3bbcc in start_routine (arg=0xb5c08258) at ../../../src/libCom/osi/os/Linux/osdThread.c:408
#11 0xb7bdc82c in start_thread (arg=0xb22ba480) at pthread_create.c:306
#12 0xb7ce70e4 in clone () from /lib/libc.so.6

camsgtask (camsgtest.c:41) exits due to an error
151: epicsPrintf ("CAS: forcing disconnect from %s\n", buf);
152: break;
153: }
154: }
155:
156: LOCK_CLIENTQ;
157: ellDelete ( &clientQ, &client->node );
158: UNLOCK_CLIENTQ;
159:
160: destroy_tcp_client ( client );

destroy_tcp_client (caservertask.c:676) calls destroyAllChannels
679: destroyAllChannels ( client, & client->chanList );

destroyAllChannels (caservertask.c:619) calls db_cancel_event
640: while ( TRUE ) {
641: /*
642: * AS state change could be using this list
643: */
644: epicsMutexMustLock ( client->eventqLock );
645: pevext = (struct event_ext *) ellGet ( &pciu->eventq );
646: epicsMutexUnlock ( client->eventqLock );
647:
648: if ( ! pevext ) {
649: break;
650: }
651:
652: if ( pevext->pdbev ) {
653: db_cancel_event (pevext->pdbev);

db_cancel_event (dbEvent.c:518) locks the event queue which fails
531: LOCKEVQUE ( pevent->ev_que )
This macro expands to: epicsMutexUnlock((pevent->ev_que)->writelock);

So at this time the mutex seems to be invalid. Why?
Content of the mutex:
(gdb) print *(( struct evSubscrip * ) es) -> ev_que->writelock
$6 = {node = {next = 0x10459d08, previous = 0xb5c58410}, id = 0x11013848, pFileName = 0xfe870e0 "../dbEvent.c", lineno = 298}

The only pssibility that makes sense it that it had been destroyed.

event_task (dbEvent.c:915) ("CAS-event" thread) destroys writelock mutexes when it exits:
928: do{
...
961: pendexit = evUser->pendexit;
962: epicsMutexUnlock ( evUser->lock );
963:
964: } while( ! pendexit );
965:
966: epicsMutexDestroy(evUser->firstque.writelock);
967:
968: {
969: struct event_que *nextque;
970:
971: ev_que...

Read more...

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

This seems to me to be related, if not the same as, lp:1762543.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Dirk, have you been able to trigger this crash?

Also, if you still have the core dump I'd be interested where the other threads are. Run "thread apply all backtrace" and attach the resulting (long) output.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

If you can replicated, please try 3.15 branch or later with d2b0e920012d4c9bc0eaecc0e092cedd776db578 which moves event cleanup from the dbEvent worker into db_close_events().

Revision history for this message
mdavidsaver (mdavidsaver) wrote :
Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :

I have never been able to trigger the problem. The condition is rare and seems to be related to high network load. It seems that only one particular client (not under my control) triggers this fault by triggering "bad resource ID" errors.

I don't have a core dump. All other threads were still functional and the IOC seemed to work just fine. However, it has been restarted meanwhile.

I cannot easily switch a production system to 3.15. There are too many implications and those are not my systems. But I will apply the change of commit d2b0e920012d4c9bc0eaecc0e092cedd776db578 to 3.14.12.7 to see it it still happens.

Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :

Michael, this patch targets exactly the problem I found. It makes sure all threads have finished before destroying the mutexes. Thus I expect this patch to fix my problem. Applying it to 3.14.12 was no problem at all. Thanks.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

If you're cherry picking patches, you'll probably also want https://git.launchpad.net/epics-base/commit/?id=e794639e31559c0b4054ce5c31f46194d93325f1 which addresses a rarer race introduced by the first.

> I cannot easily switch a production system to 3.15 ...

The fact that this issue was identified and fixed back in 2012 might, I hope, encourage PSI to begin moving beyond 3.14.x.

Revision history for this message
Andrew Johnson (anj) wrote :

Should we back-port these fixes to 3.14?
RL: Both patches are simple, MD should apply.

Changed in epics-base:
assignee: nobody → mdavidsaver (mdavidsaver)
status: New → Triaged
milestone: none → 3.14.12.8
Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Please don't assign me any 3.14 only issues. I have no support for 3.14 maintenance work, nor inclination to donate my time for this.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.