Percona Server 5.6 RC with thread pool crashes under load

Bug #1191375 reported by Vadim Tkachenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Fix Released
High
Unassigned
5.1
Invalid
Undecided
Unassigned
5.5
Invalid
Undecided
Unassigned
5.6
Fix Released
High
Unassigned

Bug Description

Running PS 5.6 with pool of threads I observe regular crashes.
There is gdb backtrace

http://pastebin.com/qiPKPN2x

Tags: upstream tp
Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

consider following scenario
* thread t1 is running
* t1 calls wait_begin -> wake_or_create_thread -> create_worker ->
                      -> mysql_thread_create -> spawn_thread_v1
  current thread's PFS_thread* remembered here as parent thread's
  PFS_thread* for newly created thread t2
* t1 finishes it's wait, sends result to client and finishes itself
* t2 is just about to start, it tries to attach instrumentation to
  itself. It wants to access PFS_thread* of t1 which is no longer
  available.
this scenario is impossible in Percona Server 5.5 as pointers
to parent thread's structures are not used
I'll try to reproduce it with DEBUG_SYNC or using other tricks

Changed in percona-server:
assignee: nobody → Sergei Glushchenko (sergei.glushchenko)
Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

It is usual for threadpool's worker threads to spawn new treads. Also parent thread can easily be terminated shortly after it spawned a child. So, the proper fix would be to introduce some kind of synchronization here.

I see couple options.

1. Add atomic variable to PFS_thread which would indicate that structure is in use. Just before child thread is spawned we increment this variable. Once new thread copied all needed information, we decrement this counter. When it is time to destroy parent thread we wait for counter became 0. We can use spin and check for it, or we can use introduce conditional variable.

2. Extend PFS_spawn_thread_arg structure and copy all the needed information from parent thread to it just before creating the new thread. pfs_spawn_thread will use copy stored in PFS_spawn_thread_arg. Additional fields for PFS_spawn_thread_arg are:

  /** Internal thread identifier, unique. */
  ulonglong m_thread_internal_id;
  /** Parent internal thread identifier. */
  ulonglong m_parent_thread_internal_id;
  /** User name. */
  char m_username[USERNAME_LENGTH];
  /** Length of @c m_username. */
  uint m_username_length;
  /** Host name. */
  char m_hostname[HOSTNAME_LENGTH];
  /** Length of @c m_hostname. */
  uint m_hostname_length;

I like second approach more.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

Option 2) looks like it'd be less error-prone to implement as well.

tags: added: tp
tags: added: 56qual
Revision history for this message
Roel Van de Paar (roel11) wrote :

--56qual as tp is beta

tags: removed: 56qual
Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

    The stack trace in description points to access violation at the
    pfs->m_parent_thread_internal_id= parent->m_thread_internal_id; in
    pfs_spawn_thread.

    I've tracked this bug down and can explain it pretty well. Since
    PFS designed to handle only limited amount of threads all the
    threads above this number will not have THR_PFS assigned
    (i.e. create_thread in pfs_spawn_thread can return NULL and it
    will be assigned to thread-specific THR_PFS). This however did not
    taken into account neither in spawn_thread_v1 (where NULL can be
    assigned to psi_arg->m_parent_thread) nor pfs_spawn_thread itself
    (where NULL-valued psi_arg->m_parent_thread can be accessed).
    This probably not the issue for upstream since new threads
    commonly spawned by "old" threads which likely have their THR_PFS
    keys assigned. But with threadpool enabled every worker thread can
    spawn new thread when it about to wait. I saw this kind of
    failures, but it is hard to construct MTR test case for them.

    My previous assumption was that parent spawned a child thread,
    finished it's wait and stopped, while child was too slow to finish
    it's initialization in the meantime. This is also possible but
    I've never seen this in practice due to thread_pool_idle_timeout
    and threadpool internal logic.

    This bug was fixed upstream (rev. 0.18870.1) with following
    message:

    Bug#16939689 VALGRIND ERROR IN PFS_SPAWN_THREAD

    Before this fix, valgrind errors could be reported under stress tests.

    In particular, the following line in pfs_spawn_thread()
    memcpy(pfs->m_username, parent->m_username, sizeof(pfs->m_username));

    could cause memcpy to be called for the same source and destination,
    which should not happen.

    The root cause, by analysis, is that the code is reading attributes
    from the parent thread, while the parent thread instrumentation is already
    destroyed (and was reused for the child, triggering valgrind complaints).

    The fix is to capture all the parent attributes needed:
    - the thread internal id
    - the thread user name
    - the thread host name
    in the parent thread, before spawning a child,
    and use that to initialize the child instrumentation,
    instead of pasing a pointer that might become invalid.

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

Marking as Invalid because fix already released upstream and incorporated to PS.

tags: added: upstream
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-677

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.