sssd fails with 'Exiting the SSSD. Could not restart critical service [tpad].

Bug #1684295 reported by Pamela Skutnik
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
sssd (Ubuntu)
Fix Released
High
Andreas Hasenack
Xenial
Fix Released
High
Andreas Hasenack

Bug Description

[Impact]
In this particular configuration, when ldap_rfc2307_fallback_to_local_users is set to true in /etc/sss/sssd.conf and a local user is a member of an ldap group and does not exist in the directory (other scenarios are possible), the sssd_be process segfaults and logins might be prevented.

The original scenario is a bit more complex and involves setting up an Active Directory server, but with the help from the bug reporter (thanks @pam-s!) we managed to narrow it down to this simple test case.

[Test Case]

# Install the packages. When prompted, choose any password for the ldap admin
$ sudo apt update; sudo apt install sssd slapd

# create the sssd config
$ sudo tee /etc/sssd/sssd.conf <<EOF
[sssd]
config_file_version = 2
services = nss, pam
domains = LDAP

[domain/LDAP]
id_provider = ldap
ldap_uri = ldap://localhost
ldap_search_base = dc=example,dc=com
ldap_rfc2307_fallback_to_local_users = True
EOF

$ sudo chmod 0600 /etc/sssd/sssd.conf
# reconfigure slapd for domain example.com, organization example. For the rest, accept defaults
$ sudo dpkg-reconfigure slapd

# add the base ldif. When prompted, use the password you chose when reconfiguring slapd earlier
$ ldapadd -x -D cn=admin,dc=example,dc=com -W <<EOF
dn: ou=People,dc=example,dc=com
ou: People
objectClass: organizationalUnit

dn: ou=Group,dc=example,dc=com
ou: Group
objectClass: organizationalUnit

dn: cn=ldapusers,ou=Group,dc=example,dc=com
cn: ldapusers
objectClass: posixGroup
gidNumber: 10000
memberUid: localuser
EOF

adding new entry "ou=People,dc=example,dc=com"

adding new entry "ou=Group,dc=example,dc=com"

adding new entry "cn=ldapusers,ou=Group,dc=example,dc=com"

# create a localuser with that name
$ sudo useradd -M localuser

# restart sssd
$ sudo service sssd restart

# take note of the sssd_be process id:
$ pidof sssd_be
15474

# in one terminal, keep tailing /var/log/syslog
$ sudo tail -f /var/log/syslog

# in another terminal, run this id command. It will possibly hang for a bit, and won't show the "ldapusers" group membership
$ id localuser
(hangs a bit)
uid=1001(localuser) gid=1001(localuser) groups=1001(localuser)

# /var/log/syslog will emit messages like these, about a crash and sssd_be restarting (if you don't have apport installed, you will just see the "starting up" bit about sssd_be):
Nov 6 17:17:08 xenial-sssd-bad-initgroups-result-1684295 systemd[1]: Starting Apport crash forwarding receiver...
Nov 6 17:17:08 xenial-sssd-bad-initgroups-result-1684295 sssd[be[LDAP]]: Starting up
Nov 6 17:17:08 xenial-sssd-bad-initgroups-result-1684295 systemd[1]: Started Apport crash forwarding receiver.

# verify that the sssd_be process id changed, confirming that it crashed and was restarted:
$ pidof sssd_be
15485

# install the fixed packages from proposed
$ apt install/dist-upgrade ....

# repeat the id command. Now it finishes quickly, shows the "ldapusers" group membership, and there won't be any sign of an sssd_be restart in /var/log/syslog:
$ id localuser
uid=1001(localuser) gid=1001(localuser) groups=1001(localuser),10000(ldapusers)

[Regression Potential]
The patch is very specific, but given in how many different ways sssd can be configured, it would really help if users actually tested the package from proposed in their deployments. Specially considering it's a login service.

That being said, the patch is applied in the 1.13, 1,14 and current 1.15 series upstream and is more than a year old by now. It could rely on other changes that I missed, though, but at least one I chose to ignore (see [other info]).

[Other Info]
The exact upstream patch wasn't applied (https://pagure.io/SSSD/sssd/c/5a0fb268e836e600d864ded7de5d935946ae6c61), because it relied on dropping an unused parameter from sdap_fallback_local_user(), namely the *opts struct pointer (https://pagure.io/SSSD/sssd/c/77f960ab32c2d2245fed55671f24af287ea0ba50). It is indeed not used, but I rather not drop it for an SRU because I don't know if some library could be using it, and also because a new upstream version for this series (1.13.5) wasn't released yet with this change.

Revision history for this message
Pamela Skutnik (pam-s) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Thanks for filing this bug in Ubuntu.

This one will be a bit complex for me to reproduce locally, as your sssd.conf file has a lot of customizations and multiple servers for each domain. It would be best if we could reduce this complexity as much as possible and still reproduce the problem.

Have you tried with just one of your two domains to see if the crash still happens? For example, list just "gc" or just "tpad" in [sssd]->domains and then login with credentials from each.

Changed in sssd (Ubuntu):
status: New → Incomplete
Revision history for this message
Pamela Skutnik (pam-s) wrote :

Andreas,
thank you for looking into this. It has been a while since I looked at this problem, since our temporary fix of holding the sssd packages at version 1.12.5-2 was working. I see that in the meantime, new versions of the sssd packages have been released. I have tested with the 1.13.4-1ubuntu1.5 and 1.13.4-1ubuntu1.6 and I am not able to reproduce the problem with those versions. It would be nice to know what fixed the issue but I do not see anything obvious in the package changelogs.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for sssd (Ubuntu) because there has been no activity for 60 days.]

Changed in sssd (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Pamela Skutnik (pam-s) wrote :

Sorry, finally got back to working on this. Can I continue to post here or do I need to reopen?

I have set up a test server configured with only the one domain, tpad.
I also updated the sssd packages to 1.13.4-1ubuntu1.8, the highest I had.
It still crashes as soon as I login with any id (local or tpad).

root@dcmilphlum127:~# systemctl status sssd
â sssd.service - System Security Services Daemon
   Loaded: loaded (/lib/systemd/system/sssd.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2017-10-04 15:52:08 EDT; 24min ago
  Process: 13480 ExecStart=/usr/sbin/sssd -i -f (code=exited, status=1/FAILURE)
 Main PID: 13480 (code=exited, status=1/FAILURE)

Oct 04 15:50:48 dcmilphlum127.edc.nam.gm.com systemd[1]: Started System Security Services Daemon.
Oct 04 15:51:49 dcmilphlum127.edc.nam.gm.com sssd[be[13528]: Starting up
Oct 04 15:51:52 dcmilphlum127.edc.nam.gm.com sssd[be[13530]: Starting up
Oct 04 15:51:58 dcmilphlum127.edc.nam.gm.com sssd[be[13531]: Starting up
Oct 04 15:52:08 dcmilphlum127.edc.nam.gm.com sssd[13480]: Exiting the SSSD. Could not restart critical service [tpad].
Oct 04 15:52:08 dcmilphlum127.edc.nam.gm.com sssd[13484]: Shutting down
Oct 04 15:52:08 dcmilphlum127.edc.nam.gm.com sssd[13483]: Shutting down
Oct 04 15:52:08 dcmilphlum127.edc.nam.gm.com systemd[1]: sssd.service: Main process exited, code=exited, status=1/FAILURE
Oct 04 15:52:08 dcmilphlum127.edc.nam.gm.com systemd[1]: sssd.service: Unit entered failed state.
Oct 04 15:52:08 dcmilphlum127.edc.nam.gm.com systemd[1]: sssd.service: Failed with result 'exit-code'.

root@dcmilphlum127:~# tail /var/log/all/kern.log

Oct 4 15:51:50 dcmilphlum127 kernel: [ 6814.060738] sssd_be[13528]: segfault at 0 ip 00007f302c49db94 sp 00007fff0b589cb0 error 4 in libsss_util.so[7f302c489000+6c000]
Oct 4 15:51:54 dcmilphlum127 kernel: [ 6818.117855] sssd_be[13530]: segfault at 0 ip 00007f6e98ac0b94 sp 00007fffc35cae40 error 4 in libsss_util.so[7f6e98aac000+6c000]
Oct 4 15:52:08 dcmilphlum127 kernel: [ 6832.177545] sssd_be[13531]: segfault at 0 ip 00007fb2b6ccdb94 sp 00007fff392e1e00 error 4 in libsss_util.so[7fb2b6cb9000+6c000]

I attached my modified sssd.conf file.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Continuing on this bug is fine, thanks for getting back to us with a simplified configuration file. I reopened the bug so that it will be picked up again.

Changed in sssd (Ubuntu):
status: Expired → New
Revision history for this message
Pamela Skutnik (pam-s) wrote :

I turned on debugging and collected the sssd logs at the time of the failure.

Revision history for this message
Pamela Skutnik (pam-s) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Thanks for these

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Looking at this again.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I think the best way forward here is to get a core dump, so we can have a better idea of where the crash is happening.

I induced a crash in my test sssd container, and since I have apport installed, a crash file was produced in /var/crash:
# ll /var/crash/
total 644
drwxrwxrwt 2 root root 4 Nov 1 20:34 ./
drwxr-xr-x 13 root root 15 Sep 19 19:18 ../
-rwxr-xr-x 1 root root 0 Nov 1 20:34 .lock*
-rw-r----- 1 root root 593417 Nov 1 20:34 _usr_lib_x86_64-linux-gnu_sssd_sssd_be.0.crash

Could you please check if you have a recent crash file related to sssd in that directory.

If not, do this:
sudo apt install apport

# check the kernel core_pattern:

# sysctl kernel.core_pattern
kernel.core_pattern = |/usr/share/apport/apport %p %s %c %P

And then restart sssd and induce the crash again, and then hopefully you will have a crash file and we can go from there.

Revision history for this message
Pamela Skutnik (pam-s) wrote :

Crash file attached.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Thanks, I can track this down more easily now.

(gdb) bt
#0 sysdb_attrs_get_el_ext (attrs=attrs@entry=0x0, name=name@entry=0x7f1e14bb504c "stamp", alloc=alloc@entry=true, el=el@entry=0x7ffc041e48a8) at ../src/db/sysdb.c:326
#1 0x00007f1e2d283d0d in sysdb_attrs_get_el (attrs=attrs@entry=0x0, name=name@entry=0x7f1e14bb504c "stamp", el=el@entry=0x7ffc041e48a8) at ../src/db/sysdb.c:360
#2 0x00007f1e14b6dda6 in sdap_attrs_get_sid_str (mem_ctx=mem_ctx@entry=0x1664b40, idmap_ctx=0x1682ba0, sysdb_attrs=sysdb_attrs@entry=0x0, sid_attr=0x7f1e14bb504c "stamp",
    _sid_str=_sid_str@entry=0x7ffc041e4998) at ../src/providers/ldap/ldap_common.c:897
#3 0x00007f1e14b7a878 in sdap_save_user (memctx=memctx@entry=0x1bc3c20, opts=0x1679b20, dom=0x167aa80, attrs=0x0, _usn_value=_usn_value@entry=0x0, now=now@entry=0)
    at ../src/providers/ldap/sdap_async_users.c:160
#4 0x00007f1e14b8be07 in sdap_get_initgr_user (subreq=0x0) at ../src/providers/ldap/sdap_async_initgroups.c:2896
#5 0x00007f1e14b75428 in generic_ext_search_handler (subreq=0x0, opts=<optimized out>) at ../src/providers/ldap/sdap_async.c:1668
#6 0x00007f1e14b77908 in sdap_get_generic_op_finished (op=<optimized out>, reply=<optimized out>, error=<optimized out>, pvt=<optimized out>) at ../src/providers/ldap/sdap_async.c:1561
#7 0x00007f1e14b7638d in sdap_process_message (ev=<optimized out>, sh=<optimized out>, msg=0x1664ae0) at ../src/providers/ldap/sdap_async.c:352
#8 sdap_process_result (ev=<optimized out>, pvt=<optimized out>) at ../src/providers/ldap/sdap_async.c:196
#9 0x00007f1e2df90613 in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#10 0x00007f1e2df8eb57 in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#11 0x00007f1e2df8ad3d in _tevent_loop_once () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#12 0x00007f1e2df8aedb in tevent_common_loop_wait () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#13 0x00007f1e2df8eaf7 in ?? () from /usr/lib/x86_64-linux-gnu/libtevent.so.0
#14 0x00007f1e2d2aff83 in server_loop (main_ctx=0x15ec060) at ../src/util/server.c:692
#15 0x0000000000406412 in main (argc=8, argv=<optimized out>) at ../src/providers/data_provider_be.c:2994
(gdb) frame 0
#0 sysdb_attrs_get_el_ext (attrs=attrs@entry=0x0, name=name@entry=0x7f1e14bb504c "stamp", alloc=alloc@entry=true, el=el@entry=0x7ffc041e48a8) at ../src/db/sysdb.c:326
326 for (i = 0; i < attrs->num; i++) {

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Changed in sssd (Ubuntu):
status: New → Triaged
importance: Undecided → High
tags: added: bitesize server-next
Revision history for this message
Pamela Skutnik (pam-s) wrote :

My coworker has recompiled the 1.13.4-1ubuntu1.8 sssd packages with that patch and we are testing.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I got a small reproducer case. With a simple "id <user>" command I get sssd_be to segfault, and with the above patch applied it no longer segfaults and also produces the correct result. I'll use that for the SRU test plan.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Changed in sssd (Ubuntu):
assignee: nobody → Andreas Hasenack (ahasenack)
status: Triaged → In Progress
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

It's only xenial that is affected, that is, version 1.13.4 and perhaps earlier. Trusty, zesty and higher are OK.

description: updated
description: updated
description: updated
Revision history for this message
Pamela Skutnik (pam-s) wrote :

I only have xenial and trusty and I can confirm that trusty is not affected.
For Xenial, both the amd64 and s390x architectures are affected.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

@pam-s, as soon as you can confirm this patch fixes your problem (feel free to use my PPA packages), please let us know so we can proceed with the SRU.

My test case reproduces the segfault, but I would like to be sure it also fixes it in your environment before continuing.

Thanks again

description: updated
Revision history for this message
Pamela Skutnik (pam-s) wrote :

I tested with your PPA packages and it does fix my problem. Please proceed and thank you!

Changed in sssd (Ubuntu Xenial):
status: New → In Progress
assignee: nobody → Andreas Hasenack (ahasenack)
Changed in sssd (Ubuntu):
status: In Progress → Fix Released
Changed in sssd (Ubuntu Xenial):
importance: Undecided → High
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Pamela, or anyone else affected,

Accepted sssd into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/sssd/1.13.4-1ubuntu1.9 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in sssd (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

xenial verification:

Reproducing the crash:
ubuntu@xenial-sssd-1684295:~$ pidof sssd_be
3516

ubuntu@xenial-sssd-1684295:~$ id localuser
(stuck for 30s or more)
uid=1001(localuser) gid=1001(localuser) groups=1001(localuser)

syslog shows the crash:
Nov 13 13:05:55 xenial-sssd-1684295 systemd[1]: Starting Apport crash forwarding receiver...
Nov 13 13:06:40 xenial-sssd-1684295 sssd: Killing service [LDAP], not responding to pings!
Nov 13 13:06:54 xenial-sssd-1684295 sssd[be[LDAP]]: Starting up

pid changed:
ubuntu@xenial-sssd-1684295:~$ pidof sssd_be
4639

With the new package it works:
  Version table:
 *** 1.13.4-1ubuntu1.9 500
        500 http://br.archive.ubuntu.com/ubuntu xenial-proposed/main amd64 Packages

ubuntu@xenial-sssd-1684295:~$ id localuser
uid=1001(localuser) gid=1001(localuser) groups=1001(localuser),10000(ldapusers)

xenial verification succeeded.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Pamela Skutnik (pam-s) wrote :

Yes, the sssd-1.13.4-1ubuntu1.9 fixed the problem. I did not see any other issues as a result.

thanks,
Pam

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package sssd - 1.13.4-1ubuntu1.9

---------------
sssd (1.13.4-1ubuntu1.9) xenial; urgency=medium

  * debian/patches/bad-initgroups-results-3045.patch: sdap: Fix
    ldap_rfc_2307_fallback_to_local_users. Thanks to Michal Židek
    <email address hidden>. Closes LP: #1684295.

 -- Andreas Hasenack <email address hidden> Mon, 06 Nov 2017 12:15:20 -0200

Changed in sssd (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for sssd has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.