ovs-vswitchd crashed with SIGSEGV in nl_attr_get_size()

Bug #1352570 reported by James Page
52
This bug affects 6 people
Affects Status Importance Assigned to Milestone
openvswitch (Ubuntu)
Invalid
High
Unassigned
Trusty
Fix Released
High
Liam Young
Utopic
Invalid
High
Unassigned

Bug Description

[Impact]
Userspace daemon dies and all flows are lost during instance teardown on OpenStack hypervisor nodes; result is that all instances lose network access on the impacted server.

[Test Case]
Deploy openstack (sounds easy right); use it regularly and at some point instance termination will result in the userspace daemon dieing.

[Regression potential]
Limited; the fix is included in a new upstream point release which is covered by the usual upstream testing.

[Original Bug Report]
this crash has been observed a few times on this particular openstack cloud; result in loss of network connectivity to instances running on the hypervisor.

ProblemType: Crash
DistroRelease: Ubuntu 14.04
Package: openvswitch-switch 2.0.1+git20140120-0ubuntu2
ProcVersionSignature: Ubuntu 3.13.0-24.46-generic 3.13.9
Uname: Linux 3.13.0-24-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.2
Architecture: amd64
Date: Mon Aug 4 18:10:16 2014
ExecutablePath: /usr/sbin/ovs-vswitchd
ExecutableTimestamp: 1393166598
ProcCmdline: ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach --monitor
ProcCwd: /
ProcEnviron:
 TERM=linux
 PATH=(custom, no user)
SegvAnalysis:
 Segfault happened at: 0x459110 <nl_attr_get_size>: movzwl (%rdi),%eax
 PC (0x00459110) ok
 source "(%rdi)" (0x00000000) not located in a known VMA region (needed readable region)!
 destination "%eax" ok
SegvReason: reading NULL VMA
Signal: 11
SourcePackage: openvswitch
StacktraceTop:
 nl_attr_get_size (nla=nla@entry=0x0) at ../lib/netlink.c:506
 format_generic_odp_key (a=a@entry=0x0, ds=ds@entry=0x7fff9005e1e0) at ../lib/odp-util.c:767
 format_odp_key_attr (a=a@entry=0xc5e990, ma=ma@entry=0x0, ds=ds@entry=0x7fff9005e1e0, verbose=verbose@entry=true) at ../lib/odp-util.c:1331
 odp_flow_format (key=key@entry=0xc5e920, key_len=key_len@entry=120, mask=mask@entry=0x0, mask_len=mask_len@entry=0, ds=ds@entry=0x7fff9005e1e0, verbose=verbose@entry=true) at ../lib/odp-util.c:1401
 log_flow_message (error=error@entry=2, operation=operation@entry=0x4d0b93 "flow_del", key=0xc5e920, key_len=120, mask=mask@entry=0x0, mask_len=mask_len@entry=0, stats=0x0, actions=actions@entry=0x0, actions_len=actions_len@entry=0, dpif=<optimized out>) at ../lib/dpif.c:1354
Title: ovs-vswitchd crashed with SIGSEGV in nl_attr_get_size()
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

Revision history for this message
James Page (james-page) wrote :
Revision history for this message
Apport retracing service (apport) wrote :

StacktraceTop:
 nl_attr_get_size (nla=nla@entry=0x0) at ../lib/netlink.c:506
 format_generic_odp_key (a=a@entry=0x0, ds=ds@entry=0x7fff9005e1e0) at ../lib/odp-util.c:767
 format_odp_key_attr (a=a@entry=0xc5e990, ma=ma@entry=0x0, ds=ds@entry=0x7fff9005e1e0, verbose=verbose@entry=true) at ../lib/odp-util.c:1331
 odp_flow_format (key=key@entry=0xc5e920, key_len=key_len@entry=120, mask=mask@entry=0x0, mask_len=mask_len@entry=0, ds=ds@entry=0x7fff9005e1e0, verbose=verbose@entry=true) at ../lib/odp-util.c:1401
 log_flow_message (error=error@entry=2, operation=operation@entry=0x4d0b93 "flow_del", key=0xc5e920, key_len=120, mask=mask@entry=0x0, mask_len=mask_len@entry=0, stats=0x0, actions=actions@entry=0x0, actions_len=actions_len@entry=0, dpif=<optimized out>) at ../lib/dpif.c:1354

Revision history for this message
Apport retracing service (apport) wrote : Stacktrace.txt
Revision history for this message
Apport retracing service (apport) wrote : StacktraceSource.txt
Revision history for this message
Apport retracing service (apport) wrote : ThreadStacktrace.txt
Changed in openvswitch (Ubuntu):
importance: Undecided → Medium
tags: removed: need-amd64-retrace
Revision history for this message
James Page (james-page) wrote :
information type: Private → Public
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openvswitch (Ubuntu):
status: New → Confirmed
James Page (james-page)
Changed in openvswitch (Ubuntu):
importance: Medium → High
assignee: nobody → James Page (james-page)
Revision history for this message
James Page (james-page) wrote :

I've spoken with upstream about this and the commit identified probably resolves this issue (testing on our internal cloud appears to confirm this with packages from ppa:james-page/openvswitch).

I've requested a new point release, however if that does not appear in a reasonable period of time, I'm proposing we take a snapshot of the current tip of the 2.0 branch upstream and run with that.

Changed in openvswitch (Ubuntu):
status: Confirmed → Triaged
Changed in openvswitch (Ubuntu Trusty):
status: New → Triaged
importance: Undecided → High
assignee: nobody → James Page (james-page)
Changed in openvswitch (Ubuntu Utopic):
assignee: James Page (james-page) → nobody
Revision history for this message
James Page (james-page) wrote :

New point release including fix for this bug uploaded to trusty-proposed for SRU team review.

description: updated
Changed in openvswitch (Ubuntu Utopic):
status: Triaged → Invalid
Changed in openvswitch (Ubuntu Trusty):
status: Triaged → In Progress
James Page (james-page)
Changed in openvswitch (Ubuntu Trusty):
assignee: James Page (james-page) → Liam Young (gnuoy)
Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello James, or anyone else affected,

Accepted openvswitch into trusty-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/openvswitch/2.0.2-0ubuntu0.14.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in openvswitch (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
James Page (james-page) wrote :

We've not seen any problems since upgrading to the version in trusty-proposed; marking verification done.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package openvswitch - 2.0.2-0ubuntu0.14.04.1

---------------
openvswitch (2.0.2-0ubuntu0.14.04.1) trusty; urgency=medium

  * New upstream stable release (LP: #1357229):
    - Includes fix for SIGSEGV in nl_attr_get_size() due to use of
      de-allocated object during flow teardown (LP: #1352570).
 -- James Page <email address hidden> Fri, 15 Aug 2014 09:05:46 +0100

Changed in openvswitch (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Scott Kitterman (kitterman) wrote : Update Released

The verification of the Stable Release Update for openvswitch has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Dave Chiluk (chiluk) wrote :

There are numerous reports that this is still not resolved in the latest packages available in trusty.
http://openvswitch.org/pipermail/discuss/2014-December/015931.html

tags: added: cts
Revision history for this message
sridhar basam (sri-7) wrote :

I am the person who sent the email above on crashes on the 2.0.2 branch. To add some context to the frequency of crashes on 2.0.2, we saw the crashes on 2.0.2 happen a couple times so far compared to 100s of crashes over the past month on the 2.0.1 code. I will update this bug report and the ovs mailing list if things change with the frequency.

 If there is any additional information i can provide, let me know. Thanks.

Ryan Beisner (1chb1n)
tags: added: openstack uosci
Revision history for this message
Ryan Beisner (1chb1n) wrote :

We're seeing ovs crashes on a private deployment. They seem to surface only after being up for some time (>1mo) and after creating/deleting a lot of instances over that time period (>20K), but all with a sane system resource load.

$ apt-cache policy openvswitch-common
openvswitch-common:
  Installed: 2.0.2-0ubuntu0.14.04.1
  Candidate: 2.0.2-0ubuntu0.14.04.1
  Version table:
 *** 2.0.2-0ubuntu0.14.04.1 0
        500 http://archive.ubuntu.com//ubuntu/ trusty-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     2.0.1+git20140120-0ubuntu2 0
        500 http://archive.ubuntu.com//ubuntu/ trusty/main amd64 Packages

Revision history for this message
Ryan Beisner (1chb1n) wrote :

So from a user symptom / impact standpoint, when new instances are nova booted, they are able to send DHCP DISCOVER packet through the corresponding bridge, but return DHCP OFFER traffic never reaches the new instance.

In all cases that I have seen, the neutron net, subnet, and port statuses all report A-OK via cli queries. In some cases, inspecting the new underlying bridge with brctl results in error(s), but not always.

## Symptomatic info re: bridge:
$ sudo brctl show qvob744fc12-71
bridge name bridge id STP enabled interfaces
qvob744fc12-71 can't get info Operation not supported

$ sudo brctl showmacs qvob744fc12-71
read of forward table failed: Operation not supported

## Symptomatic info from nova console-log:
 * Starting configure network device[74G[ OK ]
cloud-init-nonet[13.52]: waiting 120 seconds for network device
cloud-init-nonet[133.52]: gave up waiting for a network device.
Cloud-init v. 0.7.5 running 'init' at Tue, 06 Jan 2015 16:03:14 +0000. Up 133.72 seconds.
ci-info: +++++++++++++++++++++++Net device info+++++++++++++++++++++++
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: | Device | Up | Address | Mask | Hw-Address |
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | . |
ci-info: | eth0 | True | . | . | fa:16:3e:f9:18:4f |
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Route info failed!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Revision history for this message
Ryan Beisner (1chb1n) wrote :

For the life of me, I couldn't get ubuntu-bug or apport to automagically add the .crash and stack trace to this bug. Here it is, though, via attachment. ~9mb

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Update, fyi:
Nova booted 110 instances. 16 had no net.
Deleted the instances.
Nova booted 110 more instances. 17 had no net.
Deleted the instances.
Consistent with the ~15% no net we saw last time around.
v
Deleted neutron nets and subnets, then re-added them.
^
Nova booted 110 instances. All had network.
Deleted the instances.
Nova booted 110 instances. All had network.
Deleted the instances.

Turned our CI engine back on (which will use this undercloud to instantiate a few hundred short-lived instances per day to test other code such as juju charms).

I predict recurrence in about 20K instances, short of a solid lead on a fix.

Revision history for this message
James Page (james-page) wrote :

Please can any further discussion happen on bug 1336555.

Thanks!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.