Unrecoverable error in the card

Bug #1078184 reported by Igor Ajdisek
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Emulex project
Fix Released
High
Jesse Sung
linux (Ubuntu)
Fix Released
High
Jesse Sung
Precise
Fix Released
High
Jesse Sung
Quantal
Fix Released
High
Jesse Sung

Bug Description

on HP BL460c Gen8 with HP FlexFabric 10Gb 2-port 554FLB Adapter running Ubuntu 12.04 Precise, on random occasions "Unrecoverable error in the card" error is reported. This has a severe impact on the entire network as it seems network loop is created which causes disruption on network.

Nov 13 07:09:56 xx kernel: [1688054.612553] be2net 0000:04:00.0: Unrecoverable error in the card
Nov 13 07:09:56 xx kernel: [1688054.623011] be2net 0000:04:00.0: UE: PMEM bit set
Nov 13 07:09:56 xx kernel: [1688054.633419] be2net 0000:04:00.0: UE: TXP bit set

After server is rebooted network is restored to a previous state.

Revision history for this message
Samantha Jian-Pielak (samantha-jian) wrote :

There were a few updates on be2net since 12.04.
Could you please provide the kernel version?

Revision history for this message
Igor Ajdisek (iajdisek) wrote :

Yes i saw that, but i believe we are running the latest.

uname -a
Linux xx 3.2.0-32-generic #51-Ubuntu SMP Wed Sep 26 21:33:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Sarveshwar Bandi (sarveshwar-bandi) wrote :

Few questions:
- Every time you see an unrecoverable error, do you the same error(UE:PMEM, TXP) in logs? or are the errors different everytime?

- Can you give me the modinfo of be2net.ko and ethtool -i <intf-name> output (will give the firmware version)?

- Also, please run ethtool -d <intf-name> raw on and redirect the output to a file and attach it.

- What is the kind of traffic that is running ? Is it only ipv4 or do you have ipv6 too?

- what is the traffic throughput when you hit the issue?

As you know there was lock up issue which was fixed as part of bug 1035348 that used to happen when stack requests the driver to transmit a packet larger then 65535. Am not sure if the issue is recurring. Will you be able to run compile and run a driver with debug logs on your setup?

Revision history for this message
Igor Ajdisek (iajdisek) wrote :

1) "UE: PMEM bit set" was missing couple of times, but "UE: TXP bit set" and "Unrecoverable error in the card" always come in pair

2) Here you go:

ethtool -i <intf-name>

driver: be2net
version: 4.0.100u
firmware-version: 4.1.450.7
bus-info: 0000:04:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes

modinfo be2net

filename: /lib/modules/3.2.0-32-generic/kernel/drivers/net/ethernet/emulex/benet/be2net.ko
license: GPL
author: ServerEngines Corporation
description: ServerEngines BladeEngine 10Gbps NIC Driver 4.0.100u
version: 4.0.100u
srcversion: C8106BC01A381B8FC6B93E8
alias: pci:v000010DFd00000720sv*sd*bc*sc*i*
alias: pci:v000010DFd0000E228sv*sd*bc*sc*i*
alias: pci:v000010DFd0000E220sv*sd*bc*sc*i*
alias: pci:v000019A2d00000710sv*sd*bc*sc*i*
alias: pci:v000019A2d00000700sv*sd*bc*sc*i*
alias: pci:v000019A2d00000221sv*sd*bc*sc*i*
alias: pci:v000019A2d00000211sv*sd*bc*sc*i*
depends:
intree: Y
vermagic: 3.2.0-32-generic SMP mod_unload modversions
parm: num_vfs:Number of PCI VFs to initialize (uint)
parm: rx_frag_size:Size of a fragment that holds rcvd data. (ushort)

3) attached

4) only ipv4

5) we first thought to be related with higher bandwidth utilization because first server to show symptoms were static content servers serving images and videos. but then also other servers experienced the same problems that have really low throughput (up to 10mbit)

Revision history for this message
Sarveshwar Bandi (sarveshwar-bandi) wrote :

One other question: is this VLAN traffic? Will you be able to run a debug driver on your setup?

Revision history for this message
Igor Ajdisek (iajdisek) wrote :

Is a regular network, VLAN is assigned on switch.
However one detail worth mentioned is that bonding is used, and we experienced issues while running in mode 6 (alb). After last issue we have switched to mode 1 (active/standby) and since then we have been running 3 days without the issue occuring again.

According to the HP Support Document both mode 6 and mode 1 should be supported on their VirtualConnect.
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02957870&lang=en&cc=us&taskId=&prodSeriesId=4144084&prodTypeId=3709945

I will get back on you regarding debug driver because I need to get authorization first.

Revision history for this message
Igor Ajdisek (iajdisek) wrote :

Hey,

unfortunately I couldn't get my supervisor to agree with running the debug version in production env, because of all the downtime we already had in past weeks because of this issue. Now with different bond mode thing are at least for now running stable, so they are unwilling to make any changes on production servers at the moment.

I spoke with few colleagues and came across information that kernel 3.6.x contains driver version 4.4.31.0u.
Were there some significant changes/bugfixes between 4.0.100u and 4.4.31.0u that might solve our issue?

Revision history for this message
Sarveshwar Bandi (sarveshwar-bandi) wrote :

Hi,
   We haven't put any fix specific to the issue you are observing with 4.4.31.0u . We are trying to reproduce the issue in our lab with the bonding mode that you used. I will let you know if we have any luck with that.

Thanks,
Sarvesh

Revision history for this message
Sarveshwar Bandi (sarveshwar-bandi) wrote :

Igor,
  We were able to reproduce the bug in our setup. The issue is with kernel bonding driver. Though the Emulex be2net driver instructs the kernel to limit the size of TSO packets, the bonding driver does not do this. When stack transmits a packet of size greater than 65535, the network controller raises a UE. This issue can occur with any bonding mode.

   I have posted a patch for the bonding driver on linux upstream kernel community. Once it is accepted, i will push the same patch in ubuntu too.

Thanks,
Sarvesh

Revision history for this message
Igor Ajdisek (iajdisek) wrote :

Sarvesh,

many thanks for all your help and very prompt resolution of this issue.

Br,
Igor

Revision history for this message
Sarveshwar Bandi (sarveshwar-bandi) wrote :

The following patch for bonding driver has been accepted upstream (net tree) and needs to be pulled into ubuntu.

From: Sarveshwar Bandi <email address hidden>

Patch sets the lowest gso_max_size and gso_max_segs values of the slave devices during enslave and detach.

Signed-off-by: Sarveshwar Bandi <email address hidden>
---
 drivers/net/bonding/bond_main.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index b2530b0..5f5b69f 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1379,6 +1379,8 @@ static void bond_compute_features(struct bonding *bond)
  struct net_device *bond_dev = bond->dev;
  netdev_features_t vlan_features = BOND_VLAN_FEATURES;
  unsigned short max_hard_header_len = ETH_HLEN;
+ unsigned int gso_max_size = GSO_MAX_SIZE;
+ u16 gso_max_segs = GSO_MAX_SEGS;
  int i;
  unsigned int flags, dst_release_flag = IFF_XMIT_DST_RELEASE;

@@ -1394,11 +1396,16 @@ static void bond_compute_features(struct bonding *bond)
   dst_release_flag &= slave->dev->priv_flags;
   if (slave->dev->hard_header_len > max_hard_header_len)
    max_hard_header_len = slave->dev->hard_header_len;
+
+ gso_max_size = min(gso_max_size, slave->dev->gso_max_size);
+ gso_max_segs = min(gso_max_segs, slave->dev->gso_max_segs);
  }

 done:
  bond_dev->vlan_features = vlan_features;
  bond_dev->hard_header_len = max_hard_header_len;
+ bond_dev->gso_max_segs = gso_max_segs;
+ netif_set_gso_max_size(bond_dev, gso_max_size);

  flags = bond_dev->priv_flags & ~IFF_XMIT_DST_RELEASE;
  bond_dev->priv_flags = flags | dst_release_flag;
--
1.7.9.5

Chris Van Hoof (vanhoof)
Changed in emulex:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Jesse Sung (wenchien)
information type: Proprietary → Public
Jesse Sung (wenchien)
Changed in emulex:
status: Confirmed → In Progress
Jesse Sung (wenchien)
Changed in linux (Ubuntu):
status: New → In Progress
assignee: nobody → Jesse Sung (wenchien)
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Precise):
assignee: nobody → Jesse Sung (wenchien)
status: New → Fix Committed
Changed in linux (Ubuntu Quantal):
assignee: nobody → Jesse Sung (wenchien)
status: New → In Progress
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Quantal):
status: In Progress → Fix Committed
Chris Van Hoof (vanhoof)
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Precise):
importance: Undecided → High
Changed in linux (Ubuntu Quantal):
importance: Undecided → High
Revision history for this message
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel for Precise in -proposed solves the problem (3.2.0-35.55). Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-precise' to 'verification-done-precise'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-precise
Revision history for this message
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel for Quantal in -proposed solves the problem (3.5.0-20.31). Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-quantal' to 'verification-done-quantal'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-quantal
Revision history for this message
Igor Ajdisek (iajdisek) wrote :

Sarvesh,

are you able to test this kernel in your testing environment?

Revision history for this message
Sarveshwar Bandi (sarveshwar-bandi) wrote :

Not yet. We will do it in next couple of days.

Thanks,
Sarvesh

Revision history for this message
Luis Henriques (henrix) wrote :

Is there any chance of getting this bug verified for Precise and Quantal? We need to have this closed this week. Thanks.

Revision history for this message
Sarveshwar Bandi (sarveshwar-bandi) wrote :

Verification done.

Thanks,
Sarvesh

tags: added: verification-done-precise verification-done-quantal
removed: verification-needed-precise verification-needed-quantal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (7.4 KiB)

This bug was fixed in the package linux - 3.2.0-35.55

---------------
linux (3.2.0-35.55) precise-proposed; urgency=low

  [Luis Henriques]

  * Release Tracking Bug
    - LP: #1086856

  [ Andy Whitcroft ]

  * Revert "SAUCE: ata_piix: add a disable_driver option"
    - LP: #1079084
  * Revert "SAUCE: ata_piix: defer disks to the Hyper-V drivers by default"
    - LP: #1079084
  * SAUCE: ata_piix: add a disable_driver option
    - LP: #1079084, #994870

  [ Upstream Kernel Changes ]

  * libata: add a host flag to ignore detected ATA devices
    - LP: #1079084
  * ata_piix: defer disks to the Hyper-V drivers by default
    - LP: #1079084, #929545, #942316

linux (3.2.0-35.54) precise-proposed; urgency=low

  [Luis Henriques]

  * Release Tracking Bug
    - LP: #1086349

  [ Kees Cook ]

  * Revert "SAUCE: SECCOMP: audit: always report seccomp violations"
    - LP: #1079469

  [ Luis Henriques ]

  * SAUCE: SECCOMP: audit: fix build on archs without CONFIG_AUDITSYSCALL
    - LP: #1079469

  [ Upstream Kernel Changes ]

  * seccomp: forcing auditing of kill condition
    - LP: #1079469
  * Bluetooth: Avoid calling undefined smp_conn_security()
    - LP: #1081676
  * x86: Remove the ancient and deprecated disable_hlt() and enable_hlt()
    facility
    - LP: #1081676
  * drm/nouveau: silence modesetting spam on pre-gf8 chipsets
    - LP: #1081676
  * drm/nouveau: fix suspend/resume when in headless mode
    - LP: #1081676
  * drm/nouveau: headless mode by default if pci class != vga display
    - LP: #1081676
  * nfsd: add get_uint for u32's
    - LP: #1081676
  * ALSA: PCM: Fix some races at disconnection
    - LP: #1081676
  * ALSA: usb-audio: Fix races at disconnection
    - LP: #1081676
  * ALSA: usb-audio: Use rwsem for disconnect protection
    - LP: #1081676
  * ALSA: usb-audio: Fix races at disconnection in mixer_quirks.c
    - LP: #1081676
  * ALSA: Add a reference counter to card instance
    - LP: #1081676
  * ALSA: Avoid endless sleep after disconnect
    - LP: #1081676
  * drm/radeon: fix typo in evergreen_mc_resume()
    - LP: #1081676
  * USB: mos7840: remove unused variable
    - LP: #1081676
  * rtnetlink: Fix problem with buffer allocation
    - LP: #1081676
  * rtnetlink: fix rtnl_calcit() and rtnl_dump_ifinfo()
    - LP: #1081676
  * gpio-timberdale: fix a potential wrapping issue
    - LP: #1081676
  * cfg80211: fix antenna gain handling
    - LP: #1081676
  * drm/i915: fix overlay on i830M
    - LP: #1081676
  * drm/i915: fixup infoframe support for sdvo
    - LP: #1081676
  * drm/i915: clear the entire sdvo infoframe buffer
    - LP: #1081676
  * crypto: cryptd - disable softirqs in cryptd_queue_worker to prevent
    data corruption
    - LP: #1081676
  * ARM: at91: at91sam9g10: fix SOC type detection
    - LP: #1081676
  * ARM: at91/i2c: change id to let i2c-gpio work
    - LP: #1081676
  * mac80211: Only process mesh config header on frames that RA_MATCH
    - LP: #1081676
  * mac80211: don't inspect Sequence Control field on control frames
    - LP: #1081676
  * mac80211: fix SSID copy on IBSS JOIN
    - LP: #1081676
  * wireless: drop invalid mesh address extension frames
    - LP: #1081676
  * mac80211: check managem...

Read more...

Changed in linux (Ubuntu Precise):
status: Fix Committed → Fix Released
Revision history for this message
Adam Conrad (adconrad) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (24.2 KiB)

This bug was fixed in the package linux - 3.5.0-21.32

---------------
linux (3.5.0-21.32) quantal-proposed; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1088979
  * SAUCE: i915_hsw: move i915_hsw_enabled symbol to intel_ips
    - LP: #1087622

linux (3.5.0-20.31) quantal-proposed; urgency=low

  [Luis Henriques]

  * Release Tracking Bug
    - LP: #1086759

  [ Ben Widawsky ]

  * SAUCE: i915_hsw: Include #define I915_PARAM_HAS_WAIT_TIMEOUT
    - LP: #1085245
  * SAUCE: i915_hsw: Include #define DRM_I915_GEM_CONTEXT_[CREATE,DESTROY]
    - LP: #1085245
  * SAUCE: i915_hsw: drm/i915: add register read IOCTL
    - LP: #1085245
  * SAUCE: i915_hsw: Include #define i915_execbuffer2_[set,get]_context_id
    - LP: #1085245

  [ Chris Wilson ]

  * SAUCE: i915_hsw: Include #define I915_GEM_PARAM_HAS_SEMAPHORES
    - LP: #1085245
  * SAUCE: i915_hsw: Include #define I915_PARAM_HAS_SECURE_BATCHES
    - LP: #1085245

  [ Daniel Vetter ]

  * SAUCE: i915_hsw: drm/i915: call intel_enable_gtt
    - LP: #1085245
  * SAUCE: i915_hsw: drm: add helper to sort panels to the head of the
    connector list
    - LP: #1085245
  * SAUCE: i915_hsw: drm: extract dp link bw helpers
    - LP: #1085245
  * SAUCE: i915_hsw: drm: extract drm_dp_max_lane_count helper
    - LP: #1085245
  * SAUCE: i915_hsw: drm: dp helper: extract drm_dp_channel_eq_ok
    - LP: #1085245
  * SAUCE: i915_hsw: drm: extract helpers to compute new training values
    from sink request
    - LP: #1085245
  * SAUCE: i915_hsw: drm: dp helper: extract drm_dp_clock_recovery_ok
    - LP: #1085245

  [ Dave Airlie ]

  * SAUCE: i915_hsw: Include #define I915_PARAM_HAS_PRIME_VMAP_FLUSH
    - LP: #1085245

  [ Leann Ogasawara ]

  * SAUCE: i915_hsw: Provide an ubuntu/i915 driver for Haswell graphics
    - LP: #1085245
  * SAUCE: i915_hsw: Revert "drm: Make the .mode_fixup() operations mode
    argument a const pointer" for ubuntu/i915 driver
    - LP: #1085245
  * SAUCE: i915_hsw: Rename ubuntu/i915 driver i915_hsw
    - LP: #1085245
  * SAUCE: i915_hsw: Only support Haswell with ubuntu/i915 driver
    - LP: #1085245
  * SAUCE: i915_hsw: Include #define DRM_I915_GEM_WAIT
    - LP: #1085245
  * SAUCE: i915_hsw: drm: extract dp link train delay functions from radeon
    - LP: #1085245
  * SAUCE: i915_hsw: drm/dp: Update DPCD defines
    - LP: #1085245
  * SAUCE: i915_hsw: Update intel_ips.h file location
    - LP: #1085245
  * SAUCE: i915_hsw: Provide updated drm_mm.h and drm_mm.c for ubuntu/i915
    - LP: #1085245
  * SAUCE: i915_hsw: drm/i915: Replace the array of pages with a
    scatterlist
    - LP: #1085245
  * SAUCE: i915_hsw: drm/i915: Replace the array of pages with a
    scatterlist
    - LP: #1085245
  * SAUCE: i915_hsw: drm/i915: Stop using AGP layer for GEN6+
    - LP: #1085245
  * SAUCE: i915_hsw: Add i915_hsw_gpu_*() calls for ubuntu/i915
    - LP: #1085245
  * i915_hsw: [Config] Enable CONFIG_DRM_I915_HSW=m
    - LP: #1085245

  [ Paulo Zanoni ]

  * SAUCE: drm/i915: fix hsw_fdi_link_train "retry" code
    - LP: #1085245
  * SAUCE: drm/i915: reject modes the LPT FDI receiver can't handle
    - LP: #1085245
  * SAUCE: drm/i915: add support for mPHY destination on i...

Changed in linux (Ubuntu Quantal):
status: Fix Committed → Fix Released
Jesse Sung (wenchien)
Changed in linux (Ubuntu):
status: In Progress → Fix Released
Changed in emulex:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.