network manager 0.6 should use 9 or 8 seconds as a scan timeout to workaround driver bugs

Bug #361602 reported by Alexander Sack
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
network-manager (Ubuntu)
Fix Released
High
Alexander Sack
Hardy
Fix Released
High
Alexander Sack

Bug Description

Binary package hint: network-manager

Reporting on behalf of Dan Langlois <dan clockwerk net>
(see http://mail.gnome.org/archives/networkmanager-list/2009-January/msg00245.html)

I have recently installed Ubuntu 8.04, Hardy Heron, on my laptop. This is
my first exposure to NetworkManager. For the purposes of this discussion,
I always connect to the same WPA wifi AP, named "darknet". At any given
moment 2 or 3 other WEP APs, and 1 or 2 unprotected APs are visible, but
I don't connect to them. My wifi hardware is an Intel 4965agn using the
iwl4965 driver.

Occasionally I'd boot up to find NM disconnected. When this happens, APs
are visible, but not darknet in particular. If I wait long enough (over
20min sometimes), eventually NM finds it and connects. This is way too
long to wait for an AP that I know is up and running and has a strong
signal where I'm sitting.

I started to feel that NM needed a manual "rescan" feature. Googling this
I found discussions from 2006 about this, including comments from
Dan Williams about how just touching the nm-applet should schedule a rescan.
(http://mail.gnome.org/archives/networkmanager-list/2006-November/msg00127.html)
So why was it not finding darknet?

I decided to get the source and experiment (given that I use HH,
I had to get the v0.6.6 source). What I found was that while in the
disconnected state, NM was indeed still scanning, but it just couldn't seem
to find darknet. So I dug a little deeper into the scanning guts of NM.

Already you might be thinking I should use a more recent version of NM.
However, I'm not sure that would work with Ubuntu HH. Since HH is the
Long Term Support version of Ubuntu, I thought I'd look into fixing it
up. This is why I'm investigating the older source and am keen to
improve it.

I'll skip along here as my email is getting long, my hypothesis is this:
NM waits too long from the time it initiates the scan. I believe the
driver (or more precisely, the Wireless Extensions (WE) code) is expiring
the scan result before NM gets them.

I propose that NM should read and process scan results sooner. The
current time of 10 seconds is right on the threshold of results
getting expired (as I understand it).

I'm no NM or WE expert, so I'll leave Dan Williams and other experts
to mull over my findings. Here are the details.

I added some nm_info()'s to NM source to see what was happening:

Jan 22 14:43:45 talon NetworkManager: <info> dan, wlan0 iw scan issued
Jan 22 14:43:47 talon NetworkManager: <info> dan, helper buf:8 SIOCGIWSCAN
Jan 22 14:43:55 talon NetworkManager: <info> dan, schedule_scan(0x80a0110,20000ms)
Jan 22 14:43:55 talon NetworkManager: <info> dan, process_scan_results(0x80a0110,0x80a1800,211) WE version 22
Jan 22 14:43:55 talon NetworkManager: <info> dan, found AP 'BELL669' ret=0

Jan 22 14:44:15 talon NetworkManager: <info> dan, wlan0 iw scan issued
Jan 22 14:44:17 talon NetworkManager: <info> dan, helper buf:8 SIOCGIWSCAN
Jan 22 14:44:25 talon NetworkManager: <info> dan, schedule_scan(0x80a0110,120000ms)

Jan 22 14:46:25 talon NetworkManager: <info> dan, wlan0 iw scan issued
Jan 22 14:46:27 talon NetworkManager: <info> dan, helper buf:8 SIOCGIWSCAN
Jan 22 14:46:35 talon NetworkManager: <info> dan, schedule_scan(0x80a0110,120000ms)
Jan 22 14:48:35 talon NetworkManager: <info> dan, wlan0 iw scan issued
Jan 22 14:48:37 talon NetworkManager: <info> dan, helper buf:8 SIOCGIWSCAN
Jan 22 14:48:45 talon NetworkManager: <info> dan, schedule_scan(0x80a0110,120000ms)
Jan 22 14:48:45 talon NetworkManager: <info> dan, process_scan_results(0x80a0110,0x80a1800,422) WE version 22
Jan 22 14:48:45 talon NetworkManager: <info> dan, found AP 'BELL479' ret=1
Jan 22 14:48:45 talon NetworkManager: <info> dan, found AP 'BELL669' ret=0

So scans are triggered via iw_set_ext() and we setup to collect the results
in 10 seconds with schedule_scan_results_timeout(self, 10).

2 seconds later, wireless_event_helper() is called with SIOCGIWSCAN
indicating results are ready. But the code there doesn't respond
right away, clamining that this might be the first of a burst of
such messages. So instead we continue to wait the full 10 seconds.

At the 10 second mark, results are fetched using iw_get_ext() and the
next scan is scheduled. Results (if any) are then processed.
In the log above you can see that some scans had no results, others
had just 1 or two. None had darknet.

Curious about how iw_set_ext() and iw_get_ext() worked, I dug a
little and found this interesting bit in net/mac80211/ieee80211_sta.c
function ieee80211_sta_scan_result():

if (time_after(jiffies, bss->last_update + IEEE80211_SCAN_RESULT_EXPIRE))
    return current_ev;

Note that IEEE80211_SCAN_RESULT_EXPIRE is defined to be (10 * HZ).

It seems that ieee80211_sta_scan_result() won't put a given AP
into the scan result list if it was last seen over 10 seconds ago.

I changed the code in wireless_event_helper() to always call
schedule_scan_results_timeout(self, 2) upon receipt of SIOCGIWSCAN
and I've been running for a few days now without ever being stuck
in the NM disconnected state. I've also found this connects much
faster (reasonable since I dropped a 6 second wait time) and
surprisingly I found this also improved waking from suspend, which
has been reliable since I made this change.

I'd be interested to hear what Dan Williams and others
think of this analysis. Is NM waiting too long to collect scan
results?

cheers,
Dan

Revision history for this message
Alexander Sack (asac) wrote :

importance high as it prevents to see scan results for some.

Changed in network-manager (Ubuntu):
importance: Undecided → High
milestone: none → ubuntu-8.04.3
status: New → Triaged
Revision history for this message
Alexander Sack (asac) wrote :

patch from http://mail.gnome.org/archives/networkmanager-list/2009-February/msg00074.html

--- src/nm-device-802-11-wireless.c 2008-03-05 09:50:55 +0000
+++ src/nm-device-802-11-wireless.c 2009-01-27 23:49:47 +0000
@@ -2110,7 +2110,7 @@
    nm_warning ("(%s): could not trigger wireless scan", iface);
    goto reschedule;
   }
- schedule_scan_results_timeout (self, 10);
+ schedule_scan_results_timeout (self, 8);
  } else {
   NMSock *sk = NULL;
   struct iwreq wrq;
@@ -2130,7 +2130,7 @@
    nm_warning ("(%s): could not trigger wireless scan: %s", iface, strerror (errno));
    goto reschedule;
   }
- schedule_scan_results_timeout (self, 10);
+ schedule_scan_results_timeout (self, 8);
  }
  return FALSE;

Changed in network-manager (Ubuntu):
assignee: nobody → Alexander Sack (asac)
status: Triaged → In Progress
Changed in network-manager (Ubuntu Hardy):
assignee: nobody → Alexander Sack (asac)
importance: Undecided → High
milestone: none → ubuntu-8.04.3
status: New → In Progress
Revision history for this message
Alexander Sack (asac) wrote :

hardy only bug.

Changed in network-manager (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Alexander Sack (asac) wrote :

bzr commit -m '* fix LP: #361602 - network manager 0.6 should use 9 or 8 seconds as
  a scan timeout to workaround driver bugs; applying patch from
  Dan Langlois <dan at clockwerk dot net> that reduces scan result
  timeout to 8 seconds
  - add debian/patches/50_lp361602_reduce_scan_timeout.patch
  - update debian/patches/series' --fixes 'lp:361602'
Committing to: bzr+ssh://<email address hidden>/~ubuntu-core-dev/network-manager/ubuntu.0.6.x.hardy/
modified debian/changelog
added debian/patches/50_lp361602_reduce_scan_timeout.patch
modified debian/patches/series
Committed revision 123.

Changed in network-manager (Ubuntu Hardy):
status: In Progress → Fix Committed
Revision history for this message
Martin Pitt (pitti) wrote :

I rejected the hardy-proposed upload because it says

  * fix LP: #203016 - Spelling Mistake when shutting down/restarting
    - add debian/patches/43_fix_termination_spelling_error.patch
    - update debian/patches/series

This doesn't meet the SRU criteria, but it's harmless enough, so I'd wave it through. However, that's not the correct bug. Please refer to the correct bug number and reupload (same version number).

Revision history for this message
Alexander Sack (asac) wrote :

oops. good catch Martin. I committed that long ago as a ride-along for an eventual SRU. the bzr merge actually had the right bug id; must have been a copy paste bug :).

Reuploaded.

Revision history for this message
Martin Pitt (pitti) wrote :

Accepted into hardy-proposed; please test and give feedback here. Please see https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

tags: added: verification-needed
Revision history for this message
Dan Langlois (dan-clockwerk) wrote :

I've verified this fix.

I definitely see the problem with 0.6.6-0ubuntu5.8.04.1, but after upgrading to the proposed 0.6.6-0ubuntu5.8.04.2 I didn't hit it. It's a bit tricky to know for sure that the problem is resolved, given the nature of the bug, but I can vouch that NM is still working (so the patch certainly hasn't broken it).

This fix gets my thumbs up.

Martin Pitt (pitti)
tags: added: verification-done
removed: verification-needed
Revision history for this message
Martin Pitt (pitti) wrote :

Can someone else who is affected please test this as well? One piece of feedback doesn't sound enough for this rather "delicate" (to avoid saying "brittle") workaround.

tags: added: verification-needed
removed: verification-done
Revision history for this message
Dan Langlois (dan-clockwerk) wrote :

Just to recap:
If you're affected by this bug, then you'll find that your desired AP doesn't show up in the list of available networks. You'll find yourself wanting a refresh/rescan button (because you know it's there). After several minutes of waiting patiently, you're AP *may* finally appear and you'll connect. With this patch, your AP should always appear and connect immediately.

I tried to suggest a low-impact patch. I can't see how this patch could break anything, in fact, it fixes something. I discussed this on the NM list to see if anyone could find any fault with it, no red flags. I agree it should be verified by a few more people just to be sure, but I'd hardly call it delicate.

Revision history for this message
Alexander Sack (asac) wrote : Re: [Bug 361602] Re: network manager 0.6 should use 9 or 8 seconds as a scan timeout to workaround driver bugs

On Tue, Apr 28, 2009 at 02:37:55PM -0000, Dan Langlois wrote:
> Just to recap:
> If you're affected by this bug, then you'll find that your desired AP doesn't show up in the list of available networks. You'll find yourself wanting a refresh/rescan button (because you know it's there). After several minutes of waiting patiently, you're AP *may* finally appear and you'll connect. With this patch, your AP should always appear and connect immediately.
>
> I tried to suggest a low-impact patch. I can't see how this patch could
> break anything, in fact, it fixes something. I discussed this on the NM
> list to see if anyone could find any fault with it, no red flags. I
> agree it should be verified by a few more people just to be sure, but
> I'd hardly call it delicate.
>

The only regression potential i see here is for drivers that need
between 8-10 seconds to get their results collected. Those that need
that much time are probably flaky anyway and I don't think we have
considerable regression potential here.

 - Alexander

Revision history for this message
Martin Pitt (pitti) wrote :

Comment 8 verified that the update works and fixes this bug, and this has sat in -proposed for three months, so I'd think regressions should have been spotted by now. Steve Beattie tested the updated network-manager in bug 214860, so I consider this sufficiently tested now.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package network-manager - 0.6.6-0ubuntu5.8.04.2

---------------
network-manager (0.6.6-0ubuntu5.8.04.2) hardy-proposed; urgency=low

  [ Andy Matteson <email address hidden> ]
  * fix LP: #214860 - Spelling Mistake when shutting down/restarting
    - add debian/patches/43_fix_termination_spelling_error.patch
    - update debian/patches/series

  [ Alexander Sack <email address hidden> ]
  * fix LP: #361602 - network manager 0.6 should use 9 or 8 seconds as
    a scan timeout to workaround driver bugs; applying patch from
    Dan Langlois <dan at clockwerk dot net> that reduces scan result
    timeout to 8 seconds
    - add debian/patches/50_lp361602_reduce_scan_timeout.patch
    - update debian/patches/series

 -- Alexander Sack <email address hidden> Wed, 15 Apr 2009 12:15:45 +0200

Changed in network-manager (Ubuntu Hardy):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.