SSH connection killed by eth_speed script [natty]

Bug #730210 reported by Mathieu Bérard
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
powernap (Ubuntu)
Fix Released
High
Andres Rodriguez

Bug Description

Binary package hint: powernap

On my machine,
upon changing the ethernet connection speed (as done by the /etc/pm/power.d/eth_speed script), TCP connections seems to be lost.
This cause, among other things, remote SSH session to be immediately lost.

This is extra nasty if, as I did, you put:
[TCPMonitor]
ssh = 22
in /etc/powernap/config

Then the following events occurs:
* a user open a remote SSH session
* powernap reacts by running action.d scripts
* eth_speed script change ethernet speed setting
* the SSH connection is lost and thus the opening of the remote session fails.

The workaround is to attempt a reconnect very quickly before another power transition occurs

The NIC is a standard RTL8111/8168B

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Mathiu,

Thank you for taking the time to report bugs and trying to make PowerNap better.

Now, I don't quite seem to understand your scenario. So just let me get this straight.

1. You configured PowerNap to listen to a TCP Monitor.
2. You started powernap and it became idled and after 300 seconds of inactivity it performed the action.
2. The action is *powersave* that run all the scripts in /etc/pm/power.d/ and /usr/lib/pm-utils/power.d/ to reduce power consumption while running the machine.
3. You SSH into the machine and PowerNap takes recover action by running the scripts again, *but* your SSH session failed?

Please specify your scenario. Please also enable DEBUG=3 in the config, try to reproduce the bug and attach the following:
  - /etc/powernap/config
  - /var/log/powernap.log
  - /var/log/powernap.err

Now, if you configure a TCP monitor for port 22 and you SSH into that machine *before* PowerNap put the machine into powersave mode (ACTION_METHOD=0), then the machine will *never* enter into PowerSave monitor until the SSH session is lost and the machine is detected as being inactive after the desired time.

Now, if you have the TCPMonitor configured for ssh, and the machine *is* in PowerSave mode, whenever you SSH in, the machine should take recover action by running the scripts. And you should be able to continue to use your machine without any issue, even with SSH.

I've tested this again (and has been tested in various machines previously):

1. Configure PowerNap for TCPMonitor.
2. Wait till PowerNap enters into PowerSave Mode.
3. SSH into machine
4. PowerNap takes recover action after detection of SSH connection.
5. SSH Connection continue as normal (without the issues you are mentioning above).

Now, try to reproduce the issue again and run 'watch ethtool eth0' in a different terminal and observe the behavior.

When PowerNap enters into PowerSave mode (2), you should see something like:

        Supported ports: [ TP MII ]
 Supported link modes: 10baseT/Half 10baseT/Full
                         100baseT/Half 100baseT/Full
                         1000baseT/Half 1000baseT/Full
 Supports auto-negotiation: Yes
 Advertised link modes: Not reported

When you SSH into machine after machine is in PowerSave mode (3) you should see something like:

 Supported ports: [ TP MII ]
 Supported link modes: 10baseT/Half 10baseT/Full
                         100baseT/Half 100baseT/Full
                         1000baseT/Half 1000baseT/Full
 Supports auto-negotiation: Yes
 Advertised link modes: 10baseT/Half 10baseT/Full
                         100baseT/Half 100baseT/Full
                         1000baseT/Half 1000baseT/Full

I'm marking this bug as incomplete after the required information is provided!

Changed in powernap (Ubuntu):
status: New → Incomplete
Revision history for this message
Mathieu Bérard (mathieu-berard) wrote :

Hello Andres,

First your description of my scenario is quite accurate:

1. PowerNap is configured to listen to a TCP Monitor (port 22).
2. Powernap became idled and after 300 seconds of inactivity and performed the action.
2. The action is *powersave* that run all the scripts in /etc/pm/power.d/ and /usr/lib/pm-utils/power.d/ to reduce power consumption while running the machine. Including setting the NIC from 1000Mb/s to 100Mb/s
3. I attempt to SSH into the machine, PowerNap see the connection on port 22 and takes recover action by running the scripts again. Including setting the NIC back from 100Mb/s to 1000Mb/s
4. That speed transition make my NIC to drop packets for about ~18 sec, making the SSH session to fail.

As requested you will find attached in pm-logs-config.tar.gz
- /etc/powernap/config
- /var/log/powernap.log
- /var/log/powernap.err (empty)
- /var/log/pm-powersave.log*
the logs were taken with DEBUG=3 in the config

I have captured the output of ethtool eth0 and it shows exactly what you describe, as intended.

As a further piece of information, here is the output of the ping command directed to the affected machine just when PowerNap set the NIC from 100Mb/s to 1000Mb/s (triggered by a SSH logging attempt):

PING perenold.nat.mberard.eu (192.168.0.1) 56(84) bytes of data.
64 bytes from perenold.nat.mberard.eu (192.168.0.1): icmp_req=1 ttl=64 time=0.287 ms
64 bytes from perenold.nat.mberard.eu (192.168.0.1): icmp_req=2 ttl=64 time=0.188 ms
64 bytes from perenold.nat.mberard.eu (192.168.0.1): icmp_req=3 ttl=64 time=0.302 ms
64 bytes from perenold.nat.mberard.eu (192.168.0.1): icmp_req=4 ttl=64 time=0.200 ms
64 bytes from perenold.nat.mberard.eu (192.168.0.1): icmp_req=22 ttl=64 time=0.379 ms
64 bytes from perenold.nat.mberard.eu (192.168.0.1): icmp_req=23 ttl=64 time=0.120 ms
64 bytes from perenold.nat.mberard.eu (192.168.0.1): icmp_req=24 ttl=64 time=0.109 ms
...
--- perenold.nat.mberard.eu ping statistics ---
34 packets transmitted, 17 received, 50% packet loss, time 33008ms
rtt min/avg/max/mdev = 0.072/0.181/0.383/0.104 ms

As you can see 18 ping packets are lost during the speed transition, during that time the SSH connection is lost.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Mathieu,

Thanks for the info. I was confused with the scenario you described in the report but now can see that was the same scenario I was mentioning :)

Now, I've been trying to reproduce this bug in other machines and I have been unsuccessful. The only thing that I can think of right now which might be causing the issue according to [1] is:

"Note, however, that if the network device on the other side has auto-negotiation enabled (which is very common) and you turn auto-negotiation off, the other side will assume half-duplex mode and you will experience a significant loss of performance"

This means that if the router/switch you are connecting to has auto-negotiation enabled, and PowerNap disables it to change the speed, then the router/switch does something to adapt to the change in the machine, which might indeed cause the packet loss (router/switch cannot forward the packets till the NIC in the network device changes to half-duplex).

From previous studies, I recall that doing these changes might result in unreachable devices because of speed mismatch and the inability to establish a link (More info [2]). I personally believe that this might in fact be the issue (router/switch configuration). For now

sudo powernap-action --disable eth_speed

I will however, keep this bug open and try to reproduce this issue again. Thank again for the information. If you need anything else please feel free to contact me.
Regards,

[1]: http://www.cyberciti.biz/tips/linux-ethernet-card-power-saving.html
[2]: http://www.cisco.com/en/US/products/hw/switches/ps700/products_tech_note09186a00800a7af0.shtml#why_do_auto

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Note that by the end of the table on [2] says:

" A duplex mismatch can result in performance issues, intermittent connectivity, and loss of communication."

So yes, I firmly believe that this is the issue is because of a duplex mismatch. I'll improve the documentation to make people aware of this.

Thank you again!

Changed in powernap (Ubuntu):
assignee: nobody → Andres Rodriguez (andreserl)
importance: Undecided → Low
Revision history for this message
Mathieu Bérard (mathieu-berard) wrote :

Hi Andres,
thank you for your investigations !

it was quite clear for me from the beginning that this problem may depend on a particular combination of NIC models and/or switches in between.

However, I would definitely not characterize what I see as a loss of performances in powersave mode, as the machine behaves normally (not packet lost, normal ping time etc...) both in normal (auto-neg advertised) and in powersave mode (auto-neg not adevertised).

What I see is the fact that *just right when a transition occurs* (normal -> powersave or powersave -> normal), then the NIC just drops all packets for about 18 sec. Just like if you pull the network plug for that same amount of time. That of course disturbs established connections, apparently beyond whatever recovery mechanisms in place (TCP retransmit etc..) can cope with.

I will try to make some further test and keep you informed if I get anything interesting. I will particularly look closer at the duplex settings.

Changed in powernap (Ubuntu):
importance: Low → High
status: Incomplete → Confirmed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package powernap - 2.15-0ubuntu1

---------------
powernap (2.15-0ubuntu1) oneiric; urgency=low

  * actions/eth_speed: Disable by default to not experience connectivity
    loss on certain network configs. (LP: #730210)
 -- Andres Rodriguez <email address hidden> Fri, 09 Sep 2011 15:29:31 -0400

Changed in powernap (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.