Make jenkins-slave more resilient, ship out systemd service to retry

Bug #1847939 reported by Haw Loeung
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Jenkins CI Agent Charm
Fix Released
Low
Haw Loeung

Bug Description

Hi,

When migrating neutron routers between hosts/neutron-gateways, jenkins-slave dies as follows:

| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: INFO: Setting up agent: jenkins-slave-xenial-12
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: WARNING: No Working Directory. Using the legacy JAR Cache location: /var/lib/jenkins/.jenkins/cache/jars
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: INFO: Locating server among [https://jenkins.ols.canonical.com/online-services/, http://jenkins-be.internal:8080/online-services/]
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: Oct 13, 2019 11:30:26 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: Oct 13, 2019 11:30:26 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: Oct 13, 2019 11:30:26 PM org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer onRecv
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: INFO: [JNLP4-connect connection to 10.25.200.124/10.25.200.124:48484] Local headers refused by remote: jenkins-slave-xenial-12 is already connected to this master. Rejecting this connection.
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: java.util.concurrent.ExecutionException: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: jenkins-slave-xenial-12 is already connected to this master. Rejecting this connection.
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: #011at org.jenkinsci.remoting.util.SettableFuture.get(SettableFuture.java:223)
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: Caused by: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: jenkins-slave-xenial-12 is already connected to this master. Rejecting this connection.
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: #011at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.newAbortCause(ConnectionHeadersFilterLayer.java:378)
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: #011at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecvClosed(ConnectionHeadersFilterLayer.java:433)
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: #011at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: #011at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: #011at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:172)
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: #011at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: #011at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: #011at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1500(BIONetworkLayer.java:48)
| Oct 13 23:30:26 juju-manual-jenkaas-4 bash[31920]: #011at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:247)

We should have the service script retry.

Related branches

Haw Loeung (hloeung)
Changed in jenkins-slave-charm:
assignee: nobody → Haw Loeung (hloeung)
description: updated
Haw Loeung (hloeung)
Changed in jenkins-slave-charm:
status: New → In Progress
Revision history for this message
Junien F (axino) wrote :

In bionic deploys, there's already a systemd unit file.
It has :
Restart=on-failure

Which I think means instant restart. What I've seen is that just after the above, jenkins-slave will try to start a bunch of time and fail, and reach the "max-restart" state.

I think we just need to add a pause between restarts in the systemd unit file.

Haw Loeung (hloeung)
Changed in jenkins-slave-charm:
importance: Undecided → Low
Haw Loeung (hloeung)
Changed in jenkins-slave-charm:
status: In Progress → Fix Committed
Haw Loeung (hloeung)
Changed in jenkins-slave-charm:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.