incorrect time on node causes failed oauth

Bug #978127 reported by Scott Moser
34
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Scott Moser
cloud-init
Fix Released
Medium
Scott Moser
cloud-init (Ubuntu)
Fix Released
High
Unassigned
Precise
Fix Released
High
Unassigned

Bug Description

=== Begin SRU Information ===
[Impact]
 * For systems that have a broken or incorrectly set hardware clock,
   enlistment and commissioning into MAAS will fail. This is because
   ubuntu's system clock is initially seeded by the hardware clock, and
   OAUTH is used for authentication with the maas server. If the client
   clock differs by more than 5 minutes from the server clock,
   authentication will fail, and subsequently enlistment or commisioning
   will fail.
   This is also a problem after installation of the system as the same
   process for authentication is used.
 * There is a need to backport this change in order to fully utilize 12.04
   and MAAS.
 * The change in cloud-init is essentially this:
   If a request for access to the MAAS metadata service returns 401 or
   403 (unauthorized), then subsequent re-tries will modify the
   timestamp in the OAUTH request so that it matches the server.

[Test Case]
 * To recreate the bug, you first need to get MAAS set up
   (http://maas.ubuntu.com/docs), and start a system for enlistment that would
   have an invalid clock. To force an invalid clock, do one of:
   * boot to a system bios and change the bios clock
   * modify the ephemeral image so that the clock is broken during boot.
     This can be accomplished by appending the following to
     /etc/init/cloud-init.conf inside an ephemeral image.
     | pre-start script
     | offset="10 minutes ago"
     | past=$(date -R --date "$offset")
     | date --set "$past" &&
     | echo ===== "set date to $past [$offset]" ===== ||
     | echo ===== "failed to set date to $past [$offset]" ====
     | end script
   This is actually more complex than that, because the ephemeral images
   already have this fix inside of them. So in order to reproduce, you have
   to downgrade the version of cloud-init inside the 12.04 ephemeral image
   to the version available in the ubuntu archive (0.6.3-0ubuntu1.1)
 * After a sufficiently broken system is obtained, boot the system.
   If this fix is not present, enlistment or commissioning will fail
   to do anything as it will not have access to the metadata.
 * Errors will be written to the MAAS server's /var/log/apache/error.log
 * When the fix is applied, a single failure will occur, and then cloud-init
   will modify future requests.

[Regression Potential]
 * Regression is limited to the MAAS datasource, which is not enabled by
   default for cloud-init. Thus, only a user that is using MAAS or otherwise
   takes explicit action to enable it will be affected.

[Other Info]
 * This bug has essentially been fixed in maas enlistment and commissioning
   environments outside of the SRU process. The "ephemeral images" downloaded
   for MAAS have 12.10's version of cloud-init installed inside them.
   This all works reliably. We want to properly SRU the change so that
   installed systems will also be resilient to a bad hardware clock.

=== End SRU Information ===

=== original bug report ===
In this simple scenario:
 a. hardware installed
 b. hardware booted and enlisted
 c. commissioning
 d. install to hardware
 e. cloud-init boot

At this point steps 'b' and 'e' do OAUTH to get user-data.

If the clock on the system is sufficiently off, then oauth will fail as shown in the attached screenshot.

it seems to make sense that 'b' would set the clock. Once the user enlists the systme to MAAS, it seems OK to start changing their hardware settings.

There is still a potential for really bad hardware clock that could forget its settings on reboot, or somehow get off between 'b' and 'c'. If we were really interested in fixing that, cloud-init could read a kernel command line parameter pointing to a system that ran an ntp server and just run that very early in boot to set the local date.

Related branches

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Urgh this is nasty. We'd need to provide an NTP service on the network, right? Otherwise, can MAAS send an on the spot timestamp to the node to do a rough-and-ready initialisation of its clock? Won't be accurate but will be good enough to get past the oauth nonce check.

Changed in maas:
status: Confirmed → Triaged
importance: Undecided → High
tags: added: api provisioning
Revision history for this message
Scott Moser (smoser) wrote :

so we just hit this, and it sucked.
I worked around by, in the ephemeral image, modifying /etc/init/cloud-init.conf to append:
  pre-start script
  ntpdate -p 8 10.155.32.1
  hwclock -w
  end script

Revision history for this message
Jeff Anderson-Lee (jonah-2) wrote :

This is really problematic. New nodes do not come with synchronized clocks.
And client nodes may not be on a public facing network interface (at least not at boot time).

What about changing the configuration of the MAAS master to automatically run a time service
and have the client nodes set their clocks from it (+1 for #2)?

Failing that, allow the tickets or whatever timestamped credential is being passed to be valid
for +/-24 hours? I like this less than having the clock set accurately at boot, since if you got a
batch of 100+ nodes with ill-set clocks, it would be *way* to much manual labor to set in in the bios.

This bug nearly had me willing to walk away from MAAS as DOA.

Revision history for this message
Jeff Anderson-Lee (jonah-2) wrote :

Thanks for the work around Scott. I installed ntp on the MAAS-server and tweaked the ephemeral image to use ntpdate from it:

apt-get -q -y install ntpdate
ntpdate -p 8 SOME.PUBLIC.SERVER
apt-get -q -y install ntp ntp-doc

IFACE=eth1
IPADDR=`ifconfig eth1| grep 'inet addr'| sed -e 's/.*addr://' -e 's/ .*//'`
export IPADDR
# The tweak the ephemeral image to set the clock
mount /var/lib/maas/ephemeral/precise/ephemeral/amd64/20120424/disk.img /mnt/
chroot /mnt
apt-get update
cat >> /etc/init/cloud-init.conf << END

pre-start script
ntpdate -p 8 ${IPADDR}
hwclock -w
end script
END
exit
umount /mnt

# after that, the nodes boot and configure!

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Jeff, we'll address this pretty soon. It's been a pain point for a lot of people.

Revision history for this message
Gui Maluf Balzana (guimalufb) wrote :

Thanks Jeff. Your solution works great!

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Via Scott:

I think a reasonable and SRU-able solution is below. Note, that in order
to deliver this, we have to deliver updated ephemeral images (which was
always expected, just pointing out that this fix comes in a ~600M
download).

The way this works right now is the following:
 A. ephemeral instance is booted with a 'url=' parameter on the kernel
   command line something like this:
      url=http://maasserver/cblr/svc/op/ks/system/node-XXXX
 B. as described at [1], cloud-init pulls that un-authed url, and stores it
   as local configuration. Currently the payload looks like this:
      #cloud-config
      datasource:
       MAAS:
         metadata_url: http://mass-host.localdomain/source
         consumer_key: Xh234sdkljf
         token_key: kjfhgb3n
         token_secret: 24uysdfx1w4
 C. cloud-init then continues on and uses that maas datasource as if it
    were locally configured to do so. It pulls user-data from
    the derivative url, and then executes it.
 D. The user-data provided is read from
    /etc/maas/commissioning-user-data [2]. cloud-init executes this code
    which makes api calls back to the configured maas server in 'B' to
    post commissioning status.

The issue that we see in bug 978127 is that the http requests done in 'C'
fail because of out of sync clock on the ephemeral node.

The solution that I suggest is:
 i.) modify 'B' above to include 'time_sync_url' field under 'MAAS'
 ii.) Before cloud-init does oauthed requests for user-data in 'C'
    above, it will first do an un-authed request to the value of
    'time_sync_url' which will return data like:
      Wed, 27 Jun 2012 10:13:29 -0400
 iii.) cloud-init will then set the system clock (not the hardware clock)
    to the given date. The subsequent oauth requests will succeed as
    they'll have a reasonable system clock at that point.
 iv.) if possible make cloud-init log the failure in 'C' above more
      obviously on the console. I believe this is less than
      straightforward unfortunately due to the console switching around
      that is done on boot.

Note, I skipping the 'time_sync_url' by simply directly providing a
'time_sync' like:
  time_sync: Wed, 27 Jun 2012 10:13:29 -0400
If that is seen as desirable it could probably be accommodated. The thing
that I do not like about it is that it writes that data to a local config
file, and obviously the current time stamp very quickly becomes incorrect.
Hiding it behind a url that has dynamic and correct content removes that.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

I see a couple of potential problems with the first approach:

 * There may not be a time-sync url (ntp service) available
 * Unless you're talking about adding an API call to MAAS? This obviously adds more complexity to the SRU.

The key thing is that the node clocks need to be "near enough" to the MAAS server's clock for oauth to work. We could just put the time right in the config payload but I guess that depends on whether having a more accurate clock time available for other purposes is desirable or not? If not, we can just make the config payload item be called "oauth_time_sync" instead so it's a bit more obvious what it's used for (and would discourage other use).

Revision history for this message
Scott Moser (smoser) wrote :

I was suggesting the "api" call to maas. That makes sense to me, as the maas clock is actually the "right" clock (if maas server was out of date, syncing against anything else is just going to cause other confusion).

I dont' like putting the 'oauth_time_sync' in the config payload as that is stored locally in cloud-init as regular config. In the ephemeral image, every boot is "first boot" (as there is no way of storing a already-booted). Having a hard coded static time in a config file seems broken. I'd rather it point to a reference where a dynamic time can be obtained.

The api call in maas can be as simple as:
 print time.time()

If this is the right solution, then its the right path to SRU. Cobbling a solution that is different than what is done in trunk so it is more appropriate for SRU doesn't seem like a good idea. We should focus first on fixing trunk the right way.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

I have no intention of doing a different fix in trunk vs 1.0, don't worry. This discussion is about the pros and cons of each approach. I think we've settled on a reasonable one now, so I'll get this API call in place ready for when the change is added to cloud-init.

Revision history for this message
Scott Moser (smoser) wrote :

I did some more thinking on this, and I think the easiest solution is to set the hardware clock during enlistment.
The good thing about this is that we're already executing arbitrary code (via early_command) during enlistment and preseed for enlistment is not oauth protected.

There is still the possibility for failure in one of the following cases:
 a.) really bad hardware clock or dead bios battery. Ie, the clock loses minutes in a day or completely loses time across reboot/power cycle. This means that between enlistment and commissioning, the system could get a bad clock and fail oauth for commissioning.
 b.) virtual machine. I think that setting the hwclock is basically a no-op in a virtual machine, and is not likely persistent across shutdown/startup. For virtual machine's the host clock is just going to have to be relied upon across shutdown.

Note, however, that the suggestion given above (setting prior to use of the api during commissioning) could also fail in either of the above cases between commissioning and installation. The only complete fix would then be to have each stage somehow be provided via unauthed access a good timestamp and set the system clock via that.

At very least, I think we should take this path now, as it really should be really easy, and generally cut out issues with incorrectly set "real hardware" clocks.

Revision history for this message
Andrew Glen-Young (aglenyoung) wrote :

The HTTP/1.1 RFC dictates that an origin server MUST include a valid Date field for it's responses¹. Could this field not be used?

Example HTTP/1.1 request/response (truncated for brevity):

    GET /maas/+bug/978127 HTTP/1.1
    Host: bugs.launchpad.net

    HTTP/1.1 200 Ok
    Date: Tue, 31 Jul 2012 15:48:51 GMT

[1]: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.18

dann frazier (dannf)
Changed in lomond:
status: New → Triaged
Revision history for this message
Robie Basak (racb) wrote :
Robie Basak (racb)
tags: added: arm
Scott Moser (smoser)
Changed in cloud-init:
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Scott Moser (smoser)
Revision history for this message
Scott Moser (smoser) wrote :

I just committed a fix to trunk that I believe should address this.
Its based on the suggestion in comment 13.

Basically, if we fail by 403 unauthorized, we set a "skew" (skew = local_current_time - server_response_time) and future requests will have the oauth header time adjusted by that skew.

Scott Moser (smoser)
Changed in cloud-init (Ubuntu):
status: New → Triaged
importance: Undecided → High
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.0~bzr677-0ubuntu1

---------------
cloud-init (0.7.0~bzr677-0ubuntu1) quantal; urgency=low

  * add CloudStack to DataSources listed by dpkg-reconfigure (LP: #1002155)
  * New upstream snapshot.
    * 0440 permissions on /etc/sudoers.d files rather than 0644
    * get host ssh keys to the console (LP: #1055688)
    * MAAS DataSource adjust timestamp in oauth header to one based on the
      timestamp in the response of a 403. This accounts for a bad local
      clock. (LP: #978127)
    * re-start the salt daemon rather than start to ensure config changes
      are taken.
    * allow for python unicode types in yaml that is loaded.
    * cleanup in how config modules get at users and groups.
 -- Scott Moser <email address hidden> Sun, 30 Sep 2012 14:29:04 -0400

Changed in cloud-init (Ubuntu):
status: Triaged → Fix Released
Scott Moser (smoser)
Changed in cloud-init:
status: In Progress → Fix Released
mahmoh (mahmoh)
Changed in lomond:
status: Triaged → Fix Released
Changed in maas:
status: Triaged → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
Scott Moser (smoser)
Changed in cloud-init (Ubuntu Precise):
status: New → Triaged
importance: Undecided → High
Revision history for this message
Robie Basak (racb) wrote :

This seems to be fixed in cloud-init based on the latest daily ephemeral image. However, it seems that maas-signal (embedded in etc/maas/commissioning-user-data) also has the same problem. Bumping this to Critical, as it means that it is very difficult for highbank users to deploy at all, since they don't necessarily know how to set their hardware clock.

Changed in maas:
status: Fix Released → Triaged
importance: High → Critical
assignee: nobody → Scott Moser (smoser)
Changed in maas:
status: Triaged → In Progress
tags: added: missing-in-quantal
no longer affects: maas (Ubuntu)
no longer affects: maas (Ubuntu Precise)
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Ryan Finnie (fo0bar) wrote :

I just encountered this in precise MAAS, on brand new metal nodes which had dates sometime in 2011. Adding the NTP hack to the ephemeral image helped get around it. (Thanks for the recipe, Jeff! FYI, a public NTP server can work just as well, ntp.ubuntu.com in my case.)

Scott Moser: "Basically, if we fail by 403 unauthorized, we set a "skew" (skew = local_current_time - server_response_time) and future requests will have the oauth header time adjusted by that skew."

Here's the problem though. From what I saw, the MAAS server was returning 401 because of the skewed oauth, not 403. I can provide tcpdump captures if desired.

Revision history for this message
Scott Moser (smoser) wrote :

There are 2 pieces to this fix, both of which are fixed in 12.10.
a.) cloud-init portion. This fix is in the latest set of precise "released" ephemeral images. They include cloud-init from 12.10. It accounts for 401 or 403 return.
b.) "maas-signal" portion. maas-signal is used during commissioning, and it reports commissioning results back. maas-signal is part of maas (/etc/maas/commissioning-user-data). This was fixed in revision 1242 [1].

The behavior that I would expect in 12.04 with broken hardware clocks at this point is for enlistment to work, commissioning to get user-data, but then to failto report home as done. I think for 12.04 I would recommend the modification of the ephemeral image to include the running of 'ntp-date' as described in comment 5.

--
[1] http://bazaar.launchpad.net/~maas-maintainers/maas/trunk/revision/1242

Scott Moser (smoser)
tags: removed: missing-in-quantal
Scott Moser (smoser)
description: updated
Revision history for this message
Steve Langasek (vorlon) wrote : Please test proposed package

Hello Scott, or anyone else affected,

Accepted cloud-init into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/cloud-init/0.6.3-0ubuntu1.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-init (Ubuntu Precise):
status: Triaged → Fix Committed
tags: added: verification-needed
Revision history for this message
Scott Moser (smoser) wrote :

I've verified this by setting up maas, then:

bzr branch lp:~smoser/+junk/backdoor-image
cd backdoor-image
img=/var/lib/maas/ephemeral/precise/ephemeral/amd64/20121008/disk.img
sudo ./mount-callback-umount --system-resolvconf -v $img -- \
  sh -c 'chroot "$1" /bin/sh' -- < ~/bin/modify-image

sudo ./backdoor-image --password=ubuntu $

where modify-image is attached patch.

Then started a machine (kvm) and watched it boot. It successfully enlisted. This is actually normal.
Then, accept the node, and boot again

The node successfully commissions, and shows:
2012-12-12 20:36:35,682 - util.py[WARNING]: 'http://10.55.60.142/MAAS/metadata//2012-03-01/meta-data/instance-id' failed [0/120s]: http error [401]
2012-12-12 20:36:35,690 - DataSourceMAAS.py[WARNING]: set oauth clockskew to 602

in the /var/log/cloud-init.log, indicating that the fix was applied.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Hello Scott, or anyone else affected,

Accepted cloud-init into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/cloud-init/0.6.3-0ubuntu1.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: removed: verification-done
tags: added: verification-needed
Scott Moser (smoser)
tags: added: verification-done
removed: verification-needed
Revision history for this message
Colin Watson (cjwatson) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.6.3-0ubuntu1.3

---------------
cloud-init (0.6.3-0ubuntu1.3) precise-proposed; urgency=low

  * debian/patches/lp-1070345-landscape-restart-after-change.patch,
    debian/patches/lp-1066115-landscape-install-fix-perms.patch:
    fix missing or incorrect imports (LP: #1070345, LP: #1066115).

cloud-init (0.6.3-0ubuntu1.2) precise-proposed; urgency=low

  * debian/patches/lp-978127-maas-oauth-fix-bad-clock.patch: fix usage of
    oauth in maas data source if local system has a bad clock (LP: #978127)
  * debian/cloud-init.preinst: fix bug where user data scripts re-ran on
    upgrade from 10.04 versions (LP: #1049146)
  * debian/patches/lp-974509-detect-dns-server-redirection.patch: detect dns
    server redirection and disable searching dns for a mirror named
    'ubuntu-mirror' (LP: #974509)
  * debian/patches/lp-1018554-shutdown-message-to-console.patch: write a
    message to the console on system shutdown. (LP: #1018554)
  * debian/patches/lp-1066115-landscape-install-fix-perms.patch: install
    landscape package if needed which will ensure proper permissions on config
    file (LP: #1066115).
  * debian/patches/lp-1070345-landscape-restart-after-change.patch: restart
    landscape after modifying config (LP: #1070345)
  * debian/patches/lp-1073077-zsh-workaround-for-locale_warn.patch: avoid
    warning when user's shell is zsh (LP: #1073077)
  * debian/patches/rework-mirror-selection.patch: improve mirror selection by:
    * allowing region/availability-zone to be part of mirror (LP: #1037727)
    * making mirror selection arch aware (LP: #1028501)
    * allow specification of a security mirror (LP: #1006963)
 -- Scott Moser <email address hidden> Thu, 13 Dec 2012 12:16:56 -0500

Changed in cloud-init (Ubuntu Precise):
status: Fix Committed → Fix Released
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.