Comment 3 for bug 1746605

Revision history for this message
Mike Gerdts (mgerdts) wrote : Re: stack trace when sdc:* not defined

I think I've sorted out what is going wrong here. I modified JoyentMetadataLcient in DataSourceSmartOS.py so that it retries when an exchange is corrupted.

On a particular boot, cloud-init logged:

Writing "V2 29 98686c19 c36fd530 GET c2RjOnJvdXRlcw==\n" to metadata transport.
Read "lV2 21 1f8b55fb c36fd530 SUCCESS W10=" from metadata transport.
      ^ Extra 'l' carachter
Writing "V2 29 e2c45a1f 9492baa0 GET c2RjOnJvdXRlcw==\n" to metadata transport.
Read "V2 21 c1a6af86 9492baa0 SUCCESS W10=" from metadata transport.
Value "[]" found.

Notice my annotation of the extra character in the input stream.

In the host I was tracing the reads and writes by the metadata server. I saw:

R 'V2 29 98686c19 c36fd530 GET c2RjOnJvdXRlcw==\n\n\0'
                                                ^ ^ two newlines
W 'invalid command\n'
   ^^^ this, perhaps with the exception of a single l, does not appear in guest
       cloud-init log.
W 'V2 21 1f8b55fb c36fd530 SUCCESS W10=\n'
R 'V2 29 e2c45a1f 9492baa0 GET c2RjOnJvdXRlcw==\n'
W 'V2 21 c1a6af86 9492baa0 SUCCESS W10=\n'

On a related note, mdata-get flushes the serial port before it does its work. That is, reads any pending data, it writes a newline, and then reads a line. It discards whatever was read, then sends the real request and reads the real reply. My test system is one that has typically done a bunch of configuration using mdata-get from rc.local and perhaps other places. I think I have pretty strong evidence that there's a race between cloud-init and these other bits that are also trying to use the metadata service.

The extra newline that is seen by the host would correspond to the flush performed by mdata-get. The fact that cloud-init sees an extra character (l) but not the other chacaraters in "invalid command" suggests that mdata-get and cloud-init are both reading from the serial port at the same time.

The first stack trace that I posted showed:

Read "aV2 21 0e6e7ec8 d133c055 SUCCESS W10=" from metadata transport.

Notice the extra 'a' that it read. Looking through my terminal history, I see this.

Read "nnV2 21 ff55207a 8bb85dc9 SUCCESS W10=" from metadata transport.

In each case where there are extra characters, those extra characters are some subset of "invalid command".

The second stack trace that I posted timed out because it never got a newline. That could also be explained by a rogue mdata-get reading a line from the serial port.

There are a several of approaches to dealing this:

1) If it hurts, don't do it. (close as not-a-bug)
2) Do some sort of locking on /dev/ttyS1 to prevent crosstalk
3) When a command fails, retry a time or two.

Option 2 requires that everyone that may open /dev/ttyS1 agree on the locking strategy and implement it correctly. mdata-get tries to take a lock, but clearly that's not working out so well.

# strace mdata-get sdc:hostname
...
open("/dev/ttyS1", O_RDWR|O_EXCL|O_NOCTTY) = 5
fcntl(5, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0

Linux seems to be working as designed. http://www.tldp.org/HOWTO/Serial-HOWTO-13.html

Since there's no way to keep other actors from opening the serial device that cloud-init is using, it would probably be beneficial to implement option 3.