On 10.02.12 12:38, Stewart Smith wrote:
> I have a number of thoughts:
> - mostly with my xtrabackup-also-in-drizzle hat on, i'd prefer to see innodb OS abstraction file IO routines used instead of mysys.  (also from a general dislike of mysys being some weird third thing and not posix or win32 io). With the datasink API this makes it a bit easier to just use native IO calls... have a POSIX local sink and a Win32 local sink.

As discussed on IRC, I would love to use innodb OS abstraction routines
if they allowed to write a specified file descriptor rather than a path
name to begin with. Which was essential for my purposes, though I did
ponder the Drizzle port and the idea to avoid dependencies on mysys. But
I also needed other portability wrappers from mysys, so I would create
that dependency anyway. And not everything from mysys has alternatives
on CCAN, for example.

It is a big problem, and I don't have a solution for it yet. The idea
with POSIX- and Windows-specific local datasinks is good, but in general
we'll most likely end up creating our own portability library. Hopefully
with some help from the Drizzle community ;)

> - CRC32C instead of adler for XBSTREAM chunk checksum lets us use CRC32 instructions on Intel CPUs to compute checksums really fast.

The server tree doesn't have a bundled CRC32-C implementation, and I
couldn't find a suitable one after 5 minutes of searching the web. So I
replaced alder with crc32(). As I understand, there's also an x86
instruction to compute plain CRC-32, so we can optimize that later.

> - I think we should have a flags for XBSTREAM chunk. e.g. if unknown header a flag can be set to say if it's ignorable or not. This would allow us to add things like informational chunks in the future without breaking backwards compatibility.

Done. With XB_STREAM_FLAG_IGNORABLE set unknown chunk types and their
payloads are ignored, but checksums are till verified.

Note that there's also a format version field "embedded" into the chunk
header magic.

> - for xb_rstream_chunk_t, if data member is at the end, we can use the zero length array trick for reading it. Even if we don't use zero length array, it makes things a little nicer. I think this structure should also more mirror what's on disk (pathlength first).

The zero length array trick would work if payload was the only variable
length part of the chunk. But it's not the case, path is also variable
length, so the header itself is variable length too.

But I did move path_len before the path to match the stream format.

I also changed the format so that checksum goes before the payload to
minimize write() calls.

> - I wonder if we should have a incrementing number in the XBSTREAM chunk header. for disaster recovery, this may let the backup file be more easily reassembled from parts, or at least help figure out what's missing.

We already have the file offset as a part of every chunk.

> - i think the mutexes around writing the chunks out could be around less code

Fixed.

> - the F_WRITE macro calls should be replaced with a single write call. doing multiple small write() calls is less efficient than one big one... actually... if we store the checksum along with the header rather than at the tail of the chunk, we save a write() call and even without much code modification we go from 3 to 2 write() calls in write_chunk().

It doesn't seem feasible to replace with a single call (again, due to
variable header length). But yes, moving the checksum field reduces
saves 1 write() call. Done.

> - we should have defines for xbstream chunk payload type
> 

Done.

> in xb_stream_write_chunk()
> - tmpbuf should be malloc() instead of on stack - will be easier to catch overruns
> 

Well, it's a small fixed-sized buffer, so using malloc() doesn't make
much sense. I added a debug assertion.

> xb_stream_write_data()
> - is it intentional that a series of writes of 100bytes, 1mb will be in 2 chunks (one 100bytes, the other 1mb)? It's certainly the simpler option :)

Yes, the intention was to get by with user-supplied buffers to avoid
(re)allocating and copying data around. We do group small writes into
64K blocks just in case, but even that is probably excessive, as the
upper layers write mostly in much larger blocks.

Thanks!