[cvsnt] Re: check CVS repository integrity
Tony Hoyle
tony.hoyle at march-hare.com
Fri May 19 20:21:38 BST 2006
Michael Wojcik wrote:
> Look, however the copy happens, it's going to map every page of the new
> file at some point. Doesn't matter whether you're using private-buffer
> I/O with kernel copies (conventional read(2)/write(2) I/O), or
> memory-mapping and copying. Running a cumulative checksum over those
> pages will take very little additional time; you're already pulling them
> into cache. It's not like CVS can DMA from one file to another.
It's just not that simple. The file isn't read like that.. bits of it are
skipped over while reading, and there's no single point that that you would be
able to even make sense of a checksum if you had one. If you can't check a
checksum why spend the time to write it in the first place?
When it writes it just remembers the last point it read from and does a block
copy of the rest *however* the headers are normally reconstructed. This
reconstruction is distributed all over the RCS code and no single point could
be used that a checksum could be calculated.
Checksumming individual revisions is slated for 3.0/3.1 if the technical
difficulties can be overcome, but they won't use RCS files anyway.
> I wasn't suggesting that. I'd run a checksum over the whole file, then
> append the result as another RCS section.
That would also kill performance. You would need to put any new data at the
beginning, or in the header area, since it is a major performance drag under
many operating systems to seek to the end of the file - this is why the older
revisions are stored at the end, btw.
Storing the checksum in the RCS file at all would negate the point of it
anyway - it would always be invalid because writing it in the file changes the
file.
> If a checksum takes noticeable time, that's an algorithm problem. CVSNT
> shouldn't have any problem getting on the order of tens of megabytes per
> second (or better) throughput on a checksum on standard hardware. It
> wouldn't be noticeable.
I wouldn't accept an algorithm that couldn't process *hundreds* of megabytes a
second. RCS files get *big*. And it definately would be noticable if it took
more than a second. 1000 files, 1 second - that's adding more than 16 minutes
to your checkin time. Not acceptable.
Believe me I *have* worked on this, and the tradeoff simply isn't worth it.
Tony
More information about the cvsnt
mailing list