[cvsnt] Re: check CVS repository integrity

Fri May 19 20:21:38 BST 2006

Michael Wojcik wrote:
> Look, however the copy happens, it's going to map every page of the new
> file at some point.  Doesn't matter whether you're using private-buffer
> I/O with kernel copies (conventional read(2)/write(2) I/O), or
> memory-mapping and copying.  Running a cumulative checksum over those
> pages will take very little additional time; you're already pulling them
> into cache.  It's not like CVS can DMA from one file to another.

It's just not that simple.  The file isn't read like that.. bits of it are 
skipped over while reading, and there's no single point that that you would be 
able to even make sense of a checksum if you had one.  If you can't check a 
checksum why spend the time to write it in the first place?

When it writes it just remembers the last point it read from and does a block 
copy of the rest *however* the headers are normally reconstructed.  This 
reconstruction is distributed all over the RCS code and no single point could 
be used that a checksum could be calculated.

Checksumming individual revisions is slated for 3.0/3.1 if the technical 
difficulties can be overcome, but they won't use RCS files anyway.

> I wasn't suggesting that.  I'd run a checksum over the whole file, then
> append the result as another RCS section.

That would also kill performance.  You would need to put any new data at the 
beginning, or in the header area, since it is a major performance drag under 
many operating systems to seek to the end of the file - this is why the older 
revisions are stored at the end, btw.

Storing the checksum in the RCS file at all would negate the point of it 
anyway - it would always be invalid because writing it in the file changes the 
file.

> If a checksum takes noticeable time, that's an algorithm problem.  CVSNT
> shouldn't have any problem getting on the order of tens of megabytes per
> second (or better) throughput on a checksum on standard hardware.  It
> wouldn't be noticeable.

I wouldn't accept an algorithm that couldn't process *hundreds* of megabytes a 
second.  RCS files get *big*.  And it definately would be noticable if it took 
more than a second.  1000 files, 1 second - that's adding more than 16 minutes 
to your checkin time.  Not acceptable.

Believe me I *have* worked on this, and the tradeoff simply isn't worth it.

Tony