[cvsnt] Re: UTF conversion issues after upgrade to 2.0.34

Wed Apr 7 23:31:55 BST 2004

Olaf Groeger wrote:
> http://www.unicode.org/faq/utf_bom.html#28). To have an example out of my
> work: We save content of a database in CVS. In the DB the unicode string
> has not BOM and when we save it to file and put it to CVS we don't add a
> BOM. All four variations are fully legal.

If you're saving the DB to a file the context is lost, so you should use a 
BOM.  Otherwise it's just a binary file, not unicode text file.

> What cvsnt seems to do during commit is to cut of the first to bytes where
> it expects to have the BOM. And while checkout/update, it adds "0xFF 0xFE"

Basically the output of a -ku file is always a correct UTF16-LE (Actually 
UCS-2 - UTF16 is just an abstraction and not used in practice) file as this is 
what Windows uses.  Internally it's stored as UTF-8 anyway so there's 
absolutely no difference between the types once it's in the repository (in 
fact if you do update -kkv on the -ku file it'll give you a valid UTF-8 file 
instead).

The ability to checkout into different types may be added someday, but it's 
not there at the moment.  This just requires client-side changes and some more 
-k options.  Support for UCS-4 is needed for a complete implementation.

If you don't have a BOM you get ambiguities, which CVS does its best to 
resolve, but it's not really a supported configuration - an automated tool 
like CVS can't be expected to guess 100% of the time what the file actually is 
(if you have any japanese/arabic/etc. then it's got no chance).

Tony