[cvsnt] Re: UTF conversion issues after upgrade to 2.0.34
Tony Hoyle
tmh at nodomain.org
Wed Apr 7 23:31:55 BST 2004
Olaf Groeger wrote:
> http://www.unicode.org/faq/utf_bom.html#28). To have an example out of my
> work: We save content of a database in CVS. In the DB the unicode string
> has not BOM and when we save it to file and put it to CVS we don't add a
> BOM. All four variations are fully legal.
If you're saving the DB to a file the context is lost, so you should use a
BOM. Otherwise it's just a binary file, not unicode text file.
> What cvsnt seems to do during commit is to cut of the first to bytes where
> it expects to have the BOM. And while checkout/update, it adds "0xFF 0xFE"
Basically the output of a -ku file is always a correct UTF16-LE (Actually
UCS-2 - UTF16 is just an abstraction and not used in practice) file as this is
what Windows uses. Internally it's stored as UTF-8 anyway so there's
absolutely no difference between the types once it's in the repository (in
fact if you do update -kkv on the -ku file it'll give you a valid UTF-8 file
instead).
The ability to checkout into different types may be added someday, but it's
not there at the moment. This just requires client-side changes and some more
-k options. Support for UCS-4 is needed for a complete implementation.
If you don't have a BOM you get ambiguities, which CVS does its best to
resolve, but it's not really a supported configuration - an automated tool
like CVS can't be expected to guess 100% of the time what the file actually is
(if you have any japanese/arabic/etc. then it's got no chance).
Tony
More information about the cvsnt
mailing list