36782 – #pragma setlocale("C") is needed with localized MS VS .NET compilers

Issue 36782 - #pragma setlocale("C") is needed with localized MS VS .NET compilers

Summary: #pragma setlocale("C") is needed with localized MS VS .NET compilers

Status:	CLOSED FIXED

Alias:	None

Product:	porting
Classification:	Code
Component:	code (show other issues)
Version:	current
Hardware:	All All

Importance:	P4 Trivial (vote)
Target Milestone:	OOo 2.0
Assignee:	daniel.rentz
QA Contact:	issues@porting

URL:
Keywords:

Depends on:	42348 42350
Blocks:
	Show dependency tree

Reported:	2004-11-06 18:23 UTC by tora3
Modified:	2007-05-01 16:02 UTC (History)
CC List:	3 users (show)

See Also:
Issue Type:	PATCH
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
cd sal; patch -p0 < 36782.patch.txt (577 bytes, patch) 2004-11-06 18:30 UTC, tora3	no flags	Details \| Diff
a list of the files (7.04 KB, text/plain) 2005-02-03 23:30 UTC, tora3	no flags	Details
A sample code that shows a relation between setlocale and 0x80-0xff characters (684 bytes, text/plain) 2005-02-07 11:36 UTC, tora3	no flags	Details
Works for both single and double quotation marks. (150 bytes, text/plain) 2005-02-07 12:28 UTC, tora3	no flags	Details
suspicious files on comment lines (9.81 KB, text/plain) 2005-02-12 09:29 UTC, tora3	no flags	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description tora3 2004-11-06 18:23:50 UTC

Japanese Microsoft Visual Studio .NET Compilers misinterpret some texts 
containing letters 0x80 - 0xff and consequently fail to compile them.

For instance, in the source file sc/source/filter/excel/xistyle.cxx , 
a line 
 { 32, NF_NUMBER_STANDARD, "[$-0412]H\354\213\234 MM\353\266\204",
LANGUAGE_KOREAN },
will be interpreted as
 { 32, NF_NUMBER_STANDARD, "[$-0412]H(A)(B)MM(C)(D), LANGUAGE_KOREAN },
and produce syntax errors telling unbalanced double quotation marks. 
The second double quotation mark is mistakenly treated as a part of a 
Japanese character that consists of 2 byte codes - \204 and " -.

To avoid misinterpretation, the following statement could be added somewhere 
appropriate.
 #pragma setlocale("C")

Any idea?

Comment 1 tora3 2004-11-06 18:30:15 UTC

Created attachment 19011 [details]
cd sal; patch -p0 < 36782.patch.txt

Comment 2 Martin Hollmichel 2005-01-21 17:10:34 UTC

reassign for review.

Comment 3 ooo 2005-02-03 15:17:12 UTC

Tora,

Is this the only place where you encountered the problem? If so, I would suggest
to use UTF-7 instead of UTF-8 encoding for these strings or to put the #pragma
setlocale("C") just into this file. Who knows the nasty side effects it might
have when put into sal/config.h, we'd have to compile the entire suite and try
it out..

Cc'ed Martin (for porting and release engineering) and assigned to Daniel, the
code owner.

Eike

Comment 4 daniel.rentz 2005-02-03 15:38:16 UTC

Accepted. BTW: The file in question is nowadays: sc/source/filter/excel/xlstyle.cxx

Comment 5 tora3 2005-02-03 15:47:06 UTC

I am going to provide you shortly with names of the files that might have a 
problem with the localized .NET Compilers.

Tora

Comment 6 daniel.rentz 2005-02-03 15:49:44 UTC

Fixed in SRC680/dr32 (OOo 2.0) by adding the #pragma locally to the cxx file.

Comment 7 tora3 2005-02-03 17:01:16 UTC

cvs co -r SRC680_m74 OpenOffice

find . -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' | xargs perl
-ne 'print "$ARGV\t$.\t$_" if m/"[^\x00-\x1f]*[\x80-\xff]"/; close(ARGV) if eof'

The results of the above command will be attached soon, which will show what files 
would be needed to be taken care of.

Comment 8 ooo 2005-02-03 19:32:37 UTC

Tora,

Bear in mind that also high-bit characters within comments are caught that way.
Something that doesn't really hinder compiling the source (I hope). And yes,
8-bit characters are to be avoided nevertheless.

Eike

Comment 9 tora3 2005-02-03 23:28:02 UTC

OK, here is a new version of the command, which ignores comment lines:

find . -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' | xargs perl
-ne 's{//.*}{}; print "$ARGV\t$.\t$_" if m/"[^\x00-\x1f]*[\x80-\xff]"/;
close(ARGV) if eof' | tee 36782-list_of_files_UTF-8.cvs

And a list of the files:
awk '{print $1}' 36782-list_of_files_UTF-8.cvs | sed -e 's/\.\///' | sort -u
binfilter/bf_sc/source/filter/excel/sc_biffdump.cxx
binfilter/bf_sc/source/filter/excel/sc_xistyle.cxx
binfilter/bf_sw/source/filter/sw6/sw_sw6par.cxx
binfilter/bf_sw/source/filter/w4w/sw_w4wgraf.cxx
binfilter/bf_sw/source/filter/ww8/sw_ww8par5.cxx
fpicker/source/win32/filepicker/workbench/Test_fps.cxx
sal/qa/osl/module/osl_Module_Const.h
sal/qa/rtl/ostring/rtl_string.cxx
sal/qa/rtl/uri/rtl_Uri.cxx
sc/source/filter/excel/biffdump.cxx
svx/source/dialog/hangulhanjadlg.cxx
sw/source/filter/sw6/sw6par.cxx
sw/source/filter/w4w/w4wgraf.cxx
sw/source/filter/ww8/ww8par5.cxx
transex3/source/wtratree.cxx
vcl/unx/source/app/keysymnames.cxx

For detail, see 36782-list_of_files_UTF-8.cvs

Comment 10 tora3 2005-02-03 23:30:34 UTC

Created attachment 22170 [details]
a list of the files

Comment 11 tora3 2005-02-03 23:40:32 UTC

Daniel,
sc/source/filter/excel/xlstyle.cxx seems innocent now.
We do not need to insert #pragma setlocale("C") in the file.

Even though, this file has lines like below
// Special UTF-8 characters
#define UTF8_EURO       "\342\202\254"
The expression \254 is just a sequence of ASCII characters \ 2 5 and 4.

Comment 12 daniel.rentz 2005-02-04 08:29:43 UTC

Just ignore sc/source/filter/biffdump.cxx, it is not compiled at all without special 
settings. Why should "\342\202\254" become a char sequence with backslash? 
This happens only if there is something like "\\342...". Does your script find 
escaped illegal characters ("\342") or only the "hardcoded" ones?

Comment 13 tora3 2005-02-04 09:11:39 UTC

The script is aimed at finding hardcorded ones.

For instance, 'less' command displays 0x80-0xff characters in a hexdecimal 
format, sandwitched with < and >.

LANG=C less binfilter/bf_sc/source/filter/excel/sc_xistyle.cxx
    {   56,     NF_NUMBER_STANDARD,            
"[$-0411]M<E6><9C><88>D<E6><97><A5>",         LANGUAGE_JAPANESE },
    {   57,     NF_NUMBER_STANDARD,             "[$-030411]GE.M.D",        
LANGUAGE_JAPANESE },
    {   58,     NF_NUMBER_STANDARD,            
"[$-030411]GGGE<E5><B9><B4>M<E6><9C><88>D<E6><97><A5>",LANGUAGE_JAPANESE }
} 

Those byte sequences, A5 22, could be interrupted as a Japanese 2 byte Kanji
character by the .NET Compiler. 22 means a double quotation mark.

Comment 14 daniel.rentz 2005-02-04 12:22:48 UTC

DR->TORA: Just to make it clear: Does the localized compiler have problems with 
*escaped* non-ASCII characters, i.e. with the text "\342\202\254" in sc/source/
filter/excel.xlstyle.cxx; or do we *only* talk about the illegal literal characters like in 
sc_xistyle.cxx in the binfilter module, i.e. "<E6><97><A5>"? It is not clear for me 
compared to your initial comment in this issue, where it seems that you talk about 
the *escaped* characters. Anyway, for now I reopen the issue, and I will create sub 
tasks for all the affected modules.

Comment 15 daniel.rentz 2005-02-04 14:42:58 UTC

DR->TORA: Seems that your script does not look for single characters in 
apostrophs, i.e. /transex3/source/wtratree.cxx, line 108.

Comment 16 tora3 2005-02-07 11:25:31 UTC

Tora->DR: The localized compiler does not have a problem with *escaped* non-ASCII 
characters. We only need to talk about the characters whose byte code is in the 
range from 0x80 to 0xff.

Here is a sample file sample.c that can be compiled in the following way:
 perl guw.pl /cygdrive/c/PROGRA~1/MICROS~1.NET/Vc7/bin/cl.exe
-I/cygdrive/c/PROGRA~1/MICROS~1.NET/Vc7/include -DSETLOCALE_JA sample.c

Adding a command line option -DSETLOCALE_JA would maybe show you what happens, 
which would maybe let your compiler act as a localized one by using 
#pragma setlocale("ja")

Comment 17 tora3 2005-02-07 11:36:30 UTC

Created attachment 22285 [details]
A sample code that shows a relation between setlocale and 0x80-0xff characters

Comment 18 tora3 2005-02-07 12:24:05 UTC

Tora->DR: thank you for pointing out. As you mentioned, the script does not 
find single characters in apostrophes. A revised script that can detect such 
cases will be attached.

find . -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' | xargs perl
looking_for_0x80-0xff_characters_just_before_a_quotation_mark.pl 

Here are additional files which can be taken care of:
binfilter/bf_sw/source/filter/excel/sw_exctools.cxx
sfx2/source/control/macro.cxx
svtools/source/control/reginfo.cxx
sw/source/ui/utlui/attrdesc.cxx
tools/source/fsys/dosmsc.cxx

Comment 19 tora3 2005-02-07 12:28:17 UTC

Created attachment 22289 [details]
Works for both single and double quotation marks.

Comment 20 daniel.rentz 2005-02-09 18:27:14 UTC

sfx2/* -> issue 42367
transex3/* -> issue 42367
svx/* -> issue 42367
sc/* -> issue 42367
sw/* -> issue 42367
binfilter/* -> issue 42367
fpicker/* -> issue 42348
sal/qa/* -> issue 42350
vcl/unx/* : Unix-only code
svtools/source/control/reginfo.cxx : comments or unused code
tools/source/fsys/dosmsc.cxx : unused code

Comment 21 daniel.rentz 2005-02-09 18:28:46 UTC

dependent issues

Comment 22 daniel.rentz 2005-02-09 18:29:34 UTC

parent issue -> prio to P4

Comment 23 tora3 2005-02-09 20:37:42 UTC

Great! i hope you all had a great time on Rosenmontag und Faschingsdienstag.

Comment 24 tora3 2005-02-10 02:16:01 UTC

Yoshiyuki Masutomi, who first pointed out this problem, let me know about
another file today.

find . -name '*.cpp' | xargs perl
looking_for_0x80-0xff_characters_just_before_a_quotation_mark.pl 

hwpfilter/source/fontmap.cpp

Comment 25 daniel.rentz 2005-02-10 10:47:17 UTC

dependent issue added

Comment 26 daniel.rentz 2005-02-10 16:07:43 UTC

added "#pragma setlocale" to hwpfilter/source/fontmap.cpp

all child tasks are fixed -> this task is fixed

Comment 27 tora3 2005-02-12 08:39:28 UTC

An additional problem has been found. It can happens with comment lines.

The following comment lines, for example, have a problem with Japanese version 
of the Microsoft Visual C++ .NET 2002 and 2003.

SRC680_m74/src/hwpfilter/source/hpara.h (172)
======================================================
// layout<C0><BB> <C0><A7><C7><D1> <C7><D4><BC><F6>
/**
 * Get line information of given line
 */
======================================================

The compilers interpret them in the following way:
======================================================
// layout<C0><BB> <C0><A7><C7><D1> <C7><D4><BC><F6>/**
 * Get line information of given line
 */
======================================================

As a result, the compilers complain that 
  Syntax Error: ';' is needed before the identifier 'line'
It means that the compilers probably expect 
 int * Get;

To understand such unbelievable behavior, see the encoding called codepage 932:
http://www.microsoft.com/globaldev/reference/dbcs/932.htm

A leading byte ranging 0x80 - 0x9f and 0xe0 - 0xff is followed by a tailing byte.
In the example mentioned above the control code 0x0a, carriage return, is 
treated as a tailing byte, and thus the compiler made a misinterpretation.

Other Asian languages have individual ranges of the leading byte. 
According to the MSDN web page 
http://msdn.microsoft.com/archive/default.asp?url=/archive/en-us/dnarvc/html/msdn_mbcssg.asp

=========================================================================
Lead Byte Ranges. Each code page may have different lead byte ranges.
Japan 	932 	0x81-0x9F
Korea 	949 	0xA1-0xFE
China 	936 	0xA1-0xFE
Taiwan 	950 	0xA1-0xFE, 0x8E-0xA0, 0x81-0x8D
=========================================================================

Comment 28 tora3 2005-02-12 09:24:28 UTC

Here is a script for finding out suspicious files. Note that it works for 
only Japanese version of Microsoft Visual C++ .NET 2002 and 2003.

cd OOo_1.9.m77_src
find * -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' -o -name
'*.cpp' | xargs perl -ne '$line=$_; s/\r\n/\n/; printf("%s\t%s\t%s", $ARGV, $.,
$line) if s/[\x80-\x9f\xe0-\xff](.|\n)//g and not m/\n\Z/; close(ARGV) if eof' |
tee 36782-suspicious_files_on_comments.cvs 
awk '{print $1}' 36782-suspicious_files_on_comments.cvs  | sort -u

binfilter/bf_sch/source/core/sch_chtmode2.cxx
binfilter/bf_svx/source/engine3d/svx_scene3d.cxx
binfilter/bf_sw/source/filter/w4w/sw_w4wpar1.cxx
binfilter/bf_sw/source/ui/app/sw_docsh.cxx
hwpfilter/source/drawdef.h
hwpfilter/source/drawing.h
hwpfilter/source/hbox.cpp
hwpfilter/source/hbox.h
hwpfilter/source/hcode.cpp
hwpfilter/source/hinfo.cpp
hwpfilter/source/hinfo.h
hwpfilter/source/hpara.cpp
hwpfilter/source/hpara.h
hwpfilter/source/hwpeq.cpp
hwpfilter/source/hwpfile.h
hwpfilter/source/hwpread.cpp
hwpfilter/source/hwpreader.cxx
hwpfilter/source/ksc5601.h
sch/source/core/chtmode2.cxx
so3/source/inplace/ipenv.cxx
svtools/source/filter.vcl/filter/sgvmain.cxx
svtools/source/filter.vcl/filter/sgvspln.cxx
svtools/source/items/itemset.cxx
svx/source/engine3d/scene3d.cxx
svx/source/svdraw/svdobj.cxx
svx/source/svdraw/svdoedge.cxx
sw/source/filter/w4w/w4wpar1.cxx
sw/source/ui/app/docsh.cxx
sw/source/ui/table/tablepg.hxx


For hwpfilter/* , how about simply inserting #pragma setlocale("C") in the 
following files?

hwpfilter/source/fontmap.cpp
hwpfilter/source/mzstring.h
hwpfilter/source/precompile.h

Comment 29 tora3 2005-02-12 09:29:47 UTC

Created attachment 22501 [details]
suspicious files on comment lines

Comment 30 tora3 2005-02-12 09:39:23 UTC

Here is a script for finding out suspicious files for all other languages that 
use double-byte character set (DBCS).

cd OOo_1.9.m77_src
find * -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' -o -name
'*.cpp' | xargs perl -ne 's/\r\n/\n/; printf("%s\t%s\t%s", $ARGV, $., $_) if
m/[\x80-\xff]\Z/; close(ARGV) if eof' > z

awk '{print $1}' z | sort -u | wc -l
      58

awk '{print $1}' z | sort -u 

binfilter/bf_forms/source/misc/forms_services.cxx
binfilter/bf_sch/source/core/sch_chtmode2.cxx
binfilter/bf_sfx2/source/dialog/sfx2_filtergrouping.cxx
binfilter/bf_starmath/source/starmath_xchar.cxx
binfilter/bf_svx/source/engine3d/svx_scene3d.cxx
binfilter/bf_svx/source/svrtf/svx_rtfitem.cxx
binfilter/bf_sw/source/filter/w4w/sw_w4wpar1.cxx
binfilter/bf_sw/source/ui/app/sw_docsh.cxx
connectivity/source/commontools/FValue.cxx
connectivity/source/drivers/evoab/LColumnAlias.cxx
connectivity/source/drivers/evoab/LConfigAccess.cxx
connectivity/source/drivers/mozab/MConfigAccess.cxx
forms/source/misc/services.cxx

hwpfilter/*

sal/qa/osl/file/osl_File_Const.h
sch/source/core/chtmode2.cxx
sfx2/source/dialog/filtergrouping.cxx
so3/source/inplace/ipenv.cxx
starmath/source/xchar.cxx
svtools/source/filepicker/iodlg.cxx
svtools/source/filepicker/pickerhistory.cxx
svtools/source/filter.vcl/filter/sgvmain.cxx
svtools/source/filter.vcl/filter/sgvspln.cxx
svtools/source/filter.vcl/filter/sgvtext.cxx
svtools/source/items/itemset.cxx
svx/inc/svdio.hxx
svx/inc/svdmodel.hxx
svx/inc/svdomeas.hxx
svx/inc/svdtrans.hxx
svx/inc/sxmlhitm.hxx
svx/source/engine3d/scene3d.cxx
svx/source/form/formcontrolling.cxx
svx/source/svdraw/svdobj.cxx
svx/source/svdraw/svdoedge.cxx
svx/source/svrtf/rtfitem.cxx
sw/source/filter/w4w/w4wpar1.cxx
sw/source/ui/app/docsh.cxx
sw/source/ui/table/tablepg.hxx
toolkit/source/controls/dialogcontrol.cxx

Comment 31 tora3 2005-02-12 09:40:37 UTC

To fix this kind of problem, there are at least two options:
 1. inserting #pragma setlocale("C") in the suspicious files or header files.
 2. appending a space character at the end of the comment lines that end with 
    one of the 0x80-0xff characters.

Comment 32 daniel.rentz 2005-02-14 12:39:18 UTC

I will take care of the new files.

Adding spaces at the end of the line is a bad idea, some editors may strip them 
away when editing or saving the file.

Comment 33 daniel.rentz 2005-02-15 14:49:33 UTC

all sub tasks fixed

Comment 34 daniel.rentz 2005-04-04 14:08:14 UTC

all sub tasks verified

Comment 35 tora3 2005-04-04 16:07:01 UTC

Thanks. 
Some Japanese engineers are going to try to build the revised sources 
with Microsoft Visual Studio .NET Compiler Japanese version.

Comment 36 tora3 2005-04-05 03:38:22 UTC

A highly experienced builder, Yoshiyuki Masutomi <curvirgo@eos.ocn.ne.jp>, 
has already confirmed the fix with around SRC680_m86 or m87.

Thank you all who attacked this issue. Now we can let this issue come to close.

Comment 37 daniel.rentz 2005-05-20 13:19:35 UTC

closed