Issue 36782 - #pragma setlocale("C") is needed with localized MS VS .NET compilers
Summary: #pragma setlocale("C") is needed with localized MS VS .NET compilers
Status: CLOSED FIXED
Alias: None
Product: porting
Classification: Code
Component: code (show other issues)
Version: current
Hardware: All All
: P4 Trivial (vote)
Target Milestone: OOo 2.0
Assignee: daniel.rentz
QA Contact: issues@porting
URL:
Keywords:
Depends on: 42348 42350
Blocks:
  Show dependency tree
 
Reported: 2004-11-06 18:23 UTC by tora3
Modified: 2007-05-01 16:02 UTC (History)
3 users (show)

See Also:
Issue Type: PATCH
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
cd sal; patch -p0 < 36782.patch.txt (577 bytes, patch)
2004-11-06 18:30 UTC, tora3
no flags Details | Diff
a list of the files (7.04 KB, text/plain)
2005-02-03 23:30 UTC, tora3
no flags Details
A sample code that shows a relation between setlocale and 0x80-0xff characters (684 bytes, text/plain)
2005-02-07 11:36 UTC, tora3
no flags Details
Works for both single and double quotation marks. (150 bytes, text/plain)
2005-02-07 12:28 UTC, tora3
no flags Details
suspicious files on comment lines (9.81 KB, text/plain)
2005-02-12 09:29 UTC, tora3
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description tora3 2004-11-06 18:23:50 UTC
Japanese Microsoft Visual Studio .NET Compilers misinterpret some texts 
containing letters 0x80 - 0xff and consequently fail to compile them.

For instance, in the source file sc/source/filter/excel/xistyle.cxx , 
a line 
 { 32, NF_NUMBER_STANDARD, "[$-0412]H\354\213\234 MM\353\266\204",
LANGUAGE_KOREAN },
will be interpreted as
 { 32, NF_NUMBER_STANDARD, "[$-0412]H(A)(B)MM(C)(D), LANGUAGE_KOREAN },
and produce syntax errors telling unbalanced double quotation marks. 
The second double quotation mark is mistakenly treated as a part of a 
Japanese character that consists of 2 byte codes - \204 and " -.

To avoid misinterpretation, the following statement could be added somewhere 
appropriate.
 #pragma setlocale("C")

Any idea?
Comment 1 tora3 2004-11-06 18:30:15 UTC
Created attachment 19011 [details]
cd sal; patch -p0 < 36782.patch.txt
Comment 2 Martin Hollmichel 2005-01-21 17:10:34 UTC
reassign for review.
Comment 3 ooo 2005-02-03 15:17:12 UTC
Tora,

Is this the only place where you encountered the problem? If so, I would suggest
to use UTF-7 instead of UTF-8 encoding for these strings or to put the #pragma
setlocale("C") just into this file. Who knows the nasty side effects it might
have when put into sal/config.h, we'd have to compile the entire suite and try
it out..

Cc'ed Martin (for porting and release engineering) and assigned to Daniel, the
code owner.

Eike
Comment 4 daniel.rentz 2005-02-03 15:38:16 UTC
Accepted. BTW: The file in question is nowadays: sc/source/filter/excel/xlstyle.cxx
Comment 5 tora3 2005-02-03 15:47:06 UTC
I am going to provide you shortly with names of the files that might have a 
problem with the localized .NET Compilers.

Tora
Comment 6 daniel.rentz 2005-02-03 15:49:44 UTC
Fixed in SRC680/dr32 (OOo 2.0) by adding the #pragma locally to the cxx file.
Comment 7 tora3 2005-02-03 17:01:16 UTC
cvs co -r SRC680_m74 OpenOffice

find . -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' | xargs perl
-ne 'print "$ARGV\t$.\t$_" if m/"[^\x00-\x1f]*[\x80-\xff]"/; close(ARGV) if eof'

The results of the above command will be attached soon, which will show what files 
would be needed to be taken care of.
Comment 8 ooo 2005-02-03 19:32:37 UTC
Tora,

Bear in mind that also high-bit characters within comments are caught that way.
Something that doesn't really hinder compiling the source (I hope). And yes,
8-bit characters are to be avoided nevertheless.

Eike
Comment 9 tora3 2005-02-03 23:28:02 UTC
OK, here is a new version of the command, which ignores comment lines:

find . -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' | xargs perl
-ne 's{//.*}{}; print "$ARGV\t$.\t$_" if m/"[^\x00-\x1f]*[\x80-\xff]"/;
close(ARGV) if eof' | tee 36782-list_of_files_UTF-8.cvs

And a list of the files:
awk '{print $1}' 36782-list_of_files_UTF-8.cvs | sed -e 's/\.\///' | sort -u
binfilter/bf_sc/source/filter/excel/sc_biffdump.cxx
binfilter/bf_sc/source/filter/excel/sc_xistyle.cxx
binfilter/bf_sw/source/filter/sw6/sw_sw6par.cxx
binfilter/bf_sw/source/filter/w4w/sw_w4wgraf.cxx
binfilter/bf_sw/source/filter/ww8/sw_ww8par5.cxx
fpicker/source/win32/filepicker/workbench/Test_fps.cxx
sal/qa/osl/module/osl_Module_Const.h
sal/qa/rtl/ostring/rtl_string.cxx
sal/qa/rtl/uri/rtl_Uri.cxx
sc/source/filter/excel/biffdump.cxx
svx/source/dialog/hangulhanjadlg.cxx
sw/source/filter/sw6/sw6par.cxx
sw/source/filter/w4w/w4wgraf.cxx
sw/source/filter/ww8/ww8par5.cxx
transex3/source/wtratree.cxx
vcl/unx/source/app/keysymnames.cxx

For detail, see 36782-list_of_files_UTF-8.cvs
Comment 10 tora3 2005-02-03 23:30:34 UTC
Created attachment 22170 [details]
a list of the files
Comment 11 tora3 2005-02-03 23:40:32 UTC
Daniel,
sc/source/filter/excel/xlstyle.cxx seems innocent now.
We do not need to insert #pragma setlocale("C") in the file.

Even though, this file has lines like below
// Special UTF-8 characters
#define UTF8_EURO       "\342\202\254"
The expression \254 is just a sequence of ASCII characters \ 2 5 and 4.
Comment 12 daniel.rentz 2005-02-04 08:29:43 UTC
Just ignore sc/source/filter/biffdump.cxx, it is not compiled at all without special 
settings. Why should "\342\202\254" become a char sequence with backslash? 
This happens only if there is something like "\\342...". Does your script find 
escaped illegal characters ("\342") or only the "hardcoded" ones?
Comment 13 tora3 2005-02-04 09:11:39 UTC
The script is aimed at finding hardcorded ones.

For instance, 'less' command displays 0x80-0xff characters in a hexdecimal 
format, sandwitched with < and >.

LANG=C less binfilter/bf_sc/source/filter/excel/sc_xistyle.cxx
    {   56,     NF_NUMBER_STANDARD,            
"[$-0411]M<E6><9C><88>D<E6><97><A5>",         LANGUAGE_JAPANESE },
    {   57,     NF_NUMBER_STANDARD,             "[$-030411]GE.M.D",        
LANGUAGE_JAPANESE },
    {   58,     NF_NUMBER_STANDARD,            
"[$-030411]GGGE<E5><B9><B4>M<E6><9C><88>D<E6><97><A5>",LANGUAGE_JAPANESE }
} 

Those byte sequences, A5 22, could be interrupted as a Japanese 2 byte Kanji
character by the .NET Compiler. 22 means a double quotation mark.
Comment 14 daniel.rentz 2005-02-04 12:22:48 UTC
DR->TORA: Just to make it clear: Does the localized compiler have problems with 
*escaped* non-ASCII characters, i.e. with the text "\342\202\254" in sc/source/
filter/excel.xlstyle.cxx; or do we *only* talk about the illegal literal characters like in 
sc_xistyle.cxx in the binfilter module, i.e. "<E6><97><A5>"? It is not clear for me 
compared to your initial comment in this issue, where it seems that you talk about 
the *escaped* characters. Anyway, for now I reopen the issue, and I will create sub 
tasks for all the affected modules.
Comment 15 daniel.rentz 2005-02-04 14:42:58 UTC
DR->TORA: Seems that your script does not look for single characters in 
apostrophs, i.e. /transex3/source/wtratree.cxx, line 108.
Comment 16 tora3 2005-02-07 11:25:31 UTC
Tora->DR: The localized compiler does not have a problem with *escaped* non-ASCII 
characters. We only need to talk about the characters whose byte code is in the 
range from 0x80 to 0xff.

Here is a sample file sample.c that can be compiled in the following way:
 perl guw.pl /cygdrive/c/PROGRA~1/MICROS~1.NET/Vc7/bin/cl.exe
-I/cygdrive/c/PROGRA~1/MICROS~1.NET/Vc7/include -DSETLOCALE_JA sample.c

Adding a command line option -DSETLOCALE_JA would maybe show you what happens, 
which would maybe let your compiler act as a localized one by using 
#pragma setlocale("ja")
Comment 17 tora3 2005-02-07 11:36:30 UTC
Created attachment 22285 [details]
A sample code that shows a relation between setlocale and 0x80-0xff characters
Comment 18 tora3 2005-02-07 12:24:05 UTC
Tora->DR: thank you for pointing out. As you mentioned, the script does not 
find single characters in apostrophes. A revised script that can detect such 
cases will be attached.

find . -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' | xargs perl
looking_for_0x80-0xff_characters_just_before_a_quotation_mark.pl 

Here are additional files which can be taken care of:
binfilter/bf_sw/source/filter/excel/sw_exctools.cxx
sfx2/source/control/macro.cxx
svtools/source/control/reginfo.cxx
sw/source/ui/utlui/attrdesc.cxx
tools/source/fsys/dosmsc.cxx
Comment 19 tora3 2005-02-07 12:28:17 UTC
Created attachment 22289 [details]
Works for both single and double quotation marks.
Comment 20 daniel.rentz 2005-02-09 18:27:14 UTC
sfx2/* -> issue 42367
transex3/* -> issue 42367
svx/* -> issue 42367
sc/* -> issue 42367
sw/* -> issue 42367
binfilter/* -> issue 42367
fpicker/* -> issue 42348
sal/qa/* -> issue 42350
vcl/unx/* : Unix-only code
svtools/source/control/reginfo.cxx : comments or unused code
tools/source/fsys/dosmsc.cxx : unused code
Comment 21 daniel.rentz 2005-02-09 18:28:46 UTC
dependent issues
Comment 22 daniel.rentz 2005-02-09 18:29:34 UTC
parent issue -> prio to P4
Comment 23 tora3 2005-02-09 20:37:42 UTC
Great! i hope you all had a great time on Rosenmontag und Faschingsdienstag.
Comment 24 tora3 2005-02-10 02:16:01 UTC
Yoshiyuki Masutomi, who first pointed out this problem, let me know about
another file today.

find . -name '*.cpp' | xargs perl
looking_for_0x80-0xff_characters_just_before_a_quotation_mark.pl 

hwpfilter/source/fontmap.cpp
Comment 25 daniel.rentz 2005-02-10 10:47:17 UTC
dependent issue added
Comment 26 daniel.rentz 2005-02-10 16:07:43 UTC
added "#pragma setlocale" to hwpfilter/source/fontmap.cpp

all child tasks are fixed -> this task is fixed
Comment 27 tora3 2005-02-12 08:39:28 UTC
An additional problem has been found. It can happens with comment lines.

The following comment lines, for example, have a problem with Japanese version 
of the Microsoft Visual C++ .NET 2002 and 2003.

SRC680_m74/src/hwpfilter/source/hpara.h (172)
======================================================
// layout<C0><BB> <C0><A7><C7><D1> <C7><D4><BC><F6>
/**
 * Get line information of given line
 */
======================================================

The compilers interpret them in the following way:
======================================================
// layout<C0><BB> <C0><A7><C7><D1> <C7><D4><BC><F6>/**
 * Get line information of given line
 */
======================================================

As a result, the compilers complain that 
  Syntax Error: ';' is needed before the identifier 'line'
It means that the compilers probably expect 
 int * Get;

To understand such unbelievable behavior, see the encoding called codepage 932:
http://www.microsoft.com/globaldev/reference/dbcs/932.htm

A leading byte ranging 0x80 - 0x9f and 0xe0 - 0xff is followed by a tailing byte.
In the example mentioned above the control code 0x0a, carriage return, is 
treated as a tailing byte, and thus the compiler made a misinterpretation.

Other Asian languages have individual ranges of the leading byte. 
According to the MSDN web page 
http://msdn.microsoft.com/archive/default.asp?url=/archive/en-us/dnarvc/html/msdn_mbcssg.asp

=========================================================================
Lead Byte Ranges. Each code page may have different lead byte ranges.
Japan 	932 	0x81-0x9F
Korea 	949 	0xA1-0xFE
China 	936 	0xA1-0xFE
Taiwan 	950 	0xA1-0xFE, 0x8E-0xA0, 0x81-0x8D
=========================================================================
Comment 28 tora3 2005-02-12 09:24:28 UTC
Here is a script for finding out suspicious files. Note that it works for 
only Japanese version of Microsoft Visual C++ .NET 2002 and 2003.

cd OOo_1.9.m77_src
find * -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' -o -name
'*.cpp' | xargs perl -ne '$line=$_; s/\r\n/\n/; printf("%s\t%s\t%s", $ARGV, $.,
$line) if s/[\x80-\x9f\xe0-\xff](.|\n)//g and not m/\n\Z/; close(ARGV) if eof' |
tee 36782-suspicious_files_on_comments.cvs 
awk '{print $1}' 36782-suspicious_files_on_comments.cvs  | sort -u

binfilter/bf_sch/source/core/sch_chtmode2.cxx
binfilter/bf_svx/source/engine3d/svx_scene3d.cxx
binfilter/bf_sw/source/filter/w4w/sw_w4wpar1.cxx
binfilter/bf_sw/source/ui/app/sw_docsh.cxx
hwpfilter/source/drawdef.h
hwpfilter/source/drawing.h
hwpfilter/source/hbox.cpp
hwpfilter/source/hbox.h
hwpfilter/source/hcode.cpp
hwpfilter/source/hinfo.cpp
hwpfilter/source/hinfo.h
hwpfilter/source/hpara.cpp
hwpfilter/source/hpara.h
hwpfilter/source/hwpeq.cpp
hwpfilter/source/hwpfile.h
hwpfilter/source/hwpread.cpp
hwpfilter/source/hwpreader.cxx
hwpfilter/source/ksc5601.h
sch/source/core/chtmode2.cxx
so3/source/inplace/ipenv.cxx
svtools/source/filter.vcl/filter/sgvmain.cxx
svtools/source/filter.vcl/filter/sgvspln.cxx
svtools/source/items/itemset.cxx
svx/source/engine3d/scene3d.cxx
svx/source/svdraw/svdobj.cxx
svx/source/svdraw/svdoedge.cxx
sw/source/filter/w4w/w4wpar1.cxx
sw/source/ui/app/docsh.cxx
sw/source/ui/table/tablepg.hxx


For hwpfilter/* , how about simply inserting #pragma setlocale("C") in the 
following files?

hwpfilter/source/fontmap.cpp
hwpfilter/source/mzstring.h
hwpfilter/source/precompile.h
Comment 29 tora3 2005-02-12 09:29:47 UTC
Created attachment 22501 [details]
suspicious files on comment lines
Comment 30 tora3 2005-02-12 09:39:23 UTC
Here is a script for finding out suspicious files for all other languages that 
use double-byte character set (DBCS).

cd OOo_1.9.m77_src
find * -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' -o -name
'*.cpp' | xargs perl -ne 's/\r\n/\n/; printf("%s\t%s\t%s", $ARGV, $., $_) if
m/[\x80-\xff]\Z/; close(ARGV) if eof' > z

awk '{print $1}' z | sort -u | wc -l
      58

awk '{print $1}' z | sort -u 

binfilter/bf_forms/source/misc/forms_services.cxx
binfilter/bf_sch/source/core/sch_chtmode2.cxx
binfilter/bf_sfx2/source/dialog/sfx2_filtergrouping.cxx
binfilter/bf_starmath/source/starmath_xchar.cxx
binfilter/bf_svx/source/engine3d/svx_scene3d.cxx
binfilter/bf_svx/source/svrtf/svx_rtfitem.cxx
binfilter/bf_sw/source/filter/w4w/sw_w4wpar1.cxx
binfilter/bf_sw/source/ui/app/sw_docsh.cxx
connectivity/source/commontools/FValue.cxx
connectivity/source/drivers/evoab/LColumnAlias.cxx
connectivity/source/drivers/evoab/LConfigAccess.cxx
connectivity/source/drivers/mozab/MConfigAccess.cxx
forms/source/misc/services.cxx

hwpfilter/*

sal/qa/osl/file/osl_File_Const.h
sch/source/core/chtmode2.cxx
sfx2/source/dialog/filtergrouping.cxx
so3/source/inplace/ipenv.cxx
starmath/source/xchar.cxx
svtools/source/filepicker/iodlg.cxx
svtools/source/filepicker/pickerhistory.cxx
svtools/source/filter.vcl/filter/sgvmain.cxx
svtools/source/filter.vcl/filter/sgvspln.cxx
svtools/source/filter.vcl/filter/sgvtext.cxx
svtools/source/items/itemset.cxx
svx/inc/svdio.hxx
svx/inc/svdmodel.hxx
svx/inc/svdomeas.hxx
svx/inc/svdtrans.hxx
svx/inc/sxmlhitm.hxx
svx/source/engine3d/scene3d.cxx
svx/source/form/formcontrolling.cxx
svx/source/svdraw/svdobj.cxx
svx/source/svdraw/svdoedge.cxx
svx/source/svrtf/rtfitem.cxx
sw/source/filter/w4w/w4wpar1.cxx
sw/source/ui/app/docsh.cxx
sw/source/ui/table/tablepg.hxx
toolkit/source/controls/dialogcontrol.cxx

Comment 31 tora3 2005-02-12 09:40:37 UTC
To fix this kind of problem, there are at least two options:
 1. inserting #pragma setlocale("C") in the suspicious files or header files.
 2. appending a space character at the end of the comment lines that end with 
    one of the 0x80-0xff characters.
Comment 32 daniel.rentz 2005-02-14 12:39:18 UTC
I will take care of the new files.

Adding spaces at the end of the line is a bad idea, some editors may strip them 
away when editing or saving the file.
Comment 33 daniel.rentz 2005-02-15 14:49:33 UTC
all sub tasks fixed
Comment 34 daniel.rentz 2005-04-04 14:08:14 UTC
all sub tasks verified
Comment 35 tora3 2005-04-04 16:07:01 UTC
Thanks. 
Some Japanese engineers are going to try to build the revised sources 
with Microsoft Visual Studio .NET Compiler Japanese version.
Comment 36 tora3 2005-04-05 03:38:22 UTC
A highly experienced builder, Yoshiyuki Masutomi <curvirgo@eos.ocn.ne.jp>, 
has already confirmed the fix with around SRC680_m86 or m87.

Thank you all who attacked this issue. Now we can let this issue come to close.
Comment 37 daniel.rentz 2005-05-20 13:19:35 UTC
closed