Apache OpenOffice (AOO) Bugzilla – Issue 36782
#pragma setlocale("C") is needed with localized MS VS .NET compilers
Last modified: 2007-05-01 16:02:14 UTC
Japanese Microsoft Visual Studio .NET Compilers misinterpret some texts containing letters 0x80 - 0xff and consequently fail to compile them. For instance, in the source file sc/source/filter/excel/xistyle.cxx , a line { 32, NF_NUMBER_STANDARD, "[$-0412]H\354\213\234 MM\353\266\204", LANGUAGE_KOREAN }, will be interpreted as { 32, NF_NUMBER_STANDARD, "[$-0412]H(A)(B)MM(C)(D), LANGUAGE_KOREAN }, and produce syntax errors telling unbalanced double quotation marks. The second double quotation mark is mistakenly treated as a part of a Japanese character that consists of 2 byte codes - \204 and " -. To avoid misinterpretation, the following statement could be added somewhere appropriate. #pragma setlocale("C") Any idea?
Created attachment 19011 [details] cd sal; patch -p0 < 36782.patch.txt
reassign for review.
Tora, Is this the only place where you encountered the problem? If so, I would suggest to use UTF-7 instead of UTF-8 encoding for these strings or to put the #pragma setlocale("C") just into this file. Who knows the nasty side effects it might have when put into sal/config.h, we'd have to compile the entire suite and try it out.. Cc'ed Martin (for porting and release engineering) and assigned to Daniel, the code owner. Eike
Accepted. BTW: The file in question is nowadays: sc/source/filter/excel/xlstyle.cxx
I am going to provide you shortly with names of the files that might have a problem with the localized .NET Compilers. Tora
Fixed in SRC680/dr32 (OOo 2.0) by adding the #pragma locally to the cxx file.
cvs co -r SRC680_m74 OpenOffice find . -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' | xargs perl -ne 'print "$ARGV\t$.\t$_" if m/"[^\x00-\x1f]*[\x80-\xff]"/; close(ARGV) if eof' The results of the above command will be attached soon, which will show what files would be needed to be taken care of.
Tora, Bear in mind that also high-bit characters within comments are caught that way. Something that doesn't really hinder compiling the source (I hope). And yes, 8-bit characters are to be avoided nevertheless. Eike
OK, here is a new version of the command, which ignores comment lines: find . -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' | xargs perl -ne 's{//.*}{}; print "$ARGV\t$.\t$_" if m/"[^\x00-\x1f]*[\x80-\xff]"/; close(ARGV) if eof' | tee 36782-list_of_files_UTF-8.cvs And a list of the files: awk '{print $1}' 36782-list_of_files_UTF-8.cvs | sed -e 's/\.\///' | sort -u binfilter/bf_sc/source/filter/excel/sc_biffdump.cxx binfilter/bf_sc/source/filter/excel/sc_xistyle.cxx binfilter/bf_sw/source/filter/sw6/sw_sw6par.cxx binfilter/bf_sw/source/filter/w4w/sw_w4wgraf.cxx binfilter/bf_sw/source/filter/ww8/sw_ww8par5.cxx fpicker/source/win32/filepicker/workbench/Test_fps.cxx sal/qa/osl/module/osl_Module_Const.h sal/qa/rtl/ostring/rtl_string.cxx sal/qa/rtl/uri/rtl_Uri.cxx sc/source/filter/excel/biffdump.cxx svx/source/dialog/hangulhanjadlg.cxx sw/source/filter/sw6/sw6par.cxx sw/source/filter/w4w/w4wgraf.cxx sw/source/filter/ww8/ww8par5.cxx transex3/source/wtratree.cxx vcl/unx/source/app/keysymnames.cxx For detail, see 36782-list_of_files_UTF-8.cvs
Created attachment 22170 [details] a list of the files
Daniel, sc/source/filter/excel/xlstyle.cxx seems innocent now. We do not need to insert #pragma setlocale("C") in the file. Even though, this file has lines like below // Special UTF-8 characters #define UTF8_EURO "\342\202\254" The expression \254 is just a sequence of ASCII characters \ 2 5 and 4.
Just ignore sc/source/filter/biffdump.cxx, it is not compiled at all without special settings. Why should "\342\202\254" become a char sequence with backslash? This happens only if there is something like "\\342...". Does your script find escaped illegal characters ("\342") or only the "hardcoded" ones?
The script is aimed at finding hardcorded ones. For instance, 'less' command displays 0x80-0xff characters in a hexdecimal format, sandwitched with < and >. LANG=C less binfilter/bf_sc/source/filter/excel/sc_xistyle.cxx { 56, NF_NUMBER_STANDARD, "[$-0411]M<E6><9C><88>D<E6><97><A5>", LANGUAGE_JAPANESE }, { 57, NF_NUMBER_STANDARD, "[$-030411]GE.M.D", LANGUAGE_JAPANESE }, { 58, NF_NUMBER_STANDARD, "[$-030411]GGGE<E5><B9><B4>M<E6><9C><88>D<E6><97><A5>",LANGUAGE_JAPANESE } } Those byte sequences, A5 22, could be interrupted as a Japanese 2 byte Kanji character by the .NET Compiler. 22 means a double quotation mark.
DR->TORA: Just to make it clear: Does the localized compiler have problems with *escaped* non-ASCII characters, i.e. with the text "\342\202\254" in sc/source/ filter/excel.xlstyle.cxx; or do we *only* talk about the illegal literal characters like in sc_xistyle.cxx in the binfilter module, i.e. "<E6><97><A5>"? It is not clear for me compared to your initial comment in this issue, where it seems that you talk about the *escaped* characters. Anyway, for now I reopen the issue, and I will create sub tasks for all the affected modules.
DR->TORA: Seems that your script does not look for single characters in apostrophs, i.e. /transex3/source/wtratree.cxx, line 108.
Tora->DR: The localized compiler does not have a problem with *escaped* non-ASCII characters. We only need to talk about the characters whose byte code is in the range from 0x80 to 0xff. Here is a sample file sample.c that can be compiled in the following way: perl guw.pl /cygdrive/c/PROGRA~1/MICROS~1.NET/Vc7/bin/cl.exe -I/cygdrive/c/PROGRA~1/MICROS~1.NET/Vc7/include -DSETLOCALE_JA sample.c Adding a command line option -DSETLOCALE_JA would maybe show you what happens, which would maybe let your compiler act as a localized one by using #pragma setlocale("ja")
Created attachment 22285 [details] A sample code that shows a relation between setlocale and 0x80-0xff characters
Tora->DR: thank you for pointing out. As you mentioned, the script does not find single characters in apostrophes. A revised script that can detect such cases will be attached. find . -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' | xargs perl looking_for_0x80-0xff_characters_just_before_a_quotation_mark.pl Here are additional files which can be taken care of: binfilter/bf_sw/source/filter/excel/sw_exctools.cxx sfx2/source/control/macro.cxx svtools/source/control/reginfo.cxx sw/source/ui/utlui/attrdesc.cxx tools/source/fsys/dosmsc.cxx
Created attachment 22289 [details] Works for both single and double quotation marks.
sfx2/* -> issue 42367 transex3/* -> issue 42367 svx/* -> issue 42367 sc/* -> issue 42367 sw/* -> issue 42367 binfilter/* -> issue 42367 fpicker/* -> issue 42348 sal/qa/* -> issue 42350 vcl/unx/* : Unix-only code svtools/source/control/reginfo.cxx : comments or unused code tools/source/fsys/dosmsc.cxx : unused code
dependent issues
parent issue -> prio to P4
Great! i hope you all had a great time on Rosenmontag und Faschingsdienstag.
Yoshiyuki Masutomi, who first pointed out this problem, let me know about another file today. find . -name '*.cpp' | xargs perl looking_for_0x80-0xff_characters_just_before_a_quotation_mark.pl hwpfilter/source/fontmap.cpp
dependent issue added
added "#pragma setlocale" to hwpfilter/source/fontmap.cpp all child tasks are fixed -> this task is fixed
An additional problem has been found. It can happens with comment lines. The following comment lines, for example, have a problem with Japanese version of the Microsoft Visual C++ .NET 2002 and 2003. SRC680_m74/src/hwpfilter/source/hpara.h (172) ====================================================== // layout<C0><BB> <C0><A7><C7><D1> <C7><D4><BC><F6> /** * Get line information of given line */ ====================================================== The compilers interpret them in the following way: ====================================================== // layout<C0><BB> <C0><A7><C7><D1> <C7><D4><BC><F6>/** * Get line information of given line */ ====================================================== As a result, the compilers complain that Syntax Error: ';' is needed before the identifier 'line' It means that the compilers probably expect int * Get; To understand such unbelievable behavior, see the encoding called codepage 932: http://www.microsoft.com/globaldev/reference/dbcs/932.htm A leading byte ranging 0x80 - 0x9f and 0xe0 - 0xff is followed by a tailing byte. In the example mentioned above the control code 0x0a, carriage return, is treated as a tailing byte, and thus the compiler made a misinterpretation. Other Asian languages have individual ranges of the leading byte. According to the MSDN web page http://msdn.microsoft.com/archive/default.asp?url=/archive/en-us/dnarvc/html/msdn_mbcssg.asp ========================================================================= Lead Byte Ranges. Each code page may have different lead byte ranges. Japan 932 0x81-0x9F Korea 949 0xA1-0xFE China 936 0xA1-0xFE Taiwan 950 0xA1-0xFE, 0x8E-0xA0, 0x81-0x8D =========================================================================
Here is a script for finding out suspicious files. Note that it works for only Japanese version of Microsoft Visual C++ .NET 2002 and 2003. cd OOo_1.9.m77_src find * -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' -o -name '*.cpp' | xargs perl -ne '$line=$_; s/\r\n/\n/; printf("%s\t%s\t%s", $ARGV, $., $line) if s/[\x80-\x9f\xe0-\xff](.|\n)//g and not m/\n\Z/; close(ARGV) if eof' | tee 36782-suspicious_files_on_comments.cvs awk '{print $1}' 36782-suspicious_files_on_comments.cvs | sort -u binfilter/bf_sch/source/core/sch_chtmode2.cxx binfilter/bf_svx/source/engine3d/svx_scene3d.cxx binfilter/bf_sw/source/filter/w4w/sw_w4wpar1.cxx binfilter/bf_sw/source/ui/app/sw_docsh.cxx hwpfilter/source/drawdef.h hwpfilter/source/drawing.h hwpfilter/source/hbox.cpp hwpfilter/source/hbox.h hwpfilter/source/hcode.cpp hwpfilter/source/hinfo.cpp hwpfilter/source/hinfo.h hwpfilter/source/hpara.cpp hwpfilter/source/hpara.h hwpfilter/source/hwpeq.cpp hwpfilter/source/hwpfile.h hwpfilter/source/hwpread.cpp hwpfilter/source/hwpreader.cxx hwpfilter/source/ksc5601.h sch/source/core/chtmode2.cxx so3/source/inplace/ipenv.cxx svtools/source/filter.vcl/filter/sgvmain.cxx svtools/source/filter.vcl/filter/sgvspln.cxx svtools/source/items/itemset.cxx svx/source/engine3d/scene3d.cxx svx/source/svdraw/svdobj.cxx svx/source/svdraw/svdoedge.cxx sw/source/filter/w4w/w4wpar1.cxx sw/source/ui/app/docsh.cxx sw/source/ui/table/tablepg.hxx For hwpfilter/* , how about simply inserting #pragma setlocale("C") in the following files? hwpfilter/source/fontmap.cpp hwpfilter/source/mzstring.h hwpfilter/source/precompile.h
Created attachment 22501 [details] suspicious files on comment lines
Here is a script for finding out suspicious files for all other languages that use double-byte character set (DBCS). cd OOo_1.9.m77_src find * -name '*.cxx' -o -name '*.c' -o -name '*.hxx' -o -name '*.h' -o -name '*.cpp' | xargs perl -ne 's/\r\n/\n/; printf("%s\t%s\t%s", $ARGV, $., $_) if m/[\x80-\xff]\Z/; close(ARGV) if eof' > z awk '{print $1}' z | sort -u | wc -l 58 awk '{print $1}' z | sort -u binfilter/bf_forms/source/misc/forms_services.cxx binfilter/bf_sch/source/core/sch_chtmode2.cxx binfilter/bf_sfx2/source/dialog/sfx2_filtergrouping.cxx binfilter/bf_starmath/source/starmath_xchar.cxx binfilter/bf_svx/source/engine3d/svx_scene3d.cxx binfilter/bf_svx/source/svrtf/svx_rtfitem.cxx binfilter/bf_sw/source/filter/w4w/sw_w4wpar1.cxx binfilter/bf_sw/source/ui/app/sw_docsh.cxx connectivity/source/commontools/FValue.cxx connectivity/source/drivers/evoab/LColumnAlias.cxx connectivity/source/drivers/evoab/LConfigAccess.cxx connectivity/source/drivers/mozab/MConfigAccess.cxx forms/source/misc/services.cxx hwpfilter/* sal/qa/osl/file/osl_File_Const.h sch/source/core/chtmode2.cxx sfx2/source/dialog/filtergrouping.cxx so3/source/inplace/ipenv.cxx starmath/source/xchar.cxx svtools/source/filepicker/iodlg.cxx svtools/source/filepicker/pickerhistory.cxx svtools/source/filter.vcl/filter/sgvmain.cxx svtools/source/filter.vcl/filter/sgvspln.cxx svtools/source/filter.vcl/filter/sgvtext.cxx svtools/source/items/itemset.cxx svx/inc/svdio.hxx svx/inc/svdmodel.hxx svx/inc/svdomeas.hxx svx/inc/svdtrans.hxx svx/inc/sxmlhitm.hxx svx/source/engine3d/scene3d.cxx svx/source/form/formcontrolling.cxx svx/source/svdraw/svdobj.cxx svx/source/svdraw/svdoedge.cxx svx/source/svrtf/rtfitem.cxx sw/source/filter/w4w/w4wpar1.cxx sw/source/ui/app/docsh.cxx sw/source/ui/table/tablepg.hxx toolkit/source/controls/dialogcontrol.cxx
To fix this kind of problem, there are at least two options: 1. inserting #pragma setlocale("C") in the suspicious files or header files. 2. appending a space character at the end of the comment lines that end with one of the 0x80-0xff characters.
I will take care of the new files. Adding spaces at the end of the line is a bad idea, some editors may strip them away when editing or saving the file.
all sub tasks fixed
all sub tasks verified
Thanks. Some Japanese engineers are going to try to build the revised sources with Microsoft Visual Studio .NET Compiler Japanese version.
A highly experienced builder, Yoshiyuki Masutomi <curvirgo@eos.ocn.ne.jp>, has already confirmed the fix with around SRC680_m86 or m87. Thank you all who attacked this issue. Now we can let this issue come to close.
closed