Is this Regular Expression for UTF-8 Correct??

  • Follow


Is this Regular Expression for UTF-8 Correct??

The solution is based on the GREEN portions of the first 
chart shown
on this link:
  http://www.w3.org/2005/03/23-lex-U

A semantically identical regular expression is also found on 
the above link underValidating lex Template

1    ['\u0000'-'\u007F']
2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
3 | ( '\u00E0'           ['\u00A0'-'\u00BF'] 
['\u0080'-'\u00BF'])
4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] 
['\u0080'-'\u00BF'])
5 | ( '\u00ED'           ['\u0080'-'\u009F'] 
['\u0080'-'\u00BF'])
6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] 
['\u0080'-'\u00BF'])
7 | ( '\u00F0'           ['\u0090'-'\u00BF'] 
['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] 
['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
9 | ( '\u00F4'           ['\u0080'-'\u008F'] 
['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])

Here is my version, the syntax is different, but the UTF8 
portion should be semantically identical.

UTF8_BYTE_ORDER_MARK   [\xEF][\xBB][\xBF]

ASCII     [\x0-\x7F]

U1          [a-zA-Z_]
U2          [\xC2-\xDF][\x80-\xBF]
U3          [\xE0][\xA0-\xBF][\x80-\xBF]
U4          [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
U5          [\xED][\x80-\x9F][\x80-\xBF]
U6          [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
U7          [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
U8          [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
U9          [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]

UTF8   {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}

// This identifies the "Letter" portion of an Identifier.
L          {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}

I guess that most of the analysis may simply boil down to 
whether or not the original source from the link is 
considered reliable. I had forgotten this original source 
when I first asked this question, that is why I am reposting 
this same question again. 


0
Reply Peter 5/13/2010 8:06:36 PM

"Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d@giganews.com...
> Is this Regular Expression for UTF-8 Correct??
>
> The solution is based on the GREEN portions of the first chart shown
> on this link:
>  http://www.w3.org/2005/03/23-lex-U
>
> A semantically identical regular expression is also found on the above 
> link underValidating lex Template
>
> 1    ['\u0000'-'\u007F']
> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
> 3 | ( '\u00E0'           ['\u00A0'-'\u00BF'] ['\u0080'-'\u00BF'])
> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
> 5 | ( '\u00ED'           ['\u0080'-'\u009F'] ['\u0080'-'\u00BF'])
> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
> 7 | ( '\u00F0'           ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] 
> ['\u0080'-'\u00BF'])
> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] 
> ['\u0080'-'\u00BF'])
> 9 | ( '\u00F4'           ['\u0080'-'\u008F'] ['\u0080'-'\u00BF'] 
> ['\u0080'-'\u00BF'])
>
> Here is my version, the syntax is different, but the UTF8 portion should 
> be semantically identical.
>
> UTF8_BYTE_ORDER_MARK   [\xEF][\xBB][\xBF]
>
> ASCII     [\x0-\x7F]
>
> U1          [a-zA-Z_]
> U2          [\xC2-\xDF][\x80-\xBF]
> U3          [\xE0][\xA0-\xBF][\x80-\xBF]
> U4          [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
> U5          [\xED][\x80-\x9F][\x80-\xBF]
> U6          [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
> U7          [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
> U8          [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
> U9          [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>
> UTF8   {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>
> // This identifies the "Letter" portion of an Identifier.
> L          {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>
> I guess that most of the analysis may simply boil down to whether or not 
> the original source from the link is considered reliable. I had forgotten 
> this original source when I first asked this question, that is why I am 
> reposting this same question again.

What has this got to do with C++?  What is your C++ language question?

/Leigh 

0
Reply Leigh 5/13/2010 8:27:39 PM


"Leigh Johnston" <leigh@i42.co.uk> wrote in message 
news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d@giganews.com...
> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d@giganews.com...
>> Is this Regular Expression for UTF-8 Correct??
>>
>> The solution is based on the GREEN portions of the first 
>> chart shown
>> on this link:
>>  http://www.w3.org/2005/03/23-lex-U
>>
>> A semantically identical regular expression is also found 
>> on the above link underValidating lex Template
>>
>> 1    ['\u0000'-'\u007F']
>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>> 3 | ( '\u00E0'           ['\u00A0'-'\u00BF'] 
>> ['\u0080'-'\u00BF'])
>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] 
>> ['\u0080'-'\u00BF'])
>> 5 | ( '\u00ED'           ['\u0080'-'\u009F'] 
>> ['\u0080'-'\u00BF'])
>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] 
>> ['\u0080'-'\u00BF'])
>> 7 | ( '\u00F0'           ['\u0090'-'\u00BF'] 
>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] 
>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>> 9 | ( '\u00F4'           ['\u0080'-'\u008F'] 
>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>
>> Here is my version, the syntax is different, but the UTF8 
>> portion should be semantically identical.
>>
>> UTF8_BYTE_ORDER_MARK   [\xEF][\xBB][\xBF]
>>
>> ASCII     [\x0-\x7F]
>>
>> U1          [a-zA-Z_]
>> U2          [\xC2-\xDF][\x80-\xBF]
>> U3          [\xE0][\xA0-\xBF][\x80-\xBF]
>> U4          [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>> U5          [\xED][\x80-\x9F][\x80-\xBF]
>> U6          [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>> U7          [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>> U8          [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>> U9          [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>
>> UTF8   {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>
>> // This identifies the "Letter" portion of an Identifier.
>> L          {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>
>> I guess that most of the analysis may simply boil down to 
>> whether or not the original source from the link is 
>> considered reliable. I had forgotten this original source 
>> when I first asked this question, that is why I am 
>> reposting this same question again.
>
> What has this got to do with C++?  What is your C++ 
> language question?
>
> /Leigh

I will be implementing a utf8string to supplement 
std::string and will be using a regular expression to 
quickly divide up UTF-8 bytes into Unicode CodePoints.

Since there are no UTF-8 groups, or even Unicode groups I 
must post these questions to groups that are at most 
indirectly related to this subject matter. 


0
Reply Peter 5/13/2010 8:36:24 PM


"Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
news:xMOdnahxZJNX_3HWnZ2dnUVZ_hCdnZ2d@giganews.com...
>>
>> What has this got to do with C++?  What is your C++ language question?
>>
>> /Leigh
>
> I will be implementing a utf8string to supplement std::string and will be 
> using a regular expression to quickly divide up UTF-8 bytes into Unicode 
> CodePoints.
>
> Since there are no UTF-8 groups, or even Unicode groups I must post these 
> questions to groups that are at most indirectly related to this subject 
> matter.

Wrong: off-topic is off-topic.  If I chose to write a Tetris game in C++ it 
would be inappropriate to ask about the rules of Tetris in this newsgroup 
even if there was not a more appropriate newsgroup.

/Leigh 

0
Reply Leigh 5/13/2010 8:41:16 PM

"Leigh Johnston" <leigh@i42.co.uk> wrote in message 
news:v7CdnY8dPrNy_nHWnZ2dnUVZ8t-dnZ2d@giganews.com...
>
>
> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
> news:xMOdnahxZJNX_3HWnZ2dnUVZ_hCdnZ2d@giganews.com...
>>>
>>> What has this got to do with C++?  What is your C++ 
>>> language question?
>>>
>>> /Leigh
>>
>> I will be implementing a utf8string to supplement 
>> std::string and will be using a regular expression to 
>> quickly divide up UTF-8 bytes into Unicode CodePoints.
>>
>> Since there are no UTF-8 groups, or even Unicode groups I 
>> must post these questions to groups that are at most 
>> indirectly related to this subject matter.
>
> Wrong: off-topic is off-topic.  If I chose to write a 
> Tetris game in C++ it would be inappropriate to ask about 
> the rules of Tetris in this newsgroup even if there was 
> not a more appropriate newsgroup.
>
> /Leigh

I think that posting to the next most relevant group(s) 
where a directly relevant group does not exist is right, and 
thus you are simply wrong. 


0
Reply Peter 5/13/2010 8:54:08 PM

"Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
news:L76dnYQBH-5s-3HWnZ2dnUVZ_sqdnZ2d@giganews.com...
>
> "Leigh Johnston" <leigh@i42.co.uk> wrote in message 
> news:v7CdnY8dPrNy_nHWnZ2dnUVZ8t-dnZ2d@giganews.com...
>>
>>
>> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
>> news:xMOdnahxZJNX_3HWnZ2dnUVZ_hCdnZ2d@giganews.com...
>>>>
>>>> What has this got to do with C++?  What is your C++ language question?
>>>>
>>>> /Leigh
>>>
>>> I will be implementing a utf8string to supplement std::string and will 
>>> be using a regular expression to quickly divide up UTF-8 bytes into 
>>> Unicode CodePoints.
>>>
>>> Since there are no UTF-8 groups, or even Unicode groups I must post 
>>> these questions to groups that are at most indirectly related to this 
>>> subject matter.
>>
>> Wrong: off-topic is off-topic.  If I chose to write a Tetris game in C++ 
>> it would be inappropriate to ask about the rules of Tetris in this 
>> newsgroup even if there was not a more appropriate newsgroup.
>>
>> /Leigh
>
> I think that posting to the next most relevant group(s) where a directly 
> relevant group does not exist is right, and thus you are simply wrong.

From this newsgroup's FAQ:

"Only post to comp.lang.c++ if your question is about the C++ language 
itself."

Thus I am simply correct?

/Leigh 

0
Reply Leigh 5/13/2010 9:11:40 PM

On 05/14/10 08:06 AM, Peter Olcott wrote:
> Is this Regular Expression for UTF-8 Correct??

It's a fair bet you are off-topic in all the groups you have cross 
posted to.  Why don't you pick a group for a language with built in UTF8 
and regexp support (PHP?) and badger them?

-- 
Ian Collins
0
Reply Ian 5/13/2010 9:33:29 PM

"Leigh Johnston" <leigh@i42.co.uk> wrote in message 
news:TNydnVD6sciN9nHWnZ2dnUVZ7oWdnZ2d@giganews.com...
> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
> news:L76dnYQBH-5s-3HWnZ2dnUVZ_sqdnZ2d@giganews.com...
>>
>> "Leigh Johnston" <leigh@i42.co.uk> wrote in message 
>> news:v7CdnY8dPrNy_nHWnZ2dnUVZ8t-dnZ2d@giganews.com...
>>>
>>>
>>> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
>>> news:xMOdnahxZJNX_3HWnZ2dnUVZ_hCdnZ2d@giganews.com...
>>>>>
>>>>> What has this got to do with C++?  What is your C++ 
>>>>> language question?
>>>>>
>>>>> /Leigh
>>>>
>>>> I will be implementing a utf8string to supplement 
>>>> std::string and will be using a regular expression to 
>>>> quickly divide up UTF-8 bytes into Unicode CodePoints.
>>>>
>>>> Since there are no UTF-8 groups, or even Unicode groups 
>>>> I must post these questions to groups that are at most 
>>>> indirectly related to this subject matter.
>>>
>>> Wrong: off-topic is off-topic.  If I chose to write a 
>>> Tetris game in C++ it would be inappropriate to ask 
>>> about the rules of Tetris in this newsgroup even if 
>>> there was not a more appropriate newsgroup.
>>>
>>> /Leigh
>>
>> I think that posting to the next most relevant group(s) 
>> where a directly relevant group does not exist is right, 
>> and thus you are simply wrong.
>
> From this newsgroup's FAQ:
>
> "Only post to comp.lang.c++ if your question is about the 
> C++ language itself."
>
> Thus I am simply correct?
>
> /Leigh

I can not accept that the "correct" answer is that some 
questions can not be asked. 


0
Reply Peter 5/13/2010 9:59:12 PM

"Ian Collins" <ian-news@hotmail.com> wrote in message 
news:8539h9F7f1U1@mid.individual.net...
> On 05/14/10 08:06 AM, Peter Olcott wrote:
>> Is this Regular Expression for UTF-8 Correct??
>
> It's a fair bet you are off-topic in all the groups you 
> have cross posted to.  Why don't you pick a group for a 
> language with built in UTF8 and regexp support (PHP?) and 
> badger them?
>
> -- 
> Ian Collins

What does this question have to do with the C++ language?

At least my question is indirectly related to C++ by making 
a utf8string for the C++ language from the regular 
expression.

Your question is not even indirectly related to the C++ 
language. 


0
Reply Peter 5/13/2010 10:04:52 PM

Peter Olcott wrote:

> 
> "Leigh Johnston" <leigh@i42.co.uk> wrote in message
> news:TNydnVD6sciN9nHWnZ2dnUVZ7oWdnZ2d@giganews.com...
>> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message
[... on topicality of regular expressions ...]
>> From this newsgroup's FAQ:
>>
>> "Only post to comp.lang.c++ if your question is about the
>> C++ language itself."
>>
>> Thus I am simply correct?
>>
>> /Leigh
> 
> I can not accept that the "correct" answer is that some
> questions can not be asked.

a) Whether you can accept that answer or not, it could still be correct.

b) However, it isn't: the question can be asked. It should just be asked 
_elsewhere_. May I suggest comp.programming, where it seems to be on-topic.


Best

Kai-Uwe Bux

0
Reply Kai 5/13/2010 10:12:38 PM

See below...
On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott" <NoSpam@OCR4Screen.com> wrote:

>
>"Leigh Johnston" <leigh@i42.co.uk> wrote in message 
>news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d@giganews.com...
>> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
>> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d@giganews.com...
>>> Is this Regular Expression for UTF-8 Correct??
>>>
>>> The solution is based on the GREEN portions of the first 
>>> chart shown
>>> on this link:
>>>  http://www.w3.org/2005/03/23-lex-U
****
Note that in the "green" areas, we find

U0482 Cyrillic thousands sign
U055A Armenian apostrophe
U055C Armenian exclamation mark
U05C3 Hebrew punctuation SOF Pasuq
U060C Arabic comma
U066B Arabic decimal separator
U0700-U0709 Assorted Syriac punctuation  marks
U0966-U096F Devanagari digits 0..9
U09E6-U09EF Bengali digits 0..9
U09F2-U09F3 Bengali rupee marks
U0A66-U0A6F Gurmukhi digits 0..9
U0AE6-U0AEF Gujarati digits 0..9
U0B66-U0B6F Oriya digits 0..9
U0BE6-U0BEF Tamil digits 0..9
U0BF0-U0BF2  Tamil indicators for 10, 100, 1000
U0BF3-U0BFA Tamil punctuation marks
U0C66-U0C6F  Telugu digits 0..9
U0CE6-U0CEF Kannada digits 0..9
U0D66-U0D6F Malayam digits 0..9
U0E50-U0E59 Thai digits 0..9
U0ED0-U0ED9  Lao digits 0..9
U0F20-U0F29 Tibetan digits 0..9
U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
U1040-U1049 - Myanmar digits 0..9 
U1360-U1368 Ethiopic punctuation marks
U1369-U137C Ethiopic numeric values (digits, tens of digits, etc.)
U17E0-U17E9 Khmer digits 0..9
U1800-U180E Mongolian punctuation marks
U1810-U1819 Mongolian digits 0..9
U1946-U194F Limbu digits 0..9
U19D0-U19D9  New Tai Lue digits 0..9
....at which point I realized I was wasting my time, because I was attempting to disprovde
what is a Really Dumb Idea, which is to write applications that actually work on UTF-8
encoded text.

You are free to convert these to UTF-8, but in addition, if I've read some of the
encodings correctly, the non-green areas preclude what are clearly "letters" in other
languages.  

Forget UTF-8.  It is a transport mechanism used at input and output edges.  Use Unicode
internally.
****
>>>
>>> A semantically identical regular expression is also found 
>>> on the above link underValidating lex Template
>>>
>>> 1    ['\u0000'-'\u007F']
>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>> 3 | ( '\u00E0'           ['\u00A0'-'\u00BF'] 
>>> ['\u0080'-'\u00BF'])
>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] 
>>> ['\u0080'-'\u00BF'])
>>> 5 | ( '\u00ED'           ['\u0080'-'\u009F'] 
>>> ['\u0080'-'\u00BF'])
>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] 
>>> ['\u0080'-'\u00BF'])
>>> 7 | ( '\u00F0'           ['\u0090'-'\u00BF'] 
>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] 
>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>> 9 | ( '\u00F4'           ['\u0080'-'\u008F'] 
>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>
>>> Here is my version, the syntax is different, but the UTF8 
>>> portion should be semantically identical.
>>>
>>> UTF8_BYTE_ORDER_MARK   [\xEF][\xBB][\xBF]
>>>
>>> ASCII     [\x0-\x7F]
>>>
>>> U1          [a-zA-Z_]
>>> U2          [\xC2-\xDF][\x80-\xBF]
>>> U3          [\xE0][\xA0-\xBF][\x80-\xBF]
>>> U4          [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>>> U5          [\xED][\x80-\x9F][\x80-\xBF]
>>> U6          [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>>> U7          [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>>> U8          [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>>> U9          [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>>
>>> UTF8   {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>
>>> // This identifies the "Letter" portion of an Identifier.
>>> L          {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>
>>> I guess that most of the analysis may simply boil down to 
>>> whether or not the original source from the link is 
>>> considered reliable. I had forgotten this original source 
>>> when I first asked this question, that is why I am 
>>> reposting this same question again.
>>
>> What has this got to do with C++?  What is your C++ 
>> language question?
>>
>> /Leigh
>
>I will be implementing a utf8string to supplement 
>std::string and will be using a regular expression to 
>quickly divide up UTF-8 bytes into Unicode CodePoints.
***
For someone who had an unholy fixation on "performance", why would you choose such a slow
mechanism for doing recognition?

I can imagine a lot of alternative approaches, including having a table of 65,536
"character masks" for Unicode characters, including on-the-fly updating of the table, and
extensions to support surrogates, which would outperform any regular expression based
approach.

What is your crtiterion for what constitutes a "letter"?  Frankly, I have no interest in
decoding something as bizarre as UTF-8 encodings to see if you covered the foreign
delimiters, numbers, punctuation marks, etc. properly, and it makes no sense to do so.  So
there is no way I would waste my time trying to understand an example that should not
exist at all.

Why do you seem to choose the worst possible choice when there is more than one way to do
something?  The choices are (a) work in 8-bit ANSI (b) work in UTF-8 (c) work in Unicode.
Of these, the worst possible choice is (b), followed by (a).  (c) is clearly the winner.

So why are you using something as bizarre as UTF-8 internally?  UTF-8 has ONE role, which
is to write Unicode out in an 8-bit encoding, and read Unicode in an 8-bit encoding.  You
do NOT want to write the program in terms of UTF-8!  
					joe
****
>
>Since there are no UTF-8 groups, or even Unicode groups I 
>must post these questions to groups that are at most 
>indirectly related to this subject matter. 
>
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Reply Joseph 5/13/2010 10:31:56 PM

Actually, what it does is give us another opportunity to point how how really bad this
design choice is, and thus Peter can tell us all we are fools for not answering a question
that should never have been asked, not because it is inappropriate for the group, but
because it represents the worst-possible-design decision that could be made.
				joe

On Thu, 13 May 2010 22:11:40 +0100, "Leigh Johnston" <leigh@i42.co.uk> wrote:

>"Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
>news:L76dnYQBH-5s-3HWnZ2dnUVZ_sqdnZ2d@giganews.com...
>>
>> "Leigh Johnston" <leigh@i42.co.uk> wrote in message 
>> news:v7CdnY8dPrNy_nHWnZ2dnUVZ8t-dnZ2d@giganews.com...
>>>
>>>
>>> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
>>> news:xMOdnahxZJNX_3HWnZ2dnUVZ_hCdnZ2d@giganews.com...
>>>>>
>>>>> What has this got to do with C++?  What is your C++ language question?
>>>>>
>>>>> /Leigh
>>>>
>>>> I will be implementing a utf8string to supplement std::string and will 
>>>> be using a regular expression to quickly divide up UTF-8 bytes into 
>>>> Unicode CodePoints.
>>>>
>>>> Since there are no UTF-8 groups, or even Unicode groups I must post 
>>>> these questions to groups that are at most indirectly related to this 
>>>> subject matter.
>>>
>>> Wrong: off-topic is off-topic.  If I chose to write a Tetris game in C++ 
>>> it would be inappropriate to ask about the rules of Tetris in this 
>>> newsgroup even if there was not a more appropriate newsgroup.
>>>
>>> /Leigh
>>
>> I think that posting to the next most relevant group(s) where a directly 
>> relevant group does not exist is right, and thus you are simply wrong.
>
>From this newsgroup's FAQ:
>
>"Only post to comp.lang.c++ if your question is about the C++ language 
>itself."
>
>Thus I am simply correct?
>
>/Leigh 
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Reply Joseph 5/13/2010 10:33:37 PM

On 5/13/2010 6:04 PM, Peter Olcott wrote:
> "Ian Collins"<ian-news@hotmail.com>  wrote in message
> news:8539h9F7f1U1@mid.individual.net...
>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>> Is this Regular Expression for UTF-8 Correct??
>>
>> It's a fair bet you are off-topic in all the groups you
>> have cross posted to.  Why don't you pick a group for a
>> language with built in UTF8 and regexp support (PHP?) and
>> badger them?
>>
>> --
>> Ian Collins
>
> What does this question have to do with the C++ language?

It does not have to have anything to do with C++.  A post on the 
topicality of another post is *always on topic*.

> At least my question is indirectly related to C++ by making
> a utf8string for the C++ language from the regular
> expression.

<sarcasm>
I am about to hold a party where I expect my colleagues to show up. 
They are all C++ programmers.  Would the question on what to feed them, 
or whether 1970s pop music is going to be appropriate, be on topic in 
comp.lang.c++?  It's *indirectly related* to C++, isn't it?
</sarcasm>

> Your question is not even indirectly related to the C++
> language.

See above.

V
-- 
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask
0
Reply Victor 5/13/2010 10:41:32 PM

This is a MIME GnuPG-signed message.  If you see this text, it means that
your E-mail or Usenet software does not support MIME signed messages.
The Internet standard for MIME PGP messages, RFC 2015, was published in 1996.
To open this message correctly you will need to install E-mail or Usenet
software that supports modern Internet standards.

--=_mimegpg-commodore.email-scan.com-18468-1273792367-0002
Content-Type: text/plain; format=flowed; charset="US-ASCII"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit

Victor Bazarov writes:

> On 5/13/2010 6:04 PM, Peter Olcott wrote:
>> "Ian Collins"<ian-news@hotmail.com>  wrote in message
>> news:8539h9F7f1U1@mid.individual.net...
>>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>>> Is this Regular Expression for UTF-8 Correct??
>>>
>>> It's a fair bet you are off-topic in all the groups you
>>> have cross posted to.  Why don't you pick a group for a
>>> language with built in UTF8 and regexp support (PHP?) and
>>> badger them?
>>>
>>> --
>>> Ian Collins
>>
>> What does this question have to do with the C++ language?
> 
> It does not have to have anything to do with C++.  A post on the 
> topicality of another post is *always on topic*.
> 
>> At least my question is indirectly related to C++ by making
>> a utf8string for the C++ language from the regular
>> expression.
> 
> <sarcasm>
> I am about to hold a party where I expect my colleagues to show up. 
> They are all C++ programmers.  Would the question on what to feed them, 
> or whether 1970s pop music is going to be appropriate, be on topic in 
> comp.lang.c++?  It's *indirectly related* to C++, isn't it?
> </sarcasm>
> 
>> Your question is not even indirectly related to the C++
>> language.
> 
> See above.

This guy is a tool. He re-posted this question a second time because when he 
first posted that snippet nobody cared either. But after watching the 
struggle in the original thread, the ugly carnage appealed to the 
infinitesimally small humanitarian aspect of my psyche sufficiently enough 
to motivate myself into actually looking at the regexp monstrosity. But 
after I explained why that spaghetti of a regexp does not jive with RFC 
2279, he got all huffy about it. He was confident that I was wrong, and that 
the regular expression was right. But I was able to explain my reasoning, by 
referencing directly to the contents of RFC 2279, and he was unable to 
explain why he thought I was wrong, instead sprinkling more URLs to some 
apparently orphaned web pages that said something else.

Which raised an obvious question: if he was so sure that his regular 
expressions were correct, why was he asking? What exactly is the part of RFC 
2279 that he didn't understand?

It seems to be his personality trait: when he asks a question, he thinks he 
knows what the answer is, and every other answer is wrong. I can't figure 
out what the real reason for asking the question must be, but I think I 
really don't want to know the answer.

It remains to be seen how long it will take him to figure out that the 
difficulty he has in getting someone answer this might be, just might be, 
due to the simple fact that this is one of these things that can be answered 
simply by RTFMing. Really, UTF-8 is not some patented trade secret. Its 
specifications are openly available, to anyone who wants to read them. And 
anyone who reads them should be able to figure out the correct regexp for 
themselves. It's not rocket science.

Amusingly, he's been trying to find the answer to this question longer than 
it took myself, originally, to read RFC 2279, and implement encoding and 
decoding of Unicode using UTF-8. In C++. Well, in C actually, but it's still 
technically valid C++. Which, I guess, makes this on-topic, under the new 
rules that just came down, by fiat.



--=_mimegpg-commodore.email-scan.com-18468-1273792367-0002
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEABECAAYFAkvsh28ACgkQx9p3GYHlUOIoYgCggLuzqINS3qAD1kxnFb1jgocS
LJAAnA/sRTk/zZVPeNLnM2B1NgV0tLbO
=+Grg
-----END PGP SIGNATURE-----

--=_mimegpg-commodore.email-scan.com-18468-1273792367-0002--
0
Reply Sam 5/13/2010 11:12:48 PM

Ah so in other words an extremely verbose, "I don't know".
Let me take a different approach. Can postings on www.w3.org 
generally be relied upon?

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in 
message news:gprou55bvl3rgp2qmp6v3euk20ucf865mi@4ax.com...
> See below...
> On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott" 
> <NoSpam@OCR4Screen.com> wrote:
>
>>
>>"Leigh Johnston" <leigh@i42.co.uk> wrote in message
>>news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d@giganews.com...
>>> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message
>>> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d@giganews.com...
>>>> Is this Regular Expression for UTF-8 Correct??
>>>>
>>>> The solution is based on the GREEN portions of the 
>>>> first
>>>> chart shown
>>>> on this link:
>>>>  http://www.w3.org/2005/03/23-lex-U
> ****
> Note that in the "green" areas, we find
>
> U0482 Cyrillic thousands sign
> U055A Armenian apostrophe
> U055C Armenian exclamation mark
> U05C3 Hebrew punctuation SOF Pasuq
> U060C Arabic comma
> U066B Arabic decimal separator
> U0700-U0709 Assorted Syriac punctuation  marks
> U0966-U096F Devanagari digits 0..9
> U09E6-U09EF Bengali digits 0..9
> U09F2-U09F3 Bengali rupee marks
> U0A66-U0A6F Gurmukhi digits 0..9
> U0AE6-U0AEF Gujarati digits 0..9
> U0B66-U0B6F Oriya digits 0..9
> U0BE6-U0BEF Tamil digits 0..9
> U0BF0-U0BF2  Tamil indicators for 10, 100, 1000
> U0BF3-U0BFA Tamil punctuation marks
> U0C66-U0C6F  Telugu digits 0..9
> U0CE6-U0CEF Kannada digits 0..9
> U0D66-U0D6F Malayam digits 0..9
> U0E50-U0E59 Thai digits 0..9
> U0ED0-U0ED9  Lao digits 0..9
> U0F20-U0F29 Tibetan digits 0..9
> U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
> U1040-U1049 - Myanmar digits 0..9
> U1360-U1368 Ethiopic punctuation marks
> U1369-U137C Ethiopic numeric values (digits, tens of 
> digits, etc.)
> U17E0-U17E9 Khmer digits 0..9
> U1800-U180E Mongolian punctuation marks
> U1810-U1819 Mongolian digits 0..9
> U1946-U194F Limbu digits 0..9
> U19D0-U19D9  New Tai Lue digits 0..9
> ...at which point I realized I was wasting my time, 
> because I was attempting to disprovde
> what is a Really Dumb Idea, which is to write applications 
> that actually work on UTF-8
> encoded text.
>
> You are free to convert these to UTF-8, but in addition, 
> if I've read some of the
> encodings correctly, the non-green areas preclude what are 
> clearly "letters" in other
> languages.
>
> Forget UTF-8.  It is a transport mechanism used at input 
> and output edges.  Use Unicode
> internally.
> ****
>>>>
>>>> A semantically identical regular expression is also 
>>>> found
>>>> on the above link underValidating lex Template
>>>>
>>>> 1    ['\u0000'-'\u007F']
>>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>>> 3 | ( '\u00E0'           ['\u00A0'-'\u00BF']
>>>> ['\u0080'-'\u00BF'])
>>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>>>> ['\u0080'-'\u00BF'])
>>>> 5 | ( '\u00ED'           ['\u0080'-'\u009F']
>>>> ['\u0080'-'\u00BF'])
>>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>>>> ['\u0080'-'\u00BF'])
>>>> 7 | ( '\u00F0'           ['\u0090'-'\u00BF']
>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>> 9 | ( '\u00F4'           ['\u0080'-'\u008F']
>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>
>>>> Here is my version, the syntax is different, but the 
>>>> UTF8
>>>> portion should be semantically identical.
>>>>
>>>> UTF8_BYTE_ORDER_MARK   [\xEF][\xBB][\xBF]
>>>>
>>>> ASCII     [\x0-\x7F]
>>>>
>>>> U1          [a-zA-Z_]
>>>> U2          [\xC2-\xDF][\x80-\xBF]
>>>> U3          [\xE0][\xA0-\xBF][\x80-\xBF]
>>>> U4          [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>>>> U5          [\xED][\x80-\x9F][\x80-\xBF]
>>>> U6          [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>>>> U7          [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>>>> U8 
>>>> [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>>>> U9          [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>>>
>>>> UTF8   {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>
>>>> // This identifies the "Letter" portion of an 
>>>> Identifier.
>>>> L          {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>
>>>> I guess that most of the analysis may simply boil down 
>>>> to
>>>> whether or not the original source from the link is
>>>> considered reliable. I had forgotten this original 
>>>> source
>>>> when I first asked this question, that is why I am
>>>> reposting this same question again.
>>>
>>> What has this got to do with C++?  What is your C++
>>> language question?
>>>
>>> /Leigh
>>
>>I will be implementing a utf8string to supplement
>>std::string and will be using a regular expression to
>>quickly divide up UTF-8 bytes into Unicode CodePoints.
> ***
> For someone who had an unholy fixation on "performance", 
> why would you choose such a slow
> mechanism for doing recognition?
>
> I can imagine a lot of alternative approaches, including 
> having a table of 65,536
> "character masks" for Unicode characters, including 
> on-the-fly updating of the table, and
> extensions to support surrogates, which would outperform 
> any regular expression based
> approach.
>
> What is your crtiterion for what constitutes a "letter"? 
> Frankly, I have no interest in
> decoding something as bizarre as UTF-8 encodings to see if 
> you covered the foreign
> delimiters, numbers, punctuation marks, etc. properly, and 
> it makes no sense to do so.  So
> there is no way I would waste my time trying to understand 
> an example that should not
> exist at all.
>
> Why do you seem to choose the worst possible choice when 
> there is more than one way to do
> something?  The choices are (a) work in 8-bit ANSI (b) 
> work in UTF-8 (c) work in Unicode.
> Of these, the worst possible choice is (b), followed by 
> (a).  (c) is clearly the winner.
>
> So why are you using something as bizarre as UTF-8 
> internally?  UTF-8 has ONE role, which
> is to write Unicode out in an 8-bit encoding, and read 
> Unicode in an 8-bit encoding.  You
> do NOT want to write the program in terms of UTF-8!
> joe
> ****
>>
>>Since there are no UTF-8 groups, or even Unicode groups I
>>must post these questions to groups that are at most
>>indirectly related to this subject matter.
>>
> Joseph M. Newcomer [MVP]
> email: newcomer@flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm 


0
Reply Peter 5/13/2010 11:14:47 PM

On 5/13/2010 6:12 PM, Sam wrote:
> Victor Bazarov writes:
>
>> On 5/13/2010 6:04 PM, Peter Olcott wrote:
>>> "Ian Collins"<ian-news@hotmail.com> wrote in message
>>> news:8539h9F7f1U1@mid.individual.net...
>>>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>>>> Is this Regular Expression for UTF-8 Correct??
>>>>
>>>> It's a fair bet you are off-topic in all the groups you
>>>> have cross posted to. Why don't you pick a group for a
>>>> language with built in UTF8 and regexp support (PHP?) and
>>>> badger them?
>>>>
>>>> --
>>>> Ian Collins
>>>
>>> What does this question have to do with the C++ language?
>>
>> It does not have to have anything to do with C++. A post on the
>> topicality of another post is *always on topic*.
>>
>>> At least my question is indirectly related to C++ by making
>>> a utf8string for the C++ language from the regular
>>> expression.
>>
>> <sarcasm>
>> I am about to hold a party where I expect my colleagues to show up.
>> They are all C++ programmers. Would the question on what to feed them,
>> or whether 1970s pop music is going to be appropriate, be on topic in
>> comp.lang.c++? It's *indirectly related* to C++, isn't it?
>> </sarcasm>
>>
>>> Your question is not even indirectly related to the C++
>>> language.
>>
>> See above.
>
> This guy is a tool. He re-posted this question a second time because
> when he first posted that snippet nobody cared either. But after
> watching the struggle in the original thread, the ugly carnage appealed
> to the infinitesimally small humanitarian aspect of my psyche
> sufficiently enough to motivate myself into actually looking at the
> regexp monstrosity. But after I explained why that spaghetti of a regexp
> does not jive with RFC 2279, he got all huffy about it. He was confident
> that I was wrong, and that the regular expression was right. But I was
> able to explain my reasoning, by referencing directly to the contents of
> RFC 2279, and he was unable to explain why he thought I was wrong,
> instead sprinkling more URLs to some apparently orphaned web pages that
> said something else.
>
> Which raised an obvious question: if he was so sure that his regular
> expressions were correct, why was he asking? What exactly is the part of
> RFC 2279 that he didn't understand?
>
> It seems to be his personality trait: when he asks a question, he thinks
> he knows what the answer is, and every other answer is wrong. I can't
> figure out what the real reason for asking the question must be, but I
> think I really don't want to know the answer.
>
> It remains to be seen how long it will take him to figure out that the
> difficulty he has in getting someone answer this might be, just might
> be, due to the simple fact that this is one of these things that can be
> answered simply by RTFMing. Really, UTF-8 is not some patented trade
> secret. Its specifications are openly available, to anyone who wants to
> read them. And anyone who reads them should be able to figure out the
> correct regexp for themselves. It's not rocket science.
>
> Amusingly, he's been trying to find the answer to this question longer
> than it took myself, originally, to read RFC 2279, and implement
> encoding and decoding of Unicode using UTF-8. In C++. Well, in C
> actually, but it's still technically valid C++. Which, I guess, makes
> this on-topic, under the new rules that just came down, by fiat.
>
>

This time I found the original source of a semantically identical 
regular expression that you berated so rudely.
   http://www.w3.org/2005/03/23-lex-U

Who knows, maybe www.w3.org is wrong and you are right?



0
Reply Peter 5/13/2010 11:22:01 PM

This is a MIME GnuPG-signed message.  If you see this text, it means that
your E-mail or Usenet software does not support MIME signed messages.
The Internet standard for MIME PGP messages, RFC 2015, was published in 1996.
To open this message correctly you will need to install E-mail or Usenet
software that supports modern Internet standards.

--=_mimegpg-commodore.email-scan.com-18468-1273794006-0008
Content-Type: text/plain; format=flowed; charset="US-ASCII"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit

Peter Olcott writes:

> On 5/13/2010 6:12 PM, Sam wrote:
>> Victor Bazarov writes:
>>
>>> On 5/13/2010 6:04 PM, Peter Olcott wrote:
>>>> "Ian Collins"<ian-news@hotmail.com> wrote in message
>>>> news:8539h9F7f1U1@mid.individual.net...
>>>>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>>>>> Is this Regular Expression for UTF-8 Correct??
>>>>>
>>>>> It's a fair bet you are off-topic in all the groups you
>>>>> have cross posted to. Why don't you pick a group for a
>>>>> language with built in UTF8 and regexp support (PHP?) and
>>>>> badger them?
>>>>>
>>>>> --
>>>>> Ian Collins
>>>>
>>>> What does this question have to do with the C++ language?
>>>
>>> It does not have to have anything to do with C++. A post on the
>>> topicality of another post is *always on topic*.
>>>
>>>> At least my question is indirectly related to C++ by making
>>>> a utf8string for the C++ language from the regular
>>>> expression.
>>>
>>> <sarcasm>
>>> I am about to hold a party where I expect my colleagues to show up.
>>> They are all C++ programmers. Would the question on what to feed them,
>>> or whether 1970s pop music is going to be appropriate, be on topic in
>>> comp.lang.c++? It's *indirectly related* to C++, isn't it?
>>> </sarcasm>
>>>
>>>> Your question is not even indirectly related to the C++
>>>> language.
>>>
>>> See above.
>>
>> This guy is a tool. He re-posted this question a second time because
>> when he first posted that snippet nobody cared either. But after
>> watching the struggle in the original thread, the ugly carnage appealed
>> to the infinitesimally small humanitarian aspect of my psyche
>> sufficiently enough to motivate myself into actually looking at the
>> regexp monstrosity. But after I explained why that spaghetti of a regexp
>> does not jive with RFC 2279, he got all huffy about it. He was confident
>> that I was wrong, and that the regular expression was right. But I was
>> able to explain my reasoning, by referencing directly to the contents of
>> RFC 2279, and he was unable to explain why he thought I was wrong,
>> instead sprinkling more URLs to some apparently orphaned web pages that
>> said something else.
>>
>> Which raised an obvious question: if he was so sure that his regular
>> expressions were correct, why was he asking? What exactly is the part of
>> RFC 2279 that he didn't understand?
>>
>> It seems to be his personality trait: when he asks a question, he thinks
>> he knows what the answer is, and every other answer is wrong. I can't
>> figure out what the real reason for asking the question must be, but I
>> think I really don't want to know the answer.
>>
>> It remains to be seen how long it will take him to figure out that the
>> difficulty he has in getting someone answer this might be, just might
>> be, due to the simple fact that this is one of these things that can be
>> answered simply by RTFMing. Really, UTF-8 is not some patented trade
>> secret. Its specifications are openly available, to anyone who wants to
>> read them. And anyone who reads them should be able to figure out the
>> correct regexp for themselves. It's not rocket science.
>>
>> Amusingly, he's been trying to find the answer to this question longer
>> than it took myself, originally, to read RFC 2279, and implement
>> encoding and decoding of Unicode using UTF-8. In C++. Well, in C
>> actually, but it's still technically valid C++. Which, I guess, makes
>> this on-topic, under the new rules that just came down, by fiat.
>>
>>
> 
> This time I found the original source of a semantically identical 
> regular expression that you berated so rudely.
>    http://www.w3.org/2005/03/23-lex-U
> 
> Who knows, maybe www.w3.org is wrong and you are right?

And as I wrote in the first thread, I suspected that the regular expression 
mish-mash's actual purpose was to validate some defined a subset of the 
entire Unicode range, as encoded in UTF-8.

See message <cone.1273539693.340713.2085.500@commodore.email-scan.com>, 
where I wrote:

> I think what that regexp really does is match a subset of all valid
> UTF-8 sequences that corresponds with a subset of Unicodes that the
> author was interested in. It doesn't match all valid UTF-8 sequences,
> which the non-regexp version does.

And reading the "www.w3.org" link, it's clear that's exactly what it does, 
and what the criteria is. Still, you replied as follows, in 
<cvydnfcDPJR5MHXWnZ2dnUVZ_vOdnZ2d@giganews.com>:

> I think that your understanding might be less than complete. If you read 
> the commentary you will see that your view is not supported.

Obviously, it's your thoughts turned out to be "less than complete". That 
regular expression does not validate whether an arbitrary octet stream is a 
UTF-8-encoded unicode value sequence. That regular expression checks whether 
whether an arbitrary octet stream is a UTF-8-encoded unicode value sequence 
and all unicode values belong to a specific, defined subset of the entire 
unicode value range.


--=_mimegpg-commodore.email-scan.com-18468-1273794006-0008
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEABECAAYFAkvsjdYACgkQx9p3GYHlUOK1MgCfY96M0O1XsbJWaoF7GFMEHTwP
q8EAn09vDKIfshLA6grPtnAB+6LPlGi4
=WOD2
-----END PGP SIGNATURE-----

--=_mimegpg-commodore.email-scan.com-18468-1273794006-0008--
0
Reply Sam 5/13/2010 11:40:06 PM

On 5/13/2010 6:40 PM, Sam wrote:
>> This time I found the original source of a semantically identical
>> regular expression that you berated so rudely.
>> http://www.w3.org/2005/03/23-lex-U
>>
>> Who knows, maybe www.w3.org is wrong and you are right?
>
> And as I wrote in the first thread, I suspected that the regular
> expression mish-mash's actual purpose was to validate some defined a
> subset of the entire Unicode range, as encoded in UTF-8.

And this view is clearly incorrect. It validates the the entire set of 
UTF-8 encodings. Here is a quote:

    "This pattern does not restrict to the set of
    defined UCS characters, instead to the set that
    is permitted by UTF-8 encoding."

The difference is the missing D800-DFFF High and Low surrogates that are 
not legal in UTF-8. All of the other CodePoints from 0-10FFFF are 
represented.
0
Reply Peter 5/13/2010 11:58:23 PM

This is a MIME GnuPG-signed message.  If you see this text, it means that
your E-mail or Usenet software does not support MIME signed messages.
The Internet standard for MIME PGP messages, RFC 2015, was published in 1996.
To open this message correctly you will need to install E-mail or Usenet
software that supports modern Internet standards.

--=_mimegpg-commodore.email-scan.com-19206-1273798895-0001
Content-Type: text/plain; format=flowed; charset="US-ASCII"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit

Peter Olcott writes:

> On 5/13/2010 6:40 PM, Sam wrote:
>>> This time I found the original source of a semantically identical
>>> regular expression that you berated so rudely.
>>> http://www.w3.org/2005/03/23-lex-U
>>>
>>> Who knows, maybe www.w3.org is wrong and you are right?
>>
>> And as I wrote in the first thread, I suspected that the regular
>> expression mish-mash's actual purpose was to validate some defined a
>> subset of the entire Unicode range, as encoded in UTF-8.
> 
> And this view is clearly incorrect. It validates the the entire set of 
> UTF-8 encodings. Here is a quote:
> 
>     "This pattern does not restrict to the set of
>     defined UCS characters, instead to the set that
>     is permitted by UTF-8 encoding."
> 
> The difference is the missing D800-DFFF High and Low surrogates that are 
> not legal in UTF-8. All of the other CodePoints from 0-10FFFF are 
> represented.

Since you claim to know so much about UTF-8 encoding and decoding -- even 
more than RFC 2279 -- it's a wonder you had to ask your question at all. It 
seems that you already knew the answer to the question.

Good luck UTF-8 encoding and decoding.


--=_mimegpg-commodore.email-scan.com-19206-1273798895-0001
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEABECAAYFAkvsoO8ACgkQx9p3GYHlUOIXeQCcD5vgu9hYl9OTFRbQ+9qHOBw/
75kAn0eLlw7q4Lsh4dIRG1ws3barc+L2
=SyFi
-----END PGP SIGNATURE-----

--=_mimegpg-commodore.email-scan.com-19206-1273798895-0001--
0
Reply Sam 5/14/2010 1:01:36 AM

On 5/13/2010 8:01 PM, Sam wrote:
> Peter Olcott writes:
>
>> On 5/13/2010 6:40 PM, Sam wrote:
>>>> This time I found the original source of a semantically identical
>>>> regular expression that you berated so rudely.
>>>> http://www.w3.org/2005/03/23-lex-U
>>>>
>>>> Who knows, maybe www.w3.org is wrong and you are right?
>>>
>>> And as I wrote in the first thread, I suspected that the regular
>>> expression mish-mash's actual purpose was to validate some defined a
>>> subset of the entire Unicode range, as encoded in UTF-8.
>>
>> And this view is clearly incorrect. It validates the the entire set of
>> UTF-8 encodings. Here is a quote:
>>
>> "This pattern does not restrict to the set of
>> defined UCS characters, instead to the set that
>> is permitted by UTF-8 encoding."
>>
>> The difference is the missing D800-DFFF High and Low surrogates that
>> are not legal in UTF-8. All of the other CodePoints from 0-10FFFF are
>> represented.
>
> Since you claim to know so much about UTF-8 encoding and decoding --
> even more than RFC 2279 -- it's a wonder you had to ask your question at
> all. It seems that you already knew the answer to the question.

   http://tools.ietf.org/html/rfc3629
   This memo obsoletes and replaces RFC 2279.

>
> Good luck UTF-8 encoding and decoding.
>

Thanks.

0
Reply Peter 5/14/2010 1:24:23 AM

Yes,. you have to accept that there are questions that should not be asked.

There are two right now: (a) how to write a parser that works on UTF-8 input (b) how to
disable mouse clicks during a lenghty computation.  The correct answers to both questions
are "don't even try to do it that way!  Redesign it so these are no longer problems".
Alternatively, think of it as "do not ask questions of how to solve problems which are the
direct result of incorrect design choices; change the design so the problem no longer
exists, then the question does not need to be asked"

Using UTF-8 is a particularly poor design choice.  Doing long computations in the main GUI
thread is a particularly poor design choice.  The questions would not arise if the poor
design choices had not been made.  This is reality.  The questions should not be asked,
because they indicate that poor choices have been made which make the questions necessary.

Answering the questions by giving an answer that solves what the questioner asked is not a
service to the person asking the question; what it allows is that a bad design decision is
allowed to stand, which will in turn lead to more problems, which produce more questions.
By doing the redesign and getting rid of the bad decisions, the problem goes away and
cannot return in the forseeable future.  The problem of using a MBCS like UTF-8 is not
limited to something like an r.e.; the problems of handling the character set are
pervasive and very complex, and will continue to plague the implementation.  The problems
of doing long computations in the main GUI thread merely introduce more and more failure
modes that will have to be kludged around, so the correct answer is "redesign it".  Poor
design does not go away by making one patch.  The patches eventually grow scar tissue, as
one kludge leads to another which leads to another, and so on.

So get rid of the UTF-8 internally, write it for Unicode, and then the problem goes away.
					joe

On Thu, 13 May 2010 16:59:12 -0500, "Peter Olcott" <NoSpam@OCR4Screen.com> wrote:

>
>"Leigh Johnston" <leigh@i42.co.uk> wrote in message 
>news:TNydnVD6sciN9nHWnZ2dnUVZ7oWdnZ2d@giganews.com...
>> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
>> news:L76dnYQBH-5s-3HWnZ2dnUVZ_sqdnZ2d@giganews.com...
>>>
>>> "Leigh Johnston" <leigh@i42.co.uk> wrote in message 
>>> news:v7CdnY8dPrNy_nHWnZ2dnUVZ8t-dnZ2d@giganews.com...
>>>>
>>>>
>>>> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
>>>> news:xMOdnahxZJNX_3HWnZ2dnUVZ_hCdnZ2d@giganews.com...
>>>>>>
>>>>>> What has this got to do with C++?  What is your C++ 
>>>>>> language question?
>>>>>>
>>>>>> /Leigh
>>>>>
>>>>> I will be implementing a utf8string to supplement 
>>>>> std::string and will be using a regular expression to 
>>>>> quickly divide up UTF-8 bytes into Unicode CodePoints.
>>>>>
>>>>> Since there are no UTF-8 groups, or even Unicode groups 
>>>>> I must post these questions to groups that are at most 
>>>>> indirectly related to this subject matter.
>>>>
>>>> Wrong: off-topic is off-topic.  If I chose to write a 
>>>> Tetris game in C++ it would be inappropriate to ask 
>>>> about the rules of Tetris in this newsgroup even if 
>>>> there was not a more appropriate newsgroup.
>>>>
>>>> /Leigh
>>>
>>> I think that posting to the next most relevant group(s) 
>>> where a directly relevant group does not exist is right, 
>>> and thus you are simply wrong.
>>
>> From this newsgroup's FAQ:
>>
>> "Only post to comp.lang.c++ if your question is about the 
>> C++ language itself."
>>
>> Thus I am simply correct?
>>
>> /Leigh
>
>I can not accept that the "correct" answer is that some 
>questions can not be asked. 
>
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Reply Joseph 5/14/2010 3:17:38 AM

No, an extremely verbose "You are going about this completely wrong".
				joe

On Thu, 13 May 2010 18:14:47 -0500, "Peter Olcott" <NoSpam@OCR4Screen.com> wrote:

>Ah so in other words an extremely verbose, "I don't know".
>Let me take a different approach. Can postings on www.w3.org 
>generally be relied upon?
>
>"Joseph M. Newcomer" <newcomer@flounder.com> wrote in 
>message news:gprou55bvl3rgp2qmp6v3euk20ucf865mi@4ax.com...
>> See below...
>> On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott" 
>> <NoSpam@OCR4Screen.com> wrote:
>>
>>>
>>>"Leigh Johnston" <leigh@i42.co.uk> wrote in message
>>>news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d@giganews.com...
>>>> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message
>>>> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d@giganews.com...
>>>>> Is this Regular Expression for UTF-8 Correct??
>>>>>
>>>>> The solution is based on the GREEN portions of the 
>>>>> first
>>>>> chart shown
>>>>> on this link:
>>>>>  http://www.w3.org/2005/03/23-lex-U
>> ****
>> Note that in the "green" areas, we find
>>
>> U0482 Cyrillic thousands sign
>> U055A Armenian apostrophe
>> U055C Armenian exclamation mark
>> U05C3 Hebrew punctuation SOF Pasuq
>> U060C Arabic comma
>> U066B Arabic decimal separator
>> U0700-U0709 Assorted Syriac punctuation  marks
>> U0966-U096F Devanagari digits 0..9
>> U09E6-U09EF Bengali digits 0..9
>> U09F2-U09F3 Bengali rupee marks
>> U0A66-U0A6F Gurmukhi digits 0..9
>> U0AE6-U0AEF Gujarati digits 0..9
>> U0B66-U0B6F Oriya digits 0..9
>> U0BE6-U0BEF Tamil digits 0..9
>> U0BF0-U0BF2  Tamil indicators for 10, 100, 1000
>> U0BF3-U0BFA Tamil punctuation marks
>> U0C66-U0C6F  Telugu digits 0..9
>> U0CE6-U0CEF Kannada digits 0..9
>> U0D66-U0D6F Malayam digits 0..9
>> U0E50-U0E59 Thai digits 0..9
>> U0ED0-U0ED9  Lao digits 0..9
>> U0F20-U0F29 Tibetan digits 0..9
>> U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
>> U1040-U1049 - Myanmar digits 0..9
>> U1360-U1368 Ethiopic punctuation marks
>> U1369-U137C Ethiopic numeric values (digits, tens of 
>> digits, etc.)
>> U17E0-U17E9 Khmer digits 0..9
>> U1800-U180E Mongolian punctuation marks
>> U1810-U1819 Mongolian digits 0..9
>> U1946-U194F Limbu digits 0..9
>> U19D0-U19D9  New Tai Lue digits 0..9
>> ...at which point I realized I was wasting my time, 
>> because I was attempting to disprovde
>> what is a Really Dumb Idea, which is to write applications 
>> that actually work on UTF-8
>> encoded text.
>>
>> You are free to convert these to UTF-8, but in addition, 
>> if I've read some of the
>> encodings correctly, the non-green areas preclude what are 
>> clearly "letters" in other
>> languages.
>>
>> Forget UTF-8.  It is a transport mechanism used at input 
>> and output edges.  Use Unicode
>> internally.
>> ****
>>>>>
>>>>> A semantically identical regular expression is also 
>>>>> found
>>>>> on the above link underValidating lex Template
>>>>>
>>>>> 1    ['\u0000'-'\u007F']
>>>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>>>> 3 | ( '\u00E0'           ['\u00A0'-'\u00BF']
>>>>> ['\u0080'-'\u00BF'])
>>>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>>>>> ['\u0080'-'\u00BF'])
>>>>> 5 | ( '\u00ED'           ['\u0080'-'\u009F']
>>>>> ['\u0080'-'\u00BF'])
>>>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>>>>> ['\u0080'-'\u00BF'])
>>>>> 7 | ( '\u00F0'           ['\u0090'-'\u00BF']
>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>> 9 | ( '\u00F4'           ['\u0080'-'\u008F']
>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>
>>>>> Here is my version, the syntax is different, but the 
>>>>> UTF8
>>>>> portion should be semantically identical.
>>>>>
>>>>> UTF8_BYTE_ORDER_MARK   [\xEF][\xBB][\xBF]
>>>>>
>>>>> ASCII     [\x0-\x7F]
>>>>>
>>>>> U1          [a-zA-Z_]
>>>>> U2          [\xC2-\xDF][\x80-\xBF]
>>>>> U3          [\xE0][\xA0-\xBF][\x80-\xBF]
>>>>> U4          [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>>>>> U5          [\xED][\x80-\x9F][\x80-\xBF]
>>>>> U6          [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>>>>> U7          [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>>>>> U8 
>>>>> [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>>>>> U9          [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>>>>
>>>>> UTF8   {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>
>>>>> // This identifies the "Letter" portion of an 
>>>>> Identifier.
>>>>> L          {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>
>>>>> I guess that most of the analysis may simply boil down 
>>>>> to
>>>>> whether or not the original source from the link is
>>>>> considered reliable. I had forgotten this original 
>>>>> source
>>>>> when I first asked this question, that is why I am
>>>>> reposting this same question again.
>>>>
>>>> What has this got to do with C++?  What is your C++
>>>> language question?
>>>>
>>>> /Leigh
>>>
>>>I will be implementing a utf8string to supplement
>>>std::string and will be using a regular expression to
>>>quickly divide up UTF-8 bytes into Unicode CodePoints.
>> ***
>> For someone who had an unholy fixation on "performance", 
>> why would you choose such a slow
>> mechanism for doing recognition?
>>
>> I can imagine a lot of alternative approaches, including 
>> having a table of 65,536
>> "character masks" for Unicode characters, including 
>> on-the-fly updating of the table, and
>> extensions to support surrogates, which would outperform 
>> any regular expression based
>> approach.
>>
>> What is your crtiterion for what constitutes a "letter"? 
>> Frankly, I have no interest in
>> decoding something as bizarre as UTF-8 encodings to see if 
>> you covered the foreign
>> delimiters, numbers, punctuation marks, etc. properly, and 
>> it makes no sense to do so.  So
>> there is no way I would waste my time trying to understand 
>> an example that should not
>> exist at all.
>>
>> Why do you seem to choose the worst possible choice when 
>> there is more than one way to do
>> something?  The choices are (a) work in 8-bit ANSI (b) 
>> work in UTF-8 (c) work in Unicode.
>> Of these, the worst possible choice is (b), followed by 
>> (a).  (c) is clearly the winner.
>>
>> So why are you using something as bizarre as UTF-8 
>> internally?  UTF-8 has ONE role, which
>> is to write Unicode out in an 8-bit encoding, and read 
>> Unicode in an 8-bit encoding.  You
>> do NOT want to write the program in terms of UTF-8!
>> joe
>> ****
>>>
>>>Since there are no UTF-8 groups, or even Unicode groups I
>>>must post these questions to groups that are at most
>>>indirectly related to this subject matter.
>>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer@flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm 
>
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Reply Joseph 5/14/2010 3:18:58 AM

> I can imagine a lot of alternative approaches, including having a table of
> 65,536 "character masks" for Unicode characters

As we know, 65,536 (FFFF) is not enough, Unicode codepoints go to 10FFFF :-)



> What is your crtiterion for what constitutes a "letter"?

The best way to attack the identification is by using Unicode properties
Each code point has attributes indicating if it is a letter
(General Category)

A good starting point is this:
    http://unicode.org/reports/tr31/tr31-1.html

But this only shows that basing that on some UTF-8 kind of thing is no
the way. And how are you going to deal with combining characters?
Normalization?

There are very good reasons why the rule of thumb is:
 - UTF-16 or UTF-32 for processing
 - UTF-8 for storage/exchange


-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

0
Reply Mihai 5/14/2010 8:30:00 AM

> Can postings on www.w3.org generally be relied upon?

For official documents, in general yes.
Unless it is some private post that says something like:
   "It is not endorsed by the W3C members, team, or any working group."
(see http://www.w3.org/2005/03/23-lex-U)

And also does not mean that a solution that is enough to do some basic
utf-8 validation for html is the right tool for writing a compiler.


-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

0
Reply Mihai 5/14/2010 8:40:39 AM

On 2010-05-13, Peter Olcott <NoSpam@OCR4Screen.com> wrote:
>
> "Ian Collins" <ian-news@hotmail.com> wrote in message 
> news:8539h9F7f1U1@mid.individual.net...
>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>> Is this Regular Expression for UTF-8 Correct??
>>
>> It's a fair bet you are off-topic in all the groups you 
>> have cross posted to.  Why don't you pick a group for a 
>> language with built in UTF8 and regexp support (PHP?) and 
>> badger them?
>>
>> -- 
>> Ian Collins
>
> What does this question have to do with the C++ language?
>
> At least my question is indirectly related to C++ by making 
> a utf8string for the C++ language from the regular 
> expression.

Just use iconv.

and don't cross post off-topic.


--- news://freenews.netfront.net/ - complaints: news@netfront.net ---
0
Reply Jasen 5/14/2010 9:01:40 AM

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in 
message news:68gpu599cjcsm3rjh1ptc6e9qu977smdph@4ax.com...
> No, an extremely verbose "You are going about this 
> completely wrong".
> joe

Which still avoids rather than answers my question. This was 
at one time a very effective ruse to hide the fact that you 
don't know the answer. I can see through this ruse now, so 
there is no sense in my attempting to justify my design 
decision to you. That would simply be a waste of time.

>
> On Thu, 13 May 2010 18:14:47 -0500, "Peter Olcott" 
> <NoSpam@OCR4Screen.com> wrote:
>
>>Ah so in other words an extremely verbose, "I don't know".
>>Let me take a different approach. Can postings on 
>>www.w3.org
>>generally be relied upon?
>>
>>"Joseph M. Newcomer" <newcomer@flounder.com> wrote in
>>message news:gprou55bvl3rgp2qmp6v3euk20ucf865mi@4ax.com...
>>> See below...
>>> On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott"
>>> <NoSpam@OCR4Screen.com> wrote:
>>>
>>>>
>>>>"Leigh Johnston" <leigh@i42.co.uk> wrote in message
>>>>news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d@giganews.com...
>>>>> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in 
>>>>> message
>>>>> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d@giganews.com...
>>>>>> Is this Regular Expression for UTF-8 Correct??
>>>>>>
>>>>>> The solution is based on the GREEN portions of the
>>>>>> first
>>>>>> chart shown
>>>>>> on this link:
>>>>>>  http://www.w3.org/2005/03/23-lex-U
>>> ****
>>> Note that in the "green" areas, we find
>>>
>>> U0482 Cyrillic thousands sign
>>> U055A Armenian apostrophe
>>> U055C Armenian exclamation mark
>>> U05C3 Hebrew punctuation SOF Pasuq
>>> U060C Arabic comma
>>> U066B Arabic decimal separator
>>> U0700-U0709 Assorted Syriac punctuation  marks
>>> U0966-U096F Devanagari digits 0..9
>>> U09E6-U09EF Bengali digits 0..9
>>> U09F2-U09F3 Bengali rupee marks
>>> U0A66-U0A6F Gurmukhi digits 0..9
>>> U0AE6-U0AEF Gujarati digits 0..9
>>> U0B66-U0B6F Oriya digits 0..9
>>> U0BE6-U0BEF Tamil digits 0..9
>>> U0BF0-U0BF2  Tamil indicators for 10, 100, 1000
>>> U0BF3-U0BFA Tamil punctuation marks
>>> U0C66-U0C6F  Telugu digits 0..9
>>> U0CE6-U0CEF Kannada digits 0..9
>>> U0D66-U0D6F Malayam digits 0..9
>>> U0E50-U0E59 Thai digits 0..9
>>> U0ED0-U0ED9  Lao digits 0..9
>>> U0F20-U0F29 Tibetan digits 0..9
>>> U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
>>> U1040-U1049 - Myanmar digits 0..9
>>> U1360-U1368 Ethiopic punctuation marks
>>> U1369-U137C Ethiopic numeric values (digits, tens of
>>> digits, etc.)
>>> U17E0-U17E9 Khmer digits 0..9
>>> U1800-U180E Mongolian punctuation marks
>>> U1810-U1819 Mongolian digits 0..9
>>> U1946-U194F Limbu digits 0..9
>>> U19D0-U19D9  New Tai Lue digits 0..9
>>> ...at which point I realized I was wasting my time,
>>> because I was attempting to disprovde
>>> what is a Really Dumb Idea, which is to write 
>>> applications
>>> that actually work on UTF-8
>>> encoded text.
>>>
>>> You are free to convert these to UTF-8, but in addition,
>>> if I've read some of the
>>> encodings correctly, the non-green areas preclude what 
>>> are
>>> clearly "letters" in other
>>> languages.
>>>
>>> Forget UTF-8.  It is a transport mechanism used at input
>>> and output edges.  Use Unicode
>>> internally.
>>> ****
>>>>>>
>>>>>> A semantically identical regular expression is also
>>>>>> found
>>>>>> on the above link underValidating lex Template
>>>>>>
>>>>>> 1    ['\u0000'-'\u007F']
>>>>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>>>>> 3 | ( '\u00E0'           ['\u00A0'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 5 | ( '\u00ED'           ['\u0080'-'\u009F']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 7 | ( '\u00F0'           ['\u0090'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>> 9 | ( '\u00F4'           ['\u0080'-'\u008F']
>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>
>>>>>> Here is my version, the syntax is different, but the
>>>>>> UTF8
>>>>>> portion should be semantically identical.
>>>>>>
>>>>>> UTF8_BYTE_ORDER_MARK   [\xEF][\xBB][\xBF]
>>>>>>
>>>>>> ASCII     [\x0-\x7F]
>>>>>>
>>>>>> U1          [a-zA-Z_]
>>>>>> U2          [\xC2-\xDF][\x80-\xBF]
>>>>>> U3          [\xE0][\xA0-\xBF][\x80-\xBF]
>>>>>> U4          [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>>>>>> U5          [\xED][\x80-\x9F][\x80-\xBF]
>>>>>> U6          [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>>>>>> U7          [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>> U8
>>>>>> [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>> U9          [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>>>>>
>>>>>> UTF8 
>>>>>> {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>
>>>>>> // This identifies the "Letter" portion of an
>>>>>> Identifier.
>>>>>> L 
>>>>>> {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>
>>>>>> I guess that most of the analysis may simply boil 
>>>>>> down
>>>>>> to
>>>>>> whether or not the original source from the link is
>>>>>> considered reliable. I had forgotten this original
>>>>>> source
>>>>>> when I first asked this question, that is why I am
>>>>>> reposting this same question again.
>>>>>
>>>>> What has this got to do with C++?  What is your C++
>>>>> language question?
>>>>>
>>>>> /Leigh
>>>>
>>>>I will be implementing a utf8string to supplement
>>>>std::string and will be using a regular expression to
>>>>quickly divide up UTF-8 bytes into Unicode CodePoints.
>>> ***
>>> For someone who had an unholy fixation on "performance",
>>> why would you choose such a slow
>>> mechanism for doing recognition?
>>>
>>> I can imagine a lot of alternative approaches, including
>>> having a table of 65,536
>>> "character masks" for Unicode characters, including
>>> on-the-fly updating of the table, and
>>> extensions to support surrogates, which would outperform
>>> any regular expression based
>>> approach.
>>>
>>> What is your crtiterion for what constitutes a "letter"?
>>> Frankly, I have no interest in
>>> decoding something as bizarre as UTF-8 encodings to see 
>>> if
>>> you covered the foreign
>>> delimiters, numbers, punctuation marks, etc. properly, 
>>> and
>>> it makes no sense to do so.  So
>>> there is no way I would waste my time trying to 
>>> understand
>>> an example that should not
>>> exist at all.
>>>
>>> Why do you seem to choose the worst possible choice when
>>> there is more than one way to do
>>> something?  The choices are (a) work in 8-bit ANSI (b)
>>> work in UTF-8 (c) work in Unicode.
>>> Of these, the worst possible choice is (b), followed by
>>> (a).  (c) is clearly the winner.
>>>
>>> So why are you using something as bizarre as UTF-8
>>> internally?  UTF-8 has ONE role, which
>>> is to write Unicode out in an 8-bit encoding, and read
>>> Unicode in an 8-bit encoding.  You
>>> do NOT want to write the program in terms of UTF-8!
>>> joe
>>> ****
>>>>
>>>>Since there are no UTF-8 groups, or even Unicode groups 
>>>>I
>>>>must post these questions to groups that are at most
>>>>indirectly related to this subject matter.
>>>>
>>> Joseph M. Newcomer [MVP]
>>> email: newcomer@flounder.com
>>> Web: http://www.flounder.com
>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>
> Joseph M. Newcomer [MVP]
> email: newcomer@flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm 


0
Reply Peter 5/14/2010 1:27:45 PM

"Mihai N." <nmihai_year_2000@yahoo.com> wrote in message 
news:Xns9D78F42C2233MihaiN@207.46.248.16...
>
>> I can imagine a lot of alternative approaches, including 
>> having a table of
>> 65,536 "character masks" for Unicode characters
>
> As we know, 65,536 (FFFF) is not enough, Unicode 
> codepoints go to 10FFFF :-)
>
>
>
>> What is your crtiterion for what constitutes a "letter"?
>
> The best way to attack the identification is by using 
> Unicode properties
> Each code point has attributes indicating if it is a 
> letter
> (General Category)
>
> A good starting point is this:
>    http://unicode.org/reports/tr31/tr31-1.html
>
> But this only shows that basing that on some UTF-8 kind of 
> thing is no
> the way. And how are you going to deal with combining 
> characters?
> Normalization?

I am going to handle this simplistically. Every code point 
above the ASCII range will be considered an alpha numeric 
character.

Eventually I will augment this to further divide these code 
points into smaller categories. Unicode is supposed to have 
a way to do this, but, I never could find anything as simple 
as a table of the mapping of Unicode code points to their 
category.

>
> There are very good reasons why the rule of thumb is:
> - UTF-16 or UTF-32 for processing
> - UTF-8 for storage/exchange
>
>
> -- 
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
> 


0
Reply Peter 5/14/2010 1:36:21 PM

"Mihai N." <nmihai_year_2000@yahoo.com> wrote in message 
news:Xns9D781110C7D27MihaiN@207.46.248.16...
>> Can postings on www.w3.org generally be relied upon?
>
> For official documents, in general yes.
> Unless it is some private post that says something like:
>   "It is not endorsed by the W3C members, team, or any 
> working group."
> (see http://www.w3.org/2005/03/23-lex-U)
>
> And also does not mean that a solution that is enough to 
> do some basic
> utf-8 validation for html is the right tool for writing a 
> compiler.

I am internationalizing the language that I am creating 
within the timeframe that I have.

UTF-8 is the standard encoding for internet applications. It 
works across every platform equally well without adaptation. 
It does not care about Little or Big Endian, it simply works 
everywhere correctly.

>
>
> -- 
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
> 


0
Reply Peter 5/14/2010 1:41:37 PM

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in message 
news:jfvou5ll41i9818ub01a4mgbfvetg4giu1@4ax.com...
> Actually, what it does is give us another opportunity to point how how 
> really bad this
> design choice is, and thus Peter can tell us all we are fools for not 
> answering a question
> that should never have been asked, not because it is inappropriate for the 
> group, but
> because it represents the worst-possible-design decision that could be 
> made.
> joe

 Come on Joe, give Mr. Olcott some credit. I'm sure that he could dream up 
an even worse design as he did with his OCR project once he is given (and 
ignores) input from the professionals whos input he claims to seek.  ;)


-Pete


0
Reply Pete 5/14/2010 4:38:57 PM

"Pete Delgado" <Peter.Delgado@NoSpam.com> wrote in message 
news:uU4O0P48KHA.1892@TK2MSFTNGP05.phx.gbl...
>
> "Joseph M. Newcomer" <newcomer@flounder.com> wrote in 
> message news:jfvou5ll41i9818ub01a4mgbfvetg4giu1@4ax.com...
>> Actually, what it does is give us another opportunity to 
>> point how how really bad this
>> design choice is, and thus Peter can tell us all we are 
>> fools for not answering a question
>> that should never have been asked, not because it is 
>> inappropriate for the group, but
>> because it represents the worst-possible-design decision 
>> that could be made.
>> joe
>
> Come on Joe, give Mr. Olcott some credit. I'm sure that he 
> could dream up an even worse design as he did with his OCR 
> project once he is given (and ignores) input from the 
> professionals whos input he claims to seek.  ;)
>
>
> -Pete
>
>

Most often I am not looking for "input from professionals", 
I am looking for answers to specific questions.

I now realize that every non-answer response tends to be a 
mask for the true answer of "I don't know". 


0
Reply Peter 5/14/2010 4:53:30 PM

"Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
news:v5OdnTkHxKqRHXDWnZ2dnUVZ_uGdnZ2d@giganews.com...
>
> "Pete Delgado" <Peter.Delgado@NoSpam.com> wrote in message 
> news:uU4O0P48KHA.1892@TK2MSFTNGP05.phx.gbl...
>>
>> "Joseph M. Newcomer" <newcomer@flounder.com> wrote in message 
>> news:jfvou5ll41i9818ub01a4mgbfvetg4giu1@4ax.com...
>>> Actually, what it does is give us another opportunity to point how how 
>>> really bad this
>>> design choice is, and thus Peter can tell us all we are fools for not 
>>> answering a question
>>> that should never have been asked, not because it is inappropriate for 
>>> the group, but
>>> because it represents the worst-possible-design decision that could be 
>>> made.
>>> joe
>>
>> Come on Joe, give Mr. Olcott some credit. I'm sure that he could dream up 
>> an even worse design as he did with his OCR project once he is given (and 
>> ignores) input from the professionals whos input he claims to seek.  ;)
>>
>>
>> -Pete
>>
>>
>
> Most often I am not looking for "input from professionals", I am looking 
> for answers to specific questions.

Which is one reason why your projects consistantly fail. If you have a few 
days, take a look at the book "Programming Pearls" by Jon 
Bentley -specifically the first chapter. Sometimes making sure you are 
asking the *right* question is more important than getting an answer to a 
question. You seem to have a problem with that particular concept.

>
> I now realize that every non-answer response tends to be a mask for the 
> true answer of "I don't know".

In my case, you should change "I don't know" in your sentance above to: "I 
don't care"...

To clarify:

* I don't care to answer off-topic questions
* I don't care to answer questions where the answer will be ignored
* I don't care to have to justify a correct answer against an incorrect 
answer
* I don't care to answer questions where the resident SME (Mihai) has 
already guided you
* I don't care to feed the trolls

HTH

-Pete 


0
Reply Pete 5/14/2010 6:12:38 PM

"Pete Delgado" <Peter.Delgado@NoSpam.com> wrote in message 
news:O8vhKE58KHA.980@TK2MSFTNGP04.phx.gbl...
>
>> Most often I am not looking for "input from 
>> professionals", I am looking for answers to specific 
>> questions.
>
> Which is one reason why your projects consistantly fail. 
> If you have a few

None of my projects have ever failed. Some of my projects 
inherently take an enormous amount of time to complete.

> days, take a look at the book "Programming Pearls" by Jon 
> Bentley -specifically the first chapter. Sometimes making 
> sure you are asking the *right* question is more important 
> than getting an answer to a question. You seem to have a 
> problem with that particular concept.

Yes especially on those cases where I have already thought 
the problem through completely using categorically 
exhaustively complete reasoning.

In those rare instances anything at all besides a direct 
answer to a direct question can only be a waste of time for 
me.



0
Reply Peter 5/14/2010 6:44:56 PM

"Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
news:f-6dnTCV1ce3B3DWnZ2dnUVZ_hCdnZ2d@giganews.com...
>
> "Pete Delgado" <Peter.Delgado@NoSpam.com> wrote in message 
> news:O8vhKE58KHA.980@TK2MSFTNGP04.phx.gbl...
>>
>>> Most often I am not looking for "input from professionals", I am looking 
>>> for answers to specific questions.
>>
>> Which is one reason why your projects consistantly fail. If you have a 
>> few
>
> None of my projects have ever failed. Some of my projects inherently take 
> an enormous amount of time to complete.

ROTFL

OK Peter.. If you say so...  :-)  I suppose that is the benefit of doing 
development soley for your own amusement. You can take inordinate amounts of 
time and not have to care if the market passes you by or if the relevancy of 
the software is diminished.


>
>> days, take a look at the book "Programming Pearls" by Jon 
>> Bentley -specifically the first chapter. Sometimes making sure you are 
>> asking the *right* question is more important than getting an answer to a 
>> question. You seem to have a problem with that particular concept.
>
> Yes especially on those cases where I have already thought the problem 
> through completely using categorically exhaustively complete reasoning.

That *sounds* nice, but if one considers your recent questions here as a 
guage of your success at reasoning out the problem and coming up with a 
realistic, workable solution, it seems that your words and deeds do not 
match.

>
> In those rare instances anything at all besides a direct answer to a 
> direct question can only be a waste of time for me.

....which is why, long ago, I suggested that you simply hire a consultant.

-Pete


0
Reply Pete 5/14/2010 9:24:27 PM

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in 
message news:68gpu599cjcsm3rjh1ptc6e9qu977smdph@4ax.com...
> No, an extremely verbose "You are going about this 
> completely wrong".
> joe
>
> On Thu, 13 May 2010 18:14:47 -0500, "Peter Olcott" 
> <NoSpam@OCR4Screen.com> wrote:
>
>>Ah so in other words an extremely verbose, "I don't know".
>>Let me take a different approach. Can postings on 
>>www.w3.org
>>generally be relied upon?
>>
>>"Joseph M. Newcomer" <newcomer@flounder.com> wrote in
>>message news:gprou55bvl3rgp2qmp6v3euk20ucf865mi@4ax.com...
>>> See below...
>>> On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott"
>>> <NoSpam@OCR4Screen.com> wrote:
>>>
>>>>
>>>>"Leigh Johnston" <leigh@i42.co.uk> wrote in message
>>>>news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d@giganews.com...
>>>>> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in 
>>>>> message
>>>>> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d@giganews.com...
>>>>>> Is this Regular Expression for UTF-8 Correct??
>>>>>>
>>>>>> The solution is based on the GREEN portions of the
>>>>>> first
>>>>>> chart shown
>>>>>> on this link:
>>>>>>  http://www.w3.org/2005/03/23-lex-U
>>> ****
>>> Note that in the "green" areas, we find
>>>
>>> U0482 Cyrillic thousands sign
>>> U055A Armenian apostrophe
>>> U055C Armenian exclamation mark
>>> U05C3 Hebrew punctuation SOF Pasuq
>>> U060C Arabic comma
>>> U066B Arabic decimal separator
>>> U0700-U0709 Assorted Syriac punctuation  marks
>>> U0966-U096F Devanagari digits 0..9
>>> U09E6-U09EF Bengali digits 0..9
>>> U09F2-U09F3 Bengali rupee marks
>>> U0A66-U0A6F Gurmukhi digits 0..9
>>> U0AE6-U0AEF Gujarati digits 0..9
>>> U0B66-U0B6F Oriya digits 0..9
>>> U0BE6-U0BEF Tamil digits 0..9
>>> U0BF0-U0BF2  Tamil indicators for 10, 100, 1000
>>> U0BF3-U0BFA Tamil punctuation marks
>>> U0C66-U0C6F  Telugu digits 0..9
>>> U0CE6-U0CEF Kannada digits 0..9
>>> U0D66-U0D6F Malayam digits 0..9
>>> U0E50-U0E59 Thai digits 0..9
>>> U0ED0-U0ED9  Lao digits 0..9
>>> U0F20-U0F29 Tibetan digits 0..9
>>> U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
>>> U1040-U1049 - Myanmar digits 0..9
>>> U1360-U1368 Ethiopic punctuation marks
>>> U1369-U137C Ethiopic numeric values (digits, tens of
>>> digits, etc.)
>>> U17E0-U17E9 Khmer digits 0..9
>>> U1800-U180E Mongolian punctuation marks
>>> U1810-U1819 Mongolian digits 0..9
>>> U1946-U194F Limbu digits 0..9
>>> U19D0-U19D9  New Tai Lue digits 0..9

Do you know anywhere where I can get a table that maps all 
of the code points to their category?

>>> ...at which point I realized I was wasting my time,
>>> because I was attempting to disprovde
>>> what is a Really Dumb Idea, which is to write 
>>> applications
>>> that actually work on UTF-8
>>> encoded text.
>>>
>>> You are free to convert these to UTF-8, but in addition,
>>> if I've read some of the
>>> encodings correctly, the non-green areas preclude what 
>>> are
>>> clearly "letters" in other
>>> languages.
>>>
>>> Forget UTF-8.  It is a transport mechanism used at input
>>> and output edges.  Use Unicode
>>> internally.

That is how I intend to use it. To internationalize my GUI 
scripting language the interpreter will accept UTF-8 input 
as its source code files. It is substantially implemented 
using Lex and Yacc specifications for "C" that have been 
adapted to implement a subset of C++.

It was far easier (and far less error prone) to add the C++ 
that I needed to the "C" specification than it would have 
been to remove what I do not need from the C++ 
specification.

The actual language itself will store its strings as 32-bit 
codepoints. The SymbolTable will not bother to convert its 
strings from UTF-8. It turns out that UTF-8 byte sort order 
is identical to Unicode code point sort order.

I am implementing a utf8string that will provide the most 
useful subset of the std::string interface. I need the 
regular expression for Lex, and it also can be easily 
converted into a DFA to very quickly and completely 
correctly break of a UTF-8 string into its code point 
constituent parts.

Do you know anywhere where I can get a table that maps all 
of the code points to their category?

It is a shame that Microsoft will be killing this group next 
month, where will we go?

>>> ****
>>>>>>
>>>>>> A semantically identical regular expression is also
>>>>>> found
>>>>>> on the above link underValidating lex Template
>>>>>>
>>>>>> 1    ['\u0000'-'\u007F']
>>>>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>>>>> 3 | ( '\u00E0'           ['\u00A0'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 5 | ( '\u00ED'           ['\u0080'-'\u009F']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 7 | ( '\u00F0'           ['\u0090'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>> 9 | ( '\u00F4'           ['\u0080'-'\u008F']
>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>
>>>>>> Here is my version, the syntax is different, but the
>>>>>> UTF8
>>>>>> portion should be semantically identical.
>>>>>>
>>>>>> UTF8_BYTE_ORDER_MARK   [\xEF][\xBB][\xBF]
>>>>>>
>>>>>> ASCII     [\x0-\x7F]
>>>>>>
>>>>>> U1          [a-zA-Z_]
>>>>>> U2          [\xC2-\xDF][\x80-\xBF]
>>>>>> U3          [\xE0][\xA0-\xBF][\x80-\xBF]
>>>>>> U4          [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>>>>>> U5          [\xED][\x80-\x9F][\x80-\xBF]
>>>>>> U6          [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>>>>>> U7          [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>> U8
>>>>>> [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>> U9          [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>>>>>
>>>>>> UTF8 
>>>>>> {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>
>>>>>> // This identifies the "Letter" portion of an
>>>>>> Identifier.
>>>>>> L 
>>>>>> {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>
>>>>>> I guess that most of the analysis may simply boil 
>>>>>> down
>>>>>> to
>>>>>> whether or not the original source from the link is
>>>>>> considered reliable. I had forgotten this original
>>>>>> source
>>>>>> when I first asked this question, that is why I am
>>>>>> reposting this same question again.
>>>>>
>>>>> What has this got to do with C++?  What is your C++
>>>>> language question?
>>>>>
>>>>> /Leigh
>>>>
>>>>I will be implementing a utf8string to supplement
>>>>std::string and will be using a regular expression to
>>>>quickly divide up UTF-8 bytes into Unicode CodePoints.
>>> ***
>>> For someone who had an unholy fixation on "performance",
>>> why would you choose such a slow
>>> mechanism for doing recognition?
>>>
>>> I can imagine a lot of alternative approaches, including
>>> having a table of 65,536
>>> "character masks" for Unicode characters, including
>>> on-the-fly updating of the table, and
>>> extensions to support surrogates, which would outperform
>>> any regular expression based
>>> approach.
>>>
>>> What is your crtiterion for what constitutes a "letter"?
>>> Frankly, I have no interest in
>>> decoding something as bizarre as UTF-8 encodings to see 
>>> if
>>> you covered the foreign
>>> delimiters, numbers, punctuation marks, etc. properly, 
>>> and
>>> it makes no sense to do so.  So
>>> there is no way I would waste my time trying to 
>>> understand
>>> an example that should not
>>> exist at all.
>>>
>>> Why do you seem to choose the worst possible choice when
>>> there is more than one way to do
>>> something?  The choices are (a) work in 8-bit ANSI (b)
>>> work in UTF-8 (c) work in Unicode.
>>> Of these, the worst possible choice is (b), followed by
>>> (a).  (c) is clearly the winner.
>>>
>>> So why are you using something as bizarre as UTF-8
>>> internally?  UTF-8 has ONE role, which
>>> is to write Unicode out in an 8-bit encoding, and read
>>> Unicode in an 8-bit encoding.  You
>>> do NOT want to write the program in terms of UTF-8!
>>> joe
>>> ****
>>>>
>>>>Since there are no UTF-8 groups, or even Unicode groups 
>>>>I
>>>>must post these questions to groups that are at most
>>>>indirectly related to this subject matter.
>>>>
>>> Joseph M. Newcomer [MVP]
>>> email: newcomer@flounder.com
>>> Web: http://www.flounder.com
>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>
> Joseph M. Newcomer [MVP]
> email: newcomer@flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm 


0
Reply Peter 5/15/2010 3:42:16 AM

> Do you know anywhere where I can get a table that maps all 
> of the code points to their category?

ftp://ftp.unicode.org/Public/5.2.0/ucd

UnicodeData.txt
The main guide for that is ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html
(if you don't want to go thru the standard, which is the adviseable thing)

And when you bump your head, remeber that joe and I warned you about utf-8.
It was not designed for this kind of usage.



-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

0
Reply Mihai 5/15/2010 10:21:47 AM

"Mihai N." <nmihai_year_2000@yahoo.com> wrote in message 
news:Xns9D7922352F422MihaiN@207.46.248.16...
>
>> Do you know anywhere where I can get a table that maps 
>> all
>> of the code points to their category?
>
> ftp://ftp.unicode.org/Public/5.2.0/ucd
>
> UnicodeData.txt
> The main guide for that is 
> ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html
> (if you don't want to go thru the standard, which is the 
> adviseable thing)
>
> And when you bump your head, remeber that joe and I warned 
> you about utf-8.
> It was not designed for this kind of usage.
>
>
Joe also said that UTF-8 was designed for data interchange 
which is how I will be using it. Joe also falsely assumed 
that I would be using UTF-8 for my internal representation. 
I will be using UTF-32 for my internal representation.

I will be using UTF-8 as the source code for my language 
interpreter, which has the advantage of simply being ASCII 
for the English language, and working across every platform 
without requiring adaptations such as Little Endian and Big 
Endian. UTF-8 will also be the output of my OCR4Screen DFA 
recognizer.

>
> -- 
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
> 


0
Reply Peter 5/15/2010 2:12:09 PM

"Mihai N." <nmihai_year_2000@yahoo.com> wrote in message 
news:Xns9D7922352F422MihaiN@207.46.248.16...
>
>> Do you know anywhere where I can get a table that maps 
>> all
>> of the code points to their category?
>
> ftp://ftp.unicode.org/Public/5.2.0/ucd
>

What I am looking for is a mapping between Unicode code 
points (compressed into code point ranges when possible) 
that maps to General Category Values as two character 
abbreviations. I will look though this first link to see if 
I can find this. Initially I saw a lot of things that were 
not this.

> UnicodeData.txt
> The main guide for that is 
> ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html
> (if you don't want to go thru the standard, which is the 
> adviseable thing)
>
> And when you bump your head, remeber that joe and I warned 
> you about utf-8.
> It was not designed for this kind of usage.
>
>
>
> -- 
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
> 


0
Reply Peter 5/15/2010 3:08:25 PM

"Mihai N." <nmihai_year_2000@yahoo.com> wrote in message 
news:Xns9D7922352F422MihaiN@207.46.248.16...
>
>> Do you know anywhere where I can get a table that maps 
>> all
>> of the code points to their category?
>
> ftp://ftp.unicode.org/Public/5.2.0/ucd

I found the table that I was looking for here:
   ftp://ftp.unicode.org/Public/5.2.0/ucd/UnicodeData.txt
Thanks for all your help.

>
> UnicodeData.txt
> The main guide for that is 
> ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html
> (if you don't want to go thru the standard, which is the 
> adviseable thing)
>
> And when you bump your head, remeber that joe and I warned 
> you about utf-8.
> It was not designed for this kind of usage.
>
>
>
> -- 
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
> 


0
Reply Peter 5/15/2010 3:48:25 PM

How about a non-answer is a substitute for "this is the most incredibly stupid idea I have
seen in decades, and I'm not going to waste my time pointing out the obvious silliness of
it"?

You are again spending massive effort to solve an artificial problem of your own creation,
caused by making poor initial design choices, and supported by nonsensical
rationalizations.  A professional programmer knows certain patterns (that is our
strength!) and among these are the recognition that if you have to implement complex
solutions to simple problems, you  have made a bad design choice and are best served by
re-examining the design choices and making design choices that eliminate the need for
complex solutions, particularly when the complexity simply goes away if a different set of
solutions is postulated.

Personally, if I had to do a complex parser design, I'd want to eliminate the need to deal
with UTF-16 surrogates, and I'd write my code in terms of UTF-32.  Much simpler, and
isolates the complexity and the input and output edges, not making it uniformly
distributed throughout the code.  And I'd know not to make childish decisions such as "it
costs too much to do the conversion" because I outgrew those kinds of arguments certainly
by 1980 (that's thirty years ago).  My first instance of this was a typesetting program I
did around 1970 where I stored the text as 9-bit rather than 7-bit bytes because I could
encode font informtion more readily in the upper two bits.  And I didn't even CONSIDER the
size and performance issues of 9-bit vs. 7-bit bytes because I knew they didn't matter in
the slightest.  So I guess I learned this lesson 40 years ago.  It greatly simplified the
internal coding.

But you are sounding like a first-semester programmer who was taught by some old PDP-11
programmer, and I don't buy either the size or the conversion performance arguments.  You
don't even have NUMBERS to argue your position! Optimization decisions that are argued
without quantitative supporting measurments are almost always wrong.  But we've had this
discussion before, and your view is "My mind is made up, don't require me to get FACTS to
support my decision!"  In the Real World, before we can justify wasting lots of programmer
time to implement bad decisions, we require justification. But maybe that's just my
project management experience talking.  Horrible, this dependence on reality that I have.

If someone came to me with such a design, and was as insistent as you will be, my first
requirement would be "Write a program that reads UTF-8 files of the expected size, then
writes them back out.  Measure its performance reading several dozen different files, and
run each experiment 100 times, measuring the time-to-completion".  Then "modify the
program to convert the data to UTF-16, convert it back to UTF-8, and run the same
experiment sent.  Demonstrate that the change in the mean time is statistically
significant".  Hell, the variation of LOADING the  PROGRAM Is going to differ from
experiment to experiment by a variance several orders of magnitude greater than the
conversion cost!  So don't try to make the case that the conversion cost matters; the
truth, based on actual performance measurements end-to-end, is that it does not.  But, not
having actually done performance measurement, you don't understand that.  Those of us who
devoted nontrivial parts of our lives to optimizing program performance KNOW what the
problems are, and know that the conversion cannot possibly matter.
						joe
*****
					joe
****
On Fri, 14 May 2010 11:53:30 -0500, "Peter Olcott" <NoSpam@OCR4Screen.com> wrote:

>
>"Pete Delgado" <Peter.Delgado@NoSpam.com> wrote in message 
>news:uU4O0P48KHA.1892@TK2MSFTNGP05.phx.gbl...
>>
>> "Joseph M. Newcomer" <newcomer@flounder.com> wrote in 
>> message news:jfvou5ll41i9818ub01a4mgbfvetg4giu1@4ax.com...
>>> Actually, what it does is give us another opportunity to 
>>> point how how really bad this
>>> design choice is, and thus Peter can tell us all we are 
>>> fools for not answering a question
>>> that should never have been asked, not because it is 
>>> inappropriate for the group, but
>>> because it represents the worst-possible-design decision 
>>> that could be made.
>>> joe
>>
>> Come on Joe, give Mr. Olcott some credit. I'm sure that he 
>> could dream up an even worse design as he did with his OCR 
>> project once he is given (and ignores) input from the 
>> professionals whos input he claims to seek.  ;)
>>
>>
>> -Pete
>>
>>
>
>Most often I am not looking for "input from professionals", 
>I am looking for answers to specific questions.
>
>I now realize that every non-answer response tends to be a 
>mask for the true answer of "I don't know". 
>
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Reply Joseph 5/17/2010 4:03:39 AM

See below...
On Fri, 14 May 2010 08:27:45 -0500, "Peter Olcott" <NoSpam@OCR4Screen.com> wrote:

>
>"Joseph M. Newcomer" <newcomer@flounder.com> wrote in 
>message news:68gpu599cjcsm3rjh1ptc6e9qu977smdph@4ax.com...
>> No, an extremely verbose "You are going about this 
>> completely wrong".
>> joe
>
>Which still avoids rather than answers my question. This was 
>at one time a very effective ruse to hide the fact that you 
>don't know the answer. I can see through this ruse now, so 
>there is no sense in my attempting to justify my design 
>decision to you. That would simply be a waste of time.
****
I think I answered part of it.  The part that matters.  THe part that says "this is
wrong".  I did this by pointing out some counterexamples.

I know the answer: Don;t Do It That Way.  You are asking for a specific answer that will
allow you to pursue a Really Bad Design Decision.  I'm not going to answer a bad question;
I'm going to tell you what the correct solution is.  I'm avoiding the question because it
is a really bad question, because you should be able to answer it yourself, and because
giving an answer simply justifies a poor design.  I don't justify poor designs, I try to
kill them.

Only you could make a bad design decision and feel you have to justify it.  Particularly
when the experts have already all told you it is a bad design decision, and you should not
go that way.
				joe

>
>>
>> On Thu, 13 May 2010 18:14:47 -0500, "Peter Olcott" 
>> <NoSpam@OCR4Screen.com> wrote:
>>
>>>Ah so in other words an extremely verbose, "I don't know".
>>>Let me take a different approach. Can postings on 
>>>www.w3.org
>>>generally be relied upon?
>>>
>>>"Joseph M. Newcomer" <newcomer@flounder.com> wrote in
>>>message news:gprou55bvl3rgp2qmp6v3euk20ucf865mi@4ax.com...
>>>> See below...
>>>> On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott"
>>>> <NoSpam@OCR4Screen.com> wrote:
>>>>
>>>>>
>>>>>"Leigh Johnston" <leigh@i42.co.uk> wrote in message
>>>>>news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d@giganews.com...
>>>>>> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in 
>>>>>> message
>>>>>> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d@giganews.com...
>>>>>>> Is this Regular Expression for UTF-8 Correct??
>>>>>>>
>>>>>>> The solution is based on the GREEN portions of the
>>>>>>> first
>>>>>>> chart shown
>>>>>>> on this link:
>>>>>>>  http://www.w3.org/2005/03/23-lex-U
>>>> ****
>>>> Note that in the "green" areas, we find
>>>>
>>>> U0482 Cyrillic thousands sign
>>>> U055A Armenian apostrophe
>>>> U055C Armenian exclamation mark
>>>> U05C3 Hebrew punctuation SOF Pasuq
>>>> U060C Arabic comma
>>>> U066B Arabic decimal separator
>>>> U0700-U0709 Assorted Syriac punctuation  marks
>>>> U0966-U096F Devanagari digits 0..9
>>>> U09E6-U09EF Bengali digits 0..9
>>>> U09F2-U09F3 Bengali rupee marks
>>>> U0A66-U0A6F Gurmukhi digits 0..9
>>>> U0AE6-U0AEF Gujarati digits 0..9
>>>> U0B66-U0B6F Oriya digits 0..9
>>>> U0BE6-U0BEF Tamil digits 0..9
>>>> U0BF0-U0BF2  Tamil indicators for 10, 100, 1000
>>>> U0BF3-U0BFA Tamil punctuation marks
>>>> U0C66-U0C6F  Telugu digits 0..9
>>>> U0CE6-U0CEF Kannada digits 0..9
>>>> U0D66-U0D6F Malayam digits 0..9
>>>> U0E50-U0E59 Thai digits 0..9
>>>> U0ED0-U0ED9  Lao digits 0..9
>>>> U0F20-U0F29 Tibetan digits 0..9
>>>> U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
>>>> U1040-U1049 - Myanmar digits 0..9
>>>> U1360-U1368 Ethiopic punctuation marks
>>>> U1369-U137C Ethiopic numeric values (digits, tens of
>>>> digits, etc.)
>>>> U17E0-U17E9 Khmer digits 0..9
>>>> U1800-U180E Mongolian punctuation marks
>>>> U1810-U1819 Mongolian digits 0..9
>>>> U1946-U194F Limbu digits 0..9
>>>> U19D0-U19D9  New Tai Lue digits 0..9
>>>> ...at which point I realized I was wasting my time,
>>>> because I was attempting to disprovde
>>>> what is a Really Dumb Idea, which is to write 
>>>> applications
>>>> that actually work on UTF-8
>>>> encoded text.
>>>>
>>>> You are free to convert these to UTF-8, but in addition,
>>>> if I've read some of the
>>>> encodings correctly, the non-green areas preclude what 
>>>> are
>>>> clearly "letters" in other
>>>> languages.
>>>>
>>>> Forget UTF-8.  It is a transport mechanism used at input
>>>> and output edges.  Use Unicode
>>>> internally.
>>>> ****
>>>>>>>
>>>>>>> A semantically identical regular expression is also
>>>>>>> found
>>>>>>> on the above link underValidating lex Template
>>>>>>>
>>>>>>> 1    ['\u0000'-'\u007F']
>>>>>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>>>>>> 3 | ( '\u00E0'           ['\u00A0'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 5 | ( '\u00ED'           ['\u0080'-'\u009F']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 7 | ( '\u00F0'           ['\u0090'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>> 9 | ( '\u00F4'           ['\u0080'-'\u008F']
>>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>>
>>>>>>> Here is my version, the syntax is different, but the
>>>>>>> UTF8
>>>>>>> portion should be semantically identical.
>>>>>>>
>>>>>>> UTF8_BYTE_ORDER_MARK   [\xEF][\xBB][\xBF]
>>>>>>>
>>>>>>> ASCII     [\x0-\x7F]
>>>>>>>
>>>>>>> U1          [a-zA-Z_]
>>>>>>> U2          [\xC2-\xDF][\x80-\xBF]
>>>>>>> U3          [\xE0][\xA0-\xBF][\x80-\xBF]
>>>>>>> U4          [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>>>>>>> U5          [\xED][\x80-\x9F][\x80-\xBF]
>>>>>>> U6          [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>>>>>>> U7          [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>>> U8
>>>>>>> [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>>> U9          [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>>>>>>
>>>>>>> UTF8 
>>>>>>> {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>>
>>>>>>> // This identifies the "Letter" portion of an
>>>>>>> Identifier.
>>>>>>> L 
>>>>>>> {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>>
>>>>>>> I guess that most of the analysis may simply boil 
>>>>>>> down
>>>>>>> to
>>>>>>> whether or not the original source from the link is
>>>>>>> considered reliable. I had forgotten this original
>>>>>>> source
>>>>>>> when I first asked this question, that is why I am
>>>>>>> reposting this same question again.
>>>>>>
>>>>>> What has this got to do with C++?  What is your C++
>>>>>> language question?
>>>>>>
>>>>>> /Leigh
>>>>>
>>>>>I will be implementing a utf8string to supplement
>>>>>std::string and will be using a regular expression to
>>>>>quickly divide up UTF-8 bytes into Unicode CodePoints.
>>>> ***
>>>> For someone who had an unholy fixation on "performance",
>>>> why would you choose such a slow
>>>> mechanism for doing recognition?
>>>>
>>>> I can imagine a lot of alternative approaches, including
>>>> having a table of 65,536
>>>> "character masks" for Unicode characters, including
>>>> on-the-fly updating of the table, and
>>>> extensions to support surrogates, which would outperform
>>>> any regular expression based
>>>> approach.
>>>>
>>>> What is your crtiterion for what constitutes a "letter"?
>>>> Frankly, I have no interest in
>>>> decoding something as bizarre as UTF-8 encodings to see 
>>>> if
>>>> you covered the foreign
>>>> delimiters, numbers, punctuation marks, etc. properly, 
>>>> and
>>>> it makes no sense to do so.  So
>>>> there is no way I would waste my time trying to 
>>>> understand
>>>> an example that should not
>>>> exist at all.
>>>>
>>>> Why do you seem to choose the worst possible choice when
>>>> there is more than one way to do
>>>> something?  The choices are (a) work in 8-bit ANSI (b)
>>>> work in UTF-8 (c) work in Unicode.
>>>> Of these, the worst possible choice is (b), followed by
>>>> (a).  (c) is clearly the winner.
>>>>
>>>> So why are you using something as bizarre as UTF-8
>>>> internally?  UTF-8 has ONE role, which
>>>> is to write Unicode out in an 8-bit encoding, and read
>>>> Unicode in an 8-bit encoding.  You
>>>> do NOT want to write the program in terms of UTF-8!
>>>> joe
>>>> ****
>>>>>
>>>>>Since there are no UTF-8 groups, or even Unicode groups 
>>>>>I
>>>>>must post these questions to groups that are at most
>>>>>indirectly related to this subject matter.
>>>>>
>>>> Joseph M. Newcomer [MVP]
>>>> email: newcomer@flounder.com
>>>> Web: http://www.flounder.com
>>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer@flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm 
>
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Reply Joseph 5/17/2010 4:33:31 AM

See below...
On Fri, 14 May 2010 22:42:16 -0500, "Peter Olcott" <NoSpam@OCR4Screen.com> wrote:

>
>"Joseph M. Newcomer" <newcomer@flounder.com> wrote in 
>message news:68gpu599cjcsm3rjh1ptc6e9qu977smdph@4ax.com...
>> No, an extremely verbose "You are going about this 
>> completely wrong".
>> joe
>>
>> On Thu, 13 May 2010 18:14:47 -0500, "Peter Olcott" 
>> <NoSpam@OCR4Screen.com> wrote:
>>
>>>Ah so in other words an extremely verbose, "I don't know".
>>>Let me take a different approach. Can postings on 
>>>www.w3.org
>>>generally be relied upon?
>>>
>>>"Joseph M. Newcomer" <newcomer@flounder.com> wrote in
>>>message news:gprou55bvl3rgp2qmp6v3euk20ucf865mi@4ax.com...
>>>> See below...
>>>> On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott"
>>>> <NoSpam@OCR4Screen.com> wrote:
>>>>
>>>>>
>>>>>"Leigh Johnston" <leigh@i42.co.uk> wrote in message
>>>>>news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d@giganews.com...
>>>>>> "Peter Olcott" <NoSpam@OCR4Screen.com> wrote in 
>>>>>> message
>>>>>> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d@giganews.com...
>>>>>>> Is this Regular Expression for UTF-8 Correct??
>>>>>>>
>>>>>>> The solution is based on the GREEN portions of the
>>>>>>> first
>>>>>>> chart shown
>>>>>>> on this link:
>>>>>>>  http://www.w3.org/2005/03/23-lex-U
>>>> ****
>>>> Note that in the "green" areas, we find
>>>>
>>>> U0482 Cyrillic thousands sign
>>>> U055A Armenian apostrophe
>>>> U055C Armenian exclamation mark
>>>> U05C3 Hebrew punctuation SOF Pasuq
>>>> U060C Arabic comma
>>>> U066B Arabic decimal separator
>>>> U0700-U0709 Assorted Syriac punctuation  marks
>>>> U0966-U096F Devanagari digits 0..9
>>>> U09E6-U09EF Bengali digits 0..9
>>>> U09F2-U09F3 Bengali rupee marks
>>>> U0A66-U0A6F Gurmukhi digits 0..9
>>>> U0AE6-U0AEF Gujarati digits 0..9
>>>> U0B66-U0B6F Oriya digits 0..9
>>>> U0BE6-U0BEF Tamil digits 0..9
>>>> U0BF0-U0BF2  Tamil indicators for 10, 100, 1000
>>>> U0BF3-U0BFA Tamil punctuation marks
>>>> U0C66-U0C6F  Telugu digits 0..9
>>>> U0CE6-U0CEF Kannada digits 0..9
>>>> U0D66-U0D6F Malayam digits 0..9
>>>> U0E50-U0E59 Thai digits 0..9
>>>> U0ED0-U0ED9  Lao digits 0..9
>>>> U0F20-U0F29 Tibetan digits 0..9
>>>> U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
>>>> U1040-U1049 - Myanmar digits 0..9
>>>> U1360-U1368 Ethiopic punctuation marks
>>>> U1369-U137C Ethiopic numeric values (digits, tens of
>>>> digits, etc.)
>>>> U17E0-U17E9 Khmer digits 0..9
>>>> U1800-U180E Mongolian punctuation marks
>>>> U1810-U1819 Mongolian digits 0..9
>>>> U1946-U194F Limbu digits 0..9
>>>> U19D0-U19D9  New Tai Lue digits 0..9
>
>Do you know anywhere where I can get a table that maps all 
>of the code points to their category?
****
You don't need to.  There's an API that does that.  Go read the Unicode support.  You can
also read the code for my Local Explorer, or you can downlod the table from
www.unicode.org.  
				joe
****
>
>>>> ...at which point I realized I was wasting my time,
>>>> because I was attempting to disprovde
>>>> what is a Really Dumb Idea, which is to write 
>>>> applications
>>>> that actually work on UTF-8
>>>> encoded text.
>>>>
>>>> You are free to convert these to UTF-8, but in addition,
>>>> if I've read some of the
>>>> encodings correctly, the non-green areas preclude what 
>>>> are
>>>> clearly "letters" in other
>>>> languages.
>>>>
>>>> Forget UTF-8.  It is a transport mechanism used at input
>>>> and output edges.  Use Unicode
>>>> internally.
>
>That is how I intend to use it. To internationalize my GUI 
>scripting language the interpreter will accept UTF-8 input 
>as its source code files. It is substantially implemented 
>using Lex and Yacc specifications for "C" that have been 
>adapted to implement a subset of C++.
*****
So why does the question matter?  Accepting UTF-8 input makes perfect sense, but the first
thing you should do with it is convert it to UTF-16, or better still UTF-32.
****
>
>It was far easier (and far less error prone) to add the C++ 
>that I needed to the "C" specification than it would have 
>been to remove what I do not need from the C++ 
>specification.
***
Huh?  What's this got to do with the encoding?
***
>
>The actual language itself will store its strings as 32-bit 
>codepoints. The SymbolTable will not bother to convert its 
>strings from UTF-8. It turns out that UTF-8 byte sort order 
>is identical to Unicode code point sort order.
****
Strange.  I though sort order was locale-specific and independent of code points.  But
then, maybe I just understand what is going on.
*****
>
>I am implementing a utf8string that will provide the most 
>useful subset of the std::string interface. I need the 
>regular expression for Lex, and it also can be easily 
>converted into a DFA to very quickly and completely 
>correctly break of a UTF-8 string into its code point 
>constituent parts.
>
>Do you know anywhere where I can get a table that maps all 
>of the code points to their category?
*****
www.unicode.org

Also, there is an API call that does this, and you can check the source of my Locale
Explorer to find it.
				joe
****
>
>It is a shame that Microsoft will be killing this group next 
>month, where will we go?
>
>>>> ****
>>>>>>>
>>>>>>> A semantically identical regular expression is also
>>>>>>> found
>>>>>>> on the above link underValidating lex Template
>>>>>>>
>>>>>>> 1    ['\u0000'-'\u007F']
>>>>>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>>>>>> 3 | ( '\u00E0'           ['\u00A0'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 5 | ( '\u00ED'           ['\u0080'-'\u009F']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'])
>>>>>>> 7 | ( '\u00F0'           ['\u0090'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>> 9 | ( '\u00F4'           ['\u0080'-'\u008F']
>>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>>
>>>>>>> Here is my version, the syntax is different, but the
>>>>>>> UTF8
>>>>>>> portion should be semantically identical.
>>>>>>>
>>>>>>> UTF8_BYTE_ORDER_MARK   [\xEF][\xBB][\xBF]
>>>>>>>
>>>>>>> ASCII     [\x0-\x7F]
>>>>>>>
>>>>>>> U1          [a-zA-Z_]
>>>>>>> U2          [\xC2-\xDF][\x80-\xBF]
>>>>>>> U3          [\xE0][\xA0-\xBF][\x80-\xBF]
>>>>>>> U4          [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>>>>>>> U5          [\xED][\x80-\x9F][\x80-\xBF]
>>>>>>> U6          [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>>>>>>> U7          [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>>> U8
>>>>>>> [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>>> U9          [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>>>>>>
>>>>>>> UTF8 
>>>>>>> {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>>
>>>>>>> // This identifies the "Letter" portion of an
>>>>>>> Identifier.
>>>>>>> L 
>>>>>>> {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>>
>>>>>>> I guess that most of the analysis may simply boil 
>>>>>>> down
>>>>>>> to
>>>>>>> whether or not the original source from the link is
>>>>>>> considered reliable. I had forgotten this original
>>>>>>> source
>>>>>>> when I first asked this question, that is why I am
>>>>>>> reposting this same question again.
>>>>>>
>>>>>> What has this got to do with C++?  What is your C++
>>>>>> language question?
>>>>>>
>>>>>> /Leigh
>>>>>
>>>>>I will be implementing a utf8string to supplement
>>>>>std::string and will be using a regular expression to
>>>>>quickly divide up UTF-8 bytes into Unicode CodePoints.
>>>> ***
>>>> For someone who had an unholy fixation on "performance",
>>>> why would you choose such a slow
>>>> mechanism for doing recognition?
>>>>
>>>> I can imagine a lot of alternative approaches, including
>>>> having a table of 65,536
>>>> "character masks" for Unicode characters, including
>>>> on-the-fly updating of the table, and
>>>> extensions to support surrogates, which would outperform
>>>> any regular expression based
>>>> approach.
>>>>
>>>> What is your crtiterion for what constitutes a "letter"?
>>>> Frankly, I have no interest in
>>>> decoding something as bizarre as UTF-8 encodings to see 
>>>> if
>>>> you covered the foreign
>>>> delimiters, numbers, punctuation marks, etc. properly, 
>>>> and
>>>> it makes no sense to do so.  So
>>>> there is no way I would waste my time trying to 
>>>> understand
>>>> an example that should not
>>>> exist at all.
>>>>
>>>> Why do you seem to choose the worst possible choice when
>>>> there is more than one way to do
>>>> something?  The choices are (a) work in 8-bit ANSI (b)
>>>> work in UTF-8 (c) work in Unicode.
>>>> Of these, the worst possible choice is (b), followed by
>>>> (a).  (c) is clearly the winner.
>>>>
>>>> So why are you using something as bizarre as UTF-8
>>>> internally?  UTF-8 has ONE role, which
>>>> is to write Unicode out in an 8-bit encoding, and read
>>>> Unicode in an 8-bit encoding.  You
>>>> do NOT want to write the program in terms of UTF-8!
>>>> joe
>>>> ****
>>>>>
>>>>>Since there are no UTF-8 groups, or even Unicode groups 
>>>>>I
>>>>>must post these questions to groups that are at most
>>>>>indirectly related to this subject matter.
>>>>>
>>>> Joseph M. Newcomer [MVP]
>>>> email: newcomer@flounder.com
>>>> Web: http://www.flounder.com
>>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer@flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm 
>
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Reply Joseph 5/17/2010 4:39:09 AM

See below...
On Sat, 15 May 2010 09:12:09 -0500, "Peter Olcott" <NoSpam@OCR4Screen.com> wrote:

>
>"Mihai N." <nmihai_year_2000@yahoo.com> wrote in message 
>news:Xns9D7922352F422MihaiN@207.46.248.16...
>>
>>> Do you know anywhere where I can get a table that maps 
>>> all
>>> of the code points to their category?
>>
>> ftp://ftp.unicode.org/Public/5.2.0/ucd
>>
>> UnicodeData.txt
>> The main guide for that is 
>> ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html
>> (if you don't want to go thru the standard, which is the 
>> adviseable thing)
>>
>> And when you bump your head, remeber that joe and I warned 
>> you about utf-8.
>> It was not designed for this kind of usage.
>>
>>
>Joe also said that UTF-8 was designed for data interchange 
>which is how I will be using it. Joe also falsely assumed 
>that I would be using UTF-8 for my internal representation. 
>I will be using UTF-32 for my internal representation.
****
But then, you would not need the UTF-8 regexps!  You would only need those if you were
storing the data in UTF-8.  To give an external grammar to your language, you should give
the UTF-32 regexps, and if necessary, you can TRANSLATE those to UTF-8, but you don't
start with UTF-8.  The lex input would need to be in terms of UTF-32, so you would not be
using UTF-8 there, either.
****
>
>I will be using UTF-8 as the source code for my language 
>interpreter, which has the advantage of simply being ASCII 
>for the English language, and working across every platform 
>without requiring adaptations such as Little Endian and Big 
>Endian. UTF-8 will also be the output of my OCR4Screen DFA 
>recognizer.
>
>>
>> -- 
>> Mihai Nita [Microsoft MVP, Visual C++]
>> http://www.mihai-nita.net
>> ------------------------------------------
>> Replace _year_ with _ to get the real email
>> 
>
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Reply Joseph 5/17/2010 4:42:45 AM

See below..
On Fri, 14 May 2010 01:30:00 -0700, "Mihai N." <nmihai_year_2000@yahoo.com> wrote:

>
>> I can imagine a lot of alternative approaches, including having a table of
>> 65,536 "character masks" for Unicode characters
>
>As we know, 65,536 (FFFF) is not enough, Unicode codepoints go to 10FFFF :-)
****
Yes, but this leads to questions of how to build sparse encodings or handling surrogates
with secondary tables, and I did not want to confuse the issue.  First-cut performance
would be to use a 64K table, and for values above FFFF decode to a secondary table.

But this would be too much reality to absorb and once.
****
>
>
>
>> What is your crtiterion for what constitutes a "letter"?
>
>The best way to attack the identification is by using Unicode properties
>Each code point has attributes indicating if it is a letter
>(General Category)
>
>A good starting point is this:
>    http://unicode.org/reports/tr31/tr31-1.html
>
>But this only shows that basing that on some UTF-8 kind of thing is no
>the way. And how are you going to deal with combining characters?
>Normalization?
****
Ahh, that old concept, "reality" again.  This is what I meant by the question about what
constitutes a letter; for example, the trivial case orf a byte sequence that encodes a
nonspacing accent mark with a letter that follows requires a separate lexical rule because
lex only works on actual input characters (even if modified to support UTF-32), and
therefore the overly-simplistic regexp shown is clearly untenable.  But again, I did not
want to point out the subtleties because I would not have been using exhaustive
categorical reasoning to derive why the question was a stupid question.  So I pointed out
just the most trivial of failure modes, and asked a fundamental question, for which, alas,
you gave the answer (thus cheating me out of further annoying Peter by forcing him to
actually think the problem through).  Vowel marks in some languages (e.g., Hebrew) are
another counterexample.

Even UTF-32 doesn't solve the "what is a letter" question!  Which is why the regexp rules
are clearly bad!
					joe
****
>
>There are very good reasons why the rule of thumb is:
> - UTF-16 or UTF-32 for processing
> - UTF-8 for storage/exchange
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Reply Joseph 5/17/2010 4:54:07 AM

On May 13, 2:59=A0pm, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

> I can not accept that the "correct" answer is that some
> questions can not be asked.

Nobody said that. That's your straw man.

If you go to an alcoholics anonymous meeting and want to talk about
your sex addition, they will politely explain to you this particular
group is for people who have problems with alcohol. It matters not
whether there is or isn't some group for sex addicts around. Simply
put, AA folks deal with alcohol addiction and they shouldn't even have
to think about whether there is or isn't some other group for sex
addicts that you may or may not find suitable. That's not their
problem.

DS
0
Reply David 5/17/2010 9:13:29 AM

On May 13, 3:04=A0pm, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

> What does this question have to do with the C++ language?
>
> At least my question is indirectly related to C++ by making
> a utf8string for the C++ language from the regular
> expression.
>
> Your question is not even indirectly related to the C++
> language.

Unfortunately, no better way is known to keep conversations on topic.
If you know a better way, we'd all love to hear it. If you don't
respond immediately in the forum and point out that something is off
topic, other people browsing the forum will think the question was on
topic. Other ways have been tried in the past (such as private mails
where possible, monthly posts about topicality rather than replying to
each off-topic post, and so on). None have been shown to be effective.

Painful experience has shown that the most effective technique is to
verbally berate and ridicule people who post off topic. Thus others
will see the negative response by the group and now want their posts
to be met with a similar response.

Again, this wasn't anyone's first choice, and if you know a better
way, please tell us. (In the appropriate forum, of course!)

DS
0
Reply David 5/17/2010 9:17:03 AM

On 5/16/2010 11:03 PM, Joseph M. Newcomer wrote:
> How about a non-answer is a substitute for "this is the most incredibly stupid idea I have
> seen in decades, and I'm not going to waste my time pointing out the obvious silliness of
> it"?
>
> You are again spending massive effort to solve an artificial problem of your own creation,
> caused by making poor initial design choices, and supported by nonsensical
> rationalizations.  A professional programmer knows certain patterns (that is our
> strength!) and among these are the recognition that if you have to implement complex
> solutions to simple problems, you  have made a bad design choice and are best served by
> re-examining the design choices and making design choices that eliminate the need for
> complex solutions, particularly when the complexity simply goes away if a different set of
> solutions is postulated.
>
> Personally, if I had to do a complex parser design, I'd want to eliminate the need to deal
> with UTF-16 surrogates, and I'd write my code in terms of UTF-32.  Much simpler, and
> isolates the complexity and the input and output edges, not making it uniformly
> distributed throughout the code.  And I'd know not to make childish decisions such as "it
> costs too much to do the conversion" because I outgrew those kinds of arguments certainly
> by 1980 (that's thirty years ago).  My first instance of this was a typesetting program I
> did around 1970 where I stored the text as 9-bit rather than 7-bit bytes because I could
> encode font informtion more readily in the upper two bits.  And I didn't even CONSIDER the
> size and performance issues of 9-bit vs. 7-bit bytes because I knew they didn't matter in
> the slightest.  So I guess I learned this lesson 40 years ago.  It greatly simplified the
> internal coding.
>
> But you are sounding like a first-semester programmer who was taught by some old PDP-11
> programmer, and I don't buy either the size or the conversion performance arguments.  You
> don't even have NUMBERS to argue your position! Optimization decisions that are argued
> without quantitative supporting measurments are almost always wrong.  But we've had this
> discussion before, and your view is "My mind is made up, don't require me to get FACTS to
> support my decision!"  In the Real World, before we can justify wasting lots of programmer
> time to implement bad decisions, we require justification. But maybe that's just my
> project management experience talking.  Horrible, this dependence on reality that I have.
>
> If someone came to me with such a design, and was as insistent as you will be, my first
> requirement would be "Write a program that reads UTF-8 files of the expected size, then
> writes them back out.  Measure its performance reading several dozen different files, and
> run each experiment 100 times, measuring the time-to-completion".  Then "modify the
> program to convert the data to UTF-16, convert it back to UTF-8, and run the same
> experiment sent.  Demonstrate that the change in the mean time is statistically
> significant".  Hell, the variation of LOADING the  PROGRAM Is going to differ from
> experiment to experiment by a variance several orders of magnitude greater than the
> conversion cost!  So don't try to make the case that the conversion cost matters; the
> truth, based on actual performance measurements end-to-end, is that it does not.  But, not
> having actually done performance measurement, you don't understand that.  Those of us who
> devoted nontrivial parts of our lives to optimizing program performance KNOW what the
> problems are, and know that the conversion cannot possibly matter.
> 						joe

You probably have a point here. My "devil's advocate" counter argument 
is showing up all of the nuances of the alternative design decisions.

Where I am going to be able to talk to you when Microsoft shuts down the 
microsoft.public.* hierachy?

> *****
> 					joe
> ****
> On Fri, 14 May 2010 11:53:30 -0500, "Peter Olcott"<NoSpam@OCR4Screen.com>  wrote:
>
>>
>> "Pete Delgado"<Peter.Delgado@NoSpam.com>  wrote in message
>> news:uU4O0P48KHA.1892@TK2MSFTNGP05.phx.gbl...
>>>
>>> "Joseph M. Newcomer"<newcomer@flounder.com>  wrote in
>>> message news:jfvou5ll41i9818ub01a4mgbfvetg4giu1@4ax.com...
>>>> Actually, what it does is give us another opportunity to
>>>> point how how really bad this
>>>> design choice is, and thus Peter can tell us all we are
>>>> fools for not answering a question
>>>> that should never have been asked, not because it is
>>>> inappropriate for the group, but
>>>> because it represents the worst-possible-design decision
>>>> that could be made.
>>>> joe
>>>
>>> Come on Joe, give Mr. Olcott some credit. I'm sure that he
>>> could dream up an even worse design as he did with his OCR
>>> project once he is given (and ignores) input from the
>>> professionals whos input he claims to seek.  ;)
>>>
>>>
>>> -Pete
>>>
>>>
>>
>> Most often I am not looking for "input from professionals",
>> I am looking for answers to specific questions.
>>
>> I now realize that every non-answer response tends to be a
>> mask for the true answer of "I don't know".
>>
> Joseph M. Newcomer [MVP]
> email: newcomer@flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

0
Reply Peter 5/17/2010 1:34:25 PM

On 5/16/2010 11:33 PM, Joseph M. Newcomer wrote:
> See below...
> On Fri, 14 May 2010 08:27:45 -0500, "Peter Olcott"<NoSpam@OCR4Screen.com>  wrote:
>
>>
>> "Joseph M. Newcomer"<newcomer@flounder.com>  wrote in
>> message news:68gpu599cjcsm3rjh1ptc6e9qu977smdph@4ax.com...
>>> No, an extremely verbose "You are going about this
>>> completely wrong".
>>> joe
>>
>> Which still avoids rather than answers my question. This was
>> at one time a very effective ruse to hide the fact that you
>> don't know the answer. I can see through this ruse now, so
>> there is no sense in my attempting to justify my design
>> decision to you. That would simply be a waste of time.
> ****
> I think I answered part of it.  The part that matters.  THe part that says "this is
> wrong".  I did this by pointing out some counterexamples.
>
> I know the answer: Don;t Do It That Way.  You are asking for a specific answer that will
> allow you to pursue a Really Bad Design Decision.  I'm not going to answer a bad question;
> I'm going to tell you what the correct solution is.  I'm avoiding the question because it
> is a really bad question, because you should be able to answer it yourself, and because
> giving an answer simply justifies a poor design.  I don't justify poor designs, I try to
> kill them.
>
> Only you could make a bad design decision and feel you have to justify it.  Particularly
> when the experts have already all told you it is a bad design decision, and you should not
> go that way.
> 				joe

If a decision is truly bad, then there must be dysfunctional results 
that make the decision a bad one. If dysfunctional results can not be 
provided, then the statement that it is a bad decision lacks sufficient 
support. My original intention was to use UTF-32 as my internal 
representation. I have not yet decided to alter this original decision.

The fact that someone provided an example where UTF-8 strings would 
often substantially vary in length provides the best counter example 
showing that your view is likely correct about internal representation.

In fact I will simply state that I am now convinced that UTF-32 is the 
best way to go.

I still MUST have a correct UTF-8 RegEx because my interpreter is 75% 
completed using Lex and Yacc. Besides this I need a good way to parse 
UTF-8 to convert it to UTF-32.
0
Reply Peter 5/17/2010 1:57:58 PM

On 5/16/2010 11:39 PM, Joseph M. Newcomer wrote:
>>
>> That is how I intend to use it. To internationalize my GUI
>> scripting language the interpreter will accept UTF-8 input
>> as its source code files. It is substantially implemented
>> using Lex and Yacc specifications for "C" that have been
>> adapted to implement a subset of C++.
> *****
> So why does the question matter?  Accepting UTF-8 input makes perfect sense, but the first
> thing you should do with it is convert it to UTF-16, or better still UTF-32.
> ****
>>
>> It was far easier (and far less error prone) to add the C++
>> that I needed to the "C" specification than it would have
>> been to remove what I do not need from the C++
>> specification.
> ***
> Huh?  What's this got to do with the encoding?
(1) Lex requires a RegEx

(2) I still must convert from UTF-8 to UTF-32, and I don't think that a 
faster or simpler way to do this besides a regular expression 
implemented as a finite state machine can possibly exist.


>> The actual language itself will store its strings as 32-bit
>> codepoints. The SymbolTable will not bother to convert its
>> strings from UTF-8. It turns out that UTF-8 byte sort order
>> is identical to Unicode code point sort order.
> ****
> Strange.  I though sort order was locale-specific and independent of code points.  But
> then, maybe I just understand what is going on.

The SymbolTable only needs to be able to find its symbols in a std::map. 
Accounting for locale specific sort order is a waste of time in this case.
0
Reply Peter 5/17/2010 2:04:07 PM

On 5/16/2010 11:42 PM, Joseph M. Newcomer wrote:
> See below...
> On Sat, 15 May 2010 09:12:09 -0500, "Peter Olcott"<NoSpam@OCR4Screen.com>  wrote:
>
>> Joe also said that UTF-8 was designed for data interchange
>> which is how I will be using it. Joe also falsely assumed
>> that I would be using UTF-8 for my internal representation.
>> I will be using UTF-32 for my internal representation.
> ****
> But then, you would not need the UTF-8 regexps!  You would only need those if you were
> storing the data in UTF-8.  To give an external grammar to your language, you should give
> the UTF-32 regexps, and if necessary, you can TRANSLATE those to UTF-8, but you don't
> start with UTF-8.  The lex input would need to be in terms of UTF-32, so you would not be
> using UTF-8 there, either.

My source code encoding will be UTF-8. My interpreter is written in Yacc 
and Lex, and 75% complete. This makes a UTF-8 regular expression mandatory.

> ****
>>
>> I will be using UTF-8 as the source code for my language
>> interpreter, which has the advantage of simply being ASCII
>> for the English language, and working across every platform
>> without requiring adaptations such as Little Endian and Big
>> Endian. UTF-8 will also be the output of my OCR4Screen DFA
>> recognizer.
>>
>>>
>>> --
>>> Mihai Nita [Microsoft MVP, Visual C++]
>>> http://www.mihai-nita.net
>>> ------------------------------------------
>>> Replace _year_ with _ to get the real email
>>>
>>
> Joseph M. Newcomer [MVP]
> email: newcomer@flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

0
Reply Peter 5/17/2010 2:07:20 PM

See below...
On Mon, 17 May 2010 09:04:07 -0500, Peter Olcott <NoSpam@OCR4Screen.com> wrote:

>On 5/16/2010 11:39 PM, Joseph M. Newcomer wrote:
>>>
>>> That is how I intend to use it. To internationalize my GUI
>>> scripting language the interpreter will accept UTF-8 input
>>> as its source code files. It is substantially implemented
>>> using Lex and Yacc specifications for "C" that have been
>>> adapted to implement a subset of C++.
>> *****
>> So why does the question matter?  Accepting UTF-8 input makes perfect sense, but the first
>> thing you should do with it is convert it to UTF-16, or better still UTF-32.
>> ****
>>>
>>> It was far easier (and far less error prone) to add the C++
>>> that I needed to the "C" specification than it would have
>>> been to remove what I do not need from the C++
>>> specification.
>> ***
>> Huh?  What's this got to do with the encoding?
>(1) Lex requires a RegEx
***
Your regexp should be in terms of UTF32, not UTF8.
****
>
>(2) I still must convert from UTF-8 to UTF-32, and I don't think that a 
>faster or simpler way to do this besides a regular expression 
>implemented as a finite state machine can possibly exist.
****
Actually, there is; you obviously know nothing about UTF-8, or you would know that the
high-order bits of the first byte tell you the length of the encoding, and the FSM is
written entirely in terms of the actual encoding, and is never written as a regexp.

RTFM.

You are expected to have spent a LITTLE time reading about a subject before asking a
question.
****
>
>
>>> The actual language itself will store its strings as 32-bit
>>> codepoints. The SymbolTable will not bother to convert its
>>> strings from UTF-8. It turns out that UTF-8 byte sort order
>>> is identical to Unicode code point sort order.
>> ****
>> Strange.  I though sort order was locale-specific and independent of code points.  But
>> then, maybe I just understand what is going on.
>
>The SymbolTable only needs to be able to find its symbols in a std::map. 
>Accounting for locale specific sort order is a waste of time in this case.
****
OK, then it is not sort order, and the fact that the byte-encoded sort is in the same
order is irrelevant, so why did you mention it as if it had meaning?  std::map doesn't
care about what YOU mean by "sort order", it only requires byte sequences for keys, where
the interpretation of the byte sequence is a function of the data type.

But the text should already be in UTF-32!  Why are you wasting time worrying about UTF-8?
					joe
****
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Reply Joseph 5/17/2010 4:24:43 PM

See below...
On Mon, 17 May 2010 08:57:58 -0500, Peter Olcott <NoSpam@OCR4Screen.com> wrote:

>On 5/16/2010 11:33 PM, Joseph M. Newcomer wrote:
>> See below...
>> On Fri, 14 May 2010 08:27:45 -0500, "Peter Olcott"<NoSpam@OCR4Screen.com>  wrote:
>>
>>>
>>> "Joseph M. Newcomer"<newcomer@flounder.com>  wrote in
>>> message news:68gpu599cjcsm3rjh1ptc6e9qu977smdph@4ax.com...
>>>> No, an extremely verbose "You are going about this
>>>> completely wrong".
>>>> joe
>>>
>>> Which still avoids rather than answers my question. This was
>>> at one time a very effective ruse to hide the fact that you
>>> don't know the answer. I can see through this ruse now, so
>>> there is no sense in my attempting to justify my design
>>> decision to you. That would simply be a waste of time.
>> ****
>> I think I answered part of it.  The part that matters.  THe part that says "this is
>> wrong".  I did this by pointing out some counterexamples.
>>
>> I know the answer: Don;t Do It That Way.  You are asking for a specific answer that will
>> allow you to pursue a Really Bad Design Decision.  I'm not going to answer a bad question;
>> I'm going to tell you what the correct solution is.  I'm avoiding the question because it
>> is a really bad question, because you should be able to answer it yourself, and because
>> giving an answer simply justifies a poor design.  I don't justify poor designs, I try to
>> kill them.
>>
>> Only you could make a bad design decision and feel you have to justify it.  Particularly
>> when the experts have already all told you it is a bad design decision, and you should not
>> go that way.
>> 				joe
>
>If a decision is truly bad, then there must be dysfunctional results 
>that make the decision a bad one. If dysfunctional results can not be 
>provided, then the statement that it is a bad decision lacks sufficient 
>support. My original intention was to use UTF-32 as my internal 
>representation. I have not yet decided to alter this original decision.
****
Dysfunctional results:
	Horrible costs to do regexp manipulation when none is needed
	Added complexity distributed uniformly over the entire implementation
	Actually not correct because it ignores
		-localized punctuation
		-localized numbers
		-bidirectional text
Other than it is needlessly complex, horribly inefficient, and wrong, what more do you
need to know?
***
>
>The fact that someone provided an example where UTF-8 strings would 
>often substantially vary in length provides the best counter example 
>showing that your view is likely correct about internal representation.
****
But is that not obvious at the beginning?  You should have realized that!
****
>
>In fact I will simply state that I am now convinced that UTF-32 is the 
>best way to go.
>
>I still MUST have a correct UTF-8 RegEx because my interpreter is 75% 
>completed using Lex and Yacc. Besides this I need a good way to parse 
>UTF-8 to convert it to UTF-32.
****
No, it sucks.  For reasons I have pointed out.  You can easily write a UTF-32 converter
just based on the table in the Unicode 5.0 manual!  

I realized that I have this information on a slide in my course, which is on my laptop, so
with a little copy-and-paste-and-reformat, here's the table.  Note that no massive FSM
recognition is required to do the conversion, and it is even questionable as to whether an
FSM is required at all!

All symbols represent bits, and x, y, u, z and w are metasymbols for bits that can be
either 0 or 1

UTF-32		00000000 00000000 00000000 0xxxxxxx	
UTF-16		00000000 0xxxxxxx
UTF-8		0xxxxxx	
		
UTF-32		00000000 00000000 00000yyy yyxxxxxx
UTF-16		00000yyy yyxxxxxx	
UTF-8		110yyyyy	10xxxxxx		

UTF-32		00000000 00000000 zzzzyyyy yyxxxxxx	
UTF-16		zzzzyyyy yyxxxxxx	
UTF-8		1110zzzz 10yyyyyy 10xxxxxx

UTF-32		00000000 000uuuuu zzzzyyyy yyzzzzzz	
UTF-16		110110ww wwzzzzyy  110111yy yyxxxxxx*	
UTF-8		11110uuu 10uuzzzzz 10yyyyyy 10xxxxxx

uuuuu  = wwww + 1
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Reply Joseph 5/17/2010 4:51:07 PM

"Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
news:lI2dnTWeE-MF0GzWnZ2dnUVZ_gadnZ2d@giganews.com...
> My source code encoding will be UTF-8. My interpreter is written in Yacc 
> and Lex, and 75% complete. This makes a UTF-8 regular expression 
> mandatory.

Peter,
You keep mentioning that your interpreter is 75% complete. Forgive my morbid 
curiosity but what exactly does that mean?

-Pete 


0
Reply Pete 5/17/2010 5:42:19 PM

On 5/17/2010 11:24 AM, Joseph M. Newcomer wrote:
> See below...
> On Mon, 17 May 2010 09:04:07 -0500, Peter Olcott<NoSpam@OCR4Screen.com>  wrote:
>
>>> Huh?  What's this got to do with the encoding?
>> (1) Lex requires a RegEx
> ***
> Your regexp should be in terms of UTF32, not UTF8.

Wrong, Lex can not handle data larger than bytes.

> ****
>>
>> (2) I still must convert from UTF-8 to UTF-32, and I don't think that a
>> faster or simpler way to do this besides a regular expression
>> implemented as a finite state machine can possibly exist.
> ****
> Actually, there is; you obviously know nothing about UTF-8, or you would know that the
> high-order bits of the first byte tell you the length of the encoding, and the FSM is
> written entirely in terms of the actual encoding, and is never written as a regexp.

Ah I see, if I don't know everything that I must know nothing, I think 
that this logic is flawed. None of the docs that I read mentioned this 
nuance. It may prove to be useful. It looks like this will be most 
helpful when translating from UTF-32 to UTF-8, and not the other way 
around.

It would still seem to be slower and more complex than a DFA based 
finite state machine for validating a UTF-8 byte sequence. It also look 
like it would be slower for translating from UTF-8 to UTF-32.

0
Reply Peter 5/17/2010 7:05:54 PM

On 5/17/2010 12:42 PM, Pete Delgado wrote:
> "Peter Olcott"<NoSpam@OCR4Screen.com>  wrote in message
> news:lI2dnTWeE-MF0GzWnZ2dnUVZ_gadnZ2d@giganews.com...
>> My source code encoding will be UTF-8. My interpreter is written in Yacc
>> and Lex, and 75% complete. This makes a UTF-8 regular expression
>> mandatory.
>
> Peter,
> You keep mentioning that your interpreter is 75% complete. Forgive my morbid
> curiosity but what exactly does that mean?
>
> -Pete
>
>

The Yacc and Lex are done and working and correctly translate all input 
into a corresponding abstract syntax tree. The control flow portion of 
the code generator is done and correctly translates control flow 
statements into corresponding jump code with minimum branches. The 
detailed design for everything else is complete. The everything else 
mostly involves how to handle all of the data types including objects, 
and also including elemental operations upon these data types and objects.
0
Reply Peter 5/17/2010 7:31:06 PM

"Peter Olcott" <NoSpam@OCR4Screen.com> wrote in message 
news:a_WdndCS8YLmBGzWnZ2dnUVZ_o6dnZ2d@giganews.com...
> On 5/17/2010 12:42 PM, Pete Delgado wrote:
>> "Peter Olcott"<NoSpam@OCR4Screen.com>  wrote in message
>> news:lI2dnTWeE-MF0GzWnZ2dnUVZ_gadnZ2d@giganews.com...
>>> My source code encoding will be UTF-8. My interpreter is written in Yacc
>>> and Lex, and 75% complete. This makes a UTF-8 regular expression
>>> mandatory.
>>
>> Peter,
>> You keep mentioning that your interpreter is 75% complete. Forgive my 
>> morbid
>> curiosity but what exactly does that mean?
>>
>> -Pete
>>
>>
>
> The Yacc and Lex are done and working and correctly translate all input 
> into a corresponding abstract syntax tree. The control flow portion of the 
> code generator is done and correctly translates control flow statements 
> into corresponding jump code with minimum branches. The detailed design 
> for everything else is complete. The everything else mostly involves how 
> to handle all of the data types including objects, and also including 
> elemental operations upon these data types and objects.

Peter,
Correct me if I'm wrong, but it sounds to me as if you are throwing 
something at Yacc and Lex and are getting something you think is 
"reasonable" out of them but you haven't been able to test the validity of 
the output yet since the remainder of the coding to your spec is yet to be 
done. Is that a true statement?


-Pete 


0
Reply Pete 5/17/2010 7:47:21 PM

On 5/17/2010 11:51 AM, Joseph M. Newcomer wrote:
> See below...
> On Mon, 17 May 2010 08:57:58 -0500, Peter Olcott<NoSpam@OCR4Screen.com>  wrote:
>>
>>
>> I still MUST have a correct UTF-8 RegEx because my interpreter is 75%
>> completed using Lex and Yacc. Besides this I need a good way to parse
>> UTF-8 to convert it to UTF-32.
> ****
> No, it sucks.  For reasons I have pointed out.  You can easily write a UTF-32 converter
> just based on the table in the Unicode 5.0 manual!

Lex can ONLY handle bytes. Lex apparently can handle the RegEx that I 
posted. I am basically defining every UTF-8 byte sequence above the 
ASCII range as a valid Letter that can be used in an Identifier.
[A-Za-z_] can also be used as a Letter.

>
> I realized that I have this information on a slide in my course, which is on my laptop, so
> with a little copy-and-paste-and-reformat, here's the table.  Note that no massive FSM
> recognition is required to do the conversion, and it is even questionable as to whether an
> FSM is required at all!
>
> All symbols represent bits, and x, y, u, z and w are metasymbols for bits that can be
> either 0 or 1
>
> UTF-32		00000000 00000000 00000000 0xxxxxxx	
> UTF-16		00000000 0xxxxxxx
> UTF-8		0xxxxxx	
> 		
> UTF-32		00000000 00000000 00000yyy yyxxxxxx
> UTF-16		00000yyy yyxxxxxx	
> UTF-8		110yyyyy	10xxxxxx		
>
> UTF-32		00000000 00000000 zzzzyyyy yyxxxxxx	
> UTF-16		zzzzyyyy yyxxxxxx	
> UTF-8		1110zzzz 10yyyyyy 10xxxxxx
>
> UTF-32		00000000 000uuuuu zzzzyyyy yyzzzzzz	
> UTF-16		110110ww wwzzzzyy  110111yy yyxxxxxx*	
> UTF-8		11110uuu 10uuzzzzz 10yyyyyy 10xxxxxx
>
> uuuuu  = wwww + 1
> Joseph M. Newcomer [MVP]
> email: newcomer@flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

I was aware of the encodings between UTF-8 and UTF-32, the encoding to 
UTF-16 looks a little clumsy when we get to four UTF-8 bytes.
0
Reply Peter 5/17/2010 7:53:39 PM

On 5/17/2010 2:47 PM, Pete Delgado wrote:
> "Peter Olcott"<NoSpam@OCR4Screen.com>  wrote in message
> news:a_WdndCS8YLmBGzWnZ2dnUVZ_o6dnZ2d@giganews.com...
>> On 5/17/2010 12:42 PM, Pete Delgado wrote:
>>> "Peter Olcott"<NoSpam@OCR4Screen.com>   wrote in message
>>> news:lI2dnTWeE-MF0GzWnZ2dnUVZ_gadnZ2d@giganews.com...
>>>> My source code encoding will be UTF-8. My interpreter is written in Yacc
>>>> and Lex, and 75% complete. This makes a UTF-8 regular expression
>>>> mandatory.
>>>
>>> Peter,
>>> You keep mentioning that your interpreter is 75% complete. Forgive my
>>> morbid
>>> curiosity but what exactly does that mean?
>>>
>>> -Pete
>>>
>>>
>>
>> The Yacc and Lex are done and working and correctly translate all input
>> into a corresponding abstract syntax tree. The control flow portion of the
>> code generator is done and correctly translates control flow statements
>> into corresponding jump code with minimum branches. The detailed design
>> for everything else is complete. The everything else mostly involves how
>> to handle all of the data types including objects, and also including
>> elemental operations upon these data types and objects.
>
> Peter,
> Correct me if I'm wrong, but it sounds to me as if you are throwing
> something at Yacc and Lex and are getting something you think is
> "reasonable" out of them but you haven't been able to test the validity of
> the output yet since the remainder of the coding to your spec is yet to be
> done. Is that a true statement?
>
>
> -Pete
>
>

That is just as true as the statement that no one is more Jewish than 
the pope.

I started with a known good Yacc and Lex specification for "C"
   http://www.lysator.liu.se/c/ANSI-C-grammar-y.html
   http://www.lysator.liu.se/c/ANSI-C-grammar-l.html

I carefully constructed code to build a very clean abstract syntax tree. 
I made absolute minimal transformations to the original Yacc and Lex 
spec to attain the subset of C++ that I will be implementing.
(Basically the "C" language plus classes and minus pointers).

I used this abstract syntax tree to generate jump code with absolute 
minimum number of of branches for the control flow statements that this 
C++ subset will be providing. Testing has indicated that these results 
have been correctly achieved.

Then I designed the whole infrastucture to handle every type of 
elemental operation upon the fundamental data types, as well the 
aggregate data types.
0
Reply Peter 5/17/2010 8:24:03 PM

56 Replies
443 Views

(page loaded in 1.225 seconds)

Similiar Articles:































7/24/2012 3:06:39 PM


Reply: