Is this UTF-8 regular expression semantically correct?

BYTE_ORDER_MARK   [0\xEF][0\xBB][0\xBF]
ASCII [\x0-\x7f]
U2    [\xC2-\xDF][\x80-\xBF]
U3    [\xE0][\xA0-\xBF][\x80-\xBF]
U4    [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
U5    [\xED][\x80-\x9F][\x80-\xBF]
U6    [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
U7    [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
U8    [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
U9    [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
U     {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9} 


0
Peter
5/11/2010 2:38:33 PM
vc.mfc 33608 articles. 0 followers. Follow

21 Replies
1405 Views

Similar Articles

[PageSpeed] 27

> BYTE_ORDER_MARK   [0\xEF][0\xBB][0\xBF]
> ASCII [\x0-\x7f]
> U2    [\xC2-\xDF][\x80-\xBF]
> U3    [\xE0][\xA0-\xBF][\x80-\xBF]
> U4    [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
> U5    [\xED][\x80-\x9F][\x80-\xBF]
> U6    [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
> U7    [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
> U8    [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
> U9    [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
> U     {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9} 

I did not thoroughly check all regular expressions, because it is a bad idea.

But quickly:
  1. No, at a quick look it is not corect. Two many options for 3 and 4 bytes
     sequences, 80 is never valid as trailing byte, why zeros in BOM,
     probably more.
  2. Try using a Unicode-aware regular expression engine, forget this crap.
  3. In general regexp are not strong enough to fully validate UTF-8
     (if validation is what you want)


-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

0
Mihai
5/12/2010 7:50:43 AM
"Mihai N." <nmihai_year_2000@yahoo.com> wrote in message 
news:Xns9D76899C99B4MihaiN@207.46.248.16...
>
>> BYTE_ORDER_MARK   [0\xEF][0\xBB][0\xBF]
>> ASCII [\x0-\x7f]
>> U2    [\xC2-\xDF][\x80-\xBF]
>> U3    [\xE0][\xA0-\xBF][\x80-\xBF]
>> U4    [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>> U5    [\xED][\x80-\x9F][\x80-\xBF]
>> U6    [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>> U7    [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>> U8    [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>> U9    [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>> U     {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>
> I did not thoroughly check all regular expressions, 
> because it is a bad idea.
>
> But quickly:
>  1. No, at a quick look it is not corect. Two many options 
> for 3 and 4 bytes
>     sequences, 80 is never valid as trailing byte, why 
> zeros in BOM,
>     probably more.

I think that the number of options for three and four byte 
sequences may be required here is my original source and he 
has the same number of options for a four byte
sequence.
  http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex

This says that 80-BF is the required range for a trailing 
byte.
  http://en.wikipedia.org/wiki/UTF-8

>  2. Try using a Unicode-aware regular expression engine, 
> forget this crap.
>  3. In general regexp are not strong enough to fully 
> validate UTF-8
>     (if validation is what you want)
>

I am writing a compiler that takes UTF-8 input, so I must 
have a correct regular expression to be used by the lexical 
analyzer.

>
> -- 
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
> 


0
Peter
5/12/2010 2:16:05 PM
> I am writing a compiler that takes UTF-8 input, so I must 
> have a correct regular expression to be used by the lexical 
> analyzer.

The more I look at them, the wronger they seem :-)
Really, if you write a compiler, then forger regexp.



-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

0
Mihai
5/13/2010 10:00:01 AM
"Mihai N." <nmihai_year_2000@yahoo.com> wrote in message 
news:Xns9D771E851DEFBMihaiN@207.46.248.16...
>
>> I am writing a compiler that takes UTF-8 input, so I must
>> have a correct regular expression to be used by the 
>> lexical
>> analyzer.
>
> The more I look at them, the wronger they seem :-)
> Really, if you write a compiler, then forger regexp.
>

I have no idea what you are saying about forger regexp.
I have been able to derive a process for reverse-engineering 
and empirically validating the correct regular expression.

I guess that you didn't bother to look at an almost 
identical regular expression that has been published and 
critiqued for years.
  http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex

>
>
> -- 
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
> 


0
Peter
5/13/2010 11:47:27 AM
"Mihai N." <nmihai_year_2000@yahoo.com> wrote in message 
news:Xns9D771E851DEFBMihaiN@207.46.248.16...
>
>> I am writing a compiler that takes UTF-8 input, so I must
>> have a correct regular expression to be used by the 
>> lexical
>> analyzer.
>
> The more I look at them, the wronger they seem :-)
> Really, if you write a compiler, then forger regexp.

Ah maybe you are saying forget regexp, I can't because the 
compiler is based on lex and yacc. I am writing a simplified 
C++ interpreter by slightly modifying the correct lex and 
yacc syntax for "C".

>
>
>
> -- 
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
> 


0
Peter
5/13/2010 11:50:50 AM
Regular expressions are often used to define the lexical components of a language.

This does not suggest that using a regexp recognizer is a sensible implementation of a
compiler.

In general, we build FSMs to recognize lexical elements, and PDAs (Push Down Automata, not
pocket-sized little computers) to recognize syntactic elements.

Often these are generated by programs such as Bison and YACC, and in many cases are just
hand-written.  Personally, I write my lexers as a switch-based FSM, and use recursive
descent to write my parser.  I throw exceptions when there are lexical or syntactic
errors.

I though I understood the question until the phrase "writing a compiler" appeared.

Tkaing UTF input is not the same as saying "I am doing lexical analysis on UTF-8 text".
Typically, what I would do is take UTF-8 input and immediately convert it to Unicode, and
work in terms of Unicode internally.  

The problem here is defining the "alphabetic" and "numeric" characters; fortunately,
isalpha, isalnum, isnum, etc. seem to be locale-aware, and you could always use the
Unicode-related APIs to determine a character class.  Also, check out the Unicode tab in
my Locale Explorer.
				joe


On Thu, 13 May 2010 03:00:01 -0700, "Mihai N." <nmihai_year_2000@yahoo.com> wrote:

>
>> I am writing a compiler that takes UTF-8 input, so I must 
>> have a correct regular expression to be used by the lexical 
>> analyzer.
>
>The more I look at them, the wronger they seem :-)
>Really, if you write a compiler, then forger regexp.
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Joseph
5/13/2010 3:21:48 PM
But lex does not use regexp recognizers at all!  Instead, it takes a regexp as a
specification and GENERATES information for doing lexical analysis with a FSM recognizer,
which becomes a component of a compiler based most commonly on yacc.

So are you talking about input to lex?  If so, say so. Otherwise, the assumption is that
you are working with some regexp recognizer subroutine such as those found in codeproject.
Or the one found in the FreeBSD library.  Do a google search for CRegExp and find all the
available libraries.  Otherwise, you would have said "I'm using lex, and I have this
question about its input".  And the answer is, if you are not talking about lex, you are
going after the problem in the wrong way.

If you are working with lex and yacc, there are newsgroups devoted to these topics in the
comp hierarchy.  If you ask about regexp in this NG, we will assume you are talking about
the kinds of libraries that exist that you can call from C++.
					joe

On Thu, 13 May 2010 06:50:50 -0500, "Peter Olcott" <NoSpam@OCR4Screen.com> wrote:

>
>"Mihai N." <nmihai_year_2000@yahoo.com> wrote in message 
>news:Xns9D771E851DEFBMihaiN@207.46.248.16...
>>
>>> I am writing a compiler that takes UTF-8 input, so I must
>>> have a correct regular expression to be used by the 
>>> lexical
>>> analyzer.
>>
>> The more I look at them, the wronger they seem :-)
>> Really, if you write a compiler, then forger regexp.
>
>Ah maybe you are saying forget regexp, I can't because the 
>compiler is based on lex and yacc. I am writing a simplified 
>C++ interpreter by slightly modifying the correct lex and 
>yacc syntax for "C".
>
>>
>>
>>
>> -- 
>> Mihai Nita [Microsoft MVP, Visual C++]
>> http://www.mihai-nita.net
>> ------------------------------------------
>> Replace _year_ with _ to get the real email
>> 
>
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Joseph
5/13/2010 3:35:40 PM
"Joseph M. Newcomer" <newcomer@flounder.com> wrote in 
message news:e96ou5taaufi8fqmmmaf0dl7m2hkn69po8@4ax.com...
> If you are working with lex and yacc, there are newsgroups 
> devoted to these topics in the
> comp hierarchy.  If you ask about regexp in this NG, we 
> will assume you are talking about
> the kinds of libraries that exist that you can call from 
> C++.
> joe

I already tried comp.compilers, and am aware of no lex or 
yacc specific groups.


0
Peter
5/13/2010 4:35:13 PM
Well, it has been a long number of years since I last was there.  They may have evaporated
due to lack of activity.
				joe

On Thu, 13 May 2010 11:35:13 -0500, "Peter Olcott" <NoSpam@OCR4Screen.com> wrote:

>
>"Joseph M. Newcomer" <newcomer@flounder.com> wrote in 
>message news:e96ou5taaufi8fqmmmaf0dl7m2hkn69po8@4ax.com...
>> If you are working with lex and yacc, there are newsgroups 
>> devoted to these topics in the
>> comp hierarchy.  If you ask about regexp in this NG, we 
>> will assume you are talking about
>> the kinds of libraries that exist that you can call from 
>> C++.
>> joe
>
>I already tried comp.compilers, and am aware of no lex or 
>yacc specific groups.
>
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Joseph
5/13/2010 5:39:52 PM
On May 12, 2:50=A0am, "Mihai N." <nmihai_year_2...@yahoo.com> wrote:
> > BYTE_ORDER_MARK =A0 [0\xEF][0\xBB][0\xBF]
> > ASCII [\x0-\x7f]
> > U2 =A0 =A0[\xC2-\xDF][\x80-\xBF]
> > U3 =A0 =A0[\xE0][\xA0-\xBF][\x80-\xBF]
> > U4 =A0 =A0[\xE1-\xEC][\x80-\xBF][\x80-\xBF]
> > U5 =A0 =A0[\xED][\x80-\x9F][\x80-\xBF]
> > U6 =A0 =A0[\xEE-\xEF][\x80-\xBF][\x80-\xBF]
> > U7 =A0 =A0[\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
> > U8 =A0 =A0[\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
> > U9 =A0 =A0[\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
> > U =A0 =A0 {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>
> I did not thoroughly check all regular expressions, because it is a bad i=
dea.
>
> But quickly:
> =A0 1. No, at a quick look it is not corect. Two many options for 3 and 4=
 bytes
> =A0 =A0 =A0sequences, 80 is never valid as trailing byte, why zeros in BO=
M,
> =A0 =A0 =A0probably more.
> =A0 2. Try using a Unicode-aware regular expression engine, forget this c=
rap.
> =A0 3. In general regexp are not strong enough to fully validate UTF-8
> =A0 =A0 =A0(if validation is what you want)
>
> --
> Mihai Nita [Microsoft MVP, Visual C++]http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email


The solution is based on the GREEN portions of the first chart shown
on this link:
  http://www.w3.org/2005/03/23-lex-U

Here is the corrected Regular Expression for UTF-8, a semantically
identical regular expression is also found on the above link.

UTF8_BYTE_ORDER_MARK   [\xEF][\xBB][\xBF]

ASCII     [\x0-\x7F]

U1          [a-zA-Z_]
U2          [\xC2-\xDF][\x80-\xBF]
U3          [\xE0][\xA0-\xBF][\x80-\xBF]
U4          [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
U5          [\xED][\x80-\x9F][\x80-\xBF]
U6          [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
U7          [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
U8          [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
U9          [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]

UTF8   {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}

{U1} is used instead of {ASCII} to restrict matching to an alphabetic
Letter.
0
SeeScreen
5/13/2010 7:47:00 PM
> Ah maybe you are saying forget regexp
Just a typo, and your guess is right, it was "forget"

> I can't because the 
> compiler is based on lex and yacc. I am writing a simplified 
> C++ interpreter by slightly modifying the correct lex and 
> yacc syntax for "C".
Ok, see joe's comments about using regexp for compilers.


-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

0
Mihai
5/14/2010 8:00:29 AM
> I can't because the 
> compiler is based on lex and yacc. I am writing a simplified 
> C++ interpreter by slightly modifying the correct lex and 
> yacc syntax for "C".

Saying that the expresion is used with lex/yacc context makes a big 
difference, because that implies that there is a state machine
somewhere that can track the context.

Otherwise it is like saying
I am writing a compiler that takes C input
then show a regular expressions like if|else|while|do
that you use to detect the C keywords.
A regexp using that will accept "bif" as input, lex will not :-)


-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

0
Mihai
5/14/2010 8:18:26 AM
"Mihai N." <nmihai_year_2000@yahoo.com> wrote in message 
news:Xns9D78D4CBF598MihaiN@207.46.248.16...
>> I can't because the
>> compiler is based on lex and yacc. I am writing a 
>> simplified
>> C++ interpreter by slightly modifying the correct lex and
>> yacc syntax for "C".
>
> Saying that the expresion is used with lex/yacc context 
> makes a big
> difference, because that implies that there is a state 
> machine
> somewhere that can track the context.
>

Possibly, but, I was really only looking for a yes or no 
answer.

Also I am unaware of any reasonable alternative to a finite 
state machine for processing regular expressions. From my 
point of view regular expressions and finite state machines 
are mutually dependent upon each other. I see no other view 
that could possibly be correct.

I guess no one here, or anywhere else knows whether or not 
the regular expression is correct.  This leaves me with the 
much more time consuming option of empirical validation.

> Otherwise it is like saying
> I am writing a compiler that takes C input
> then show a regular expressions like if|else|while|do
> that you use to detect the C keywords.
> A regexp using that will accept "bif" as input, lex will 
> not :-)
>
>
> -- 
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
> 


0
Peter
5/14/2010 6:38:56 PM
> Possibly, but, I was really only looking for a yes or no 
> answer.

If you wanted a yes/no answer you should give complete info
(like the fact that you are talking lex context)
Othewise you wil very likely get a wrong answer.


-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

0
Mihai
5/15/2010 10:23:17 AM
"Mihai N." <nmihai_year_2000@yahoo.com> wrote in message 
news:Xns9D7922762D9EBMihaiN@207.46.248.16...
>
>
>> Possibly, but, I was really only looking for a yes or no
>> answer.
>
> If you wanted a yes/no answer you should give complete 
> info
> (like the fact that you are talking lex context)
> Othewise you wil very likely get a wrong answer.

I don't see why this would be the case for a yes or no 
question.

>
>
> -- 
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
> 


0
Peter
5/15/2010 2:20:42 PM
Because without the context it is not a valid question.

For example, since this is a C++/MFC group, the question might have been in terms of a
regexp library, which suggests you are using UTF-8 internally, which would be wrong.

But as stated, the question is wrong, because you are presuming an over-simplified concept
of "letter", for which I have already pointed out there are failures (numbers in other
languages).  You would have to deal with all accent marks, and while some languages have
e-umlaut, i-umlaut and y-umlaut, these are not letters in German, and so you have to take
into account the localization context to determine if they really are "letters".  And in
Chinese, a single glyph may be a "word" and thus two of these in sequence would be
syntactically illegal.  So how do you define "letter"?  And in some cases, the accent mark
is a separate codepoint, so a separate UTF-8 encoding, but you can't combine that accent
mark with any but a few letters, so the regexp does not account for these at all!

What about RTL encodings.  In Hebrew, which I will simplifiy for NG syntax, if I wanted to
write ABC it would appear as CBA because of the left-to-right nature of that language. But
if I wanted to write ABC 123 DEF it would appear as FED*123$CBA where the $ represents the
token that says "change to RTL" and the * represents the token that says "change to LTR".
Read the Unicode documentation!  (RTFM!)  So if you are parsing this into tokens, is it
"FED" "123" "CBA" or "ABC" "321" "DEF" or "ABC" "123" "DEF"?  If you can't answer this
question, then you can't ask the one about the regexp being correct.  What if I have a
lexically illegal sequence of accent marks and characters?  What if I have the sequence
'`a?  If 'a means � and `a means � (I'm not talking about the ANSI characters, here '
means U0300 and ` means U0301), what does '`a or `'a mean?  Whoops, lexical error.  There
is no rule in your regexp that detects this, therefore, it is wrong.  (UTF-32 these would
be U00000300 U00000061 and U00000301 U00000061and in UTF-8 these would be "cc 80 61 cc 81
61"

So the simple answer is "It is completely and utterly insufficient, and its correctness is
problematic, and it does not define even what a letter is", and even if you convert to
UTF-32 you have not solved this problem.
					joe

So the simplest answer is "No", under no imaginable conditions is this collection of
regexps even CLOSE to being usable, and even if expressed in UTF-32, there is no possible
way something this overly-simplistic could be construed to make sense, and the real
problem is vastly more complicated than you have imagined!
					joe

On Sat, 15 May 2010 09:20:42 -0500, "Peter Olcott" <NoSpam@OCR4Screen.com> wrote:

>
>"Mihai N." <nmihai_year_2000@yahoo.com> wrote in message 
>news:Xns9D7922762D9EBMihaiN@207.46.248.16...
>>
>>
>>> Possibly, but, I was really only looking for a yes or no
>>> answer.
>>
>> If you wanted a yes/no answer you should give complete 
>> info
>> (like the fact that you are talking lex context)
>> Othewise you wil very likely get a wrong answer.
>
>I don't see why this would be the case for a yes or no 
>question.
>
>>
>>
>> -- 
>> Mihai Nita [Microsoft MVP, Visual C++]
>> http://www.mihai-nita.net
>> ------------------------------------------
>> Replace _year_ with _ to get the real email
>> 
>
Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
0
Joseph
5/17/2010 5:40:41 AM
On 5/17/2010 12:40 AM, Joseph M. Newcomer wrote:
> Because without the context it is not a valid question.
>
> For example, since this is a C++/MFC group, the question might have been in terms of a
> regexp library, which suggests you are using UTF-8 internally, which would be wrong.
>
> But as stated, the question is wrong, because you are presuming an over-simplified concept
> of "letter", for which I have already pointed out there are failures (numbers in other
> languages).  You would have to deal with all accent marks, and while some languages have
> e-umlaut, i-umlaut and y-umlaut, these are not letters in German, and so you have to take
> into account the localization context to determine if they really are "letters".  And in
> Chinese, a single glyph may be a "word" and thus two of these in sequence would be
> syntactically illegal.  So how do you define "letter"?  And in some cases, the accent mark
> is a separate codepoint, so a separate UTF-8 encoding, but you can't combine that accent
> mark with any but a few letters, so the regexp does not account for these at all!
>
> What about RTL encodings.  In Hebrew, which I will simplifiy for NG syntax, if I wanted to
> write ABC it would appear as CBA because of the left-to-right nature of that language. But
> if I wanted to write ABC 123 DEF it would appear as FED*123$CBA where the $ represents the
> token that says "change to RTL" and the * represents the token that says "change to LTR".
> Read the Unicode documentation!  (RTFM!)  So if you are parsing this into tokens, is it
> "FED" "123" "CBA" or "ABC" "321" "DEF" or "ABC" "123" "DEF"?  If you can't answer this
> question, then you can't ask the one about the regexp being correct.  What if I have a
> lexically illegal sequence of accent marks and characters?  What if I have the sequence
> '`a?  If 'a means � and `a means � (I'm not talking about the ANSI characters, here'
> means U0300 and ` means U0301), what does '`a or `'a mean?  Whoops, lexical error.  There
> is no rule in your regexp that detects this, therefore, it is wrong.  (UTF-32 these would
> be U00000300 U00000061 and U00000301 U00000061and in UTF-8 these would be "cc 80 61 cc 81
> 61"
>
> So the simple answer is "It is completely and utterly insufficient, and its correctness is
> problematic, and it does not define even what a letter is", and even if you convert to
> UTF-32 you have not solved this problem.
> 					joe
>
> So the simplest answer is "No", under no imaginable conditions is this collection of
> regexps even CLOSE to being usable, and even if expressed in UTF-32, there is no possible
> way something this overly-simplistic could be construed to make sense, and the real
> problem is vastly more complicated than you have imagined!
> 					joe

You are taking the incorrect approach in that if a solution does not 
provide support for every possible issue then the this solution does not 
solve the problem. The failure in this approach is that for many 
problems most of these issues are entirely moot.

For the purpose of creating an interpreted GUI scripting language that 
permits people to write GUI scripts in their native language I only need 
to be able to handle UTF-8 input and make sure that it it valid UTF-8. 
There is no need for me to validate this any further.

>
> On Sat, 15 May 2010 09:20:42 -0500, "Peter Olcott"<NoSpam@OCR4Screen.com>  wrote:
>
>>
>> "Mihai N."<nmihai_year_2000@yahoo.com>  wrote in message
>> news:Xns9D7922762D9EBMihaiN@207.46.248.16...
>>>
>>>
>>>> Possibly, but, I was really only looking for a yes or no
>>>> answer.
>>>
>>> If you wanted a yes/no answer you should give complete
>>> info
>>> (like the fact that you are talking lex context)
>>> Othewise you wil very likely get a wrong answer.
>>
>> I don't see why this would be the case for a yes or no
>> question.
>>
>>>
>>>
>>> --
>>> Mihai Nita [Microsoft MVP, Visual C++]
>>> http://www.mihai-nita.net
>>> ------------------------------------------
>>> Replace _year_ with _ to get the real email
>>>
>>
> Joseph M. Newcomer [MVP]
> email: newcomer@flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

0
Peter
5/17/2010 2:17:41 PM
Why not go to the root of the problem?

This is what you need:
  > For the purpose of creating an interpreted GUI scripting language that 
  > permits people to write GUI scripts in their native language

Then expose the whole thing using a COM model, and it would allow
anyone to automate using any .NET language, Perl, JScript, you name it.
Solid languages, some of them supporting Unicode out of the box, way
more popular. You stop wasting your time developing a compiler,
and people will not be forces to waste time learning another
programming language (C-like but not quite C).




-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

0
Mihai
5/18/2010 8:10:07 AM
On 5/18/2010 3:10 AM, Mihai N. wrote:
>
>
> Why not go to the root of the problem?
>
> This is what you need:
>    >  For the purpose of creating an interpreted GUI scripting language that
>    >  permits people to write GUI scripts in their native language
>
> Then expose the whole thing using a COM model, and it would allow
> anyone to automate using any .NET language, Perl, JScript, you name it.
> Solid languages, some of them supporting Unicode out of the box, way
> more popular. You stop wasting your time developing a compiler,
> and people will not be forces to waste time learning another
> programming language (C-like but not quite C).
>
>
>
>
I considered that , but rejected it for two reasons:
(1) Not sufficiently platform independent.
(2) Makes my success too dependent upon Microsoft.

0
Peter
5/18/2010 7:03:56 PM
> (2) Makes my success too dependent upon Microsoft.

That's a good reason, indeed.
You stuff is so great, that Microsoft might actually drag you down :-)


-- 
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

0
Mihai
5/19/2010 5:57:29 AM
On 5/19/2010 12:57 AM, Mihai N. wrote:
>> (2) Makes my success too dependent upon Microsoft.
>
> That's a good reason, indeed.
> You stuff is so great, that Microsoft might actually drag you down :-)
>
>
How well does ActiveX run on Linux or Apple?

There are other reasons why I considered and rejected this otherwise 
excellent idea. In fact I will still be offering my OCR4Sreen technology 
as an ActiveX component at some point.
0
Peter
5/19/2010 2:26:35 PM
Reply:

Similar Artilces:

Outlook Express:
Can outlook express be set up to send an away mesage/vacation response. Any help is appreciated from anyone. "jason Pope" <mr_anderson1979@hotmail.com> wrote in message news:03dc01c3b3a4$432fa430$a501280a@phx.gbl... > Can outlook express be set up to send an away > mesage/vacation response. Any help is appreciated from > anyone. This newsgroup is for support of Outlook 97/98/2000/2002/2003 from the Office suite of products. Outlook Express is actually a separate program despite the similar name. For help with your OE questions, try an OE newsgroup such as ...

Reading and writing UTF-8 files
I have to do some simple text editing to large-ish (2 Mbyte) html files generated by Word. They are, I believe, in UTF-8. It is the !%^&* problem where single apostrophes become sequences of funny characters, some spaces are shown as unprintable characters, etc. The following code just reads and writes a file, and shows the problem. It applies whether I use a String or a StringBuilder, and whether or not I explicitly force UTF-8 encoding. Can somebody just tell me basically how to copy an html file by reading it in and then writing it out, which is all the following meth...

Read Receipts #8
Does anyone know why when I read a Message in Outlook 2003 the read receipt is sent but with the TO field blank. This causes some SPAM filters to return the message thinking its SPAM? I have tried with several different people and they all get the blank TO fields in the receipts, but when I read a message in OWA it sends it OK... Any comments... PLEASE HELP Thanks ...

Anyway to retrieve lost messages after synchronization in Outlook Express?
Setup for my hotmail messages to be downloaded to Outlook Express. When tried to connect to see an old message, didn't realize that it was setup to auto synch. My messages were deleted from hotmail, and are now lost on Outlook Express after synching. Is there anyway to retrieve the lost messages from Outlook Express? I really needed some of the items in there, and had downloaded them so they wouldn't be lost. Thanks! probably not, but I would try running dbxtract.exe on the deleted folder. if the folder was compacted, they are gone forever. look for dbxtract at tomsterdam.co...

pcnetsecurity@gmail.com =?UTF-8?B?QXNzaXN0w6puY2lhIFTDqWM=?= =?UTF-8?B?bmljYSAgbWFudXRlbsOnw6M=?= =?UTF-8?B?byBkZSBjb21wdXRhZG9y?= =?UTF-8?B?ZXMgaW5mb3JtYXRpY2Eg?= =?UTF-8?B?Vml0w7NyaWEtZXMgNjc5NDA=?=
Contato: pcnetsecurity@gmail.com Contato: pcnetsecurity @ gmail.com Planos a partir de R$ 250,00 . Assist�ncia T�cnica Prestamos assist�ncia t�cnica nos computadores de sua empresa ou resid�ncia, e tamb�m possu�mos uma equipe qualificada para fazer a manuten��o no pr�prio local. - Contratos de Suporte e Manuten��o Reduza os custos de sua empresa com solicita��es de visitas t�cnicas para seus computadores, elaboramos um contrato de manuten��o integrado para sua empresa onde disponibilizamos: t�cnicos, equipamentos de suporte e substitui��o, e atendimento no hor�rio comercial ou ...

Publisher 2000 #8
When i start up publisher the page area shows as a blueish tint and all surrounding areas are fuschia like color. When i click on a text or picture frame took and drag the mouse arrow drags behind it and image as if i was in a paint mode or something. Opening existing files are ok. Uninstalled and reinstalled and sme problem. Any help on this problem would be appreciated greatly. Norm After managing to set up OE-QuoteFix on his new PC, Ed reads a message from Norm <anonymous@discussions.microsoft.com>... > When i start up publisher the page area shows as a > blueish tint and...

=?Utf-8?B?2YPZitmB2YrYqSDYudmF2YQg2YXZitiy2KfZhtmK2Ycg2KrZgtiv2YrYsdmK2Yc=?= =?Utf-8?B?INmE2KrZiNi42YrZgSDYudiv2K8g2KfZg9io2LEg2YXZhiDYp9mE2YXZiNi4?= =?Utf-8?B?2YHZitmGINmI2KfZhNi52KfYptivINmF2YY=?=
...

Filter list not correctly refreshed
Hi, I created a pivot table in Excel, linked to a database. So I added some filters to this table, displaying lists. The fact is that when I try to find a value in this list (my filter list), I don't find this value. I explain with more details. I add a new value to a list A in TFS (value x). I wait for the OLAP cube to refresh. I open my excel file, I check in the A list (used in my file as a filter) and try to select the value x, but I don't find it. My OLAP cube has correcty refreshed the data (I can see it with the displayed data in my tables). What should I d...

CF Stop If True not working correctly
Hi, I have a problem with Stop If True for Conditional Formatting not working correctly in Excel 2007. It seems as though Stop If True is always on even when the check box is not checked. I have a very simple demonstration, at least on my machine. Open a new workbook. Enter the values 1, 2, 3 in cells A1, A2, and A3 respectively. Select those three cells. Select New Rule from the Conditional Formatting menu Create a rule of "Format only cells that contain" Make the rule as follows: Cell Value Is greater than 1 Fill is (select ...

publisher 97 #8
I have installed MS Publisher 97 on my machine running Windows ME but every time I try to run the program I get the following error message - "Can't load Custom Control DLL.'C:\WINDOWS\SYSTEM\MSPUBG40.VBX" and the program locks up. I have tried repairing the program in Add and Remove programs in control panel and have unistalled it and reinstalled it - all without success. Is there a way to overcome this problem? Rename the MSPUBG40.VBX to MSPUBG40.old and drag a new instance from your CD. Be sure the file is in the c:\windows\system folder. If the VBX file is damage...

Help #8
I upgraded windows and when i re-installed outlook, i lost all my messages and my address book. I really want to get these back. Is there any way to get them? Please help me if you can. Thank you ...

Does not include the specified expression as part of an aggregate function
Running Access 2003 on Windows XP I'm starting a new structure parallel to a current database. The table are fully linked (no local tables) and I'm trying to create a simple query and I keep getting this message. The current query has one table to it, the general idea is to calculate consumption of raw materials in production. Fields: - Product Code (Text) // These are alphanumeric codes. - Inventory deductions // Numeric I'm trying to sum, min and max the deductions. My formula : "Total : Sum([Inventory deductions]) // Low : Min([Inventory deductions]) // H...

pcnetsecurity@gmail.com =?UTF-8?B?QXNzaXN0w6puY2lhIFTDqWM=?= =?UTF-8?B?bmljYSAgbWFudXRlbsOnw6M=?= =?UTF-8?B?byBkZSBjb21wdXRhZG9y?= =?UTF-8?B?ZXMgaW5mb3JtYXRpY2Eg?= =?UTF-8?B?Vml0w7NyaWEtZXMgNDMzNzk=?=
Contato: pcnetsecurity@gmail.com Contato: pcnetsecurity @ gmail.com Planos a partir de R$ 250,00 . Assist�ncia T�cnica Prestamos assist�ncia t�cnica nos computadores de sua empresa ou resid�ncia, e tamb�m possu�mos uma equipe qualificada para fazer a manuten��o no pr�prio local. - Contratos de Suporte e Manuten��o Reduza os custos de sua empresa com solicita��es de visitas t�cnicas para seus computadores, elaboramos um contrato de manuten��o integrado para sua empresa onde disponibilizamos: t�cnicos, equipamentos de suporte e substitui��o, e atendimento no hor�rio comercial ou ...

CRM and Active Directory Synchronisation (Error Correction?)
Following a successful(!) Pilot Test Install and Data Migration for MSCRM, Recently when Generating New Business Units and attempting to populate with Roles and/or Users, the following message (or of very similar style appears in the CRM applogs) Error: A privilege change was dropped after the maximum number of tries. -2147016672 (0x80072020) Description: An operations error occurred. Comments: The privilege {78777C10-09AB-4326-B4C8-CF5729702937} could not be changed for business {FBCC65E2-38E8-4FCF-AD20-34DEDE432A51}. This may be normal if this business has been deleted. If this busin...

Gift cards #8
can u help with these...plz 1.How to retrieve the value of a certain Gift Card which has been sold. 2.How to generate report on what is the value of the Gift Cards which have been sold but not used. 3.When the sale is complete, I get this notice “XML Render – Compiler Error: Token ‘GST’ was not found.” And after that detailed receipt is printed. This is a new error coming up after every sale Hi Angie.....(may I call u Angie?) If you are using VOUCHER items and VOUCHER tender types to sell and redeem your gift certificates, then 1 - Use CTRL-SHIFT-F3 VOUCHER to lookup a sold vo...

Question for experts IE 8 on Win 7
When I have an IE 8 window open, I experience a flickering in the taskbar. This stops when I minimize the window or run the cursor over it. Have run antivirus, spyware, malware... everything's good. Should I worry? Hi, Do you have the G'Mail Taskbar notifier installed? or the Google Task bar search box installed? On Win7 you can click on the Task Bar Notification area and then choose which 'services' you want displayed on the Task Bar. (other OS versions you need to disable these items from msconfig>Startup.) Ensure you have the latest versions of Goog...

Outlook Express takes forever to load
Running Windows XP (Home Ed.). Outlook Express takes up to 1 minute to load (the dreaded hourglass) and the same amount of time to retrieve new messages. Funny thing is that on the same computer, on another user's desktop, OE loads normally. My wife gets her e-mail fast on her desktop but mine is slow. Options under "Accounts", "Properties" are set exactly the same for both accounts. Any ideas? Try disabling the preview pane and msn messenger.... If you have a lot of folders or email, the files might be too large or damaged for a fast load of OE "Dave...

MS Windows XP SP3 with IE 8 Keyboard Issues
I am running a Windows XP w/SP3 workstation utilizing IE8. It is all service packed to current configuration. I have an issue when you type the address in IE 8 that the first character is always double typed. In addition when typing one character another is actually printed on screen both in IE 8 and in chat programs. When the screen saver comes on or in Remote Desktop and trying to type a password after a period of time the keyboard strokes are not what characters I am typing, it starts from the top row of the keyboard and progress down the line. Has anyone seen this issue as...

how do I program to add/remove/modify rule of outlook & outlook express
I have read groups but didn't find how to programmly modify rules of outlook or outlook express.I have found that there is a way to modify rules in the exchange server, but what I want is to modify rules in the client. any one have idea? tks. You would need to upgrade to Outlook 2007, which is the first version = with rules programmability.=20 --=20 Sue Mosher, Outlook MVP Author of Configuring Microsoft Outlook 2003 http://www.turtleflock.com/olconfig/index.htm and Microsoft Outlook Programming - Jumpstart for=20 Administrators, Power Users, and Developers http://w...

=?utf-8?B?V2FudCB0byByZXBsYWNlIHRoZSBzcGVjaWFsIGNoYXIgKOKAoikgd2hpbGUgcmVhZGluZyB4bWwgYnkgWG1sVGV4dFJlYWRlciBDbGFzcw==?=
hi, i am using XmlTextReader class to read xml from http link ,now my problem is that i want to replace the special characters(•) from xml before reading its tag, coz it is giving the error in XmlTextReader reader.Read() method. Please suggest any solutions to this. Thanks in advance, Ashish ashu2409 wrote: > i am using XmlTextReader class to read xml from http link ,now my problem is > that i want to replace the special characters(•) from xml before reading > > its tag, coz it is giving the error in XmlTextReader reader.Read() method. > > Please suggest any solu...

Backupexec 8.6 Exch2k3?
does backupexec 8.6 and exchange2003 (on windows2003) work? Trying to find out so to prepare the CFO for a $6000 upgrade with all the agents and stuff we'll have to buy with 9.1. Don't want to spend $89 on a tech support call after i've spent all day on it just so they can tell me it doesn't work. Thanks. Jason jason b wrote: > does backupexec 8.6 and exchange2003 (on windows2003) work? > > Trying to find out so to prepare the CFO for a $6000 upgrade with all > the agents and stuff we'll have to buy with 9.1. > > Don't want to spend $89 on a te...

import adress book from outlook express to outlook
I am trying to import old adresses from outlook express to outlook 2007, any help would be appreciated File > Import and Export > Internet Mail and Addresses... -- Russ Valentine "rconar" <rconar@discussions.microsoft.com> wrote in message news:4D441E96-90F7-48FE-A111-69CC3B18B563@microsoft.com... >I am trying to import old adresses from outlook express to outlook 2007, >any > help would be appreciated > ...

Dilma Roussef =?UTF-8?B?bsOjbyBhY3I=?= =?UTF-8?B?ZWRpdGE=?= =?UTF-8?B?IGVtIERFVQ==?= =?UTF-8?B?UyA1MTU=?= =?UTF-8?B?NzY=?=
Se voce acredita em tudo que a candidata fala por ai , veja o video que comprova que ela nao acredita em DEUS, cuidado ao jogar o seu voto no lixo! http://www.youtube.com/watch?v=24fqyh-kvvk Dilma Roussef n�o acredita em DEUS Dilma Roussef n�o acredita em DEUS Dilma Roussef n�o acredita em DEUS Dilma Roussef n�o acredita em DEUS Dilma Roussef n�o acredita em DEUS Dilma Roussef n�o acredita em DEUS http://www.youtube.com/watch?v=24fqyh-kvvk sttQqSeEtwvC.kasUVvEaL?iB ...

outlook express #2
help i cannot receive emails on outlook express can you provide more detail? is it happend after you installed Email router? "riverbollin" wrote: > help i cannot receive emails on outlook express ...

pcnetsecurity@gmail.com =?UTF-8?B?QXNzaXN0w6puY2lhIFTDqWM=?= =?UTF-8?B?bmljYSAgbWFudXRlbsOnw6M=?= =?UTF-8?B?byBkZSBjb21wdXRhZG9y?= =?UTF-8?B?ZXMgaW5mb3JtYXRpY2Eg?= =?UTF-8?B?Vml0w7NyaWEtZXMgNzczMjM=?=
Contato: pcnetsecurity@gmail.com Contato: pcnetsecurity @ gmail.com Planos a partir de R$ 250,00 . Assist�ncia T�cnica Prestamos assist�ncia t�cnica nos computadores de sua empresa ou resid�ncia, e tamb�m possu�mos uma equipe qualificada para fazer a manuten��o no pr�prio local. - Contratos de Suporte e Manuten��o Reduza os custos de sua empresa com solicita��es de visitas t�cnicas para seus computadores, elaboramos um contrato de manuten��o integrado para sua empresa onde disponibilizamos: t�cnicos, equipamentos de suporte e substitui��o, e atendimento no hor�rio comercial ou ...