HTML Scrapping using XmlTextReader

Greetings.

Just wondering if it is possible to use XmlTextReader to 
read off a html doc:

e.g. XmlTextReader tr = new XmlTextReader
("http://localhost/test.xml");

where test.xml contains the following:

<table cellspacing="1" cellpadding="1" width="100%">
<tr valign="top">
    <td class="head" width="20%">test heading1</td>
    <td class="head" width="10%">test heading2</td>
</tr>
<tr valign="top">
    <td class="content" width="20%">content1</td>
    <td class="content" width="10%">
        <table cellspacing="0" width="100%">
        <tr>
            <td align="left">test</td>
            <td nowarp align="right">
                <nobr>0.123456</nobr>
            </td>
        </tr>
        </table>
    </td>
</tr>
</table>

It seems to work for the first few seconds and then it 
crashes my win app after the XmlTextReader come across 
certain situation when doing a Xml.TextReader.Read(). Is 
it to do with the well-formness(is there such a word??) of 
this html doc? Also, is there a way to detect and convert 
&nbsp; to the #1390(can't remember if this is right but I 
am trying to say the equivalent special character) on the 
fly (i.e. without saving the html onto disk)?

Any thought will be appreciated.
0
9/10/2003 3:55:25 AM
dotnet.xml 7266 articles. 0 followers. Follow

3 Replies
584 Views

Similar Articles

[PageSpeed] 43

Daniel wrote:

> Just wondering if it is possible to use XmlTextReader to 
> read off a html doc:
Not really, because html is not xml. Some html docs might be well-formed, so 
they can be read be XmlTextReader, but in general a single <br> tag or 
ubiquitous in HTML &nbsp; will stop reading.

> e.g. XmlTextReader tr = new XmlTextReader
> ("http://localhost/test.xml");
> 
> where test.xml contains the following:
> 
> <table cellspacing="1" cellpadding="1" width="100%">
> <tr valign="top">
>     <td class="head" width="20%">test heading1</td>
>     <td class="head" width="10%">test heading2</td>
> </tr>
> <tr valign="top">
>     <td class="content" width="20%">content1</td>
>     <td class="content" width="10%">
>         <table cellspacing="0" width="100%">
>         <tr>
>             <td align="left">test</td>
>             <td nowarp align="right">

Watch nowrap - it's so-called boolean attribute, XML doesn't support that.

Try SGMLReader instead of XmlTextReader 
http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC
-- 
Oleg Tkachenko
http://www.tkachenko.com/blog
Multiconn Technologies, Israel

0
oleg7603 (294)
9/10/2003 9:46:27 AM
Thanks Oleg,

The url you provided looks very interesting. And looking 
at the replies the sgmlreader has got, people are 
definitely finding it useful. And I will definitely 
download it and have a play with it.

However, I do want to learn more about reading html using 
the XmlTextReader. Do you (or anybody out there) know of a 
good url to get me started?

Cheers.

>-----Original Message-----
>Daniel wrote:
>
>> Just wondering if it is possible to use XmlTextReader 
to 
>> read off a html doc:
>Not really, because html is not xml. Some html docs might 
be well-formed, so 
>they can be read be XmlTextReader, but in general a 
single <br> tag or 
>ubiquitous in HTML   will stop reading.
>
>> e.g. XmlTextReader tr = new XmlTextReader
>> ("http://localhost/test.xml");
>> 
>> where test.xml contains the following:
>> 
>> <table cellspacing="1" cellpadding="1" width="100%">
>> <tr valign="top">
>>     <td class="head" width="20%">test heading1</td>
>>     <td class="head" width="10%">test heading2</td>
>> </tr>
>> <tr valign="top">
>>     <td class="content" width="20%">content1</td>
>>     <td class="content" width="10%">
>>         <table cellspacing="0" width="100%">
>>         <tr>
>>             <td align="left">test</td>
>>             <td nowarp align="right">
>
>Watch nowrap - it's so-called boolean attribute, XML 
doesn't support that.
>
>Try SGMLReader instead of XmlTextReader 
>http://www.gotdotnet.com/Community/UserSamples/Details.asp
x?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC
>-- 
>Oleg Tkachenko
>http://www.tkachenko.com/blog
>Multiconn Technologies, Israel
>
>.
>
0
9/10/2003 10:16:39 PM
Daniel wrote:

> However, I do want to learn more about reading html using 
> the XmlTextReader. Do you (or anybody out there) know of a 
> good url to get me started?
Not really. It's just technically impossible to read HTML by XmlTextReader 
without some sort of preprocessing of HTML (aka conversion HTML to XML or 
XHTML). Often Tidy is used for that too. Google for "HTML Tidy".
-- 
Oleg Tkachenko
http://www.tkachenko.com/blog
Multiconn Technologies, Israel

0
oleg7603 (294)
9/11/2003 10:43:11 AM
Reply:

Similar Artilces:

outlook and HTML message
I am using outlook 2003. I have composed a message in HTML and using POP yahoo account. I sent a message to myself. In my outlook I got that message with one extra line after each line break. Where to correct the problem. Is it outlook issue or my yahoo pop issue. When I send the same email to my Gmail account, in the browser it looks the same as what I sent i.e. no extra lines etc. thanks. Put your HTML message in a table cell. "abcd" <abcd@abcd.com> wrote in message news:%23F72x7JkGHA.4828@TK2MSFTNGP04.phx.gbl... >I am using outlook 2003. I have composed a messag...

Logo use in signature blocks
I am starting a new business and brand image is very important to us. I have heard that using company logos in signature blocks can cause outlook to crash and cause other issues. Is there any truth to this? What is your recomendation? Thanks in advance for your help! Who told you that and what was their reasoning? It all works as it should. It's just an image in a message. You can chose to save it on a web server and link to the picture instead of embedding it in the message to keep the size of the message down and make sure you don't show a paperclip icon for each of yo...

Viewing HTML in E-mail for Outlook 97
How do you view HTML e-mail messages in Outlook 97. The mail program does not have HTML settings and when I open mail that has HTML, it coverts all of the HTML to hyperlinks. So I can't see graphics, photos, etc in- line...many thanks! Dwayne Outlook 97 does not support HTML at all. The first version to support this was Outlook 98. --� Milly Staples [MVP - Outlook] Post all replies to the group to keep the discussion intact. Due to the Swen virus, all e-mails sent to my actual account will be deleted w/out reading. After searching google.groups.com and finding no answer Dwayne &...

How to insert a dialog in vc using the control of vb?
I am heard that the control of vb can insert into a dialog of vc, who could tell me how to do in detail?and if it can insert into a dialog of vc,I want to know how the control's property and function will use in vc?who could give a example? One option would be to make it an ActiveX control in VB and use it in a MFC/VC module. -- Ajay Kalra [MVP - VC++] ajaykalra@yahoo.com "LeeTow" <fbjlt@pub3.fz.fj.cn> wrote in message news:#YegMsfmEHA.1656@TK2MSFTNGP09.phx.gbl... > I am heard that the control of vb can insert into a dialog of vc, > who could tell me how to do ...

Can we use RMS with SL?
We have a SL client workstation (ver 6.5) that gives us crystal-related errors when we try to print anything immediately after we updated the RMS POS on this workstation from ver. 1.3 to 2.0. We installed the RMS with the Backward Compatability software rather than the SQL2005 Express that comes with the new version hoping to avoid problems. The RMS POS works fine, SL does not. We have tried to reinstall the SL client setup but no improvement. The SL client worstations without RMS are fine. We have also delayed upgrading the HQ software as well, which is stored on the same server as t...

Error using command button navigation
I use command button navigation to perform certain tasks as I move from record to record in a form. Somehow I get an error that only shows up for unpopulated records (only the primary key has data and there's no associated data in the subform). The error: Microsoft Office Access can't find the field '|' referred to in your expression. This occurs for all five navigation command buttons (First, Last, Previous, Next records and Close Form). The code added to each button is as follows: [Balance] = [record_sum] If [record_sum] <= 0 Then [Fee_Box] = 0 ...

creating an Excel chart using a map of a country?
How can I create an Excel chart using an image of a country? I would like to color states depending on certain results. Thanks! ...

Outlook 2007 wont send emails when using contacts
If i select people in my contacts to send out an email, it wont send. It just gives me a parameter error. But, if i manually type the same email, it will work. I looked in the contacts, and the email addresses are correct. I did notice, however, that when i select a contact, outlook puts ' ' around the addresses. When i removed the ' ' it works. Could this be the issues? If so, how do i fix this? You didn't specify the method used for recipient selection. How were these Contacts created? Are their email addresses resolved? -- Russ Valentine "turbowrx...

After having downloaded the trial version of Office 2008 mac
Version: 2008 HELP! - I just downloaded the office 2008 trial package, but after having sucessfully installed it, it askes me to register - but i't won't let me register - it just keeps loading, nothing happens! PLease please help me, i'm getting really frustrated! Registering is totally unnecessary � just cancel the Registration. All it does is get you on the email list :-) Regards |:>) Bob Jones [MVP] Office:Mac On 6/21/09 8:00 AM, in article 59b76f07.-1@webcrossing.caR9absDaxw, "Tamsin@officeformac.com" <Tamsin@officeformac.com> wrote: > Version: 2...

Generating HTML
Is there ant way to include the Internet Explorer component in a MFC .exe project and somehow generate HTML. Maybe first generate XML and then using XSLT templates to generate HTML. For any solutions i'de be very thankfull. Regards, Mystique Yaa >>Is there ant way to include the Internet Explorer component in a MFC .exe project You can use CHtmlView class to use Internet Explorer in your MFC application. >> Maybe first generate XML and then using XSLT templates to generate HTML. You can use MSXML to create XML document. MSXML has many interfaces that you could use for ...

Drop-down, listbox or what to use for this?
My application has keywords maintained in a tree-structure. Whenever the tree structure is modified, I create a flattened version of the tree. This table has four columns, the key and 3 keyword 'tiers', T1, T2 and T3. My problem is how will the users select keywords? They need to do this in the minimum number of keystrokes. They need to be able to select at any tier e.g. there may be: T1=Software, T2 = MSOffice, T3 = Powerpoint In a conventional, drop down, if the user entered 'Pow', Powerpoint would not be listed because its not at the beginning. Ideally, when the user en...

Problem with data using IF and Nested IF statements possibly???
Afternoon All I am attempting to analyse data from multiple worksheets from numerous people the incoming data all has one thing in common column A this is a certain frequency a job is done. The problem is that there are many ways of entering the data ie 12 months or 52 weeks or 365 days all essentially meaning the same thing. My idea is to collate the data in col A and then using the Helper column as the standard frequencies ie if cell A1 = 12 months, closes frequency in helper is 52 weeks therefore value in C1 = 52 weeks. By using an IF statement I can change the value of one frequen...

get_outerhtml for the HTML element does not return <HTML>......</HTML>
//////////sample code snippet//////////////////////////////////////////////////////////////////////////////////// hr = _Browser->get_Document(&document_dispatch); if (SUCCEEDED(hr) && (document_dispatch != NULL)) { IHTMLDocument3 *document = NULL; hr = document_dispatch->QueryInterface(IID_IHTMLDocument3,(void **)&document); IHTMLElement* pElem; hr = document->get_documentElement(&pElem); // document.documentelement.outerhtml if (hr == S_OK && pElem != NULL) { pElem->get_tagName(&bstrTemp); //bstrTemp contains "HTML" for the HTML el...

Print a HTML document programatically
Hi, I need to print a html docment dynamically produced at runtime, at that instance. I'm using shellexecute for this, which is popping up the Windows default PrintDialog. I want it to be printed directly. Can anyone suggest a solution or give some sample code for this. Please do reply. Thanks in Advance. Regards, Ajay Kumar Have you looked at IWebBrowser2::ExecWB? This also might be helpful: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnie55/html /wb_print.asp -- Ajay Kalra [MVP - VC++] ajaykalra@yahoo.com "Ajay Kumar" &...

Calculated Textbox using Invisible values
I have a calculated textbox with data source =Nz([A],0)+Nz([B],0)+Nz([C],0)+Nz([D],0). Any one or more of the textboxes B, C, D can be invisible. Everything works fine at data entry time. But when I recall the record for edit, the calculated field is #Error. I tried removing Nz([B],0)+Nz([C],0)+Nz([D],0) from the expression as B, C and D were invisible for the particular record to be editted and the error went away. Any thoughts as to how to get rid of the #Error when attempting to edit the record? Or what the problem is? Are invisible fields acceptable in calculated unbound tex...

Checkbox to allow Email Router to use user credentials is MISSING
I am using CRM 4.0 Online. We recently opted to use CRM Email Router for CRM Online for all email activities instead Outlook. I installed the CRM Email Router for CRM Online and tried configuing it. However, I could not find the checkbox [or the textboxes] to "Allow Email router to use my credentials to send and receive email on my behalf " in the Tools->Options (To set Personal Options) ->Email Tab. I only have the "Track Email" dropdown box, no check box that would allow me to enter the username and passsword. I made sure that the User's Ema...

Using system serial when creating a cash receipt using econnect 8.
I’m trying to insert a new unposted Cash receipt through “taRMCashReceiptInsert” class in econnect DLL. It requires the value of the document number of the new cash receipt . on the other hand if I insert a new cash receipt manually through Great Plains there’s a feature called automatic serial so that I don’t need to provide a document number for the new document, it’s generated automatically. For example if the last payment have the doc. Number payment00000001 the new payment will have document number payment00000002. So My question is how can I use the automatic serial feature of t...

sorting using macro
I created a series of macros that sort my spreadsheet by a different column heads as needed, and assigned them to buttons, so I can sort by those column when I want to by just clicking. It works fine, until I go and do something else in the sheet, like highlight a range of data for printing. After that, the macro fails and when I click debug, Below is one of the macros that sorts the column headed by names (names is in A1)Any ideas what I can do to fix this? Thanks in advance. Ross Sub sortnames() ' ' sortnames Macro ' Macro recorded 9/14/2005 by ross D ' ' S...

html
How do you create an "html" file? We need this file to display messages & promotions on the costumer display monitor Halo! I am student!!!! ...

required User rights for using Outlook client in crm 3.0
Hello, In V1.2 was this update "http://support.microsoft.com/kb/841124/EN-US/" to reduce required prermissions to use the CRM Client. Before the update administrative rights were necessary to work with the client. How is it in 3.0? Is this update already part of the software package? and is it coz of this possible to use the Outlook client with standard user rights (no administrative rights). thank you in advance! have a nice weekend! Thomas In both version 1.2 and 3.0 it was necessary for the user to have local admin rights to install the Outlook client, but you could the...

HTML messages #3
When I use Outlook 2000 to send an HTML message, I often want to cut and paste from other documents. The trouble is, the font and color come with the paste. If I was using Word as my editor, I could use the paintbrush to get this pasted text to the font I was using. Without using Word, how do I clean up my pasted attributes? In microsoft.public.office.misc Howard Brazee <howard@brazee.net> wrote: > When I use Outlook 2000 to send an HTML message, I often want to cut and paste > from other documents. The trouble is, the font and color come with the paste. > If I was us...

how you get the sum by using the mouse
sorry in advance but not sure which forum to put this in. I used to be able to highlight an area of spreadsheet with the mouse and in the bottom right corner it would say "SUM= the value of all the entries in the cells." Since i came back my holidays someones messed about with it nad now it says "COUNT=the number of cells with values within the chosen range" Its pretty annoying now as i found the SUM function better for my job because I sometimes need rough values and can just highlight the area or pick certain values using the control key. I've tried looking a...

Replying using a template
Is it possible to create a template to use when replying to a message? Basically, I want to open a message and reply to it using a standard reply template that will also contain the original message and the sender's details. I know that it is possible to set up a rule to send an automatic reply but I don't want the message sent automatically as there will be a number of different replies depending on the content of the original message and I would need to choose the template to use. Does anybody know how to do this? -----------------------------------------------------------------...

Using outlook 2002 with word 2000
Can you use word 2000 as a email editor with outlook 2002? No. The versions must match. -- Jocelyn Fiorello MVP - Outlook *** Replies sent to my e-mail address will probably not be answered -- please reply only to the newsgroup to preserve the message thread. *** In news:02dd01c39e68$a3c91e80$a601280a@phx.gbl, dharvey wrote: > Can you use word 2000 as a email editor with outlook 2002? ...

How to drag a picture from html page and save to local computer
Hi, I had written an application with Microsoft Visual C++ 6.0(with sp5), I wanna drag a picture from html page and drop in my application, a dialog-based application, but I don't know how to set the clipboard format, CF_GIF?, and, If I can get a href-link for the picture, So my problem is addressed now. So..., I don't know how to do? Anybody can offer some articles or codes to help? Any help will be appreciated! Allen Chang ...