Read XHTML into XML

Hi all,

I need to read/parse XHTML aspx pages and look for certain tokens and
content. How can I use a XmlTextReader for this? If not, any other ideas?

Thanks in advance,

JA Reyes.
0
Reyes (2)
6/28/2007 8:04:12 AM
dotnet.xml 7266 articles. 0 followers. Follow

5 Replies
510 Views

Similar Articles

[PageSpeed] 14

Jose Antonio Reyes wrote:

> I need to read/parse XHTML aspx pages and look for certain tokens and
> content. How can I use a XmlTextReader for this? If not, any other ideas?

If the pages are well-formed XHTML then it is possible to use XmlReader 
(in .NET 2.0/3.0) or XmlTextReader (in .NET 1.x) to parse the XHTML 
documents. You can also use the other XML APIs .NET provides so using 
XPathNavigator and/or XmlDocument might offer more comfort than XmlReader.

Here is an example using XmlReader that prints out all heading elements 
(h1 .. h6 elements) assuming they have no child elements:

     static public void PrintHeadings (string path) {
       XmlReaderSettings settings = new XmlReaderSettings();
       settings.ProhibitDtd = false;
       using (XmlReader xmlReader = XmlReader.Create(path, settings)) {
         while (xmlReader.Read()) {
           if (xmlReader.NodeType == XmlNodeType.Element && 
xmlReader.NamespaceURI == "http://www.w3.org/1999/xhtml") {
             switch (xmlReader.LocalName) {
               case "h1":
               case "h2":
               case "h3":
               case "h4":
               case "h5":
               case "h6":
                 Console.Out.WriteLine(
"{0} heading has InnerText: \"{1}\".", xmlReader.LocalName, 
xmlReader.ReadString());
                 break;
             }
           }
         }
       }

       PrintHeasdings("doc.xhtml");
     }
-- 

	Martin Honnen --- MVP XML
	http://JavaScript.FAQTs.com/
0
mahotrash (1778)
6/28/2007 12:02:37 PM
Thanks Martin,

but how can I load the aspx page DTD?? I need to deal with special symbols 
like nbsp; and so on...

For example:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

Thanks in advance,

Jose Antonio Reyes.


"Martin Honnen" wrote:

> Jose Antonio Reyes wrote:
> 
> > I need to read/parse XHTML aspx pages and look for certain tokens and
> > content. How can I use a XmlTextReader for this? If not, any other ideas?
> 
> If the pages are well-formed XHTML then it is possible to use XmlReader 
> (in .NET 2.0/3.0) or XmlTextReader (in .NET 1.x) to parse the XHTML 
> documents. You can also use the other XML APIs .NET provides so using 
> XPathNavigator and/or XmlDocument might offer more comfort than XmlReader.
> 
> Here is an example using XmlReader that prints out all heading elements 
> (h1 .. h6 elements) assuming they have no child elements:
> 
>      static public void PrintHeadings (string path) {
>        XmlReaderSettings settings = new XmlReaderSettings();
>        settings.ProhibitDtd = false;
>        using (XmlReader xmlReader = XmlReader.Create(path, settings)) {
>          while (xmlReader.Read()) {
>            if (xmlReader.NodeType == XmlNodeType.Element && 
> xmlReader.NamespaceURI == "http://www.w3.org/1999/xhtml") {
>              switch (xmlReader.LocalName) {
>                case "h1":
>                case "h2":
>                case "h3":
>                case "h4":
>                case "h5":
>                case "h6":
>                  Console.Out.WriteLine(
> "{0} heading has InnerText: \"{1}\".", xmlReader.LocalName, 
> xmlReader.ReadString());
>                  break;
>              }
>            }
>          }
>        }
> 
>        PrintHeasdings("doc.xhtml");
>      }
> -- 
> 
> 	Martin Honnen --- MVP XML
> 	http://JavaScript.FAQTs.com/
> 
0
6/28/2007 7:18:05 PM
Jose Antonio Reyes wrote:

> but how can I load the aspx page DTD?? I need to deal with special symbols 
> like nbsp; and so on...
> 
> For example:
> 
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

That is an SGML DTD, don't expect to use an XML parser to consume that.
If the document is an XHTML document (not a HTML 4.0) document then you 
can parse it with XmlReader, I have already included the settings for that:
     static public void PrintHeadings (string path) {
       XmlReaderSettings settings = new XmlReaderSettings();
       settings.ProhibitDtd = false;
       using (XmlReader xmlReader = XmlReader.Create(path, settings)) {
-- 

	Martin Honnen --- MVP XML
	http://JavaScript.FAQTs.com/
0
mahotrash (1778)
6/29/2007 11:55:12 AM
Unfornately I could find some nbsp; items or javascript in the aspx page.

Could be a good solution to parse after the aspx and include CDATA sections??

Thanks.

"Martin Honnen" wrote:

> Jose Antonio Reyes wrote:
> 
> > but how can I load the aspx page DTD?? I need to deal with special symbols 
> > like nbsp; and so on...
> > 
> > For example:
> > 
> > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> 
> That is an SGML DTD, don't expect to use an XML parser to consume that.
> If the document is an XHTML document (not a HTML 4.0) document then you 
> can parse it with XmlReader, I have already included the settings for that:
>      static public void PrintHeadings (string path) {
>        XmlReaderSettings settings = new XmlReaderSettings();
>        settings.ProhibitDtd = false;
>        using (XmlReader xmlReader = XmlReader.Create(path, settings)) {
> -- 
> 
> 	Martin Honnen --- MVP XML
> 	http://JavaScript.FAQTs.com/
> 
0
6/29/2007 12:52:00 PM
Jose Antonio Reyes wrote:
> Unfornately I could find some nbsp; items or javascript in the aspx page.
> 
> Could be a good solution to parse after the aspx and include CDATA sections??

If the document is an XHTML document and the entity nbsp is defined in 
the DTD then the XML parser can parse it.


-- 

	Martin Honnen --- MVP XML
	http://JavaScript.FAQTs.com/
0
mahotrash (1778)
6/29/2007 1:27:45 PM
Reply:

Similar Artilces:

xml config file...
hello, I ve a Windowsapplication with a configuration file app.config... (it's in xml) code app.config: <?xml version="1.0" encoding="utf-8" ?> <configuration> <configSections> <section name = "MyNameValueSection" type="System.Configuration.NameValueSectionHandler" /> </configSections> <MyNameValueSection> <add key="MyKey" value="MyValue" /> </MyNameValueSection> <appSettings> <add key="test" value="MeinInhalt" /> </appSettings> </configurati...

Reading .wks file
Greetings...according to the Excel "help" file, as well as the file extension listing, I *should* be able to read an *.wks file, but Excel insists that it cannot. I am pretty sure the file was created in Microsoft Works. Is there a converter somewhere that I can download/install? Cheers - S2 Excel can read Works 2.0, not later. You have to save them in Works as excel files or Works 2.0 or get a commercial converter. -- Regards, Peo Sjoblom "Skip Stocks" <anonymous@discussions.microsoft.com> wrote in message news:AFC110E0-641D-4D87-9464-B930CC41CF02@microsoft....

File won't open as read only
I have a file that is in use, but another person opens it and it doeasn't display the "file is in use" message. Is there a setting or fix? Hi have you shared this file? -- Regards Frank Kabel Frankfurt, Germany John wrote: > I have a file that is in use, but another person opens it > and it doeasn't display the "file is in use" message. Is > there a setting or fix? The file is on a network share. The share has all the appropriate permissions. >-----Original Message----- >Hi >have you shared this file? > >-- >Regards >Frank Ka...

Binding ASP.NET Menu to XML
I have created a XmlDocument that contains the exact layout that siteMap uses, and then I bind my Menu object to it: Dim xml As XmlDocument xml = DirectCast(Session("MenuData"), XmlDocument) Dim xmlDS As New XmlDataSource() xmlDS.Data = xml.OuterXml MainMenu.DataSource = xmlDS Dim mb As New MenuItemBinding() mb.DataMember = "SiteMapNode" mb.TextField = "Title" mb.NavigateUrlField = "url" MainMenu.DataBindings.Add(mb) MainMenu.DataBind() The problem is when it dis...

How to give other users read-only access to Calendar
I want to allow the group Everyone to have read-only access to a calendar in a certain mailbox. I can do this by granting the permission 'Full mailbox access' (under 'Mailbox rights', under 'Exchange Advanced', for the particular user). However this also allows people to to do everything (ie: they become read-write users). I notice that every mailbox in the system has 'Read permissions' granted to group Everyone. This does not allow other people to open items in the mailbox, but as I understand it, permits Exchange Server and Outlook to do shared meetin...

Read mail arn't marked as read anymore
After an SP upgrade of my Office 2000 the priviewed mail doesnt get marked as read anymore. I have tried to change the time (Tools->Options->Priview pane) from 2 -> 3 -> 4 seconds but nothing works. The only way to mark a mail as read is either to open it or right click it and chose Mark as read. Since I only use the priview pane this is very anoying for me. Is this a bug or has some setting changed with the SP? ...

reading confirmation
Good day, I have a problem with outlook. When they send a message to me that demands the shipment of a reading confirmation, even if I accept, the reading confirmation does not come received from the sender. Someone knows from what depends and in which way I can resolve the problem? Thanks for the eventual answers. Niki In news:eht7fo$251$1@fata.cs.interbusiness.it, Niki <nicola.pantaleo@yahoo.it> typed: > Good day, > > I have a problem with outlook. When they send a message to me that > demands the shipment of a reading confirmation, even if I accept, the > read...

"Unblock" feature should be optional when reading e-mail in CRM
When viewing e-mail messages in CRM, a line appears saying "Unblock" to allow the full message content to be read. Can this be made a configurable server or security setting? We are trying to reduce "clicks" as much as possible. ---------------- This post is a suggestion for Microsoft, and Microsoft responds to the suggestions with the most votes. To vote for this suggestion, click the "I Agree" button in the message pane. If you do not see the button, follow this link to open the suggestion in the Microsoft Web-based Newsreader and then click "I Agre...

Cannot delete xml file - file based event scheduling
Hi, I have a report scheduled with File based event scheduling in CR XI Server version 11.5. The report is created with XML file as its datasource. The report will be triggered if the event of xml file is created in the source directory. As per the schedule, it will look for the event to occur every 1 minute. Since I need to enerate this report everytime the file comes into the directory, I have to delete the file after he report is generated, so that, next time the file comes in, the report will be generated again. After the report is run for the first time successfuly as the event occurs...

The memory could not be "read".
I'm at work yesterday afternoon responding to e-mail on my desktop, and Adobe pops up from the toolbar with a notice that there is an update available for Acrobat. I accept the download, keep plugging away at my e-mail, and install it when it's done. It runs through the install, tells me that I should restart, but I ignore it and keep working. It's near the end of the day and I'm going to be going home soon anyways. I forgot to turn it off when I left. Came in this morning, restarted it through the Start menu, and rebooted. Upon reaching the "Ctrl+Alt+Del" ...

Unable to Read Japanese Email
I correspond with several Japanese users and can read emails from some of them without a problem, but emails from others are nothing but a series of ?????. In one case, I can read one email but not another from the same sender. Changing the Encoding doesn't help. Sending emails in Japanese to them without a problem...they can read it fine. Would appreciate any suggestions/solutions as I'm stumped. I'm using Outlook 2003 on a Windows XP Home operating system. Japanese language support is installed. ...

how can i copy a document to a CD without making it read only?
HOW CAN I COPY A DOCUMENT TO A CD WITHOUT MAKING IN READ ONLY? You can't. It is not the file, but the media, that is read only. Even CD-RW media does not allow editing a file on the CD. Copy te file from CD to HD, mak edits and if a CD-RW you should be able to burn the edited file back to the CD. hth "DON" wrote: > HOW CAN I COPY A DOCUMENT TO A CD WITHOUT MAKING IN READ ONLY? ...

Transferring read e-mail to another folder
Hello I was wondering whether there was a way in which I could automatically transfer my e-mail that has been read into another folder, such as "Old e-mail" or something like that? Thanks ...

Outlook not marking read emails as read
Hi - We have 1 computer that is doing the oddest thing, anytime the customer reads the last email in his box, exit's out of email and then comes back in, that email is now marked as unread - it's the weirdest thing I have seen in a long time. I uninstalled office XP, rebooted, then reinstalled and applied the 2 service packs, hoping that would fix it, but it didn't. So now I'm stuck and was wondering if anybody out there has any thoughts on what to do.... thanks! Gerri Urban gurban@ci.broomfield.co.us ...

XML formatting question.
Hi All: For example, I load a XML file with DOM. XML file like this: <root a="a" b="b"/> Then I want to save and format this file like this: <root a="a" b="b" /> But I find that I can not insert CR and Tab char between attributes. Help me, please! Fiveight Which XML parser are you using? Most of the one's I've seen have a separate XML formatter (SAX parser?) that allows you to set up the format for how files are written. In any event, except for visual it doesn't matter how the file is written. Tom "fi...

Should System.Xml.NameTable be deprecated?
There seems to be little interest in this object, the docs are wrong and it doesn't do much. Two years ago, it was pointed out that the documentation is nonsensical (http://groups-beta.google.com/group/microsoft.public.dotnet.framework/browse_thread/thread/3e621115f78943e2/38232c4e9c0bdd64?q=Marcus+NameTable#38232c4e9c0bdd64) , the documentation remains unchanged. The one example that makes any sense at all is found at http://www.csharpfriends.com/articles/getArticle.aspx?articleID=309&page=2 But even reading a 52MB file, it provides about a 10% boost. So, MS, please .. correct th...

Suggested reading
Any suggested reading for Access 2003 VBA? I have both "Step by Step Access 2003" And "Microsoft Access 2003 VBA for Dummies" (how Ironic) and niether have been very helpful. Try this book. It's a winner: Access 2003 VBA Programmer's Reference by Patricia Cardoza, Teresa Hennig, Graham Seach, and Armen Stein http://www.amazon.com/Access-2007-Programmers-Reference-Programmer/dp/0470047038/ref=sr_1_1/104-1181757-2327103?ie=UTF8&s=books&qid=1185824619&sr=8-1 -- Arvin Meyer, MCP, MVP http://www.datastrat.com http://www.mvps.org/access http://www.access...

Message(s) Not Being Marked As Read
I have "Mark message read after displayed for" '0' seconds checked. But in my Junk e-mail folder the messages don't get marked as read when I select them, only when I open them. I'm assuming this is because there isn't a the reading pane for the Junk e-mail folder even though I have "Show reading pane" checked in "Layout" a reading pane doesn't display in the Junk email right pane. Is this normal? I'd rather just click on the message instead of opening the message to mark it as read. James > I'd rather just c...

how to transform xml according to c# vars
My menuing system uses xml/xslt to create the menus on my site. C# variable values dictate what menu items in the xml get translated by the xslt into the menu, and which are hidden. Say, if the user is logged in as non-admin, don't render the admin links. Currently, I have an extra attribute in the XML items for the menus called render which I later set to "yes" or "no" in C# by traversing the whole XML tree at runtime and checking whether it's OK to show (set to "yes"), or not (set to "no"). Then, once these values are set, I run it through the...

Controling READ ONLY and READ/WRITE mode when opening a project PS
Hello All, I was looking for a way to force users to select between READ ONLY and READ/WRITE instead of it defaulting to READ/WRITE when opening projects in MS Project 2007. Any suggestions would be a huge help. Thanks, Eric Eric -- Short of using custom software development, there is no way to force this issue with your PMs. If you want to try the custom software development route, then please repost your message in the microsoft.public.project.developer newsgroup. Otherwise, make this a training and performance issue with your PMs. Hope this helps. -- Dale A. Ho...

Read/Not read
Hello We have an exhange-server environment. The Boss' secretary need's to be able to read the Boss' mail, without the messages being marked as read, within outlook 2003. She can access the Boss mail, but all mail she reads is being marked as read, hence the boss can not figure out what he has seen/not seen. What is the solution? Thanx a lot /Jan Hi Towli. There is no way to marked as unread automatically, Just she should be tick the unread option on the pop up menu after she open the her boss e-mail. Once you right click button one of e-mail on the e-mail list, you w...

read an ascii file with fopen
I try to open with fopen and read an ascii file, line by line, but get garbage - among the right data in the CString variable that is filled with this line data. Can someone copy&paste the right code how to so that? Thanks in advance. Mark "Mark" <mark@chasan.ar> wrote in message news:%23sPmEzsgGHA.2208@TK2MSFTNGP05.phx.gbl... > I try to open with fopen and read an ascii file, line by line, but get > garbage - among the right data in the CString variable that is filled with > this line data. > > Can someone copy&paste the right code how to so that?...

Programatically reading a XSD File
Hello, Let us say I have a schema file like this sample below. How would I using ..NET classes be able to read this XSD file and get all the values for each element, such as "name", "type", "minoccurs" etc.,? I would appreciate if somebody can help me with some sample code. Thanks for your help. Ganesh ********************* <?xml version="1.0" standalone="yes"?> <xs:schema id="Account_Did" xmlns="" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:msdata="urn:schemas-microsoft-com:xml-msdata&...

after saving to cd many times. now says read only?
i have saved many times to my cd. i overwrite it almost daily (CD-RW) all of a sudden it now say's the file is read only. although i can still save to my hard drive? nsturre wrote: > i have saved many times to my cd. i overwrite it almost daily (CD-RW) all of > a sudden it now say's the file is read only. although i can still save to my > hard drive? Unless you have formatted your CDRW for packet writing, then yes, it will be Read only. CDROM = CD Read Only Memory! -- Interim Systems and Management Accounting Gordon Burgess-Parker Director www.gbpcomputing.co.uk Just ...

Synchronizing and Read Only
I am trying to set up a one note notebook on a desktop that I can access from a notebook computer as a synchronized one note notebook. I am using Windows 7 and have a Homegroup set up on the network. I have set up a Test notebook on the desktop and when I attempt to open it from the notebook computer through the Homegroup it always opens as read only. I have looked at the permissions of the folder on the desktop and ensured that the folder is not read only but everytime I attempt to open it it reverts back to read only (permissions show as a blue square not a check). I have also ...