reading text out of word docs

  • Follow


I have a post I added to the microsoft.public.dotnet.languages.vb newsgroup 
yesterday with the same subject as this one (reading text out of word docs). 
It's about how to read word docs using vb.net. I haven't gotten any 
responses. I figured it wouldn't be very hard for someone who knows what 
they're doing. I spent hours online and in the help file yesterday and got 
almost nowhere. Maybe someone here would know how to handle this. I'm not 
goign to put all the info here because it's not vba and I don't want to 
xpost. 


0
Reply Keith 12/24/2009 3:02:54 PM

Okay. Not getting help in the other newsgropu so I'm moving this here:

I started working on a program to read text out of some well organized word
docs. I've done this sort of thing in vba but not quite this extensively and
I'm not great with word automation. I know enough to be dangerous. LOL. I
need to open the doc (got that part done), locate certain phrases that are
in all of them and then read some text after those phrases into variables so
I can post them to a sql db. The part I'm struggling with is how to read the
doc. I'm not changing the docs in any way. They are deposited into a folder
on the network and I open and read them as they arrive. Setting up the
watcher for this in general is not a problem. I just need help reading the
docs in vb.net.

Here's some of what I have so far:

oWord = CreateObject("Word.Application")
oWord.Visible = True
oDoc = oWord.Documents.Open("C:\SomeWordDoc.doc", , True)

Dim rng As Word.Range

With oWord.Selection
..HomeKey(wdStory)
rng = .Range
End With

>>> Here's a point where I'm stuck. I can find the phrase "Issue date:"
>>> but then I need to read the text AFTER that (but not including the
>>> phrase itself)
>>> For example, the line in the doc might read "Issue date: March 25, 2009"
>>> I need to extract the "March 25, 2009" part.

rng.Find.Text = "Issue date::"
If rng.Find.Execute() Then
'MsgBox("found")
rng = oWord.Selection.Range
rng.End = rng.Next(wdLine, 1).End ' rng.MoveEnd(wdLine)
MsgBox(rng)
Else
MsgBox("Not found")
End If

>>> Then the next line below that doesn't have anything to cue me into that
>>> line. I just need the entire line below the date noted above. How do
>>> I move to the next line and read the entire line?

'move to linebelow "Issue Date:" to get county
>>> The line below "Issue Date:" would be like this: "Orange County"

I decided that it might be best to read teh entire text into a string 
variable and use RegEx to get the pieces I need. But there's a problem with 
that. There are some places in the text where that will work adn I know how 
to do that. But the bigger problem for me is how to read specific lines. For 
example, I need to read the 4th line of each document. There is no specific 
text in the 4th line that I can use RegEx to find it with so I have to read 
the 4th line. I found this idea somewhere:

rng.Start = oDoc.Paragraphs(4).Range.Start

rng.End = oDoc.Paragraphs(4).Range.End


It seems to work but not sure if that's teh best way.

Then the last thing is that there is a large block of text in the middle of 
these documents that I will need to read. I know the line it starts on but 
have no idea which line it will stop on. But there is a line that follows it 
that I can find using RegEx. Not sure how to grab that text based on those 
ideas.

Help with the above will really get me started well on this. I'd really
apprecate it.

Thanks,

Keith





0
Reply Keith 12/26/2009 5:43:44 PM


See if any of this helps:
Sub ScratchMaco()
Dim oRng As Word.Range
Set oRng = ActiveDocument.Range
With oRng.Find
  .Text = "Issue date:"
  If .Execute Then
    oRng.Collapse wdCollapseEnd
    oRng.MoveEndUntil Chr(13)
    If IsDate(oRng.Text) Then
      MsgBox oRng.Text
      oRng.Collapse wdCollapseStart
      oRng.Move wdParagraph
      oRng.MoveEndUntil Chr(13)
      MsgBox "County is: " & oRng.Text
      On Error Resume Next
      MsgBox ActiveDocument.Paragraphs(4).Range.Text
      'Or specifically the fourth line.
      ActiveDocument.Range(0, 0).Select
      Dim i As Long
      For i = 1 To 3
        With Selection
          .MoveDown unit:=wdLine
          .Bookmarks("\line").Select
        End With
      Next i
      MsgBox Selection.Text
      'GoTo a specific line e.g., line 8:
      Selection.GoTo What:=wdGoToLine, Count:=8
      'Set a range equal to the complete paragraph range.
      Set oRng = Selection.Paragraphs(1).Range
      MsgBox oRng.Text
    Else
      MsgBox "No date found on this line"
    End If
  End If
End With
End Sub


Keith G Hicks wrote:
> Okay. Not getting help in the other newsgropu so I'm moving this here:
>
> I started working on a program to read text out of some well
> organized word docs. I've done this sort of thing in vba but not
> quite this extensively and I'm not great with word automation. I know
> enough to be dangerous. LOL. I need to open the doc (got that part
> done), locate certain phrases that are in all of them and then read
> some text after those phrases into variables so I can post them to a
> sql db. The part I'm struggling with is how to read the doc. I'm not
> changing the docs in any way. They are deposited into a folder on the
> network and I open and read them as they arrive. Setting up the
> watcher for this in general is not a problem. I just need help
> reading the docs in vb.net.
> Here's some of what I have so far:
>
> oWord = CreateObject("Word.Application")
> oWord.Visible = True
> oDoc = oWord.Documents.Open("C:\SomeWordDoc.doc", , True)
>
> Dim rng As Word.Range
>
> With oWord.Selection
> .HomeKey(wdStory)
> rng = .Range
> End With
>
>>>> Here's a point where I'm stuck. I can find the phrase "Issue date:"
>>>> but then I need to read the text AFTER that (but not including the
>>>> phrase itself)
>>>> For example, the line in the doc might read "Issue date: March 25,
>>>> 2009" I need to extract the "March 25, 2009" part.
>
> rng.Find.Text = "Issue date::"
> If rng.Find.Execute() Then
> 'MsgBox("found")
> rng = oWord.Selection.Range
> rng.End = rng.Next(wdLine, 1).End ' rng.MoveEnd(wdLine)
> MsgBox(rng)
> Else
> MsgBox("Not found")
> End If
>
>>>> Then the next line below that doesn't have anything to cue me into
>>>> that line. I just need the entire line below the date noted above.
>>>> How do I move to the next line and read the entire line?
>
> 'move to linebelow "Issue Date:" to get county
>>>> The line below "Issue Date:" would be like this: "Orange County"
>
> I decided that it might be best to read teh entire text into a string
> variable and use RegEx to get the pieces I need. But there's a
> problem with that. There are some places in the text where that will
> work adn I know how to do that. But the bigger problem for me is how
> to read specific lines. For example, I need to read the 4th line of
> each document. There is no specific text in the 4th line that I can
> use RegEx to find it with so I have to read the 4th line. I found
> this idea somewhere:
> rng.Start = oDoc.Paragraphs(4).Range.Start
>
> rng.End = oDoc.Paragraphs(4).Range.End
>
>
> It seems to work but not sure if that's teh best way.
>
> Then the last thing is that there is a large block of text in the
> middle of these documents that I will need to read. I know the line
> it starts on but have no idea which line it will stop on. But there
> is a line that follows it that I can find using RegEx. Not sure how
> to grab that text based on those ideas.
>
> Help with the above will really get me started well on this. I'd
> really apprecate it.
>
> Thanks,
>
> Keith 


0
Reply Greg 12/26/2009 6:40:11 PM

Very helpful. Thank you.


"Greg Maxey" <gmaxey@mIKEvICTORpAPAsIERRA.oSCARrOMEOgOLF> wrote in message 
news:uHKZarlhKHA.2780@TK2MSFTNGP05.phx.gbl...
> See if any of this helps:
> Sub ScratchMaco()
> Dim oRng As Word.Range
> Set oRng = ActiveDocument.Range
> With oRng.Find
>  .Text = "Issue date:"
>  If .Execute Then
>    oRng.Collapse wdCollapseEnd
>    oRng.MoveEndUntil Chr(13)
>    If IsDate(oRng.Text) Then
>      MsgBox oRng.Text
>      oRng.Collapse wdCollapseStart
>      oRng.Move wdParagraph
>      oRng.MoveEndUntil Chr(13)
>      MsgBox "County is: " & oRng.Text
>      On Error Resume Next
>      MsgBox ActiveDocument.Paragraphs(4).Range.Text
>      'Or specifically the fourth line.
>      ActiveDocument.Range(0, 0).Select
>      Dim i As Long
>      For i = 1 To 3
>        With Selection
>          .MoveDown unit:=wdLine
>          .Bookmarks("\line").Select
>        End With
>      Next i
>      MsgBox Selection.Text
>      'GoTo a specific line e.g., line 8:
>      Selection.GoTo What:=wdGoToLine, Count:=8
>      'Set a range equal to the complete paragraph range.
>      Set oRng = Selection.Paragraphs(1).Range
>      MsgBox oRng.Text
>    Else
>      MsgBox "No date found on this line"
>    End If
>  End If
> End With
> End Sub
>
>
> Keith G Hicks wrote:
>> Okay. Not getting help in the other newsgropu so I'm moving this here:
>>
>> I started working on a program to read text out of some well
>> organized word docs. I've done this sort of thing in vba but not
>> quite this extensively and I'm not great with word automation. I know
>> enough to be dangerous. LOL. I need to open the doc (got that part
>> done), locate certain phrases that are in all of them and then read
>> some text after those phrases into variables so I can post them to a
>> sql db. The part I'm struggling with is how to read the doc. I'm not
>> changing the docs in any way. They are deposited into a folder on the
>> network and I open and read them as they arrive. Setting up the
>> watcher for this in general is not a problem. I just need help
>> reading the docs in vb.net.
>> Here's some of what I have so far:
>>
>> oWord = CreateObject("Word.Application")
>> oWord.Visible = True
>> oDoc = oWord.Documents.Open("C:\SomeWordDoc.doc", , True)
>>
>> Dim rng As Word.Range
>>
>> With oWord.Selection
>> .HomeKey(wdStory)
>> rng = .Range
>> End With
>>
>>>>> Here's a point where I'm stuck. I can find the phrase "Issue date:"
>>>>> but then I need to read the text AFTER that (but not including the
>>>>> phrase itself)
>>>>> For example, the line in the doc might read "Issue date: March 25,
>>>>> 2009" I need to extract the "March 25, 2009" part.
>>
>> rng.Find.Text = "Issue date::"
>> If rng.Find.Execute() Then
>> 'MsgBox("found")
>> rng = oWord.Selection.Range
>> rng.End = rng.Next(wdLine, 1).End ' rng.MoveEnd(wdLine)
>> MsgBox(rng)
>> Else
>> MsgBox("Not found")
>> End If
>>
>>>>> Then the next line below that doesn't have anything to cue me into
>>>>> that line. I just need the entire line below the date noted above.
>>>>> How do I move to the next line and read the entire line?
>>
>> 'move to linebelow "Issue Date:" to get county
>>>>> The line below "Issue Date:" would be like this: "Orange County"
>>
>> I decided that it might be best to read teh entire text into a string
>> variable and use RegEx to get the pieces I need. But there's a
>> problem with that. There are some places in the text where that will
>> work adn I know how to do that. But the bigger problem for me is how
>> to read specific lines. For example, I need to read the 4th line of
>> each document. There is no specific text in the 4th line that I can
>> use RegEx to find it with so I have to read the 4th line. I found
>> this idea somewhere:
>> rng.Start = oDoc.Paragraphs(4).Range.Start
>>
>> rng.End = oDoc.Paragraphs(4).Range.End
>>
>>
>> It seems to work but not sure if that's teh best way.
>>
>> Then the last thing is that there is a large block of text in the
>> middle of these documents that I will need to read. I know the line
>> it starts on but have no idea which line it will stop on. But there
>> is a line that follows it that I can find using RegEx. Not sure how
>> to grab that text based on those ideas.
>>
>> Help with the above will really get me started well on this. I'd
>> really apprecate it.
>>
>> Thanks,
>>
>> Keith
>
> 


0
Reply Keith 12/26/2009 7:04:30 PM

3 Replies
1025 Views

(page loaded in 0.129 seconds)


Reply: