The problem seems relatively straightward to the human eye but I imagine it's
quite difficult to do in VBA without prior knowledge of what the strings
contain.
If I have two strings like:
The quick red fox jumped over the lazy brown cow
The quick red fox just stood there
How would I compare them and identify "quick red fox" as the longest common
sub-string?
I suppose I would have to start by reading one of the strings into an array
and comparing its elements against the second string? Or somehow use the
Filter function to send the matching items to another array? But then I can
only seem to be able compare each element against one word ("quick", "red" or
"fox") which I would have to know in advance.
There's no doubt some better way.
Any advice greatly appreciated.
Sub CompareStrings()
Dim Array1() As String
Dim string1 As String
Dim string2 As String
Dim InString() As String
Dim i As Integer
Dim j As Integer
string1 = "The quick red fox jumped over the lazy brown cow"
string2 = "The quick red fox just stood there"
For i = 0 To UBound(Split(string1, " ")) - 1
ReDim Preserve Array1(i)
Array1(i) = Split(string1, " ")(i)
'MsgBox Array1(i)
Next
InString = Filter(Array1, "red", True)
For j = 0 To UBound(InString)
MsgBox InString(j)
Next j
End Sub
|
|
0
|
|
|
|
Reply
|
Utf
|
2/3/2010 2:04:06 PM |
|
David Turner wrote:
> The problem seems relatively straightward to the human eye but I imagine it's
> quite difficult to do in VBA without prior knowledge of what the strings
> contain.
> If I have two strings like:
> The quick red fox jumped over the lazy brown cow
> The quick red fox just stood there
> How would I compare them and identify "quick red fox" as the longest common
> sub-string?
Perhaps it's not even very easy to the human eye? (The longest common
substring was actually "The quick red fox j", not "quick red fox".)
> I suppose I would have to start by reading one of the strings into an array
> and comparing its elements against the second string? Or somehow use the
> Filter function to send the matching items to another array? But then I can
> only seem to be able compare each element against one word ("quick", "red" or
> "fox") which I would have to know in advance.
> There's no doubt some better way.
> Any advice greatly appreciated.
I'd suggest you start by clearly enumerating your actual requirements.
Is this a character comparison or a word comparison? Do the matches
need to start at the same position, or may they be anywhere within the
string. Etc, etc, etc...
If you can't define the problem, you'll never solve it.
--
..NET: It's About Trust!
http://vfred.mvps.org
|
|
0
|
|
|
|
Reply
|
Karl
|
2/3/2010 6:39:35 PM
|
|
I would put them in a For Next loop incremented from 1 to the length of the
shortest string.
Use the Mid() function to extract the characters in the shortest string one
by one starting at the first.
Use the Instr() function to search for the character in the longer string.
As long as the strings are a match, increment and keep looping. When they
don't match, use the index of the loop to tell you the position.
Use the Left() function to return the portion of the string that matches.
This will only work if the two string start out the same. It gets more
complex if the patterns match in the middle.
"David Turner" wrote:
> The problem seems relatively straightward to the human eye but I imagine it's
> quite difficult to do in VBA without prior knowledge of what the strings
> contain.
> If I have two strings like:
> The quick red fox jumped over the lazy brown cow
> The quick red fox just stood there
> How would I compare them and identify "quick red fox" as the longest common
> sub-string?
> I suppose I would have to start by reading one of the strings into an array
> and comparing its elements against the second string? Or somehow use the
> Filter function to send the matching items to another array? But then I can
> only seem to be able compare each element against one word ("quick", "red" or
> "fox") which I would have to know in advance.
> There's no doubt some better way.
> Any advice greatly appreciated.
>
> Sub CompareStrings()
> Dim Array1() As String
> Dim string1 As String
> Dim string2 As String
> Dim InString() As String
> Dim i As Integer
> Dim j As Integer
>
> string1 = "The quick red fox jumped over the lazy brown cow"
> string2 = "The quick red fox just stood there"
>
> For i = 0 To UBound(Split(string1, " ")) - 1
> ReDim Preserve Array1(i)
> Array1(i) = Split(string1, " ")(i)
> 'MsgBox Array1(i)
> Next
>
> InString = Filter(Array1, "red", True)
>
> For j = 0 To UBound(InString)
> MsgBox InString(j)
> Next j
>
> End Sub
|
|
0
|
|
|
|
Reply
|
Utf
|
2/3/2010 6:48:08 PM
|
|
I would agree with Karl. As, technically, the largest sub-string is indeed
"The quick red fox j", you need to spell out EXACTLY the requirements.
Neil: "This will only work if the two string start out the same. It gets more
complex if the patterns match in the middle."
That is putting it mildly.
This is a very complex task. Quite likely do-able, but precise requirements
are needed.
Neil Humphries wrote:
>I would put them in a For Next loop incremented from 1 to the length of the
>shortest string.
>Use the Mid() function to extract the characters in the shortest string one
>by one starting at the first.
>Use the Instr() function to search for the character in the longer string.
>As long as the strings are a match, increment and keep looping. When they
>don't match, use the index of the loop to tell you the position.
>Use the Left() function to return the portion of the string that matches.
>
>This will only work if the two string start out the same. It gets more
>complex if the patterns match in the middle.
>
>> The problem seems relatively straightward to the human eye but I imagine it's
>> quite difficult to do in VBA without prior knowledge of what the strings
>[quoted text clipped - 36 lines]
>>
>> End Sub
--
Message posted via OfficeKB.com
http://www.officekb.com/Uwe/Forums.aspx/word-programming/201002/1
|
|
0
|
|
|
|
Reply
|
Fumei2
|
2/3/2010 7:06:14 PM
|
|
"Karl E. Peterson" wrote:
> I'd suggest you start by clearly enumerating your actual requirements.
> Is this a character comparison or a word comparison? Do the matches
> need to start at the same position, or may they be anywhere within the
> string. Etc, etc, etc...
>
> If you can't define the problem, you'll never solve it.
Sorry. Word comparison, the matches can start anywhere in the strings and
the strings could be of variable length, possibly ending in punctuation: two
sentences for example.
I now realise it's a much more complex problem than I first thought.
|
|
0
|
|
|
|
Reply
|
Utf
|
2/3/2010 8:59:01 PM
|
|
It is probably several orders of magnitude more complex than you may have
first though.
I am wondering however is some of the purpose-built plagiarism detection
software might not be the way to go.
--
Hope this helps,
Doug Robbins - Word MVP
Please reply only to the newsgroups unless you wish to obtain my services on
a paid professional basis.
"David Turner" <DavidTurner@discussions.microsoft.com> wrote in message
news:2A1C80E7-9C34-49A7-A13E-B1ECB60C46FA@microsoft.com...
>
>
> "Karl E. Peterson" wrote:
>
>> I'd suggest you start by clearly enumerating your actual requirements.
>> Is this a character comparison or a word comparison? Do the matches
>> need to start at the same position, or may they be anywhere within the
>> string. Etc, etc, etc...
>>
>> If you can't define the problem, you'll never solve it.
>
> Sorry. Word comparison, the matches can start anywhere in the strings and
> the strings could be of variable length, possibly ending in punctuation:
> two
> sentences for example.
> I now realise it's a much more complex problem than I first thought.
>
|
|
0
|
|
|
|
Reply
|
Doug
|
2/3/2010 10:20:36 PM
|
|
Doug Robbins - Word MVP wrote:
> It is probably several orders of magnitude more complex than you may have
> first though.
It's sounding fairly simple, actually, as brute force tasks go. And,
likewise, as brute force tasks go, it's going to take exponentially
longer to execute as the string size(s) grow. It's not unlike cracking
the combination to a lock without knowing how many numbers are in the
combination, although we are rewarded with partial success results
which actually makes it possible.
You'd "just" start comparing the first character to every character in
the other string, noting if you have a match of one. If you don't find
the first char, you scan the second string for the 2nd char in the
first string, and so on. Once you have a match of one, you start
looking for a match of two. You'd scan the second string for the first
two chars in the first. If they're not there, you'd start scanning the
second string for the 2nd and 3rd chars in the first, and so on.
Simple, huh? <g> Assuming you know the rules. Must the case match?
Etc, etc, etc...
> I am wondering however is some of the purpose-built plagiarism detection
> software might not be the way to go.
That's not a bad thought.
--
..NET: It's About Trust!
http://vfred.mvps.org
|
|
0
|
|
|
|
Reply
|
Karl
|
2/3/2010 10:33:05 PM
|
|
In the simple case where you match character by character, white space
has to match exactly, etc. you're probably better off starting by
comparing the endpoints of the longest possible match, then reducing the
comparison length. If the endpoints don't match, the contents can't
match so you do not actually need to compare them.
Can't prove it though. But I would have thought there was good coverage
of this kind of algorithm in standard works such as Knuth's Sorting and
Searching volume. I don't know where you find that stuff online.
Peter Jamieson
http://tips.pjmsn.me.uk
On 03/02/2010 22:33, Karl E. Peterson wrote:
> Doug Robbins - Word MVP wrote:
>> It is probably several orders of magnitude more complex than you may
>> have first though.
>
> It's sounding fairly simple, actually, as brute force tasks go. And,
> likewise, as brute force tasks go, it's going to take exponentially
> longer to execute as the string size(s) grow. It's not unlike cracking
> the combination to a lock without knowing how many numbers are in the
> combination, although we are rewarded with partial success results which
> actually makes it possible.
>
> You'd "just" start comparing the first character to every character in
> the other string, noting if you have a match of one. If you don't find
> the first char, you scan the second string for the 2nd char in the first
> string, and so on. Once you have a match of one, you start looking for a
> match of two. You'd scan the second string for the first two chars in
> the first. If they're not there, you'd start scanning the second string
> for the 2nd and 3rd chars in the first, and so on. Simple, huh? <g>
> Assuming you know the rules. Must the case match? Etc, etc, etc...
>
>> I am wondering however is some of the purpose-built plagiarism
>> detection software might not be the way to go.
>
> That's not a bad thought.
>
|
|
0
|
|
|
|
Reply
|
Peter
|
2/4/2010 12:53:22 AM
|
|
Peter Jamieson wrote:
> In the simple case where you match character by character, white space has to
> match exactly, etc. you're probably better off starting by comparing the
> endpoints of the longest possible match, then reducing the comparison length.
> If the endpoints don't match, the contents can't match so you do not actually
> need to compare them.
Yeah, that's a potentially good option, too. A bit more complex, but
that happens when you start trying to finesse brute force attacks. :-)
> Can't prove it though. But I would have thought there was good coverage of
> this kind of algorithm in standard works such as Knuth's Sorting and
> Searching volume. I don't know where you find that stuff online.
You'd think, yeah. But there aren't many folks who still think in 2N
versus N^2 terms. Most just throw a framework at it, and see what
sticks.
--
..NET: It's About Trust!
http://vfred.mvps.org
|
|
0
|
|
|
|
Reply
|
Karl
|
2/4/2010 1:09:21 AM
|
|
> You'd think, yeah. But there aren't many folks who still think in 2N
> versus N^2 terms. Most just throw a framework at it, and see what
> sticks.
Yeah. For the "build it from the ground up" types (which unfortunately
probably includes me),
http://www.cs.sunysb.edu/~algorith/files/longest-common-substring.shtml
could be handy. I got there via
http://www.itl.nist.gov/div897/sqg/dads/
Peter Jamieson
http://tips.pjmsn.me.uk
On 04/02/2010 01:09, Karl E. Peterson wrote:
> Peter Jamieson wrote:
>> In the simple case where you match character by character, white space
>> has to match exactly, etc. you're probably better off starting by
>> comparing the endpoints of the longest possible match, then reducing
>> the comparison length. If the endpoints don't match, the contents
>> can't match so you do not actually need to compare them.
>
> Yeah, that's a potentially good option, too. A bit more complex, but
> that happens when you start trying to finesse brute force attacks. :-)
>
>> Can't prove it though. But I would have thought there was good
>> coverage of this kind of algorithm in standard works such as Knuth's
>> Sorting and Searching volume. I don't know where you find that stuff
>> online.
>
> You'd think, yeah. But there aren't many folks who still think in 2N
> versus N^2 terms. Most just throw a framework at it, and see what sticks.
>
|
|
0
|
|
|
|
Reply
|
Peter
|
2/4/2010 9:39:18 AM
|
|
I cobble this together no idea where to go, if anywhere(!), from here:
Sub LongestMatch()
Dim strArray1() As String
Dim strArray2() As String
Dim string1 As String
Dim string2 As String
Dim i As Integer
Dim j As Integer
Dim m As Integer
Dim n As Integer
string1 = "quick red fox jumped"
string2 = "red fox stood"
For m = 0 To UBound(Split(string1, " ")) - 1
ReDim Preserve strArray1(m)
strArray1(m) = Split(string1, " ")(m)
'MsgBox strArray1(m)
Next
For n = 0 To UBound(Split(string2, " ")) - 1
ReDim Preserve strArray2(n)
strArray2(n) = Split(string2, " ")(n)
'MsgBox strArray2(n)
Next
For m = 0 To UBound(strArray1)
For n = 0 To UBound(strArray2)
If strArray1(m) = strArray2(n) Then
MsgBox strArray1(m) ' don't know what to do next
ElseIf strArray1(m) <> strArray2(n) Then
MsgBox strArray2(n) ' don't know what to do next
End If
Next
Next
End Sub
|
|
0
|
|
|
|
Reply
|
Utf
|
2/4/2010 11:58:01 AM
|
|
This should do the trick...
Option Explicit
Public Function Main()
Dim aStr As String, bStr As String
aStr = "The quick red fox jumped over the lazy brown cow"
bStr = "The quick red fox just stood there"
Dim longestPhrases As Collection, longPhrase As Variant
Set longestPhrases = FindLongestMatchingPhrase(aStr, bStr)
For Each longPhrase In longestPhrases
Debug.Print "Longest Matching Phrase: " & longPhrase
Next
End Function
Public Function FindLongestMatchingPhrase(aStr As String, bStr As String) As
Collection
Dim longestPhrases As Collection
Set longestPhrases = New Collection
Dim aWords As Variant, bWords As Variant
aWords = Split(aStr, " ")
bWords = Split(bStr, " ")
Dim aPhrases As Collection, bPhrases As Collection
Set aPhrases = BuildPhrases(aWords)
Set bPhrases = BuildPhrases(bWords)
Dim curPhrase As Variant, longestPhraseLen As Integer
longestPhraseLen = 0
For Each curPhrase In aPhrases
If (FoundInOtherPhrases(curPhrase, bPhrases)) Then
If (Len(curPhrase) > longestPhraseLen) Then
longestPhraseLen = Len(curPhrase)
Set longestPhrases = New Collection
AddDistinctPhrase curPhrase, longestPhrases
ElseIf (Len(curPhrase) = longestPhraseLen) Then
AddDistinctPhrase curPhrase, longestPhrases
End If
End If
Next
For Each curPhrase In bPhrases
If (FoundInOtherPhrases(curPhrase, aPhrases)) Then
If (Len(curPhrase) > longestPhraseLen) Then
longestPhraseLen = Len(curPhrase)
Set longestPhrases = New Collection
AddDistinctPhrase curPhrase, longestPhrases
ElseIf (Len(curPhrase) = longestPhraseLen) Then
AddDistinctPhrase curPhrase, longestPhrases
End If
End If
Next
Set FindLongestMatchingPhrase = longestPhrases
End Function
Public Function BuildPhrases(wordList As Variant) As Collection
Dim phrases As Collection
Set phrases = New Collection
Dim firstIndex As Integer, secondIndex As Integer, curPhrase As Variant
For firstIndex = 0 To UBound(wordList)
curPhrase = ""
For secondIndex = firstIndex To UBound(wordList)
If (secondIndex > firstIndex) Then
curPhrase = curPhrase & " "
End If
curPhrase = curPhrase & wordList(secondIndex)
AddDistinctPhrase curPhrase, phrases
Next
Next
Set BuildPhrases = phrases
End Function
Public Sub AddDistinctPhrase(curPhrase As Variant, phrases As Collection)
On Error GoTo NotFound
Dim existingPhrase As Variant
existingPhrase = phrases(curPhrase)
Exit Sub
NotFound:
phrases.Add Key:=curPhrase, Item:=curPhrase
Exit Sub
End Sub
Public Function FoundInOtherPhrases(curPhrase As Variant, phrases As
Collection) As Boolean
On Error GoTo NotFound
Dim existingPhrase As Variant
existingPhrase = phrases(curPhrase)
FoundInOtherPhrases = True
Exit Function
NotFound:
FoundInOtherPhrases = False
Exit Function
End Function
"David Turner" wrote:
> The problem seems relatively straightward to the human eye but I imagine it's
> quite difficult to do in VBA without prior knowledge of what the strings
> contain.
> If I have two strings like:
> The quick red fox jumped over the lazy brown cow
> The quick red fox just stood there
> How would I compare them and identify "quick red fox" as the longest common
> sub-string?
> I suppose I would have to start by reading one of the strings into an array
> and comparing its elements against the second string? Or somehow use the
> Filter function to send the matching items to another array? But then I can
> only seem to be able compare each element against one word ("quick", "red" or
> "fox") which I would have to know in advance.
> There's no doubt some better way.
> Any advice greatly appreciated.
>
> Sub CompareStrings()
> Dim Array1() As String
> Dim string1 As String
> Dim string2 As String
> Dim InString() As String
> Dim i As Integer
> Dim j As Integer
>
> string1 = "The quick red fox jumped over the lazy brown cow"
> string2 = "The quick red fox just stood there"
>
> For i = 0 To UBound(Split(string1, " ")) - 1
> ReDim Preserve Array1(i)
> Array1(i) = Split(string1, " ")(i)
> 'MsgBox Array1(i)
> Next
>
> InString = Filter(Array1, "red", True)
>
> For j = 0 To UBound(InString)
> MsgBox InString(j)
> Next j
>
> End Sub
|
|
0
|
|
|
|
Reply
|
Utf
|
2/4/2010 8:28:06 PM
|
|
Awesome! Only takes a second even with quite long strings. Will do my best to
try and understand the code. Many thanks.
"Office PC Developer" wrote:
> This should do the trick...
>
> Option Explicit
>
> Public Function Main()
>
> Dim aStr As String, bStr As String
> aStr = "The quick red fox jumped over the lazy brown cow"
> bStr = "The quick red fox just stood there"
>
> Dim longestPhrases As Collection, longPhrase As Variant
> Set longestPhrases = FindLongestMatchingPhrase(aStr, bStr)
> For Each longPhrase In longestPhrases
> Debug.Print "Longest Matching Phrase: " & longPhrase
> Next
>
> End Function
>
> Public Function FindLongestMatchingPhrase(aStr As String, bStr As String) As
> Collection
>
> Dim longestPhrases As Collection
> Set longestPhrases = New Collection
>
> Dim aWords As Variant, bWords As Variant
> aWords = Split(aStr, " ")
> bWords = Split(bStr, " ")
>
> Dim aPhrases As Collection, bPhrases As Collection
> Set aPhrases = BuildPhrases(aWords)
> Set bPhrases = BuildPhrases(bWords)
>
> Dim curPhrase As Variant, longestPhraseLen As Integer
> longestPhraseLen = 0
> For Each curPhrase In aPhrases
> If (FoundInOtherPhrases(curPhrase, bPhrases)) Then
> If (Len(curPhrase) > longestPhraseLen) Then
> longestPhraseLen = Len(curPhrase)
> Set longestPhrases = New Collection
> AddDistinctPhrase curPhrase, longestPhrases
> ElseIf (Len(curPhrase) = longestPhraseLen) Then
> AddDistinctPhrase curPhrase, longestPhrases
> End If
> End If
> Next
> For Each curPhrase In bPhrases
> If (FoundInOtherPhrases(curPhrase, aPhrases)) Then
> If (Len(curPhrase) > longestPhraseLen) Then
> longestPhraseLen = Len(curPhrase)
> Set longestPhrases = New Collection
> AddDistinctPhrase curPhrase, longestPhrases
> ElseIf (Len(curPhrase) = longestPhraseLen) Then
> AddDistinctPhrase curPhrase, longestPhrases
> End If
> End If
> Next
>
> Set FindLongestMatchingPhrase = longestPhrases
>
> End Function
>
> Public Function BuildPhrases(wordList As Variant) As Collection
>
> Dim phrases As Collection
> Set phrases = New Collection
>
> Dim firstIndex As Integer, secondIndex As Integer, curPhrase As Variant
> For firstIndex = 0 To UBound(wordList)
> curPhrase = ""
> For secondIndex = firstIndex To UBound(wordList)
> If (secondIndex > firstIndex) Then
> curPhrase = curPhrase & " "
> End If
> curPhrase = curPhrase & wordList(secondIndex)
> AddDistinctPhrase curPhrase, phrases
> Next
> Next
>
> Set BuildPhrases = phrases
>
> End Function
>
> Public Sub AddDistinctPhrase(curPhrase As Variant, phrases As Collection)
>
> On Error GoTo NotFound
> Dim existingPhrase As Variant
> existingPhrase = phrases(curPhrase)
> Exit Sub
>
> NotFound:
> phrases.Add Key:=curPhrase, Item:=curPhrase
> Exit Sub
>
> End Sub
>
> Public Function FoundInOtherPhrases(curPhrase As Variant, phrases As
> Collection) As Boolean
>
> On Error GoTo NotFound
> Dim existingPhrase As Variant
> existingPhrase = phrases(curPhrase)
> FoundInOtherPhrases = True
> Exit Function
>
> NotFound:
> FoundInOtherPhrases = False
> Exit Function
>
> End Function
>
>
> "David Turner" wrote:
>
> > The problem seems relatively straightward to the human eye but I imagine it's
> > quite difficult to do in VBA without prior knowledge of what the strings
> > contain.
> > If I have two strings like:
> > The quick red fox jumped over the lazy brown cow
> > The quick red fox just stood there
> > How would I compare them and identify "quick red fox" as the longest common
> > sub-string?
> > I suppose I would have to start by reading one of the strings into an array
> > and comparing its elements against the second string? Or somehow use the
> > Filter function to send the matching items to another array? But then I can
> > only seem to be able compare each element against one word ("quick", "red" or
> > "fox") which I would have to know in advance.
> > There's no doubt some better way.
> > Any advice greatly appreciated.
> >
> > Sub CompareStrings()
> > Dim Array1() As String
> > Dim string1 As String
> > Dim string2 As String
> > Dim InString() As String
> > Dim i As Integer
> > Dim j As Integer
> >
> > string1 = "The quick red fox jumped over the lazy brown cow"
> > string2 = "The quick red fox just stood there"
> >
> > For i = 0 To UBound(Split(string1, " ")) - 1
> > ReDim Preserve Array1(i)
> > Array1(i) = Split(string1, " ")(i)
> > 'MsgBox Array1(i)
> > Next
> >
> > InString = Filter(Array1, "red", True)
> >
> > For j = 0 To UBound(InString)
> > MsgBox InString(j)
> > Next j
> >
> > End Sub
|
|
0
|
|
|
|
Reply
|
Utf
|
2/4/2010 11:01:02 PM
|
|
|
12 Replies
1020 Views
(page loaded in 0.195 seconds)
|