Compare two strings and find longest common sub-string

  • Follow


The problem seems relatively straightward to the human eye but I imagine it's 
quite difficult to do in VBA without prior knowledge of what the strings 
contain.
If I have two strings like:
The quick red fox jumped over the lazy brown cow
The quick red fox just stood there
How would I compare them and identify "quick red fox" as the longest common 
sub-string?
I suppose I would have to start by reading one of the strings into an array 
and comparing its elements against the second string? Or somehow use the 
Filter function to send the matching items to another array? But then I can 
only seem to be able compare each element against one word ("quick", "red" or 
"fox") which I would have to know in advance.
There's no doubt some better way.
Any advice greatly appreciated.

Sub CompareStrings()
    Dim Array1() As String
    Dim string1 As String
    Dim string2 As String
    Dim InString() As String
    Dim i As Integer
    Dim j As Integer

    string1 = "The quick red fox jumped over the lazy brown cow"
    string2 = "The quick red fox just stood there"

    For i = 0 To UBound(Split(string1, " ")) - 1
        ReDim Preserve Array1(i)
        Array1(i) = Split(string1, " ")(i)
        'MsgBox Array1(i)
    Next

    InString = Filter(Array1, "red", True)

    For j = 0 To UBound(InString)
        MsgBox InString(j)
    Next j

End Sub
0
Reply Utf 2/3/2010 2:04:06 PM

David Turner wrote:
> The problem seems relatively straightward to the human eye but I imagine it's 
> quite difficult to do in VBA without prior knowledge of what the strings 
> contain.
> If I have two strings like:
> The quick red fox jumped over the lazy brown cow
> The quick red fox just stood there
> How would I compare them and identify "quick red fox" as the longest common 
> sub-string?

Perhaps it's not even very easy to the human eye?  (The longest common 
substring was actually "The quick red fox j", not "quick red fox".)

> I suppose I would have to start by reading one of the strings into an array 
> and comparing its elements against the second string? Or somehow use the 
> Filter function to send the matching items to another array? But then I can 
> only seem to be able compare each element against one word ("quick", "red" or 
> "fox") which I would have to know in advance.
> There's no doubt some better way.
> Any advice greatly appreciated.

I'd suggest you start by clearly enumerating your actual requirements.  
Is this a character comparison or a word comparison?  Do the matches 
need to start at the same position, or may they be anywhere within the 
string.  Etc, etc, etc...

If you can't define the problem, you'll never solve it.

-- 
..NET: It's About Trust!
http://vfred.mvps.org


0
Reply Karl 2/3/2010 6:39:35 PM


I would put them in a For Next loop incremented from 1 to the length of the 
shortest string.
Use the Mid() function to extract the characters in the shortest string one 
by one starting at the first.
Use the Instr() function to search for the character in the longer string.
As long as the strings are a match, increment and keep looping. When they 
don't match, use the index of the loop to tell you the position.
Use the Left() function to return the portion of the string that matches.

This will only work if the two string start out the same. It gets more 
complex if the patterns match in the middle.

"David Turner" wrote:

> The problem seems relatively straightward to the human eye but I imagine it's 
> quite difficult to do in VBA without prior knowledge of what the strings 
> contain.
> If I have two strings like:
> The quick red fox jumped over the lazy brown cow
> The quick red fox just stood there
> How would I compare them and identify "quick red fox" as the longest common 
> sub-string?
> I suppose I would have to start by reading one of the strings into an array 
> and comparing its elements against the second string? Or somehow use the 
> Filter function to send the matching items to another array? But then I can 
> only seem to be able compare each element against one word ("quick", "red" or 
> "fox") which I would have to know in advance.
> There's no doubt some better way.
> Any advice greatly appreciated.
> 
> Sub CompareStrings()
>     Dim Array1() As String
>     Dim string1 As String
>     Dim string2 As String
>     Dim InString() As String
>     Dim i As Integer
>     Dim j As Integer
> 
>     string1 = "The quick red fox jumped over the lazy brown cow"
>     string2 = "The quick red fox just stood there"
> 
>     For i = 0 To UBound(Split(string1, " ")) - 1
>         ReDim Preserve Array1(i)
>         Array1(i) = Split(string1, " ")(i)
>         'MsgBox Array1(i)
>     Next
> 
>     InString = Filter(Array1, "red", True)
> 
>     For j = 0 To UBound(InString)
>         MsgBox InString(j)
>     Next j
> 
> End Sub
0
Reply Utf 2/3/2010 6:48:08 PM

I would agree with Karl.  As, technically, the largest sub-string is indeed
"The quick red fox j", you need to spell out EXACTLY the requirements.

Neil: "This will only work if the two string start out the same. It gets more
complex if the patterns match in the middle."

That is putting it mildly.

This is a very complex task.  Quite likely do-able, but precise requirements
are needed.

Neil Humphries wrote:
>I would put them in a For Next loop incremented from 1 to the length of the 
>shortest string.
>Use the Mid() function to extract the characters in the shortest string one 
>by one starting at the first.
>Use the Instr() function to search for the character in the longer string.
>As long as the strings are a match, increment and keep looping. When they 
>don't match, use the index of the loop to tell you the position.
>Use the Left() function to return the portion of the string that matches.
>
>This will only work if the two string start out the same. It gets more 
>complex if the patterns match in the middle.
>
>> The problem seems relatively straightward to the human eye but I imagine it's 
>> quite difficult to do in VBA without prior knowledge of what the strings 
>[quoted text clipped - 36 lines]
>> 
>> End Sub

-- 
Message posted via OfficeKB.com
http://www.officekb.com/Uwe/Forums.aspx/word-programming/201002/1

0
Reply Fumei2 2/3/2010 7:06:14 PM


"Karl E. Peterson" wrote:

> I'd suggest you start by clearly enumerating your actual requirements.  
> Is this a character comparison or a word comparison?  Do the matches 
> need to start at the same position, or may they be anywhere within the 
> string.  Etc, etc, etc...
> 
> If you can't define the problem, you'll never solve it.

Sorry. Word comparison, the matches can start anywhere in the strings and 
the strings could be of variable length, possibly ending in punctuation: two 
sentences for example.
I now realise it's a much more complex problem than I first thought.

0
Reply Utf 2/3/2010 8:59:01 PM

It is probably several orders of magnitude more complex than you may have 
first though.

I am wondering however is some of the purpose-built plagiarism detection 
software might not be the way to go.

-- 
Hope this helps,

Doug Robbins - Word MVP

Please reply only to the newsgroups unless you wish to obtain my services on
a paid professional basis.

"David Turner" <DavidTurner@discussions.microsoft.com> wrote in message 
news:2A1C80E7-9C34-49A7-A13E-B1ECB60C46FA@microsoft.com...
>
>
> "Karl E. Peterson" wrote:
>
>> I'd suggest you start by clearly enumerating your actual requirements.
>> Is this a character comparison or a word comparison?  Do the matches
>> need to start at the same position, or may they be anywhere within the
>> string.  Etc, etc, etc...
>>
>> If you can't define the problem, you'll never solve it.
>
> Sorry. Word comparison, the matches can start anywhere in the strings and
> the strings could be of variable length, possibly ending in punctuation: 
> two
> sentences for example.
> I now realise it's a much more complex problem than I first thought.
> 

0
Reply Doug 2/3/2010 10:20:36 PM

Doug Robbins - Word MVP wrote:
> It is probably several orders of magnitude more complex than you may have 
> first though.

It's sounding fairly simple, actually, as brute force tasks go.  And, 
likewise, as brute force tasks go, it's going to take exponentially 
longer to execute as the string size(s) grow.  It's not unlike cracking 
the combination to a lock without knowing how many numbers are in the 
combination, although we are rewarded with partial success results 
which actually makes it possible.

You'd "just" start comparing the first character to every character in 
the other string, noting if you have a match of one.  If you don't find 
the first char, you scan the second string for the 2nd char in the 
first string, and so on.  Once you have a match of one, you start 
looking for a match of two.  You'd scan the second string for the first 
two chars in the first.  If they're not there, you'd start scanning the 
second string for the 2nd and 3rd chars in the first, and so on.  
Simple, huh? <g>  Assuming you know the rules.  Must the case match?  
Etc, etc, etc...

> I am wondering however is some of the purpose-built plagiarism detection 
> software might not be the way to go.

That's not a bad thought.

-- 
..NET: It's About Trust!
http://vfred.mvps.org


0
Reply Karl 2/3/2010 10:33:05 PM

In the simple case where you match character by character, white space 
has to match exactly, etc. you're probably better off starting by 
comparing the endpoints of the longest possible match, then reducing the 
comparison length. If the endpoints don't match, the contents can't 
match so you do not actually need to compare them.

Can't prove it though. But I would have thought there was good coverage 
of this kind of algorithm in standard works such as Knuth's Sorting and 
Searching volume. I don't know where you find that stuff online.


Peter Jamieson

http://tips.pjmsn.me.uk

On 03/02/2010 22:33, Karl E. Peterson wrote:
> Doug Robbins - Word MVP wrote:
>> It is probably several orders of magnitude more complex than you may
>> have first though.
>
> It's sounding fairly simple, actually, as brute force tasks go. And,
> likewise, as brute force tasks go, it's going to take exponentially
> longer to execute as the string size(s) grow. It's not unlike cracking
> the combination to a lock without knowing how many numbers are in the
> combination, although we are rewarded with partial success results which
> actually makes it possible.
>
> You'd "just" start comparing the first character to every character in
> the other string, noting if you have a match of one. If you don't find
> the first char, you scan the second string for the 2nd char in the first
> string, and so on. Once you have a match of one, you start looking for a
> match of two. You'd scan the second string for the first two chars in
> the first. If they're not there, you'd start scanning the second string
> for the 2nd and 3rd chars in the first, and so on. Simple, huh? <g>
> Assuming you know the rules. Must the case match? Etc, etc, etc...
>
>> I am wondering however is some of the purpose-built plagiarism
>> detection software might not be the way to go.
>
> That's not a bad thought.
>
0
Reply Peter 2/4/2010 12:53:22 AM

Peter Jamieson wrote:
> In the simple case where you match character by character, white space has to 
> match exactly, etc. you're probably better off starting by comparing the 
> endpoints of the longest possible match, then reducing the comparison length. 
> If the endpoints don't match, the contents can't match so you do not actually 
> need to compare them.

Yeah, that's a potentially good option, too.  A bit more complex, but 
that happens when you start trying to finesse brute force attacks.  :-)

> Can't prove it though. But I would have thought there was good coverage of 
> this kind of algorithm in standard works such as Knuth's Sorting and 
> Searching volume. I don't know where you find that stuff online.

You'd think, yeah.  But there aren't many folks who still think in 2N 
versus N^2 terms.  Most just throw a framework at it, and see what 
sticks.

-- 
..NET: It's About Trust!
http://vfred.mvps.org


0
Reply Karl 2/4/2010 1:09:21 AM

 > You'd think, yeah. But there aren't many folks who still think in 2N
 > versus N^2 terms. Most just throw a framework at it, and see what
 > sticks.

Yeah. For the "build it from the ground up" types (which unfortunately 
probably includes me),

http://www.cs.sunysb.edu/~algorith/files/longest-common-substring.shtml

could be handy. I got there via

http://www.itl.nist.gov/div897/sqg/dads/

Peter Jamieson

http://tips.pjmsn.me.uk

On 04/02/2010 01:09, Karl E. Peterson wrote:
> Peter Jamieson wrote:
>> In the simple case where you match character by character, white space
>> has to match exactly, etc. you're probably better off starting by
>> comparing the endpoints of the longest possible match, then reducing
>> the comparison length. If the endpoints don't match, the contents
>> can't match so you do not actually need to compare them.
>
> Yeah, that's a potentially good option, too. A bit more complex, but
> that happens when you start trying to finesse brute force attacks. :-)
>
>> Can't prove it though. But I would have thought there was good
>> coverage of this kind of algorithm in standard works such as Knuth's
>> Sorting and Searching volume. I don't know where you find that stuff
>> online.
>
> You'd think, yeah. But there aren't many folks who still think in 2N
> versus N^2 terms. Most just throw a framework at it, and see what sticks.
>
0
Reply Peter 2/4/2010 9:39:18 AM

I cobble this together no idea where to go, if anywhere(!), from here:

Sub LongestMatch()
    Dim strArray1() As String
    Dim strArray2() As String
    Dim string1 As String
    Dim string2 As String
    Dim i As Integer
    Dim j As Integer
    Dim m As Integer
    Dim n As Integer


    string1 = "quick red fox jumped"
    string2 = "red fox stood"

    For m = 0 To UBound(Split(string1, " ")) - 1
        ReDim Preserve strArray1(m)
        strArray1(m) = Split(string1, " ")(m)
        'MsgBox strArray1(m)
    Next

    For n = 0 To UBound(Split(string2, " ")) - 1
        ReDim Preserve strArray2(n)
        strArray2(n) = Split(string2, " ")(n)
        'MsgBox strArray2(n)
    Next

    For m = 0 To UBound(strArray1)
        For n = 0 To UBound(strArray2)
            If strArray1(m) = strArray2(n) Then
                MsgBox strArray1(m) ' don't know what to do next
            ElseIf strArray1(m) <> strArray2(n) Then
                MsgBox strArray2(n) ' don't know what to do next
            End If
        Next
    Next

End Sub
0
Reply Utf 2/4/2010 11:58:01 AM

This should do the trick...

Option Explicit

Public Function Main()

    Dim aStr As String, bStr As String
    aStr = "The quick red fox jumped over the lazy brown cow"
    bStr = "The quick red fox just stood there"
    
    Dim longestPhrases As Collection, longPhrase As Variant
    Set longestPhrases = FindLongestMatchingPhrase(aStr, bStr)
    For Each longPhrase In longestPhrases
        Debug.Print "Longest Matching Phrase: " & longPhrase
    Next
    
End Function

Public Function FindLongestMatchingPhrase(aStr As String, bStr As String) As 
Collection

    Dim longestPhrases As Collection
    Set longestPhrases = New Collection
    
    Dim aWords As Variant, bWords As Variant
    aWords = Split(aStr, " ")
    bWords = Split(bStr, " ")
    
    Dim aPhrases As Collection, bPhrases As Collection
    Set aPhrases = BuildPhrases(aWords)
    Set bPhrases = BuildPhrases(bWords)
    
    Dim curPhrase As Variant, longestPhraseLen As Integer
    longestPhraseLen = 0
    For Each curPhrase In aPhrases
        If (FoundInOtherPhrases(curPhrase, bPhrases)) Then
            If (Len(curPhrase) > longestPhraseLen) Then
                longestPhraseLen = Len(curPhrase)
                Set longestPhrases = New Collection
                AddDistinctPhrase curPhrase, longestPhrases
            ElseIf (Len(curPhrase) = longestPhraseLen) Then
                AddDistinctPhrase curPhrase, longestPhrases
            End If
        End If
    Next
    For Each curPhrase In bPhrases
        If (FoundInOtherPhrases(curPhrase, aPhrases)) Then
            If (Len(curPhrase) > longestPhraseLen) Then
                longestPhraseLen = Len(curPhrase)
                Set longestPhrases = New Collection
                AddDistinctPhrase curPhrase, longestPhrases
            ElseIf (Len(curPhrase) = longestPhraseLen) Then
                AddDistinctPhrase curPhrase, longestPhrases
            End If
        End If
    Next
    
    Set FindLongestMatchingPhrase = longestPhrases
    
End Function

Public Function BuildPhrases(wordList As Variant) As Collection

    Dim phrases As Collection
    Set phrases = New Collection
    
    Dim firstIndex As Integer, secondIndex As Integer, curPhrase As Variant
    For firstIndex = 0 To UBound(wordList)
        curPhrase = ""
        For secondIndex = firstIndex To UBound(wordList)
            If (secondIndex > firstIndex) Then
                curPhrase = curPhrase & " "
            End If
            curPhrase = curPhrase & wordList(secondIndex)
            AddDistinctPhrase curPhrase, phrases
        Next
    Next
    
    Set BuildPhrases = phrases
    
End Function

Public Sub AddDistinctPhrase(curPhrase As Variant, phrases As Collection)

    On Error GoTo NotFound
    Dim existingPhrase As Variant
    existingPhrase = phrases(curPhrase)
    Exit Sub
    
NotFound:
    phrases.Add Key:=curPhrase, Item:=curPhrase
    Exit Sub
    
End Sub

Public Function FoundInOtherPhrases(curPhrase As Variant, phrases As 
Collection) As Boolean

    On Error GoTo NotFound
    Dim existingPhrase As Variant
    existingPhrase = phrases(curPhrase)
    FoundInOtherPhrases = True
    Exit Function
    
NotFound:
    FoundInOtherPhrases = False
    Exit Function
    
End Function


"David Turner" wrote:

> The problem seems relatively straightward to the human eye but I imagine it's 
> quite difficult to do in VBA without prior knowledge of what the strings 
> contain.
> If I have two strings like:
> The quick red fox jumped over the lazy brown cow
> The quick red fox just stood there
> How would I compare them and identify "quick red fox" as the longest common 
> sub-string?
> I suppose I would have to start by reading one of the strings into an array 
> and comparing its elements against the second string? Or somehow use the 
> Filter function to send the matching items to another array? But then I can 
> only seem to be able compare each element against one word ("quick", "red" or 
> "fox") which I would have to know in advance.
> There's no doubt some better way.
> Any advice greatly appreciated.
> 
> Sub CompareStrings()
>     Dim Array1() As String
>     Dim string1 As String
>     Dim string2 As String
>     Dim InString() As String
>     Dim i As Integer
>     Dim j As Integer
> 
>     string1 = "The quick red fox jumped over the lazy brown cow"
>     string2 = "The quick red fox just stood there"
> 
>     For i = 0 To UBound(Split(string1, " ")) - 1
>         ReDim Preserve Array1(i)
>         Array1(i) = Split(string1, " ")(i)
>         'MsgBox Array1(i)
>     Next
> 
>     InString = Filter(Array1, "red", True)
> 
>     For j = 0 To UBound(InString)
>         MsgBox InString(j)
>     Next j
> 
> End Sub
0
Reply Utf 2/4/2010 8:28:06 PM

Awesome! Only takes a second even with quite long strings. Will do my best to 
try and understand the code. Many thanks.

"Office PC Developer" wrote:

> This should do the trick...
> 
> Option Explicit
> 
> Public Function Main()
> 
>     Dim aStr As String, bStr As String
>     aStr = "The quick red fox jumped over the lazy brown cow"
>     bStr = "The quick red fox just stood there"
>     
>     Dim longestPhrases As Collection, longPhrase As Variant
>     Set longestPhrases = FindLongestMatchingPhrase(aStr, bStr)
>     For Each longPhrase In longestPhrases
>         Debug.Print "Longest Matching Phrase: " & longPhrase
>     Next
>     
> End Function
> 
> Public Function FindLongestMatchingPhrase(aStr As String, bStr As String) As 
> Collection
> 
>     Dim longestPhrases As Collection
>     Set longestPhrases = New Collection
>     
>     Dim aWords As Variant, bWords As Variant
>     aWords = Split(aStr, " ")
>     bWords = Split(bStr, " ")
>     
>     Dim aPhrases As Collection, bPhrases As Collection
>     Set aPhrases = BuildPhrases(aWords)
>     Set bPhrases = BuildPhrases(bWords)
>     
>     Dim curPhrase As Variant, longestPhraseLen As Integer
>     longestPhraseLen = 0
>     For Each curPhrase In aPhrases
>         If (FoundInOtherPhrases(curPhrase, bPhrases)) Then
>             If (Len(curPhrase) > longestPhraseLen) Then
>                 longestPhraseLen = Len(curPhrase)
>                 Set longestPhrases = New Collection
>                 AddDistinctPhrase curPhrase, longestPhrases
>             ElseIf (Len(curPhrase) = longestPhraseLen) Then
>                 AddDistinctPhrase curPhrase, longestPhrases
>             End If
>         End If
>     Next
>     For Each curPhrase In bPhrases
>         If (FoundInOtherPhrases(curPhrase, aPhrases)) Then
>             If (Len(curPhrase) > longestPhraseLen) Then
>                 longestPhraseLen = Len(curPhrase)
>                 Set longestPhrases = New Collection
>                 AddDistinctPhrase curPhrase, longestPhrases
>             ElseIf (Len(curPhrase) = longestPhraseLen) Then
>                 AddDistinctPhrase curPhrase, longestPhrases
>             End If
>         End If
>     Next
>     
>     Set FindLongestMatchingPhrase = longestPhrases
>     
> End Function
> 
> Public Function BuildPhrases(wordList As Variant) As Collection
> 
>     Dim phrases As Collection
>     Set phrases = New Collection
>     
>     Dim firstIndex As Integer, secondIndex As Integer, curPhrase As Variant
>     For firstIndex = 0 To UBound(wordList)
>         curPhrase = ""
>         For secondIndex = firstIndex To UBound(wordList)
>             If (secondIndex > firstIndex) Then
>                 curPhrase = curPhrase & " "
>             End If
>             curPhrase = curPhrase & wordList(secondIndex)
>             AddDistinctPhrase curPhrase, phrases
>         Next
>     Next
>     
>     Set BuildPhrases = phrases
>     
> End Function
> 
> Public Sub AddDistinctPhrase(curPhrase As Variant, phrases As Collection)
> 
>     On Error GoTo NotFound
>     Dim existingPhrase As Variant
>     existingPhrase = phrases(curPhrase)
>     Exit Sub
>     
> NotFound:
>     phrases.Add Key:=curPhrase, Item:=curPhrase
>     Exit Sub
>     
> End Sub
> 
> Public Function FoundInOtherPhrases(curPhrase As Variant, phrases As 
> Collection) As Boolean
> 
>     On Error GoTo NotFound
>     Dim existingPhrase As Variant
>     existingPhrase = phrases(curPhrase)
>     FoundInOtherPhrases = True
>     Exit Function
>     
> NotFound:
>     FoundInOtherPhrases = False
>     Exit Function
>     
> End Function
> 
> 
> "David Turner" wrote:
> 
> > The problem seems relatively straightward to the human eye but I imagine it's 
> > quite difficult to do in VBA without prior knowledge of what the strings 
> > contain.
> > If I have two strings like:
> > The quick red fox jumped over the lazy brown cow
> > The quick red fox just stood there
> > How would I compare them and identify "quick red fox" as the longest common 
> > sub-string?
> > I suppose I would have to start by reading one of the strings into an array 
> > and comparing its elements against the second string? Or somehow use the 
> > Filter function to send the matching items to another array? But then I can 
> > only seem to be able compare each element against one word ("quick", "red" or 
> > "fox") which I would have to know in advance.
> > There's no doubt some better way.
> > Any advice greatly appreciated.
> > 
> > Sub CompareStrings()
> >     Dim Array1() As String
> >     Dim string1 As String
> >     Dim string2 As String
> >     Dim InString() As String
> >     Dim i As Integer
> >     Dim j As Integer
> > 
> >     string1 = "The quick red fox jumped over the lazy brown cow"
> >     string2 = "The quick red fox just stood there"
> > 
> >     For i = 0 To UBound(Split(string1, " ")) - 1
> >         ReDim Preserve Array1(i)
> >         Array1(i) = Split(string1, " ")(i)
> >         'MsgBox Array1(i)
> >     Next
> > 
> >     InString = Filter(Array1, "red", True)
> > 
> >     For j = 0 To UBound(InString)
> >         MsgBox InString(j)
> >     Next j
> > 
> > End Sub
0
Reply Utf 2/4/2010 11:01:02 PM

12 Replies
1020 Views

(page loaded in 0.195 seconds)


Reply: