How do I find the overlap of two strings in bash?

Question

I have two strings. For the sake of the example they are set like this:

string1="test toast"
string2="test test"

What I want is to find the overlap starting at the beginning of the strings. With overlap I mean the string "test t" in my above example.

# I look for the command 
command "$string1" "$string2"
# that outputs:
"test t"

If the strings were string1="atest toast"; string2="test test" they would have no overlap since the check starts form the beginning and the "a" at the start of string1 .

Please do not cross post! How do I find the overlap of two strings in bash? — Caleb, Aug 07 '11 at 17:16
This is exactly the reason people aren't supposed to cross-post; now it has multiple answers on each site that are different, and it's on-topic for both sites. I think I'm just going to leave it here anyway — Michael Mrozek, Aug 08 '11 at 14:15

score 12 · Accepted Answer · answered Aug 07 '11 at 14:13

12

You can think of a function like this, with some error check to add

common_prefix() {
  local n=0
  while [[ "${1:n:1}" == "${2:n:1}" ]]; do
    ((n++))
  done
  echo "${1:0:n}"
}

answered Aug 07 '11 at 14:13

enzotib

51,661

I just noticed that when run with two empty/null args it enters an ∞ loop. [[ -z "$1$2" ]] && return fixes it. – Peter.O Feb 14 '12 at 14:28
This method is exponentially slower (rather than linearly). As the string doubles in length, the time increases by a factor of 4 (approx). Here are some string-length/time comparisons to Gilles' binary-split: .. 64 0m0.005s vs 0m0.003s - 128 0m0.013s vs 0m0.003s - 256 0m0.041s vs 0m0.003s - 512 0m0.143s vs 0m0.005s - 1024 0m0.421s vs 0m0.009s - 2048 0m1.575s vs 0m0.012s - 4096 0m5.967s vs 0m0.022s - 8192 0m24.693s vs 0m0.049s - 16384 1m34.004s vs 0m0.085s - 32768 6m34.721s vs 0m0.168s - 65536 27m34.012s vs 0m0.370s – Peter.O Aug 29 '12 at 06:15
2

@Peter.O Quadratically, not exponentially. – Gilles 'SO- stop being evil' Aug 29 '12 at 11:27
I guess bash stores strings internally with implicit length, so getting the nth character requires scanning n characters to check that they're not the string-terminating zero-byte. This is consistent with bash being unable to store a zero-byte in a variable. – Peter Cordes Sep 05 '15 at 05:14

Gilles 'SO- stop being evil' · Answer 2 · 2012-08-29T12:36:55.660

This can be done entirely inside bash. Although doing string manipulation in a loop in bash is slow, there is a simple algorithm that is logarithmic in the number of shell operations, so pure bash is a viable option even for long strings.

longest_common_prefix () {
  local prefix= n
  ## Truncate the two strings to the minimum of their lengths
  if [[ ${#1} -gt ${#2} ]]; then
    set -- "${1:0:${#2}}" "$2"
  else
    set -- "$1" "${2:0:${#1}}"
  fi
  ## Binary search for the first differing character, accumulating the common prefix
  while [[ ${#1} -gt 1 ]]; do
    n=$(((${#1}+1)/2))
    if [[ ${1:0:$n} == ${2:0:$n} ]]; then
      prefix=$prefix${1:0:$n}
      set -- "${1:$n}" "${2:$n}"
    else
      set -- "${1:0:$n}" "${2:0:$n}"
    fi
  done
  ## Add the one remaining character, if common
  if [[ $1 = $2 ]]; then prefix=$prefix$1; fi
  printf %s "$prefix"
}

The standard toolbox includes cmp to compare binary files. By default, it indicates the byte offset of the first differing bytes. There is a special case when one string is a prefix of the other: cmp produces a different message on STDERR; an easy way to deal with this is to take whichever string is the shortest.

longest_common_prefix () {
  local LC_ALL=C offset prefix
  offset=$(export LC_ALL; cmp <(printf %s "$1") <(printf %s "$2") 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=$1
    else
      prefix=$2
    fi
  fi
  printf %s "$prefix"
}

Note that cmp operates on bytes, but bash's string manipulation operates on characters. This makes a difference in multibyte locales, for examples locales using the UTF-8 character set. The function above prints the longest prefix of a byte string. To handle character strings with this method, we can first convert the strings to a fixed-width encoding. Assuming the locale's character set is a subset of Unicode, UTF-32 fits the bill.

longest_common_prefix () {
  local offset prefix LC_CTYPE="${LC_ALL:=$LC_CTYPE}"
  offset=$(unset LC_ALL; LC_MESSAGES=C cmp <(printf %s "$1" | iconv -t UTF-32) \
                                           <(printf %s "$2" | iconv -t UTF-32) 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset/4-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=$1
    else
      prefix=$2
    fi
  fi
  printf %s "$prefix"
}

Revisiting this question (1 year on), I've have re-assessed the best answer. It's all quite simple: rock breaks scissors, scissors cut paper, paper wraps rock. and binary eats sequential!.. even for quite short strings.. and as for a moderate 10000 char string being processed sequentially via while char-by-char, I'm still waiting for it as I write this.. time passes.. still waiting (maybe there's something wrong with my system).. time passes.. there must be something wrong; it is only 10,000 itterations! Ah! patience is a virtue (perhaps a curse in this case).. 13m53.755s.. vs, 0m0.322s — Peter.O, Aug 29 '12 at 02:49
The 3 methods given here are the outright fastest of all the presented answers.. Basically, cmp is the fastest (but is not char based). The next is iconv and then the very respectibly fast binary-split answer. Thanks Gilles. It took me a year to get to this point, but better late than never. (PS. 2 typo mods in iconv code: $ in =$LC_CTYPE} and \ in UTF-32) \)
... PPS. actually the string I mentioned above was longer than 10,000 characters. It was the result of {1..10000} which is, 48,894, but that doesnt' change the differential — Peter.O, Aug 29 '12 at 12:11

score 6 · Answer 3 · edited Mar 07 '13 at 10:42

6

In sed, assuming the strings don't contain any newline characters:

string1="test toast"
string2="test test"
printf "%s\n" "$string1" "$string2" | sed -e 'N;s/^\(.*\).*\n\1.*$/\1/'

edited Mar 07 '13 at 10:42

Stéphane Chazelas

544,893

answered Aug 07 '11 at 15:24

jfg956

6,336

But duplicate with this. – jfg956 Aug 07 '11 at 15:25
Brilliant! goes directly to my tips&tricks library :-) – hmontoliu Aug 08 '11 at 11:27
Or, for a bash string, which can't contain \0. Using tr and \0, the method can handle newlines in the string, .... { printf "%s" "$string1" |tr \\n \\0; echo; printf "%s" "$string2" |tr \\n \\0; echo; } | sed -e 'N;s/^$.*$.*\n\1.*$/\1/' |tr \\0 \\n – Peter.O Aug 29 '12 at 01:25
I've just tested this sed method a bit further,and it seems that using back-references this way (in the search pattern) is hugely expensive. It still outperforms the sequential byte-by-byte looping (by approx a factor of 3), but here is an example: for two 32kb strings (with the last byte being different), it takes 2m4.880s, as compared to Gilles' binary-split method 0m0.168s – Peter.O Aug 29 '12 at 07:54

score 2 · Answer 4 · answered Aug 07 '11 at 14:21

2

This seems crude to me, but you can do it via brute force:

#!/bin/bash

string1="test toast"
string2="test test"

L=1  # Prefix length

while [[ ${string1:0:$L} == ${string2:0:$L} ]]
do
    ((L = L + 1))
done

echo Overlap: ${string1:0:$((L - 1))}

I want some clever algorithm to exist, but I can't find any with a short search.

answered Aug 07 '11 at 14:21

2

compare half and repeat is n*log(n) rather than n^2. – Gilles 'SO- stop being evil' Aug 13 '11 at 21:57
2

For general reference, it is a bit on the slow side. Two 32768 character strings (the last char being different) took 6m27.689s. – Peter.O Aug 29 '12 at 08:34

How do I find the overlap of two strings in bash?

4 Answers4

Linked