TryAlgo

Searching a substring

Statement

A basic problem in string algorithms is to locate a given substring needle in a given string haystack. Formally the goal is to find an index i such that the j-th character in needle is equivalent to the (i+j)-th character in haystack.

A naïve implementation would have worst case time complexity $\Theta(n m)$ where $n,m$ are the respective sizes of the given strings. Here is a tight example.

Given positive integer n, locate the string $a^nb$ in the string $a^{2n}$.

The naïve algorithm would try $\Theta(n)$ locations (candidates for the index $i$), compare the characters one by one and only at the end observe the inequality $a \neq b$ and realize a fail. Hence the naïve algorithm has running time $\Theta(n^2)$ on this example.

However there are algorthms of complexity $O(n + m)$, the most famous being Knuth-Morris-Pratt, see Rechercher des mots dans un texte par l’algorithme de Knuth-Morris-Pratt or tryalgo.knuth_morris_pratt.

Experiments

We did experiments in Java, Python and C++. The latter languages implement the linear time search algorithm, but Java implements the naïve quadratic time algorithm. This has been confirmed by looking at the code of the String.indexOf(String) method. Experiments have been done with various versions of Java, up to version 10. Here is the code we used.

class Test_str_find {
    public static void main(String[] args) {
        int n = Integer.parseInt(args[0]);
        int n2 = 2*n;
        StringBuilder needle = new StringBuilder();
        StringBuilder haystack = new StringBuilder();
        for (int i=0; i<n; i++)
            needle.append('a');
        needle.append('b');
        for (int i=0; i<n2; i++) {
            haystack.append('a');
        }
        String s_needle = needle.toString();
        String s_haystack = haystack.toString();
        s_haystack.indexOf(s_needle);
    }
}

Workaround

In Java however there is the possibility to search a regular expression in a given string, and this implementation is efficient. Hence you can use it to search for a substring. The following code has linear time complexity.

public int my_indexOf(String needle, String haystack) {
    Pattern p = Pattern.compile(String.quote(needle));
    Matcher m = p.matcher(haystack);
    if (m.find()) 
        return m.start();
    else
        return -1;
}