Suffix Array
Given a string s, sort all cyclic shifts of s. Formally produce a table p such that p[j]=i if s[i:]+s[:i] has rank j among all cyclic shifts.
Example
On input “bobocel”, here are all cyclic shifts, together with by how much they are shifted.
bobocel 0
obocelb 1
bocelbo 2
ocelbob 3
celbobo 4
elboboc 5
lboboce 6
And the same list, but lexicographically sorted. Together with the rank in the new order.
0 bobocel 0
1 bocelbo 2
2 celbobo 4
3 elboboc 5
4 lboboce 6
5 obocelb 1
6 ocelbob 3
Hence on this example we should output p=[0, 2, 4, 5, 6, 1, 3]
.
Complexity
The naïve algorithm for this problem stores all cyclic shifts in a list and sorts it. This takes time $O(n^2 \log n)$, because the list has size $n$, and comparing two strings of length $n$ takes time $O(n)$.
The problem can be solved in time $O(n)$, under some conditions on the alphabet. But we present an $O(n \log^2 n)$ implementation, which is good enough for most programming contests.
The key operation
The algorithm relies on a simple sorting function sort_class
, which not only sorts a given string or list s
, but also returns additional informations. It returns two tables p
and c
such that
- p[j]=i if s[i] has rank j in
sorted(s)
. - c[i]=j if s[i] has rank j in
sorted(set(s))
.
Note that the second table groups identical elements in s, and gives a rank only to the equivalence classes. For example
index = 0123456
input s = "bobocel"
after sorting s we have for example (because identical letters can be ordered arbitrarily within each other)
original index = 0245613
sorted s = "bbceloo"
If we rank all distinct letters of the input string we have
rank = 0 1 2 3 4
sorted(set(s)) = b c e l o
Hence our function returns
p = [0, 2, 4, 5, 6, 1, 3]
c = [0, 4, 0, 4, 1, 2, 3]
s = b o b o c e l # for comparison
The cyclic version
We present here the cyclic version of the problem. If we want to sort the suffixes of a given string s, then we can just sort the cyclic shifts of s + special, where special is a dummy character, smaller than all characters in s.
Iteration
The idea is that for K being every integer power of 2, we want to sort the K-lengths prefixes of all cyclic shifts. For example for K=2 and s=”bobocel” we want to sort the following strings.
bo
ob
bo
oc
ce
el
lb
But we already have the order and equivalence classes of all (K/2)-length prefixes of all cyclic shifts. Now every K-lengths prefix is the concatenation of two (K/2)-length prefixes, say xy
. Let i be the equivalence class of x
and j be the equivalence class of y
. Then the pair (i,j) represents the equivalence class of xy
. We can use sort_class
to translate the pairs into rank integers.
The strings of the example above correspond to the following pairs, where the table c
from the previous iteration makes the correspondance.
(0, 4)
(4, 0)
(0, 4)
(4, 1)
(1, 2)
(2, 3)
(3, 0)
In total we have a logarithmic number of outer iterations, each costs O(n log n)
, leading to the claimed complexity.
Implementation
Update September 2024
Programming is often a question of compromise. The implementation of sort_class
is quite short. But its usage of a dictionary makes it a bit slow. The construction of array c
can be improved by processing all items in order, as given by p, and keeping track of the distinct values seen so far. This gives an improvement of about 30% in a test we made. In many implementations the sorting of S_index
is done by two stages of bucket sort. Since the keys are couples of ranks, we can sort first by the second rank, and then do a stable sort on the first rank. This will generate quite some lines of code in Python, and we don’t feel ready yet for so much compromise.
References
- an O(nlogn) implementation in CP-algorithms – describes applications
- Visualization in action
- Quest for the quickest implementation in Python