American Flag String Sort

In-place MSD radix string sort that partitions strings by character using American flag style bucket permutation.

American Flag String Sort

American flag string sort is an in-place MSD radix sort for strings. It partitions strings by the character at the current position, then recursively sorts each bucket by the next character.

It uses the same bucket permutation idea as American flag sort, but the digit source is a string character rather than an integer digit.

Problem

Given an array $A$ of strings, sort the strings in lexicographic order.

For a string $s$, let:

$$ s[d] $$

be the character at position $d$. If $d$ reaches the end of the string, use a sentinel value smaller than every real character.

Idea

At each recursion level:

  1. Count characters at position $d$
  2. Compute bucket ranges
  3. Move strings into their correct bucket in place
  4. Recursively sort nontrivial buckets at position $d + 1$

The sentinel bucket handles shorter strings, so prefix order is correct.

Algorithm

american_flag_string_sort(A, lo, hi, d, alphabet):
    if hi - lo <= 1:
        return

    count = [0] * alphabet

    for i from lo to hi - 1:
        c = char_at(A[i], d)
        count[c] += 1

    start = [0] * alphabet
    start[0] = lo

    for c from 1 to alphabet - 1:
        start[c] = start[c - 1] + count[c - 1]

    next = copy(start)
    end = [start[c] + count[c] for c in 0..alphabet-1]

    i = lo
    while i < hi:
        c = char_at(A[i], d)

        if start[c] <= i < end[c]:
            i += 1
        else:
            j = next[c]
            swap A[i], A[j]
            next[c] += 1

    for c from 1 to alphabet - 1:
        if end[c] - start[c] > 1:
            american_flag_string_sort(A, start[c], end[c], d + 1, alphabet)

The loop skips recursive sorting of the sentinel bucket because strings that have ended need no more characters processed.

Example

Sort:

$$ ["car", "cat", "dog", "cart"] $$

At character position 0:

bucket strings
c car, cat, cart
d dog

The c bucket is sorted by position 1, then position 2:

prefix strings
car car, cart
cat cat

The sentinel makes:

$$ "car" < "cart" $$

Final result:

$$ ["car", "cart", "cat", "dog"] $$

Correctness

At each level, bucket order follows character order. All strings in an earlier bucket are lexicographically smaller than all strings in a later bucket at the first position where they differ.

Within a bucket, all strings share the same current character, so recursive sorting by the next position decides their remaining order. The sentinel bucket correctly places shorter prefix strings before longer extensions.

By induction on character position, the final array is lexicographically sorted.

Complexity

Let:

  • $n$ be the number of strings
  • $L$ be the average string length
  • $\sigma$ be the alphabet size

Typical time:

$$ O(nL) $$

Extra space per recursion level:

$$ O(\sigma) $$

The algorithm is in place with respect to the string array.

Properties

property value
stable no
in place yes
comparison based no
direction MSD
best for strings and byte sequences

When to Use

Use American flag string sort when sorting many strings or byte sequences and auxiliary arrays of size $n$ are undesirable. It is useful when strings are large, prefix structure matters, and in-place operation is more important than stability.

Avoid it when stability is required, when the alphabet is large and sparse, or when a simpler comparison sort is sufficient.

Implementation

def american_flag_string_sort(a):
    alphabet = 257

    def char_at(s, d):
        if d == len(s):
            return 0
        return ord(s[d]) + 1

    def sort(lo, hi, d):
        if hi - lo <= 1:
            return

        count = [0] * alphabet

        for i in range(lo, hi):
            count[char_at(a[i], d)] += 1

        start = [0] * alphabet
        start[0] = lo

        for c in range(1, alphabet):
            start[c] = start[c - 1] + count[c - 1]

        end = [start[c] + count[c] for c in range(alphabet)]
        next_pos = start.copy()

        i = lo
        while i < hi:
            c = char_at(a[i], d)

            if start[c] <= i < end[c]:
                i += 1
            else:
                j = next_pos[c]
                a[i], a[j] = a[j], a[i]
                next_pos[c] += 1

        for c in range(1, alphabet):
            if end[c] - start[c] > 1:
                sort(start[c], end[c], d + 1)

    sort(0, len(a), 0)
    return a