A clear explanation of finding the longest duplicate substring using binary search on length combined with Rabin-Karp rolling hash.
Problem Restatement
We are given a string s.
We need to return the longest substring that appears at least twice.
If no such substring exists, return an empty string.
The official constraints state that 2 <= s.length <= 3 * 10^4 and s consists only of lowercase English letters.
Input and Output
| Item | Meaning |
|---|---|
| Input | String s |
| Output | Longest duplicate substring (any valid answer accepted) |
Function shape:
def longestDupSubstring(s: str) -> str:
...Examples
Example 1:
s = "banana""ana" appears at indices 1 and 3.
Answer:
"ana"Example 2:
s = "abcd"No duplicate substring of length ≥ 1.
Answer:
""Key Insight
Use binary search on the length of the duplicate substring.
For a fixed length L, use Rabin-Karp rolling hash to check if any substring of length L appears at least twice.
Since “if length L has a duplicate then any shorter length also has a duplicate”, the problem is monotone and binary search applies.
Algorithm
- Binary search on length
Lfrom0ton-1. - For each
L, compute rolling hashes of all substrings of lengthL. - If any hash appears twice, record the substring as a candidate.
- Return the longest found candidate.
Correctness
Rolling hash detects duplicate substrings in O(n) per check (with small probability of false positives, but we verify matches). Binary search reduces the search space logarithmically.
Edge Cases
- Check the smallest and largest valid search values; off-by-one errors usually appear at the boundaries.
- Decide whether the predicate is looking for the first true value, the last true value, or an exact match.
- When the target is absent, the loop should still terminate and return the problem-specific failure value.
Complexity
| Metric | Value | Why |
|---|---|---|
| Time | O(n log n) | Binary search × rolling hash check |
| Space | O(n) | Hash set storage |
Common Pitfalls
- Do not mix inclusive and half-open bounds inside the same loop.
- Make sure each branch strictly reduces the search interval.
Implementation
class Solution:
def longestDupSubstring(self, s: str) -> str:
n = len(s)
nums = [ord(c) - ord('a') for c in s]
MOD = (1 << 61) - 1
BASE = 31
def check(L):
if L == 0:
return ""
h = 0
power = pow(BASE, L, MOD)
for i in range(L):
h = (h * BASE + nums[i]) % MOD
seen = {h: [0]}
for i in range(1, n - L + 1):
h = (h * BASE - nums[i - 1] * power + nums[i + L - 1]) % MOD
if h in seen:
for j in seen[h]:
if s[i:i+L] == s[j:j+L]:
return s[i:i+L]
seen[h].append(i)
else:
seen[h] = [i]
return ""
lo, hi = 0, n - 1
result = ""
while lo <= hi:
mid = (lo + hi) // 2
found = check(mid)
if found:
result = found
lo = mid + 1
else:
hi = mid - 1
return resultTesting
def run_tests():
s = Solution()
assert s.longestDupSubstring("banana") == "ana"
assert s.longestDupSubstring("abcd") == ""
assert s.longestDupSubstring("aa") == "a"
print("all tests passed")
run_tests()| Test | Expected | Why |
|---|---|---|
"banana" | "ana" | "ana" appears at two positions |
"abcd" | "" | No duplicates |
"aa" | "a" | Single char duplicate |