Linear Sorting and Order Statistics
Linear Sorting and Order Statistics
Bart Massey
Review: Comparison Sorting is n lg n
Recall that comparison sorting requires a minimum of n lg n comparisons
- Building a decision tree to distinguish n! possible orderings requires lg (n!) >= lg (n^n) = n lg n comparisons
We are only counting comparisons here
- Recall that selection sort requires O(n) swaps
- Maybe we can find a more powerful operation than comparison?
Motivation: Comparison sorting is not "natural"
- Normally consider values of items as well as pairwise comparisons
Preliminary: "Counting Sort"
Suppose that we are sorting objects whose keys are small non-negative integers.
Specifically, suppose that an array of size n is guaranteed to have no more than k(n) keys in it, where k is in O(n)
"Sort" the keys in linear time by just counting the number of occurrences of each, then spitting them back out in order
Can then use dynamic programming to compute the number of keys less than k in O(n) time
Can then swap the actual elements into place
Stable sort
Note: disingenuous assumption about lg k costs (but also for comp sort)
Does constant-time array allocation help here?
Preliminary: "Bucket Sort"
Suppose that we are sorting objects drawn uniform-random from 1..n
Bucket sort is the obvious O(n) expected-case sort:
- Allocate an array of n "buckets"
- Dump keys around i in the i-th bucket
- Sort each bucket's entries some n^2 way
- Output everything in order
Expected complexity (p. 202-204) is O(n)
LSD Radix Sort
Let's examine these ideas one more time:
- Counting sort will let us stably sort on small sets
- Let's talk about the set of digits of a number
- If we counting-sort rightmost digits of a set of numbers:
- We can use stability to preserve this ordering
- We can start sorting on the next-left digit
- This is least-significant-digit radix sort
What radix to use? largest power of 2 < floor(lg n)
Memory efficiency and stuff is terrible (d-digit number = d passes)
MSD Radix Sort
Not in the book?
Bucket sort recursively from leftmost to rightmost digit
Still a linear algorithm
Easy to "stop early" and use a conventional sort
Linear Time Order Statistics
Order Statistic = value at position i in sorted sequence
- Obvious way to compute this is
- sort the sequence
- look at the k-th position
- Obvious way to compute this is
For some order statistics, can do something better and/or simpler
Obvious O(n) algorithms for minimum (first), maximum (last)
- Cute variant for simultaneous min/max:
- compare pairs, then compare larger and smaller against max / min
- special case for equal pairs
- goes from 2n-2 to floor(3n/2)-2 comparisons
- Cute variant for simultaneous min/max:
For any constant k, can find k-th in O(n) by doing at most k passes
Randomized Select
Basically quicksort, but only has to explore the partition in which the i-th statistic lies
Achieves O(n) expected time for any i, when elements are distinct
- p. 217-219; understand this proof
Not very satisfying: want O(n) worst-case
Linear Select
Modification of randomized select for distinct elements
O(n) worst-case
Fairly magic: instead of picking the pivot randomly, pick it to be the recursive median-of-medians of groups of 5 values
Why 5? Well, 5 is the minimum odd number that makes the math come out; 7, 9 etc work too
Basic analysis idea: at each step in the recursion, at most (7n/10)+6 elements in the partition
- recursion converges on amortized O(1) (p. 222)
- thus select is linear
Detail: Linear Select
To find the i-th element in a one-based array of n elements:
if n ≤ 5
sort a
return a[i]
b ← new array of n′ = ceil(n/5) elements
j′ ← 1
c ← new array of 5 elements
for j ← 1 to n - 4 by 5s
c ← a[j]..a[j+4]
b[j′] ← select third element from c
j′ ← j′ + 1
m ← n mod 5
c ← a[j]..a[m]
b[j′] ← select element m div 2 from c
m ← n′ div 2
p ← select element m from b
partition a around p
if i = m
return a[i]
if i < m
p ← select element i from a[1]..a[k]
return p
p ← select element i - m from a[m+1]..a[n]
return p
Analysis of Select
Because the elements are distinct, each median-of-medians must be greater than the overall median by at most (1/2)(3n/5) = 3n/10 and less than by at most the same amount, except some fiddling for the tail
Thus, for large enough n, we get
T[n] <= T[7n/10 + 6] + O(n)
The Master Theorem doesn't quite apply, because "+ 6", so essentially re-prove for this case
End up with T[n] in O(n)