# Linear Sorting and Order Statistics

Bart Massey

### Review: Comparison Sorting is n lg n

• Recall that comparison sorting requires a minimum of n lg n comparisons

• Building a decision tree to distinguish n! possible orderings requires lg (n!) >= lg (n^n) = n lg n comparisons
• We are only counting comparisons here

• Recall that selection sort requires O(n) swaps
• Maybe we can find a more powerful operation than comparison?
• Motivation: Comparison sorting is not "natural"

• Normally consider values of items as well as pairwise comparisons

### Preliminary: "Counting Sort"

• Suppose that we are sorting objects whose keys are small non-negative integers.

• Specifically, suppose that an array of size n is guaranteed to have no more than k(n) keys in it, where k is in O(n)

• "Sort" the keys in linear time by just counting the number of occurrences of each, then spitting them back out in order

• Can then use dynamic programming to compute the number of keys less than k in O(n) time

• Can then swap the actual elements into place

• Stable sort

• Note: disingenuous assumption about lg k costs (but also for comp sort)

• Does constant-time array allocation help here?

### Preliminary: "Bucket Sort"

• Suppose that we are sorting objects drawn uniform-random from 1..n

• Bucket sort is the obvious O(n) expected-case sort:

• Allocate an array of n "buckets"
• Dump keys around i in the i-th bucket
• Sort each bucket's entries some n^2 way
• Output everything in order
• Expected complexity (p. 202-204) is O(n)

• Let's examine these ideas one more time:

• Counting sort will let us stably sort on small sets
• Let's talk about the set of digits of a number
• If we counting-sort rightmost digits of a set of numbers:
• We can use stability to preserve this ordering
• We can start sorting on the next-left digit
• This is least-significant-digit radix sort
• What radix to use? largest power of 2 < floor(lg n)

• Memory efficiency and stuff is terrible (d-digit number = d passes)

• Not in the book?

• Bucket sort recursively from leftmost to rightmost digit

• Still a linear algorithm

• Easy to "stop early" and use a conventional sort

### Linear Time Order Statistics

• Order Statistic = value at position i in sorted sequence

• Obvious way to compute this is
• sort the sequence
• look at the k-th position
• For some order statistics, can do something better and/or simpler

• Obvious O(n) algorithms for minimum (first), maximum (last)

• Cute variant for simultaneous min/max:
• compare pairs, then compare larger and smaller against max / min
• special case for equal pairs
• goes from 2n-2 to floor(3n/2)-2 comparisons
• For any constant k, can find k-th in O(n) by doing at most k passes

### Randomized Select

• Basically quicksort, but only has to explore the partition in which the i-th statistic lies

• Achieves O(n) expected time for any i, when elements are distinct

• p. 217-219; understand this proof
• Not very satisfying: want O(n) worst-case

### Linear Select

• Modification of randomized select for distinct elements

• O(n) worst-case

• Fairly magic: instead of picking the pivot randomly, pick it to be the recursive median-of-medians of groups of 5 values

• Why 5? Well, 5 is the minimum odd number that makes the math come out; 7, 9 etc work too

• Basic analysis idea: at each step in the recursion, at most (7n/10)+6 elements in the partition

• recursion converges on amortized O(1) (p. 222)
• thus select is linear

### Detail: Linear Select

To find the i-th element in a one-based array of n elements:
if n ≤ 5
sort a
return a[i]
b ← new array of n′ = ceil(n/5) elements
j′ ← 1
c ← new array of 5 elements
for j ← 1 to n - 4 by 5s
c ← a[j]..a[j+4]
b[j′] ← select third element from c
j′ ← j′ + 1
m ← n mod 5
c ← a[j]..a[m]
b[j′] ← select element m div 2 from c
m ← n′ div 2
p ← select element m from b
partition a around p
if i = m
return a[i]
if i < m
p ← select element i from a[1]..a[k]
return p
p ← select element i - m from a[m+1]..a[n]
return p


### Analysis of Select

• Because the elements are distinct, each median-of-medians must be greater than the overall median by at most (1/2)(3n/5) = 3n/10 and less than by at most the same amount, except some fiddling for the tail

• Thus, for large enough n, we get

T[n] <= T[7n/10 + 6] + O(n)

• The Master Theorem doesn't quite apply, because "+ 6", so essentially re-prove for this case

• End up with T[n] in O(n)