CS 584/684 S2014: Linear Sorting and Order Statistics

Linear Sorting and Order Statistics

Bart Massey

Review: Comparison Sorting is n lg n

Recall that comparison sorting requires a minimum of n lg n comparisons
- Building a decision tree to distinguish n! possible orderings requires lg (n!) >= lg (n^n) = n lg n comparisons
We are only counting comparisons here
- Recall that selection sort requires O(n) swaps
- Maybe we can find a more powerful operation than comparison?
Motivation: Comparison sorting is not "natural"
- Normally consider values of items as well as pairwise comparisons

Preliminary: "Counting Sort"

Suppose that we are sorting objects whose keys are small non-negative integers.
- Specifically, suppose that an array of size n is guaranteed to have no more than k(n) keys in it, where k is in O(n)
- "Sort" the keys in linear time by just counting the number of occurrences of each, then spitting them back out in order
- Can then use dynamic programming to compute the number of keys less than k in O(n) time
- Can then swap the actual elements into place
Stable sort
Note: disingenuous assumption about lg k costs (but also for comp sort)
Does constant-time array allocation help here?

Preliminary: "Bucket Sort"

Suppose that we are sorting objects drawn uniform-random from 1..n
Bucket sort is the obvious O(n) expected-case sort:
- Allocate an array of n "buckets"
- Dump keys around i in the i-th bucket
- Sort each bucket's entries some n^2 way
- Output everything in order
Expected complexity (p. 202-204) is O(n)

LSD Radix Sort

Let's examine these ideas one more time:
- Counting sort will let us stably sort on small sets
- Let's talk about the set of digits of a number
- If we counting-sort rightmost digits of a set of numbers:
  - We can use stability to preserve this ordering
  - We can start sorting on the next-left digit
- This is least-significant-digit radix sort
What radix to use? largest power of 2 < floor(lg n)
Memory efficiency and stuff is terrible (d-digit number = d passes)

MSD Radix Sort

Not in the book?
Bucket sort recursively from leftmost to rightmost digit
Still a linear algorithm
Easy to "stop early" and use a conventional sort

Linear Time Order Statistics

Order Statistic = value at position i in sorted sequence
- Obvious way to compute this is
  - sort the sequence
  - look at the k-th position
For some order statistics, can do something better and/or simpler
- Obvious O(n) algorithms for minimum (first), maximum (last)
  - Cute variant for simultaneous min/max:
    - compare pairs, then compare larger and smaller against max / min
    - special case for equal pairs
    - goes from 2n-2 to floor(3n/2)-2 comparisons
- For any constant k, can find k-th in O(n) by doing at most k passes

Randomized Select

Basically quicksort, but only has to explore the partition in which the i-th statistic lies
Achieves O(n) expected time for any i, when elements are distinct
- p. 217-219; understand this proof
Not very satisfying: want O(n) worst-case

Linear Select

Modification of randomized select for distinct elements
O(n) worst-case
Fairly magic: instead of picking the pivot randomly, pick it to be the recursive median-of-medians of groups of 5 values
- Why 5? Well, 5 is the minimum odd number that makes the math come out; 7, 9 etc work too
- Basic analysis idea: at each step in the recursion, at most (7n/10)+6 elements in the partition
  - recursion converges on amortized O(1) (p. 222)
  - thus select is linear

Detail: Linear Select

To find the i-th element in a one-based array of n elements:
    if n ≤ 5
        sort a
        return a[i]
    b ← new array of n′ = ceil(n/5) elements
    j′ ← 1
    c ← new array of 5 elements
    for j ← 1 to n - 4 by 5s
        c ← a[j]..a[j+4] 
        b[j′] ← select third element from c
        j′ ← j′ + 1
    m ← n mod 5
    c ← a[j]..a[m]
    b[j′] ← select element m div 2 from c
    m ← n′ div 2
    p ← select element m from b
    partition a around p
    if i = m
        return a[i]
    if i < m
        p ← select element i from a[1]..a[k]
        return p
    p ← select element i - m from a[m+1]..a[n]
    return p

Analysis of Select

Because the elements are distinct, each median-of-medians must be greater than the overall median by at most (1/2)(3n/5) = 3n/10 and less than by at most the same amount, except some fiddling for the tail
Thus, for large enough n, we get
```
T[n] <= T[7n/10 + 6] + O(n)
```
The Master Theorem doesn't quite apply, because "+ 6", so essentially re-prove for this case
End up with T[n] in O(n)

Last modified: Monday, 12 May 2014, 4:08 PM