Linear Sorting and Order Statistics

Linear Sorting and Order Statistics

Bart Massey


Review: Comparison Sorting is n lg n

  • Recall that comparison sorting requires a minimum of n lg n comparisons

    • Building a decision tree to distinguish n! possible orderings requires lg (n!) >= lg (n^n) = n lg n comparisons
  • We are only counting comparisons here

    • Recall that selection sort requires O(n) swaps
    • Maybe we can find a more powerful operation than comparison?
  • Motivation: Comparison sorting is not "natural"

    • Normally consider values of items as well as pairwise comparisons

Preliminary: "Counting Sort"

  • Suppose that we are sorting objects whose keys are small non-negative integers.

    • Specifically, suppose that an array of size n is guaranteed to have no more than k(n) keys in it, where k is in O(n)

    • "Sort" the keys in linear time by just counting the number of occurrences of each, then spitting them back out in order

    • Can then use dynamic programming to compute the number of keys less than k in O(n) time

    • Can then swap the actual elements into place

  • Stable sort

  • Note: disingenuous assumption about lg k costs (but also for comp sort)

  • Does constant-time array allocation help here?

Preliminary: "Bucket Sort"

  • Suppose that we are sorting objects drawn uniform-random from 1..n

  • Bucket sort is the obvious O(n) expected-case sort:

    • Allocate an array of n "buckets"
    • Dump keys around i in the i-th bucket
    • Sort each bucket's entries some n^2 way
    • Output everything in order
  • Expected complexity (p. 202-204) is O(n)

LSD Radix Sort

  • Let's examine these ideas one more time:

    • Counting sort will let us stably sort on small sets
    • Let's talk about the set of digits of a number
    • If we counting-sort rightmost digits of a set of numbers:
      • We can use stability to preserve this ordering
      • We can start sorting on the next-left digit
    • This is least-significant-digit radix sort
  • What radix to use? largest power of 2 < floor(lg n)

  • Memory efficiency and stuff is terrible (d-digit number = d passes)

MSD Radix Sort

  • Not in the book?

  • Bucket sort recursively from leftmost to rightmost digit

  • Still a linear algorithm

  • Easy to "stop early" and use a conventional sort

Linear Time Order Statistics

  • Order Statistic = value at position i in sorted sequence

    • Obvious way to compute this is
      • sort the sequence
      • look at the k-th position
  • For some order statistics, can do something better and/or simpler

    • Obvious O(n) algorithms for minimum (first), maximum (last)

      • Cute variant for simultaneous min/max:
        • compare pairs, then compare larger and smaller against max / min
        • special case for equal pairs
        • goes from 2n-2 to floor(3n/2)-2 comparisons
    • For any constant k, can find k-th in O(n) by doing at most k passes

Randomized Select

  • Basically quicksort, but only has to explore the partition in which the i-th statistic lies

  • Achieves O(n) expected time for any i, when elements are distinct

    • p. 217-219; understand this proof
  • Not very satisfying: want O(n) worst-case

Linear Select

  • Modification of randomized select for distinct elements

  • O(n) worst-case

  • Fairly magic: instead of picking the pivot randomly, pick it to be the recursive median-of-medians of groups of 5 values

    • Why 5? Well, 5 is the minimum odd number that makes the math come out; 7, 9 etc work too

    • Basic analysis idea: at each step in the recursion, at most (7n/10)+6 elements in the partition

      • recursion converges on amortized O(1) (p. 222)
      • thus select is linear

Detail: Linear Select

To find the i-th element in a one-based array of n elements:
    if n ≤ 5
        sort a
        return a[i]
    b ← new array of n′ = ceil(n/5) elements
    j′ ← 1
    c ← new array of 5 elements
    for j ← 1 to n - 4 by 5s
        c ← a[j]..a[j+4] 
        b[j′] ← select third element from c
        j′ ← j′ + 1
    m ← n mod 5
    c ← a[j]..a[m]
    b[j′] ← select element m div 2 from c
    m ← n′ div 2
    p ← select element m from b
    partition a around p
    if i = m
        return a[i]
    if i < m
        p ← select element i from a[1]..a[k]
        return p
    p ← select element i - m from a[m+1]..a[n]
    return p

Analysis of Select

  • Because the elements are distinct, each median-of-medians must be greater than the overall median by at most (1/2)(3n/5) = 3n/10 and less than by at most the same amount, except some fiddling for the tail

  • Thus, for large enough n, we get

    T[n] <= T[7n/10 + 6] + O(n)
    
  • The Master Theorem doesn't quite apply, because "+ 6", so essentially re-prove for this case

  • End up with T[n] in O(n)

Last modified: Monday, 12 May 2014, 4:08 PM