Friday, March 11, 2011

Just what kind of sort are you?

More fun with sorting...

In the last post I discussed how an incorrect type of sort (alphabetic vs. numeric) led to bad data. A more common concern related to sorting is performance, either from coding inefficiency of an inappropriate choice of algorithm for the application. This is a surprisingly complex subject, which I will just scrape the surface of here.

The key characteristics of a sort algorithm are:
  • typical number of operations for various data sets (fully random to already sorted)
  • typical amount of data access (a swap is more expensive than a compare)
  • amount of memory usage (some algorithms work within the original array space, while others expand into additional, potentially large, work areas)
And so the things to consider when evaluating a sort algorithm for your application are:
  • the size of your data
  • whether your data is always random, or partially ordered
  • your available memory
Let's consider some popular algorithms. First the simple classics.

Bubble sort
Popular in introductory programming courses, probably because it is easy to understand and very easy to code. The basic method involves continually looping through the list comparing adjacent members and swapping places if the second is ever greater than the first, until a complete loop yields no swaps.

swap = 1;
while (swap) {
    swap = 0;
    for ( i = 0; i < max; i++ )
        if ( A[i] < A[i+1] )  {
            swap( A[i], A[i+1]);
            swap = 1; }
}

Memory efficient. Speed is not so bad for a small, mostly ordered list. Bad for everything else. Performance can be estimated with O(n^2), i.e., relative to the square of the number of elements in the list. For the novice only.

Selection Sort
Here, an outer loop through the list contains an inner loop which finds the maximum value amongst the remaining members, swapping it with the value in the outer loop location.

for ( i = 0; i < max-1; i++ )
    for ( j = i+1; j < max; j++ )
        if ( A[j] > A[i] )
            swap( A[j], A[i])

Logical, consistent, memory efficient - but still O(n^2). For large data, you can do better.

Insertion Sort
Creates a new list from the original by taking members from the first and inserting them in their sorted order in the second. Simple, uses some extra memory, but not efficient for large data.


Now let's move on to the serious contenders.


Quicksort
Clever programmers love the Quick Sort algorithm, which uses a divide and conquer type approach. It's probably your fastest choice for a large random dataset, with the average case O(n*log(n)). So what's the catch? Worst case for this algorithm is a dataset that is already sorted, where performance creeps back to O(n^2).

But, hey, how often does that happen? And how bad can it be? Well, pretty often, and pretty bad. It's a common occurrence to have a small amount of new data added to a large ordered set, requiring a reshuffle to fix the order. This also brings to mind a situation I ran into some years back, working on a pretty compute-intensive circuit analysis program that implemented an NP-complete algorithm. (That means it could not be guaranteed to complete in a reasonable time.) A new analysis routine was introduced that we estimated could add maybe 10% additional processing.

Run-time doubled.

We eventually determined the reason was back-to-back sorts in the code.The new analysis required the data to be in order, so the developer began the new function with a sort - not realizing that the preceding function had modified the data and so ended with a sort to restore the order. The first sort was very fast. The second sort was more than 100X slower. Ironically, this same developer had implemented the quick sort function some months earlier recognizing that performance would be crucial given our large datasets. The fix was a single comment symbol to remove the redundant sort. Now sensitized to the issue, we went on to remove another unnecessary sort, further improving performance.


Continuing on our survey...


Merge sort
Does a recursive divide-and-conquer sort and merge. While Quicksort may be a little faster on large random lists, Merge sort has no Achilles heel. Performance is O(n log n) for average and worst case, approaching O(n) for nearly sorted lists. Does use some extra memory (2n). All-in-all, a pretty darn good choice for most applications, and is the algorithm used in the standard library sort routine for a number of modern programming languages.

Heapsort
Combines a selection sort with a type of binary tree called a heap. Competitive with Quicksort and Merge sort for speed (performance is O(n log n) for average and worst case), it is a good choice when memory is tight.

Shell sort
An evolved combination of bubble and insertion sort, it's generally quite efficient, but does have some gotchas. Easy to code, it's a good choice for many applications.


Some endnotes:

  • This vulnerability to presorted data is sometimes used in Denial of Service attacks, given the popularity of Quicksort.
  • Many versions (especially early ones) of the C library function qsort() exhibit this behavior as well. The function qsort() is not required to be implemented as a Quicksort, though it often is.
  • Some variations to the textbook algorithm can alleviate Quicksort's vulnerability to the presorted data worstcase.

References:

There are a lot of great resources out there. A variety of sort and other algorithms are covered in detail in these books:

  • Robert Sedgewick, "Algorithms in C"
  • Mark Allen Weiss, "Data Structures and Algorithm Analysis"
  • Tenenbaum, Langsam, Augenstein, "Data Structures Using C" 
A couple of really good overviews, with links to explore further:


The next two links provide some fascinating comparative visualizations of the algorithms, while the last provides some of the best critical analysis:




Monday, March 7, 2011

Sordid sorting

Here's a quick testing tip based on something I ran into this week. If you are testing an application that uses ordinal numbers, make sure your dataset:
  • crosses a decade boundary (e.g., 7, 8, 9, 10, 11, 12)
  • includes a 9, or numbers that have 9 as a digit
  • but the last / highest number should not include 9
Why? A common programming error is to use an ASCII sort when a numerical sort was intended. If you have the numbers 1 through 12, you want a sort to produce
{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}

A numerical sort will give that. But an ASCII or alphabetical sort will produce
{ 1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9}

If your dataset was 1 through 9, or 4203 through 4207, you would not notice a problem. In my case the data was sufficient to trigger the problem, which was especially apparent because a developer had put a sanity check in the code.

if ( current_version > highest_version )
    print ("ERROR: current_version > highest_version")

We had to track down what created the condition, which turned out to be the bad sort. But if not for the error message, who knows what odd behavior this erroneous data would have caused - and whether I would have caught it. Everybody makes mistakes; good developers work to catch them.

There is a lot that could be said about datasets for testing ... but there are a couple of interesting points on sorting, test, and performance issues that I would like to cover next time.