Sebastian Wild
/ Fachbereich Informatik, TU Kaiserslautern
/ wild@cs.uni-kl.de

Ref:
https://arxiv.org/abs/1511.01138

I discuss the new dual-pivot Quicksort that is nowadays used to sort arrays of primitive types in Java. I sketch theoretical analyses of this algorithm that offer a possible, and in my opinion plausible, explanation why (a) dual-pivot Quicksort is faster than the previously used (classic) Quicksort and (b) why this improvement was not already found much earlier.

Quicksort became popular shortly after its original presentation by Hoare [7] and many authors contributed variations of and implementation tricks for the basic algorithm. From a practical point of view, the most notable improvements appear in [17, 15, 2, 14].

After the improvements to Quicksort in the 1990’s, almost all programming libraries used almost identical versions of the algorithm: the classic Quicksort implementation had reached calm waters.

It was not until 2009, over a decade later, that previously unknown Russian developer Vladimir Yaroslavskiy caused a sea change upon releasing the outcome of his free-time experiments to the public: a dual-pivot Quicksort implementation that clearly outperforms the classic Quicksort in Oracle’s Java 6. The core innovation is the arguably natural ternary partitioning algorithm given to the right.

Yaroslavskiy’s ﬁnding was so surprising that people were initially reluctant to believe him, but his Quicksort has ﬁnally been deployed to millions of devices with the release of Java 7 in 2011.

How could this substantial improvement to the well-studied Quicksort algorithm escape the eyes of researchers around the world for nearly 50 years?

For programs as heavily used as library sorting methods, it is advisable to back up experimental data with mathematically proven properties. The latter consider however only a *model* of reality, which may or may not reﬂect accurately enough the behavior of actual machines.

The answer to above question is in part a tale of the pitfalls of theoretical models, so we start with a summary of the mathematical analysis of Quicksort and the underlying model in Section 2. We then brieﬂy discuss the “memory wall” metaphor and its implications for Quicksort in Section 3, and ﬁnally propose an alternative model in Section 4 that offers an explanation for the superiority of Yaroslavskiy’s Quicksort.

The classical model for the analysis of sorting algorithm considers the average number of key comparisons on random permutations. Quicksort has been extensively studied under this model, including variations like choosing the pivot as median of a sample [7, 4, 15, 10, 3]: Let \(c_n\) denote the expected number of comparisons used by classic Quicksort (as given in [16]), when each pivot is chosen as median of a sample of \(2t+1\) random elements. \(c_n\) fulﬁlls the recurrence

$$ \begin{equation}
c_n = n - 1 + \sum_{\substack{0 \leq j_1, j_2 \leq n-1 \\ j_1 + j_2= n-1}} {\frac{\left(\begin{array}{ccc}j_1 \\ t\end{array}\right)\left(\begin{array}{ccc}j_2 \\ t\end{array}\right)}{\left(\begin{array}{ccc}n \\ 2t+1\end{array}\right)}\left( c_{j_1} + c_{j_2} \right)}
\tag{1} \end{equation} $$

since \(n − 1\) comparisons are needed in the ﬁrst partitioning step, and we have two recursive calls of random sizes, where the probability to have sizes \(j_1\) and \(j_2\) is given by the fraction of binomials (see [9] for details). This recurrence can be solved asymptotically [4, 15] to

$$ \begin{equation}
c_n \sim \frac{1}{H_{2(t+1)}-H_{t+1}} \cdot n \ln{n} ,
\end{equation} $$

where \( H_n = \sum_{i=1}^{n} {1/i} \) is the \(n\)th harmonic number and \(f(n) \sim g(n)\) means \(\lim_{n \rightarrow \infty } f(n)/g(n) = 1\). The mathematical details are beyond the scope of this abstract, but a rather elementary derivation is possible [10]. Large values of \(t\) are impractical; a good compromise in practice is given by the “ninther”, the median of three medians, each chosen from three elements [2]. This scheme can be analyzed similarly to the above [3].

The author generalized Equation (1) to Yaroslavskiy’s Quicksort [19, 18, 9]. Note that unlike for classic Quicksort, the comparison count of Yaroslavskiy’s partitioning depends on pivot values, so its expected value has to be computed over the choices for the pivots. We obtain for tertiles-of-\((3t+2)\)

$$ \begin{equation}
c_n = \left( \frac{5}{3} - \frac{1}{9t+12} \right) \cdot n + O(1) + \sum_{\substack{0 \leq j_1,j_2,j_3 \leq n-2 \\ j_1+j_2+j_3=n-2}}
{
\frac
{ \left( \begin{array}{ccc} j_1 \\ t \end{array} \right) \left( \begin{array}{ccc} j_2 \\ t \end{array} \right) \left( \begin{array}{ccc} j_3 \\ t \end{array} \right) }
{ \left( \begin{array}{ccc} n \\ 3t+2 \end{array} \right) }
\cdot
\left( c_{j_1}+c_{j_2}+c_{j_3} \right)
};
\tag{2} \end{equation} $$

with solution

$$ \begin{equation}
c_n \sim \frac{ \frac{5}{3} - \frac{1}{9t+12} } { H_{3(t+1)} - H_{t+1} } \cdot n \ln{n}.
\end{equation} $$

Oracle’s Java runtime library previously used classic Quicksort with ninther, and now uses Yaroslavskiy’s Quicksort with tertiles-of-ﬁve; the average number of comparisons are asymptotically \(1.5697 n \ln{n}\) vs. \(1.7043 n \ln{n}\). According to the comparison model, Yaroslavskiy’s algorithm is signiﬁcantly *worse* than classic Quicksort! Moreover, this result does not change qualitatively if we consider *all* primitive instructions of a machine instead of only comparisons [19, 9]. It is thus not surprising that researchers found the use of two pivots not promising [15, 6].

But if Yaroslavskiy’s algorithm actually uses more comparisons and instructions, how comes it is still faster in practice? And why was this discrepancy between theory and practice not noticed earlier? The reason is most likely related to a phenomenon known as the “memory wall” [20, 13] or the “von-Neumann bottleneck” [1]: Processor speed has been growing considerably faster than memory bandwidth for a long time.

Based on the extensive data for the STREAM benchmark [12, 11], CPU speed has increased with an average annual growth rate of 46% over the last 25 years, whereas memory bandwidth, the amount of data transferable between RAM and CPU in a given amount of time, has increased by 37% per year. Even though one should not be too strict about the exact numbers as they are averages over very different architectures, a signiﬁcant increase in *imbalance* is undeniable. Figure 1 gives a direct quantitative view of this trend.

If the imbalance between CPU and memory transfer speed continues to grow exponentially, at some point in the future any further improvements of CPUs will be futile: the processor is waiting for data all the time; we hit a “memory wall”.

It is debatable if and when this extreme will be reached [5, 13], and consequences certainly depend on the application. In any case, however, the (average) relative costs of memory accesses have increased signiﬁcantly over the last 25 years.

So when Quicksort with two pivots was ﬁrst studied, researchers correctly concluded that it does not pay off. But computers have changed since then, and so must our models.

Our new cost model for sorting counts the number of “*scanned elements*”. An element scan is essentially an accesses “\(A[i]\)” to the input array \(A\), but we count all accesses as one that use the same index variable \(i\) and the same value for \(i\). For example, a linear scan over \(A\) entails \(n\) scanned elements, and several interleaved scans (with different index variables) cost the traveled distance, summed up over all indices, even when the scanned ranges overlap. We do not distinguish read and write accesses.

We claim that for algorithms built on potentially interleaved sequential scans, in particular for classic and dual-pivot Quicksort, the number of scanned elements is asymptotically proportional to the amount of data transfered between CPU and main memory [9].

Scanned elements are related to cache misses [8], but the latter is a machine-dependent quantity, whereas the former is a clean, abstract cost measure that is easy to analyze: One partitioning step of classic Quicksort scans \(A\) exactly once, resulting in \(n\) scanned elements. In Yaroslavskiy’s partitioning, indices \(k\) and \(g\) together scan \(A\) once, but index \(l\) scans the leftmost segment a second time. On average, the latter contains a third of all elements, yielding \( \frac{4}{3} n \) scanned elements in total.

Using these in recurrences (1) resp. (2) yields \(1.5697 n \ln{n}\) vs. \(1.4035 n \ln(n)\) scanned elements; the Java 7 Quicksort saves 12% of the element scans over the version in Java 6, which matches the roughly 10% speedup observed in running time studies.

Memory speed has not fully kept pace with improvements in processing power. This growing imbalance forces us to economize on memory accesses in algorithms that were almost entirely CPU-bound in the past, and calls for new cost models for the analysis of algorithms. For sorting algorithms that build on sequential scans over their input, the proposed “scanned elements” counts serve as such a model and give a good indication of the amount of memory trafﬁc caused by an algorithm. It is exactly this data trafﬁc where dual-pivot outclasses classic Quicksort, offering a plausible explanation for its superiority in practice.

- J. BACKUS, Can Programming Be Liberated from the Von Neumann Style? A Functional Style and Its Algebra of Programs. Communications of the ACM 21 (1978) 8, 613–641.
- J. L. B ENTLEY, M. D. M C I LROY, Engineering a sort function. Software: Practice and Experience 23 (1993) 11, 1249–1265.
- M. D URAND, Asymptotic analysis of an optimized quicksort algorithm. Information Processing Letters 85 (2003) 2, 73–77.
- M. H. VAN E MDEN, Increasing the efﬁciency of quicksort. Communications of the ACM (1970), 563–567.
- M. A. E RTL, The Memory Wall Fallacy. 2001. https://www.complang.tuwien.ac.at/anton/memory-wall.html
- P. H ENNEQUIN, Analyse en moyenne d’algorithmes : tri rapide et arbres de recherche. PhD Thesis, Ecole Politechnique, Palaiseau, 1991.
- C. A. R. H OARE, Quicksort. The Computer Journal 5 (1962) 1, 10–16.
- S. K USHAGRA, A. L ÓPEZ -O RTIZ, A. Q IAO, J. I. M UNRO, Multi-Pivot Quicksort: Theory and Experiments. In: ALENEX 2014. SIAM, 2014, 47–60.
- C. M ARTÍNEZ, M. E. N EBEL, S. W ILD, Analysis of Pivot Sampling in Dual-Pivot Quicksort. Algorithmica (2015). http://arxiv.org/abs/1412.0193
- C. M ARTÍNEZ, S. ROURA, Optimal Sampling Strategies in Quicksort and Quickselect. SIAM Journal on Computing 31 (2001) 3, 683–705.
- J. D. M C C ALPIN, STREAM: Sustainable Memory Bandwidth in High Performance Computers. Technical report, University of Virginia, Charlottesville, Virginia, 1991-2007. Continually updated technical report. http://www.cs.virginia.edu/~mccalpin/papers/bandwidth/bandwidth.html
- J. D. M C C ALPIN, Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter (1995), 19–25. http://www.cs.virginia.edu/~mccalpin/papers/balance/index.html
- S. A. M C K EE, Reﬂections on the Memory Wall. In: Proceedings of the ﬁrst conference on computing frontiers. 2004, 162–167.
- D. R. M USSER, Introspective Sorting and Selection Algorithms. Software: Practice and Experience 27 (1997) 8, 983–993.
- R. S EDGEWICK, Quicksort. Ph. D. Thesis, Stanford University, 1975.
- R. S EDGEWICK, Implementing Quicksort programs. Communications of the ACM 21 (1978) 10, 847–857.
- R. C. S INGLETON, Algorithm 347: an efﬁcient algorithm for sorting with minimal storage [M1]. Communications of the ACM 12 (1969) 3, 185–186.
- S. W ILD, Java 7’s Dual Pivot Quicksort. Master thesis, University of Kaiserslautern, 2012. http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:hbz:386-kluedo-34638
- S. W ILD, M. E. N EBEL, Average Case Analysis of Java 7’s Dual Pivot Quicksort. In: L. E PSTEIN, P. F ERRAGINA (eds.), ESA 2012. LNCS 7501, Springer, 2012, 825–836. http://arxiv.org/abs/1310.7409
- W. A. W ULF, S. A. M C K EE, Hitting the Memory Wall: Implications of the Obvious. 1995.