@@ -318,65 +318,104 @@ merging must be done as (A+B)+C or A+(B+C) instead.
318318So merging is always done on two consecutive runs at a time, and in-place,
319319although this may require some temp memory (more on that later).
320320
321- When a run is identified, its base address and length are pushed on a stack
322- in the MergeState struct. merge_collapse() is then called to potentially
323- merge runs on that stack. We would like to delay merging as long as possible
324- in order to exploit patterns that may come up later, but we like even more to
325- do merging as soon as possible to exploit that the run just found is still
326- high in the memory hierarchy. We also can't delay merging "too long" because
327- it consumes memory to remember the runs that are still unmerged, and the
328- stack has a fixed size.
329-
330- What turned out to be a good compromise maintains two invariants on the
331- stack entries, where A, B and C are the lengths of the three rightmost not-yet
332- merged slices:
333-
334- 1. A > B+C
335- 2. B > C
336-
337- Note that, by induction, #2 implies the lengths of pending runs form a
338- decreasing sequence. #1 implies that, reading the lengths right to left,
339- the pending-run lengths grow at least as fast as the Fibonacci numbers.
340- Therefore the stack can never grow larger than about log_base_phi(N) entries,
341- where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice
342- for very large arrays.
343-
344- If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the
345- freshness-in-cache reason), and the new run replaces the A,B or B,C entries;
346- e.g., if the last 3 entries are
347-
348- A:30 B:20 C:10
349-
350- then B is merged with C, leaving
351-
352- A:30 BC:30
353-
354- on the stack. Or if they were
355-
356- A:500 B:400: C:1000
357-
358- then A is merged with B, leaving
359-
360- AB:900 C:1000
361-
362- on the stack.
363-
364- In both examples, the stack configuration after the merge still violates
365- invariant #2, and merge_collapse() goes on to continue merging runs until
366- both invariants are satisfied. As an extreme case, suppose we didn't do the
367- minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,
368- and 2. Nothing would get merged until the final 2 was seen, and that would
369- trigger 7 perfectly balanced merges.
370-
371- The thrust of these rules when they trigger merging is to balance the run
372- lengths as closely as possible, while keeping a low bound on the number of
373- runs we have to remember. This is maximally effective for random data,
374- where all runs are likely to be of (artificially forced) length minrun, and
375- then we get a sequence of perfectly balanced merges (with, perhaps, some
376- oddballs at the end).
377-
378- OTOH, one reason this sort is so good for partly ordered data has to do
379- with wildly unbalanced run lengths.
321+ When a run is identified, its length is passed to found_new_run() to
322+ potentially merge runs on a stack of pending runs. We would like to delay
323+ merging as long as possible in order to exploit patterns that may come up
324+ later, but we like even more to do merging as soon as possible to exploit
325+ that the run just found is still high in the memory hierarchy. We also can't
326+ delay merging "too long" because it consumes memory to remember the runs that
327+ are still unmerged, and the stack has a fixed size.
328+
329+ The original version of this code used the first thing I made up that didn't
330+ obviously suck ;-) It was loosely based on invariants involving the Fibonacci
331+ sequence.
332+
333+ It worked OK, but it was hard to reason about, and was subtle enough that the
334+ intended invariants weren't actually preserved. Researchers discovered that
335+ when trying to complete a computer-generated correctness proof. That was
336+ easily-enough repaired, but the discovery spurred quite a bit of academic
337+ interest in truly good ways to manage incremental merging on the fly.
338+
339+ At least a dozen different approaches were developed, some provably having
340+ near-optimal worst case behavior with respect to the entropy of the
341+ distribution of run lengths. Some details can be found in bpo-34561.
342+
343+ The code now uses the "powersort" merge strategy from:
344+
345+ "Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods
346+ That Optimally Adapt to Existing Runs"
347+ J. Ian Munro and Sebastian Wild
348+
349+ The code is pretty simple, but the justification is quite involved, as it's
350+ based on fast approximations to optimal binary search trees, which are
351+ substantial topics on their own.
352+
353+ Here we'll just cover some pragmatic details:
354+
355+ The `powerloop()` function computes a run's "power". Say two adjacent runs
356+ begin at index s1. The first run has length n1, and the second run (starting
357+ at index s1+n1, called "s2" below) has length n2. The list has total length n.
358+ The "power" of the first run is a small integer, the depth of the node
359+ connecting the two runs in an ideal binary merge tree, where power 1 is the
360+ root node, and the power increases by 1 for each level deeper in the tree.
361+
362+ The power is the least integer L such that the "midpoint interval" contains
363+ a rational number of the form J/2**L. The midpoint interval is the semi-
364+ closed interval:
365+
366+ ((s1 + n1/2)/n, (s2 + n2/2)/n]
367+
368+ Yes, that's brain-busting at first ;-) Concretely, if (s1 + n1/2)/n and
369+ (s2 + n2/2)/n are computed to infinite precision in binary, the power L is
370+ the first position at which the 2**-L bit differs between the expansions.
371+ Since the left end of the interval is less than the right end, the first
372+ differing bit must be a 0 bit in the left quotient and a 1 bit in the right
373+ quotient.
374+
375+ `powerloop()` emulates these divisions, 1 bit at a time, using comparisons,
376+ subtractions, and shifts in a loop.
377+
378+ You'll notice the paper uses an O(1) method instead, but that relies on two
379+ things we don't have:
380+
381+ - An O(1) "count leading zeroes" primitive. We can find such a thing as a C
382+ extension on most platforms, but not all, and there's no uniform spelling
383+ on the platforms that support it.
384+
385+ - Integer divison on an integer type twice as wide as needed to hold the
386+ list length. But the latter is Py_ssize_t for us, and is typically the
387+ widest native signed integer type the platform supports.
388+
389+ But since runs in our algorithm are almost never very short, the once-per-run
390+ overhead of `powerloop()` seems lost in the noise.
391+
392+ Detail: why is Py_ssize_t "wide enough" in `powerloop()`? We do, after all,
393+ shift integers of that width left by 1. How do we know that won't spill into
394+ the sign bit? The trick is that we have some slop. `n` (the total list
395+ length) is the number of list elements, which is at most 4 times (on a 32-box,
396+ with 4-byte pointers) smaller than than the largest size_t. So at least the
397+ leading two bits of the integers we're using are clear.
398+
399+ Since we can't compute a run's power before seeing the run that follows it,
400+ the most-recently identified run is never merged by `found_new_run()`.
401+ Instead a new run is only used to compute the 2nd-most-recent run's power.
402+ Then adjacent runs are merged so long as their saved power (tree depth) is
403+ greater than that newly computed power. When found_new_run() returns, only
404+ then is a new run pushed on to the stack of pending runs.
405+
406+ A key invariant is that powers on the run stack are strictly decreasing
407+ (starting from the run at the top of the stack).
408+
409+ Note that even powersort's strategy isn't always truly optimal. It can't be.
410+ Computing an optimal merge sequence can be done in time quadratic in the
411+ number of runs, which is very much slower, and also requires finding &
412+ remembering _all_ the runs' lengths (of which there may be billions) in
413+ advance. It's remarkable, though, how close to optimal this strategy gets.
414+
415+ Curious factoid: of all the alternatives I've seen in the literature,
416+ powersort's is the only one that's always truly optimal for a collection of 3
417+ run lengths (for three lengths A B C, it's always optimal to first merge the
418+ shorter of A and C with B).
380419
381420
382421Merge Memory
0 commit comments