doc/todo/smarter_sorting.mdwn

   1 I benchmarked a build of a large wiki (my home wiki), and it was spending
   2 quite a lot of time sorting; `CORE::sort` was called only 1138 times, but
   3 still flagged as the #1 time sink. (I'm not sure I trust NYTProf fully
   4 about that FWIW, since it also said 27238263 calls to `cmp_age` were
   5 the #3 timesink, and I suspect it may not entirely accurately measure
   6 the overhead of so many short function calls.)
   7
   8 `pagespec_match_list` currently always sorts *all* pages first, and then
   9 finds the top M that match the pagespec. That's innefficient when M is
  10 small (as for example in a typical blog, where only 20 posts are shown,
  11 out of maybe thousands).
  12
  13 As [[smcv]] noted, It could be flipped, so the pagespec is applied first,
  14 and then sort the smaller matching set. But, checking pagespecs is likely
  15 more expensive than sorting. (Also, influence calculation complicates
  16 doing that.)
  17
  18 Another option, when there is a limit on M pages to return, might be to
  19 cull the M top pages without sorting the rest. This could be done using
  20 a variant of Ye Olde Bubble Sort. Take the first M pages, and (quick)sort.
  21 Then for each of the rest, check if it is higher than the Mth page.
  22 If it is, bubble it up so it's sorted.
  23 If not, throw it out (that's the fast bit and why this is not O(N^2)).
  24
  25 > The patch below implements this, perhaps not as efficiently as possible.
  26 > It speeds up building just the top page of my blog by 1 second (out of
  27 > 18).
  28 >
  29 > But, I have not thought enough about influence calculation.
  30 > I need to figure out which pagespec matches influences need to be
  31 > accumulated for in order to determine all possible influences of a
  32 > pagespec are known.
  33 >
  34 > The old code accumulates influences from matching all successful pages
  35 > up to the num cutoff, as well as influences from an arbitrary (sometimes
  36 > zero) number of failed matches. New code does not accumulate influences
  37 > from all the top successful matches, only an arbitrary group of
  38 > successes and some failures.
  39
  40 <pre>
  41 diff --git a/IkiWiki.pm b/IkiWiki.pm
  42 index 1730e47..bc8b23d 100644
  43 --- a/IkiWiki.pm
  44 +++ b/IkiWiki.pm
  45 @@ -2122,36 +2122,54 @@ sub pagespec_match_list ($$;@) {
  46         my $num=$params{num};
  47         delete @params{qw{num deptype reverse sort filter list}};
  48
  49 -       # when only the top matches will be returned, it's efficient to
  50 -       # sort before matching to pagespec,
  51 -       if (defined $num && defined $sort) {
  52 -               @candidates=IkiWiki::SortSpec::sort_pages(
  53 -                       $sort, @candidates);
  54 -       }
  55 -
  56 +       # Find the first num matches (or all), before sorting.
  57         my @matches;
  58 -       my $firstfail;
  59         my $count=0;
  60         my $accum=IkiWiki::SuccessReason->new();
  61 -       foreach my $p (@candidates) {
  62 -               my $r=$sub->($p, %params, location => $page);
  63 +       my $i;
  64 +       for ($i=0; $i < @candidates; $i++) {
  65 +               my $r=$sub->($candidates[$i], %params, location => $page);
  66                 error(sprintf(gettext("cannot match pages: %s"), $r))
  67                         if $r->isa("IkiWiki::ErrorReason");
  68                 $accum |= $r;
  69                 if ($r) {
  70 -                       push @matches, $p;
  71 +                       push @matches, $candidates[$i];
  72                         last if defined $num && ++$count == $num;
  73                 }
  74         }
  75
  76 +       # We have num natches, but they may not be the best.
  77 +       # Efficiently find and add the rest, without sorting the full list of
  78 +       # candidates.
  79 +       if (defined $num && defined $sort) {
  80 +               @matches=IkiWiki::SortSpec::sort_pages($sort, @matches);
  81 +
  82 +               for ($i++; $i < @candidates; $i++) {
  83 +                       # Comparing candidate with lowest match is cheaper,
  84 +                       # so it's done before testing against pagespec.
  85 +                       if (IkiWiki::SortSpec::cmptwo($candidates[$i], $matches[-1], $sort) < 0 &&
  86 +                           $sub->($candidates[$i], %params, location => $page)
  87 +                       ) {
  88 +                               # this could be done less expensively
  89 +                               # using a binary search
  90 +                               for (my $j=0; $j < @matches; $j++) {
  91 +                                       if (IkiWiki::SortSpec::cmptwo($candidates[$i], $matches[$j], $sort) < 0) {
  92 +                                               splice @matches, $j, $#matches-$j+1, $candidates[$i],
  93 +                                                       @matches[$j..$#matches-1];
  94 +                                               last;
  95 +                                       }
  96 +                               }
  97 +                       }
  98 +               }
  99 +       }
 100 +
 101         # Add simple dependencies for accumulated influences.
 102 -       my $i=$accum->influences;
 103 -       foreach my $k (keys %$i) {
 104 -               $depends_simple{$page}{lc $k} |= $i->{$k};
 105 +       my $inf=$accum->influences;
 106 +       foreach my $k (keys %$inf) {
 107 +               $depends_simple{$page}{lc $k} |= $inf->{$k};
 108         }
 109
 110 -       # when all matches will be returned, it's efficient to
 111 -       # sort after matching
 112 +       # Sort if we didn't already.
 113         if (! defined $num && defined $sort) {
 114                 return IkiWiki::SortSpec::sort_pages(
 115                         $sort, @matches);
 116 @@ -2455,6 +2473,12 @@ sub sort_pages {
 117         sort $f @_
 118  }
 119
 120 +sub cmptwo {
 121 +       $a=$_[0];
 122 +       $b=$_[1];
 123 +       $_[2]->();
 124 +}
 125 +
 126  sub cmp_title {
 127         IkiWiki::pagetitle(IkiWiki::basename($a))
 128         cmp
 129 </pre>
 130
 131 This would be bad when M is very large, and particularly, of course, when
 132 there is no limit and all pages are being matched on. (For example, an
 133 archive page shows all pages that match a pagespec specifying a creation
 134 date range.) Well, in this case, it *does* make sense to flip it, limit by
 135 pagespe first, and do a (quick)sort second. (No influence complications,
 136 either.)
 137
 138 > Flipping when there's no limit implemented, and it knocked 1/3 off
 139 > the rebuild time of my blog's archive pages. --[[Joey]]
 140
 141 Adding these special cases will be more complicated, but I think the best
 142 of both worlds. --[[Joey]]