X-Git-Url: https://sipb.mit.edu/gitweb.cgi/ikiwiki.git/blobdiff_plain/15f9bb7ce5034ca9a6c813a0bf7ddb55d73c67fc..ef9bf2ea764bc9a77db720c07e612a4dce0460dc:/doc/todo/different_search_engine.mdwn diff --git a/doc/todo/different_search_engine.mdwn b/doc/todo/different_search_engine.mdwn index 592c159b3..9d0fc92c9 100644 --- a/doc/todo/different_search_engine.mdwn +++ b/doc/todo/different_search_engine.mdwn @@ -1,10 +1,78 @@ -After using it for a while, my feeling is that hyperestradier, as used in +[[done]], using xapian-omega! --[[Joey]] + +After using it for a while, my feeling is that hyperestraier, as used in the [[plugins/search]] plugin, is not robust enough for ikiwiki. It doesn't -upgrade well, and it has a habit of sig-11 on certian input from time to +upgrade well, and it has a habit of sig-11 on certain input from time to time. -So some other engine should be found and used instead. Enrico had one that -he was using for debtags stuff that looked pretty good. +So some other engine should be found and used instead. + +Enrico had one that he was using for debtags stuff that looked pretty good. +That was [Xapian](http://www.xapian.org/), which has perl bindings in +libsearch-xapian-perl. The nice thing about xapian is that it does a ranked +search so it understands what words are most important in a search. (So +does Lucene..) Another nice thing is it supports "more documents like this +one" kind of search. --[[Joey]] + +## xapian + +I've invesitgated xapian briefly. I think a custom xapian indexer and use +of the omega for cgi searches could work well for ikiwiki. --[[Joey]] + +### indexer + +A custom indexer is needed because omindex isn't good enough for ikiwiki's +needs for incremental rendering. (And because, since ikiwiki has page info +in memory, it's silly to write it to disk and have omindex read it back.) + +The indexer would run as a ikiwiki hook. It needs to be passed the page +name, and the content. Which hook to use is an open question. +Possibilities: + +* `filter` - Since this runs before preprocess, only the actual text + written on the page would be indexed. Not text generated by directives, + pulled in by inlining, etc. There's something to be said for that. And + something to be said against it. It would also get markdown formatted + content, mostly, though it would still need to strip html, and also + probably strip preprocessor directives too. +* `sanitize` - Would get the htmlized content, so would need to strip html. + Preprocessor directive output would be indexed. Doesn't get a destpage + parameter, making optimisation hard. +* `format` - Would get the entire html page, including the page template. + Probably not a good choice as indexing the same template for each page + is unnecessary. + +The hook would remove any html from the content, and index it. +It would need to add the same document data that omindex would. + +The indexer (and deleter) will need a way to figure out the ids in xapian +of the documents to delete. One way is storing the id of each page in the +ikiwiki index. + +The other way would be adding a special term to the xapian db that can be +used with replace_document_by_term/delete_document_by_term. +Hmm, let's use a term named "P". + +The hook should try to avoid re-indexing pages that have not changed since +they were last indexed. One problem is that, if a page with an inline is +built, every inlined item will get each hook run. And so a naive hook would +index each of those items, even though none of them have necessarily +changed. Date stamps are one possibility. Another would be to avoid having +the hook not do any indexing when `%preprocessing` is set (Ikiwiki.pm would +need to expose that variable.) Another approach would be to use a +needsbuild hook and only index the pages that are being built. + +#### cgi + +The cgi hook would exec omega to handle the searching, much as is done +with estseek in the current search plugin. + +It would first set `OMEGA_CONFIG_FILE=.ikiwiki/omega.conf` ; that omega.conf +would set `database_dir=.ikiwiki/xapian` and probably also set a custom +`template_dir`, which would have modified templates branded for ikiwiki. So +the actual xapian db would be in `.ikiwiki/xapian/default/`. + +## lucene >> I've done a bit of prototyping on this. The current hip search library is [Lucene](http://lucene.apache.org/java/docs/). There's a Perl port called [Plucene](http://search.cpan.org/~tmtm/Plucene-1.25/). Given that it's already packaged, as `libplucene-perl`, I assumed it would be a good starting point. I've written a **very rough** patch against `IkiWiki/Plugin/search.pm` to handle the indexing side (there's no facility to view the results yet, although I have a command-line interface working). That's below, and should apply to SVN trunk. @@ -17,6 +85,12 @@ he was using for debtags stuff that looked pretty good. >> If this seems a sensible approach, I'll write the CGI interface, and clean up the plugin. -- Ben +>>> The weird thing about lucene is that these are all reimplmentations of +>>> it. Thank you java.. The C++ version seems like a better choice to me +>>> (packages are trivial). --[[Joey]] + +> Might I suggest renaming the "search" plugin to "hyperestraier", and then creating new search plugins for different engines? No reason to pick a single replacement. --[[JoshTriplett]] +
 Index: IkiWiki/Plugin/search.pm
 ===================================================================
@@ -52,30 +126,30 @@ Index: IkiWiki/Plugin/search.pm
 +  $PLUCENE_DIR = $config{wikistatedir}.'/plucene';  
 +}
 +
- sub import { #{{{
+ sub import {
 -       hook(type => "getopt", id => "hyperestraier",
--               call => \&getopt);
+-               call => \&getopt);
 -       hook(type => "checkconfig", id => "hyperestraier",
 +       hook(type => "checkconfig", id => "plucene",
-                call => \&checkconfig);
+                call => \&checkconfig);
 -       hook(type => "pagetemplate", id => "hyperestraier",
--               call => \&pagetemplate);
+-               call => \&pagetemplate);
 -       hook(type => "delete", id => "hyperestraier",
 +       hook(type => "delete", id => "plucene",
-                call => \&delete);
+                call => \&delete);
 -       hook(type => "change", id => "hyperestraier",
 +       hook(type => "change", id => "plucene",
-                call => \&change);
+                call => \&change);
 -       hook(type => "cgi", id => "hyperestraier",
--               call => \&cgi);
- } # }}}
+-               call => \&cgi);
+ }
  
--sub getopt () { #{{{
+-sub getopt () {
 -        eval q{use Getopt::Long};
 -       error($@) if $@;
 -        Getopt::Long::Configure('pass_through');
 -        GetOptions("estseek=s" => \$config{estseek});
--} #}}}
+-}
  
 +sub writer {
 +  init();
@@ -91,20 +165,20 @@ Index: IkiWiki/Plugin/search.pm
 +    grep { defined pagetype($_) } @_;
 +}
 +
- sub checkconfig () { #{{{
+ sub checkconfig () {
         foreach my $required (qw(url cgiurl)) {
                 if (! length $config{$required}) {
 @@ -36,112 +58,55 @@
         }
- } #}}}
+ }
  
 -my $form;
--sub pagetemplate (@) { #{{{
+-sub pagetemplate (@) {
 -       my %params=@_;
 -       my $page=$params{page};
 -       my $template=$params{template};
 +#my $form;
-+#sub pagetemplate (@) { #{{{
++#sub pagetemplate (@) {
 +#      my %params=@_;
 +#      my $page=$params{page};
 +#      my $template=$params{template};
@@ -119,7 +193,7 @@ Index: IkiWiki/Plugin/search.pm
 +#
 +#              $template->param(searchform => $form);
 +#      }
-+#} #}}}
++#}
  
 -       # Add search box to page header.
 -       if ($template->query(name => "searchform")) {
@@ -131,9 +205,9 @@ Index: IkiWiki/Plugin/search.pm
 -
 -               $template->param(searchform => $form);
 -       }
--} #}}}
+-}
 -
- sub delete (@) { #{{{
+ sub delete (@) {
 -       debug(gettext("cleaning hyperestraier search index"));
 -       estcmd("purge -cl");
 -       estcfg();
@@ -145,9 +219,9 @@ Index: IkiWiki/Plugin/search.pm
 +    $reader->delete_term( Plucene::Index::Term->new({ field => "id", text => $_ }));
 +  }
 +  $reader->close;
- } #}}}
+ }
  
- sub change (@) { #{{{
+ sub change (@) {
 -       debug(gettext("updating hyperestraier search index"));
 -       estcmd("gather -cm -bc -cl -sd",
 -               map {
@@ -176,9 +250,9 @@ Index: IkiWiki/Plugin/search.pm
 +    $doc->add(Plucene::Document::Field->UnStored('text' => $data));
 +    $writer->add_document($doc);
 +  }
- } #}}}
+ }
 -
--sub cgi ($) { #{{{
+-sub cgi ($) {
 -       my $cgi=shift;
 -
 -       if (defined $cgi->param('phrase') || defined $cgi->param("navi")) {
@@ -186,10 +260,10 @@ Index: IkiWiki/Plugin/search.pm
 -               chdir("$config{wikistatedir}/hyperestraier") || error("chdir: $!");
 -               exec("./".IkiWiki::basename($config{cgiurl})) || error("estseek.cgi failed");
 -       }
--} #}}}
+-}
 -
 -my $configured=0;
--sub estcfg () { #{{{
+-sub estcfg () {
 -       return if $configured;
 -       $configured=1;
 -
@@ -227,9 +301,9 @@ Index: IkiWiki/Plugin/search.pm
 -       unlink($cgi);
 -       my $estseek = defined $config{estseek} ? $config{estseek} : '/usr/lib/estraier/estseek.cgi';
 -       symlink($estseek, $cgi) || error("symlink $estseek $cgi: $!");
--} # }}}
+-}
 -
--sub estcmd ($;@) { #{{{
+-sub estcmd ($;@) {
 -       my @params=split(' ', shift);
 -       push @params, "-cl", "$config{wikistatedir}/hyperestraier";
 -       if (@_) {
@@ -249,7 +323,7 @@ Index: IkiWiki/Plugin/search.pm
 -               open(STDOUT, "/dev/null"); # shut it up (closing won't work)
 -               exec("estcmd", @params) || error("can't run estcmd");
 -       }
--} #}}}
+-}
 -
 -1
 +1;