SWISH::Split - Perl interface to split index variant of Swish-e

-= SWISH::Split perl module with test and essential documentation
WARNING: this is experimental version. It requires swish-e perl bindings compiled with --enable-incremental
SWISH-Split-0.03.tar.gz 8.5 Kb
-=> Latest source is always available from Subversion repository

NAME

SWISH::Split - Perl interface to split index variant of Swish-e

SYNOPSIS

  use SWISH::Split;

DESCRIPTION

This is alternative interface for indexing data with swish-e. It's designed to split indexes over multiple files (slices) to allow updates of records in index by reindexing just changed parts (slice).

Data is stored in index using intrface which is somewhat similar to the Plucene::Simple manpage. This could make your migration (or supporting two index engines) easier.

In the background, it will fork swish-e binaries (one for each index slice) and produce UTF-8 encoded XML files for it. So, if your input charset isn't ISO-8859-1 you will have to specify it.

Methods used for indexing

open_index

Create new object for index.

  my $i = SWISH::Split->open_index({
        index => '/path/to/index',
        slice_name => \&slice_on_path,
        slices => 30,
        merge => 0,
        codepage => 'ISO-8859-2',
        swish_config => qq{
                PropertyNames from date
                PropertyNamesDate date
        },
        memoize_to_xml => 0,
  );
  sub slice_on_path {
        return shift split(/\//,$_[0]);
  }

Options to open_index are following:

index

path to (existing) directory in which index slices will be created.

slice_name

coderef to function which provide slicing from path.

slices

maximum number of index slices. See in_slice for more explanation.

merge

(planned) option to merge indexes into one at end.

codepage

data codepage (needed for conversion to UTF-8). By default, it's ISO-8859-1.

swish_config

additional parametars which will be inserted into swish-e configuration file. See swish-config.

memoize_to_xml

speed up repeatable data, see to_xml.

add

Add document to index.

  $i->add($swishpath, {
        headline => 'foobar result',
        property => 'data',
  })

delete

Delete documents from index.

  $i->delete(@swishpath);

This function is not implemented.

done

Finish indexing and close index file(s).

  $i->done;

This is most time-consuming operation. When it's called, it will re-index all entries which haven't changed in all slices.

Returns number of slices updated.

This method should really be called close or finish, but both of those are allready used.

Reporting methods

This methods return statistics about your index.

swishpaths

Return array of swishpaths in index.

  my @p = $i->swishpaths;

swishpaths_updated

Return array with updated swishpaths.

  my @d = $i->swishpaths_updated;

swishpaths_deleted

Return array with deleted swishpaths.

  my $n = $i->swishpaths_deleted;

slices

Return array with all slice names.

  my @s = $i->slices;

Helper methods

This methods are used internally, but they might be useful.

in_slice

Takes path and return slice in which this path belongs.

  my $s = $i->in_slice('path/to/document/in/index');

If there are slices parametar to open_index it will use MD5 hash to spread documents across slices. That will produce random distribution of your documents in slices, which might or might not be best for your data. If you have to re-index large number of slices on each run, think about creating your own slice function and distributing documents manually across slices.

Slice number must always be true value or various sanity checks will fail.

This function is Memoizeed for performance reasons.

find_paths

Return array of swishpaths for given swish-e query.

  my @p = $i->find_paths("headline=test*");

Useful for combining with delete_documents to delete documents which hasn't changed a while (so, expired).

make_config

Create swish-e configuration file for given slice.

  my $config_filename = $i->make_config('slice name');

It returns configuration filename. If no swish_config was defined in open_index, default swish-e configuration will be used. It will index all data for searching, but none for properties.

If you want to see what is allready defined for swish-e in configuration take a look at source code for DEFAULT_SWISH_CONF.

It uses stdin as IndexDir to comunicate with swish-e.

create_slice

On first run, starts swish-e. On subsequent calls just return it's handles using Memoize.

  my $s = create_slice('/path/to/document');

You shouldn't need to call create_slice directly because it will be called from put_slice when needed.

put_slice

Pass XML data to swish.

  my $slice = $i->put_slice('/swish/path', '<xml>data</xml>');

Returns slice in which XML ended up.

slice_output

Prints to STDERR output and errors from swish-e.

  my $slice = $i->slice_output($s);

Normally, you don't need to call it.

This is dummy placeholder function for very old code that assumes this module is using IPC::Run which it isn't any more.

close_slice

Close slice (terminates swish-e process for that slice).

  my $i->close_slice($s);

Returns true if slice is closed, false otherwise.

to_xml

Convert (binary safe, I hope) your data into XML for swish-e. Data will not yet be recoded to UTF-8. put_slice will do that.

  my $xml = $i->to_xml({ foo => 'bar' });

This function is extracted from add method so that you can Memoize it. If your data set has a lot of repeatable data, and memory is not a problem, you can add memoize_to_xml option to open_index.

Searching

Searching is still conducted using the SWISH::API manpage, but you have to glob index names.

    use SWISH::API;
    my $swish = SWISH::API->new( glob('index.swish-e/*') );

You can also alternativly create merged index (using merge option) and not change your source code at all.

That would also benefit performance, but it increases indexing time because merged indexes must be re-created on each indexing run.

EXPORT

Nothing by default.

EXAMPLES

Test script for this module uses all parts of API. It's also nice example how to use SWISH::Split.

SEE ALSO

the SWISH::API manpage, http://www.swish-e.org/

AUTHOR

Dobrica Pavlinusic, <dpavlin@rot13.org>

COPYRIGHT AND LICENSE

Copyright (C) 2004 by Dobrica Pavlinusic

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

ChangeLog

2005-04-29 23:25:02 dpavlin r13

/trunk/Split.pm: added warning about unimplemented delete

2005-04-29 22:51:58 dpavlin r12

/trunk: added some files to ignore

2005-04-29 22:50:16 dpavlin r11

/trunk/MANIFEST, /trunk/Split.pm: some cleanups

2005-04-29 22:38:00 dpavlin r10

/trunk/Changes: created from Subversion log

2005-04-29 22:35:21 dpavlin r9

/trunk/Split.pm: API 0.03: works with current swish-e version from CVS (with --enable-incremental!), but still doesn't support incremental indexing

2004-12-19 03:06:01 dpavlin r8

/trunk/Split.pm, /trunk/t/01api.t: new api:
- renamed open to open_index
- removed dependency on IPC::Run
- tests which all pass

2004-12-17 18:32:34 dpavlin r7

/trunk/Split.pm, /trunk/Makefile.PL, /trunk/t/01api.t: a lot of changes:
- better testing framework
- changed put_slice API (to actually confirm with documentation)
- use swish-e stdin instead of external cat utility
- added tags target

2004-12-08 20:35:49 dpavlin r6

/trunk/MANIFEST, /trunk/Split.pm, /trunk/Makefile.PL, /trunk/t/99pod.t, /trunk/MANIFEST.SKIP, /trunk/t/SWISH-Split.t, /trunk/t/01api.t: better distribution packaging and html target

2004-08-11 14:28:40 dpavlin r5

/trunk/Split.pm, /trunk/t/SWISH-Split.t: smaller improvements

2004-08-08 19:22:56 dpavlin r4

/trunk/Split.pm, /trunk/t/SWISH-Split.t: first version which passes 51 test. It still doesn't update documents, just insert.

2004-08-08 10:53:04 dpavlin r3

/trunk/Split.pm: one more planned call: find_paths

2004-08-08 10:27:27 dpavlin r2

/trunk/t/SWISH-Split.t: better tests

2004-08-08 10:09:55 dpavlin r1

/trunk/t/SWISH-Split.t, /trunk/README, /trunk, /trunk/t, /trunk/MANIFEST, /trunk/Split.pm, /trunk/Makefile.PL, /trunk/Changes: initial import of SWISH::Split. Lot of documentation, less code.