IndexService を使ったインデクシング

Neo4j Wiki から

The integrated index framework is set to replace the IndexService, see Transitioning To Index Framework

Neo4j is a powerful database, however it has no indexing features built-in since the graph structure of the stored data eliminates many of the needs for indexes existing when using other underlying data models. Simple key based lookup of nodes is however something that isn't easily done through the graph structure, instead you'll have to manage a lookup index manually. IndexService is a means of providing those indexing capabilities for a Neo4j graph and integrate it as tightly as possible. With an IndexService you can associate any number of key-value pairs to any node and do fast lookups given such key-value pairs.

If you're looking for information about indexing with LuceneIndexBatchInserter go to Indexing with BatchInserter

Note that you don't need to instantiate more than one IndexService to go with a GraphDatabaseService instance. Instantiating more than one of any given type (f.ex. LuceneIndexService) will just point to the same internal index data source. Also know that LuceneFulltextIndexService has its own XA resource ID and will therefore have its own index data source, such that it can co-exist with a LuceneIndexService.


[edit] IndexService implementations

Our main implementation is the LuceneIndexService which (as the name implies) uses Lucene as backend.

It should be mentioned that when Neo4j gets an event framework (soon after the 1.0 release) it'll be possible to integrate these features such that it nearly feels native to Neo4j.

[edit] General behaviour

So you have your GraphDatabaseService and you want to index stuff from your graph in one or more indices. Well, you instantiate an IndexService with your GraphDatabaseService as an argument and you're good to go.

GraphDatabaseService graphDb = new EmbeddedGraphDatabase( "path/to/neo4j-db" );
IndexService index = new LuceneIndexService( graphDb );

[edit] Indexing your nodes

The IndexService indexes your nodes with key-value pairs, just like the properties on the nodes. So if you'd like to have a certain property key indexed you basically set the value as a property on the node and index the value in your index service. Any node can have any number of key-value pairs associated with it.

[edit] Wachowski brothers example

GraphDatabaseService graphDb = new EmbeddedGraphDatabase( "path/to/neo4j-db" );
IndexService index = new LuceneIndexService( graphDb );

Node andy = graphDb.createNode();
Node larry = graphDb.createNode();

andy.setProperty( "name", "Andy Wachowski" );
andy.setProperty( "title", "Director" );
larry.setProperty( "name", "Larry Wachowski" );
larry.setProperty( "title", "Director" );
index.index( andy, "name", andy.getProperty( "name" ) );
index.index( andy, "title", andy.getProperty( "title" ) );
index.index( larry, "name", larry.getProperty( "name" ) );
index.index( larry, "title", larry.getProperty( "title" ) );

You see here that you'll have to both set the properties and manually add that data to the index. It is recommended to wrap this kind of functionality, f.ex. in your domain model.

Note that even though the IndexService.index(Node, String, Object) method takes an Object as argument for the value to index, the LuceneIndexService implementation will cast it into a String. Of course the same goes for the getNodes(String, Object) method.

[edit] Looking up your nodes

Once you've indexed nodes in your IndexService you can query it and get back one or more nodes given a key and a value. F.ex. in the previous example there's only one node with the name "Andy Wachowski", but two with the title "Director".

IndexService index = // your arbitrary LuceneIndexService instance

// This will return the andy node.
index.getSingleNode( "name", "Andy Wachowski" );

// This will return an IndexHits<Node> containing only the larry node
for ( Node hit : index.getNodes( "name", "Larry Wachowski" ) )
{ // do something

// This will return an IndexHits<Node> containing both andy and larry
for ( Node hit : index.getNodes( "title", "Director" )
{ // do something

As you can see there are two methods for looking up nodes from and index service. One is getSingleNode which returns one hit or null if there wasn't any hit. It'll however throw an exception if there's more than one hit. The other one is the getNodes which will return all the found nodes for that query, with the return type IndexHits which is an Iterable with a (precalculated) size on it.

Note that if you don't loop through the entire result wou'll have to call close yourself when you're done.

[edit] Updating values in the index

Remember that any node can be associated with any number of key-value pairs. This means that you can index a node with many key-value pairs having the same key. With those facts we can say that it's not enough to just index the new value (in the case where a property value changes and you'd like to update the index with the new value), you'll have to remove the old value as well. Look at code below:

IndexService indexService = ...
Node node = ...

// First index the property value
node.setProperty( "name", "Thomas Anderson" );
indexService.index( node, "name", node.getProperty( "name" ) );

// When the property value changes, do as you usually do,
// but remove the old value first
indexService.removeIndex( node, "name", node.getProperty( "name" ) );
node.setProperty( "name", "Thomas A. Anderson" );
indexService.index( node, "name", node.getProperty( "name" ) );

Similarly, your application has to take care of removing the corresponding index entries when a node i deleted.

[edit] Fulltext indexing

Given the example above, let's say that you'd really like to get back both Andy and Larry with a query for "Wachowski". This would be a problem in LuceneIndexService since it can only match whole values, not parts of it. So, this is where LuceneFulltextIndexService comes in. It extends LuceneIndexService, both class-wise and format-wise. To accomplish fulltext capabilities it stores more data for each entry (node, key, value). This makes it incompatible with LuceneIndexService so it has been given its own XA resource ID. This has the effect that it can co-exist with a LuceneIndexService instance for a given GraphDatabaseService.

The default implementation of LuceneFulltextIndexService uses lucene's whitespace analyzer to analyze each value and split it up into words. This can easily be extended with your own custom implementation. So take a look at the #Wachowski brothers example again and assume that you'd use a LuceneFulltextIndexService instead of the LuceneIndexService ok?You could then query the index service like this.

IndexService index = // your LuceneFulltextIndexService

index.getNodes( "name", "wachowski" ); // --> andy and larry
index.getNodes( "name", "andy" ); // --> andy
index.getNodes( "name", "Andy" ); // --> andy
index.getNodes( "name", "larry Wachowski" ); // --> larry
index.getNodes( "name", "wachowski larry" ); // --> larry

This is quite nice, also notice that the queries are case-insensitive.

[edit] Fulltext indexing with lucene query syntax

The LuceneFulltextQueryIndexService takes this even further in that it supports queries formatted in the lucene query syntax with the restriction that you cannot query with different keys as seen in the lucene query syntax. Note that the default operator between terms in lucene query syntax is OR, this behaviour can be changed by overriding the getDefaultQueryOperator method.

IndexService index = // your LuceneFulltextQueryIndexService

index.getNodes( "name", "wachow* andy" ); // --> andy and larry
index.getNodes( "name", "Andy" ); // --> andy
index.getNodes( "name", "andy" ); // --> andy
index.getNodes( "name", "wachowski" ); // --> andy and larry
index.getNodes( "name", "+wachow* +larry" ); // --> larry
index.getNodes( "name", "andy AND larry" ); // -->
index.getNodes( "name", "andy OR larry" ); // --> andy and larry
index.getNodes( "name", "Wachowski AND larry" ); // --> larry

Note that the LuceneFulltextQueryIndexService doesn't change the store format from that of LuceneFulltextIndexService, making them compatible.

[edit] Big search results

One straight forward aspect of performance if caching, but that's for smaller result sets. If a result set is bigger than a certain configurable threshold the lazy iterator kicks in. This makes the query return very fast and each hit is examined and its node fetched lazily for each step in the iteration of the search result. This also has the side-effect that such big results won't be cached. The returned search result also has a known size which has the effect that the results doesn't need to be looped through in order to get the size, hence it's always fast.

To be able to implement this feature we've added a close method which must be called if the iteration of the result isn't looped through entirely (it closes automatically if the entire result is looped through).

[edit] Sorting

You can control the way results are sorted by taking advantage of lucenes sorting features. You can do this on any LuceneIndexService derivative with the overloaded getNodes method which also takes a sorting argument.

LuceneIndexService index = // your arbitrary LuceneIndexService
index.getNodes( "name", "wachowski", Sort.RELEVANCE );
// OR to sort by "name"
index.getNodes( "name", "wachowski", new Sort( new SortField( "name", SortField.STRING ) ) );

[edit] Caching

If your LuceneIndexService becomes a performance bottle neck you can enable caching for even faster lookups. The caching is implemented with a LRU cache so that only the most recently accessed results are in cache (by results I mean a result of a getNodes/getSingleNode query, not a single node). You can control the size of the cache (the maximum number of results) per index key. You're advised to enable your caching right after instantiation of your IndexService.

Caching will work well on smaller result sets, but in addition to this there's lazy search result iterators for big search results. These two features complement each other.

LuceneIndexService index = new LuceneIndexService( graphDb );

// This means that you enable caching for results for the key "name"
// and that the 100000 latest results will be cached for faster lookups
index.enableCache( "name", 100000 );

In performance critical areas such a cache can increase performance for lookups significantly.

Note that for the LuceneFulltextQueryIndexService, cache enabling is not supported, since cache overhead is too big.

[edit] Range queries

Using the fulltext search, you can even do range queries for Node properties along the lines of

IndexService index = new LuceneFulltextQueryIndexService( graphDb );
index.index( trinity, "name", "Trinity" );
index.index( neo, "name", "Neo" );
index.index( mouse, "name", "Mouse" );
index.index( morpheus, "name", "Morpheus" );

// This will return morpheus, mouse and neo
index.getNodes( "name", "[Morpheus TO Neo]" )

However, Lucene is treating all Integers as Strings, so

IndexService index = new LuceneFulltextQueryIndexService( graphDb );
index.index( myNode1, "someKey", 1 );
index.index( myNode2, "someKey", 2 );
index.index( myNode3, "someKey", 12 );
index.index( myNode4, "someKey", 21 );

// This will return myNode1, myNode2 and myNode3, NOT myNode4
index.getNodes( "someKey", "[1 TO 2]" );

To make lucene behave more like you'd expect when dealing with numerical values, please make use of padding i.e. adding zeros at the beginning of values. See more information.

Neo4j のサイト