IMDB Search Engine

Neo4j Wiki から

Here we'll create a simple customized search engine.

目次

[edit] Indexing and searching

Note: There are images of the graph layout in the #Graph layout and indexing section of this page.

In the IMDB example we use two different strategies for indexing the graph:

  1. Index the full string: every index item points to a movie- or actor node. This is provided by the built-in index service.
  2. Index every word in the name/title: every index item points to a special "index node". This is provided by a small search engine, built on top of Neo4j.

So we will have to index the nodes, and add index nodes for the words and add these to an index.

To make the search perform better, it will use the word with the least number of relationships as a starting point. This decreases the total number of relationships that have to be followed during the search. We will come back to this when we look at the code that implements the search.

Note: this example doesn't show "how indexing should be used in Neo4j" or something like that. It is merely an example with some advantages and disadvantages. Many other options are possible. You could for instance use Lucene in various ways. You have to find out what will work for your domain. What the example does show is how easy you can create your own search engine on top of Neo4j, tailored for the domain at hand!

Advantages of the index/search implementation used in the example:

  • simple to understand
  • simple to implement
  • works fairly well for "and"-searches (all words should be included)
  • performs well for low numbers of search terms even with much data

Disadvantages:

  • doesn't perform very good when searching for a higher number of search terms
  • doesn't handle the problem with very common words (like "the")
  • doesn't handle "or"-searches that well when it comes to ranking the results

This is an overview of what we are going to do here:

Image:Imdb.domain.service.search.png

To begin with we use a simple string splitting function to prepare the input for indexing:

    private String[] splitSearchString( final String value )
    {
        return value.toLowerCase( Locale.ENGLISH ).split( "[^\\w]+" );
    }

Having this as a separate method is made to ensure splitting the input strings in the same way both when indexing and when searching.

As seen from the diagram, we have introduced one more set of relationshiptypes, like this:

public enum ImdbSearchRelTypes implements RelationshipType
{
    PART_OF_NAME, PART_OF_TITLE
}

The interface of the search engine to implement:

public interface ImdbSearchEngine
{
    void indexActor( Actor actor );
    void indexMovie( Movie movie );
    Node searchActor( String name );
    Node searchMovie( String title );
}

[edit] Constants and dependencies

To start with, the class defines some constants we will make use of, and lest Spring inject the services we depend on here.

public class ImdbSearchEngineImpl implements ImdbSearchEngine
{
    private static final String NAME_PART_INDEX = "name.part";
    private static final String WORD_PROPERTY = "word";
    private static final String COUNT_PROPERTY = "count_uses";
    private static final String TITLE_PART_INDEX = "title.part";

    @Autowired
    private GraphDatabaseService graphDbService;

    @Autowired
    private IndexService indexService;


[edit] インデクシング

Now let's add our actor and movie nodes to the index.

To make life easier for the consumer, the input is simply the Actor or Movie that should get indexed.

As the two cases are very similar, the public methods will only act as wrappers for a private workhorse method:

    public void indexActor( Actor actor )
    {
        index( actor.getName(), ( ( ActorImpl ) actor ).getUnderlyingNode(),
            NAME_PART_INDEX, ImdbSearchRelTypes.PART_OF_NAME );
    }

    public void indexMovie( Movie movie )
    {
        index( movie.getTitle(), ( ( MovieImpl ) movie ).getUnderlyingNode(),
            TITLE_PART_INDEX, ImdbSearchRelTypes.PART_OF_TITLE );
    }

A look at the index() method will explain what this is about.

The caller provides index() with:

  1. the value to index on
  2. the node to index
  3. the index name - name of the key for this index
  4. the relationship type used for relationships from the word node to the indexed node
    private void index( final String value, final Node node, final String partIndexName, final ImdbSearchRelTypes relType )
    {
        for ( String part : splitSearchString( value ) )
        {
            Node wordNode = indexService.getSingleNode( partIndexName, part );
            if ( wordNode == null )
            {
                wordNode = neoService.createNode();
                indexService.index( wordNode, partIndexName, part );
                wordNode.setProperty( WORD_PROPERTY, part );
            }
            wordNode.createRelationshipTo( node, relType );
            wordNode.setProperty( COUNT_PROPERTY, ( ( Integer ) wordNode.getProperty( COUNT_PROPERTY, 0 ) ) + 1 );
        }
    }

The idea is to split the string, and then index every part of it. For every such "word" a word node is created, that has a relationship to the actor/movie that is getting indexed.

We don't want duplicate word nodes, so we'll first try to get an existing one, and just create a new when needed.

For convenience we add the search term as a property on the node. No functionality depends on this, but it's useful for debugging activites like when inspecting the graph.

Finally the relationship counter on the word node is incremented. If there was no such property, the default value given (zero) will be used and incremented to one. More on the counter later.


[edit] Searching

The search depends on finding the word nodes that correspond to the search terms given, so we'll start in that end. Other than finding the nodes, they should also be listed ordered by their number of relationships in ascending order. That's where the counter property comes in handy.

As when indexing the public methods only wrap the actual search method:

    public Node searchActor( String name )
    {
        return searchSingle( name, NAME_PART_INDEX, ImdbSearchRelTypes.PART_OF_NAME );
    }

    public Node searchMovie( String title )
    {
        return searchSingle( title, TITLE_PART_INDEX, ImdbSearchRelTypes.PART_OF_TITLE );
    }

Lets dive into the method where the search actually takes place.

    private Node searchSingle( final String value, final String indexName, final ImdbSearchRelTypes wordRelType )
    {
        final List<Node> wordList = findSearchWords( value, indexName );
        if ( wordList.isEmpty() )
        {
            return null;
        }

First we call a method that will find the nodes of the search terms for us. If none is found, we return a negative search result.

We take the node with the least number of relationships and use it as our start node, from which we will start looking into the graph. After setting up the start node, we choose a random relationship (the one that happens to be returned as the first) and store it in case we won't find anything better than this.

        final Node startNode = wordList.remove( 0 );
        Node match = startNode.getRelationships( wordRelType ).iterator().next().getEndNode();
        if ( wordList.isEmpty() )
        {
            return match;
        }

If there was only one word in the list, we return the match that we just set up.

Now we set up the bestCount variable to keep track of how many word hits was the best match encountered so far. The listSize variable is used to confirm when we have matched all words in the word list.

Our method is to find all movies/actors connected to the start node, and then look if they in turn are connected to one or more of our search words.

        int bestCount = 0;
        final int listSize = wordList.size();
        for ( Relationship targetRel : startNode.getRelationships( wordRelType ) )
        {
            Node targetNode = targetRel.getEndNode();
            int hitCount = 0;
            for ( Relationship wordRel : targetNode.getRelationships( wordRelType ) )
            {

Here we use the hitCount variable to count how many of our search words were found for the current target node (a movie/actor). When finding a full match, it's returned immediately, as we only want to return a single node.

Note: pre-increment implies that we compare to the value after incrementing the counter.

                if ( wordList.contains( wordRel.getStartNode() ) )
                {
                    if ( ++hitCount == listSize )
                    {
                        return targetNode;
                    }
                }
            }
            if ( hitCount > bestCount )
            {
                match = targetNode;
                bestCount = hitCount;
            }
        }
        return match;
    }

At the end, we save the "best so far" node, when it had a higher hit count than the best one so far. Finally, it's time to return the best match we could find.

Now lets see how the word nodes are fetched and sorted according to the number of relationships they have.

    private List<Node> findSearchWords( final String userInput, final String partIndexName )
    {
        final List<Node> wordList = new ArrayList<Node>();

Loop over the words and add any nodes fulfilling the criteria to the list.

        for ( String part : splitSearchString( userInput ) )
        {
            Node wordNode = indexService.getSingleNode( partIndexName, part );
            if ( wordNode == null || !wordNode.hasRelationship() || wordList.contains( wordNode ) )
            {
                continue;
            }
            wordList.add( wordNode );
        }
        if ( wordList.isEmpty() )
        {
            return Collections.emptyList();
        }

The next task is to sort the nodes according to their number of relationships. This is performed by implementing the Comparator interface. As we have saved the relationship count when we added the relationships, we can simply fetch those numbers and calculate the difference. That's it, and we'll get a sorted list with the least connective nodes at the beginning.

        Collections.sort( wordList, new Comparator<Node>()
        {
            public int compare( final Node left, final Node right )
            {
                int leftCount = ( Integer ) left.getProperty( COUNT_PROPERTY, 0 );
                int rightCount = ( Integer ) right.getProperty( COUNT_PROPERTY, 0 );
                return leftCount - rightCount;
            }
        } );
        return wordList;
    }

[edit] Graph layout and indexing

The index service adds a lucene/ path in your Neo4j store directory, since we use the LuceneIndexService. It can keep track of nodes so even if you have disconnected subgraphs ("islands") you can still reach them this way.

Looking from the perspective of an indexed movie node we get the following:

Image:Imdb.screenshot.movie.index.png

As you can see, the movie is connected to both actors and part-of-title-words.

Taking the actor angle will look similar to the following image:

Image:Imdb.screenshot.actor.index.png

The actor has relationships to movies and to part-of-name-words.

The next image has it's focus on actors that share the name Johan. Here you can see how the part-of-name-word "johan" is connected to different actors, that are connected to other part-of-name-words as well as movies.

Image:Imdb.screenshot.johan.png

Points of interest:
  • use relationship types to overlay multiple structures on top of each other in the graph
  • create structures that fit to your domain
  • using some creativity you can create amazing ways to search your data


Next part: IMDB Transaction handling Index page: overview

Neo4j のサイト
ツールボックス