IMDB Inserting Data

Neo4j Wiki から

This part uses the domain layer and services provided to insert IMDB data into the graph.

[edit] Overview

To load the data we use a simple parser, and an ImdbReader implementation. The structure is like this:

Image:Imdb.parser.png

ImdbParser is a push-based parser for the IMDB data files. It expects the client to provide an ImdbReader. The ImdbReader implementation then relies on an ImdbService to actually inject the data into the graph.

The ImdbReader interface:

public interface ImdbReader
{
    void newMovies( List<MovieData> movieList );
    void newActors( List<ActorData> actorList );
}

The MovieData, ActorData and RoleData classes are simple containers used for the data from the parser:

Image:Imdb.parser.data.png

This is the overall flow when injecting the data:

  1. the caller provides the ImdbReader implementation to the ImdbParser
  2. the caller provides the location of the file to parse to the parser
  3. the parser goes through the file and buffers the output of the parsing
  4. the buffered data is sent to the ImdbReader at some interval

The reason for buffering the output of the parser, is that we do not want to start one transaction for every item when inserting bulk data. Buffering a few hundred items will perform much better. Depending on the available heap space, you can buffer an even larger amount of items as well.

We will go on and take a closer look at the ImdbReader implementation, but we'll leave the ImdbParser source code behind, as it's not of any interest in this context other than that it consumes the ImdbReader API.

[edit] ImdbReader implementation

The ImdbReader interface is implemented in the ImdbReaderImpl class. In this section we'll walk through the code showing how the domain layer is used to insert the data.

To start with, the class uses Spring to inject the ImdbService that we are going to use to handle the graph without having to touch the details.

class ImdbReaderImpl implements ImdbReader
{
    @Autowired
    private ImdbService imdbService;

Here comes an important point: every set of new actors or movies are handled inside of one single transaction. For this very reason we use the newActors() and newMovies() methods to wrap the actual inserting methods (which are private).

Otherwise, these two methods merely unwrap the data needed for creation of the entities in the graph.

    @Transactional
    public void newActors( final List<ActorData> actorList )
    {
        for ( ActorData actorData : actorList )
        {
            newActor( actorData.getName(), actorData.getMovieRoles() );
        }
    }

    @Transactional
    public void newMovies( final List<MovieData> movieList )
    {
        for ( MovieData movieData : movieList )
        {
            newMovie( movieData.getTitle(), movieData.getYear() );
        }
    }

Now let's look at how the domain layer is used to add new entities. First comes the newMovie() method, that simply forwards the request to the ImdbService.

    private void newMovie( final String title, final int year )
    {
        imdbService.createMovie( title, year );
    }

Note: the newMovie() and newActor() methods cannot be executed outside of a transaction, as all Neo4j operations occur inside of transactions.

In the case with inserting actors, we start out creating a new actor, just like with the movies. What differs is that we now have to insert the roles of the actor as well.

For each movie role, we have to look the movie up. If there is a hit, we go on and add the role too.

    private void newActor( final String name, final RoleData[] movieRoles )
    {
        final Actor actor = imdbService.createActor( name );
        for ( MovieRole movieRole : movieRoles )
        {
            final Movie movie = imdbService.getMovie( movieRole.getTitle() );
            if ( movie != null )
            {
                imdbService.createRole( actor, movie, movieRole.getRole() );
            }
        }
    }
}
Important points:
  • don't use too small transactions when inserting bulk data
  • the @Transactional annotations are not so visible in the code, but indispensable

Next part: IMDB Finding the Path Index page: overview

Neo4j のサイト
ツールボックス