バッチ挿入

Neo4j Wiki から

Note: when running batch inserter and failing to invoke the shutdown method it may corrupt the store. The batch inserter is great for initial import of data but should not be used in normal operation on an already existing store.

目次

[edit] Batch Insert

Neo4j has a batch insert mode that drops support for transactions and concurrency in favor of insertion speed. This is useful when you have a big dataset that needs to be loaded once. In our experience, the batch inserter will typically inject data around five times faster than running in normal transactional mode.

Be aware that the BatchInserter is

  1. intended use is for initial import of data
  2. non thread safe
  3. non transactional
  4. failure to successfully invoke shutdown (properly) results in corrupt database files

[edit] Getting Started

Creating a batch inserter is similar to how you create a GraphDatabaseService. After it has been created you can directly create nodes, relationships and properties. You don't have to open up transactions but remember that you can't have multiple threads using the same batch inserter concurrently without external synchronization. To get started include the "neo4j-kernel" component version "1.2" in your pom or checkout the kernel trunk and build the jar manually.

import org.neo4j.kernel.impl.batchinsert.BatchInserter;
import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;

...

// create the batch inserter
BatchInserter inserter = new BatchInserterImpl( "neo4j-db/", BatchInserterImpl.loadProperties( "neo4j.props" ) );

// inject some data
Map<String,Object> properties = new HashMap<String,Object>();

properties.put( "name", "Mr. Andersson" );
properties.put( "age", 29 );
long node1 = inserter.createNode( properties );

properties.put( "name", "Trinity" );
properties.remove( "age" );
long node2 = inserter.createNode( properties );

inserter.createRelationship( node1, node2, DynamicRelationshipType.withName( "KNOWS" ), null );

// shutdown, makes sure all changes are written to disk
inserter.shutdown();

The batch inserter can be created loading a configuration that should be optimized for the work that you are about to perform. The "neo4j.props" is just a normal Java properties file (see further down for explanation how to configure the batch inserter).

For highest injection speed you should pass in all the properties for a node or relationship when it is created. If the node or relationship doesn't have any properties or you need to set its properties at a later time just pass in null. All changes go to memory (when available) so it is very important that you call BatchInserter.shutdown() when you're done since that will force all changes to be written to disk. Failing to do so may result in some of the changes getting lost or the store may even be left in a corrupted state!

[edit] Using batch inserter together with indexing

Often you have some property that needs to be indexed (typically some URI like property) and for that we have created an index batch inserter using Lucene. This index service works like the normal transactional LuceneIndexService (but then again with no transactions or thread safety). To access the Lucene index batch inserter include "neo4j-index" version "1.1" in your pom or checkout index trunk and build the jar manually.

import org.neo4j.kernel.impl.batchinsert.BatchInserter;
import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;
import org.neo4j.index.lucene.LuceneIndexBatchInserter;
import org.neo4j.index.lucene.LuceneIndexBatchInserterImpl;

...

BatchInserter inserter = new BatchInserterImpl( "neo4j-db/", BatchInserterImpl.loadProperties( "neo4j.props" ) );
// create the batch index service
LuceneIndexBatchInserter indexService = new LuceneIndexBatchInserterImpl( inserter );

// ... create nodes and index them
while ( haveNodesToCreate ) 
{
    properties.put( "uri", nextUri );
    long node = inserter.createNode( properties );
    indexService.index( node, "uri", nextUri );
}

// optimize the index
indexService.optimize();


// create relationships, make use of the index to find the right nodes
while ( haveRelationshipsToCreate )
{
    long node1 = indexService.getNodes( "uri", uri1 ).iterator().next();
    long node2 = indexService.getNodes( "uri", uri2 ).iterator().next();
    inserter.createRelationship( node1, node2, DynamicRelationshipType.withName( "KNOWS" ), null );
}

indexService.shutdown();
inserter.shutdown();

For best performance when using the index batch inserter, everything that needs to be indexed should be indexed first followed by a call to indexService.optimize(). After that the index can be used to find nodes. Depending on how your data looks a small LRU cache for the index may speed things up even more.

If you want to batch insert using a fulltext index, simply use code like this to create the corresponding index service:

LuceneFulltextIndexBatchInserter fulltextIndexService = new LuceneFulltextIndexBatchInserter( inserter );

[edit] Configuring the batch inserter for optimal performance

Please note that the following recommendations apply only to the batch inserter. For normal operation, there are different guidelines.

The most important rule is to give the batch inserter Java process as much heap as possible.

As an optimizing step, if you know the number of relationships and nodes to be inserted, you can configure the memory mapped buffer settings like so:

neostore.nodestore.db.mapped_memory=<expected number of nodes * 9 bytes>
neostore.relationshipstore.db.mapped_memory=<expected number of relationships * 33 bytes>

There is an important caveat to be aware of. During non-batch operation, these buffers are allocated outside the heap. However, during batch operation, this memory will be allocated inside the heap. Therefore, you need to ensure that the sum of the memory you configure as mapped fits inside your allocated Java heap. Additionally, some memory needs to be left for the Java process itself.

Example:

  • 3M nodes and 100M relationships
  • 27MB mapped for nodes and 3300MB mapped for relationships, plus an additional 1GB for the Java process itself
  • Therefore, allocate a Java heap of at least 4327MB, rounding up to 5GB for good measure.


If your Java heap is not big enough to fit the Java process and all of the memory mapped buffers, leave 1GB for the Java process and use the rest for the memory mapped buffers. You will then have to shrink the buffer sizes so that they fit into the Java heap, leaving 1GB free for the Java process.

Neo4j のサイト
Neo4j のサイト
ツールボックス