opal

opal

Scroll to Top

  • Profile
  • Pages
  • Twitter
  • Blog Members
  • Connect

    • Twitter
  • RSS
  • Archive
  • Ask Opal

@opal_io

Follow @opal_io

Blog Members

  • nlacasse
  • omgisthisonetaken
  • randomrando
  • sixwing
  • the-real-jxson
  • davidsasda
Pencil Icon

Loading DBPedia into Neo4j with Clojure

DBPedia is a community project to extract structured data from Wikipedia articles. The data is freely available for download.

Currently, the DBPedia datasets describe 3.77 million things. The information is encoded as relationships between resources. This is a natural fit for graph databases like Neo4j, where DBPedia resources are vertices and relationships are edges.

This article describes how to load over 82 million relationships from the DBPedia datasets into Neo4j with Clojure.

Why Clojure

Clojure is a Lisp dialect that runs on the JVM. It has excellent Java interoperability, and we were able to use Neo4j’s native Java libraries with no problems.

Clojure is a very expressive language, capable of doing a lot with relatively few lines of code. All the code to read the dataset files, parse them into tuples, and insert them as nodes and relationships takes less than 80 lines of Clojure. While we certainly could have accomplished this loading with Java, it would have taken longer to write and resulted in more code.

The Raw Data

First, we downloaded 11 datasets from the DBPedia 3.8 downloads page that were relevant to our application. The datasets come in a variety of formats. We chose to use the Turtle (.ttl) format because the OWL API had a well-documented Turtle parser with a clean interface.

The datasets we used totalled about 13gb and had just over 82 million tuples.

 ubuntu@host:/dbp/dbpedia/tuples$ wc -l *.ttl 15115486 article_categories_en.ttl 862828 category_labels_en.ttl 1900006 geo_coordinates_en.ttl 476978 homepages_en.ttl 7370587 images_en.ttl 13225167 instance_types_en.ttl 9442540 labels_en.ttl 20516861 mappingbased_properties_en.ttl 5959457 persondata_en.ttl 3769928 short_abstracts_en.ttl 3458049 skos_categories_en.ttl 82097887 total 

The Code

You can see the entire file here. We describe each chunk of code below.

Parsing the Tuples

The TurtleParser reads an input stream and returns one token at a time. The parse-file function below uses the get-next-tuple and seq-of-parser functions to read a file and return a lazy sequence of tuples.

Since Turtle tuples can have 3 or more elements, seq-of-parser has to pop elements off the parser until it finds a “.” (end of tuple charactor) or an empty string (representing the end of the file).

Inserting and Connecting Nodes

Each tuple in a dataset represents one edge in the graph. They have the format start-resource relationship-type end-resource.

Most resources have more than one edge associated with them, and so they appear multiple times in the dataset. Because of this, we had to keep track of which resources we had already been inserted, and which were new. We accomplished this with the help of an in-memory hash map: resource -> node id. We used a transient data structure so that modifications to the map could be made in-place, which helped to conserve memory.

 (def id-map (atom (transient {}))) 

When inserting a resource, we check the id-map to see if the resource has already been inserted, in which case we just return the id. If the resource has not been seen before, we insert it, add the new id to the hash map, and return the id.

We don’t have to worry about duplicate edges, so the connect-resource-nodes is quite simple. The only fancy part is that we have to use DynamicRelationshipType/withName to get or create the relationship type.

Putting it all together, we can use insert-resource-node! and connect-resource-nodes! to insert tuples in a straight-forward way.

The -main file takes a path where the graph lives, and a list of turtle files. It creates the batchInserter that gets used to insert and connect nodes, then it iterates over all tuples in all files and inserts them, logging a heartbeat every 10,000 nodes. Lastly, it shuts down the graph, an operation that sometimes takes 20 minutes because it can involve flushing lots of data to disk.

JVM Tweaks

There were a few jvm arguments we tweaked. The heap size maximum was set to 14gb (-Xmx14g), and we used the concurrent mark and sweep garbage collector instead of the default (-XX:+UseConcMarkSweepGC), as recommended by Neo4j.

The following line in properties.clj accomplishes this:

 :jvm-opts ["-Xmx14g" "-XX:+UseConcMarkSweepGC"] 

Results!

On an m2.xlarge instance, loading all 8 million edges took only 2.7 hours.

The following graph shows the edge loading times for consecutive sets of 10,000 edges.

Edge loading times

You can see a very slow upward trend, punctuated by brief periods of slowness, most likely caused by the garbage collector. The high bump towards the right of the graph occurred when we were inserting the abstracts, which involved a lot of text and significantly more memory. After that file, the loading speed settled back down.

nlacasse Avatar

Posted by nlacasse
January 28, 2013
4 notes
comments

Share
http://tmblr.co/ZCCIEtcrceIf

4 notes

  1. darkuncle likes this
  2. sixwing likes this
  3. randomrando reblogged this from losangelesindustries
  4. sixwing reblogged this from losangelesindustries
  5. losangelesindustries posted this
blog comments powered by Disqus

< Previous post Next post >

Theme by Pixel Union