Towards a scalable Solr-based RDF Store

Towards a scalable Solr-based RDF Store

SolRDF (i.e., Solr + RDF) is a set of Solr extensions for managing (indexing and searching) RDF data.

In a preceding post, I described how to set up a standalone SolRDF instance in two minutes; in this post, I’ll describe how to run SolRDF in a simple cluster. The required steps are similar to what you (hopefully) already did for the standalone instance.

All what you need

  • A shell  (in case you are on the dark side of the moon, all steps can be easily done in Eclipse or whatever IDE)
  • Java 7 or higher
  • Apache Maven (3.x)
  • Apache Zookeeper  (I’m using the 3.4.6 version)
  • git (optional, you can also download the repository from GitHub as a zipped file)

Step #1: Start Zookeeper

Open a shell and type the following

				
					> cd $ZOOKEPER_HOME/bin
> ./zkServer -start
				
			

That will start Zookeeper in background (start-foreground for foreground mode). By default it will listen on localhost:2181

Step #2: Checkout SolRDF

If it is the first time you hear about SolRDF you need to clone the repository. Open another shell and type the following:

				
					> cd /tmp
> git clone https://github.com/agazzarini/SolRDF.git solrdf-download
				
			

Alternatively, if you’ve already cloned the repository you have to pull the latest version, or finally, if you don’t have git, you can download the whole repository from here.

Step #3: Build and Run SolRDF Nodes

For this example we will set-up a simple cluster consisting of a collection with two shards.

				
					> cd solrdf-download/solrdf
> mvn -DskipTests \
    -Dlisten.port=$PORT \
    -Dindex.data.dir=$DATA_DIR \
    -DskipTests \
    -Dulog.dir=ULOG_DIR \
    -Dzk=ZOOKEEPER_HOST_PORT \
    -Pcloud \
    clean package cargo:run
				
			

Where

  • $PORT is the hosting servlet engine listen port;
  • $DATA_DIR is the directory where Solr will store its datafiles (i.e. the index)
  • $ULOG_DIR is the directory where Solr will store its transaction logs.
  • $ZOOKEEPER_HOST_PORT is the Zookeeper listen address (e.g. localhost:2181)

The very first time you run this command a lot of things will be downloaded, Solr included. At the end you should see something like this:

				
					[INFO] Jetty 7.6.15.v20140411 Embedded started on port [8080]
[INFO] Press Ctrl-C to stop the container...
				
			
the first node of SolRDF is up and running!  Note the command assume the node is running on localhost:8080.

The second node can be started by opening another shell and re-executing the last command above.

Step #4: Distributed Indexing

Open another shell and type the following (assuming a node is running on localhost:8080):

				
					> cd solrdf-download/solrdf
> mvn -DskipTests \
    -Dlisten.port=$PORT \
    -Dindex.data.dir=$DATA_DIR \
    -DskipTests \
    -Dulog.dir=ULOG_DIR \
    -Pcloud \
    cargo:run
				
			

Wait a moment…ok! You just added 5007 triples! They’ve been distributed across the cluster: you can see that by opening the Solr admin console.

Step #5: Querying

Open another shell and type the following:

				
					> curl "http://127.0.0.1:8080/solr/store/sparql" \
  --data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10" \
  -H "Accept: application/sparql-results+json"
...  
				
			

In the examples above, I’m using only (for indexing and querying) the node running on localhost:8080, but you can send the query to any node in the cluster. For instance, you can re-execute the query above with the other node (assuming it is running on localhost:8081):

				
					> curl "http://127.0.0.1:8081/solr/store/sparql" \
  --data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10" \
  -H "Accept: application/sparql-results+json"
...  
				
			

You will get the same results.

Is that ready for a production scenario? No, absolutely not. I think a lot needs to be done on the indexing and querying optimization side. At the moment, only the functional side has been covered: the integration test suite includes about 150 SPARQL queries (ASK, CONSTRUCT, SELECT, and DESCRIBE) and updates (e.g., INSERT, DELETE) taken from the LearningSPARQL book [1], that are working regardless the target service is running as a standalone or clustered instance.

I will run the first benchmarks as soon as possible but honestly, at the moment, I don’t believe I’ll see high throughputs.

Share this post