Towards a scalable Solr-based RDF Store

SolRDF (i.e., Solr + RDF) is a set of Solr extensions for managing (indexing and searching) RDF data.

In a preceding post, I described how to set up a standalone SolRDF instance in two minutes; in this post, I’ll describe how to run SolRDF in a simple cluster. The required steps are similar to what you (hopefully) already did for the standalone instance.

All what you need

A shell (in case you are on the dark side of the moon, all steps can be easily done in Eclipse or whatever IDE)
Java 7 or higher
Apache Maven (3.x)
Apache Zookeeper (I’m using the 3.4.6 version)
git (optional, you can also download the repository from GitHub as a zipped file)

Step #1: Start Zookeeper

Open a shell and type the following

				
					> cd $ZOOKEPER_HOME/bin
> ./zkServer -start

That will start Zookeeper in background (start-foreground for foreground mode). By default it will listen on localhost:2181

Step #2: Checkout SolRDF

If it is the first time you hear about SolRDF you need to clone the repository. Open another shell and type the following:

				
					> cd /tmp
> git clone https://github.com/agazzarini/SolRDF.git solrdf-download

Alternatively, if you’ve already cloned the repository you have to pull the latest version, or finally, if you don’t have git, you can download the whole repository from here.

Step #3: Build and Run SolRDF Nodes

For this example we will set-up a simple cluster consisting of a collection with two shards.

				
					> cd solrdf-download/solrdf
> mvn -DskipTests \
    -Dlisten.port=$PORT \
    -Dindex.data.dir=$DATA_DIR \
    -DskipTests \
    -Dulog.dir=ULOG_DIR \
    -Dzk=ZOOKEEPER_HOST_PORT \
    -Pcloud \
    clean package cargo:run

Where

$PORT is the hosting servlet engine listen port;
$DATA_DIR is the directory where Solr will store its datafiles (i.e. the index)
$ULOG_DIR is the directory where Solr will store its transaction logs.
$ZOOKEEPER_HOST_PORT is the Zookeeper listen address (e.g. localhost:2181)

The very first time you run this command a lot of things will be downloaded, Solr included. At the end you should see something like this:

				
					[INFO] Jetty 7.6.15.v20140411 Embedded started on port [8080]
[INFO] Press Ctrl-C to stop the container...

the first node of SolRDF is up and running! Note the command assume the node is running on localhost:8080.

The second node can be started by opening another shell and re-executing the last command above.

Note You need to declare different parameters values (port, data dir, ulog dir) if you are on the same machine

Step #4: Distributed Indexing

Open another shell and type the following (assuming a node is running on localhost:8080):

				
					> cd solrdf-download/solrdf
> mvn -DskipTests \
    -Dlisten.port=$PORT \
    -Dindex.data.dir=$DATA_DIR \
    -DskipTests \
    -Dulog.dir=ULOG_DIR \
    -Pcloud \
    cargo:run

Wait a moment…ok! You just added 5007 triples! They’ve been distributed across the cluster: you can see that by opening the Solr admin console.

Step #5: Querying

Open another shell and type the following:

				
					> curl "http://127.0.0.1:8080/solr/store/sparql" \
  --data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10" \
  -H "Accept: application/sparql-results+json"
...

In the examples above, I’m using only (for indexing and querying) the node running on localhost:8080, but you can send the query to any node in the cluster. For instance, you can re-execute the query above with the other node (assuming it is running on localhost:8081):

				
					> curl "http://127.0.0.1:8081/solr/store/sparql" \
  --data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10" \
  -H "Accept: application/sparql-results+json"
...

You will get the same results.

Is that ready for a production scenario? No, absolutely not. I think a lot needs to be done on the indexing and querying optimization side. At the moment, only the functional side has been covered: the integration test suite includes about 150 SPARQL queries (ASK, CONSTRUCT, SELECT, and DESCRIBE) and updates (e.g., INSERT, DELETE) taken from the LearningSPARQL book [1], that are working regardless the target service is running as a standalone or clustered instance.

I will run the first benchmarks as soon as possible but honestly, at the moment, I don’t believe I’ll see high throughputs.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Towards a scalable Solr-based RDF Store

All what you need

Step #1: Start Zookeeper

Step #2: Checkout SolRDF

Step #3: Build and Run SolRDF Nodes

Step #4: Distributed Indexing

Step #5: Querying

Share this post

SpazioCodice SRL

Services

Useful Links

Contact Us

Towards a scalable Solr-based RDF Store

All what you need

Step #1: Start Zookeeper

Step #2: Checkout SolRDF

Step #3: Build and Run SolRDF Nodes

Step #4: Distributed Indexing

Step #5: Querying

Share this post

Discover more from SpazioCodice