2014-12-29 14:28:37 -05:00
2014-07-23 15:57:34 -04:00
2014-09-30 09:41:54 -04:00
2014-06-04 13:36:21 -04:00
2014-12-29 14:13:00 -05:00

AccumuloGraph

Build Status

This is an implementation of the TinkerPop Blueprints 2.6 API using Apache Accumulo as the backend. This combines the benefits and flexibility of Blueprints with the scalability and performance of Accumulo.

In addition to the basic Blueprints functionality, we provide additional features that harness more of Accumulo's power.

Some features include...

Benchmarks

Indexing via the IndexableGraph and KeyIndexableGraph interfaces.

Benchmarking

Feel free to email with suggestions for improvements. Please submit issues for any bugs you find or features you want. We are also open to pull requests.

This implementation provides easy to use, easy to write, and easy to read access to an arbitrarily large graph that is stored in Accumulo.

We implement the following Blueprints interfaces:
1. Graph
2. KeyIndexableGraph
3. IndexableGraph

Benchmarking.

Getting Started

First, include AccumuloGraph as a Maven dependency. Releases are deployed to Maven Central.

<dependency>
	<groupId>edu.jhuapl.tinkerpop</groupId>
	<artifactId>blueprints-accumulo-graph</artifactId>
	<version>0.0.2</version>
</dependency>

For non-Maven users, the binary jars can be found in the releases section in this GitHub repository, or you can get them from Maven Central.

Creating an AccumuloGraph involves setting a few parameters in an AccumuloGraphConfiguration object, and opening the graph. The defaults are sensible for using an Accumulo cluster. We provide some simple examples below. Javadocs for AccumuloGraphConfiguration explain all the other parameters in more detail.

First, to instantiate an in-memory graph:

Configuration cfg = new AccumuloGraphConfiguration()
  .setInstanceType(InstanceType.Mock)
  .setGraphName("graph");
return GraphFactory.open(cfg);

This creates a "Mock" instance which holds the graph in memory. You can now use all the Blueprints and AccumuloGraph-specific functionality with this in-memory graph. This is useful for getting familiar with AccumuloGraph's functionality, or for testing or prototyping purposes.

To use an actual Accumulo cluster, use the following:

Configuration cfg = new AccumuloGraphConfiguration()
  .setInstanceType(InstanceType.Distributed)
  .setZooKeeperHosts("zookeeper-host")
  .setInstanceName("instance-name")
  .setUser("user").setPassword("password")
  .setGraphName("graph")
  .setCreate(true);
return GraphFactory.open(cfg);

This directs AccumuloGraph to use a "Distributed" Accumulo instance, and sets the appropriate ZooKeeper parameters, instance name, and authentication information, which correspond to the usual Accumulo connection settings. The graph name is used to create several backing tables in Accumulo, and the setCreate option tells AccumuloGraph to create the backing tables if they don't already exist.

Improving Performance

This section describes various configuration parameters that greatly enhance AccumuloGraph's performance. Brief descriptions of each option are provided here, but refer to the AccumuloGraphConfiguration Javadoc for fuller explanations.

Disable consistency checks

The Blueprints API specifies a number of consistency checks for various operations, and requires errors if they fail. Some examples of invalid operations include adding a vertex with the same id as an existing vertex, adding edges between nonexistent vertices, and setting properties on nonexistent elements. Unfortunately, checking the above constraints for an Accumulo installation entails significant performance issues, since these require extra traffic to Accumulo using inefficient non-batched access patterns.

To remedy these performance issues, AccumuloGraph exposes several options to disable various of the above checks. These include:

  • setAutoFlush - to disable automatically flushing changes to the backing Accumulo tables
  • setSkipExistenceChecks - to disable element existence checks, avoiding trips to the Accumulo cluster
  • setIndexableGraphDisabled - to disable indexing functionality, which improves performance of element removal

Set Accumulo performance parameters

Accumulo itself features a number of performance-related parameters, and we allow configuration of these. Generally, these relate to write buffer sizes, multithreading, etc. The settings include:

  • setMaxWriteLatency - max time prior to flushing element write buffer
  • setMaxWriteMemory - max size for element write buffer
  • setMaxWriteThreads - max threads used for element writing
  • setMaxWriteTimeout - max time to wait before failing element buffer writes
  • setQueryThreads - number of query threads to use for fetching elements, properties etc.

Caching and preloading

AccumuloGraph contains a number of

  • setPropertyCacheTimeout

  • setEdgeCacheParams

  • setVertexCacheParams

  • setPreloadedEdgeLabels

  • setPreloadedProperties

Bulk Ingest

Hadoop Integration

Table Structure

##Code Examples ###Creating a new or connecting to an existing distributed graph

Configuration cfg = new AccumuloGraphConfiguration()
	.setInstanceName("accumulo").setUser("user").setZookeeperHosts("zk1")
    .setPassword("password".getBytes()).setGraphName("myGraph");
Graph graph = GraphFactory.open(cfg.getConfiguration());

###Creating a new Mock Graph

Setting the instance type to mock allows for in-memory processing with a MockAccumulo instance.
There is also support for Mini Accumulo.

Configuration cfg = new AccumuloGraphConfiguration().setInstanceType(InstanceType.Mock)
	.setGraphName("myGraph");
Graph graph = GraphFactory.open(cfg);

###Accessing a graph

Vertex v1 = graph.addVertex("1");
v1.setProperty("name", "Alice");
Vertex v2 = graph.addVertex("2");
v2.setProperty("name", "Bob");

Edge e1 = graph.addEdge("E1", v1, v2, "knows");
e1.setProperty("since", new Date());

###Creating indexes

((KeyIndexableGraph)graph)
	.createKeyIndex("name", Vertex.class);

###MapReduce Integration

####In the tool

AccumuloConfiguration cfg = new AccumuloGraphConfiguration()
	.setInstanceName("accumulo").setZookeeperHosts("zk1").setUser("root")
	.setPassword("secret".getBytes()).setGraphName("myGraph");

Job j = new Job();
j.setInputFormatClass(VertexInputFormat.class);
VertexInputFormat.setAccumuloGraphConfiguration(j,
	cfg.getConfiguration());

####In the mapper

public void map(Text k, Vertex v, Context c) {
    System.out.println(v.getId().toString());
}

##Table Design ###Vertex Table

Row ID Column Family Column Qualifier Value
VertexID Label Flag Exists Flag [empty]
VertexID INVERTEX OutVertexID_EdgeID Edge Label
VertexID OUTVERTEX InVertexID_EdgeID Edge Label
VertexID Property Key [empty] Serialized Value
###Edge Table
Row ID Column Family Column Qualifier Value
--- --- --- ---
EdgeID Label Flag InVertexID_OutVertexID Edge Label
EdgeID Property Key [empty] Serialized Value
###Edge/Vertex Index
Row ID Column Family Column Qualifier Value
--- --- --- ---
Serialized Value Property Key VertexID/EdgeID [empty]

###Metadata Table

Row ID Column Family Column Qualifier Value
Index Name Index Class [empty] [empty]
##Advanced Configuration
###Graph Configuration
  • setGraphName(String name)
  • setCreate(boolean create) - Sets if the backing graph tables should be created if they do not exist.
  • setClear(boolean clear) - Sets if the backing graph tables should be reset if they exist.
  • autoFlush(boolean autoFlush) - Sets if each graph element and property change will be flushed to the server.
  • skipExistenceChecks(boolean skip) - Sets if you want to skip existance checks when creating graph elemenets.
  • setAutoIndex(boolean ison) - Turns on/off automatic indexing.

###Accumulo Control

  • setUser(String user) - Sets the user to use when connecting to Accumulo
  • setPassword(byte[] password | String password) - Sets the password to use when connecting to Accumulo
  • setZookeeperHosts(String zookeeperHosts) - Sets the Zookeepers to connect to.
  • setInstanceName(String instance) - Sets the Instance name to use when connecting to Zookeeper
  • setInstanceType(InstanceType type) - Sets the type of Instance to use : Distrubuted, Mini, or Mock. Defaults to Distrubuted
  • setQueryThreads(int threads) - Specifies the number of threads to use in scanners. Defaults to 3
  • setMaxWriteLatency(long latency) - Sets the latency to be used for all writes to Accumulo
  • setMaxWriteTimeout(long timeout) - Sets the timeout to be used for all writes to Accumulo
  • setMaxWriteMemory(long mem) - Sets the memory buffer to be used for all writes to Accumulo
  • setMaxWriteThreads(int threads) - Sets the number of threads to be used for all writes to Accumulo
  • setAuthorizations(Authorizations auths) - Sets the authorizations to use when accessing the graph
  • setColumnVisibility(ColumnVisibility colVis) - TODO
  • setSplits(String splits | String[] splits) - Sets the splits to use when creating tables. Can be a space sperated list or an array of splits
  • setMiniClusterTempDir(String dir) - Sets directory to use as the temp directory for the Mini cluster

###Caching

  • setLruMaxCapacity(int max) - TODO
  • setVertexCacheTimeout(int millis) - Sets the vertex cache timeout. A value <=0 clears the value
  • setEdgeCacheTimeout(int millis) - Sets the edge cache timeout. A value <=0 clears the value

###Preloading

  • setPropertyCacheTimeout(int millis) - Sets the element property cache timeout. A value <=0 clears the value
  • setPreloadedProperties(String[] propertyKeys) - Sets the property keys that should be preloaded. Requiers a positive timout.
  • setPreloadedEdgeLabels(String[] edgeLabels) - TODO
Description
No description provided
Readme Apache-2.0 948 KiB
Latest
2015-02-20 12:32:04 -05:00
Languages
Java 100%