Pages

Monday, May 6, 2019

A Guide to Apache Accumulo

1. Overview:

In this tutorial, We will learn about Apache Accumulo and its API to process the large data-set as part of Big Data ecosystem.

Apache Accumulo is designed based on Google's Bigtable which is written in java and built on top of Hadoop, ZooKeeper and Apache Thrift. This is the best choice after Cassendra and HBase in NoSQL column oriented data store. This is designed mainly for structure data storage and processing.

Accumulo provides very efficient storage model which is the best in retrieving the data.
It provides tables which can be accessed using query language to query the tables with optional conditions as well as these tables can be passed as input to map reduce jobs.



Download latest version of Accumulo from here

A Guide to Accumulo


2. Accumulo Design

Accumulo is designed based on the key/value pair which is very simple and straight forward for understanding. This is a very richer data model but not a fully relational database model.

Data is represented as key-value pairs, where the key and value are comprised of the following elements

Data Model


The following rules applied to all the tables in Accumulo.

  • Accumulo stores the data in key and value pair.
  • Every key consists of row id, column and timestamp. Every column has Family, Qualifier and Visibility. In the original Googles bigtable does not have Visibility column which is introduced in Accumulo.
  • All elements of the Key and the Value are represented as byte arrays except for Timestamp, which is a Long.
  • Accumulo sorts keys by element and lexicographically in ascending order.
  • Timestamps are sorted in descending order so that latest versions of the same Key appear first in a sequential scan.
  • Tables consist of a set of sorted key-value pairs after data load is done into tables.

Visibility columns is vital and holds the role of this content who can be viewed and retrieved using Scanners. You will see more about Scanners in this post.

All the write operation are logged into write-ahead log(WAL) and core xml file accumulo-site.xml.

3. Components

An instance of Accumulo includes many TabletServers, one Garbage Collector process, one Master server and many Clients. Let us have a look at these components.
  • Tablet Server
  • Garbage Collector
  • Master
  • Tracer
  • Monitor
  • Client

3.1 Tablet Server

The TabletServer manages the following as subset of the actual data tables.

  • writes and persisting writes to a write-ahead log from clients.
  • Sorting new key value pairs in memory
  • periodically flushing sorted key-value pairs to new files in HDFS
  • Responding to reads from clients
  • Forming a merge-sorted view of all keys and values from all the files it has created and the sorted in-memory store.

3.2. Garbage Collector

Periodically, the Garbage Collector will identify files that are no longer needed by any process, and delete them. Multiple garbage collectors can be run to provide hot-standby support.

3.3 Master

The Accumulo Master is responsible for detecting and responding to TabletServer failure.

3.4. Tracer

The Accumulo Tracer process supports the distributed timing API provided by Accumulo. One to many of these processes can be run on a cluster which will write the timing information to a given Accumulo table for future reference.

3.5. Monitor

The Accumulo Monitor is a web application that provides a wealth of information about the state of an instance.  Additionally, the Monitor should always be the first point of entry when attempting to debug an Accumulo problem as it will show high-level problems in addition to aggregated errors from all nodes in the cluster.

4. Fault-Tolerance

If a TabletServer fails, the Master detects it and automatically reassigns the tablets assigned from the failed server to other servers. Any key-value pairs that were in memory at the time the TabletServer fails are automatically reapplied from the Write-Ahead Log(WAL) to prevent any loss of data.

5. Accumulo Shell

This is simple shell is just to verify the configuration files and see the tables. Tables data can be updated or deleted from this shell. Configuration file setting are allowed to modify.

The shell can be started by the following command

$ACCUMULO_HOME/bin/accumulo shell -u [username]


This command will prompt for password for the specified username.

To see all the tables:

root@myinstance> tables
accumulo.metadata
accumulo.root

tables command is to print all tables under current user.

root@myinstance> createtable my-first-table

createtable command is to create a new tables in the current user. If command execution is success then no output will be printed. Now run the command "tables" to see the new table is created or not.

root@myinstance my-first-table> tables
accumulo.metadata
accumulo.root
my-first-table

Now able to see the newly created table in the list.

deletetable command is to delete the table

root@myinstance my-first-table> deletetable my-first-table

6. Running Client Code

The client code can be executed in many ways. Some of them are
  • using java executable
  • using the accumulo script
  • using the tool script

To run our program, we must add the dependencies to the classpath. Required jar files are Hadoop client jar, all of the jars in the Hadoop lib directory and the conf directory and Zookeeper jar.

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>3.2.0</version>
</dependency>


<dependency>
    <groupId>org.apache.zookeeper</groupId>
    <artifactId>zookeeper</artifactId>
    <version>3.4.14</version>
    <type>pom</type>
</dependency>

Latest jar file can be downloaded from repository for hadoop-client and zookeeper.

To see the accumulo classpath:

$ACCUMULO_HOME/bin/accumulo classpath

Once you create the jar file then place it under $ACCUMULO_HOME/lib/ext location.

To run the program:

$ACCUMULO_HOME/bin/accumulo com.foo.Client

To run the map reduce program we should use

$ACCUMULO_HOME/bin/tool.sh 

7. Connecting

Code to connect to zookeeper. Here PasswordToken should be implementation  of AuthenticationToken.

String instanceName = "myinstance";
String zooServers = "zooserver-one,zooserver-two"
Instance inst = new ZooKeeperInstance(instanceName, zooServers);

Connector conn = inst.getConnector("user", new PasswordToken("passwd"));

To use java keystore, PasswordToken must implement the CredentialProviderToken class.

8. Writing Data

To write the data into Accumulo, must have to create a mutation object which represents the all values(columns) of a single row.

8.1 Mutation creation

Mutations can be created as below. All these changes are made to TabletServer.

Text rowID = new Text("row1");
Text colFam = new Text("myColFam");
Text colQual = new Text("myColQual");
ColumnVisibility colVis = new ColumnVisibility("public");
long timestamp = System.currentTimeMillis();

Value value = new Value("myValue".getBytes());

Mutation mutation = new Mutation(rowID);
mutation.put(colFam, colQual, colVis, timestamp, value);

8.2 BatchWriter

We have to send all these mutations to the BatchWriter which submits to appropriate TabletServers.

BatchWriter is a highly optimized to send Mutations to multiple TabletServers. This handles the traffic automatically to the same TabletServer to reduce the network overhead.
But we must be taken care about chaning the contents to the mutations because BatchWriter keeps the objects in the memory while processing.

BatchWriterConfig config = new BatchWriterConfig();
config.setMaxMemory(10000000L); // bytes available to batchwriter for buffering mutations

BatchWriter writer = conn.createBatchWriter("table", config)

writer.addMutation(mutation);

writer.close();

8.3 ConditionalWriter

In some secenarios, we may want to store a few mutations and not all. We need to filter them based on the condition. This ConditionalWriter enables efficient, atomic read-modify-write operations on rows. The ConditionalWriter writes special Mutations which have a list of per column conditions that must all be met before the mutation is applied.  Examples on ConditionalWriter


9. Reading Data

Accumulo is provided a optimized way to quickly retrieve the value associated for a given key, and to efficiently return ranges of consecutive keys and their associated values.

9.1 Scanner

Scanner instance is used to retrive the objects from TabletServer. Scanner acts as Iterator to traverse through the key/value pairs.

Provides the way to provide the range for the input keys and return only wanted columns instead of all columns in the row. This Scanner class provides simplified and easy way to access the values.

Authorizations auths = new Authorizations("public");

Scanner scan =
    conn.createScanner("table", auths);

scan.setRange(new Range("harry","john"));
scan.fetchColumnFamily(new Text("attributes"));

for(Entry entry : scan) {
    Text row = entry.getKey().getRow();
    Value value = entry.getValue();
}

9.2 Isolated Scanner

Accumulo supports the ability to present an isolated view of rows when scanning. There are three possible ways that a row could change in Accumulo :

  • A mutation applied to a table
  • Iterators executed as part of a minor or major compaction
  • Bulk import of new files

For example if a mutation modifies three columns, it is possible that you will only see two of those modifications. With the isolated scanner either all three of the changes are seen or none.

  Connector conn = opts.getConnector();
    if (!conn.tableOperations().exists(opts.getTableName()))
      conn.tableOperations().create(opts.getTableName());

    Thread writer = new Thread(new Writer(conn.createBatchWriter(opts.getTableName(), bwOpts.getBatchWriterConfig()), opts.iterations));
    writer.start();
    Reader r;
    if (opts.isolated)
      r = new Reader(new IsolatedScanner(conn.createScanner(opts.getTableName(), opts.auths)));
    else
r = new Reader(conn.createScanner(opts.getTableName(), opts.auths));

9.3 BatchScanner

BatchScanner is similar to Scanner but this can be configured to get the subset of columns for a multiple ranges. BatchScanner accept a set of Ranges.

Note:
keys returned by a BatchScanner are not in sorted order since the keys streamed are from multiple TabletServers in parallel.

ArrayList ranges = new ArrayList();
// populate list of ranges ...

BatchScanner bscan =
    conn.createBatchScanner("table", auths, 10);
bscan.setRanges(ranges);
bscan.fetchColumnFamily("attributes");

for(Entry entry : bscan) {
    System.out.println(entry.getValue());
}

10. Conclusion

In this article, We have seen Accumulo data model and it's components.
In further, seen how to store the data into Accumulo data TabletServer. Diffeent ways to store and retrieving the data.

No comments:

Post a Comment

Please do not add any spam links in the comments section.