HBase on Azure: Import/Export snapshots from/to ADLS
by Apekshit Sharma, HBase Committer.
Overview
Azure Data Lake Store (ADLS) is Microsoft’s cloud alternative for Apache HDFS. In this blog, we’ll see how to use it as backup for storing snapshots of Apache HBase tables. You can export snapshots to ADLS for backup; and for recovery, import the snapshot back to HDFS and use it to clone/restore the table. In this post, we’ll go over the configuration changes needed to make HDFS client talk to ADLS, and commands to copy HBase table snapshots from HDFS to ADLS and vice-versa.
Introduction
“The Azure Data Lake store is an Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem.”
ADLS can be treated as any HDFS service, except that it’s in the cloud. But then how do applications talk to it? That’s where the hadoop-azure-datalake module comes into the picture. It enables an HDFS client to talk to ADLS whenever the following access path syntax is used:
adl://
For eg. |
However, before it can access any data in ADLS, the module needs to be able to authenticate to Azure. That requires a few configuration changes. These we describe in the next section.
Configuration changes
ADLS requires an OAuth2 bearer token to be present as part of request’s HTTPS header. Users who have access to an ADLS account can obtain this token from the Azure Active Directory (Azure AD) service. To allow an HDFS client to authenticate to ADLS and access data, you’ll need to specify these tokens in core-site.xml using the following four configurations:
|
To find the values for dfs.adls.oauth2.* configurations, refer to this document.
Since all files/folders in ADLS are owned by the account owner, it’s ACL environment works well with that of HDFS which can have multiple users. Since the user issuing commands using the HDFS client will be different than what’s in Azure’s AD, any operation which checks for ACL will fail. To workaround this issue, use the following configuration which will tell the HDFS client that in case of ADLS requests, assume that the current user owns all files.
|
Make sure to deploy the above configuration changes to the cluster.
Export snapshot to ADLS
Here are the steps to export a snapshot from HDFS to ADLS.
-
Create a new directory in ADLS to store snapshots.
$ hdfs dfs -mkdir adl://appy.azuredatalakestore.net/hbase
Found 1 items drwxr-xr-x - systest hdfs 0 2017-03-21 23:43 adl://appy.azuredatalakestore.net/hbase |
-
Create the snapshot. To know more about this feature and how to create/list/restore snapshots, refer to HBase Snapshots section in the HBase reference guide.
-
Export snapshot to ADLS
$ sudo -u hbase hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot
[Output] 17/03/21 23:50:24 INFO snapshot.ExportSnapshot: Copy Snapshot Manifest … … 17/03/21 23:50:48 INFO snapshot.ExportSnapshot: Export Completed: snapshot_1 |
-
Verify that the snapshot was copied to ADLS.
$ hbase snapshotinfo -snapshot Snapshot Info ---------------------------------------- Name: snapshot_1 Type: FLUSH Table: t Format: 2 Created: 2017-03-21T23:42:56 |
-
It’s now safe to delete the local snapshot (one in HDFS).
Restore/Clone table from a snapshot in ADLS
If you have a snapshot in ADLS which you want to use either to restore an original table to a previous state, or create a new table by cloning, follow the steps below.
-
Copy the snapshot back from ADLS to HDFS. Make sure to copy to ‘hbase’ directory on HDFS, because that’s where HBase service will look for snapshots.
$ sudo -u hbase hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot |
-
Verify that the snapshot exists in HDFS. (Note that there is no -remote-dir parameter)
$ hbase snapshotinfo -snapshot snapshot_1
Snapshot Info ---------------------------------------- Name: snapshot_1 Type: FLUSH Table: t Format: 2 Created: 2017-03-21T23:42:56 |
-
Follow the instructions in HBase Snapshots section of HBase reference guide to restore/clone from the snapshot.
Summary
The Azure module in HDFS makes it easy to interact with ADLS. We can keep using the commands we are already know and our applications that use the HDFS client just need a few configuration changes. What what a seamless integration! In this blog, we got a glimpse of the HBase integration with Azure - Using ADLS as a backup for storing snapshots. Let’s see what the future has in store for us. Maybe, a HBase cluster fully backed by ADLS!