Data Engineering

Hive to HBase Migration
Siva Kumar Reddy
March 24, 2023

The blog describes the need for a client, which is a large transaction processing company that handles a high volume of transactions from different issuers, to process transactions of different issuers very frequently. Currently, the client stores data on HDFS and accesses it using Hive tables, which is optimized for batch processing of large volumes of data and not well suited for real-time or interactive processing. To solve this problem, the client migrates from Hive to HBase, which is a NoSQL database optimized for real-time access of data and designed to handle large volumes of structured and unstructured data in real-time.

The client is a largest transaction processing company that handles a high volume of transactions from different issuers. Client has loyalty programs that offer rewards to customers. This company uses a scoring system that assign points or rewards based on the volume and frequency of transactions. For example, a customer who uses their credit card to make frequent purchases at a specific retailer may earn more points than a customer who uses their cards for occasional purchases. The ultimate goal of the scoring system and loyalty program is to encourage customer loyalty and increase the volume of transactions processed by the company. By offering rewards and incentives issuers can create a more loyal customer base and increase revenue.

The client gets data like user details, account details, household details and these data is stored on hdfs and accessed using hive tables. The distributed system stores data using parquet format. As the parquet files can’t be read directly, Hive is used. Hive is an SQL kind of abstraction on the top of hdfs  files to ease querying, processing.  Hive uses its meta store and YARN to run its queries.

The client will process transactions 24 hours once. Job is running every 24 hours for scoring and exporting. Approximately we get 500 million transactions a day. There is no issue with using hive as we are processing data every 24 hours.

Hive is a data warehousing system designed to process large volumes of data. It is built on top of hdfs. Hive is optimized for batch processing of large volumes of data, and is not well suited for real-time or interactive processing. The main reason why hive is not suitable for frequent processing is that it is optimized for batch processing of larger volumes of data. Hive is designed to process data in a distributed and parallel manner, which means that it can take several minutes to hours to complete a query. This is because hive uses MapReduce for processing data, which involves reading data from disk, processing it in memory and writing the results back to disk.

Coming to the problem statement:

Now, need to process transactions of different issuers very frequently. i.e., Batch process transactions every second.

Example:

If we consider 500 million transactions/24 hours.

We get around 5700 transactions per second.

For this requirement hdfs takes time and won’t be helpful for fast access and processing. So, switched to HBase which helps for faster access and processing.

Hive to HBase Migration:

In the project there is data residing on hdfs, we have to migrate from Hive to HBase for low-latency data access.

HBase is a NoSQL database that is designed to handle large volumes of structured and unstructured data in Realtime. It is optimized for random reads and writes typically used in applications that require low latency data access, such as real-time analytics. HBase is designed for real-time access of data, and is optimized for handling frequent and fast data updates.

HBase uses distributed architecture that allows it to scale horizontally, and it is designed to handle very large tables with billions of rows and millions of columns. HBase also supports automatic sharding  and replication, which helps to ensure high availability and performance.

Why HBase is efficient for certain use cases?

1. Low latency: HBase is optimized for low latency data access, which makes it ideal for Realtime applications.

2. Random Access: HBase is designed for random access to data, which means it quickly retrieve the individual records or subset of records based on their key (Row Key).

3. Data Model: HBase is a column family-based NoSQL database that allows for flexible schema design. HBase can be more efficient for certain types of data such as nested data or hierarchical data structures.

4. Data volume: HBase can handle large volumes of data, it is better suited for applications that require real time access to large datasets.

5. Fault Tolerance: HBase provides built in fault tolerance features such as automatic replication of data across multiple nodes. Which ensures data availability and integrity even in the event of hardware or network failures.

6. Integrated with other big data tools: HBase is a part of Hadoop ecosystem and can easily integrate with other big data tools such as spark, Kafka. This allows for seamless data processing and analysis in the bigdata streaming project.

7. High Scalability: HBase is designed to scale horizontally and can handle petabytes of data making it ideal for bigdata streaming projects.

In the project, for migration, considered two things.

1. Present streaming data is written to hdfs. So, done changes in the existing scripts to write streaming data to hdfs as well as HBase. Before writing to HBase create the    tables with required column families.

2. For migrating the existing data on hdfs, Created the data frames on the data that has to be migrated by fetching the data from Hive tables using Spark SQL. The     created data frames are written to HBase tables using HBase API.

Pseudo Code to create DataFrame from Hive Tables and Write to HBase:

//Create DataFrame from Hive Table:

def getDataFrame (spark: SparkSession, hiveDbName: String, hiveTableName : String): DataFrame=

{

val dataFrame =spark.sql(s “select * from ${ hiveDbName }.${ hiveTableName }”)

}

//Write to HBase:

def writeToHBase (config: Configurations, dataFrame: org.apache.spark.sql.Dataset[Row]: Unit= {

dataFrame.foreachPartition(partition => {

//Establish connection to HBase

Val connectionPool = HBaseConnectionPool.apply(config.hbaseSchemaName, config.zookeerperQuorum, config.zookeeperPort)

Val connection = connectionPool.connection

//Get connection to HBase Table

Val hbaseTable = connection.getTable(TableName.valueOf(Bytes.toBytes(config.hbaseSchemaName)

//Write to HBase Table

Partition.foreach(row => {

Val tablePut = new Put(Bytes.toBytes(row.getAs[String](“rowkey”)))

if(!row.isNullAt(row.fieldIndex(“col1”))) tablePut.addColumn(columnfamily,Bytes.toBytes.toBytes(row.getAs[String]((“col1”)))

hbaseTable.put(tablePut)

//Close connection to HBase Table and release connection

hbaseTable.close

connectionPool.release

}

}

}

Total number of records moved from Hive to HBase in Stage environment:

Accounts table: 15562769

User table: 153323060

Household: 22556384

Time taken for migration : < 33 min

HBase is efficient for certain use cases because it is optimized for low latency data access, designed for random access to data, and can handle large volumes of data. It is also fault-tolerant, highly scalable, and integrated with other big data tools. In the project, the client does two things for migration: writes streaming data to HDFS as well as HBase and migrates existing data on HDFS by creating data frames on the data that has to be migrated by fetching the data from Hive tables using Spark SQL. The created data frames are written to HBase tables using HBase API. The blog provides pseudo code to create a DataFrame from Hive tables and write to HBase.

We hope you found our blog post informative. If you have any project inquiries or would like to discuss your data and analytics needs, please don't hesitate to contact us at info@predera.com. We're here to help! Thank you for reading.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.