HDFS HSM and HBase: Experiment (continued) (Part 5 of 7)

This is part 5 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)

1TB Dataset in a Single Storage

The performance for 1TB dataset in HDD and SSD is shown in Figure 6 and Figure 7. Due to the limitation of memory capability, 1TB dataset in RAMDISK is not tested.

Figure 6. YCSB throughput of a single storage type with 1TB dataset

Figure 7. YCSB latency of a single storage type with 1TB dataset

The throughput and latency on SSD are both better than HDD (134% throughput and 35% latency). This is consistent with 50GB data test.

The benefits gained for throughput by using SSD are different between 50GB and 1TB (from 128% to 134%), SSD gains more benefits in the 1TB test. This is because much more I/O intensive events such as compactions occur in 1TB dataset test than 50GB, and this shows the superiority of SSD in huge data scenarios. Figure 8 shows the changes of the network throughput during the tests.

Figure 8. Network throughput measured for case 1T_HDD and 1T_SSD

In 1T_HDD case the network throughput is lower than 10Gbps, and in 1T_SSD case the network throughput can be much larger than 10Gbps. This means if we use a 10Gbps switch in 1T_SSD case, the network should be the bottleneck.

Figure 9. Disk throughput measured for case 1T_HDD and 1T_SSD

In Figure 9, we can see the bottleneck for these two cases is disk bandwidth.

In 1T_HDD, at the beginning of the test the throughput is almost 1000 MB/s, but after a while the throughput drops down due to memstore limitation of regions caused by slow flush.
In 1T_SSD case, the throughput seems to be limited by a ceiling of around 1300 MB/s, nearly the same with the bandwidth limitation of SATA controllers. To further improve the throughput, more SATA controllers are needed (e.g. using HBA card) instead of more SSDs are needed.

During 1T_SSD test, we observe that the operation latencies on eight SSDs per node are very different as shown in the following chart. In Figure 10, we only include latency of two disks, sdb represents disks with a high latency and sdf represents disks with a low latency.

Figure 10. I/O await time measured for different disks

Four of them have a better latency than the other ones. This is caused by the hardware design issue. You can find the details in Disk I/O Bandwidth and Latency Varies for Ports. The disk with higher latency might take the same workload as the disks with lower latency in the existing VolumeChoosingPolicy, this would slow down the performance. We suggest to implement a latency-aware VolumeChoosingPolicy in HDFS.

Performance Estimation for RAMDISK with 1TB Dataset

We cannot measure the performance of RAMDISK with 1T dataset due to RAMDISK limited capacity. Instead we have to evaluate its performance by analyzing the results of cases HDD and SSD.

The performance between 1TB and 50GB dataset are pretty close in HDD and SSD.

The throughput difference between 50GB and 1TB dataset for HDD is

|242801250034-1|×100%=2.89%

While for SSD the value is

|325148320616-1|×100%=1.41%

If we make an average of the above values as the throughput decrease in RAMDISK between 50GB and 1TB dataset, it is around 2.15% ((2.89%+1.41%)/2=2.15%), thus the throughput for RAMDISK with 1T dataset should be

406577×(1+2.15%)=415318 (ops/sec)

Figure 11. YCSB throughput estimation for RAMDISK with 1TB dataset

Please note: the throughput doesn’t drop much in 1 TB dataset cases compared to 50 GB dataset cases because they do not use the same number of pre-split regions. The table is pre-split to 18 regions in 50 GB dataset cases, and it is pre-split to 210 regions in the 1 TB dataset.

Performance for Tiered Storage

In this section, we will study the HBase write performance on tiered storage (i.e. different storage mixed together in one test). This would show what performance it can achieve by mixing fast and slow storage together, and help us to conclude the best balance of storage between performance and cost.

Figure 12 and Figure 13 show the performance for tiered storage. You can find the description of each case in Table 1.

Most of the cases that introduce fast storage have better throughput and latency. With no surprise, 1T_RAM_SSD has the best performance among them. The real surprise is that the throughput of 1T_RAM_HDD is worse than 1T_HDD (-11%) and 1T_RAM_SSD_All_HDD is worse than 1T_SSD_All_HDD (-2%) after introducing RAMDISK, and 1T_SSD is worse than 1T_SSD_HDD (-2%).

Figure 12. YCSB throughput data for tiered storage

Figure 13. YCSB latency data for tiered storage

We also investigate how much data is written to different storage types by collecting information from one DataNode.

Figure 14. Distribution of data blocks on each storage of HDFS in one DataNode

As shown in Figure 14, generally, more data are written to disks for test cases with higher throughput. Fast storage can accelerate the flush and compaction, which lead to more flushes and compactions. Thus, more data are written to disks. In some RAMDISK-related cases, only WAL can be written to RAMDISK, and there are 1216 GB WALs written to one DataNode.

For tests without SSD (1T_HDD and 1T_RAM_HDD), we by purpose limiting the number of flush and compaction actions by using fewer flushers and compactors. This is due to limited IOPs capability of HDD, which lead to fewer flush & compactions. Too many concurrent reads and writes can hurt HDD performance which eventually slows down the performance.

Many BLOCKED DataNode threads can be blocked up to tens of seconds in 1T_RAM_HDD. We observe this in other cases as well, but it happens most often in 1T_RAM_HDD. This is because each DataNode holds one big lock when creating/finalizing HDFS blocks, these methods might take tens of seconds sometimes (see Long-time BLOCKED threads in DataNode), the more these methods are used (in HBase they are used in flusher, compactor, and WAL), the more often the BLOCKED occurs. Writing WAL in HBase needs to create/finalize blocks which can be blocked, and consequently users’ inputs are blocked. Multiple WAL with a large number of groups or WAL per region might also encounter this problem, especially in HDD.

With the written data distribution in mind, let’s look back at the performance result in Figure 12 and Figure 13. According to them, we have following observations:

Mixing SSD and HDD can greatly improve the performance (136% throughput and 35% latency) compared to pure HDD. But fully replacing HDD with SSD doesn’t show an improvement (98% throughput and 99% latency) over mixing SSD/HDD. This is because the hardware design cannot evenly split the I/O bandwidth to all eight disks, and 94% data are written in SSD while only 6% data are written to HDD in SSD/HDD mixing case. This strongly hints a mix use of SSD/HDD can achieve the best balance between performance and cost. More information is in Disk Bandwidth Limitation and Disk I/O Bandwidth and Latency Varies for Ports.
Including RAMDISK in SSD/HDD tiered storage has different results with 1T_RAM_SSD_All_HDD and 1T_RAM_SSD_HDD. The case 1T_RAM_SSD_HDD shows a result when there are only a few data written to HDD, which improves the performance over SSD/HDD mixing cases. The results of 1T_RAM_SSD_All_HDD when there are a large number of data written to HDD is worse than SSD/HDD mixing cases. This means if we distribute the data appropriately to SSD and HDD in HBase, we can gain a good performance when mixing RAMDISK/SSD/HDD.
The RAMDISK/SSD tiered storage is the winner of both throughput and latency (109% throughput and 67% latency of pure SSD case). So, if cost is not an issue and maximum performance is needed, RAMDISK/SSD should be chosen.

The throughput decreases by 11% by comparing 1T_RAM_HDD to 1T_HDD. This is initially because 1T_RAM_HDD uses RAMDISK which consumes part of the RAM, which results in the OS buffer having less memory to cache the data.

Further, with 1T_RAM_HDD, the YCSB client can push data at very high speed, cells are accumulated very fast in memstore while the flush and compaction in HDD are slow, the RegionTooBusyException occurs more often (the figure below shows a much larger memstore in 1T_RAM_HDD than 1T_HDD), and we observe much longer GC pause in 1T_RAM_HDD than 1T_HDD, it can be up to 20 seconds in a minute.

Figure 15. Memstore size in 1T_RAM_HDD and 1T_HDD

Finally, as we try to increase the number of flushers and compactors, the performance even goes worse because of the reasons mentioned when explaining why we use less flusher and compactors in HDD-related tests (see Long-time BLOCKED threads in DataNode).

The performance reduction in 1T_RAM_SSD_All_HDD than 1T_SSD_All_HDD (-2%) is due to the same reasons mentioned above.

We suggest:

Implement a finer grained lock mechanism in DataNode.

Use reasonable configurations for flusher and compactor, especially in HDD-related cases.
Don’t use the storage that has large performance gaps, such as directly mixing RAMDISK and HDD together.
In many cases, we can observe the long GC pause around 10 seconds per minute. We need to implement an off-heap memstore in HBase to solve long GC pause issues.
Implement a finer grained lock mechanism in DataNode.

Go to part 6, Issues