This is part 4 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel) 

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Experimentation

Performance for Single Storage Type

First, we will study the HBase write performance on each single storage type (i.e. no storage type mix). This test is to show the maximum performance each storage type can achieve and provide a guide to following hierarchy storage analysis.

We study the single-storage-type performance with two data sets: 50GB and 1TB data insertion. We believe 1TB is a reasonable data size in practice. And 50GB is used to evaluate the performance uplimit as HBase performance is typically higher when the data size is small. The 50GB size was chosen because we need to avoid data out of space in test. Further, due to RAMDISK limited capacity, we have to use a small data size when storing all data in RAMDISK.

50GB Dataset in a Single Storage

The throughput and latency by YCSB for 50GB dataset are listed in Figure 2 and Figure 3. As expected, storing 50GB data in RAMDISK has the best performance whereas storing data in HDD has the worst.

Figure 2. YCSB throughput of a single storage with 50GB dataset

Note: Performance data for SSD and HDD may be better than its real capability as OS can buffer/cache data in memory.

Figure 3. YCSB latency of a single storage with 50GB dataset

Note: Performance data for SSD and HDD may be better than its real capability as OS can buffer/cache data in memory.

For throughput, RAMDISK and SSD are higher than HDD (163% and 128% of HDD throughput, respectively). But the average latencies are dramatically lower than HDD (only 11% and 32% of HDD latency). This is expected as HDD has long latency on seek operation. The latency improvements of using RAMDISK, SSD over HDD are bigger than throughput improvement. This is caused by the huge access latency gap of different storage. The latency we measured of accessing 4KB raw data from RAM, SSD and HDD are respectively 89ns, 80µs and 6.7ms (RAM and SSD are about 75000x and 84x faster than HDD).

Now consider the data of hardware (disk, network and CPU in each DataNode) collected during the tests.

Disk Throughput

Throughput theoretically (MB/s)

Throughput measured (MB/s)

Utilization

50G_HDD

127 x 8 = 1016

870

86%

50G_SSD

447 x 8 = 3576

1300

36%

50G_RAM

>11194

1800

<16%

Table 10. Disk utilization of test cases

Note: The data listed in the table for 50G_HDD and 50G_SSD are the total throughput of 8 disks.

The disk throughput is decided by factors such as access model, block size and I/O depth. The theoretical disk throughput is measured by using performance-friendly factors; in real cases they won’t be that friendly and would limit the data to lower values. In fact we observe that the disk utilization usually goes up to 100% for HDD.

Network Throughput

Each DataNode is connected by a 20Gbps (2560MB/s) full duplex network, which means both receive and transmit speed can reach 2560MB/s simultaneously. In our tests, the receive and transmit throughput are almost identical, so only the receive throughput data is listed in Table 9.

Throughput theoretically (MB/s)

Throughput measured (MB/s)

Utilization

50G_HDD

2560

760

30%

50G_SSD

2560

1100

43%

50G_RAM

2560

1400

55%

Table 11. Network utilization of test cases

CPU Utilization

Utilization

50G_HDD

36%

50G_SSD

55%

50G_RAM

60%

Table 12. CPU utilization of test cases

Clearly, the performance on HDD (50G_HDD) is bottlenecked on disk throughput. However, we can see that, for SSD and RAMDISK, neither disk, network nor CPU are the bottleneck.  So, the bottlenecks must be somewhere else. They can be somewhere in software (e.g. HBase compaction, memstore in regions, etc) or hardware (except disk, network and CPU).

To further understand why the utilization of SSD is so low and throughput is not as high as expected —only 132% of HDD, considering SSD has much higher theoretical throughput (447 MB/s vs. 127MB/s per disk) — we make an additional test to write 50GB data on four SSDs per node instead of eight SSDs per node.

Figure 4. YCSB throughput of a single storage type with 50 GB dataset stored in different number of SSDs

Figure 5. YCSB latency of a single storage type with 50 GB dataset stored in different number of SSDs

We can see that while the number of SSDs doubled, the throughput and latency of eight SSDs per node case (50G_8SSD) are improved to a much lesser extent (104% throughput and 79% latency) compared to four SSDs per node case (50G_4SSD). This means that the ability of SSDs are far from full use.

We made further dive and found that it is caused by current mainstream server hardware design. Currently, mainstream server has two SATA controllers which can support up to 12 Gbps bandwidth. This means that the total disk bandwidth is limited to around 1.5GB/s. That is the bandwidth of approximately 3.4 SSDs. You can find the details in the section Disk Bandwidth Limitation. So SATA controllers have become the bottleneck for eight SSDs per node. This explains why there is almost no improvement on YCSB throughput for eight SSDs per node. In the 50G_8SSD test, the disk is the bottleneck.

Go to part 5, Experiment (continued)