This is part 6 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)

Issues and Improvements

This section describes the issues we find during tests.

Software

Long-time BLOCKED threads in DataNode

In 1T_RAM_HDD test, we can observe a substantial 0 throughput period in the YCSB client. After deep-diving into the thread stack, we find many threads of DataXceiver are stuck in a BLOCKED state for a long time in DataNode. We can observe such things in other cases too, but it is most often in 1T_RAM_HDD.

In each DataNode, there is a single instance of FsDatasetImpl where there are many synchronized methods, DataXceiver threads use this instance to achieve synchronization when creating/finalizing blocks. A slow creating/finalizing operation in one DataXceiver thread can block other creating/finalizing operations in all other DataXceiver threads. The following table shows the time consumed by these operations in 1T_RAM_HDD:

Synchronized methods	Max exec time (ms) in light load	Max wait time (ms) in light load	Max exec time (ms) in heavy load	Max wait time (ms) in heavy load
finalizeBlock	0.0493	0.1397	17652.9761	21249.0423
addBlock	0.0089	0.4238	15685.7575	57420.0587
createRbw	0.0480	1.7422	21145.8033	56918.6570

Table 13. DataXceiver threads and time consumed

We can see that both execution time and wait time on the synchronized methods increased dramatically along with the increment of the system load. The time to wait for locks can be up to tens of seconds. Slow operations usually come from the slow storage such as HDD. It can hurt the concurrent operations of creating/finalizing blocks in fast storage so that HDFS/HBase cannot make better use of tiered storage.

A finer grained lock mechanism in DataNode is needed to fix this issue. We are working on this improvement now (HDFS-9668).

Load Imbalance in HDFS with Tiered Storage

In tiered storage cases, we find that the utilization are not the same among volumes of the same storage type when using the policy RoundRobinVolumeChoosingPolicy. The root cause is that in RoundRobinVolumeChoosingPolicy it uses a shared counter to choose volumes for all storage types. It might be unfair when choosing volumes for a certain storage type, so the volumes in the tail of the configured data directories have a lower chance to be written.

The situation becomes even worse when there are different numbers of volumes of different storage types. We have filed this issue in JIRA (HDFS-9608) and provided a patch.

Asynchronous File Movement Across Storage When Renaming in HDFS

Currently data blocks in HDFS are stored in different types of storage media according to pre-specified storage policies when creating. After that data blocks will remain where they were until an external tool Mover in HDFS is used. Mover scans the whole namespace, and moves the data blocks that are not stored in the right storage media as the policy specifies.

In a tiered storage, when we rename a file/directory from one storage to another different one, we have to move the blocks of that file or all files under that directory to the right storage. This is not currently provided in HDFS.

Non-configurable Storage Type and Policy

Currently in HDFS both storage type and storage policy are predefined in source code. This makes it inconvenient to add implementations for new devices and policies. It is better to make them configurable.

No Optimization for Certain Storage Type

Currently there is no difference in the execution path for different storage types. As more and more high performance storage devices are adopted, the performance gap between storage types will become larger, and the optimization for certain types of storage will be needed.

Take writing certain numbers of data into HDFS as an example. If users want to minimize the total time to write, the optimal way for HDD may be using compression to save disk I/O, while for RAMDISK writing directly is more suitable as it eliminates the overheads of compression. This scenario requires configurations per storage type, but it is not supported in the current implementation.

Hardware

Disk Bandwidth Limitation

In the section 50GB Dataset in a Single Storage, the performance difference between four SSDs and eight SSDs is very small. The root cause is the total bandwidth available for the eight SSDs is limited by upper level hardware controllers. Figure 16 illustrates the motherboard design. The eight disk slots connect to two different SATA controllers (Ports 0:3 - SATA and Port 0:3 - sSATA). As highlighted in the red rectangle, the maximum bandwidth available for the two controller in the server is 2*6 Gbps = 1536 MB/s.

Figure 16. Hardware design of server motherboard

Maximum throughput for single disk is measured with FIO tool.

	Read BW (MB/s)	Write BW (MB/s)
HDD	140	127
SSD	481	447
RAMDISK	>12059	>11194

Table 14. Maximum throughput of storage medias

Note: RAMDISK is memory essentially and it does not go through the same controller as SSDs and HDDs, so it does not have the 2*6Gbps limitation. Data of RAMDISK is listed in the table for convenience of comparison.

According to Table 14 the writing bandwidth of eight SSDs is 447 x 8 = 3576 MB/s. It exceeds the controllers’ 1536 MB/s physical limitation, thus only 1536 MB/s are available for all eight SSDs. HDD is not affected by this limitation as their total bandwidth (127 x 8 = 1016 MB/s) is below the limitation. This fact greatly impacts the performance of the storage system.

We suggest one of the following:

Enhance hardware by using more HBA cards to eliminate the limitation of the design issue.
Use SSD and HDD together with an appropriate ratio (for example four SSDs and four HDDs) to achieve a better balance between performance and cost.

Disk I/O Bandwidth and Latency Varies for Ports

As described in section Disk Bandwidth Limitation, four of the eight disks connect to Ports 0:3 - SATA and the rest of them connect to Port 0:3 - sSATA, the total bandwidth of the two controllers is 12Gbps. We find that the bandwidth is not evenly divided to the disk channels.

We do a test, each SSD (sda, sdb, sdc and sdd connect to Port 0:3 - sSATA, sde, sdf, sdg and sdh connect to Ports 0:3 - SATA) is written by an individual FIO process. It’s expected that eight disks are written at the same speed, but according to the output of IOSTAT the bandwidth 1536 MB/s is not evenly divided to two controllers and eight disk channels. As shown in Figure 17, the four SSDs connected to Ports 0:3 - SATA obtain more I/O bandwidth (213.5MB/s*4) than the others (107MB/s*4).

We suggest that you consider the controller limitation and storage bandwidth when setting up a cluster. Using four SSDs and four HDDs in a node is a reasonable choice, and it is better to install the four SSDs to Port 0:3 - SATA.

Additionally, the disk with higher latency might take the same workload as the disks with lower latency in the existing VolumeChoosingPolicy. This would slow down the performance. We suggest to implement a latency-aware VolumeChoosingPolicy in HDFS.

Figure 17. Different write speed and await time of disks

Go to part 7, Conclusions