HDFS-14859 Bug Reproduction

Introduction: Hadoop is the distributed storage underneath Google’s filesystem. When Hadoop starts up, it enters a "safe mode," where it doesn’t accept any request. Now, to exit safe mode, Hadoop requires two conditions to be met: 1) sufficient blocks being replicated, and 2) enough datanodes that came online.

Video of description: https://youtu.be/BjCnDy8Jp58

Video of code walkthrough: https://youtu.be/mqoahSxyZMM

Link to issue: https://issues.apache.org/jira/browse/HDFS-14859

Issue: What is the performance issue? Hadoop always check the two conditions. Even if one of the conditions is not satisfied, Hadoop continues to check the other condition. But we can skip this check if the first condition is not met.

Impact: The checking of exiting condition became very time-consuming when the cluster has a large number of datanodes. This performance degradation can significantly slow down the startup process and delay the cluster from becoming operational.

The Fix: To address this issue is straightforward, When the first condition, block condition is not met, Hadoop can skip the counting of datanodes altogether. That is, we only count datanodes when the first condition is met. By avoiding the unnecessary datanode counting, the performance of the safe mode exit process can be greatly improved.

10 7 1 1 May. 13, 2024, 8:20 PM

Authors

Launch on Chameleon

Launching this artifact will open it within Chameleon’s shared Jupyter experiment environment, which is accessible to all Chameleon users with an active allocation.

Download Archive

Download an archive containing the files of this artifact.

Version Stats

10 7 1