Avoiding Boilerplate with PureConfig

The PureConfig library of Scala avoids boilerplate while loading configurations, which has always been an error-prone and monotonous procedure. The common way that we follow is deserializing each field of the configurations, which is again a cumbersome task. The more fields there are, the more code has to be written and tested and maintained. This kind of code is boilerplate because most of the times the code can be automatically generated by the compiler based on what must be loaded.
PureConfig allows us to separate what to load from how it’s being loaded .

Now, let’s look at the typesafe code and pureConfig code to know the difference between the two-

TypeSafe code:

PureConfig code:

There’s no longer any reference to a Config, no more repetition of names, or tedious work in having to add new fields and defining how they should be loaded. As an added benefit, the configuration has been split into multiple classes, so that different parts of our application can cleanly depend on subsets of our configuration by requiring the appropriate settings type.

We can now simply add a new field with an appropriate type and be sure that PureConfig at compile-time generates what’s necessary to load our configuration. As you can see, loadConfig (res 3) gives us an Either instance back, which implies that it deals with possible errors during loading, which the first version simply ignored.

Resolving the Failure Issue of NameNode

In the previous blog “Smattering of HDFS“, we learnt that “The NameNode is a Single Point of Failure for the HDFS Cluster”. Each cluster had a single NameNode and if that machine became unavailable, the whole cluster would become unavailable until the NameNode is restarted or brought up on a different machine. Now in this blog, we will learn about resolving the failure issue of NameNode.

Issues that arise when NameNode fails/crashes-
The metadata for the HDFS like Namespace Information, block information etc, when in use needs to be stored in main memory, but for persistence storage, it is to be stored in disk. The NameNode stores two types of information:
1. in-memory fsimage – It is the latest and updated snapshot of the Hadoop filesystem namespace.
2. editLogs – It is the sequence of changes made to the filesystem after NameNode started.

The total availablity of HDFS cluster is decreased in two major ways:
1. In the case of a machine crash, the cluster would become unavailable until the machine is restarted.
2. In case of maintenance task to be carried on NameNode machine, cluster downtime would happen.

StandBy NameNode – the solution to NameNode failure
In HDFS High Availability feature, it provides a facility of running two NameNodes in the same cluster. There is an active-passive architecture for NameNode, that is, if NameNode goes down, within a few seconds, the passive NameNode also known as Standby NameNode comes up. At any point in time, one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast fail over if necessary.

For Namespace Information backup, the fsImage is stored along with the editLog. The editLog is like the journal ledger of NameNode. Through it, the in-memory fsImage can be reconstructed. So, it is needed to make the backup of editLog .

In Gen2 Hadoop architecture, there is a facility of Quorum Journal Manager(QJM) which is a set of atleast 3 machines known as journal nodes, where editLogs are stored for backup. To minimize the time to start the passive NameNode in case of active NameNode crash, the standby machine available is pre-configured and ready to take over the role of NameNode.
standby namenode

The Standby NameNode keeps reading the editLogs from the journal nodes and keeps itself updated. This configuration makes Standby ready to take up the active NameNode role in case of failure. All the DataNodes are configured to send the Block Report to both of the NameNodes. Thus, the Standby NameNode becomes active in case of NameNode failure in a short duration of time.

Smattering Of HDFS

Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers.It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant as it provides high-performance access to data across Hadoop clusters. Like other Hadoop-related technologies, HDFS has become a key tool for managing pools of big data and supporting big data analytics applications.It is the primary storage system used by Hadoop applications.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.It provides high throughput access to application data and is suitable for applications that have large data sets.
HDFS uses a master/slave architecture where master consists of a single NameNode that manages the file system metadata and one or more slave DataNodes that store the actual data.

What are NameNodes and DataNodes?

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. The NameNode is a Single Point of Failure for the HDFS Cluster. When the NameNode goes down, the file system goes offline.

The DataNode is responsible for storing the files in HDFS. It manages the file blocks within the node. It sends information to the NameNode about the files and blocks stored in that node and responds to the NameNode for all filesystem operations. A functional filesystem has more than one DataNode, with data replicated across them.

Within HDFS, a given name node manages file system namespace operations like opening, closing, and renaming files and directories. A name node also maps data blocks to data nodes, which handle read and write requests from HDFS clients. Data nodes also create, delete, and replicate data blocks according to instructions from the governing name node.
HDFS is comprised of interconnected clusters of nodes where files and directories reside. An HDFS cluster has a NameNode, that manages the file system namespace and regulates client access to files. In addition, data nodes (DataNodes) store data as blocks within files.


  • Fault detection and recovery : Detection of faults and quick, automatic recovery from them is the core architectural goal.
  • Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
  • Simple coherency model : HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed except for appends.
  • Large data sets : Since HDFS is tuned to support large files, it should support tens of millions of files in a single instance.