Big Data Storage Technology used by IT Giants

TechBoutique Official
12 min readOct 15, 2020

--

Ever wondered how Facebook, Twitter, or any other social media platform never loses your data, be it a post you made a day ago or your first-ever age-old childhood pic. Since there are millions and billions of users, so how do these social media giants manage to keep your data intact? Is it due to the high quality of hardware they use or is there something else in play behind the curtains?

No doubt they’re using a better quality of hardware but that’s not all they’re doing. After all, no hardware is failure-proof. They use an intelligent storage method called distributed storage. Particularly they use HDFS clusters that work on a distributed storage system. Today we’ll be taking a peek into the world of HDFS. We’ll also be experimenting with some of its concepts and busting some of the myths. So, buckle up.

Task Description:

  • Configuring Name Node, Data Node and Client.
  • Uploading file through Client.
  • Finding which data node is chosen for storing data by tracking data packet transfer
  • Finding what happens when the data node goes down.

Prerequisites: Java and Hadoop should be pre-installed on the instances. All the following steps require root login. follow the below article to install and configure your instances.

https://medium.com/@techboutique.official/world-of-big-data-with-hadoop-af48acc16a90

What is a Name Node?

Name Node also known as the Master, is the centrepiece of the HDFS and stores the metadata i.e the list of blocks, their location and that of its replicas for any given file in HDFS.

Name Node(NN) Configuration

In order to configure the name node we primarily need to work on two files: core-site.xml and hdfs-site.xml present in the directory /etc/hadoop/ created during the installation of Hadoop package.

Updating core-site.xml

The core-site.xml file informs Hadoop services/daemon the location of Name Node in the HDFS cluster. It contains the configuration settings for Hadoop Core.

To tell the location of the name node inside the core-site file we have to follow XML format. We have to update the following lines between the configuration tags:

#xml code inside core-site.xml<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:9001</value>
</property>

Here,

  • fs represents a file system
  • name represents Name Node
  • hdfs is protocol name/type
  • 0.0.0.0 acts as a universal IP for the Name Node
  • 9001 is the port on which HDFS operates

The name and value tag act like a key-value pair. The format inside the value tag is:

<protocol_name>://<master’s_IP>:<port_number>

Updating hdfs-site.xml

The hdfs-site.xml file contains the configuration settings for HDFS services/daemons mainly the location of the directory reserved for the HDFS cluster.

To tell the location of the virtual storage inside the hdfs-site.xml we have to update the following lines between the configuration tags:

<property>
<name>dfs.name.dir</name>
<value>/nn</value>
</property>

Here,

  • dfs represents a distributed file system
  • name represents Name Node
  • dir represents directory
  • /nn is the directory where all virtual space is collectively available

Creating a directory and formatting it

Remember the directory location that we provided above in the hdfs-site.xml file. We also need to create it in the main drive i.e. ‘/’.

#command to create directory in root
- cd /
- mkdir /nn

Once the directory is created we also need to format it.

#To format hadoop directory
hadoop namenode -format

It’ll prompt for confirmation. Press ‘Y’ and hit ‘Enter’

Troubleshooting: If there’s an error while formatting the directory then remove the directory (using cmd# rm -rf /nn) and then recreate it and format again.

Starting the Name Node

Command to start the Name Node daemon:

#To start a new daemon
hadoop-daemon.sh start namenode

Now our Name Node is active and we can check it’s report using:

#To check hadoop status
hadoop dfsadmin -report
Since we haven’t connected any data nodes yet no nodes will be displayed. So let’s hop on to the next part i.e. configuration of the data node. But first, let’s quickly understand what it actually is.

What is a Data Node?

Data Node, also known as a slave, is actually the system where the data gets stored in the HDFS cluster. They’re responsible for block creation, deletion and replication and also for transferring the data to the client, as instructed by the Name Node.

Data Node(DN) Configuration

Here too we have to update the two files i.e. core-site.xml and hdfs-site.xml.

Updating core-site.xml

This configuration is almost the same as we did in the Name Node’s core-site.xml file except for the IP in the value tag.
In place of 0.0.0.0, we’ll have to provide the public IP of the Name Node.

Updating hdfs-site.xml

The changes go something like this, we have to update the following lines between the configuration tags:

#Updating hdfs-site.xml<property><name>dfs.data.dir</name><value>/dn1</value></property>

Creating directory

Also, we have to create the directory we mentioned in hdfs-site.xml in the main drive.

You don’t need to format the Data Node’s directory.

Starting the Data Node

Command to start the Data Node daemon:

#command to start Datanode
- hadoop-daemon.sh start datanode

We can use jps command to check the status of the Data Node:

#command to check status
- jps

What is a Client?

Any system using the services provided by the HDFS cluster is known as a Client. Even the Name Node or any of the Data Nodes can also act as clients.

When the client gives commands to upload a file, the Name Node selects the Data Nodes where the blocks are to be stored and sends their IP to the client. From there the data is distributed among the Data Nodes.

Client System Configuration

Here we need to work on only one file i.e. core-site.xml.

Updating core-site.xml

We don’t need to configure the hdfs-site.xml file as the client is not contributing any storage instead it’s going to use them and for that, we need to tell the address of the Name Node to the client.

This configuration is exactly the same as we did in the Data Node’s core-site.xml file. Here, too we have to provide the public IP of the Name Node.

By default the block size is 64MiB and the replication factor is three in this version of Hadoop. If you want to provide custom values then you can do so by updating the client’s hdfs-site.xml file.

To provide custom value for block size:

#providing custom value for block size<property>
<name>dfs.block.size</name>
<value>1024</value>
</property>

Here,

  • block.size represents size of the block
  • 1024 is the value of block size in bytes

To provide custom value for replication factor:

#providing custom value for replication<property>
<name>dfs.replication</name>
<value>4</value>
</property>

Finally, you are done with Configuration and now ready to use Big Data storage of your own!!

Now that we have configured the Hadoop cluster consisting name node, data node and client successfully, we will now start using the cluster for storing and retrieving the data. Our main focus here will be to find out the slave/data node which the client would choose to store the file which we will upload from the client into the cluster. But before going into our task, let’s discuss some of the important concepts

What happens when you upload a file?

What happens is whenever you import any file to your Hadoop Distributed File System, that file gets divided into blocks of some size and then these blocks of data are stored in various slave nodes. This is a kind of normal thing that happens in almost all types of file systems. By default in Hadoop1, these blocks are of 64MB in size and in Hadoop 2 these blocks are of 128MB in size which means all the blocks that are obtained after dividing a file should be of 64MB or 128MB in size. You can manually change the size of the file block in hdfs-site.xml file.

Let’s understand this concept of breaking down a file in blocks with an example. Suppose you have uploaded a file of 400MB to your HDFS then what happens is, this file gets divided into blocks of 128MB + 128MB + 128MB + 16MB = 400MB size. Means 4 blocks are created each of 128MB except the last one.

Why are these blocks very huge?

There are mainly 2 reasons as discussed below:

  • Hadoop File Blocks are bigger because if the file blocks are smaller in size then, in that case, there will be so many blocks in our Hadoop File system i.e. in HDFS. Storing lots of metadata in this small file blocks in a very huge amount becomes messy which can cause network traffic.
  • Blocks are made bigger so that we can minimize the cost of seeking or finding. Because sometimes time taken to transfer the data from the disk can be more than the time taken to start these blocks.

Before uploading our file, we must check whether there are any files already present in the cluster or not. To do the same we will use the following command:

#cmd to check files in the cluster
- hadoop fs –ls /

After checking the existing file in the cluster we will upload our file into the cluster using the following command:

#command to upload a file
- hadoop fs –put <filename> /”

To upload the file it will take some time depending upon the internet speed on which you are uploading.

If you want to read the contents of the file, use the following command:

#command to fetch file
- hadoop fs -cat / <filename>”

NOTE: you can use “Cmd# hadoop fs” command to know about all the commands of Hadoop filesystem and their usage

Meanwhile, you are uploading the file into the cluster, you need to keep tracking the packets that are incoming to your data/slave nodes to know on which node the file is uploading. To trace the packets incoming on the data/slave node we used “tcpdump”.

What is tcpdump?

Tcpdump is a command-line utility that allows you to capture and analyze network traffic going through your system. This is a powerful and versatile tool that includes many options and filters, tcpdump can be used in a variety of cases. Since it’s a command-line tool, it is ideal to run in remote servers or devices for which a GUI is not available, to collect data that can be analyzed later.

To distinguish our packets incoming from the client from other traffic we need to identify the Ethernet interface that our data/slave node is using and then run the required tcpdump command over port no. 50010. To get the Ethernet interface you need to run the following command:

#to check the ethernet config
- ifconfig

Here our Ethernet interface is “eth0”. Therefore the command you had to run to monitor the incoming packets is :

#To monitor the in/out packet through tcp port
- tcpdump –i eth0 tcp port 50010 -n

Now that we have run this command on our data/slave nodes, packets will start coming into the node chosen by the client node to save the file as soon as uploading starts at the client end.

You might get confused by seeing an enormous number of packets incoming to all the data/slave nodes at the time as it happens within a very less fraction of time. This happens because the file uploaded by the client is small, hence it uploads quickly and starts replicating simultaneously into the rest of the data/slave node (Depending on the replication factor assigned by the user in the client node).

What are data replication and replication factor?

Replication ensures the availability of the data. Replication is nothing but making a copy of something and the number of times you make a copy of that particular thing can be expressed as its Replication Factor. As we have seen in File blocks that the HDFS stores the data in the form of various blocks at the same time Hadoop is also configured to make a copy of those file blocks. By default, the Replication Factor for Hadoop is set to 3 which can be configured means you can change it manually as per your requirement by writing the following commands in client hdfs file.

Here the replication factor is 1, that means the files uploaded by the client will be copied only once.

Goals of data replication

Data replication is done with an aim to:

  • Increase the availability of data.
  • Speed up the query evaluation.

How does data replication help name node to recover data in case of data node failure?

Namenode periodically receives a heartbeat and a Block report from each Datanode in the cluster. Every Datanode sends a heartbeat message after every 3 seconds to Namenode. The health report is just information about a particular Datanode that is working properly or not. In other words, we can say that a particular Datanode is alive or not.

A block report of a particular Datanode contains information about all the blocks that reside on the corresponding Datanode. When Namenode doesn’t receive any heartbeat message for 10 minutes (By Default) from a particular Datanode then corresponding Datanode is considered Dead or failed by Namenode. Since blocks will be under replicated, the system starts the replication process from one Datanode to another by taking all block information from the Block report of corresponding Datanode. The Data for replication transfers directly from one Datanode to another without data passing through Namenode.

Hadoop web user interface

HDFS exposes a web server which is capable of performing basic status monitoring and file browsing operations. By default, this is exposed on port 50070 on the NameNode. Accessing “http://namenode:50070/” with a web browser will return a page containing overview information about the health, capacity, and usage of the cluster (similar to the information returned by bin/hadoop dfsadmin -report).

The address and port where the web interface listens can be changed by setting dfs.http.address in conf/hadoop-site.xml. It must be of the form address:port. To accept requests on all addresses, use 0.0.0.0.

From this interface, you can browse HDFS itself with a basic file-browser interface. Each DataNode exposes its file browser interface on port 50075. You can override this by setting the dfs.datanode.http.address configuration key to a setting other than 0.0.0.0:50075. Log files generated by the Hadoop daemons can be accessed through this interface, which is useful for distributed debugging and troubleshooting.

Future scope of Hadoop technology

Hadoop is among the major big data technologies and has a vast scope in the future. Being cost-effective, scalable and reliable, most of the world’s biggest organizations are employing Hadoop technology to deal with their massive data for research and production.

It includes storing data on a cluster without any machine or hardware failure, adding new hardware to the nodes etc.

Several newbies in the IT sector often question what is the scope of Hadoop in the future. Well, it can be traced out by the fact that the availability of tons of data through social networking and other means has been increased and goes on increasing as the world approaches digitalization.

This generation of massive data brings into use the Hadoop technology which is highly adopted as compared to other big data technologies. However, there are some other technologies competing with Hadoop as it has not yet gained stability in the big data market. It is still in the adoption phase and will take some time to get stable and lead the big data market.

Conclusion

Big Data has taken the world by storm. It is said that the next decade will be going to be dominated by Big-data wherein all the companies will be using the data available to them to learn about their company’s ecosystem and improving fallbacks. All major universities and companies have started investing in building tools that would help them understand and create useful insights from the data that they have access to. One such tool that helps in analysing and processing Big-data is Hadoop.

Congrats you made it till the end We hope you learned something new today.

Please leave your comments down below for any queries and suggestions.

Keep exploring! 👦💻❤️😇

written by :- Ankit kumar & Yashraj

--

--

No responses yet