Sunday, July 15, 2018

Hadoop Interview Questions with Answers





---------------------------------------------------------------------------
HADOOP Interview Questions with answer
Presented By 
Dineshkumar Selvaraj
---------------------------------------------------------------------------


1. What is Big Data and give some examples ? 
Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture, store, process, retrieve and analyze it with the help of on-hand database management tools or traditional data processing techniques. 

There are many real life examples of Big Data Facebook is generating 500+ terabytes of data per day, NYSE (New York Stock Exchange) generates about 1 terabyte of new trade data per day, a jet airline collects 10 terabytes of censor data for every 30 minutes of flying time. All these are day to day examples of Big Data! 

As of December 31, 2012, there are 1.06 billion monthly active users on facebook and 680 million mobile users. On an average, 3.2 billion likes and comments are posted every day on Facebook. 72% of web audience is on Facebook. And why not! There are so many activities going on facebook from wall posts, sharing images, videos, writing comments and liking posts, etc. In fact, Facebook started using Hadoop in mid-2009 and was one of the initial users of Hadoop. 

2. What are the four characteristics of Big Data? 
According to IBM, the four characteristics of Big Data are:
     Volume: Facebook generating 500+ terabytes of data per day. 
     Velocity: Analyzing 2 million records each day to identify the reason for losses. 
     Variety: images, audio, video, sensor data, log files, etc. 
     Veracity: biases, noise and abnormality in data 

3. What is Hadoop and provide some of the characteristics of Hadoop ? 
Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model.

Hadoop framework is written in Java. It is designed to solve problems that involve analyzing large data (e.g. petabytes). The programming model is based on Google's MapReduce. The infrastructure is based on Google's Big Data and Distributed File System. Hadoop handles large files/data throughput and supports data intensive distributed applications. Hadoop is scalable as more nodes can be easily added to it.

4.Why do we need Hadoop ? 
Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to analyze the data present in different machines at different locations very quickly and in a very cost effective way. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel. This is also known as parallel computing. The following link Why Hadoop gives a detailed explanation about why Hadoop is gaining so much popularity.

5. Give a brief overview of Hadoop history. ?
In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published MapReduce, GFS papers. 
In 2006, Doug Cutting developed the open source, Mapreduce and HDFS project. In 2008, Yahoo ran 4,000 node Hadoop 
cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched SQL support for Hadoop. 

6. What are the key features of HDFS ? 
HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware. 

High Availability – HDFS is highly available file system; data gets replicated among the
nodes in the HDFS cluster by creating a replica of the blocks on the other slaves present in
the HDFS cluster. Hence, when a client wants to access his data, they can access their data
from the slaves which contains its blocks and which is available on the nearest node in the
cluster. At the time of failure of a node, a client can easily access their data from other
nodes.

Data Reliability – HDFS is a distributed file system which provides reliable data storage.
HDFS can store data in the range of 100s petabytes. It stores data reliably by creating a
replica of each and every block present on the nodes and hence, provides fault tolerance
facility.

Replication – Data replication is one of the most important and unique features of
HDFS. In HDFS, replication data is done to solve the problem of data loss in unfavorable
conditions like crashing of the node, hardware failure and so on.

Scalability – HDFS stores data on multiple nodes in the cluster, when requirement
increases we can scale the cluster. There are two scalability mechanisms available: vertical
and horizontal.

Distributed Storage – In HDFS all the features are achieved via distributed storage and
replication. In HDFS data is stored in distributed manner across the nodes in the HDFS
cluster.

7. What is Fault Tolerance ? 
Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system. 

8. Replication causes data redundancy ,then why is it pursued in HDFS ? 
HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places. Any data on HDFS gets stored at atleast 3 different locations. So, even if one of them is corrupted and the other is unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant. 

9. What is throughput? How does HDFS get a good throughput .? 
Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously. 

10. Why do we use HDFS for applications having large data sets and not when there are lot of small files? 
HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files. This is because Namenode is a very expensive high performance system, so it is not prudent to occupy the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files. So, when there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized performance, HDFS supports large data sets instead of multiple small files.  

11. What is a Secondary Namenode? Is it a substitute to the Namenode ? 
The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk or the file system. It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes down.

12. What is the difference between MR1 and MR2 Hadoop with regards to the Namenode ? 
In MR1 , Namenode is the single point of failure. 
In MR2 , we have what is known as Active and Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.

13.How does Hadoop Namenode failover process and High Availability works? 
In a typical High Availability cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.

In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace.

In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs. In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.

It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time. During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.

14. How can we initiate a manual failover when automatic failover is configured? 
Even if automatic failover is configured, you may initiate a manual failover using the same hdfs haadmin command. It will perform a coordinated failover. 

15. What kind of situation we cannot use Hadoop .? 
1. Real Time Analytics: If you want to do some Real Time Analytics, where you are expecting result quickly, Hadoop should not be used directly. It is because Hadoop works on batch processing, hence response time is high.
2. Not a Replacement for Existing Infrastructure: Hadoop is not a replacement for your existing data processing infrastructure. However, you can use Hadoop along with it.
3. Multiple Smaller Datasets:Hadoop framework is not recommended for small-structured datasets as you have other tools available in market which can do this work quite easily and at a fast pace than Hadoop like MS Excel, RDBMS etc. For a small data analytics, Hadoop can be costlier than other tools.
4. Novice Hadoopers:Unless you have a better understanding of the Hadoop framework, it’s not suggested to use Hadoop for production. Hadoop is a technology which should come with a disclaimer: “Handle with care”. You should know it before you use it or else you will end up like the kid below.
5. Security is the primary Concern:Many enterprises especially within highly regulated industries dealing with sensitive data aren’t able to move as quickly as they would like towards implementing Big Data projects and Hadoop.

16. What kind of situation we can use Hadoop?  
1. Data Size and Data Diversity:When you are dealing with huge volumes of data coming from various sources and in a variety of formats then you can say that you are dealing with Big Data. In this case, Hadoop is the right technology for you.
2. Future Planning: It is all about getting ready for challenges you may face in future. If you anticipate Hadoop as a future need then you should plan accordingly. To implement Hadoop on you data you should first understand the level of complexity of data and the rate with which it is going to grow. So, you need a cluster planning. It may begin with building a small or medium cluster in your industry as per data (in GBs or few TBs ) available at present and scale up your cluster in future depending on the growth of your data.
3. Multiple Frameworks for Big Data: There are various tools for various purposes. Hadoop can be integrated with multiple analytic tools to get the best out of it, like Mahout for Machine-Learning, R and Python for Analytics and visualization, Python, Spark for real time processing, MongoDB and Hbase for Nosql database, Pentaho for BI etc.
4. Lifetime Data Availability: When you want your data to be live and running forever, it can be achieved using Hadoop’s scalability. There is no limit to the size of cluster that you can have. You can increase the size anytime as per your need by adding datanodes to it with The bottom line is use the right technology as per your need.

17. How to fix this warning message? :WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable...
The warning is the native Hadoop library $HADOOP_HOME/lib/native/libhadoop.so.1.0.0 was actually compiled on 32 bit. Anyway, it's just a warning, and won't impact Hadoop's functionalities. Here is the way if you do want to eliminate this warning, download the source code of Hadoop and recompile libhadoop.so.1.0.0 on 64bit system, then replace the 32bit one. 

18.What platforms and Java versions does Hadoop run on? 
Java 1.6.x or higher, preferably, Linux and Windows are the supported operating systems, but BSD, Mac OS/X, and OpenSolaris are known to work. (Windows requires the installation of Cygwin).

19. Hadoop is Highly scalable how well does it Scale? 
Hadoop has been demonstrated on clusters of up to 4000 nodes. Sort performance on 900 nodes is good (sorting 9TB of data on 900 nodes takes around 1.8 hours) and improving using these non-default configuration values:

dfs.block.size = 134217728
dfs.namenode.handler.count = 40
mapred.reduce.parallel.copies = 20
mapred.child.java.opts = -Xmx512m
fs.inmemory.size.mb = 200
io.sort.factor = 100
io.sort.mb = 200
io.file.buffer.size = 131072

Sort performances on 1400 nodes and 2000 nodes are pretty good too - sorting 14TB of data on a 1400-node cluster takes 2.2 hours; sorting 20TB on a 2000-node cluster takes 2.5 hours. The updates to the above configuration being:
mapred.job.tracker.handler.count = 60
mapred.reduce.parallel.copies = 50
tasktracker.http.threads = 50
mapred.child.java.opts = -Xmx1024m

20. What kind of hardware suggested best for Hadoop? 
The short answer is dual processor/dual core machines with 4-8GB of RAM using ECC memory, depending upon workflow needs. Machines should be moderately high-end commodity machines to be most cost-effective and typically cost 1/2 - 2/3 the cost of normal production application servers but are not desktop-class machines. 

21.How does the Hadoop/Parallel Distributed Processing community define "commodity" ?
"commodity" hardware is best defined as consisting of standardized, easily available components which can be purchased from multiple distributors/retailers. Given this definition there are still ranges of quality that can be purchased for your cluster. As mentioned above, users generally avoid the low-end, cheap solutions. The primary motivating force to avoid low-end solutions is "real" cost; cheap parts mean greater number of failures requiring more maintanance/cost. Many users spend $2K-$5K per machine.

More specifics:
Multi-core boxes tend to give more computation per dollar, per watt and per unit of operational maintenance. But the highest clockrate processors tend to not be cost-effective, as do the very largest drives. So moderately high-end commodity hardware is the most cost-effective for Hadoop today.

Some users use cast-off machines that were not reliable enough for other applications. These machines originally cost about 2/3 what normal production boxes cost and achieve almost exactly 1/2 as much. Production boxes are typically dual CPU's with dual cores.

RAM:
Many users find that most hadoop applications are very small in memory consumption. Users tend to have 4-8 GB machines with 2GB probably being too little. Hadoop benefits greatly from ECC memory, which is not low-end, however using ECC memory is recommended.

22. I have a new node I want to add to a running Hadoop cluster; how do I start services on just one node? 
This also applies to the case where a machine has crashed and rebooted, etc, and you need to get it to rejoin the cluster. You do not need to shutdown and/or restart the entire cluster in this case.
First, add the new node's DNS name to the conf/slaves file on the master node.
Then log in to the new slave node and execute:

$ cd path/to/hadoop
$ bin/hadoop-daemon.sh start datanode
$ bin/hadoop-daemon.sh start tasktracker

If you are using the dfs.include/mapred.include functionality, you will need to additionally add the node to the dfs.include/mapred.include file, then issue hadoop dfsadmin -refreshNodes and hadoop mradmin -refreshNodes so that the NameNode and JobTracker know of the additional node that has been added. 

23. How to see the status and health of a cluster?
You can also see some basic HDFS cluster health data by running below command: 
$ bin/hadoop dfsadmin -report 

24.Does Hadoop require SSH?
Hadoop provided scripts (e.g., start-mapred.sh and start-dfs.sh) use ssh in order to start and stop the various daemons and some other utilities. The Hadoop framework in itself does not require ssh. Daemons (e.g. TaskTracker and DataNode) can also be started manually on each node without the script's help.

25.If I add new DataNodes to the cluster will HDFS move the blocks to the newly added nodes in order to balance disk space utilization between the nodes? 
No, HDFS will not move blocks to new nodes automatically. However, newly created files will likely have their blocks placed on the new nodes. 

There are several ways to rebalance the cluster manually. 
Select a subset of files that take up a good percentage of your disk space; copy them to new locations in HDFS; remove the old copies of the files; rename the new copies to their original names. 
A simpler way, with no interruption of service, is to turn up the replication of files, wait for transfers to stabilize, and then turn the replication back down.

Yet another way to re-balance blocks is to turn off the data-node, which is full, wait until its blocks are replicated, and then bring it back again. The over-replicated blocks will be randomly removed from different nodes, so you really get them rebalanced not just removed from the current node.

Finally, you can use the bin/start-balancer.sh command to run a balancing process to move blocks around the cluster automatically. 

26.What is the purpose of the secondary name-node? 
The term "secondary name-node" is somewhat misleading. It is not a name-node in the sense that data-nodes cannot connect to the secondary name-node, and in no event it can replace the primary name-node in case of its failure. 

The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary name-node periodically downloads current name-node image and edits log files, joins them into new image and uploads the new image back to the (primary and the only) name-node. 

So if the name-node fails and you can restart it on the same physical node then there is no need to shutdown data-nodes, just the name-node need to be restarted. If you cannot use the old node anymore you will need to copy the latest image somewhere else. The latest image can be found either on the node that used to be the primary before failure if available; or on the secondary name-node. The latter will be the latest checkpoint without subsequent edits logs, that is the most recent name space modifications may be missing there. You will also need to restart the whole cluster in this case. 

27.if the name-node is in safe mode, can we under-replicated files in HDFS? 
No. During safe mode replication of blocks is prohibited. The name-node awaits when all or majority of data-nodes report their blocks. 
Depending on how safe mode parameters are configured the name-node will stay in safe mode until a specific percentage of blocks of the system is minimally replicated dfs.replication.min. If the safe mode threshold dfs.safemode.threshold.pct is set to 1 then all blocks of all files should be minimally replicated. 
Minimal replication does not mean full replication. Some replicas may be missing and in order to replicate them the name-node needs to leave safe mode.

28.What happens if one Hadoop client renames a file or a directory containing this file while another client is still writing into it? 
A file will appear in the name space as soon as it is created. If a writer is writing to a file and another client renames either the file itself or any of its path components, then the original writer will get an IOException either when it finishes writing to the current block or when it closes the file.

29. Does Wildcard characters work correctly in FsShell? 
When you issue a command in FsShell, you may want to apply that command to more than one file. FsShell provides a wildcard character to help you do so. The * (asterisk) character can be used to take the place of any set of characters. For example, if you would like to list all the files in your account which begin with the letter x, you could use the ls command with the * wildcard: 
bin/hadoop dfs -ls x* 
Sometimes, the native OS wildcard support causes unexpected results. To avoid this problem, Enclose the expression in Single or Double quotes and it should work correctly. 
bin/hadoop dfs -ls 'in*' 

30. Can we load multiple files in HDFS using different block sizes? 
Yes. HDFS provides api to specify block size when you create a file.
See FileSystem.create(Path, overwrite, bufferSize, replication, blockSize, progress)

31. What happens when two clients try to write into the same HDFS file? 
HDFS supports exclusive writes only.
When the first client contacts the name-node to open the file for writing, the name-node grants a lease to the client to create this file. When the second client tries to open the same file for writing, the name-node will see that the lease for the file is already granted to another client, and will reject the open request for the second client.

32. How to limit Data node's disk usage .? 
Use dfs.datanode.du.reserved configuration value in $HADOOP_HOME/conf/hdfs-site.xml for limiting disk usage.
value = 182400

33. What does "file could only be replicated to 0 nodes, instead of 1" mean? 
The NameNode does not have any available DataNodes. This can be caused by a wide variety of reasons. 
So we need to check with the DataNode logs, the NameNode logs and network connectivity. 

34. If the NameNode loses its only copy of the fsimage file, can the file system be recovered from the DataNodes? 
No. This is why it is very important to configure dfs.namenode.name.dir to write to two filesystems on different physical hosts, use the SecondaryNameNode, etc. 

35.If a block size of 64MB is used and a file is written that uses less than 64MB, will 64MB of disk space be consumed? 
Short answer: No.
Longer answer: Since HFDS does not do raw disk block storage, there are two block sizes in use when writing a file in HDFS: the HDFS blocks size and the underlying file system's block size. HDFS will create files up to the size of the HDFS block size as well as a meta file that contains CRC32 checksums for that block. The underlying file system store that file as increments of its block size on the actual raw disk, just as it would any other file.

36. What are the Layers in Hadoop framework ? 
Hadoop Framework works on the following two core components- 
1) Storage Layer (HDFS) :– Hadoop Distributed File System is the java based file system for scalable and reliable storage of large datasets. Data in HDFS is stored in the form of blocks and it operates on the Master Slave Architecture. 

2) Processing Layer (MapReduce) :-This is a java based programming paradigm of Hadoop framework that provides scalability across various Hadoop clusters. MapReduce distributes the workload into various tasks that can run in parallel. Hadoop jobs perform 2 separate tasks- job. The map job breaks down the data sets into key-value pairs or tuples. The reduce job then takes the output of the map job and combines the data tuples to into smaller set of tuples. The reduce job is always performed after the map job is executed. 

37. Explain the difference between NameNode, Backup Node and Checkpoint NameNode? 
NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace-
fsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.

Checkpoint Node: 
Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.

BackupNode: 
Backup Node also provides check pointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

38.How can you overwrite the replication factors in HDFS? 
The replication factor in HDFS can be modified or overwritten in 2 ways-
1)Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-
$hadoop fs –setrep –w 4 /my/test_file (test_file is the filename whose replication factor will be set to four)

2)Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command-
$hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5).

39.Explain about the indexing process in HDFS? 
Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.

40. What happens when a user submits a Hadoop job when the NameNode is down- does the job get in to hold or does it fail.? 

The Hadoop job fails when the NameNode is down.

41. Whenever a client submits a hadoop job, who receives it? 
NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block information. JobTracker takes care of resource allocation of the hadoop job to ensure timely completion. 

42. What do you understand by Edge nodes in Hadoop? 
Edges nodes are the interface between hadoop cluster and the external network. Edge nodes are used for running cluster administration tools and client applications.Edge nodes are also referred to as gateway nodes. 

43. What all modes Hadoop can be run in? 
Hadoop can run in three modes:

Standalone Mode: 
Default mode of Hadoop, it uses local file system for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.

Pseudo-Distributed Mode (Single Node Cluster):
In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.

Fully Distributed Mode (Multiple Cluster Node): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.

44. Explain the major difference between HDFS block and InputSplit.? 
In simple terms, block is the physical representation of data while split is the logical representation of data present in the block. Split acts as an intermediary between block and mapper.
Suppose we have two blocks:
Block 1: ii dineshkumarS
Block 2: Ii kumar
Now, considering the map, it will read first block from ii till arS, but does not know how to process the second block at the same time. Here comes Split into play, which will form a logical group of Block1 and Block 2 as a single block.
It then forms key-value pair using inputformat and records reader and sends map for further processing With inputsplit, if you have limited resources, you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640MB (64MB each) and there are limited resources, you can assign ‘split size’ as 128MB. This will form a logical group of 128MB, with only 5 maps executing at a time.

However, if the ‘split size’ property is set to false, whole file will form one inputsplit and is processed by single map, consuming more time when the file is bigger.

45.What are the most common Input Formats in Hadoop? 
There are three most common input formats in Hadoop:
Text Input Format: Default input format in Hadoop.
Key Value Input Format: used for plain text files where the files are broken into lines
Sequence File Input Format: used for reading files in sequence

46. What is Speculative Execution in Hadoop? 
One limitation of Hadoop is that by distributing the tasks on several nodes, there are chances that few slow nodes limit the rest of the program. Tehre are various reasons for the tasks to be slow, which are sometimes not easy to detect. Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent task as backup. This backup mechanism in Hadoop is Speculative Execution.

It creates a duplicate task on another disk. The same input can be processed multiple times in parallel. When most tasks in a job comes to completion, the speculative execution mechanism schedules duplicate copies of remaining tasks (which are slower) across the nodes that are free currently. When these tasks finish, it is intimated to the JobTracker. If other copies are executing speculatively, Hadoop notifies the TaskTrackers to quit those tasks and reject their output.

47. What is Fault Tolerance? 
Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system. 

48. What is a heartbeat in HDFS? 
A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task. 

49. How to keep HDFS cluster balanced? 
When copying data into HDFS, it’s important to consider cluster balance. HDFS works best when the file blocks are evenly spread across the cluster, so you want to ensure that distcp doesn’t disrupt this. For example, if you specified -m 1, a single map would do the copy, which — apart from being slow and not using the cluster resources efficiently — would mean that the first replica of each block would reside on the node running the map (until the disk filled up). The second and third replicas would be spread across the cluster, but this one node would be unbalanced. By having more maps than nodes in the cluster, this problem is avoided. For this reason, it’s best to start by running distcp with the default of 20 maps per node. 
However, it’s not always possible to prevent a cluster from becoming unbalanced. Perhaps you want to limit the number of maps so that some of the nodes can be used by other jobs. In this case, you can use the balancer tool (see Balancer) to subsequently even out the block distribution across the cluster. 

50. How to deal with small files in Hadoop? 
Hadoop Archives (HAR) offers an effective way to deal with the small files problem.
Hadoop Archives or HAR is an archiving facility that packs files in to HDFS blocks efficiently and hence HAR can be used to tackle the small files problem in Hadoop. HAR is created from a collection of files and the archiving tool (a simple command) will run a MapReduce job to process the input files in parallel and create an archive file.

HAR command 
hadoop archive -archiveName myhar.har /input/location /output/location 
Once a .har file is created, you can do a listing on the .har file and you will see it is made up of index files and part files. Part files are nothing but the original files concatenated together in to a big file. Index files are look up files which is used to look up the individual small files inside the big part files. 
hadoop fs -ls /output/location/myhar.har 
/output/location/myhar.har/_index
/output/location/myhar.har/_masterindex
/output/location/myhar.har/part-0

51. How to copy file from HDFS to the local file system . ? 
bin/hadoop fs -get /hdfs/source/path /localfs/destination/path
bin/hadoop fs -copyToLocal /hdfs/source/path /localfs/destination/path
Point your web browser to HDFS WEBUI(namenode_machine:50070), browse to the file you intend to copy, scroll down the page and click on download the file.

52. What's the difference between “hadoop fs” shell commands and “hdfs dfs” shell commands? Are they supposed to be equal? but, why the "hadoop fs" commands show the hdfs files while the "hdfs dfs" commands show the local files? 
Following are the three commands which appears same but have minute differences 
hadoop fs {args}
hadoop dfs {args}
hdfs dfs {args}
hadoop fs {args}
FS relates to a generic file system which can point to any file systems like local, HDFS etc. So this can be used when you are dealing with different file systems such as Local FS, HFTP FS, S3 FS, and others

hadoop dfs {args}
dfs is very specific to HDFS. would work for operation relates to HDFS. This has been deprecated and we should use hdfs dfs instead.

hdfs dfs {args}
same as 2nd i.e would work for all the operations related to HDFS and is the recommended command instead of hadoop dfs

below is the list categorized as HDFS commands.

**#hdfs commands**
namenode|secondarynamenode|datanode|dfs|dfsadmin|fsck|balancer|fetchdt|oiv|dfsgroups
So even if you use Hadoop dfs , it will look locate hdfs and delegate that command to hdfs dfs

53. How to check HDFS Directory size .? 
hdfs dfs -du [-s] [-h] URI [URI …]

54. Difference between hadoop fs -put and hadoop fs -copyFromLocal? 
copyFromLocal is similar to put command, except that the source is restricted to a local file reference.
So, basically you can do with put, all that you do with copyFromLocal, but not vice-versa. 
Similarly, 
copyToLocal is similar to get command, except that the destination is restricted to a local file reference. 
Hence, you can use get instead of copyToLocal, but not the other way round.

55. How to specify username when putting files on HDFS from a remote machine?
hadoop dfs -put -username user1 file1 /user/user1/testFolder

56. Where HDFS stores files locally by default? 
You need to look in your hdfs-default.xml configuration file for the dfs.data.dir setting. The default setting is: ${hadoop.tmp.dir}/dfs/data and note that the ${hadoop.tmp.dir} is actually in core-default.xml described here. 
The configuration options are described here. The description for this setting is: 
Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. 

57. Is there any command for hadoop that can change the name of a file (in the HDFS) from its old name to a new name? 
Use the following : hadoop fs -mv oldname newname 

58. Explain the hadoop list command with all possible options..?
Usage: hadoop fs -ls [-d] [-h] [-R] [-t] [-S] [-r] [-u] [args] 

Options: 
-d: Directories are listed as plain files. 
-h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864). 
-R: Recursively list subdirectories encountered. 
-t: Sort output by modification time (most recent first). 
-S: Sort output by file size. 
-r: Reverse the sort order. 
-u: Use access time rather than modification time for display and sorting. 
So you can easily sort the files:

59. How to unzip .gz files in a new directory in hadoop?
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt 
My gzipped file is Links.txt.gz 

60.What is Apache Hadoop YARN .? 
YARN stands for 'Yet Another Resource Negtiator'.YARN is a powerful and efficient feature rolled out as a part of Hadoop 2.0.YARN is a large scale distributed system for running big data applications. 

61. What are the core concepts in YARN.? 
Resource Manager: As equivalent to JobTracker
Node Manager: As equivalent to TaskTracker
Application Manager: As equivalent to Jobs.
Containers: As quivalent to slots
YARN child: After submitting the application, dynamically application master launch YARN child to do the MapReduce tasks.

62. What are the additional benefits YARN brings in to Hadoop? 
Effective utilization of the resources as multiple applications can be run in YARN all sharing a common resource.YARN is backward compatible so all the existing MapReduce jobs.Using YARN, one can even run applications that are not based on the MaReduce model 

63. How can native libraries be included in YARN jobs.? 
There are two ways to include native libraries in YARN jobs-
1) By setting the -Djava.library.path on the command line
2) By setting the LD_LIBRARY_PATH in the .bashrc filewer

64. What is a container in YARN? Is it same as the child JVM in which the tasks on the nodemanager run or is it different? 
It represents a resource (memory) on a single node at a given cluster.
A container is
supervised by the node manager
scheduled by the resource manager
One MR task runs in such container(s).

65. Explain about the process of inter cluster data copying.?  or uses of DistCP command..?
HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of hadoop.

66. What is Rack Awareness .?
Rack Awareness improves the network traffic while reading/writing file. In which NameNode chooses the DataNode which is closer to the same rack or nearby rack. NameNode achieves rack information by maintaining the rack IDs of each DataNode. This concept that chooses Datanodes based on the rack information. In HDFS, NameNode makes sure that all the replicas are not stored on the same rack or single rack. It follows Rack Awareness Algorithm to reduce latency as well as fault tolerance.

Default replication factor is 3, according to Rack Awareness Algorithm. Therefore, the first replica of the block will store on a local rack. The next replica will store on another datanode within the same rack. And the third replica stored on the different rack.

In Hadoop, we need Rack Awareness because it improves:
 Data high availability and reliability.
 The performance of the cluster.
 Network bandwidth.

67. What is RAID?
 RAID is a way of combining multiple disk drives into a single entity to improve
performance and/or reliability. There are a variety of different levels in RAID
 For example, In RAID level 1 copy of the same data on two disks increases the read
performance by reading alternately from each disk in the mirror.

68. What different type of schedulers and type of scheduler did you use.?
Capacity Scheduler:
It is designed to run Hadoop applications as a shared, multi-tenant cluster while maximizing
the throughput and the utilization of the cluster.

Fair Scheduler:
Fair scheduling is a method of assigning resources to applications such that all apps get, on
average, an equal share of resources over time

69) Compare between Hadoop 2 and Hadoop 3?
 In Hadoop 2, minimum supported version of Java is Java 7, while in Hadoop 3 is Java
8.
 Hadoop 2, handle fault tolerance by replication (which is wastage of space). While
Hadoop 3 handle it by Erasure coding.
 In Hadoop 2 some default ports are Linux ephemeral port range. So at the time of
startup, they will fail to bind. But in Hadoop 3 these ports have been moved out of the
ephemeral range.
 In hadoop 2, HDFS has 200% overhead in storage space. While Hadoop 3 has 50%
overhead in storage space.
 Hadoop 2 has features to overcome SPOF (single point of failure). So whenever
NameNode fails, it recovers automatically. Hadoop 3 recovers SPOF automatically no
need of manual intervention to overcome it.

70) What are configuration files in Hadoop?
Core-site.xml – It contain configuration setting for Hadoop core such as I/O settings that
are common to HDFS & MapReduce. It use Hostname and port .The most commonly used
port is 9000.
<configuration></span>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

hdfs-site.xml – This file contains the configuration setting for HDFS daemons. hdfssite.
xml also specify default block replication and permission checking on HDFS.
 <configuration>
 <property>
 <name>dfs.replication</name>
 <value>1</value>
 <property>
 </configuration>

mapred-site.xml – In this file, we specify a framework name for MapReduce. we can
specify by setting the mapreduce.framework.name.
 <configuration>
 <property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 </property>
 </configuration>

yarn-site.xml – This file provide configuration setting
for NodeManager and ResourceManager.
 <configuration>
 <property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
 </property>
 <property>
 <name>yarn.nodemanager.env-whitelist</name> <value>
 JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOO
P_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOM
E,HADOOP_MAPRED_HOME</value> </property>
 </configuration>

71) Explain Data Locality in Hadoop?
Hadoop major drawback was cross-switch network traffic due to the huge volume of data. To overcome this drawback, Data locality came into the picture. It refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. Data locality increases the overall throughput of the system.
In Hadoop, HDFS stores datasets. Datasets are divided into blocks and stored across the datanodes in Hadoop cluster. When a user runs the MapReduce job then NameNode sends this MapReduce code to the datanodes on which data is available related to MapReduce job.

Data locality has three categories:
Data local – In this category data is on the same node as the mapper working on the data. In such case, the proximity of the data is closer to the computation. This is the most preferred scenario.
Intra -Rack- In this scenarios mapper run on the different node but on the same rack. As it is not always possible to execute the mapper on the same datanode due to constraints.
Inter-Rack – In this scenarios mapper run on the different rack. As it is not possible to execute mapper on a different node in the same rack due to resource constraints.

72) What is cron jobs? How to Setup that?
In Ubuntu, go to the terminal and type:
$ crontab -e
This will open our personal crontab (cron configuration file), the first line in that file explains it all, In every line we
can define one command to run, and the format is quite simple. So the structure is:
minute hour day-of-month month day-of-week command
For all the numbers you can use lists eg, 5,34,55 in the first field will mean run at 5 past 34 past and 55 past whatever hour is defined.

73) Does the name-node stay in safe mode till all under-replicated files are fully replicated?
No. The name-node waits until all or majority of data-nodes report their blocks. But name-node will stay in safe mode until a specific percentage of blocks of the system is minimally replicated. minimally replicated is not fully replicated.

74) How will you combine the 4 part-r files of a mapreduce job?
Using hadoop fs -getmerge

75) How will you view the compressed files via HDFS command?
It takes a source file and outputs the file in text format.
The allowed formats are zip and TextRecordInputStream.
hadoop fs -text

76) What is expunge in HDFS ?
Its doing Trash empty in HDFS.

77) What is the limitation of Sequence files?
It supports only java, no other API

78) Where does the schema of an Avro file is store if the file is transferred from one host to another?
In the same file itself as a header section

79) Can we append data records to an existing file in HDFS?
Yes by command $ hdfs dfs -appendToFile … Appends single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and appends to destination file system.

80) How do we handle small files in HDFS?
Just merge into sequence/avro file or archive them into har files.

81) What is the importance of dfs.namenode.name.dir?
It contains the fsimage file for namenode, it should be configured to write to atleast two filesystems on different physical hosts, namenode and secondary namenode, as if we lose fsimage file we will lose entire HDFS file system and there is no other recovery mechanism if there is no fsimage file available.

82) What is the need for fsck in hadoop?
It can be used to determine the files with missing blocks.

83) How do we handle record bounderies in Text files or Sequence files in Mapreduce Inputsplits?
In Mapreduce, InputSplit’s RecordReader will start and end at a record boundary. In SequenceFiles, every 2k bytes has a 20 bytes sync mark between the records. These sync marks allow the RecordReader to seek to the start of the InputSplit, which contains a file, offset and length and find the first sync mark after the start of the split. The RecordReader continues processing records until it reaches the first sync mark after the end of the split. 

Text files are handled similarly, using newlines instead of sync marks.

84) How do we change the queue of currently running hadoop job without restarting it?
if running Yarn you can change the current job's queue by
yarn application -movetoqueue <app_id> -queue <queue_name>

85) What is Safemode in Hadoop .?
Safemode in Apache Hadoop is a maintenance state of NameNode. During which NameNode doesn’t allow any modifications to the file system. During Safemode, HDFS cluster is in read-only and doesn’t replicate or delete blocks. At the startup of NameNode:
 It loads the file system namespace from the last saved FsImage into its main memory and the edits log file.
 Merges edits log file on FsImage and results in new file system namespace.
 Then it receives block reports containing information about block location from all datanodes.

In SafeMode NameNode perform a collection of block reports from datanodes. NameNode enters safemode automatically during its start up. NameNode leaves Safemode after the DataNodes have reported that most blocks are available.
Use the command:
hadoop dfsadmin –safemode get: To know the status of Safemode
hadoop dfsadmin – safemode enter: To enter Safemode
hadoop dfsadmin -safemode leave: To come out of Safemode
NameNode front page shows whether safemode is on or off.

86) If a map task is failed once during mapreduce job execution will job fail immediately?
No, it will try restarting the tasks upto max attempts allowed on map/reduce tasks, by default it is 4.


87) List the namenode and secondary namenode of a cluster from any node?
Try This. For dfsadmin not every user has permission.
hdfs getconf -confKey "fs.defaultFS"
hdfs://XYZ
or 
hadoop getconf -namenodes
hadoop getconf -secondaryNamenodes

Use the dfsadmin command to get as report:
bin/hadoop dfsadmin -report  (deprecated):
bin/hdfs dfsadmin -report

88) What is the solution for this error "There are 0 datanode(s) running and no node(s) are excluded in this operation" ?
STEP 1 : stop hadoop and clean temp files from hduser
sudo rm -R /tmp/*

STEP 2: format namenode
hdfs namenode -format

89) What is the relation between 'mapreduce.map.memory.mb' and 'mapred.map.child.java.opts' in YARN?
mapreduce.map.memory.mb is the upper memory limit that Hadoop allows to be allocated to a mapper, in megabytes. The default is 512. If this limit is exceeded, Hadoop will kill the mapper with an error like below:
Container[pid=container_1406552545451_0009_01_000002,containerID=container_234132_0001_01_000001] is running beyond physical memory limits. Current usage: 569.1 MB of 512 MB physical memory used; 970.1 MB of 1.0 GB virtual memory used. Killing container.

Hadoop mapper is a java process and each Java process has its own heap memory maximum allocation settings configured via mapred.map.child.java.opts (or mapreduce.map.java.opts in Hadoop 2+). If the mapper process runs out of heap memory, the mapper throws a java out of memory exceptions like below:
Error: java.lang.RuntimeException: java.lang.OutOfMemoryError

Hadoop and the Java settings are related. The Hadoop setting is more of a resource enforcement or controlling one and the Java is more of a resource configuration one.
The Java heap settings should be smaller than the Hadoop container memory limit because we need reserve memory for Java code. Usually, it is recommended to reserve 20% memory for code. So if settings are correct, Java-based Hadoop tasks should never get killed by Hadoop so you should never see the "Killing container" error like above.

If you experience Java out of memory errors, you have to increase both memory settings.

90) Spark Job Application is in ACCEPTED State which is never changed for long time, what we have to do .?
In the file /etc/hadoop/conf/capacity-scheduler.xml we changed the property yarn.scheduler.capacity.maximum-am-resource-percent from 0.1 to 0.5.

Changing this setting increases the fraction of the resources that is made available to be allocated to application masters, increasing the number of masters possible to run at once and hence increasing the number of possible concurrent applications.

If you running in a small cluster where the resources were limited (Ex 3GB per node). Solved this problem by changing the minimum memory allocation to a sufficiently low number.

From:
yarn.scheduler.minimum-allocation-mb: 1g
yarn.scheduler.increment-allocation-mb: 512m

To:
yarn.scheduler.minimum-allocation-mb: 256m
yarn.scheduler.increment-allocation-mb: 256m

91) How to prevent Spark Executors from getting Lost when using YARN client mode?
The solution if you're using yarn was to set 
--conf spark.yarn.executor.memoryOverhead=600,

alternatively if your cluster uses mesos you can try 
--conf spark.mesos.executor.memoryOverhead=600 instead.

92) What is a container in YARN? Is it same as the child JVM in which the tasks on the nodemanager run or is it different?
Container is a place where a YARN application is run. It is available in each node. Application Master negotiates container with the scheduler(one of the component of Resource Manager). Containers are launched by Node Manager.

It represents a resource (memory) on a single node at a given cluster.
A container is
  • supervised by the node manager
  • scheduled by the resource manager
One MR task runs in such container(s).

Container is a place where the application runs its task. If you want to know the total no.of running containers in a cluster, then you could check in your cluster Yarn UI.

Yarn URL: http://Your-Active-ResourceManager-IP:45020/cluster/apps/RUNNING

At the "Running containers" column, the total no. of running containers details is present.



...Thank You...

10 comments:

  1. Nice information, thanks for sharing this blog.
    Spark Scala Training

    ReplyDelete
  2. Very very useful interview Questions... THANK you... Sir

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Nice Article,keep sharing more articles with us.

    keep updating.

    big data online training

    ReplyDelete
  5. Such a nice blog with the attractive reference links which give the basic ideas on the topic.
    Data Science and Data Analytics
    Statistics For Business Analytics

    ReplyDelete
  6. Excellent article for the people who need information about this course.
    Hadoop
    Big Data

    ReplyDelete

  7. Great Post with valuable information.Thank you. Share more updates.
    Bilingual language
    Second language

    ReplyDelete