Bigdata Hadoop: August 2018

Saturday, August 18, 2018

Sqoop Interview Question and Answers

Welcome to BigDatapedia

Presented By

Dineshkumar Selvaraj

---------------------------------------------------------------------------

SQOOP Interview Questions with answer

1) What is SQOOP..?

This is the short meaning of (SQl+HadOOP =SQOOP)

It is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

The Sqoop main intended for:

System and application programmers
System administrators
Database administrators
Data analysts
Data engineers

2) Why is the default maximum mappers are 4 in Sqoop?
As of my knowledge, the default number of mapper 4 is followed by minimum concurrent task for one machine. We will lead to set a higher number of concurrent tasks, which can result in faster job completion.

3) is it possible set speculative execution in Sqoop ..?
In sqoop by default speculative execution is off, because if Multiple mappers run for single task, we get duplicates of data in HDFS. Hence to avoid this decrepency it is off. Also number of reducers for sqoop job is 0, since it is merely a job running a MAP only job that dumps data into HDFS. We are not aggregating anything.

4) What causes of hadoop throw ClassNotFoundException while sqoop integration ..?

The most causes of that the supporting library (like connectors) was not updated in sqoop's library path, so we need to update it on that specific path.

5) How to view all the databases and tables in RDBMS from SQOOP..?

Using below commands we can,

sqoop-list-databases
sqoop-list-tables

6) How to view table columns details in RDBMS from SQOOP..?

Unfortunately we don't have any commands like sqoop-list-columns, But we can achieve via free form query to check the information schema for the particular RDBMS tables.

here is an example:

$ sqoop eval --connect 'jdbc:mysql://nameofmyserver;' database=nameofmydatabase; username=dineshkumar; password=dineshkumar --query "SELECT column_name, DATA_TYPE FROM INFORMATION_SCHEMA.Columns WHERE table_name='mytableofinterest'

7) I am getting FileAlreadyExists exception error in Sqoop while importing data from RDBMS to a hive table.? So How do we resolve it.?

you can specify the --hive-overwrite option to indicate that existing table in hive must be replaced. After your data is imported into HDFS or this step is omitted

8) What is the default file format to import data using Apache Sqoop?

Sqoop allows data to be imported using two file formats

i) Delimited Text File Format

This is the default file format to import data using Sqoop. This file format can be explicitly specified using the –as-textfile argument to the import command in Sqoop. Passing this as an argument to the command will produce the string based representation of all the records to the output files with the delimited characters between rows and columns.

ii) Sequence File Format

It is a binary file format where records are stored in custom record-specific data types which are shown as Java classes. Sqoop automatically creates these data types and manifests them as java classes.

9) How do I resolve a Communications Link Failure when connecting to MySQL?

Verify that you can connect to the database from the node where you are running Sqoop:

$ mysql --host=IP Address --database=test --user=username --password=password

Add the network port for the server to your my.cnf file

Set up a user account to connect via Sqoop. Grant permissions to the user to access the database over the network:

Log into MySQL as root mysql -u root -p ThisIsMyPassword

Issue the following command: mysql> grant all privileges on test.* to 'testuser'@'%' identified by 'testpassword'

10) How do I resolve an IllegalArgumentException when connecting to Oracle?

This could be caused a non-owner trying to connect to the table so prefix the table name with the schema, for example SchemaName.OracleTableName.

11) What's causing this Exception in thread main java.lang.IncompatibleClassChangeError when running non-CDH Hadoop with Sqoop?

Try building Sqoop 1.4.1-incubating with the command line property -Dhadoopversion=20.

12) I have around 300 tables in a database. I want to import all the tables from the database except the tables named Table298, Table 123, and Table299. How can I do this without having to import the tables one by one?

This can be accomplished using the import-all-tables import command in Sqoop and by specifying the exclude-tables option with it as follows-

sqoop import-all-tables

--connect –username –password --exclude-tables Table298, Table 123, Table 299

13) Does Apache Sqoop have a default database?

Yes, MySQL is the default database.

14) How can I import large objects (BLOB and CLOB objects) in Apache Sqoop?

Apache Sqoop import command does not support direct import of BLOB and CLOB large objects. To import large objects, I Sqoop, JDBC based imports have to be used without the direct argument to the import utility.

15) How can you execute a free form SQL query in Sqoop to import the rows in a sequential manner?

This can be accomplished using the –m 1 option in the Sqoop import command. It will create only one MapReduce task which will then import rows serially.

16) What is the difference between Sqoop and DistCP command in Hadoop?

Both distCP (Distributed Copy in Hadoop) and Sqoop transfer data in parallel but the only difference is that distCP command can transfer any kind of data from one Hadoop cluster to another whereas Sqoop transfers data between RDBMS and other components in the Hadoop ecosystem like HBase, Hive, HDFS, etc.

17) What is Sqoop metastore?

Sqoop metastore is a shared metadata repository for remote users to define and execute saved jobs created using sqoop job defined in the metastore. The sqoop –site.xml should be configured to connect to the metastore.

18) What is the significance of using –split-by clause for running parallel import tasks in Apache Sqoop?

--Split-by clause is used to specify the columns of the table that are used to generate splits for data imports. This clause specifies the columns that will be used for splitting when importing the data into the Hadoop cluster. —split-by clause helps achieve improved performance through greater parallelism. Apache Sqoop will create splits based on the values present in the columns specified in the –split-by clause of the import command. If the –split-by clause is not specified, then the primary key of the table is used to create the splits while data import. At times the primary key of the table might not have evenly distributed values between the minimum and maximum range. Under such circumstances –split-by clause can be used to specify some other column that has even distribution of data to create splits so that data import is efficient.

19) You use –split-by clause but it still does not give optimal performance then how will you improve the performance further.

Using the –boundary-query clause. Generally, sqoop uses the SQL query select min (), max () from to find out the boundary values for creating splits. However, if this query is not optimal then using the –boundary-query argument any random query can be written to generate two numeric columns.

20) During sqoop import, you use the clause –m or –numb-mappers to specify the number of mappers as 8 so that it can run eight parallel MapReduce tasks, however, sqoop runs only four parallel MapReduce tasks. Why?

Hadoop MapReduce cluster is configured to run a maximum of 4 parallel MapReduce tasks and the sqoop import can be configured with number of parallel tasks less than or equal to 4 but not more than 4.

21) You successfully imported a table using Apache Sqoop to HBase but when you query the table it is found that the number of rows is less than expected. What could be the likely reason?
If the imported records have rows that contain null values for all the columns, then probably those records might have been dropped off during import because HBase does not allow null values in all the columns of a record.

22) The incoming value from HDFS for a particular column is NULL. How will you load that row into RDBMS in which the columns are defined as NOT NULL?

Using the –input-null-string parameter, a default value can be specified so that the row gets inserted with the default value for the column that it has a NULL value in HDFS.

23) How will you synchronize the data in HDFS that is imported by Sqoop?

Data can be synchronised using incremental parameter with data import –

--Incremental parameter can be used with one of the two options-

i) append-If the table is getting updated continuously with new rows and increasing row id values then incremental import with append option should be used where values of some of the columns are checked (columns to be checked are specified using –check-column) and if it discovers any modified value for those columns then only a new row will be inserted.

ii) lastmodified – In this kind of incremental import, the source has a date column which is checked for. Any records that have been updated after the last import based on the lastmodifed column in the source, the values would be updated.

24) What are the relational databases supported in Sqoop?

Below are the list of RDBMSs that are supported by Sqoop Currently.

MySQL

PostGreSQL

Oracle

Microsoft SQL

IBM’s Netezza

Teradata

25) What are the destination types allowed in Sqoop Import command?

Currently Sqoop Supports data imported into below services.

HDFS

Hive

HBase

HCatalog

Accumulo

26) Is Sqoop similar to distcp in hadoop?

Partially yes, hadoop’s distcp command is similar to Sqoop Import command. Both submits parallel map-only jobs.
But distcp is used to copy any type of files from Local FS/HDFS to HDFS and Sqoop is for transferring the data records only between RDMBS and Hadoop eco system services, HDFS, Hive and HBase.

27)What are the majorly used commands in Sqoop?

In Sqoop Majorly Import and export commands are used. But below commands are also useful some times.

codegen

eval

import-all-tables

job

list-databases

list-tables

merge

metastore

28) While loading tables from MySQL into HDFS, if we need to copy tables with maximum possible speed, what can you do ?

We need to use –direct argument in import command to use direct import fast path and this –direct can be used only with MySQL and PostGreSQL as of now.

29) While connecting to MySQL through Sqoop, I am getting Connection Failure exception what might be the root cause and fix for this error scenario?

This might be due to insufficient permissions to access your MySQL database over the network. To confirm this we can try the below command to connect to MySQL database from Sqoop’s client machine.

$ mysql --host=MySql node > --database=test --user= --password=

If this is the case then we need grant permissions user @ sqoop client machine as per the answer to Question 6 in this post.

30) What is the importance of eval tool?

It allow users to run sample SQL queries against Database and preview the result on the console.

31) What is the process to perform an incremental data load in Sqoop?

The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop.

Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are-

1)Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last Modified.

2)Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported.

3)Value (last-value) –This denotes the maximum value of the check column from the previous import operation.

32) What is the significance of using –compress-codec parameter?

To get the out file of a sqoop import in formats other than .gz like .bz2 compressions when we use the –compress -code parameter.

33) Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used?

Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified.

34) What is the purpose of sqoop-merge?

The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset preserving only the newest version of the records between both the data sets.

35) How do you clear the data in a staging table before loading it by Sqoop?

By specifying the –clear-staging-table option we can clear the staging table before it is loaded. This can be done again and again till we get proper data in staging.

36) How will you update the rows that are already exported?

The parameter –update-key can be used to update existing rows. In a comma-separated list of columns is used which uniquely identifies a row. All of these columns is used in the WHERE clause of the generated UPDATE query. All other table columns will be used in the SET part of the query.

37) What is the role of JDBC driver in a Sqoop set up?

To connect to different relational databases sqoop needs a connector. Almost every DB vendor makes this connecter available as a JDBC driver which is specific to that DB. So Sqoop needs the JDBC driver of each of the database it needs to interact with.

38) When to use --target-dir and --warehouse-dir while importing data?

To specify a particular directory in HDFS use --target-dir but to specify the parent directory of all the sqoop jobs use --warehouse-dir. In this case under the parent directory sqoop will create a directory with the same name as the table.

39) When the source data keeps getting updated frequently, what is the approach to keep it in sync with the data in HDFS imported by sqoop?

sqoop can have 2 approaches.

To use the --incremental parameter with append option where value of some columns are checked and only in case of modified values the row is imported as a new row.
To use the --incremental parameter with lastmodified option where a date column in the source is checked for records which have been updated after the last import.

40) Is it possible to add a parameter while running a saved job?

Yes, we can add an argument to a saved job at runtime by using the --exec option

sqoop job --exec jobname -- -- newparameter

41) sqoop takes a long time to retrieve the minimum and maximum values of columns mentioned in –split-by parameter. How can we make it efficient?

We can use the --boundary –query parameter in which we specify the min and max value for the column based on which the split can happen into multiple mapreduce tasks. This makes it faster as the query inside the –boundary-query parameter is executed first and the job is ready with the information on how many mapreduce tasks to create before executing the main query.

42) How will you implement all-or-nothing load using sqoop ?

Using the staging-table option we first load the data into a staging table and then load it to the final target table only if the staging load is successful.

43) How will you update the rows that are already exported ?

The parameter --update-key can be used to update existing rows. In it a comma-separated list of columns is used which uniquely identifies a row. All of these columns is used in the WHERE clause of the generated UPDATE query. All other table columns will be used in the SET part of the query.

44) How can you sync a exported table with HDFS data in which some rows are deleted.?

Truncate the target table and load it again.

45) How can we load to a column in a relational table which is not null but the incoming value from HDFS has a null value.?

By using the –input-null-string parameter we can specify a default value and that will allow the row to be inserted into the target table.

46) How can you schedule a sqoop job using Oozie?

Oozie has in-built sqoop actions inside which we can mention the sqoop commands to be executed.

47) Sqoop imported a table successfully to HBase but it is found that the number of rows is fewer than expected. What can be the cause?

Some of the imported records might have null values in all the columns. As Hbase does not allow all null values in a row, those rows get dropped.

48) How can you force sqoop to execute a free form Sql query only once and import the rows serially. ?

By using the –m 1 clause in the import command, sqoop creates only one mapreduce task which will import the rows sequentially.

49) In a sqoop import command you have mentioned to run 8 parallel Mapreduce task but sqoop runs only 4. What can be the reason?

The Mapreduce cluster is configured to run 4 parallel tasks. So the sqoop command must have number of parallel tasks less or equal to that of the MapReduce cluster.

50) What happens when a table is imported into a HDFS directory which already exists using the –append parameter?

Using the --append argument, Sqoop will import data to a temporary directory and then rename the files into the normal target directory in a manner that does not conflict with existing filenames in that directory.

51) How to import only the updated rows form a table into HDFS using sqoop assuming the source has last update timestamp details for each row?

By using the lastmodified mode. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.

52) Give a Sqoop command to import all the records from employee table divided into groups of records by the values in the column department_id.

$ sqoop import --connect jdbc:mysql://DineshDB --table EMPLOYEES --split-by dept_id -m2

53) What does the following query do?

$ sqoop import --connect jdbc:mysql://DineshDB --table sometable --where "id > 1000" --target-dir "/home/dinesh/sqoopincremental" --append

It performs an incremental import of new data, after having already imported the first 1000 rows of a table

54) What is the importance of $CONDITIONS in Sqoop..?

Sqoop performs highly efficient data transfers by inheriting Hadoop’s parallelism.

To help Sqoop split your query into multiple chunks that can be transferred in parallel, you need to include the $CONDITIONS placeholder in the where clause of your query.
Sqoop will automatically substitute this placeholder with the generated conditions specifying which slice of data should be transferred by each individual task.
While you could skip $CONDITIONS by forcing Sqoop to run only one job using the --num-mappers 1 param‐ eter, such a limitation would have a severe performance impact.

For example:-

If you run a parallel import, the map tasks will execute your query with different values substituted in for $CONDITIONS. one mapper may execute "select * from TblDinesh WHERE (salary>=0 AND salary < 10000)", and the next mapper may execute "select * from TblDinesh WHERE (salary >= 10000 AND salary < 20000)" and so on.

55) can sqoop run without a hadoop cluster.?

To run Sqoop commands, Hadoop is a mandatory prerequisite. You cannot run sqoop commands without the Hadoop libraries.

56) Is it possible to import a file in fixed column length from the database using sqoop import?
Importing column of a fixed length from any database you can use free form query like below

sqoop import --connect jdbc:oracle:* --username Dinesh --password pwd

-e "select substr(COL1,1,4000),substr(COL2,1,4000) from table where \$CONDITIONS"

--target-dir /user/dineshkumar/table_name --as-textfile -m 1

57) How to use sqoop validation?
You can use this parameter (--validate) to validate the counts between what’s imported/exported between RDBMS and HDFS.

58) How to pass sqoop command as file arguments in Sqoop.?
specify an options file, simply create an options file in a convenient location and pass it to the command line via --options-file argument.
eg: sqoop --options-file /users/homer/work/import.txt --table TEST

59) is it possible to import data apart from HDFS and Hive.?
Sqoop supports additional import targets beyond HDFS and Hive. Sqoop can also import records into a table in HBase and Accumulo.

60) is it possible to use sqoop --direct command in Hbase .?
This function is incompatible with direct import. But Sqoop can do bulk loading as opposed to direct writes. To use bulk loading, enable it using --hbase-bulkload.

61) Can I configure two sqoop command so that they are dependent on each other? Like if the first sqoop job is successful, second gets triggered. If first fails, second should not run..?

No, using sqoop commands it is not possible, but You can use oozie for this. Create an oozie workflow. Execute the second action only if the first action succeeds.

62) What is UBER mode and where is the settings to enable in Hadoop .?
Normally mappers and reducers will run by ResourceManager (RM), RM will create separate container for mapper and reducer. Uber configuration, will allow to run mapper and reducers in the same process as the ApplicationMaster (AM).

Uber jobs :
Uber jobs are jobs that are executed within the MapReduce ApplicationMaster. Rather then communicate with RM to create the mapper and reducer containers. The AM runs the map and reduce tasks within its own process and avoided the overhead of launching and communicate with remote containers.

Why :

If you have a small dataset or you want to run MapReduce on small amount of data, Uber configuration will help you out, by reducing additional time that MapReduce normally spends in mapper and reducers phase.

Can I configure an Uber for all MapReduce job?
As of now, map-only jobs and jobs with one reducer are supported.

Settings:
Uber Job occurs when multiple mapper and reducers are combined to use a single container. There are four core settings around the configuration of Uber Jobs in the mapred-site.xml. Configuration options for Uber Jobs:
mapreduce.job.ubertask.enable
mapreduce.job.ubertask.maxmaps
mapreduce.job.ubertask.maxreduces
mapreduce.job.ubertask.maxbytes

...All The Best...

Monday, August 13, 2018

Hive Interview Questions with Answers

---------------------------------------------------------------------------

Presented By

Dineshkumar Selvaraj

---------------------------------------------------------------------------

Hive Interview Questions with answer

1) What is Hive .?

Apache Hive is an open source for data warehouse system. Its similar like SQL Queries. We can use Hive for analyzing and querying in large data sets on top of Hadoop.

2) Why do we need Hive .?

Hive is a tool in Hadoop ecosystem which provides an interface to organize and query data in a databse like fashion and write SQL like queries. It is suitable for accessing and analyzing data in Hadoop using SQL syntax.

3) What is a metastore in Hive .?

It is a relational database storing the metadata of hive tables, partitions, Hive databases etc...

When you create a table, this metastore gets updated with the information related to the new table which gets queried when you issue queries on that table.

4) Is Hive suitable to be used for OLTP systems .? Why .?
No, Hive does not provide insert and update at row level. So it is not suitable for OLTP system.

5) Can you explain about ACID transactions in Hive?

Hive supports ACID transactions: The full form of ACID is Atomicity, Consistency, Isolation and Durability.

ACID transactions are provided at the row levels, there are Insert, Delete, and Update options so that Hive supports ACID transaction. Insert, Delete and Update.

6) What are the types of tables in Hive?

There are two types of tables in Hive : Internal Table(aka Managed Table) and External table.

7) What kind of data warehouse application is suitable for Hive?

Hive is not considered as a full database. The design rules and regulations of Hadoop and HDFS put restrictions on what Hive can do.Hive is most suitable for data warehouse applications.

Where Analyzing the relatively static data, Less Responsive time and No rapid changes in data.

Hive does not provide fundamental features required for OLTP (Online Transaction Processing). Hive is suitable for data warehouse applications in large data sets.

8) Explain what is a Hive variable. What do we use it for?

Hive variable is basically created in the Hive environment that is referenced by Hive scripting languages. It provides to pass some values to the hive queries when the query starts executing. It uses the source command.

9) How to change the warehouse.dir location for older tables?

To change the base location of the Hive tables, edit the hive.metastore.warehouse.dir param. This will not affect the older tables. Metadata needs to be changed in the database (MySQL or Derby). The location of Hive tables is in table SDS and column LOCATION.

10) What are the types of metastore available in Hive .?

There are three types of meta stores available in Hive.

Embedded Metastore (Derby)

Local Metastore

Remote Metastore.

11) Is it possible to use same metastore by multiple users, in case of embedded hive?

No, it is not possible to use metastores in sharing mode. It is recommended to use standalone real database like MySQL or PostGresSQL.

12) If you run hive server, what are the available mechanism for connecting it from application?

There are following ways by which you can connect with the Hive Server

1. Thrift Client: Using thrift you can call hive commands from a various programming languages e.g. C++, Java, PHP, Python and Ruby.

2. JDBC Driver : It supports for the Java protocal.

3. ODBC Driver: It supports ODBC protocol.

13) What is SerDe in Apache Hive ?

A SerDe is a short name for a Serializer Deserializer.

Hive uses SerDe as FileFormat to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System format that data is stored in. Users are able to write files to HDFS with whatever tools or mechanism takes their fancy (CREATE EXTERNAL TABLE or LOAD DATA INPATH) and use Hive to correctly parse that file format in a way that can be used by Hive. A SerDe is a powerful and customizable mechanism that Hive uses to parse data stored in HDFS to be used by Hive.

14)Which classes are used by the Hive to Read and Write HDFS Files ?

Following classes are used by Hive to read and write HDFS files

TextInputFormat or HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format.

SequenceFileInputFormat or SequenceFileOutputFormat: These 2 classes read/write data in hadoop SequenceFile format.

15) Give examples of the SerDe classes which hive uses to Serialize and Deserialize data ?

Hive currently use these SerDe classes to serialize and Deserialize data:

MetadataTypedColumnsetSerDe: This SerDe is used to read/write delimited records like CSV, tab-separated control-A separated records (quote is not supported yet.)

ThriftSerDe: This SerDe is used to read or write thrift serialized objects. The class file for the Thrift object must be loaded first.

DynamicSerDe: This SerDe also read or write thrift serialized objects, but it understands thrift DDL so the schema of the object can be provided at runtime. Also it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol(which writes data in delimited records).

16) How do you write your own custom SerDe and what is the need for that?

In most cases, users want to write a Deserializer instead of a SerDe, because users just want to read their own data format instead of writing to it.

For example, the RegexDeserializer will deserialize the data using the configuration parameter regex, and possibly a list of column names.

If your SerDe supports DDL (basically, SerDe with parameterized columns and column types), you probably want to implement a Protocol based on DynamicSerDe, instead of writing a SerDe from scratch. The reason is that the framework passes DDL to SerDe through thrift DDL format, and its non-trivial to write a thrift DDL parser.

Depending on the nature of data the user has, the inbuilt SerDe may not satisfy the format of the data. So users need to write their own java code to satisfy their data format requirements.

17) What is ObjectInspector functionality ?

Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns.

ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:

Instance of a Java class (Thrift or native Java)

A standard Java object (we use java.util.List to represent Struct and Array, and use java.util.Map to represent Map)

A lazily-initialized object (For example, a Struct of string fields stored in a single Java string object with starting offset for each field)

A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.

In simple terms, ObjectInspector functionality in Hive is used to analyze the internal structure of the columns, rows, and complex objects. It allows to access the internal fields inside the objects.

18) What is the functionality of Query Processor in Apache Hive ?

This component implements the processing framework for converting SQL to a graph of map or reduce jobs and the execution time framework to run those jobs in the order of dependencies and the help of metastore details.

19) What is the limitation of Derby database for Hive metastore?

With derby database, you cannot have multiple connections or multiple sessions instantiated at the same time. Derby database runs in the local mode and it creates a log file so that multiple users cannot access Hive simultaneously.

20) What are managed and external tables?

We have got two things, one of which is data present in the HDFS and the other is the metadata, present in some database.There are two categories of Hive tables that is Managed and External Tables.

In the Managed tables, both the data and the metadata are managed by Hive and if you drop the managed table, both data and metadata are deleted.There are some situations where your data will be controlled by some other application and you want to read that data but you must allow Hive to delete that data. In such case, you can create an external table in Hive.

In the external table, metadata is controlled by Hive but the actual data will be controlled by some other application. So, when you delete a table accidentally, only the metadata will be lost and the actual data will reside wherever it is.

21) What are the complex data types in Hive?

MAP: The Map contains a key-value pair where you can search for a value using the key.

STRUCT: A Struct is a collection of elements of different data types. For example, if you take the address, it can have different data types. For example, pin code will be in Integer format.

ARRAY: An Array will have a collection of homogeneous elements. For example, if you take your skillset, you can have N number of skills

UNIONTYPE: It represents a column which can have a value that can belong to any of the data types of your choice.

22) How does partitioning help in the faster execution of queries?

With the help of partitioning, a sub directory will be created with the name of the partitioned column and when you perform a query using the WHERE clause, only the particular sub-directory will be scanned instead of scanning the whole table. This gives you faster execution of queries.

23) How to enable dynamic partitioning in Hive?

Related to partitioning there are two types of partitioning Static and Dynamic. In the static partitioning, you will specify the partition column while loading the data.

Whereas in dynamic partitioning, you push the data into Hive and then Hive decides which value should go into which partition. To enable dynamic partitioning, you have set the below property

set hive.exec.dynamic.parition.mode = nonstrict;

Example: insert overwrite table emp_details_partitioned partition(location)

select * from emp_details;

24) What is bucketing ?
The values in a column are hashed into a number of buckets which is defined by user. It is a way to avoid too many partitions or nested partitions while ensuring optimizes query output.

25) How does bucketing help in the faster execution of queries?

If you have to join two large tables, you can go for reduce side join. But if both the tables have the same number of buckets or same multiples of buckets and also sorted on the same column there is a possibility of SMBMJ in which all the joins take place in the map phase itself by matching the corresponding buckets. Buckets are basically files that are created inside the HDFS directory.

There are different properties which you need to set for bucket map joins and they are as follows:

set hive.enforce.sortmergebucketmapjoin = false;

set hive.auto.convert.sortmerge.join = false;

set hive.optimize.bucketmapjoin = ture;

set hive.optimize.bucketmapjoin.sortedmerge = true;

26) How to enable bucketing in Hive?

By default bucketing is disabled in Hive, you can enforce to enable it by setting the below property

set hive.enforce.bucketing = true;

27) What are the different file formats in Hive?

Every file format has its own characteristics and Hive allows you to choose easily the file format which you wanted to use.

There are different file formats supported by Hive

Text File format

Sequence File format

Parquet

Avro

RC file format

ORC

28) How is SerDe different from File format in Hive?

SerDe stands for Serializer and Deserializer. It determines how to encode and decode the field values or the column values from a record that is how you serialize and deserialize the values of a column. But file format determines how records are stored in key value format or how do you retrieve the records from the table.

29) What is RegexSerDe?

Regex stands for a regular expression. Whenever you want to have a kind of pattern matching, based on the pattern matching, you have to store the fields.

RegexSerDe is present in org.apache.hadoop.hive.contrib.serde2.RegexSerDe.

In the SerDeproperties, you have to define your input pattern and output fields. For example, you have to get the column values from line xyz/pq@def if you want to take xyz, pq and def separately.

To extract the pattern, you can use:

input.regex = (.*)/(.*)@(.*)

To specify how to store them, you can use

output.format.string = %1$s%2$s%3$s;

30) How is ORC file format optimised for data storage and analysis?

ORC stores collections of rows in one file and within the collection the row data will be stored in a columnar format. With columnar format, it is very easy to compress, thus reducing a lot of storage cost.

While querying also, it queries the particular column instead of querying the whole row as the records are stored in columnar format.

ORC has got indexing on every block based on the statistics min, max, sum, count on columns so when you query, it will skip the blocks based on the indexing.

31) How to access HBase tables from Hive?

Using Hive-HBase storage handler, you can access the HBase tables from Hive and once you are connected, you can query HBase using the SQL queries from Hive. You can also join multiple tables in HBase from Hive and retrieve the result.

32) When running a JOIN query, I see out-of-memory errors.?

This is usually caused by the order of JOIN tables. Instead of [FROM tableA a JOIN tableB b ON ], try [FROM tableB b JOIN tableA a ON ] NOTE that if you are using LEFT OUTER JOIN, you might want to change to RIGHT OUTER JOIN. This trick usually solve the problem the rule of thumb is, always put the table with a lot of rows having the same value in the join key on the rightmost side of the JOIN.

33) Did you used Mysql as Metatstore and faced errors like com.mysql.jdbc.exceptions.jdbc4. CommunicationsException: Communications link failure ?

This is usually caused by MySQL servers closing connections after the connection is idling for some time. Run the following command on the MySQL server will solve the problem [set global wait_status=120]

When using MySQL as a metastore I see the error [com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes].

This is a known limitation of MySQL 5.0 and UTF8 databases. One option is to use another character set, such as latin1, which is known to work.

34) Does Hive support Unicode?

You can use Unicode string on data or comments, but cannot use for database or table or column name.

You can use UTF-8 encoding for Hive data. However, other encodings are not supported (HIVE 7142 introduce encoding for LazySimpleSerDe, however, the implementation is not complete and not address all cases).

35) Are Hive SQL identifiers (e.g. table names, columns, etc) case sensitive?

No, Hive is case insensitive.

36) What is the best way to load xml data into hive ?

The easiest way is to use the Hive XML SerDe (com.ibm.spss.hive.serde2.xml.XmlSerDe), which will allow you to directly import and work with XML data.

37) When Hive is not suitable?

It does not provide OLTP transactions support only OLAP transactions.If application required OLAP, switch to NoSQL database.HQL queries have higher latency, due to the mapreduce.

38) Mention what are the different modes of Hive?

Depending on the size of data nodes in Hadoop, Hive can operate in two modes.

These modes are, Local mode and Map reduce mode

39) Mention what is (HS2) HiveServer2?

It is a server interface that performs following functions.

It allows remote clients to execute queries against Hive
Retrieve the results of mentioned queries
Some advanced features Based on Thrift RPC in its latest version include
Multi-client concurrency
Authentication

40) Mention what Hive query processor does?

Hive query processor convert graph of MapReduce jobs with the execution time framework. So that the jobs can be executed in the order of dependencies.

41) Mention what are the steps of Hive in query processor?

The components of a Hive query processor include,

Logical Plan Generation
Physical Plan Generation
Execution Engine
Operators
UDFs and UDAFs
Optimizer
Parser
Semantic Analyzer
Type Checking

42) Explain how can you change a column data type in Hive?

You can change a column data type in Hive by using command,

ALTER TABLE table_name CHANGE column_name column_name new_datatype;

43) Mention what is the difference between order by and sort by in Hive?

SORT BY will sort the data within each reducer. You can use any number of reducers for SORT BY operation.

ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in hive uses a single.

44) Explain when to use explode in Hive?

Hadoop developers sometimes take an array as input and convert into a separate table row. To convert complex data types into desired table formats, then we can use explode function.

45) Mention how can you stop a partition form being queried?

You can stop a partition form being queried by using the ENABLE OFFLINE clause with ALTER TABLE statement.

46) Can we rename a Hive table ?

yes, using below command

Alter Table table_name RENAME TO new_name

47) What is the default location where hive stores table data?

hdfs://namenode_server/user/hive/warehouse

48) Is there a date data type in Hive?

Yes. The TIMESTAMP data types stores date in java.sql.timestamp format

49) Can we run unix shell commands from hive? Give example.

Yes, using the ! mark just before the command.

For example !pwd at hive prompt will list the current directory.

50) Can hive queries be executed from script files? How?

Using the source command.

Example −

Hive> source /path/to/file/file_with_query.hql

51) What is the importance of .hiverc file?

It is a file containing list of commands needs to run when the hive CLI starts. For example setting the strict mode to be true etc.

52) What are the default record and field delimiter used for hive text files?

The default record delimiter is − \n

And the filed delimiters are − \001,\002,\003

53) What do you mean by schema on read?

The schema is validated with the data when reading the data and not enforced when writing data.

54) How do you list all databases whose name starts with p?

SHOW DATABASES LIKE ‘p.*’

55) What does the “USE” command in hive do?

With the use command you fix the database on which all the subsequent hive queries will run.

56) How can you delete the DBPROPERTY in Hive?

There is no way you can delete the DBPROPERTY.

57) What is the significance of the line.?

set hive.mapred.mode = strict;

It sets the mapreduce jobs to strict mode.By which the queries on partitioned tables can not run without a WHERE clause. This prevents very large job running for long time.

58) How do you check if a particular partition exists?

This can be done with following query

SHOW PARTITIONS table_name PARTITION(partitioned_column=’partition_value’)

59) Which java class handles the Input and Output records encoding into files in Hive tables ?

For Input: org.apache.hadoop.mapred.TextInputFormat

For Output: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

60) What is the significance of ‘IF EXISTS” clause while dropping a table?
When we issue the command DROP TABLE IF EXISTS table_name
Hive throws an error if the table being dropped does not exist in the first place.

61) When you point a partition of a hive table to a new directory, what happens to the data?

The data stays in the old location. It has to be moved manually.

Write a query to insert a new column(new_col INT) into a hive table (htab) at a position before an existing column (x_col)

ALTER TABLE table_name

CHANGE COLUMN new_col INT

BEFORE x_col

62) Does the archiving of Hive tables, it saves any spaces in HDFS?

No. It only reduces the number of files which becomes easier for namenode to manage.

63) While loading data into a hive table using the LOAD DATA clause, how do you specify it is a hdfs file and not a local file ?

By Omitting the LOCAL CLAUSE in the LOAD DATA statement.

64) If you omit the OVERWRITE clause while creating a hive table,what happens to file which are new and files which already exist?

The new incoming files are just added to the target directory and the existing files are simply overwritten. Other files whose name does not match any of the incoming files will continue to exist. If you add the OVERWRITE clause then all the existing data in the directory will be deleted before new data is written.

65) What does the following query do?

INSERT OVERWRITE TABLE employees

PARTITION (country, state)

SELECT ..., se.cnty, se.st

FROM staged_employees se;

It creates partition on table employees with partition values coming from the columns in the select clause. It is called Dynamic partition insert.

66) What is a Table generating Function on hive?

A table generating function is a function which takes a single column as argument and expands it to multiple column or rows. Example exploe().

67) How can Hive avoid mapreduce?

If we set the property hive.exec.mode.local.auto to true then hive will avoid mapreduce to fetch query results.

68) What is the difference between LIKE and RLIKE operators in Hive?

The LIKE operator behaves the same way as the regular SQL operators used in select queries. Example −

street_name like ‘%Chi’

But the RLIKE operator uses more advance regular expressions which are available in java

Example − street_name RLIKE ‘.*(Chi|Oho).*’ which will select any word which has either chi or oho in it.

69) Is it possible to create Cartesian join between 2 tables, using Hive?

No. As this kind of Join can not be implemented in map reduce

70) What should be the order of table size in a join query?

In a join query the smallest table to be taken in the first position and largest table should be taken in the last position.

71) What is the usefulness of the DISTRIBUTED BY clause in Hive?

It controls how the map output is reduced among the reducers. It is useful in case of streaming data

72) How will you convert the string ’51.2’ to a float value in the price column?

Select cast(price as FLOAT)

73) What will be the result when you do cast(‘abc’ as INT)?

Hive will return NULL

74) Can we LOAD data into a view?

No. A view can not be the target of a INSERT or LOAD statement.

75) What types of costs are associated in creating index on hive tables?

Indexes occupies space and there is a processing cost in arranging the values of the column on which index is cerated.

Give the command to see the indexes on a table.

SHOW INDEX ON table_name

This will list all the indexes created on any of the columns in the table table_name.

76) What does /*streamtable(table_name)*/ do?

It is query hint to stream a table into memory before running the query. It is a query optimization Technique.

77) Can a partition be archived? What are the advantages and Disadvantages?

Yes. A partition can be archived. Advantage is it decreases the number of files stored in namenode and the archived file can be queried using hive. The disadvantage is it will cause less efficient query and does not offer any space savings.

78) What is a generic UDF in hive?

It is a UDF which is created using a java program to server some specific need not covered under the existing functions in Hive. It can detect the type of input argument programmatically and provide appropriate response.

79) The following statement failed to execute. What can be the cause?

LOAD DATA LOCAL INPATH ‘${env:HOME}/country/state/’

OVERWRITE INTO TABLE address;

The local inpath should contain a file and not a directory. The $env:HOME is a valid variable available in the hive environment.

80) How do you specify the table creator name when creating a table in Hive?

The TBLPROPERTIES clause is used to add the creator name while creating a table.

The TBLPROPERTIES is added like −

TBLPROPERTIES(‘creator’= ‘Joan’)

81) Which method has to be overridden when we use custom UDF in Hive?

Whenever you write a custom UDF in Hive, you have to extend the UDF class and you have to override the evaluate() function.

82) Suppose I have installed Apache Hive on top of my Hadoop cluster using default metastore configuration. Then, what will happen if we have multiple clients trying to access Hive at the same time?

The default metastore configuration allows only one Hive session to be opened at a time for accessing the metastore. Therefore, if multiple clients try to access the metastore at the same time, they will get an error. One has to use a standalone metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing access to multiple clients concurrently.

Following are the steps to configure MySQL database as the local metastore in Apache Hive:
One should make the following changes in hive-site.xml:
javax.jdo.option.ConnectionURL property should be set to jdbc:mysql://host/dbname?createDataba
seIfNotExist=true.
javax.jdo.option.ConnectionDriverName property should be set to com.mysql.jdbc.Driver.
One should also set the username and password as:
javax.jdo.option.ConnectionUserName is set to desired username.
javax.jdo.option.ConnectionPassword is set to the desired password.
The JDBC driver JAR file for MySQL must be on the Hive classpath, i.e. The jar file should be copied into the Hive lib directory.
Now, after restarting the Hive shell, it will automatically connect to the MySQL database which is running as a standalone metastore.

83) Is it possible to change the default location of a managed table?
Yes, it is possible to change the default location of a managed table. It can be achieved by using the clause LOCATION [hdfs_path].

84) When should we use SORT BY instead of ORDER BY?

We should use SORT BY instead of ORDER BY when we have to sort huge datasets because SORT BY clause sorts the data using multiple reducers whereas ORDER BY sorts all of the data together using a single reducer. Therefore, using ORDER BY against a large number of inputs will take a lot of time to execute.

85) What is dynamic partitioning and when is it used?

In dynamic partitioning values for partition columns are known in the runtime, i.e. It is known during loading of the data into a Hive table.

One may use dynamic partition in following two cases:

Loading data from an existing non-partitioned table to improve the sampling and therefore, decrease the query latency.
When one does not know all the values of the partitions before hand and therefore, finding these partition values manually from a huge data sets is a tedious task.

86) Suppose, I create a table that contains details of all the transactions done by the customers of year 2016: CREATE TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY , ;

Now, after inserting 50,000 tuples in this table, I want to know the total revenue generated for each month. But, Hive is taking too much time in processing this query. How will you solve this problem and list the steps that I will be taking in order to do so?

We can solve this problem of query latency by partitioning the table according to each month. So, for each month we will be scanning only the partitioned data instead of whole data sets.

As we know, we can not partition an existing non-partitioned table directly.

So, we will be taking following steps to solve the very problem:

1.Create a partitioned table, say partitioned_transaction:

CREATE TABLE partitioned_transaction (cust_id INT, amount FLOAT, country STRING) PARTITIONED BY (month STRING) ROW FORMAT DELIMITED FIELDS

TERMINATED BY (,) ;

2. Enable dynamic partitioning in Hive:

SET hive.exec.dynamic.partition = true;

SET hive.exec.dynamic.partition.mode = nonstrict;

3. Transfer the data from the non – partitioned table into the newly created partitioned table:

INSERT OVERWRITE TABLE partitioned_transaction PARTITION (month) SELECT cust_id, amount, country, month FROM transaction_details;

Now, we can perform the query using each partition and therefore, decrease the query time.

87) How can you add a new partition for the month December in the above partitioned table?

For adding a new partition in the above table partitioned_transaction, we will issue the command give below:

ALTER TABLE partitioned_transaction ADD PARTITION (month=Dec) LOCATION /partitioned_transaction;

88) What is the default maximum dynamic partition that can be created by a mapper/reducer? How can you change it?

By default the number of maximum partition that can be created by a mapper or reducer is set to 100. One can change it by issuing the following command:

SET hive.exec.max.dynamic.partitions.pernode = value

89) I am inserting data into a table based on partitions dynamically. But, I received an error FAILED ERROR IN SEMANTIC ANALYSIS: Dynamic partition strict mode requires at least one static partition column. How will you remove this error?

To remove this error one has to execute following commands:

SET hive.exec.dynamic.partition = true;

SET hive.exec.dynamic.partition.mode = nonstrict;

90) Suppose, I have a CSV file sample.csv present in temp directory with the following entries:

id first_name last_name email gender ip_address

1 Hugh Jackman hughjackman@cam.ac.uk Male 136.90.241.52

2 David Lawrence dlawrence1@gmail.com Male 101.177.15.130

3 Andy Hall andyhall2@yahoo.com Female 114.123.153.64

4 Samuel Jackson samjackson231@sun.com Male 89.60.227.31

5 Emily Rose rose.emily4@surveymonkey.com Female 119.92.21.19

How will you consume this CSV file into the Hive warehouse using built SerDe?

SerDe stands for serializer or deserializer. A SerDe allows us to convert the unstructured bytes into a record that we can process using Hive. SerDes are implemented using Java. Hive comes with several built-in SerDes and many other third-party SerDes are also available.

Hive provides a specific SerDe for working with CSV files. We can use this SerDe for the sample.csv by issuing following commands:

CREATE EXTERNAL TABLE sample

(id int, first_name string,

last_name string, email string,

gender string, ip_address string)

ROW FORMAT SERDE org.apache.hadoop.hive.serde2.OpenCSVSerde

STORED AS TEXTFILE LOCATION temp;

Now, we can perform any query on the table sample:

SELECT first_name FROM sample WHERE gender = male;

91) Suppose, I have a lot of small CSV files present in input directory in HDFS and I want to create a single Hive table corresponding to these files. The data in these files are in the format: {id, name, e-mail, country}. Now, as we know, Hadoop performance degrades when we use lots of small files.

So, how will you solve this problem where we want to create a single Hive table for lots of small files without degrading the performance of the system?

One can use the SequenceFile format which will group these small files together to form a single sequence file. The steps that will be followed in doing so are as follows:

Create a temporary table:

CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY , STORED AS TEXTFILE;

Load the data into temp_table:

LOAD DATA INPATH input INTO TABLE temp_table;

Create a table that will store data in SequenceFile format:

CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY , STORED AS SEQUENCEFILE;

Transfer the data from the temporary table into the sample_seqfile table:

INSERT OVERWRITE TABLE sample SELECT * FROM temp_table;

Hence, a single SequenceFile is generated which contains the data present in all of the input files and therefore, the problem of having lots of small files is finally eliminated.

92) Can We Change settings within Hive Session? If Yes, How?

Yes, we can change the settings within Hive session, using the SET command. It helps to change Hive job settings for an exact query.

Example: The following commands shows buckets are occupied according to the table definition.

hive> SET hive.enforce.bucketing=true;

We can see the current value of any property by using SET with the property name. SET will list all the properties with their values set by Hive.

hive> SET hive.enforce.bucketing;

hive.enforce.bucketing=true

And this list will not include defaults of Hadoop. So we should use the below like

SET -v

It will list all the properties including the Hadoop defaults in the system.

93) Is it possible to add 100 nodes when we have 100 nodes already in Hive? How?

Yes, we can add the nodes by following the below steps.

Take a new system create a new username and password.

Install the SSH and with master node setup ssh connections.

Add ssh public_rsa id key to the authorized keys file.

Add the new data node host name, IP address and other details in /etc/hosts slaves file

192.168.1.102 slave3.in slave3.

Start the Data Node on New Node.

Start HDFS of a newly added slave node by using the following command

./bin/hadoop-daemon.sh start data node.

Check the output of jps command on a new node

94) Explain the concatenation function in Hive with an example?

Concatenate function will join the input strings.We can specify the N number of strings separated by a comma.

Example:

CONCAT (It,-,is,-,a,-,eLearning,-,provider);

Output:

It-is-a-eLearning-provider

So, every time we set the limits of the strings by -. If it is common for every strings, then Hive provides another command

CONCAT_WS. In this case,we have to specify the set limits of operator first.

CONCAT_WS (-,It,is,a,eLearning,provider);

Output: It-is-a-eLearning-provider.

95) Explain Trim and Reverse function in Hive with examples?

Trim function will delete the spaces associated with a string.

Example:

TRIM( BHAVESH );

Output:

BHAVESH

To remove the Leading space

LTRIM( BHAVESH);

To remove the trailing space

RTRIM(BHAVESH );

In Reverse function, characters are reversed in the string.

Example:

REVERSE(BHAVESH);

Output:

HSEVAHB

96) Explain process to access sub directories recursively in Hive queries?

By using below commands we can access sub directories recursively in Hive

hive> Set mapred.input.dir.recursive=true;

hive> Set hive.mapred.supports.subdirectories=true;

Hive tables can be pointed to the higher level directory and this is suitable for the directory structure which is like /data/country/state/city/

97) How to skip header rows from a table in Hive?

Header records in log files

System=

Version=

Sub-version=

In the above three lines of headers that we do not want to include in our Hive query. To skip header lines from our tables in the Hive,set a table property that will allow us to skip the header lines.

CREATE EXTERNAL TABLE employee (

name STRING,

job STRING,

dob STRING,

id INT,

salary INT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY STORED AS TEXTFILE

LOCATION /user/data

TBLPROPERTIES(skip.header.line.count=2);

98) What is the maximum size of string data type supported by hive? Mention the Hive support binary formats ?

The maximum size of string data type supported by hive is 2 GB.

Hive supports the text file format by default and it supports the binary format Sequence files, ORC files, Avro Data files, Parquet files.

Sequence files: Splittable, compressible and row oriented are the general binary format.

ORC files: Full form of ORC is optimized row columnar format files. It is a Record columnar file and column oriented storage file. It divides the table in row split. In each split stores that value of the first row in the first column and followed sub subsequently.

AVRO datafiles: It is same as a sequence file splittable, compressible and row oriented, but except the support of schema evolution and multilingual binding support.

99) What is the precedence order of HIVE configuration?

We are using a precedence hierarchy for setting the properties

SET Command in HIVE

The command line –hiveconf option

Hive-site.XML

Hive-default.xml

Hadoop-site.xml

Hadoop-default.xml

100) If you run a select * query in Hive, Why does it not run MapReduce?

The hive.fetch.task.conversion property of Hive lowers the latency of mapreduce overhead and in effect when executing queries like SELECT, LIMIT, etc., it skips mapreduce function

101) How Hive can improve performance with ORC format tables?

We can store the hive data in highly efficient manner in the Optimized Row Columnar file format. It can simplify many Hive file format limitations. We can improve the performance by using ORC files while reading, writing and processing the data.

Set hive.compute.query.using.stats-true;

Set hive.stats.dbclass-fs;

CREATE TABLE orc_table (idint,name string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY \: LINES TERMINATED BY \n

STORES AS ORC;

102) Explain about the different types of join in Hive?

HiveQL has 4 different types of joins –

JOIN- Similar to Outer Join in SQL

FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil the join condition.

LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches in the right table.

RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no matches in the left table.

103) How can you configure remote metastore mode in Hive?

To configure metastore in Hive, hive-site.xml file has to be configured with the below property –

hive.metastore.uris

thrift: //node1 (or IP Address):9083

IP address and port of the metastore host

104) What happens on executing the below query? After executing the below query, if you modify the column how will the changes be tracked?

Hive> CREATE INDEX index_bonuspay ON TABLE employee (bonus)

AS org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler;

The query creates an index named index_bonuspay which points to the bonus column in the employee table. Whenever the value of bonus is modified it will be stored using an index value.

105) How to load Data from a .txt file to Table Stored as ORC in Hive?
LOAD DATA just copies the files to hive datafiles. Hive does not do any transformation while loading data into tables.
So, in this case the input file /home/user/test_details.txt needs to be in ORC format if you are loading it into an ORC table.
A possible workaround is to create a temporary table with STORED AS TEXT, then LOAD DATA into it, and then copy data from this table to the ORC table.
Here is an example:
CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE;
CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC;
Load into Text table
LOAD DATA LOCAL INPATH /home/user/test_details.txt INTO TABLE test_details_txt;
Copy to ORC table
INSERT INTO TABLE test_details_orc SELECT * FROM test_details_txt;

106) How to create HIVE Table with multi character delimiter

FILELDS TERMINATED BY does not support multi-character delimiters. The easiest way to do this is to use RegexSerDe:

CREATE EXTERNAL TABlE tableex(id INT, name STRING)

ROW FORMAT org.apache.hadoop.hive.contrib.serde2.RegexSerDe

WITH SERDEPROPERTIES (input.regex = ^(\\d+)~\\*(.*)$)

STORED AS TEXTFILE LOCATION /user/myusername;

107) Is there any way to get the column name along with the output while execute any query in Hive?

If we want to see the columns names of the table in HiveQl, the following hive conf property should be set to true.

hive> set hive.cli.print.header=true;

If you prefer to see the column names always then update the $HOME/.hiverc file with the above setting in the first line..

Hive automatically looks for a file named .hiverc in your HOME directory and runs the commands it contains, if any

108) How to Improve Hive Query Performance With Hadoop?

Use Tez Engine

Apache Tez Engine is an extensible framework for building high-performance batch processing and interactive data processing. It is coordinated by YARN in Hadoop. Tez improved the MapReduce paradigm by increasing the processing speed and maintaining the MapReduce ability to scale to petabytes of data.

Tez engine can be enabled in your environment by setting hive.execution.engine to tez:

set hive.execution.engine=tez;

Use Vectorization

Vectorization improves the performance by fetching 1,024 rows in a single operation instead of fetching single row each time. It improves the performance for operations like filter, join, aggregation, etc.

Vectorization can be enabled in the environment by executing below commands.

set hive.vectorized.execution.enabled=true;

set hive.vectorized.execution.reduce.enabled=true;

Use ORCFile

Optimized Row Columnar format provides highly efficient ways of storing the hive data by reducing the data storage format by 75% of the original. The ORCFile format is better than the Hive files format when it comes to reading, writing, and processing the data. It uses techniques like predicate push-down, compression, and more to improve the performance of the query.

Use Partitioning

With partitioning, data is stored in separate individual folders on HDFS. Instead of querying the whole dataset, it will query partitioned dataset.

1)Create Temporary Table and Load Data Into Temporary Table

2)Create Partitioned Table

3)Enable Dynamic Hive Partition

4)Import Data From Temporary Table To Partitioned Table

Use Bucketing

The Hive table is divided into a number of partitions and is called Hive Partition. Hive Partition is further subdivided into clusters or buckets and is called bucketing or clustering.

Cost-Based Query Optimization

Hive optimizes each querys logical and physical execution plan before submitting for final execution. However, this is not based on the cost of the query during the initial version of Hive.

During later versions of Hive, query has been optimized according to the cost of the query (like which types of join to be performed, how to order joins, the degree of parallelism, etc.).

109) How do I query from a horizontal output to vertical output?
There should be easier way to achieve that using explode function and selecting separately data for prev and next columns.

110) Is there a simple way to replace non numeric characters hive excluding - to allow only -ve and +ve numbers.?
we can try to use regexp_extract instead:
regexp_extract('abcd-9090','.*(-[0-9]+)',1)

111) What is Hive Tablename maximum character limit .?

Metastore service uses a RDBMS as back-end, and you can see that the standard MySQL schema defines

CREATE TABLE IF NOT EXISTS TBLS (...

TBL_NAME varchar(128) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,

...

112) How can I convert a date in string format (“April 25, 2018”) to timestamp in hive?

use from_unixtime in conjunction with unix_timestamp.

select from_unixtime(unix_timestamp(`date`,'MMM dd, yyyy'),'yyyy-MM-dd')

113) How to drop the hive database whether it contains some tables.?
use cascade command while drop the database.
Example:
hive> drop database sampleDB cascade;

114) I Dropped and recreated hive external table, but no data shown, So what should i do.?

This is because the table you created is a partitioned table. The insert you ran would have created a partition for partition_column='abcxyz'. When you drop and re-create the table, Hive looses the information about the partition, it only knows about the table.

Run the command below to get hive to re-create the partitions based on the data.

MSCK REPAIR TABLE user_dinesh.sampletable;

...Thank You...