The title "Big Data using Hadoop.pdf" suggests that the document is likely a PDF file that focuses on the utilization of Hadoop technology in the context of Big Data. Hadoop is a popular open-source framework for distributed storage and processing of large datasets. The document is expected to cover various aspects of working with big data, emphasizing the role of Hadoop in managing and analyzing vast amounts of information.
2. 1. Pre-requisites:
➢ Java Programming
➢ Database Knowledge
2. Course Educational Objectives:
This course provides practical, foundation level training that enables immediate and
effective participation in Big Data and other Analytics projects using Hadoop and R.
3. Course Outcomes:
After the completion of this course, the students will be able to:
CO1: Preparing for data summarization, query, and analysis.
CO2: Applying data modelling techniques to large data sets.
CO3: Creating applications for Big Data analytics.
CO4: Improve individual / teamwork skills, communication & report writing skills with
ethical values.
4. Course Articulation Matrix:
COs PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
CO1 3 3 3 2 3 - - - - - - - 2 3 -
CO2 3 2 2 2 3 - - - - - - - 2 2 -
CO3 3 3 3 2 3 1 - - - - - - 2 3 -
CO4 - - - - - - - 2 2 2 - - - - -
1 - Low 2 –Medium 3 - High
3. 3
LIST OF EXPERIMENTS
Week-1:
Refreshing Linux Commands and Installation of Hadoop
Week-2:
Implementation of Run a basic Word Count Map Reduce program
Week-3:
Implementation of Matrix Multiplication with Hadoop Map Reduce.
Week-4:
Implementation of Weather mining by taking weather dataset using Map Reduce.
Week-5:
Installation of Hive along with practice examples.
Week-6:
Installation of Sqoop along with Practice examples.
4. 4
WEEK 01: Refreshing Linux Commands and Installation of Hadoop
Linux commands:
• ls: This command is used to list all the files. It will print all the directories present in HDFS
$hadoop fs –ls /dir
• mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create
it.
$hadoop fs –mkdir /directory_name
• touchz: It creates an empty file.
$hadoop fs –touchz /filename
• copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the most
important command. Local filesystem means the files present on the OS.
$hadoop fs –put filename(which you want to put) /path
$hadoop fs –copyFromLocal filename(which you want to put) /path
• copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
$hadoop fs –get /file(path)
$hadoop fs –copyToLocal /file(path)
• cat: To print file contents
$hado op fs –cat /file(path)
5. 5
• moveFromLocal: This command will move file from local to hdfs.
$hadoop fs –moveFromLocal file_name(which you want to move) /path
• cp: This command is used to copy files within hdfs.
$hadoop fs –cp /path1/file /path2/file
• mv: This command is used to move files within hdfs. It cut-paste a file.
$hadoop fs –mv /path1/file /path2/file
• rmr: This command deletes a file from HDFS recursively. It is very useful command when you want to
delete a non-empty directory.
$hadoop fs –rmr /file(path)
• du: It will give the size of each file in directory.
$hadoop fs –du /file(path)
• dus:: This command will give the total size of directory/file.
$hadoop fs –dus /file(path)
• stat: It will give the last modified time of the directory or path. In short, it will give stats of the
directory or file.
$hadoop fs –stat /dir(file)
6. 6
Installation of Hadoop:
Open terminal in UBUNTU and execute the fallowing commands one by one1)sudo
apt update
2)sudo apt install openjdk-8-jdk -y
3)java -version; javac -version
4)sudo apt install openssh-server openssh-client -y
5)ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
6)cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
7)chmod 0600 ~/.ssh/authorized_keys
8) ssh localhost
9)wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz10)tar
xzf hadoop-3.2.1.tar.gz
11)sudo nano .bashrc
#Hadoop Related Options
export HADOOP_HOME=/home/hadoop-3.2.1 export
HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME export
HADOOP_COMMON_HOME=$HADOOP_HOME
9. 9
WEEK 02: Implementation of Run a basic Word Count Map Reduce
program tounderstand Map Reduce Paradigm.
Mapper Logic:
package com.lbrce.wordcount; import java.io.IOException; import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import
org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text,IntWritable>
{
@Override
public void map(LongWritable key, Text value, Context con)throws IOException, InterruptedException
{
String line = value.toString(); String[] words = line.split("s"); for(String s:words)
{
con.write(new Text(s), new IntWritable(1));
}
}
}
Reducer Logic:
package com.lbrce.wordcount;import
java.io.IOException;
import org.apache.hadoop.io.IntWritable;import
org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable,Text,IntWritable>
{
@Override
public void reduce(Text key, Iterable<IntWritable> values,Contextcon) throws
IOException, InterruptedException
{
int sum = 0; for(IntWritable
i:values)
{
sum = sum + i.get();
}
con.write(key, new IntWritable(sum));
}
}
10. 10
Driver Logic:
Job job = new Job(); job.setJarByClass(WordCountDriver.class);
package com.lbrce.wordcount;
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCountDriver
{
public static void main(String[] args) throws Exception
{
if (args.length != 2)
{
<output path>");
}
System.err.println("Usage: WordCount <input path> System.exit(-1);
Job job = new Job(); job.setJarByClass(WordCountDriver.class);
job.setJobName("Word Count"); FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
OUTPUT:
Local file
Step 1: create a text file and place in Hadoop and run with the exported jar.
Step 2: Running of Mapper and Reducer.
11. 11
Step 3: Output file generation and execution part is shown below.
Hadoop file
12. 12
WEEK 03: Implementation of Matrix Multiplication with Hadoop Map Reduce.
Mapper Logic:
package com.lbrce.matrixmul; import
java.io.IOException; import
org.apache.hadoop.conf.*;import
org.apache.hadoop.io.*;
importorg.apache.hadoop.mapreduce.*;
public class MatrixMapper extends Mapper<LongWritable, Text, Text, Text>
{
public void map(LongWritable key, Text value, Context context)throws IOException,
InterruptedException
{
Configuration conf = context.getConfiguration();int m =
Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));String line =
value.toString();
String[] indicesAndValue = line.split(",");Text outputKey
= new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("M"))
{
for (int k = 0; k < p; k++)
{
outputKey.set(indicesAndValue[1] + "," + k);
outputValue.set("M," + indicesAndValue[2] + ","
+indicesAndValue[3]);
context.write(outputKey,outputValue);
}
else
{
}
for (int i = 0; i < m; i++)
{
outputKey.set(i + "," + indicesAndValue[2]); outputValue.set("N," +
indicesAndValue[1] + ","
+indicesAndValue[3]); context.write(outputKey,outputValue);
}
}
}
}
Reducer Logic:
package com.lbrce.matrixmul;import
java.io.IOException; import
java.util.*;
import org.apache.hadoop.io.*; import
org.apache.hadoop.mapreduce.*;
public class MatrixReducer extends Reducer<Text, Text, Text, Text>
{
public void reduce(Text key, Iterable<Text> values, Contextcontext)throws
IOException, InterruptedException
13. 13
{
String[] value;
HashMap<Integer, Float> hashA = new HashMap<Integer,Float>();
HashMap<Integer, Float> hashB = new HashMap<Integer, Float>();for (Text val :
values)
{
value = val.toString().split(",");
if (value[0].equals("M"))
{
else
} {
hashB.put(Integer.parseInt(value[1]),Float.parseFloat(value[2]));
}
}
int n= Integer.parseInt(context.getConfiguration().get("n")); float result = 0.0f;
float a_ij; float b_jk;
for (int j = 0; j < n; j++)
{
a_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f; b_jk = hashB.containsKey(j) ?
hashB.get(j) : 0.0f; result += a_ij * b_jk;
}
if (result != 0.0f)
{
context.write(null, new Text(key.toString() + ","
+Float.toString(result)));
}
}
}
hashA.put(Integer.parseInt(value[1]),Float.parseFloat(value[2]));
14. 14
Driver Logic:
import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class MatrixDriver
{
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
// M is an m-by-n matrix; N is an n-by-p matrix.
conf.set("m", "2");
conf.set("n", "2");
conf.set("p", "2");
Job job = Job.getInstance(conf, "MatrixMultiplication"); job.setJarByClass(MatrixDriver.class);
job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class);
job.setMapperClass(MatrixMapper.class); job.setReducerClass(MatrixReducer.class);
job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new
Path(args[1])); job.submit();
}
}
Execution Part in Termius:
Step1: checking whether jar is present or not in directory
15. 15
Step 02: Now for executing the Matrix Multiplication operation we need an input file (i.e.MN as
given in logic) which consists of two 4 x 3 matrices.
Step 03: After successfully copying the input file to the Hadoop File system now we have toexecute the
jar file. Additionally, the output directory must be created called“MMOutput”.
Step 04: Verifying whether the job done successfully by opening the output directory.
Step 05: Now opening the output file in read mode to print the output of the operation.
The matrix multiplication of two matrices given in input file will be executed and, displayed.
16. 16
Week 04: Implementation of Weather mining by taking weather dataset using Map Reduce
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
public class MyMaxMin {
public static class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, Text>
{
public static final int MISSING=9999;
public void map(LongWritable arg0, Text Value, Context context) throws IOException, InterruptedException
{
String line = Value.toString();
if (!(line.length() == 0))
{
String date = line.substring(6, 14);
float temp_Min = Float.parseFloat(line.substring(39, 45).trim());
float temp_Max = Float.parseFloat(line.substring(47, 53).trim());
if (temp_Max > 35.0 && temp_Max!=MISSING)
{
context.write(new Text("Hot Day " + date),new Text(String.valueOf(temp_Max)));
}
if (temp_Min < 10 && temp_Max!=MISSING)
{
context.write(new Text("Cold Day " + date),new Text(String.valueOf(temp_Min)));
}
}
}
}
public static class MaxTemperatureReducer extends Reducer<Text, Text, Text, Text>
{
public void reduce(Text Key, Iterator<Text> Values, Context context)throws IOException,
InterruptedException
{
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = new Job(conf, "weather example");
job.setJarByClass(MyMaxMin.class);
job.setMapOutputKeyClass(Text.class);
18. 18
Week 05: Installation of Sqoop along with Practice examples.
Step 1: Connect to Mysql Database, next create a database(20761A5433) to create the tables required.
Step2: creating two tables called employee and student where employee table consist of primary key
Step 3: Insert values in to tables that created.
. Viewing tables with inserted values using select command
20. 20
Step 2: Hadoop directory with files.
Step 3: Connecting database with sqoop and accessing the created tables as files and running the mapper and
reducer logic by below commands.
Step 4: Now check our directory which consist of both files .
Step 5: Display the contents of the file use “cat “ command and check the output.
21. 21
Step 6: Tables can be placed by import all tables command ,as below example.
Step 7: Retrieved the columns based on the specified criteria
Step 8: Our directory after performing all above commands.
22. 22
Week 06: Installation of Hive along with Practice Examples.
Hive tutorial provides basic and advanced concepts of Hive. Our Hive tutorial is designed for beginners and
professionals.
Apache Hive is a data ware house system for Hadoop that runs SQL like queries called HQL (Hive query
language) which gets internally converted to map reduce jobs. Hive was developed by Facebook. It supports Data
definition Language, Data Manipulation Language and user defined functions.
Our Hive tutorial includes all topics of Apache Hive with Hive Installation, Hive Data Types, Hive Table
partitioning, Hive DDL commands, Hive DML commands, Hive sort by vs order by, Hive Joining tables etc.
Step 1: Open hive work space and create a directory and a database with a sample table.
Step 2: Insert values into a table in two ways that are by creating a file and loading data and by inserting
values one after the another.
Step 3: creating an internal table in hive:
Step 4: loading data into internal table.
23. 23
Step 5: creating an external table in the hive.
Step 6: external table doesn’t need a load command,so we can verify direactly.
Step 7:Hive user-defined function:
Package.com.lbrce.hiveudf;
Import org.apache.hadoop.hive.ql.exec.UDF;
public class Percentage extends UDF{
public float evaluate(int s1, int s2, int s3)
{
return(s1+s2+s3)/3.0f;
}
}
24. 24
Step 8: preparing a jar file and loading into hive for execution.
Step 9: creating a temporary file for execution.
Step 10: create any another table and check the result.
Step 11:Another user-defined function to place some words between the input file.
25. 25
Step 12: Datatypes in hive with examples.
a. Array:
b. Map
c. Struct
Step 13: Partition of the tables in hive: