APACHE HADOOP & SPARK HADOOP
DESCRIPTION
Course Content
APACHE HADOOP & SPARK HADOOP TRAINING
Introduction to HADOOP and Big Data Ecosystem
Distributed computing, cloud computing
BigData Basics and Need for Parallel Processing
How Hadoop works?
Introduction to HDFS and Map Reduce
Hadoop Architecture Details
Name Node Master details
Data Node and storage
Secondary Name Node, FSImage and Edit file
Job Tracker and Task Tracker
SafeMode Details and Configuration
HDFS (Hadoop – Distributed File System)
Hadoop Distributed file system, Background, GFS
Data Replication – Static and Dynamic configuration
Data Storage – Block Size details
Additional HDFS commands
HDFS API for Automation ( Real time project need)
MapReduce Programming
MapReduce, Background
Writing MapReduce Programs
Input Format, Output Format
JobConf and JobClient API
Number of Mappers and Reducers
Pre-built Mappers and Reducers
Hadoop Streaming
Introduction to Hadoop Streaming
Streaming API details and use cases
Streaming Lab
Apache Sqoop
Introduction and Basics
Sqoop Installation with Oracle DB/MySQL
Sqoop Export and Import features
Apache Hive
Hive Installation.
Hive Shell Description
Meta store Details
Hive QL Basics
Working with Tables, Databases etc.
Hive JDBC programming
Hands on Exercises and Assignments
Introduction to Spark
What is Spark?
Review: From Hadoop MapReduce to Spark
Introduction: HDFS
Introduction: YARN , MESOS
Spark Architecture Details
Spark Modules
Spark and Scala Installation
Apache Spark Installation (version 2.x)
Scala Installation and Configuration
Using the Spark Shell – Scala
Using PySpark shell
Spark Labs and Exercises
Resilient Distributed Datasets (RDD)
Working with RDDs in Spark
Creating RDDs
Accumulators and Broadcast variables
RDD – Transformations
RDD – Actions
RDD Labs
Spark SQL and DataFrames
Spark SQL and the SQL Context
Creating DataFrames
Transforming and Querying DataFrames
DataFrames and RDDs
Comparing Spark SQL, Impala and Hive-on-Spark
Spark Mlib ( Machine Learning)
Basic Principles of Machine Learning
Spark ML Setup
Transaformation, Corelation Algorithm.
Example: K-means .
Study Materials and Labs
1) Complete Virtual Machine is shared with students. It has Java , Oracle DB , Mozilla
Firefox and other components pre-installed
2) The VM can be used even after the training is DONE. Please note it’s NOT a remote
lab type environment. You will be able to keep the VM and all labs even after the
training is completed
3) Certification Dump for Hadoop will be provided.
4) Interview Questions on Spark , Hadoop , Map Reduce and other Eco system
components.
5) 30 Hour Duration course. All materials are shared via G Drive.