DURATION  100 hrs


Hadoop & Spark Developer Training

What You Will Get from This Course?

  • In-depth understanding of Entire Big Data Hadoop and Hadoop Ecosystem
  • Real time idea of Hadoop Development
  • Detailed Course Materials
  • Free Core Java and UNIX Fundamentals
  • Interview Oriented Discussions

Get Ready Hadoop & Spark Developer (CCA175) Certification Exam

Overall Course Structure:

  • UNIX/LINUX Basic Commands
  • Basic UNIX Shell Scripting
  • Basic Java Programming – Core JAVA OOPS Concepts
  • Introduction to Big Data and Hadoop
  • Working With HDFS
  • Hadoop Map Reduce Concepts & Features
  • Developing Map Reduce Applications
  • Hadoop Eco System Components:                                                                                HIVE

o  PIG





  • Detailed SPARK with SCALA Programming
  • Detailed Kafka Streaming
  • Overview of MongoDB
  • Overview of Spark with Python Programming (Time Permitting)
  • Real Time Tools like Putty, WinSCP, Eclipse, Hue, Cloudera Manager


  • Basic SQL Knowledge
  • Computer with Minimum 4GB RAM (8GM RAM Preferred)
  • Basic UNIX & Java Programming knowledge is added advantage

Detailed Course Structure:

Introduction to Big Data & Hadoop

  • The Big Data Problem
  • What is Big Data?
  • Challenges in processing Big Data
  • What is Hadoop?
  • Why Hadoop?
  • History of Hadoop
  • Hadoop Components Overview HDFS

Map Reduce

  • Hadoop Eco System Introduction
  • Database Introduction

Understanding Hadoop Architecture

  • Hadoop 2x Architecture
  • Introduction to YARN
  • Hadoop Daemons
  • YARN Architecture
  • Resource Manager Application Master o Node Manager

Introduction to HDFS (Hadoop Distributed File System)

  • Rack Awareness
  • HDFS Daemons
  • Writing Files to HDFS o Blocks & Splits o Input Splits

   Data Replication

  • Reading Files from HDFS
  • Introduction to HDFS Configuration Files

Working with HDFS

  • HDFS Commands
  • Accessing HDFS
  • CLI Approach
  • JAVA Approach [Introducing HDFS JAVA API]

Introduction to Map Reduce Paradigm

  • What is Map Reduce?
  • Detailed Map Reduce Flow
  • Introduction to Key/Value Approach o Detailed Mapper Functionality

Detailed Reducer Functionality o Details of Partitioner

Shuffle & Sort Process

  • Understanding Map Reduce Flow with Word Count Example

Map Reduce Programming

  • Introduction to Map Reduce API [New Map Reduce API]
  • Map Reduce Data Types
  • File Formats
  • Input Formats – Input Splits & Records, text input, binary input
  • Output Formats – Text Output, Binary Output
  • Configuring Development Environment – Eclipse
  • Developing a Map Reduce Application Default Functionality o Identity Mapper

   Identity Reducer

ToolRunner API Introduction

  • Developing Word Count Applica
  • Mapper, Reducer & Driver Code
  • Building Application
  • Deploying
  • Running the Map Reduce Application
  • Local Mode of Execution
  • Cluster Mode of Execution
  • Monitoring Map Reduce Application
  • Map Reduce Combiner
  •  Map Reduce Counters
  • Map Reduce Partitioner
  • File Merge Utility

Programming with HIVE

  • Introduction to HIVE
  • Hive Architecture
  • Types of Meta store
  • Introduction to Hive Configuration Files
  • Hive Data Types
  • Simple Data Types
  • Collection Data Types
    • Types of Hive Tables
    • Managed Table
    • External Table
    • Hive Query Language (HQL or HIVE QL)


    • Creating Databases
    • Creating Tables
    • Loading Data into table Joins in Hive

    Group BY and Distinct operations o Partitioning

    • Static Partitioning


    • Dynamic Partitioning
    • Bucketing
      • Lateral View & Explode [Introduction to Hive UDFs à UDF, UDAF & UDTF]
    • XML Processing in HIVE
    • JSON processing in HIVE
    • URL Processing in HIVE
  • Hive File Formats [Introduction to Hive SERDE]
  • Parquet
  • ORC


  • Storage Formats
  • Introduction to HIVE Query Optimizations
  • Developing Hive UDFs in JAVA
  • Hive Views

Programming with PIG

  • Introduction to PIG
  • PIG Architecture
  • Introduction to PIG Configuration Files
  • PIG vs. HIVE vs. Map Reduce
  • Introduction to Data Flow Language
  • Pig Data Types
  • Pig Programming Modes
  • Pig Access Modes
  • Detailed PIG Latin Programming
  • PIG UDFs & UDF Development JAVA
  • Hive – PIG Integration à Introduction to HCATALOG
  • Introduction to PIG Optimization


  • Introduction to NoSQL Databases
  • Types of Databases
  • Introduction To HBASE
  • HBASE Architecture
  • HBASE Shell Interface
  • Creating Data Bases and o Inserting Data in tables

Accessing data from Tables o HBase Filters

  • Hive & HBASE Integration
  • PIG & HBASE Integration
  • Document Store – MongoDB Overview

Introduction to Streaming & FLUME

  • Introduction to Streaming
  • Introduction to FLUME
  • FLUME Architecture
  • Flume Agent Setup
  • Types of Source, Channel & Sinks
  • Developing Sample Flume Applications


  • Introduction to SQOOP
  • Connecting to RDBMS Using SQOOP
  • SQOOP Import

Import to HDFS o Import to HIVE Import to HBASE o Bulk Import

  • Full Table
  • Subset of a Tables
  • All tables in DB
  • Incremental Import
  • Incremental Append
  • Incremental Last Modified
  • SQOOP Export
  • Export from HDFS
  • Export from Hive


  • Introduction to Zookeeper
  • Distributed Coordination
  • Zookeeper Data Model
  • Zookeeper Service
  • Zookeeper Commands

Apache Kafka

  • Introduction to Kafka
  • Kafka Internals
  • Kafka Cluster Architecture
  • Kafka Producer
  • Kafka Consumer
  • Kafka Broker
  • Introduction to Kafka API
  • Kafka Stream Processing
  • Integrating Kafka with various Hadoop Systems

Introduction to Scala Programming

  • Introduction to Functional Programming & Scala
  • Comparing Java and Scala
  • Setting Up Scala in UNIX
  • Setting Up SBT
  • Introduction to Scala REPL
  • Setting up Scala on Eclipse (Scala IDE)

Scala Programming Fundamentals

  • Scala Data Types
  • Variable Declarations
  • Variable Type Inference
  • Operators
  • Scala Control Structures
  • Scala Looping Structures
  • Scala Functions
  • Scala Collections
  • Array

o List

o Map

o Tuples

o Set

Functional Programming in Scala

  • Introduction to Functional Programming
  • Difference between OOPs & Functional Programming
  • Higher Order Functions
  • Anonymous Functions
  • Closures and Currying
  • Functional Programming on Collections

Iteration, Mapping, Filtering and Reduce

  • Maps, Sets, Group By, Flatten and Flat Map
  • File Access and File Processing
  • Scala Pattern Matching

Object Oriented Programming in Scala

  • Concept of Classes in Scala
  • Implementing Getters and Setters
  • Concept of Objects in Scala
  • Singleton Objects
  • Companion Objects
  • Case Classes
  • Primary Constructor
  • Auxiliary Constructor
  • Overriding Methods
  • Apply Method
  • Traits and Abstract Classes
  • Exception Handling in Scala

Introduction to Spark

  • What is Apache Spark
  • Spark Unified Stack o Spark Core o  Saprk SQL

 o Spark Streaming o MLib

 o  GraphX

o  Cluster Managers

  • Users of Spark
  • Spark vs. Mapreduce
  • Introduction to Spark Shell
  • Introduction to Spark Core API for Spark Application Development

Programming With Spark RDDs

  • Introduction to RDDs
  • Creating RDDs
  • RDD Operations
  • Transformations o Actions

o  Lazy Evaluation

  • Passing Functions to Spark
  • Common Transformations and Actions on RDDs
  • Concept of Pair RDDs
  • Transformation and Actions on Paired RDDs
  • Data Partitioning in RDDs
  • Concept of Persistence/Caching in RDDs
  • Accumulators and Broadcast Variables
  • Loading and Saving Data Using RDDs
  • File Formats:
  • Text Files
  • CSV and Tab Separated Files
  • JSON
  • Sequence Files
  • Parquet Files
  • Compression Technique – Snappy, Gzip

Programming with Spark Data Frames & Spark SQL

  • Introduction to Spark Data Frames
  • Dataframes vs. RDDs
  • Introduction to Spark SQL
  • Understanding HiveContext
  • Operations on Data Frames
  • Schema RDDs and Converting Schema RDDs to DataFrames (Custom Case Classes)
  • Temp Tables vs. Persistent Tables
  • Loading and Saving Data in DFs
  • Apache Hive
  • JSON
  • Parquet
  • ORC Files
  • User Defined Functions (UDFs)
  • Spark SQL UDF
  • Hive UDF

Spark Streaming

  • Introduction to Spark Streaming Architecture
  • Introduction to Discrete Streams (DStreams)
  • Streaming Operations
  • Integrate Spark Streaming with Kafka

PySpark Overview (Time Permitting)

  • Introduction to PySpark & PySpark Shell
  • Using Python to develop Spark Applications
  • Running PySpark Application

Additional information

Weight 1 kg