BIG DATA
Description
Real time idea of Hadoop Development
Ø In-depth understanding of Entire Big Data Hadoop and Hadoop Ecosystem
- Detailed Course Materials
- Free Core Java and UNIX Fundamentals
- Interview Oriented Discussions
- Get Ready for Hadoop & Spark Developer (CCA175) Certification Exam
UNIX/LINUX Basic Commands Basic UNIX Shell Scripting
Basic Java Programming – Core JAVA OOPS Concepts
Introduction to Big Data and Hadoop
Working With HDFS
Hadoop Map Reduce Concepts & Features Developing Map Reduce Applications
Hadoop Eco System Components:
- HIVE
- PIG
- HBASE
- FLUME
- SQOOP
- Zookeeper
Detailed SPARK with SCALA Programming Detailed Kafka Streaming
Overview of MongoDB
Overview of Spark with Python Programming (Time Permitting)
Real Time Tools like Putty, WinSCP, Eclipse, Hue, Cloudera Manager
- Basic SQL Knowledge
- Computer with Minimum 4GB RAM (8GM RAM Preferred)
- Basic UNIX & Java Programming knowledge is added advantage
Detailed Course Structure:
Introduction to Big Data & Hadoop
- The Big Data Problem
- What is Big Data?
- Challenges in processing Big Data
- What is Hadoop?
- Why Hadoop?
- History of Hadoop
- Hadoop Components Overview
- HDFS
- Map Reduce
- Hadoop Eco System Introduction
- NoSQL Database Introduction
Understanding Hadoop Architecture
- Hadoop 2.x Architecture
- Introduction to YARN
- Hadoop Daemons
- YARN Architecture
- Resource Manager
- Application Master
- Node Manager
Introduction to HDFS (Hadoop Distributed File System)
- Rack Awareness
- HDFS Daemons
- Writing Files to HDFS
- Blocks & Splits
- Input Splits
- Data Replication
- Reading Files from HDFS
- Introduction to HDFS Configuration Files
Working with HDFS
- HDFS Commands
- Accessing HDFS
- CLI Approach
-
JAVA Approach [Introducing HDFS JAVA API]
Introduction to Map Reduce Paradigm
- What is Map Reduce?
- Detailed Map Reduce Flow
- Introduction to Key/Value Approach
- Detailed Mapper Functionality
- Detailed Reducer Functionality
- Details of Partitioner
- Shuffle & Sort Process
- Understanding Map Reduce Flow with Word Count Example
Map Reduce Programming
- Introduction to Map Reduce API [New Map Reduce API]
- Map Reduce Data Types
- File Formats
- Input Formats – Input Splits & Records, text input, binary input
- Output Formats – Text Output, Binary Output
- Configuring Development Environment – Eclipse
- Developing a Map Reduce Application using Default Functionality
- Identity Mapper
- Identity Reducer
- ToolRunner API Introduction
- Developing Word Count Application
- Writing Mapper, Reducer & Driver Code
- Building Application
- Deploying Application
- Running the Map Reduce Application
- Local Mode of Execution
- Cluster Mode of Execution
- Monitoring Map Reduce Application
- Map Reduce Combiner
- Map Reduce Counters
- Map Reduce Partitioner
- File Merge Utility
Programming with HIVE
- Introduction to HIVE
- Hive Architecture
- Types of Meta store
- Introduction to Hive Configuration Files
- Hive Data Types
- Simple Data Types
- Collection Data Types
- Types of Hive Tables
- Managed Table
- External Table
- Hive Query Language (HQL or HIVE QL)
- Creating Databases
- Creating Tables
- Loading Data into table
- Joins in Hive
- Group BY and Distinct operations
- Partitioning
- Static Partitioning
- Dynamic Partitioning
- Bucketing
- Lateral View & Explode [Introduction to Hive UDFs à UDF, UDAF & UDTF]
- XML Processing in HIVE
- JSON processing in HIVE
- URL Processing in HIVE
- Hive File Formats [Introduction to Hive SERDE]
- Parquet
- ORC
- AVRO
- Storage Formats
- Introduction to HIVE Query Optimizations
- Developing Hive UDFs in JAVA
- Hive Views
Programming with PIG
- Introduction to PIG
- PIG Architecture
- Introduction to PIG Configuration Files
- PIG vs. HIVE vs. Map Reduce
- Introduction to Data Flow Language
- Pig Data Types
- Pig Programming Modes
- Pig Access Modes
- Detailed PIG Latin Programming
- PIG UDFs & UDF Development in JAVA
- Hive – PIG Integration à Introduction to HCATALOG
- Introduction to PIG Optimization
NoSQL & HBASE
- Introduction to NoSQL Databases
- Types of NoSQL Databases
- Introduction To HBASE
- HBASE Architecture
- HBASE Shell Interface
- Creating Data Bases and Tables
- Inserting Data in tables
- Accessing data from Tables
- HBase Filters
- Hive & HBASE Integration
- PIG & HBASE Integration
- Document Store – MongoDB Overview
Introduction to Streaming & FLUME
- Introduction to Streaming
- Introduction to FLUME
- FLUME Architecture
- Flume Agent Setup
- Types of Source, Channel & Sinks
- Developing Sample Flume Applications
SQOOP
- Introduction to SQOOP
- Connecting to RDBMS Using SQOOP
- SQOOP Import
- Import to HDFS
- Import to HIVE
- Import to HBASE
- Bulk Import
- Full Table
- Subset of a Tables
- All tables in DB
- Incremental Import
- Incremental Append
- Incremental Last Modified
- SQOOP Export
- Export from HDFS
- Export from Hive
Zookeeper
- Introduction to Zookeeper
- Distributed Coordination
- Zookeeper Data Model
- Zookeeper Service
- Zookeeper Commands
Apache Kafka
- Introduction to Kafka
- Kafka Internals
- Kafka Cluster Architecture
- Kafka Producer
- Kafka Consumer
- Kafka Broker
- Introduction to Kafka API
- Kafka Stream Processing
- Integrating Kafka with various Hadoop Systems
Introduction to Scala Programming
- Introduction to Functional Programming & Scala
- Comparing Java and Scala
- Setting Up Scala in UNIX
- Setting Up SBT
- Introduction to Scala REPL
- Setting up Scala on Eclipse (Scala IDE)
Scala Programming Fundamentals
- Scala Data Types
- Variable Declarations
- Variable Type Inference
- Operators
- Scala Control Structures
- Scala Looping Structures
- Scala Functions
- Scala Collections
- Array
- List
- Map
- Tuples
- Set
Functional Programming in Scala
- Introduction to Functional Programming
- Difference between OOPs & Functional Programming
- Higher Order Functions
- Anonymous Functions
- Closures and Currying
- Functional Programming on Collections
- Iteration, Mapping, Filtering and Reduce
- Maps, Sets, Group By, Flatten and Flat Map
- File Access and File Processing
- Scala Pattern Matching
Object Oriented Programming in Scala
- Concept of Classes in Scala
- Implementing Getters and Setters
- Concept of Objects in Scala
- Singleton Objects
- Companion Objects
- Case Classes
- Primary Constructor
- Auxiliary Constructor
- Overriding Methods
- Apply Method
- Traits and Abstract Classes
- Exception Handling in Scala
Introduction to Spark
- What is Apache Spark
- Spark Unified Stack
- Spark Core
- Saprk SQL
- Spark Streaming
- MLib
- GraphX
- Cluster Managers
- Users of Spark
- Spark vs. Mapreduce
- Introduction to Spark Shell
- Introduction to Spark Core API for Spark Application Development
Programming With Spark RDDs
- Introduction to RDDs
- Creating RDDs
- RDD Operations
- Transformations
- Actions
- Lazy Evaluation
- Passing Functions to Spark
- Common Transformations and Actions on RDDs
- Concept of Pair RDDs
- Transformation and Actions on Paired RDDs
- Data Partitioning in RDDs
- Concept of Persistence/Caching in RDDs
- Accumulators and Broadcast Variables
- Loading and Saving Data Using RDDs
o File Formats:
- Text Files
- CSV and Tab Separated Files
- JSON
- Sequence Files
- Parquet Files
- Compression Technique – Snappy, Gzip
Programming with Spark Data Frames & Spark SQL
- Introduction to Spark Data Frames
- Dataframes vs. RDDs
- Introduction to Spark SQL
- Understanding HiveContext
- Operations on Data Frames
- Schema RDDs and Converting Schema RDDs to DataFrames (Custom Case Classes)
- Temp Tables vs. Persistent Tables
- Loading and Saving Data in DFs
- Apache Hive
- JSON
- Parquet
- ORC Files
- User Defined Functions (UDFs)
- Spark SQL UDF
- Hive UDF
Spark Streaming
- Introduction to Spark Streaming Architecture
- Introduction to Discrete Streams (DStreams)
- Streaming Operations
- Integrate Spark Streaming with Kafka
PySpark Overview (Time Permitting)
- Introduction to PySpark & PySpark Shell
- Using Python to develop Spark Applications
- Running PySpark Application