DATA SCIENCE
Description
Introduction to BIG Data Science/Data Analytics
- What background is required? What is Data Science?
- Why Data Science?
- BIG Data Science/Analytics trend What is Machine Learning?
- Data Science Life Cycle
Tools for Data Science/Analytics
- Anaconda Distribution package Open Source: Python/R
- Visualization tools: Matplotlib, Seaborn, introduction of Tableau
Data Analytics Problems/Use-cases
- From Kaggle competitions
- Types of Data: Structured, Unstructured (Image, Text…..) Predictive Analytics Problems: Classification, Regression, Recommenders
- Descriptive Analytics Problems: Clustering, Market Basket Analysis, PCA
- Business Verticals: Retail, Real Estate, Banking, Financial, Social, Web, Medical, Scientific, Logistics,
Visualization tools:
- Matplotlib, Seaborn,
- Introduction of Tableau
Statistics for Data Scientist
- Descriptive Statistics for single variables Mean, Median, Mode, Quartile, Percentile Interquartile Range
- Standard Deviation Variance
- Descriptive Statistics for two variables Z-Score
- Co-variance/ Co-relation
- Chi-squared Analysis / Hypothesis Testing
- Limits Derivatives
- Partial Derivatives Gradients
- Significance of Gradients
Probability for Data Scientist
- Basic Probability Conditional Probability
- Properties of Random Variables Expectations
- Variance
- Entropy and cross-entropy Covariance and correlation
- Estimating probability of Random variable Understanding standard random processes
Data Distributions
- Normal Distribution Binomial Distribution Multinomial Distribution Bernoulli Distribution
- Probability, Prior probability, Posterior probability Bayes Theorem
- Naive Bayes
- Naive Bayes Algorithm Normal Distribution
Mastering Python/R Language
- How to install python (Anaconda) How to install sciKit Learn (Anaconda) How to work with Jupyter Notebook How to work with Spyder IDE
- Strings Lists Tuples Sets
- Dictionaries Control Flows Functions
- Formal/Positional/Keyword arguments
- Predefined functions (range, len, enumerates etc…) Data Frames
- Packages required for data Science in R/Python Lab/Coding
Introduction to NumPy
- One-dimensional Array Two-dimensional Array
- Pr-defined functions (arrange, reshape, zeros, ones, empty) Basic Matrix operations
- Scalar addition, subtraction, multiplication, division
- Matrix addition, subtraction, multiplication, division and transpose Slicing
- Indexing Looping
- Shape Manipulation Stacking
Introduction to Pandas
- Series DataFrame df.GroupBy df.crosstab df.apply df.map
Decision Trees
- What are Decision Trees? Gini, Entropy criterions Decision trees in Classification Decision trees in Regression Ensembles
- Random Forest
- Boosting (Ada, Gradient, Extreme Gradient) SVM
- Ensembles
- Understand what is overfitting and under fitting model Visualize the overfitting and under fitting model
- How do you handle overfitting?
Data Preparation Techniques
- Structured Data Preparation Data Type Conversion
- Category to Numeric Conversion Numeric to Category Conversion Data Normalization: 0-1, Z-Score
- Handling Skew Data: Box-Cox Transformation Handling Missing Data
Re-sampling Techniques
- K-fold
- Repeated Hold-out Data Bootstrap aggregation sampling
Exploratory Data Analysis (EDA)
- Statistical Data Analysis
- Data Visualization (Matplotlib, Seaboarn) Exploring Individual Features
- Exploring Bi-Feature Relationships Exploring Multi-feature Relationships Feature/Dimension Reduction: PCA Intuition behind PCA
- Covariance & Correlation
- Relating PCA to Covariance/Correlation Intuition to math
- Applications of PCA: Dimensionality Reduction
Feature Engineering (FE)
- Combine Features Split Features
Data Visualization
- Bar Chart Histogram
- Box whisker plot Line plot
- Scatter Plot Heat Map
Tree Based Algorithms
- Gini Index Entropy Information Gain Tree Pruning
Classification (Supervised Learning)
- What is Classification?
- Finding Patterns/Fixed Patterns Problems with Fixed Patterns
- Machine learning approach over fixed pattern approach Decision Tree based classification
- Ensemble Based Classification Logistic Regression (SGD Classifier) Accuracy measurements
- Confusion Matrix ROC Curve
- AUC Score
- Multi-class Classification Softmax Regression Classifier Multi-label Classification
- Multi-output Classification
Ensemble models
- Random Forest Bagging Boosting
- Adaptive Boosting Gradient Boosting
- Extreme Gradient Boosting Heterogeneous Ensemble Models Stacking / Voting
Regression (Supervised Learning)
- What is regression?
- Regression example in business verticals Solution strategies for Regression
- Linear Regression Explanation of statistics Evaluation metrics
- Root Mean Squeare(RMSE) R-Squre,
- Adj R-Squre
- Feature selection methods Linear regression
Multiple/Polynomial Regression (scikit learn)
- Multiple Linear Regressions (SGD Regressor)
- Gradient Descent (Calculus way of solving linear equation) Feature Scaling (Min-Max vs Mean Normalization)
- Feature Transformation Polynomial Regression
- Matrix addition, subtraction, multiplication and transpose Optimization theory for data scientist
Optimisation Theory (Gradient Descent Algorithm)
- Modelling ML problems with optimization requirements Solving unconstrained optimization problems
- Solving optimization problems with linear constraints Gradient descent ideas
- Gradient descent Batch gradient descent
- Stochastic gradient descent
Model Evaluation and Error Analysis
- Train/Validation/Test split K-Fold Cross Validation
- The Problem of Over-fitting (Bias-Variance tread-off) Learning Curve
- Regularization (Ridge, Lasso and Elastic-Net) Hyper Parameter Tuning (GridSearchCV)
Recommendation Problem
- What is Recommendation System? Top-N Recommender
- Rating Prediction
- Content based Recommenders
- Limitations of Content based recommenders Machine Learning Approaches for Recommenders User-User KNN model, Item-Item KNN model Factorization or latent factor model
- Hybrid Recommenders
- Evaluation Metrics for Recommendation Algorithms Top-N Recommnder: Accuracy, Error Rate
- Rating Prediction: RMSE
Clustering (Unsupervised Learning)
- Finding pattern and Fixed Pattern Approach Limitations of Fixed Pattern Approach
- Machine Learning Approaches for Clustering Iterative based K-Means Approaches
- Density based DB-SCAN Approach Evaluation Metrics for Clustering
- Cohesion, Coupling Metrics Correlation Metric
Support Vector Machine (SVM)
- SVM Classifier (Soft/Hard – Margin) Linear SVM
- Non-Linear SVM Kernel SVM SVM Regression
PCA (Unsupervised Learning)
- Dimensionality Reduction
- Choosing Number of Dimensions or Principal Components Incremental PCA
- Kernel PCA
- When to apply PCA? Eigen vectors
- Eigen values
Model Deployment
- Pickle (pkl file)
- Model load from pkl file and prediction
Association Rules
- A priori Algorithm
- Collaborative Filtering (User-Item based) Collaborative Filtering (User-User based)
- Collaborative Filtering (Item-Item based)
Deep Learning:
- Introduction to Deep Learning Tensorflow
- Keras
- Setting up new environment for Deep Learning Perceptron model for classification and regression Perceptron Learning
- Limitations of Perceptron model
- Multi-layer FF NN model for classification and regression ML-FF-NN Learning with backpropagation
- Applying ML-FF-NN and parameter tuning Pros and Cons of the Model
Image classification
- Image Data Preparation Converting to gray scale Pixel Value Normalization
- Building Pixel Intensity Matrix Neural Networks
- Fully connected Neural Networks Feed Forward Neural Networks Convolution Neural Networks Filters, Max Pooling
- Functional APIs
Text analytics:
- Bag of words Glove Dictionary
- Text Data Preparation Normalizing Text Stop word Removal Whitespace RemovalStemming
- Building Document Term Matrix NLP (Natural Language Processing)