Big Data Hadoop

Big Data Hadoop Certification Training from “The Dev Masters” is designed to ensure that you are job ready to take up an assignment in Big Data. This training not just equips you with essential skills of Hadoop, but also gives you the required knowledge in Big Data Hadoop via the development and implementation of sample techniques and scripts.


In this workshop you’ll learn the end-to-end data science process:

  • Collect data from a variety of sources (e.g., Excel, web-scraping, APIs and others)
  • Explore large data sets
  • Clean and “munge” the data to prepare it for analysis
  • Apply machine learning algorithms to gain insight from the data
  • Visualize the results of your analysis

This is a very practical and hands-on workshop that has lots of class exercises. You’ll build your own library of Python scripts that can be reused after your done with the course.

Prereqs & Preparation

You must bring a laptop with a text editor.

Sublime Text is recommended and has a free trial version (

In addition, students should install Anaconda, which is a free package that includes python and a number of tools that will be used in class (

Day 1

Session I: Introduction

  • Introduction
  • How to Use This Course

Session II: Getting Started

  • Installing Enthought Canopy
  • Installing MRJob
  • Downloading the MovieLens Data Set
  • Run Your First MapReduce Job

Session III: Understanding Mapreduce

  • MapReduce Basic Concepts
  • Walkthrough of Rating Histogram Code
  • Understanding How MapReduce Scales / Distributed Computing
  • Average Friends by Age Example: Part 1
  • Average Friends by Age Example: Part 2
  • Minimum Temperature By Location Example
  • Maximum Temperature By Location Example
  • Word Frequency in a Book Example
  • Making the Word Frequency Mapper Better with Regular Expressions
  • Sorting the Word Frequency Results Using Multi-Stage MapReduce Jobs
  • Activity: Design a Mapper and Reducer for Total Spent by Customer
  • Activity: Write Code for Total Spent by Customer
  • Compare Your Code to Mine. Activity: Sort Results by Amount Spent
  • Compare your Code to Mine for Sorted Results.
  • Combiners

Day 2

Session IV: Advanced MapReduce Examples

  • Examples: Most Popular Movie
  • Including Ancillary Lookup Data in the Example
  • Example: Most Popular Superhero, Part1
  • Example: Most Popular Superhero, Part2
  • Example: Degrees of Separation: Concepts
  • Degrees of Separation: Preprocessing the Data
  • Degrees of Separation: Code Walkthrough
  • Degrees of Separation: Running and Analyzing the Results
  • Example: Similar Movies Based on Ratings: Concepts
  • Similar Movies: Code Walkthrough
  • Similar Movies: Running and Analyzing the Results
  • Learning Activity: Improving our Movie Similarities MapReduce Job

Session V: Using Hadoop and Elastic MapReduce

  • Fundamental Concepts of Hadoop
  •  The Hadoop Distributed File System (HDFS)
  • Apache YARN
  • Hadoop Streaming: How Hadoop Runs your Python Code
  • Setting Up Your Amazon Elastic MapReduce Account
  • Linking Your EMR Account with MRJob
  • Exercise: Run Movie Recommendations on Elastic MapReduce
  • Analyze the Results of Your EMR Job

Day 3

Session VI: Advanced Hadoop and EMR

  • Distributed Computing Fundamentals
  • Activity: Running Movie Similarities on Four Machines
  • Analyzing the Results of the 4-Machine Job
  • Troubleshooting Hadoop Jobs with EMR and MRJob, Part 1
  • Troubleshooting Hadoop Jobs, Part 2
  • Analyzing One Million Movie Ratings Across 16 Machines, Part 1
  • Analyzing One Million Movie Ratings Across 16 Machines, Part 2

Day 4

Session VII: Other Hadoop Technologies

  • Introducing Apache Hive
  • Introducing Apache Pig
  • Apache Spark: Concepts
  • Spark Example: Part 1
  • Spark Example: Part 2