Courses

    Pages & Posts

      • IT Training
        • Amazon Cloud (AWS)
        • Analytics and Big Data
        • Cisco
        • Citrix
        • Cloud Computing
        • CompTIA Certification
        • Cybersecurity
        • Deep Learning and AI
        • Development Tools
        • DevOps
        • F5
        • FlexPod
        • IBM
        • IT Security
        • Java
        • Juniper Networks
        • Linux
        • Microsoft
        • Mobile Computing
        • NetApp
        • Oracle
        • Salesforce
        • VMware
        • Web Development, HTML, and JavaScript
        • WebSphere
      • Business Training
        • Business Analysis
        • Enterprise Architecture
        • HIPAA Certification & Training
        • IT Governance
        • ITIL® Certification
        • Project Management
        • Salesforce
        • Scrum & Agile
        • Seminars
      • Services
        • APM Practice
        • Authorized Cisco Training
        • BPM Practice
        • Cloudera Training
        • Distance Learning
        • ITIL® Certification
        • Linux Certification Training
        • Media Security Solutions
        • Microsoft 365
        • Microsoft Official Courses On Demand
        • NetApp Training
        • Salesforce Training
        • Training Facilities
        • Visual Studio TFS Training
        • Enterprise Training Solutions
        • Room Rentals
        • State & Local Government
      • Student Resources
        • Ohio Workforce Training Voucher Program
        • Student Guide / Daily Schedule
        • College Credits / Tuition Reimbursement
        • Career Programs (WIA)
        • Student Policies
        • Testing & Certifications
        • Promotions
        • Local Hotels
        • Local Restaurants
      • Hadoop Programming on the Cloudera Platform

      Course Details

      Download PDF
      BSP-2341
      5 Days
      $ 3,095.00
      Request a Course Date

      Hadoop Programming on the Cloudera Platform

      Share this course

      Tweet Share
      This training course introduces the students to Apache Hadoop and key Hadoop ecosystem projects: Pig, Hive, Sqoop, Impala, Oozie, HBase, and Spark.

      This intensive training course uses lectures and hands-on labs that help students learn theoretical knowledge and gain practical experience of Apache Hadoop and related Apache projects.
      • Course Outline
      • Audience

      Course Outline

      CHAPTER 1. MAPREDUCE OVERVIEW
      • The Client – Server Processing Pattern
      • Distributed Computing Challenges
      • MapReduce Defined
      • Google's MapReduce
      • The Map Phase of MapReduce
      • The Reduce Phase of MapReduce
      • MapReduce Explained
      • MapReduce Word Count Job
      • MapReduce Shared-Nothing Architecture
      • Similarity with SQL Aggregation Operations
      • Example of Map & Reduce Operations using JavaScript
      • Problems Suitable for Solving with MapReduce
      • Typical MapReduce Jobs
      • Fault-tolerance of MapReduce
      • Distributed Computing Economics
      • MapReduce Systems
      • Summary
      CHAPTER 2. HADOOP OVERVIEW
      • Apache Hadoop
      • Apache Hadoop Logo
      • Typical Hadoop Applications
      • Hadoop Clusters
      • Hadoop Design Principles
      • Hadoop Versions
      • Hadoop's Main Components
      • Hadoop Simple Definition
      • Side-by-Side Comparison: Hadoop 1 and Hadoop 2
      • Hadoop-based Systems for Data Analysis
      • Other Hadoop Ecosystem Projects
      • Hadoop Caveats
      • Hadoop Distributions
      • Cloudera Distribution of Hadoop (CDH)
      • Cloudera Distributions
      • Hortonworks Data Platform (HDP)
      • MapR
      • Summary
      CHAPTER 3. HADOOP DISTRIBUTED FILE SYSTEM OVERVIEW
      • Hadoop Distributed File System (HDFS)
      • HDFS High Availability
      • HDFS "Fine Print"
      • Storing Raw Data in HDFS
      • Hadoop Security
      • HDFS Rack-awareness
      • Data Blocks
      • Data Block Replication Example
      • HDFS NameNode Directory Diagram
      • Accessing HDFS
      • Examples of HDFS Commands
      • Other Supported File Systems
      • WebHDFS
      • Examples of WebHDFS Calls
      • Client Interactions with HDFS for the Read Operation
      • Read Operation Sequence Diagram
      • Client Interactions with HDFS for the Write Operation
      • Communication inside HDFS
      • Summary
      CHAPTER 4. MAPREDUCE WITH HADOOP
      • Hadoop's MapReduce
      • MapReduce 1 and MapReduce 2
      • Why do I need Discussion of the Old MapReduce?
      • MapReduce v1 ("Classic MapReduce")
      • JobTracker and TaskTracker (the "Classic MapReduce")
      • YARN (MapReduce v2)
      • YARN vs MR1
      • YARN As Data Operating System
      • MapReduce Programming Options
      • Java MapReduce API
      • The Structure of a Java MapReduce Program
      • The Mapper Class
      • The Reducer Class
      • The Driver Class
      • Compiling Classes
      • Running the MapReduce Job
      • The Structure of a Single MapReduce Program
      • Combiner Pass (Optional)
      • Hadoop's Streaming MapReduce
      • Python Word Count Mapper Program Example
      • Python Word Count Reducer Program Example
      • Setting up Java Classpath for Streaming Support
      • Streaming Use Cases
      • The Streaming API vs Java MapReduce API
      • Amazon Elastic MapReduce
      • Apache Tez
      • Summary
      CHAPTER 5. APACHE PIG SCRIPTING PLATFORM
      • What is Pig?
      • Pig Latin
      • Apache Pig Logo
      • Pig Execution Modes
      • Local Execution Mode
      • MapReduce Execution Mode
      • Running Pig
      • Running Pig in Batch Mode
      • What is Grunt?
      • Pig Latin Statements
      • Pig Programs
      • Pig Latin Script Example
      • SQL Equivalent
      • Differences between Pig and SQL
      • Statement Processing in Pig
      • Comments in Pig
      • Supported Simple Data Types
      • Supported Complex Data Types
      • Arrays
      • Defining Relation's Schema
      • Not Matching the Defined Schema
      • The bytearray Generic Type
      • Using Field Delimiters
      • Loading Data with TextLoader()
      • Referencing Fields in Relations
      • Summary
      CHAPTER 6. APACHE PIG HDFS INTERFACE
      • The HDFS Interface
      • FSShell Commands (Short List)
      • Grunt's Old File System Commands
      • Summary
      CHAPTER 7. APACHE PIG RELATIONAL AND EVAL OPERATORS
      • Pig Relational Operators
      • Example of Using the JOIN Operator
      • Example of Using the Order By Operator
      • Caveats of Using Relational Operators
      • Pig Eval Functions
      • Caveats of Using Eval Functions (Operators)
      • Example of Using Single-column Eval Operations
      • Example of Using Eval Operators For Global Operations
      • Summary
      CHAPTER 8. APACHE PIG MISCELLANEOUS TOPICS
      • Utility Commands
      • Handling Compression
      • User-Defined Functions
      • Filter UDF Skeleton Code
      • Summary
      CHAPTER 9. APACHE PIG PERFORMANCE
      • Apache Pig Performance
      • Performance Enhancer - Use the Right Schema Type
      • Performance Enhancer - Apply Data Filters
      • Use the PARALLEL Clause
      • Examples of the PARALLEL Clause
      • Performance Enhancer - Limiting the Data Sets
      • Displaying Execution Plan
      • Compress the Results of Intermediate Jobs
      • Example of Running Pig with LZO Compression Codec
      • Summary
      CHAPTER 10. HIVE
      • What is Hive?
      • Apache Hive Logo
      • Hive's Value Proposition
      • Who uses Hive?
      • Hive's Main Sub-Systems
      • Hive Features
      • The "Classic" Hive Architecture
      • The New Hive Architecture
      • HiveQL
      • Where are the Hive Tables Located?
      • Hive Command-line Interface (CLI)
      • The Beeline Command Shell
      • Summary
      CHAPTER 11. HIVE COMMAND-LINE INTERFACE
      • Hive Command-line Interface (CLI)
      • The Hive Interactive Shell
      • Running Host OS Commands from the Hive Shell
      • Interfacing with HDFS from the Hive Shell
      • The Hive in Unattended Mode
      • The Hive CLI Integration with the OS Shell
      • Executing HiveQL Scripts
      • Comments in Hive Scripts
      • Variables and Properties in Hive CLI
      • Setting Properties in CLI
      • Example of Setting Properties in CLI
      • Hive Namespaces
      • Using the SET Command
      • Setting Properties in the Shell
      • Setting Properties for the New Shell Session
      • Setting Alternative Hive Execution Engines
      • The Beeline Shell
      • Connecting to the Hive Server in Beeline
      • Beeline Command Switches
      • Beeline Internal Commands
      • Summary
      CHAPTER 12. HIVE DATA DEFINITION LANGUAGE
      • Hive Data Definition Language
      • Creating Databases in Hive
      • Using Databases
      • Creating Tables in Hive
      • Supported Data Type Categories
      • Common Numeric Types
      • String and Date / Time Types
      • Miscellaneous Types
      • Example of the CREATE TABLE Statement
      • Working with Complex Types
      • Table Partitioning
      • Table Partitioning
      • Table Partitioning on Multiple Columns
      • Viewing Table Partitions
      • Row Format
      • Data Serializers / Deserializers
      • File Format Storage
      • File Compression
      • More on File Formats
      • The ORC Data Format
      • Converting Text to ORC Data Format
      • The EXTERNAL DDL Parameter
      • Example of Using EXTERNAL
      • Creating an Empty Table
      • Dropping a Table
      • Table / Partition(s) Truncation
      • Alter Table/Partition/Column
      • Views
      • Create View Statement
      • Why Use Views?
      • Restricting Amount of Viewable Data
      • Examples of Restricting Amount of Viewable Data
      • Creating and Dropping Indexes
      • Describing Data
      • Summary
      CHAPTER 13. HIVE DATA MANIPULATION LANGUAGE
      • Hive Data Manipulation Language (DML)
      • Using the LOAD DATA statement
      • Example of Loading Data into a Hive Table
      • Loading Data with the INSERT Statement
      • Appending and Replacing Data with the INSERT Statement
      • Examples of Using the INSERT Statement
      • Multi Table Inserts
      • Multi Table Inserts Syntax
      • Multi Table Inserts Example
      • Summary
      CHAPTER 14. HIVE SELECT STATEMENT
      • HiveQL
      • The SELECT Statement Syntax
      • The WHERE Clause
      • Examples of the WHERE Statement
      • Partition-based Queries
      • Example of an Efficient SELECT Statement
      • The DISTINCT Clause
      • Supported Numeric Operators
      • Built-in Mathematical Functions
      • Built-in Aggregate Functions
      • Built-in Statistical Functions
      • Other Useful Built-in Functions
      • The GROUP BY Clause
      • The HAVING Clause
      • The LIMIT Clause
      • The ORDER BY Clause
      • The JOIN Clause
      • The CASE … Clause
      • Example of CASE … Clause
      • Summary
      CHAPTER 15. APACHE SQOOP
      • What is Sqoop?
      • Apache Sqoop Logo
      • Sqoop Import / Export
      • Sqoop Help
      • Examples of Using Sqoop Commands
      • Data Import Example
      • Fine-tuning Data Import
      • Controlling the Number of Import Processes
      • Data Splitting
      • Helping Sqoop Out
      • Example of Executing Sqoop Load in Parallel
      • A Word of Caution: Avoid Complex Free-Form Queries
      • Using Direct Export from Databases
      • Example of Using Direct Export from MySQL
      • More on Direct Mode Import
      • Changing Data Types
      • Example of Default Types Overriding
      • File Formats
      • The Apache Avro Serialization System
      • Binary vs Text
      • More on the SequenceFile Binary Format
      • Generating the Java Table Record Source Code
      • Data Export from HDFS
      • Export Tool Common Arguments
      • Data Export Control Arguments
      • Data Export Example
      • Using a Staging Table
      • INSERT and UPDATE Statements
      • INSERT Operations
      • UPDATE Operations
      • Example of the Update Operation
      • Failed Exports
      • Sqoop2
      • Sqoop2 Architecture
      • Summary
      CHAPTER 16. CLOUDERA IMPALA
      • What is Cloudera Impala?
      • Impala's Logo
      • Impala Architecture
      • Benefits of Using Impala
      • Key Features
      • How Impala Handles SQL Queries
      • Impala Programming Interfaces
      • Impala SQL Language Reference
      • Differences Between Impala and HiveQL
      • Impala Shell
      • Impala Shell Main Options
      • Impala Shell Commands
      • Impala Common Shell Commands
      • Cloudera Web Admin UI
      • Impala Browse-based Query Editor
      • Summary
      CHAPTER 17. INTRODUCTION TO FUNCTIONAL PROGRAMMING
      • What is Functional Programming (FP)?
      • Terminology: First-Class and Higher-Order Functions
      • Terminology: Lambda vs Closure
      • A Short List of Languages that Support FP
      • FP with Java
      • FP With JavaScript
      • Imperative Programming in JavaScript
      • The JavaScript map (FP) Example
      • The JavaScript reduce (FP) Example
      • Using reduce to Flatten an Array of Arrays (FP) Example
      • The JavaScript filter (FP) Example
      • Common High-Order Functions in Python
      • Common High-Order Functions in Scala
      • Elements of FP in R
      • Summary
      CHAPTER 18. INTRODUCTION TO APACHE SPARK
      • What is Spark
      • A Short History of Spark
      • Where to Get Spark?
      • The Spark Platform
      • Spark Logo
      • Common Spark Use Cases
      • Languages Supported by Spark
      • Running Spark on a Cluster
      • The Driver Process
      • Spark Applications
      • Spark Shell
      • The spark-submit Tool
      • The spark-submit Tool Configuration
      • The Executor and Worker Processes
      • The Spark Application Architecture
      • Interfaces with Data Storage Systems
      • Limitations of Hadoop's MapReduce
      • Spark vs MapReduce
      • Spark as an Alternative to Apache Tez
      • The Resilient Distributed Dataset (RDD)
      • Spark Streaming (Micro-batching)
      • Spark SQL
      • Example of Spark SQL
      • Spark Machine Learning Library
      • GraphX
      • Spark vs R
      • Summary
      CHAPTER 19. THE SPARK SHELL
      • The Spark Shell
      • The Spark Shell UI
      • Spark Shell Options
      • Getting Help
      • The Spark Context (sc) and SQL Context (sqlContext)
      • The Shell Spark Context
      • Loading Files
      • Saving Files
      • Basic Spark ETL Operations
      • Summary
      CHAPTER 20. SPARK RDDS
      • The Resilient Distributed Dataset (RDD)
      • Ways to Create an RDD
      • Custom RDDs
      • Supported Data Types
      • RDD Operations
      • RDDs are Immutable
      • Spark Actions
      • RDD Transformations
      • Other RDD Operations
      • Chaining RDD Operations
      • RDD Lineage
      • The Big Picture
      • What May Go Wrong
      • Checkpointing RDDs
      • Local Checkpointing
      • Parallelized Collections
      • More on parallelize() Method
      • The Pair RDD
      • Where do I use Pair RDDs?
      • Example of Creating a Pair RDD with Map
      • Example of Creating a Pair RDD with keyBy
      • Miscellaneous Pair RDD Operations
      • RDD Caching
      • RDD Persistence
      • The Tachyon Storage
      • Summary
      CHAPTER 21. PARALLEL DATA PROCESSING WITH SPARK
      • Running Spark on a Cluster
      • Spark Stand-alone Option
      • The High-Level Execution Flow in Stand-alone Spark Cluster
      • Data Partitioning
      • Data Partitioning Diagram
      • Single Local File System RDD Partitioning
      • Multiple File RDD Partitioning
      • Special Cases for Small-sized Files
      • Parallel Data Processing of Partitions
      • Spark Application, Jobs, and Tasks
      • Stages and Shuffles
      • The "Big Picture"
      • Summary
      CHAPTER 22. SHARED VARIABLES IN SPARK
      • Shared Variables in Spark
      • Broadcast Variables
      • Creating and Using Broadcast Variables
      • Example of Using Broadcast Variables
      • Accumulators
      • Creating and Using Accumulators
      • Example of Using Accumulators
      • Custom Accumulators
      • Summary
      CHAPTER 23. INTRODUCTION TO SPARK SQL
      • What is Spark SQL?
      • Uniform Data Access with Spark SQL
      • Hive Integration
      • Hive Interface
      • Integration with BI Tools
      • Spark SQL is No Longer Experimental Developer API!
      • What is a DataFrame?
      • The SQLContext Object
      • The SQLContext API
      • Changes Between Spark SQL 1.3 to 1.4
      • Example of Spark SQL (Scala Example)
      • Example of Working with a JSON File
      • Example of Working with a Parquet File
      • Using JDBC Sources
      • JDBC Connection Example
      • Performance & Scalability of Spark SQL
      • Summary
      CHAPTER 24. GRAPH PROCESSING WITH GRAPHX
      • What is GraphX?
      • Supported Languages
      • Vertices and Edges
      • Graph Terminology
      • Example of Property Graph
      • The GraphX API
      • The GraphX Views
      • The Triplet View
      • Graph Algorithms
      • Graphs and RDDs
      • Constructing Graphs
      • Graph Operators
      • Example of Using GraphX Operators
      • GraphX Performance Optimization
      • The PageRank Algorithm
      • GraphX Support for PageRank
      • Summary
      CHAPTER 25. MACHINE LEARNING ALGORITHMS
      • Supervised vs Unsupervised Machine Learning
      • Supervised Machine Learning Algorithms
      • Unsupervised Machine Learning Algorithms
      • Choose the Right Algorithm
      • Life-cycles of Machine Learning Development
      • Classifying with k-Nearest Neighbors (SL)
      • k-Nearest Neighbors Algorithm
      • k-Nearest Neighbors Algorithm
      • The Error Rate
      • Decision Trees (SL)
      • Random Forests
      • Unsupervised Learning Type: Clustering
      • K-Means Clustering (UL)
      • K-Means Clustering in a Nutshell
      • Regression Analysis
      • Logistic Regression
      • Summary
      CHAPTER 26. THE SPARK MACHINE LEARNING LIBRARY
      • What is MLlib?
      • Supported Languages
      • MLlib Packages
      • Dense and Sparse Vectors
      • Labeled Point
      • Python Example of Using the LabeledPoint Class
      • LIBSVM format
      • An Example of a LIBSVM File
      • Loading LIBSVM Files
      • Local Matrices
      • Example of Creating Matrices in MLlib
      • Distributed Matrices
      • Example of Using a Distributed Matrix
      • Classification and Regression Algorithm
      • Clustering
      • Summary
      LAB EXERCISES
      Lab 1. Learning the Lab Environment Lab 2. The Hadoop Distributed File System Lab 3. Hadoop Streaming MapReduce Lab 4. Programming Java MapReduce Jobs on Hadoop Lab 5. Getting Started with Apache Pig Lab 6. Apache Pig HDFS Command-Line Interface Lab 7. Working with Data Sets in Apache Pig Lab 8. Using Relational Operators in Apache Pig Lab 9. The Hive and Beeline Shells Lab 10. Hive Data Definition Language Lab 11. Using Select Statement in HiveQL Lab 12. Table Partitioning in Hive Lab 13. Data Import and Export with Sqoop Lab 14. Using Impala Lab 15. Elements of Functional Programming with Python Lab 16. Using the spark-submit Tool Lab 17. The Spark Shell Lab 18. RDD Performance Improvement Techniques Lab 19. Spark ETL and HDFS Interface Lab 20. Using Broadcast Variables Lab 21. Using Accumulators Lab 22. Common Map / Reduce Programs in Spark Lab 23. Spark SQL Lab 24. Getting Started with GraphX Lab 25. PageRank with GraphX Lab 26. Using Random Forests for Classification with Spark MLlib Lab 27. Using k-means Algorithm from MLlib

      Audience

      Business Analysts, IT Architects, Technical Managers and Developers

      Related Courses

      AngularJS End-to-End SPA Development...

      The course starts with an introduction to building Single Page Applications (SPA...

      View course details

      AngularJS Programming...

      AngularJS training introduces the AngularJS framework, which has become a popula...

      View course details

      Building an HTML5 End-to-End Web Application ...

      This course provides an end-to-end look at building an HTML5 Web application usi...

      View course details

      Building HTML5 Based Mobile Web Sites Trainin...

      This course teaches how to use the new features of HTML5 in a mobile platform. T...

      View course details

      Comprehensive AngularJS Programming...

      AngularJS training introduces the AngularJS framework, which has become a popula...

      View course details

      Comprehensive JavaScript for Web 2.0 Developm...

      JavaScript is the Lingua Franca of the front-end Web development.  In order to ...

      View course details

      Introduction to Angular 2 Programming...

      Angular 2 training covers everything you need to know to get started with this n...

      View course details

      Introduction to Responsive Web Development wi...

      This training course trains students in the core technologies of modern responsi...

      View course details

      MEAN JavaScript Technology Stack...

      MEAN [MongoDB, Express, AngularJS, and Node.JS] represents a new breed of softwa...

      View course details

      College Credit, CEUs, PDUs and CDUs
      When you take courses with the Babbage Simmel, be sure you get the credit you deserve. Curriculum offered by Babbage Simmel can earn you college credit, CEUs, PDUs or CDUs.

      College Credit
      Select curriculum offered by Babbage Simmel is part of the accredited University of Findlay's undergraduate course catalogs. For questions please E-Mail: info@babsim.com or call 614-481-4345.

      Continuing Education Units (CEUs)
      Continuing Education Units (CEUs) are nationally recognized standard units of measurement earned for satisfactory completion of qualified programs of continuing education. If you need more information about CEUs, please E-Mail: info@babsim.com or call 614-481-4345.

      Professional Development Units (PDUs)
      Professional Development Units (PDUs) can be issued by PMI® for formal learning activities related to project management. Project Management Professionals (PMPs®) are required to earn a minimum of 60 PDUs every 3 years to maintain certification. For more information about this program go to the PMI® web site or call 1-855 746 4849.

      Continuing Development Units (CDUs)
      CDUs may be earned by attending professional development (e.g. courses, seminars) offered by organizations endorsed by IIBA® and designated as an EEP vendor. As an IIBA Endorsed Education Provider (EEP) Babbage Simmel's IIBA® endorsed courses qualify for CDU credit. For more information about CDUs go the IIBA® web site or call 1-647-426-3735.

      Our babsimLIVE distance learning brings the classroom learning experience to you by seating you virtually into a real-life instructor-led classroom taught by award winning world-class instructors with other IT professionals like yourself. From the comfort of your home, workplace, or at the Babbage Simmel Columbus Campus, you acquire the training you need, when you want it, in the environment that is most comfortable for you to be successful.

      About Us Contact Us Blog Find A Course

      © Copyright 2018 • Babbage Simmel. All Rights Reserved. Columbus Web Design by Jetpack | Privacy Policy