Get in Touch

Course Outline

  • Introduction
    • The history and core concepts of Hadoop
    • Overview of the Hadoop ecosystem
    • Understanding different distributions
    • High-level architecture
    • Addressing common Hadoop myths
    • Key challenges (both hardware and software)
    • Labs: Discussing participants' Big Data projects and challenges
  • Planning and installation
    • Choosing software and Hadoop distributions
    • Cluster sizing and growth planning
    • Selecting appropriate hardware and network infrastructure
    • Rack topology considerations
    • Installation procedures
    • Managing multi-tenancy
    • Directory structure and log management
    • Benchmarking performance
    • Labs: Installing a cluster and running performance benchmarks
  • HDFS operations
    • Core concepts (horizontal scaling, replication, data locality, rack awareness)
    • Understanding nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
    • Health monitoring techniques
    • Administration via command-line and browser interfaces
    • Adding storage and replacing defective drives
    • Labs: Familiarizing with HDFS command lines
  • Data ingestion
    • Using Flume for logs and other data ingestion into HDFS
    • Using Sqoop for importing data from SQL databases to HDFS and exporting back to SQL
    • Implementing Hadoop data warehousing with Hive
    • Copying data between clusters using distcp
    • Utilizing S3 as a complementary solution to HDFS
    • Best practices and architectures for data ingestion
    • Labs: Setting up and utilizing Flume and Sqoop
  • MapReduce operations and administration
    • Parallel computing prior to MapReduce: Comparing HPC with Hadoop administration
    • Managing MapReduce cluster loads
    • Nodes and Daemons (JobTracker, TaskTracker)
    • Walkthrough of the MapReduce UI
    • MapReduce configuration
    • Job configuration settings
    • Optimizing MapReduce performance
    • Ensuring robustness in MR: Guidance for programmers
    • Labs: Executing MapReduce examples
  • YARN: New architecture and capabilities
    • Design goals and implementation architecture of YARN
    • Key components: ResourceManager, NodeManager, Application Master
    • Installing YARN
    • Job scheduling within YARN
    • Labs: Investigating job scheduling mechanisms
  • Advanced topics
    • Hardware monitoring
    • Cluster monitoring
    • Scaling operations: Adding and removing servers, and upgrading Hadoop
    • Backup, recovery, and business continuity planning
    • Oozie job workflows
    • Hadoop high availability (HA)
    • Hadoop Federation
    • Securing your cluster with Kerberos
    • Labs: Setting up monitoring systems
  • Optional tracks
    • Cloudera Manager for cluster administration, monitoring, and routine tasks: Installation and usage. All exercises and labs in this track are conducted within the Cloudera distribution environment (CDH5).
    • Ambari for cluster administration, monitoring, and routine tasks: Installation and usage. All exercises and labs in this track are conducted within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0).

Requirements

  • Proficiency in basic Linux system administration
  • Fundamental scripting capabilities

Prior knowledge of Hadoop or Distributed Computing is not required, as these topics will be introduced and explained during the course.

Lab environment

Zero Install: Students do not need to install Hadoop software on their own devices. A fully functional Hadoop cluster will be provided for use during the training.

Participants must have access to the following:

  • An SSH client (Linux and Mac users already have SSH clients; Windows users are advised to use Putty)
  • A web browser to access the cluster. We recommend using Firefox with the FoxyProxy extension installed.
 21 Hours

Number of participants


Price per participant

Testimonials (1)

Upcoming Courses

Related Categories