Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
-
Introduction
- The history and core concepts of Hadoop
- Overview of the Hadoop ecosystem
- Understanding different distributions
- High-level architecture
- Addressing common Hadoop myths
- Key challenges (both hardware and software)
- Labs: Discussing participants' Big Data projects and challenges
-
Planning and installation
- Choosing software and Hadoop distributions
- Cluster sizing and growth planning
- Selecting appropriate hardware and network infrastructure
- Rack topology considerations
- Installation procedures
- Managing multi-tenancy
- Directory structure and log management
- Benchmarking performance
- Labs: Installing a cluster and running performance benchmarks
-
HDFS operations
- Core concepts (horizontal scaling, replication, data locality, rack awareness)
- Understanding nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
- Health monitoring techniques
- Administration via command-line and browser interfaces
- Adding storage and replacing defective drives
- Labs: Familiarizing with HDFS command lines
-
Data ingestion
- Using Flume for logs and other data ingestion into HDFS
- Using Sqoop for importing data from SQL databases to HDFS and exporting back to SQL
- Implementing Hadoop data warehousing with Hive
- Copying data between clusters using distcp
- Utilizing S3 as a complementary solution to HDFS
- Best practices and architectures for data ingestion
- Labs: Setting up and utilizing Flume and Sqoop
-
MapReduce operations and administration
- Parallel computing prior to MapReduce: Comparing HPC with Hadoop administration
- Managing MapReduce cluster loads
- Nodes and Daemons (JobTracker, TaskTracker)
- Walkthrough of the MapReduce UI
- MapReduce configuration
- Job configuration settings
- Optimizing MapReduce performance
- Ensuring robustness in MR: Guidance for programmers
- Labs: Executing MapReduce examples
-
YARN: New architecture and capabilities
- Design goals and implementation architecture of YARN
- Key components: ResourceManager, NodeManager, Application Master
- Installing YARN
- Job scheduling within YARN
- Labs: Investigating job scheduling mechanisms
-
Advanced topics
- Hardware monitoring
- Cluster monitoring
- Scaling operations: Adding and removing servers, and upgrading Hadoop
- Backup, recovery, and business continuity planning
- Oozie job workflows
- Hadoop high availability (HA)
- Hadoop Federation
- Securing your cluster with Kerberos
- Labs: Setting up monitoring systems
-
Optional tracks
- Cloudera Manager for cluster administration, monitoring, and routine tasks: Installation and usage. All exercises and labs in this track are conducted within the Cloudera distribution environment (CDH5).
- Ambari for cluster administration, monitoring, and routine tasks: Installation and usage. All exercises and labs in this track are conducted within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0).
Requirements
- Proficiency in basic Linux system administration
- Fundamental scripting capabilities
Prior knowledge of Hadoop or Distributed Computing is not required, as these topics will be introduced and explained during the course.
Lab environment
Zero Install: Students do not need to install Hadoop software on their own devices. A fully functional Hadoop cluster will be provided for use during the training.
Participants must have access to the following:
- An SSH client (Linux and Mac users already have SSH clients; Windows users are advised to use Putty)
- A web browser to access the cluster. We recommend using Firefox with the FoxyProxy extension installed.
21 Hours
Testimonials (1)
Hands on exercises. Class should have been 5 days, but the 3 days helped to clear up a lot of questions that I had from working with NiFi already