产品展示

全力为中小企业提供网页设计、网站建设等店铺详情装修设计、平面设计、品牌推广等高度定制服务

课程推荐

课程描述:

 

4天的课程包涵了解Apache Spark的基础知识及其与Hadoop整体生态系统的集成方式。
本课程将重温HDFS的基础内容,学习如何使用Sqoop/Flume摄取数据,利用Spark处理分布式数据,学习在Impala和Hive上数据建模,以及在数据存储方面的最佳实践。 

 

 

培训对象:

 

企业管理者、CIO、CTO、政府信息部门官员、项目(开发)经理、咨询顾问;IT经理,IT咨询顾问,IT支持专家;系统工程师、数据中心管理员、云计算管理员及想加入云计算队伍的您

 

 

学员基础:

具备基本Linux系统管理经验;不需要事先掌握Hadoop相关知识
 

 

认证证书:

 

通过考试可获得Cloudera Certified Administrator for Apache Hadoop (CCAH) 

 

 

课程目标: 
 

How data is distributed, stored, and processed in a Hadoop cluster  
How to use Sqoop and Flume to ingest data  
How to process distributed data with Apache Spark  
How to model structured data as tables in Impala and Hive  
How to choose the best data storage format for different data usage patterns  
Best practices for data storage

 

课程内容: 


Introduction to Hadoop and the Hadoop Ecosystem  
Problems with Traditional Large-scale Systems Hadoop!  
The Hadoop EcoSystem  

 

Hadoop Architecture and HDFS  
Distributed Processing on a Cluster  
Storage: HDFS Architecture  
Storage: Using HDFS  
Resource Management: YARN Architecture  
Resource Management: Working with YARN  

 

Importing Relational Data with Apache Sqoop  
Sqoop Overview  
Basic Imports and Exports  
Limiting Results  
Improving Sqoop’s Performance  
Sqoop 2  

 

Introduction to Impala and Hive  
Introduction to Impala and Hive  
Why Use Impala and Hive?  
Comparing Hive to Traditional Databases  
Hive Use Cases  

 

Modeling and Managing Data with Impala and Hive  
Data Storage Overview  
Creating Databases and Tables  
Loading Data into Tables  
HCatalog  
Impala Metadata Caching  

 

Data Formats  
Selecting a File Format
Hadoop Tool Support for File Formats  
Avro Schemas  
Using Avro with Hive and Sqoop  
Avro Schema Evolution  
Compression  

 

Data Partitioning  
Partitioning Overview  
Partitioning in Impala and Hive  

 

Capturing Data with Apache Flume  
What is Apache Flume?  
Basic Flume Architecture  
Flume Sources  
Flume Sinks  
Flume Channels  
Flume Configuration  

 

Spark Basics  
What is Apache Spark?  
Using the Spark Shell  
RDDs (Resilient Distributed Datasets)  
Functional Programming in Spark  

 

Working with RDDs in Spark  
A Closer Look at RDDs  
Key-Value Pair RDDs  
MapReduce  
Other Pair RDD Operations  

 

Writing and Deploying Spark Applications  
Spark Applications vs. Spark Shell  
Creating the SparkContext  
Building a Spark Application (Scala and Java)  
Running a Spark Application  
The Spark Application Web UI  
Configuring Spark Properties  
Logging  

 

Parallel Programming with Spark  
Review: Spark on a Cluster  
RDD Partitions  
Partitioning of File-based RDDs  
HDFS and Data Locality  
Executing Parallel Operations  
Stages and Tasks  

 

Spark Caching and Persistence  
RDD Lineage  
Caching Overview  
Distributed Persistence  

 

Common Patterns in Spark Data Processing  
Common Spark Use Cases  
Iterative Algorithms in Spark  
Graph Processing and Analysis  
Machine Learning  
Example: k-means  

 

Preview: Spark SQL  
Spark SQL and the SQL Context  
Creating DataFrames  
Transforming and Querying DataFrames  
Saving DataFrames  
Comparing Spark SQL with Impala

首页    全部课程    大数据专题    Cloudera Apache Hadoop开发员(CCA)