Course description

  • What are the course objectives?

    Spark is an open-source query engine for processing data large datasets, and it integrates well with the Python programming language. PySpark is the interface that gives access to Spark using Python. This course starts with an overview of the Spark stack and will show you how to leverage the functionality of Python as you deploy it in the Spark ecosystem. The course will then give you a deeper look at  Apache Spark architecture and how to set up a Python environment for Spark. You'll learn about various techniques for collecting data, RDDs and contrast them with DataFrames, how to read data from files and HDFS, and how to work with schemas.
    Finally, the course will teach you how to use SQL to interact with DataFrames. By the end of this PySpark course, you will have learned how to process data using Spark DataFrames and mastered data collection techniques through distributed data processing.

  • What skills will you learn?

    Upon successful completion of our PySpark online course, you will:    

    • Get an overview of Apache Spark and the Spark 2.0 architecture
    • Obtain a comprehensive knowledge of various tools that fall under the Spark ecosystem such as Spark SQL, Spark MlLib, Sqoop, Kafka, Flume and Spark Streaming
    • Understand schemas for RDD, lazy executions, and transformations, and learn how to change the schema of a DataFrame
    • Build and interact with Spark DataFrames using Spark SQL
    • Create and explore various APIs to work with Spark DataFrames
    • Learn how to aggregate, transform, filter, and sort data with DataFrames
       

  • Who should enroll in this PySpark Training Course?

    The global market for Big Data analytics is booming, opening up exciting opportunities for IT professionals. Following are a few professional roles that are ideal for this course:

    • Freshers willing to start a career in Big Data
    • Developers and architects
    • BI/ETL/DW professionals
    • Mainframe professionals
    • Big Data architects, engineers, and developers
    • Data scientists and analytics professionals
       

  • What are the career benefits of this course?

    The career benefits of this course reflect the increasing popularity and adoption rate of Big Data tools like Spark. A quick highlight of the trends:  

    • The annual average salary of Spark developers is INR 700K in India (Source: Payscale) and $180K worldwide
    • The Big Data analytics market is expected to rise at a CAGR of 45.36% by 2025 (Source: Market Reach)
       

  • What are the prerequisites for this PySpark Online Training Course?

    There are no prerequisites for this PySpark training course. However, prior knowledge of Python Programming and SQL will be beneficial but not mandatory.
     

Course preview

    • Lesson 1 A Brief Primer on PySpark

      14:52
      • 1.1 A Brief Primer on PySpark
        05:52
      • 1.2 Brief Introduction to Spark
        02:04
      • 1.3 Apache Spark Stack
        01:38
      • 1.4 Spark Execution Process
        01:26
      • 1.5 Newest Capabilities of PySpark 2.0+
        01:56
      • 1.6 Cloning GitHub Repository
        01:56
    • Lesson 2 Resilient Distributed Datasets

      38:44
      • 2.1 Resilient Distributed Datasets
        01:49
      • 2.2 Creating RDDs
        04:38
      • 2.3 Schema of an RDD
        02:17
      • 2.4 Understanding Lazy Execution
        02:11
      • 2.5 Introducing Transformations – .map(…)
        03:57
      • 2.6 Introducing Transformations – .filter(…)
        02:23
      • 2.7 Introducing Transformations – .flatMap(…)
        06:14
      • 2.8 Introducing Transformations – .distinct(…)
        03:27
      • 2.9 Introducing Transformations – .sample(…)
        03:15
      • 2.10 Introducing Transformations – .join(…)
        04:17
      • 2.11 Introducing Transformations – .repartition(…)
        04:16
    • Lesson 3 Resilient Distributed Datasets and Actions

      35:27
      • 3.1 Resilient Distributed Datasets and Actions
        05:43
      • 3.2 Introducing Actions – .collect(…)
        02:15
      • 3.3 Introducing Actions – .reduce(…) and .reduceByKey(…)
        02:59
      • 3.4 Introducing Actions – .count()
        02:36
      • 3.5 Introducing Actions – .foreach(…)
        01:51
      • 3.6 Introducing Actions – .aggregate(…) and .aggregateByKey(…)
        04:55
      • 3.7 Introducing Actions – .coalesce(…)
        02:05
      • 3.8 Introducing Actions – .combineByKey(…)
        03:11
      • 3.9 Introducing Actions – .histogram(…)
        01:50
      • 3.10 Introducing Actions – .sortBy(…)
        02:38
      • 3.11 Introducing Actions – Saving Data
        03:10
      • 3.12 Introducing Actions – Descriptive Statistics
        02:14
    • Lesson 4 DataFrames and Transformations

      32:33
      • 4.1 DataFrames and Transformations
        01:35
      • 4.2 Creating DataFrames
        04:16
      • 4.3 Specifying Schema of a DataFrame
        06:00
      • 4.4 Interacting with DataFrames
        01:36
      • 4.5 The .agg(…) Transformation
        03:19
      • 4.6 The .sql(…) Transformation
        03:57
      • 4.7 Creating Temporary Tables
        02:31
      • 4.8 Joining Two DataFrames
        03:54
      • 4.9 Performing Statistical Transformations
        03:55
      • 4.10 The .distinct(…) Transformation
        01:30
    • Lesson 5 Data Processing with Spark DataFrames

      27:16
      • 5.1 Data Processing with Spark DataFrames
        06:29
      • 5.2 Filtering Data
        01:31
      • 5.3 Aggregating Data
        02:34
      • 5.4 Selecting Data
        02:24
      • 5.5 Transforming Data
        01:40
      • 5.6 Presenting Data
        01:34
      • 5.7 Sorting DataFrames
        01:00
      • 5.8 Saving DataFrames
        04:28
      • 5.9 Pitfalls of UDFs
        03:38
      • 5.10 Repartitioning Data
        01:58
    • Lesson 1 - Welcome

      02:28
      • Welcome
        02:28
      • Learning Objectives
    • Lesson 2 - Python Basics

      11:55
      • 2.1 Learning Objectives
      • 2.2 Your first program
        01:15
      • 2.3 Types
        02:57
      • 2.4 Expressions and Variables
        03:50
      • 2.5 Write your First Python Code
      • 2.6 String Operations
        03:53
      • 2.7 String Operations
    • Lesson 3 - Python Data Structures

      16:22
      • 3.1 Learning Objectives
      • 3.2 Lists and Tuples
        08:46
      • 3.3 Lists and Tuples
      • 3.4 Sets
        05:12
      • 3.5 Sets
      • 3.6 Dictionaries
        02:24
      • 3.7 Dictionaries
    • Lesson 4 - Python Programming Fundamentals

      41:08
      • 4.1 Learning Objectives
      • 4.2 Conditions and Branching
        10:13
      • 4.3 Conditions and Branching
      • 4.4 Loops
        06:40
      • 4.5 Loops
      • 4.6 Functions
        13:28
      • 4.7 Functions
      • 4.8 Objects and Classes
        10:47
      • 4.9 Objects and Classes
    • Lesson 5 - Working with Data in Python

      12:35
      • 5.1 Learning Objectives
      • 5.2 Reading files with open
        03:38
      • 5.3 Reading Files
      • 5.4 Writing files with open
        02:49
      • 5.5 Writing Files
      • 5.6 Loading data with Pandas
        04:07
      • 5.7 Working with and Saving data with Pandas
        02:01
      • 5.8 Loading Data and Viewing Data
    • Lesson 6 - Working with Numpy Arrays

      18:26
      • 6.1 Learning Objectives
      • 6.2 Numpy One-Dimensional Arrays
        11:18
      • 6.3 Working with One-Dimensional Numpy Arrays
      • 6.4 Numpy Two-Dimensional Arrays
        07:08
      • 6.5 Working with Two-Dimensional Numpy Arrays
    • Lesson 7 - Course Summary

      01:13
      • Course Summary
        01:13
      • Unlocking IBM Certificate
    • {{childObj.title}}

      • {{childObj.childSection.chapter_name}}

        • {{lesson.title}}
      • {{lesson.title}}

    View More

    View Less

    FAQs

    • What is PySpark?

      Apache Spark is an open-source real-time cluster processing framework which is used in streaming analytics systems. Python is an open-source programming language that has a plethora of libraries that support diverse applications. PySpark is an integration of Python and Spark used for Big Data analytics. The Python API for Spark enables programmers to harness the simplicity of Python and the power of Apache Spark.
       

    • What is RDD in PySpark?

      RDD is an abbreviation for Resilient Distributed Dataset, the primary building block of Apache Spark. RDD is a fundamental data structure of Apache Spark, which is a constant distributed collection of objects. Each dataset in an RDD is divided into logical partitions that may be computed on different nodes of the cluster.
       

    • Is PySpark a programming language?

      PySpark is not a programming language. It is a Python API for Apache Spark deployments that Python developers can leverage to create in-memory processing applications. 

    • Who are the instructors and how are they selected?

      All of our highly qualified trainers are industry experts with years of relevant industry experience working with front-end development technology. Each of them has gone through a rigorous selection process that includes profile screening, technical evaluation, and a training demo before they are certified to train for us. We also ensure that only those trainers with a high alumni rating remain on our faculty.

    • How do I enroll in this online training?

      You can enroll in this training on our website and make an online payment using any of the following options:
      · Visa Credit or Debit Card
      · MasterCard
      · American Express
      · Diner’s Club
      · PayPal
      Once payment is received, you will automatically receive a payment receipt and access information via email.
       

    • Can I cancel my enrollment? Will I get a refund?

      Yes, you can cancel your enrollment if necessary. We will refund the course price after deducting an administration fee. To learn more, you can view our refund policy

    Contact Us

    +1-844-532-7688

    (Toll Free)

    • Disclaimer
    • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.