PySpark – Big Data Analysis in Databricks (PYSPARK1)

Databases, NoSQL and Big Data

Do you work with Excel, Power Query, SQL or Pandas but need to process gigabytes to terabytes of data? PySpark is the Python interface to Apache Spark, a big data engine for scalable, distributed processing that can handle datasets larger than one machine's memory and speed up analysis.

The workshop runs entirely in Databricks Community Edition in your browser - no local setup required. You'll learn the DataFrame API and Spark SQL, work with notebooks, clusters and data uploads, and apply familiar SQL skills to scale analyses to large datasets.

Location, current course term

Contact us

Custom Customized Training (date, location, content, duration)

The course:

Hide detail
  • Getting started with Databricks
    1. What PySpark is and when to use it
    2. Creating an account in Databricks Community Edition
    3. Navigating the environment – workspace, notebooks, cluster
    4. Uploading data to Databricks
  • DataFrame – basic operations
    1. Creating DataFrames
    2. Schema and data types
    3. Selecting columns (select)
    4. Filtering rows (filter, where)
    5. Adding and transforming columns (withColumn)
  • Spark SQL
    1. Registering a DataFrame as a table (createTempView)
    2. Running SQL queries on data (spark.sql)
    3. Combining DataFrame API and SQL
    4. Using SQL functions in the DataFrame API
  • Data sources
    1. CSV files
    2. Parquet – optimal format for Spark
    3. JSON files
    4. Delta Lake (basics)
  • Data processing
    1. Column and type transformations
    2. Handling missing values (null)
    3. Joining tables (join)
    4. Combining datasets (union)
  • Data aggregation
    1. Grouping (groupBy)
    2. Aggregate functions (count, sum, avg, min, max)
    3. Multiple aggregations at once (agg)
    4. Pivot tables
  • Troubleshooting
    1. Reading PySpark error messages
    2. Common errors: data types, missing columns
    3. Data checks and debugging
  • Outputs and exporting data
    1. Saving to files (CSV, Parquet)
    2. Visualizations in Databricks
    3. Downloading results
Assumed knowledge:
Basic Python (variables, loops, functions); experience with SQL, Excel, Power Query or Pandas is advantageous.
Schedule:
2 days (9:00 AM - 5:00 PM )
Course price:
432.00 € ( 522.72 € incl. 21% VAT)
Language: