Prepare your Datasets at Scale using Apache Spark and SageMaker Data Wrangler [Level 300]

June 9, 2021
Pandas doesn’t scale for large datasets. Apache Spark is an open source, distributed processing engine that scales to large datasets across a large number of cluster instances. In this session, I will demonstrate several ways to use Apache Spark on AWS to analyze large datasets, perform data quality checks, transform raw data into machine learning features, and train predictive models. I will also demonstrate how SageMaker Data Wrangler uses Apache Spark to detect bias in our datasets.

Speaker: Chris Fregly, AWS Senior Advocate, AI/ML
Previous Video
Well-Architectured Framework for Machine Learning
Well-Architectured Framework for Machine Learning

In this whiteboarding session, learn how to design an ML application guided by the AWS Well-Architected Fra...

Next Video
Orchestrate and Automate Machine Learning Workflows with SageMaker Pipelines [Level 300]
Orchestrate and Automate Machine Learning Workflows with SageMaker Pipelines [Level 300]

In this session, you'll see how to create, automate, and manage end-to-end ML workflows using Amazon SageMa...