Speed up data preparation for ML pipelines on AWS

Data Science Milan
4 min readMay 7, 2021

“Life is easier with AWS services”

On 21st April 2021, Data Science Milan has organized a webMeetup hosting Francesco Marelli to talk about data manipulation pipelines with AWS.

“Speed up data preparation for ML pipelines on AWS”, by Francesco Marelli, Senior Solutions Architect at AWS.

To exploit huge amounts of data, Companies move all their data from various silos into a single location, called a data lake, to perform analytics and Machine Learning. Francesco showed the Lake House Architecture on AWS. The idea behind this architecture is to build a central repository of data upon which different analytics services, from a data warehouse to a Machine Learning service. This allows you to build scalable data lakes, data warehouse, analyse data by purpose-built data services and enabling unified governance and easy data movement.

Lake Formation collects and catalogues data from databases and object storage, then moves the data into an Amazon S3 data lake. AWS Glue provides all the capabilities needed for data integration, then several layers to offers the broadest and deepest portfolio of purpose-built data services, including Amazon Athena for interactive query, Amazon EMR for big data processing, Amazon Elasticsearch Service for log and search analytics, Amazon Kinesis for real-time analytics, Amazon Redshift as data warehousing and Amazon SageMaker for a Machine Learning service.

Francesco showed use case of AWS Glue, a serverless data integration service for complex workloads connecting hundreds of data sources, and processing data in real time. At the end all of this makes easier to discover, prepare, and combine data for analytics, Machine Learning, and application development.

AWS Glue runs in a serverless environment. There is no infrastructure to maintain, it allocates needed compute power, run your data integration jobs, and it’s cheaper than other cloud data integration options. AWS Glue automates much of the effort required for data integration, and it crawls your data sources, identifies data formats, and suggests schemas to store your data. It prepares raw data for Machine Learning. Different groups across your organization can use AWS Glue to work together on data integration tasks, reducing the time required to analyse your data.

As much as 80% of time is spent on task associated with data preparation: extraction & loading, cleaning & normalization, orchestrating data preparation in workflows. AWS Glue DataBrew is a visual data preparation tool that enables users to clean and normalize data without writing any code. It helps to reduce the time needed to prepare data for analytics and Machine Learning (ML). You can choose from over 250 ready-made transformations to automate data preparation tasks, such as filtering anomalies, converting data to standard formats, and correcting invalid values. In this way, business analysts, data scientists, and data engineers can more easily collaborate to get insights from raw data, with the intuitive DataBrew interface, you can interactively discover, visualize, clean, and transform raw data.

Another service for data wrangling is called AWS Data Wrangler that extends the power of Pandas library to AWS services connecting data coming from different sources.

The last service showed was SageMaker Data Wrangler, the fastest and easiest way to prepare data for Machine Learning. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. SageMaker Data Wrangler contains many built-in data transformations to convert raw data to features for Machine Learning. You are able to quickly detect outliers or extreme values within a data set, identify inconsistencies and potential issues on data preparation that could hinder model accuracy. SageMaker Data Wrangler manage all steps of the data preparation workflow through a single visual interface. Once your data is prepared, you can build fully automated ML workflows with Amazon SageMaker Pipelines and save them in the Amazon SageMaker Feature Store to share features across your team and others, to reuse them for their own models and analysis.

References

Harness the power of your data with AWS Analytics

Recording&Slides:

video

slides

Written by Claudio G. Giancaterino

--

--

Data Science Milan

Blog and summary of events of the Data Science Milan community.