26 Dec 2021 • on AWS, Python, API

To schedule data updates with AWS Lambda

Working as a solo data specialist of a team often means that you have to be resourceful and learn to solve problems for different stages of the data science pipeline. During my first year as a data analyst, one of the most challenging yet rewarding tasks was to prepare the infrastructure for web apps in a cloud. This post is to record a key takeaway from my troubleshoot with building an event-driven data processing pipeline within AWS.

Background

Our team is launching a pilot web app that provides an integrated monitoring system for training that the team provides both online and offline. After collecting, cleaning and loading the data into a MySQL database deployed within Amazon RDS, we wanted to configure the database to be regularly and automatically updated through an e-learning website’s API.

Challenge

As the aim is to provide an up-to-date dashboard feature to our clients, we needed a server for scheduled execution of a Python script that fetches data from our web-based data source.

Approach

Since our web application and database are deployed with AWS, I opted for AWS Lambda to create a triggered data flow. AWS Lambda is a computing service and can run uploaded codes in response to trigger events. Its implementation is easy since its serverless feature saves you from defining or managing all the infrastructure required.

STEP 1: Upload .py file to AWS Lambda

Prepare a .py file (in the case of Python) containing your function and zip it to upload it as a lambda function. After a successful upload, you have to check Runtime settings elements and make sure of the version of the Python. For example, Runtime: Python 3.8 Handler: get_activeusers.lambda_handler (python_file_name.function_name) Handler is usually formatted as python_file_name.function_name. In this example, I have used the default name of the AWS Lambda, ‘lambda_handler’, and packaged it inside a file named ‘get_activeusers.py’.

STEP 2: Add Lambda layers

Layers are used to define the dependencies required to run the function. If you test the uploaded function without adding layers, you’ll probably get an error. For instance, I got the following error: No module named ‘pandas’. There are online tutorials and discussions about how to get rid of this error; All the libraries required for the function have to be archived into a .zip and uploaded from Amazon Simple Storage Service (Amazon S3) or your local machine.

The breakthrough for me was to create a base docker image and manually install the required libraries within the container image. Then the image was zipped and upload to an S3 bucket.

Here are command lines I used to get rid of the No module names ‘pandas error. I referred to this post from stackoverflow.

Create a directory and a docker container

mkdir layer_pandas
cd layer_pandas
mkdir python
docker run --rm -it -v $PWD/python:/layer amazonlinux:1 bash

Within the docker container

yum install -y python38                 # install python 3.8 (in Oct.2021)
cd layer
python3 -m pip install --upgrade pip    # upgrade pip if required
python3 -m pip install pandas -t .      # Install pandas (numpy included) in this image
python3 -m pip install requests -t .    # Install other dependent libraries to run functions
rm -r *.dist-info __pycache__           # Remove non-relevant files
exit                                    # Exit the container

Zip the container image

zip -r layer_pandas.zip .

Once the .zip file is created, upload it to your S3 bucket and copy its URI to create a new layer in the AWS Lambda console.

Summary of workflow

Build a docker image in which requirements are installed and zip it -> Upload the .zip to a dedicated S3 bucket -> Create a new layer using the .zip within S3 -> Add the new layer for the function