Most Performant Docker Base Images for Data Science on AWS Batch

4 months ago

Computing, Computing - Artificial Intelligence, Computing - Clouds - AWS, Computing - Data Science, Computing - Software Engineering - Python, Finance, Finance - Algorithmic Trading

1. Summary

If you’re working with Python batch jobs that are heavy on mathematical computations, you might want to explore using the IntelPython Docker base image. In our experience, particularly with data science tasks that typically wrap up in about ten minutes, we’ve noticed a significant timesaving when using IntelPython over standard Python base images for jobs running on Intel CPUs. The difference is striking – we’re talking about a time reduction of three to five minutes on our 10 minutes jobs, with IntelPython consistently outperforming its counterparts.

For general Python batch jobs, we recommend using the official Python Docker base image.

2. AWS Batch

Most data heavy data science workloads are batch (i.e., not real-time) jobs and we prefer running them on AWS Batch for these reasons:

costs reasons – with AWS Fargate / AWS Spot Instances they run virtually for free.
performance reasons – AWS Batch allows to parallelize the jobs with their dependencies by launching virtually infinite number of parallel processes.

AWS Batch is based upon running Docker images on either ARM or Intel compatible CPUs. Our preference for executing data science jobs leans towards Intel CPUs. This choice is primarily driven by:

the availability of most of the libraries we rely on, which are exclusively compatible with Intel CPUs – this compatibility ensures smoother operations and seamless integration with our existing tools and workflows.
single-threaded data science code – Intel CPUs excel at running single-threaded code.

3. Typical Data Science Docker Base Images for Python

The usual Docker Base images for Python on Intel CPUs used in data science are:

While each base image varies in size, in the context of AWS Batch, disk size becomes a secondary concern. The crucial factor to focus on is the runtime performance – this aspect is where the real efficiency and effectiveness of these images are truly evaluated.

3.1 Ubuntu Docker Image

Ubuntu image is usually preferred for local development since it contains majority of the Linux tools pre-installed. However, performance wise, it may not be the best choice.

3.2 Official Python Docker Image

There are tens of official Python Docker images, we prefer using the latest one – the cadence of Python releases with each Python version being faster than the previous one guarantees superior performance for general Python operations.

3.3 IntelPython Docker Image

While the release cadence of IntelPython cannot match the release cadence of the official Python releases, IntelPython provides superior performance for math-heavy data science jobs largely thanks to the superior Intel compilers.

4. Benchmarking

It’s important to remember that there isn’t a one-size-fits-all benchmark for evaluating Docker base images; the most reliable measure is how they perform with your specific code. I recommend experimenting with various Docker base images to see which one optimizes your script’s performance most effectively.

The heuristic rule we recommend is if your Python code does use only pandas library without heavy use of numpy, scipy or other machine learning libraries, use the latest official Python Docker base image. Otherwise, use the IntelPython Docker base image.

To illustrate this rule, I have prepared two straightforward benchmarks for this article. The first benchmark illustrates general pandas data frame operations while the second one illustrates numerically extensive pandas data frame operations.

4.1 Benchmark #1 – General pandas dataframe operations

This test focuses on simple math and merging and filtering data frames.

The benchmark.py file is as follows:

import pandas as pd
import numpy as np
import timeit

number_of_tests = 5000

def benchmark_operation01():
    df = pd.DataFrame({'A': range(1, 1000000),
                       'B': range(1, 1000000),
                       'C': range(1, 1000000)})

    for i in range(number_of_tests):
        df['D'] = df['A'] * df['B'] + df['C']

def benchmark_operation02():
    n = 10 ** 6
    df1 = pd.DataFrame({'A': np.random.random(n), 'B': np.random.random(n), 'key': range(n)})
    df2 = pd.DataFrame({'C': np.random.random(n), 'D': np.random.random(n), 'key': range(1, n+1)})
    for i in range(number_of_tests):
        df = pd.merge(df1, df2, on='key')
        df2 = df1[df1['A'] > 0.5]

if __name__=="__main__":
    start_time = timeit.default_timer()

    benchmark_operation01()
    benchmark_operation02()

    end_time = timeit.default_timer()
    execution_time = end_time - start_time

    print(f"Execution time: {execution_time} seconds")

I have run this script five times for each of the tested Docker base images and calculated the median times in seconds:

Python: 531.7914 seconds
IntelPython: 560.0930 seconds
Ubuntu: 564.2625

4.2 Benchmark #2 – Math-extensive pandas dataframe operations

This test focuses on advanced math operations.

import pandas as pd
import numpy as np
import timeit

number_of_tests = 5000

def benchmark_operation01():
    n = 10 ** 6

    for i in range(number_of_tests):
        df = pd.DataFrame({'A': np.random.random(n),
                           'B': np.random.random(n)})
        df['C'] = df['A'] + df['B']
        df['D'] = df['A'] * df['B']
        df['E'] = np.sin(df['A'])
        df['F'] = np.log(df['B'])


if __name__=="__main__":
    start_time = timeit.default_timer()
    
    benchmark_operation01()

    end_time = timeit.default_timer()
    execution_time = end_time - start_time

    print(f"Execution time: {execution_time} seconds")

I have run this script five times for each of the tested Docker base images and calculated the median times in seconds:

IntelPython: 82.17445 seconds
Python: 96.15572 seconds
Ubuntu: 109.4516 seconds

Appendix 1 – requirements.txt

The requirements.txt is

pandas
numpy

Appendix 2 – Dockerfile for each Docker Base Image

The Dockerfile for the Ubuntu is

FROM ubuntu:latest
ENV PYTHONUNBUFFERED 1

RUN mkdir /app
WORKDIR /app

RUN apt-get update && apt-get install -y python3 python3-pip

COPY requirements.txt /app/
RUN pip install --no-cache-dir -r requirements.txt
COPY . /app/
CMD ["python3", "./benchmark.py"]

The Dockerfile for the official Python is

FROM python:latest
ENV PYTHONUNBUFFERED 1

RUN mkdir /app
WORKDIR /app

COPY requirements.txt /app/
RUN pip install --no-cache-dir -r requirements.txt
COPY . /app/

CMD ["python3", "./benchmark.py"]

The Dockerfile for the IntelPython is

FROM intelpython/intelpython3_core
ENV PYTHONUNBUFFERED 1

RUN mkdir /app
WORKDIR /app

RUN python3 -m pip install --upgrade pip

COPY requirements.txt /app/
RUN pip install --no-cache-dir -r requirements.txt
COPY . /app/

CMD ["python", "./benchmark.py"]