5 min read

Dask for Scaling Data Science

In the world of data science and machine learning, Python has established itself as the primary language for data analysis, with libraries like Pandas serving as a workhorse for data manipulation and analysis. However, as datasets grow larger, the time taken to perform tasks like exploratory data analysis becomes impractical.  This is where the single-threaded architecture of Pandas can start to show its limitations.  It is here that Dask,  as an alternative, thus, comes into the picture as they are designed to handle parallel processing, making better use of modern multi-core CPUs.

Dask is a powerful parallel computing library that extends familiar Python libraries like Pandas and NumPy to work with much larger datasets and scale across multiple cores or even clusters of machines. The convenience of working on a laptop, or a huge server using Dask is a compelling value proposition.

Lets talk about Dask and compare it with pandas with respect to performance. For evidence, I am using a 10 million row dataset of the following format, a simple comma-separated file.

    9999990,0.8650280748158613,0.6763992426693638
    9999991,0.796799926631457,0.7838066873984045
    9999992,0.8636286586877612,0.5482658133594334
    9999993,0.8404724061815126,0.5149522843098209
    9999994,0.4386663275449709,0.0418925190391195
    9999995,0.9056175079617191,0.1972167141108574
    9999996,0.7288087233312789,0.03340284275017302
    9999997,0.7016828033150314,0.49312869614744226
    9999998,0.4561628140545252,0.7843156099465182
    9999999,0.14553608403846574,0.6449094807828858

Performance Run

I set up a quick performance run to compare the two libraries. Remember both dask and dask-expr needs to be installed before using the following script. 

Synthetic Dataset Generation

The script below will generate a data set of 10 million rows with random values in three columns viz. id, value1, value2 which we will use only for our load performance testing.

import os
import pandas as pd
import numpy as np
import dotenv
from dotenv import load_dotenv


# Function to generate a 10 million row dataset
def generate_large_dataset(num_rows=10_000_000):
    load_dotenv()
    # Set a random seed for reproducibility
    np.random.seed(0)
    
    # Generate random data
    data = {
        'id': np.arange(num_rows),
        'value1': np.random.rand(num_rows),
        'value2': np.random.rand(num_rows)
    }
    
    # Create a DataFrame
    df = pd.DataFrame(data)
    
    # Save the DataFrame to a CSV file
    file_name = os.getenv('DATASETS_PATH') +'/10_million_rows.csv' 
    print(f"file_name is {file_name}")
    df.to_csv(file_name, index=False)

# Generate the dataset
generate_large_dataset()
print("10 million rows dataset generated and saved as '10_million_rows.csv'.")

For group-by performance testing we are making a subtle change in the script, so that we will be able to perform group by operations, this will be our second data set, note the parameters passed to np.random.randint called n_unique_values and n_rows. This change is necessary so that we have values in value1 column which can be used for group-by operation.


import os
import pandas as pd
import numpy as np
import dotenv
from dotenv import load_dotenv


# Function to generate a 10 million row dataset
def generate_large_dataset(num_rows=10_000_000):
    n_unique_values = 1000
    n_rows = num_rows
    load_dotenv()
    # Set a random seed for reproducibility
    np.random.seed(0)
    
    # Generate random data
    data = {
        'value1': np.random.randint(0, n_unique_values, n_rows),
        'value2': np.random.rand(num_rows),
        'value3': np.random.rand(num_rows)
    }
    
    # Create a DataFrame
    df = pd.DataFrame(data)
    
    # Save the DataFrame to a CSV file
    file_name = os.getenv('DATASETS_PATH') +'/10_million_rows_for_groupby.csv' 
    print(f"file_name is {file_name}")
    df.to_csv(file_name, index=False)
    print(f"10 million rows dataset generated and saved as '10_million_rows_for_groupby.csv'.")

# Generate the dataset
generate_large_dataset()

Now you have the datasets and the code to start your experiments with performance comparisons between Dask and Pandas! The script is a quick and dirty script that I put together. Feel free to modify the dataset generation code as per your specific needs.

Experiments and Measurements

After generating these datasets, I used them to experiment with Dask and Pandas, timing various operations such as loading the dataset, performing group-bys, and executing joins. Use the timing methods shown in the previous responses to compare performance.

  1. The Load Performance
import time
import dask.dataframe as dd
import pandas as pd

# Timing Pandas loading
start_time = time.time()
pd_large_df = pd.read_csv('/run/media/rajivg/WORK/Code/Datasets/10_million_rows.csv')
print(f"Pandas loading time: {time.time() - start_time} seconds")

# Timing Dask loading
start_time = time.time()
dd_large_df = dd.read_csv('/run/media/rajivg/WORK/Code/Datasets/10_million_rows.csv')
print(f"Dask loading time: {time.time() - start_time} seconds")

The result after the run of the script is as follows

Pandas loading time: 3.487161874771118 seconds

Dask loading time: 0.007143259048461914 seconds

  1. Group-by performance

This time let’s make separate scripts. Here is the script I made for group-by operation using pandas

import pandas as pd
import time

# Load the dataset with Pandas
start_time = time.time()

df = pd.read_csv('10_million_rows.csv')
result = df.groupby('value1').sum()  # Perform groupby operation

end_time = time.time()

print("Pandas GroupBy Processing Time: {:.2f} seconds".format(end_time - start_time))

Pandas GroupBy Processing Time: 3.46 seconds

Now let’s do the same thing with Dask. 

import dask.dataframe as dd
import time

# Load the dataset with Pandas
start_time = time.time()

ddf = dd.read_csv('/run/media/rajivg/WORK/Code/Datasets/10_million_rows_for_groupby.csv')
result = ddf.groupby('value1').sum().compute()  # Perform groupby operation

end_time = time.time()

print("Dask GroupBy Processing Time: {:.2f} seconds".format(end_time - start_time))

Dask GroupBy Processing Time: 1.64 seconds

Analysis

Pandas:

We have noticed that pandas takes much more time than dask for the single threaded read that it performs.

We have also noticed that Pandas takes significantly more time for the group-by operation because it processes the entire dataset in memory using a single core.

Dask:

It’s clear from my experiments that, Dask performs the same operation faster, as it divides the data into smaller partitions and processes them in parallel, utilizing multiple cores. Because computers today have multiple cores, Dask will always utilize the multiple cores to make processing faster.

Conclusion

In conclusion, It is clear from my experiments that even on my outdated relic of a machine Dell Micro 3070, 9th Gen, Dask is significantly faster than pandas. While Pandas is an excellent tool for smaller datasets, its single-threaded nature becomes a bottleneck when working with large datasets, leading to slower operations and potential memory limitations.

The group-by operation in Dask was faster in my tests, and I am sure it will scale better with increasing data size. This demonstrates that Dask is a powerful alternative to Pandas when dealing with big data, offering parallelism and out-of-core computation without sacrificing the familiar syntax I am comfortable with in Pandas.

I, therefore, found Dask to be a valuable tool that complements Pandas, particularly for larger datasets where performance and memory usage are critical concerns. Going forward, I’ll consider using Dask in situations where scalability and speed are essential (for example using a framework like Kedro), while still using Pandas for smaller, simpler tasks.

All code is available on my GitHub Repo for you to use as you wish.