In the world of data science and machine learning, Python has established itself as the primary language for data analysis, with libraries like Pandas serving as a workhorse for data manipulation and analysis. However, as datasets grow larger, the time taken to perform tasks like exploratory data analysis becomes impractical. This is where the single-threaded architecture of Pandas can start to show its limitations. It is here that Dask, as an alternative, thus, comes into the picture as they are designed to handle parallel processing, making better use of modern multi-core CPUs.
Dask is a powerful parallel computing library that extends familiar Python libraries like Pandas and NumPy to work with much larger datasets and scale across multiple cores or even clusters of machines. The convenience of working on a laptop, or a huge server using Dask is a compelling value proposition.
Lets talk about Dask and compare it with pandas with respect to performance. For evidence, I am using a 10 million row dataset of the following format, a simple comma-separated file.
9999990,0.8650280748158613,0.6763992426693638
9999991,0.796799926631457,0.7838066873984045
9999992,0.8636286586877612,0.5482658133594334
9999993,0.8404724061815126,0.5149522843098209
9999994,0.4386663275449709,0.0418925190391195
9999995,0.9056175079617191,0.1972167141108574
9999996,0.7288087233312789,0.03340284275017302
9999997,0.7016828033150314,0.49312869614744226
9999998,0.4561628140545252,0.7843156099465182
9999999,0.14553608403846574,0.6449094807828858
Performance Run
I set up a quick performance run to compare the two
libraries. Remember both dask and dask-expr needs
to be installed before using the following script.
Synthetic Dataset Generation
The script below will generate a data set of 10 million rows with random values in three columns viz. id, value1, value2 which we will use only for our load performance testing.
import os
import pandas as pd
import numpy as np
import dotenv
from dotenv import load_dotenv
# Function to generate a 10 million row dataset
def generate_large_dataset(num_rows=10_000_000):
load_dotenv()
# Set a random seed for reproducibility
np.random.seed(0)
# Generate random data
data = {
'id': np.arange(num_rows),
'value1': np.random.rand(num_rows),
'value2': np.random.rand(num_rows)
}
# Create a DataFrame
df = pd.DataFrame(data)
# Save the DataFrame to a CSV file
file_name = os.getenv('DATASETS_PATH') +'/10_million_rows.csv'
print(f"file_name is {file_name}")
df.to_csv(file_name, index=False)
# Generate the dataset
generate_large_dataset()
print("10 million rows dataset generated and saved as '10_million_rows.csv'.")
For group-by performance testing we are making a subtle change in the script, so that we will be able to perform group by operations, this will be our second data set, note the parameters passed to np.random.randint called n_unique_values and n_rows. This change is necessary so that we have values in value1 column which can be used for group-by operation.
import os
import pandas as pd
import numpy as np
import dotenv
from dotenv import load_dotenv
# Function to generate a 10 million row dataset
def generate_large_dataset(num_rows=10_000_000):
n_unique_values = 1000
n_rows = num_rows
load_dotenv()
# Set a random seed for reproducibility
np.random.seed(0)
# Generate random data
data = {
'value1': np.random.randint(0, n_unique_values, n_rows),
'value2': np.random.rand(num_rows),
'value3': np.random.rand(num_rows)
}
# Create a DataFrame
df = pd.DataFrame(data)
# Save the DataFrame to a CSV file
file_name = os.getenv('DATASETS_PATH') +'/10_million_rows_for_groupby.csv'
print(f"file_name is {file_name}")
df.to_csv(file_name, index=False)
print(f"10 million rows dataset generated and saved as '10_million_rows_for_groupby.csv'.")
# Generate the dataset
generate_large_dataset()
Now you have the datasets and the code to start your experiments with performance comparisons between Dask and Pandas! The script is a quick and dirty script that I put together. Feel free to modify the dataset generation code as per your specific needs.
Experiments and Measurements
After generating these datasets, I used them to experiment with Dask and Pandas, timing various operations such as loading the dataset, performing group-bys, and executing joins. Use the timing methods shown in the previous responses to compare performance.
- The Load Performance
import time
import dask.dataframe as dd
import pandas as pd
# Timing Pandas loading
start_time = time.time()
pd_large_df = pd.read_csv('/run/media/rajivg/WORK/Code/Datasets/10_million_rows.csv')
print(f"Pandas loading time: {time.time() - start_time} seconds")
# Timing Dask loading
start_time = time.time()
dd_large_df = dd.read_csv('/run/media/rajivg/WORK/Code/Datasets/10_million_rows.csv')
print(f"Dask loading time: {time.time() - start_time} seconds")
The result after the run of the script is as follows
Pandas loading time: 3.487161874771118 seconds
Dask loading time: 0.007143259048461914 seconds
- Group-by performance
This time let’s make separate scripts. Here is the script I made for group-by operation using pandas
import pandas as pd
import time
# Load the dataset with Pandas
start_time = time.time()
df = pd.read_csv('10_million_rows.csv')
result = df.groupby('value1').sum() # Perform groupby operation
end_time = time.time()
print("Pandas GroupBy Processing Time: {:.2f} seconds".format(end_time - start_time))
Pandas GroupBy Processing Time: 3.46 seconds
Now let’s do the same thing with Dask.
import dask.dataframe as dd
import time
# Load the dataset with Pandas
start_time = time.time()
ddf = dd.read_csv('/run/media/rajivg/WORK/Code/Datasets/10_million_rows_for_groupby.csv')
result = ddf.groupby('value1').sum().compute() # Perform groupby operation
end_time = time.time()
print("Dask GroupBy Processing Time: {:.2f} seconds".format(end_time - start_time))
Dask GroupBy Processing Time: 1.64 seconds
Analysis
Pandas:
We have noticed that pandas takes much more time than dask for the single threaded read that it performs.
We have also noticed that Pandas takes significantly more time for the group-by operation because it processes the entire dataset in memory using a single core.
Dask:
It’s clear from my experiments that, Dask performs the same operation faster, as it divides the data into smaller partitions and processes them in parallel, utilizing multiple cores. Because computers today have multiple cores, Dask will always utilize the multiple cores to make processing faster.
Conclusion
In conclusion, It is clear from my experiments that even on my outdated relic of a machine Dell Micro 3070, 9th Gen, Dask is significantly faster than pandas. While Pandas is an excellent tool for smaller datasets, its single-threaded nature becomes a bottleneck when working with large datasets, leading to slower operations and potential memory limitations.
The group-by operation in Dask was faster in my tests, and I am
sure it will scale better with increasing data size. This demonstrates
that Dask is a powerful alternative to Pandas when dealing with
big data, offering parallelism and out-of-core computation without
sacrificing the familiar syntax I am comfortable with in Pandas.
I, therefore, found Dask to be a valuable tool that complements Pandas, particularly for larger datasets where performance and memory usage are critical concerns. Going forward, I’ll consider using Dask in situations where scalability and speed are essential (for example using a framework like Kedro), while still using Pandas for smaller, simpler tasks.
All code is available on my GitHub Repo for you to use as you wish.