When doing data analysis with big data, scaling is often a concern as the files we are working with are large. Hence we want to select file formats that are appropriate - have low on-disk usage and having fast input-output (i.e. read-write). This workbook does a benchmark assessment of a few well known file types. It is quite similar to other benchmark studies, such as this ‘towards data science’ format study by Ilia Zaitsev in 2017.

File formats analyzed

Old school file formats: 1. Pain CSV 2. Excel (xlsx)

Apache Arrow formats: 3. Parquet 4. Feather

Python specific formats: 5. Pickle 6. Compressed pickle (using zip format)

Other data formats: 7. HDF5

Variables used in the analysis

time to save a file (output to stored memory);
time to load a file (input from stored memory);
space on disk

Dataset utilized in analyis

To analyze each file format, the relatively large dataset from Dominiks’ Fine foods scanner dataset was utilized as it contains a range of variables - numeric, string, and integer – as well as being large (7.3M rows).

Note

As a side comment, while this analysis was done on Python, the conclusions are mostly applicable to R or other langauges exclusing the use of Pickle and compressed pickle formats, which are python formats.

# %matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import os

Setup a simple timer to return time taken for a task

import datetime

class Timer:
    """
    Simple timer. When first initiated, it starts, has one method stop(), it prints the time taken
    
    `
    t = Timer()
    t.stop()
    `
    
    By default, the Timer() will just print the length of time taken, however if you specify
    The stop(return_time=True), it will instead return a datetime object of time taken 
    """
    def __init__(self):
        # when first initiated, start the clock
        self.t_start = datetime.datetime.now()
        
    def stop(self, return_time=False):
        # end timer
        self.t_end = datetime.datetime.now()
        # return or print the length of time taken
        if return_time == True:
            return self.t_end - self.t_start
        else:
            print("Task took {t}".format(t=self.t_end - self.t_start))

Load the dataset used in the demo and see how long it is

NOTE: at this step, any .csv demo dataset can be substituted

demo_dataset = "https://www.chicagobooth.edu/-/media/enterprise/centers/kilts/datasets/dominicks-dataset/movement_csv-files/wana_csv.zip"

df = pd.read_csv(demo_dataset)
len(df)

Considering that there are 7.3M rows of data, this invalidates xlsx as the simple (or full) output type as this can only handle 1 million rows. If we had still wanted to work with excel for such a large dataset, we would have to split it into several 1M row files. Hence to simulate excel, we will just focus on saving or reading one 1M row file. To equalize the analysis at the end of the day, we will simply multiply the time taken to load/save/store 1m rows by 7.3

excel_muliplier = len(df)/1000000

Setup Process to automate the tests

def execute_command(message, command):
    print("starting test:",message)
    total = None
    for each in range(0,4):
        t = Timer()
        exec(command)
        if total == None:
            total = t.stop(return_time=True)
        else:
            total = total + t.stop(return_time=True)
    print("took {l}".format(l=total/5))
    time_av = total/5
    # since excel can only store 1M records, multiply the excel estimate by the datasize ratio
    if message[:4] == "xlsx":
        time_av = time_av * excel_muliplier
    return message, time_av.total_seconds()

formats = {
    "pickle - write":"df.to_pickle('{}'.format(file_paths['pickle']))",
    "pickle - read":"df = pd.read_pickle('{}'.format(file_paths['pickle']))",
    "compressed pickle - write":"df.to_pickle('{}'.format(file_paths['compressed pickle']), compression='zip')",
    "compressed pickle - read":"df = pd.read_pickle('{}'.format(file_paths['compressed pickle']))",
    "csv - write":"df.to_csv('{}'.format(file_paths['csv']))",
    "csv - read":"df = pd.read_csv('{}'.format(file_paths['csv']))",
    "parquet - write":"df.to_parquet('{}'.format(file_paths['parquet']))",
    "parquet - read":"df = pd.read_parquet('{}'.format(file_paths['parquet']))",
    "feather - write":"df.to_feather('{}'.format(file_paths['feather']))",
    "feather - read":"df = pd.read_feather('{}'.format(file_paths['feather']))",
    "hdf5 - write":"df.to_hdf('{}'.format(file_paths['hdf5']), key='df')",
    "hdf5 - read":"df = pd.read_hdf('{}'.format(file_paths['hdf5']))",
    "xlsx - write":"df[:1000000].to_excel('{}'.format(file_paths['xlsx']))",
    "xlsx - read":"df = pd.read_excel('{}'.format(file_paths['xlsx']))"
}

file_paths = {
    "csv":"wana.csv",
    "pickle":"wana.pkl",
    "compressed pickle":"wana.pkl.zip",
    "feather":"wana.feather",
    "parquet":"wana.parquet.gzip",
    "hdf5":"wana.h5",
    "xlsx":"wana.xlsx"
}

Run the process on the downloaded demo file

times = []
for _format in formats:
    times.append(execute_command(_format, formats[_format]))

df_processing_test = pd.DataFrame(times, columns=["task", "time taken"])

starting test: pickle - write
took 0:00:03.344310
starting test: pickle - read
took 0:00:01.997607
starting test: compressed pickle - write
took 0:00:07.953267
starting test: compressed pickle - read
took 0:00:12.442646
starting test: csv - write
took 0:00:40.201943
starting test: csv - read
took 0:00:06.481310
starting test: parquet - write
took 0:00:04.330219
starting test: parquet - read
took 0:00:03.246392
starting test: feather - write
took 0:00:02.894822
starting test: feather - read
took 0:00:01.323537
starting test: hdf5 - write

C:\Users\gouss\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py:2449: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->Index(['SALE', 'PRICE_HEX', 'PROFIT_HEX'], dtype='object')]

  encoding=encoding,

took 0:00:03.991776
starting test: hdf5 - read
took 0:00:05.148564
starting test: xlsx - write
took 0:02:35.857042
starting test: xlsx - read
took 0:01:36.240834

file_size_test = []

for file in file_paths:
    fsize = os.path.getsize(file_paths[file])
    # since excel can only store 1M records, multiply the excel estimate by the datasize ratio
    if file[:4] == "xlsx":
        fsize = fsize * excel_muliplier
    file_size_test.append((file, fsize/1024/1024))

df_file_size_test = pd.DataFrame(file_size_test, columns=['type','file size (MB)'])

Visualize and print results for analysis

Now that all the results are availible, lets visualize them a bit prior to qualitative analysis

plt.figure(figsize=(15,7))
ax = sns.barplot(x='type', y='file size (MB)', data=df_file_size_test)
_ = ax.set_xlabel('File Format')
_ = ax.set_ylabel('Megabytes')
_ = ax.set_title('File size of the data that is saved')
ax.grid(True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontsize='x-large')

[Text(0, 0, 'csv'),
 Text(0, 0, 'pickle'),
 Text(0, 0, 'compressed pickle'),
 Text(0, 0, 'feather'),
 Text(0, 0, 'parquet'),
 Text(0, 0, 'hdf5'),
 Text(0, 0, 'xlsx')]

plt.figure(figsize=(20,10))
ax = sns.barplot(x='task', y='time taken', data=df_processing_test)
_ = ax.set_xlabel('File format')
_ = ax.set_ylabel('Time Taken (seconds)')
_ = ax.set_title('File formats and the time taken for IO')
ax.grid(True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontsize='x-large', horizontalalignment='right')

[Text(0, 0, 'pickle - write'),
 Text(0, 0, 'pickle - read'),
 Text(0, 0, 'compressed pickle - write'),
 Text(0, 0, 'compressed pickle - read'),
 Text(0, 0, 'csv - write'),
 Text(0, 0, 'csv - read'),
 Text(0, 0, 'parquet - write'),
 Text(0, 0, 'parquet - read'),
 Text(0, 0, 'feather - write'),
 Text(0, 0, 'feather - read'),
 Text(0, 0, 'hdf5 - write'),
 Text(0, 0, 'hdf5 - read'),
 Text(0, 0, 'xlsx - write'),
 Text(0, 0, 'xlsx - read')]

Okay… so hdf5 and xlsx are clearly not favourable. If we drop them, what does the result look like

df_file_size_test_without_outliers = df_file_size_test.drop([5,6])
plt.figure(figsize=(15,7))
ax = sns.barplot(x='type', y='file size (MB)', data=df_file_size_test_without_outliers)
_ = ax.set_xlabel('File Format')
_ = ax.set_ylabel('Megabytes')
_ = ax.set_title('File size of the data that is saved\n(dropped hdf5 as outlier)')
ax.grid(True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontsize='x-large')

[Text(0, 0, 'csv'),
 Text(0, 0, 'pickle'),
 Text(0, 0, 'compressed pickle'),
 Text(0, 0, 'feather'),
 Text(0, 0, 'parquet')]

df_processing_test_without_outliers = df_processing_test.drop([10,11,12,13])
plt.figure(figsize=(20,10))
ax = sns.barplot(x='task', y='time taken', data=df_processing_test_without_outliers)
_ = ax.set_xlabel('File format')
_ = ax.set_ylabel('Time Taken (seconds)')
_ = ax.set_title('File formats and the time taken for IO\n(dropped excel as extreme outlier)')
ax.grid(True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontsize='x-large', horizontalalignment='right')

[Text(0, 0, 'pickle - write'),
 Text(0, 0, 'pickle - read'),
 Text(0, 0, 'compressed pickle - write'),
 Text(0, 0, 'compressed pickle - read'),
 Text(0, 0, 'csv - write'),
 Text(0, 0, 'csv - read'),
 Text(0, 0, 'parquet - write'),
 Text(0, 0, 'parquet - read'),
 Text(0, 0, 'feather - write'),
 Text(0, 0, 'feather - read')]

Detailed results

We can print out the dataframes of interst to see the detailed data for all the files, including the outliers

df_file_size_test

	type	file size (MB)
0	csv	547.119179
1	pickle	579.979185
2	compressed pickle	14.979172
3	feather	104.476992
4	parquet	20.218006
5	hdf5	2088.185272
6	xlsx	288.665235

df_processing_test

	task	time taken
0	pickle - write	3.344310
1	pickle - read	1.997607
2	compressed pickle - write	7.953267
3	compressed pickle - read	12.442646
4	csv - write	40.201943
5	csv - read	6.481310
6	parquet - write	4.330219
7	parquet - read	3.246392
8	feather - write	2.894822
9	feather - read	1.323537
10	hdf5 - write	3.991776
11	hdf5 - read	5.148564
12	xlsx - write	1143.868652
13	xlsx - read	706.332365

Conclusion

A few observations from the above test: * The worst thing that can be used is xlsx, csv and hdf5! CSV and Excel formats takes an insanely long time to read and write. HSF5 is quick to load, but is a whooping 2GB on disk! * A compressed pickle seems the ideal for space constrained IO – is the second slowest to load and save (although not by a large margin), but only takes up 15mb; * Feather is the ideal one for a tradeoff of both on disk space and IO. It is the fastest when it comes to IO (2.44x faster to read than second best which is parquet, and 1.5x faster to write than parquet), but takes up 100mb on disk, which isn’t small. * If space/size is a bit more of a consideration but speed is more important, then the best is parquet. However the ballance between IO speed and size makes it best compared to a compressed pickle. Parquet is a bit larger in size (20mb instead of 14mb) but it is much faster.

We can summarize some main tradeoffs that you could make:

type	parquet compared to compressed pickle	Feather compared to parquet	Parquet compared to excel	Parquet compared to csv
write/save	1.8 faster	1.5x faster	264x faster	9.3x faster
read/load	2.8x faster	2.46x faster	217x faster	2x faster
file size	1.3x larger	5.2x larger	14.3x smaller	27x smaller

As noted above, pickle is a python format – meaning that the tradeoff partly depends on the language you use (R for ex?). However with R Markdown and other solutions, this may not be a real limitation.