Polish scanner dataset

Jacek Bialek recently uploaded this scanner dataset of polish data to zenodo. This blog is my personal exploration of this dataset.

init_notebook_modetrusted

Basic dataset overview

First off, let’s look at the raw dataset

Show the code

df = pd.read_csv("https://zenodo.org/records/18342253/files/dataRSM.csv?download=1", sep=";", index_col=False)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8090 entries, 0 to 8089
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   time           8090 non-null   object
 1   prices         8090 non-null   object
 2   quantities     8090 non-null   int64 
 3   retID          8090 non-null   object
 4   description    8090 non-null   object
 5   retailer_code  8090 non-null   int64 
 6   EAN_code       8090 non-null   object
 7   category       8090 non-null   object
 8   subcategory    8090 non-null   object
dtypes: int64(2), object(7)
memory usage: 569.0+ KB

Rendering it in its raw format isn’t very useful as prices and time (for example), are objects. Hence some light processing would be in order.

Show the code

my_dtypes = {
    'prices': 'float64', 
    'description': 'str'
}

df = pd.read_csv(
    "https://zenodo.org/records/18342253/files/dataRSM.csv?download=1", 
    sep=";", 
    decimal=",",
    dtype=my_dtypes,
    parse_dates=['time'],
    index_col=False
)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8090 entries, 0 to 8089
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   time           8090 non-null   datetime64[ns]
 1   prices         8090 non-null   float64       
 2   quantities     8090 non-null   int64         
 3   retID          8090 non-null   object        
 4   description    8090 non-null   object        
 5   retailer_code  8090 non-null   int64         
 6   EAN_code       8090 non-null   float64       
 7   category       8090 non-null   object        
 8   subcategory    8090 non-null   object        
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 569.0+ KB

Exploring the dataset

Using the itables library, have a way to play with the dataset

Loading ITables v2.6.2 from the init_notebook_mode cell... (need help?)

Trends

Numbers of time

First off, let’s see how many unique rows (i.e. number of transaction observations) were observed over time

Show the code

fig = px.histogram(df, 
                   x="time", 
                   color="category", 
                   title="Number of observations over time (by category)",
                   barmode='stack')
fig.update_layout(bargap=0.3)
fig.show()

We can also look at the number of unique products over time (focusing on EANs, common to European stores)

Show the code

df_unique = (df.groupby([pd.Grouper(key='time', freq='ME'), 'category'])
             ['EAN_code']
             .nunique()
             .reset_index(name='unique EANs'))

fig = px.bar(df_unique, 
             x="time", 
             y="unique EANs", 
             color="category",
             title="Unique Products (i.e. EANs) across timee",
             barmode='stack')

fig.update_layout(bargap=0.4)

fig.show()

Price trends

We can also look at price trends by category

Show the code

df_avg = (df.groupby([pd.Grouper(key='time', freq='ME'), 'category'])['prices']
            .mean()
            .reset_index(name='price'))

fig = px.line(df_avg,
              x="time",
              y="price",
              facet_col="category",
              facet_col_wrap=2,
              title="Average Price by Category",
              markers=True)

fig.show()

As well as by subcategory

Show the code

df_avg = (df.groupby([pd.Grouper(key='time', freq='ME'), 'subcategory'])['prices']
            .mean()
            .reset_index(name='price'))

fig = px.line(df_avg,
              x="time",
              y="price",
              facet_col="subcategory",
              facet_col_wrap=4,
              title="Average Price by Subcategory",
              markers=True)

fig.show()