Polish scanner dataset

scanner data
Date de publication

2 février 2026

Jacek Bialek recently uploaded this scanner dataset of polish data to zenodo. This blog is my personal exploration of this dataset.

Basic dataset overview

First off, let’s look at the raw dataset

Show the code
df = pd.read_csv("https://zenodo.org/records/18342253/files/dataRSM.csv?download=1", sep=";", index_col=False)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8090 entries, 0 to 8089
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   time           8090 non-null   object
 1   prices         8090 non-null   object
 2   quantities     8090 non-null   int64 
 3   retID          8090 non-null   object
 4   description    8090 non-null   object
 5   retailer_code  8090 non-null   int64 
 6   EAN_code       8090 non-null   object
 7   category       8090 non-null   object
 8   subcategory    8090 non-null   object
dtypes: int64(2), object(7)
memory usage: 569.0+ KB

Rendering it in its raw format isn’t very useful as prices and time (for example), are objects. Hence some light processing would be in order.

Show the code
my_dtypes = {
    'prices': 'float64', 
    'description': 'str'
}

df = pd.read_csv(
    "https://zenodo.org/records/18342253/files/dataRSM.csv?download=1", 
    sep=";", 
    decimal=",",
    dtype=my_dtypes,
    parse_dates=['time'],
    index_col=False
)

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8090 entries, 0 to 8089
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   time           8090 non-null   datetime64[ns]
 1   prices         8090 non-null   float64       
 2   quantities     8090 non-null   int64         
 3   retID          8090 non-null   object        
 4   description    8090 non-null   object        
 5   retailer_code  8090 non-null   int64         
 6   EAN_code       8090 non-null   float64       
 7   category       8090 non-null   object        
 8   subcategory    8090 non-null   object        
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 569.0+ KB

Exploring the dataset

Using the itables library, have a way to play with the dataset

Loading ITables v2.6.2 from the init_notebook_mode cell... (need help?)
Retour au sommet