Jacek Bialek recently uploaded this scanner dataset of polish data to zenodo. This blog is my personal exploration of this dataset.
Basic dataset overview
First off, let’s look at the raw dataset
Show the code
df = pd.read_csv("https://zenodo.org/records/18342253/files/dataRSM.csv?download=1", sep=";", index_col=False)
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8090 entries, 0 to 8089
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 8090 non-null object
1 prices 8090 non-null object
2 quantities 8090 non-null int64
3 retID 8090 non-null object
4 description 8090 non-null object
5 retailer_code 8090 non-null int64
6 EAN_code 8090 non-null object
7 category 8090 non-null object
8 subcategory 8090 non-null object
dtypes: int64(2), object(7)
memory usage: 569.0+ KB
Rendering it in its raw format isn’t very useful as prices and time (for example), are objects. Hence some light processing would be in order.
Show the code
my_dtypes = {
'prices': 'float64',
'description': 'str'
}
df = pd.read_csv(
"https://zenodo.org/records/18342253/files/dataRSM.csv?download=1",
sep=";",
decimal=",",
dtype=my_dtypes,
parse_dates=['time'],
index_col=False
)
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8090 entries, 0 to 8089
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 8090 non-null datetime64[ns]
1 prices 8090 non-null float64
2 quantities 8090 non-null int64
3 retID 8090 non-null object
4 description 8090 non-null object
5 retailer_code 8090 non-null int64
6 EAN_code 8090 non-null float64
7 category 8090 non-null object
8 subcategory 8090 non-null object
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 569.0+ KB
Exploring the dataset
Using the itables library, have a way to play with the dataset
Loading ITables v2.6.2 from the init_notebook_mode cell...
(need help?) |
Trends
Numbers of time
First off, let’s see how many unique rows (i.e. number of transaction observations) were observed over time
Show the code
fig = px.histogram(df,
x="time",
color="category",
title="Number of observations over time (by category)",
barmode='stack')
fig.update_layout(bargap=0.3)
fig.show()We can also look at the number of unique products over time (focusing on EANs, common to European stores)
Show the code
df_unique = (df.groupby([pd.Grouper(key='time', freq='ME'), 'category'])
['EAN_code']
.nunique()
.reset_index(name='unique EANs'))
fig = px.bar(df_unique,
x="time",
y="unique EANs",
color="category",
title="Unique Products (i.e. EANs) across timee",
barmode='stack')
fig.update_layout(bargap=0.4)
fig.show()Price trends
We can also look at price trends by category
Show the code
df_avg = (df.groupby([pd.Grouper(key='time', freq='ME'), 'category'])['prices']
.mean()
.reset_index(name='price'))
fig = px.line(df_avg,
x="time",
y="price",
facet_col="category",
facet_col_wrap=2,
title="Average Price by Category",
markers=True)
fig.show()As well as by subcategory
Show the code
df_avg = (df.groupby([pd.Grouper(key='time', freq='ME'), 'subcategory'])['prices']
.mean()
.reset_index(name='price'))
fig = px.line(df_avg,
x="time",
y="price",
facet_col="subcategory",
facet_col_wrap=4,
title="Average Price by Subcategory",
markers=True)
fig.show()