Explore Italian web scraped grocery dataset

web scraped data
Date de publication

14 septembre 2025

Daniele Sasso and a few others made their dataset availible on Zenodo - https://doi.org/10.5281/zenodo.14927602 - daily webscraped data from different shops of an Italian supermarket chain. This blog summarizes the dataset and explores its various facets. Detailed overview of the data is available on the Price Stats Catalogue record of this dataset and some explorations below are summarized there.

AstuceReproduce this blog

This blog is a jupyter article under the hood - have a look at the source. Save a copy of the data to /data/bronze/ and re-render.

Dataset structure

General overview of the dataset

First off - lets look at the data itself, its columns, and some statistics about the web scraping itself.

Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4033211 entries, 0 to 4033210
Data columns (total 8 columns):
 #   Column      Dtype  
---  ------      -----  
 0   date        object 
 1   price       float64
 2   product_id  int64  
 3   store_id    int64  
 4   region      object 
 5   product     object 
 6   COICOP5     object 
 7   COICOP4     object 
dtypes: float64(1), int64(2), object(5)
memory usage: 246.2+ MB

Detailed info about the dataset

As this is web scrape data for several years - its saved all in one analytical table.

Let’s look at it in a bit more detail:

Show the code
stats = {}
stats['Number of unique products'] = df['product'].nunique()
stats['Number of unique stores'] = df['store_id'].nunique()
stats['Number of unique regions'] = df['region'].nunique()
stats['Number of COICOP5 categories'] = df['COICOP5'].nunique()
stats['Number of unique scrapes'] = df['date'].nunique()
stats['Number of average unique products per store per date'] = round(df.groupby(["date", "store_id"])["product_id"].nunique().reset_index()['product_id'].mean(),1)
d_end = datetime.fromisoformat(df['date'].max())
d_start = datetime.fromisoformat(df['date'].min())
d = d_end-d_start
stats['number of days in sample'] = d.days + 1
pd.DataFrame.from_dict(stats, orient='index', columns=['statistic'])
Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)

It seems that there are 863 days but 841 scrapes - that means that there were no scrapes during 22 days:

Show the code
# Compare the current scrape list
scrape_dates = pd.DatetimeIndex(df['date'].unique())

# Against an uninterupted list of dates
start_date = scrape_dates.min()
end_date = scrape_dates.max()

# Create a complete, continuous date range
full_date_range = pd.date_range(start=start_date, end=end_date, freq='D')

# Find the dates that are in the full range but not in the scrape list
missing_dates = full_date_range.difference(scrape_dates)
# missing_dates

missing_dates
DatetimeIndex(['2021-02-13', '2021-02-14', '2021-03-23', '2021-03-28',
               '2021-06-24', '2021-08-22', '2021-08-23', '2021-08-24',
               '2021-08-25', '2021-08-26', '2021-08-27', '2021-09-30',
               '2021-12-02', '2022-01-23', '2022-03-16', '2022-06-21',
               '2022-10-01', '2022-10-07', '2022-10-10', '2022-10-22',
               '2022-10-23', '2022-11-04'],
              dtype='datetime64[ns]', freq=None)

If we pivot the raw data and show the number of prices captured per store per region - it looks like this:

Show the code
df.pivot_table(index='date', columns=['region','store_id'], aggfunc='count')
Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)

Lets also look at the number of stores per region (i.e. the above but visually)

Show the code
df_number_of_stores_per_region = df.groupby(["region"])["store_id"].nunique().to_frame()
df_number_of_stores_per_region.plot(kind='bar', color='green', figsize=(10,4))
plt.title('Number of stores per region')
plt.xlabel('region')
plt.ylabel('Number of stores')
# plt.xticks(rotation=1)  # Keep x-axis labels horizontal
plt.show()

Price and Product analysis

Geographic distribution of unique products by region

There is some example code in the zenodo page for the dataset that shows well some of the price/product info captured

Show the code
# Convert date column
df['date'] = pd.to_datetime(df['date'])  # Format: YYYY-MM-DD

# Define category colors
category_colors = {"Fruit": "blue", "Vegetable": "green", "Meat": "red"}

geo = df.groupby(["region", "COICOP4"])["product_id"].nunique().reset_index()
pivot_geo = geo.pivot(index="region", columns="COICOP4", values="product_id").fillna(0)
pivot_geo["Total"] = pivot_geo.sum(axis=1)
pivot_geo = pivot_geo.sort_values("Total", ascending=False).drop(columns="Total")
pivot_geo = pivot_geo[["Fruit", "Meat", "Vegetable"]]

pivot_geo.plot(kind="bar", stacked=True, figsize=(10,6), color=["blue", "red", "green"])
plt.ylabel("Number of Unique Products")
plt.title("Geographic Distribution by Region and Category (Sorted)")
plt.xticks(rotation=45, ha="right")
plt.legend(title="Category")
plt.tight_layout()
plt.show()

Basic analysis: average price trend over time (by COICOP4)

We can also look at average prices by COICOP4 over time

Show the code
price_trend = df.groupby(["date", "COICOP4"])["price"].mean().reset_index()

plt.figure(figsize=(10,5))
sns.lineplot(data=price_trend, x="date", y="price", hue="COICOP4", palette=category_colors)
plt.title("Average Price Over Time by COICOP4 Category")
plt.xlabel("Date")
plt.ylabel("Average Price (€)")
plt.legend(title="Category")
plt.tight_layout()
plt.show()

Retour au sommet