In 2019, Frances Krsinich and Donal Lynch of StatsNZ (working with Harpal Shergill of UNSD) published a synthetic scanner dataset that could be used for various types of research topics, including on multilateral and for quality adjustment methods. The dataset is hosted on the UN Global Platform GitLab and is provided for researchers.
This short blog is a short exploration of the dataset to better understand it.
This blog is a jupyter article under the hood - have a look at the source. Save a copy of the data to /data/bronze/ and re-render.
Original dataset
The data is tabular and it shows products sold per period with a large number of characteristics already pre-cleaned
Loading ITables v2.5.2 from the init_notebook_mode cell...
(need help?) |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5509 entries, 0 to 5508
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 month_num 5509 non-null object
1 char11 5509 non-null object
2 char1 5509 non-null float64
3 char2 5509 non-null int64
4 char3 5509 non-null object
5 char4 5509 non-null object
6 char5 5509 non-null object
7 char6 5509 non-null object
8 char7 5509 non-null object
9 char8 5509 non-null object
10 char9 5509 non-null object
11 char10 5509 non-null object
12 prodid_num 5509 non-null int64
13 quantity 5509 non-null int64
14 value 5509 non-null int64
dtypes: float64(1), int64(4), object(10)
memory usage: 645.7+ KB
The key question is how to interpret all these feature columns and what the overall information is in the data
Show the code
stats = {}
stats['Number of unique products (prodid_num column)'] = df['prodid_num'].nunique()
stats['Number of months in sample'] = df['month_num'].nunique()
stats['First month in sample'] = df['month_num'].min()
stats['Last month in sample'] = df['month_num'].max()
stats['Char11 unique count (brands)'] = df['char11'].nunique()
stats['Char1 unique count (possibly screen size)'] = df['char1'].nunique()
stats['Char10 unique count'] = df['char10'].nunique()
pd.DataFrame.from_dict(stats, orient='index', columns=['statistic'])
Loading ITables v2.5.2 from the init_notebook_mode cell...
(need help?) |
Issue with unique product identifiers in the dataset
Overview of the issue
Every period has products that seem to be duplicated - i.e. products are identical in all but sale information. We can see this by comparing the number of unique products (if we use the prodid_num column) and comparing it with just a count of the same products (without de-duplication).
Show the code
series_unique = df.groupby(['month_num'])['prodid_num'].nunique()
series_count = df.groupby(['month_num'])['prodid_num'].count()
df2 = pd.DataFrame({'Unique': series_unique, 'Count': series_count}).reset_index()
fig = px.line(df2, x='month_num', y=['Unique', 'Count'],
title='Number of unique products per period and number of total products per period',
labels={'month_num': 'Time period', 'value': 'Count of the number of products'})
fig.update_layout(yaxis_title='Number of products', xaxis_title='Month')
fig.show()Examples
Looking at a few examples - it is clear that all product features are identical except the quanity and the value counts
Show the code
df_first_period = df[df['month_num'] == df['month_num'].min()].copy()
df_first_period.groupby(['prodid_num']).filter(lambda x: x['char11'].count() > 1).head(2)
Loading ITables v2.5.2 from the init_notebook_mode cell...
(need help?) |
Lets look at another example
Show the code
df_first_period.groupby(['prodid_num']).filter(lambda x: x['char11'].count() > 1).iloc[2:4]
Loading ITables v2.5.2 from the init_notebook_mode cell...
(need help?) |
And another
Show the code
df_first_period.groupby(['prodid_num']).filter(lambda x: x['char11'].count() > 1).iloc[4:6]
Loading ITables v2.5.2 from the init_notebook_mode cell...
(need help?) |
As the dataset was created by taking a subset of variables of real data and modifying it, its possible there were variables that differentiated these products that were not included in the synthetic dataset (prodid_num was generated after the modification process).
To make use of the data then, it will be necessary to aggregate these duplicate products by summing the quantity and value columns when calculating unit prices. This is an assumption but it is a workable one.
Fixing the issue and creating a clean dataset
To simplify downstream analysis, and upload this simpified one to Zenodo - fix the issue and create a new version of the dataste (i.e. do a group by in a way that keeps the other key info).
# Define the aggregation dictionary (i.e. logic of the groupby) - the idea is to
# ensure that unique product id is kept, with scanner info summed.
agg_dict = {
'char11': 'first',
'char1':'first',
'char2':'first',
'char3':'first',
'char4':'first',
'char5':'first',
'char6':'first',
'char7':'first',
'char8':'first',
'char9':'first',
'char10':'first',
'quantity':'sum',
'value':'sum'
}
# fix the uniqueness issue using
df_aggregated = df.groupby(['month_num','prodid_num']).agg(agg_dict)
df_aggregated.reset_index(inplace=True)
# Create a unit price column
df_aggregated['unit_price'] = df_aggregated['value']/df_aggregated['quantity']
df_aggregated.to_csv("../data/gold/NZ_dataset_historic_aggregated_secure.csv")Analysis of the new cleaned dataset
As this is the dataset that we can upload to Zenodo, and document on the open data catalogue - the following helps explain the dataset a bit more.
Overall info
Loading ITables v2.5.2 from the init_notebook_mode cell...
(need help?) |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3843 entries, 0 to 3842
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 month_num 3843 non-null object
1 prodid_num 3843 non-null int64
2 char11 3843 non-null object
3 char1 3843 non-null float64
4 char2 3843 non-null int64
5 char3 3843 non-null object
6 char4 3843 non-null object
7 char5 3843 non-null object
8 char6 3843 non-null object
9 char7 3843 non-null object
10 char8 3843 non-null object
11 char9 3843 non-null object
12 char10 3843 non-null object
13 quantity 3843 non-null int64
14 value 3843 non-null int64
15 unit_price 3843 non-null float64
dtypes: float64(2), int64(4), object(10)
memory usage: 480.5+ KB
Show the code
stats = {}
stats['Number of unique products (prodid_num column)'] = df_aggregated['prodid_num'].nunique()
stats['Number of months in sample'] = df_aggregated['month_num'].nunique()
stats['First month in sample'] = df_aggregated['month_num'].min()
stats['Last month in sample'] = df_aggregated['month_num'].max()
stats['Char11 unique count (brands)'] = df_aggregated['char11'].nunique()
stats['Char1 unique count (possibly screen size)'] = df_aggregated['char1'].nunique()
stats['Char10 unique count'] = df_aggregated['char10'].nunique()
pd.DataFrame.from_dict(stats, orient='index', columns=['statistic'])
Loading ITables v2.5.2 from the init_notebook_mode cell...
(need help?) |
Tracking number of unique products over time
Show the code
import plotly.graph_objects as go
series_count = df_aggregated.groupby(['month_num'])['prodid_num'].count().to_frame().reset_index()
series_count['Average'] = series_count['prodid_num'].mean()
fig = px.bar(series_count, x='month_num', y='prodid_num', title='Number of unique products per month')
# add moving-average line
fig.add_trace(
go.Scatter(
x=series_count['month_num'],
y=series_count['Average'],
mode='lines',
name='Average',
line=dict(color='red', width=1.5 , dash='dash')
)
)
fig.update_layout(xaxis_title='Month', yaxis_title='Count')
fig.show()Average prices across time
Show the code
price_trend = df_aggregated.groupby(["month_num"])["unit_price"].mean().reset_index()
fig2 = px.line(price_trend, x='month_num', y='unit_price', title='Average price per month', range_y=[0, None])
fig2.update_layout(xaxis_title='Month', yaxis_title='Price (in NZD)')
fig2.show()Churn in products
Show the code
df_aggregated2 = df_aggregated.reset_index()
n = round(df_aggregated2.groupby(['prodid_num'])['month_num'].count().mean(),1)
"The average length of time products are in sample: {n}".format(n=n)'The average length of time products are in sample: 8.4'
Product lifetimes (Gantt-like)
We can represent visually the longevity of products in sample using a gantt -like view where a line represents a product’s time in the sample based on when it entered and existed.