In 2019, Frances Krsinich and Donal Lynch of StatsNZ (working with Harpal Shergill of UNSD) published a synthetic scanner dataset that could be used for various types of research topics, including on multilateral and for quality adjustment methods. The dataset is hosted on the UN Global Platform GitLab and is provided for researchers.

This short blog is a short exploration of the dataset to better understand it.

Reproduce this blog

This blog is a jupyter article under the hood - have a look at the source. Save a copy of the data to /data/bronze/ and re-render.

Original dataset

The data is tabular and it shows products sold per period with a large number of characteristics already pre-cleaned

init_notebook_modetrusted

Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5509 entries, 0 to 5508
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   month_num   5509 non-null   object 
 1   char11      5509 non-null   object 
 2   char1       5509 non-null   float64
 3   char2       5509 non-null   int64  
 4   char3       5509 non-null   object 
 5   char4       5509 non-null   object 
 6   char5       5509 non-null   object 
 7   char6       5509 non-null   object 
 8   char7       5509 non-null   object 
 9   char8       5509 non-null   object 
 10  char9       5509 non-null   object 
 11  char10      5509 non-null   object 
 12  prodid_num  5509 non-null   int64  
 13  quantity    5509 non-null   int64  
 14  value       5509 non-null   int64  
dtypes: float64(1), int64(4), object(10)
memory usage: 645.7+ KB

The key question is how to interpret all these feature columns and what the overall information is in the data

Show the code

stats = {}
stats['Number of unique products (prodid_num column)'] = df['prodid_num'].nunique()
stats['Number of months in sample'] = df['month_num'].nunique()
stats['First month in sample'] = df['month_num'].min()
stats['Last month in sample'] = df['month_num'].max()
stats['Char11 unique count (brands)'] = df['char11'].nunique()
stats['Char1 unique count (possibly screen size)'] = df['char1'].nunique()
stats['Char10 unique count'] = df['char10'].nunique()

pd.DataFrame.from_dict(stats, orient='index', columns=['statistic'])

Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)

Issue with unique product identifiers in the dataset

Overview of the issue

Every period has products that seem to be duplicated - i.e. products are identical in all but sale information. We can see this by comparing the number of unique products (if we use the prodid_num column) and comparing it with just a count of the same products (without de-duplication).

Show the code

series_unique = df.groupby(['month_num'])['prodid_num'].nunique()
series_count = df.groupby(['month_num'])['prodid_num'].count()

df2 = pd.DataFrame({'Unique': series_unique, 'Count': series_count}).reset_index()

fig = px.line(df2, x='month_num', y=['Unique', 'Count'], 
              title='Number of unique products per period and number of total products per period',
              labels={'month_num': 'Time period', 'value': 'Count of the number of products'})

fig.update_layout(yaxis_title='Number of products', xaxis_title='Month')
fig.show()

Examples

Looking at a few examples - it is clear that all product features are identical except the quanity and the value counts

Show the code

df_first_period = df[df['month_num'] == df['month_num'].min()].copy()
df_first_period.groupby(['prodid_num']).filter(lambda x: x['char11'].count() > 1).head(2)

Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)

Lets look at another example

Show the code

df_first_period.groupby(['prodid_num']).filter(lambda x: x['char11'].count() > 1).iloc[2:4]

Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)

And another

Show the code

df_first_period.groupby(['prodid_num']).filter(lambda x: x['char11'].count() > 1).iloc[4:6]

Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)

As the dataset was created by taking a subset of variables of real data and modifying it, its possible there were variables that differentiated these products that were not included in the synthetic dataset (prodid_num was generated after the modification process).

To make use of the data then, it will be necessary to aggregate these duplicate products by summing the quantity and value columns when calculating unit prices. This is an assumption but it is a workable one.

Fixing the issue and creating a clean dataset

To simplify downstream analysis, and upload this simpified one to Zenodo - fix the issue and create a new version of the dataste (i.e. do a group by in a way that keeps the other key info).

# Define the aggregation dictionary (i.e. logic of the groupby) - the idea is to 
# ensure that unique product id is kept, with scanner info summed.
agg_dict = {
    'char11': 'first',
    'char1':'first',
    'char2':'first',
    'char3':'first',
    'char4':'first',
    'char5':'first',
    'char6':'first',
    'char7':'first',
    'char8':'first',
    'char9':'first',
    'char10':'first',
    'quantity':'sum',
    'value':'sum'
}

# fix the uniqueness issue using 
df_aggregated = df.groupby(['month_num','prodid_num']).agg(agg_dict)
df_aggregated.reset_index(inplace=True)
# Create a unit price column
df_aggregated['unit_price'] = df_aggregated['value']/df_aggregated['quantity']
df_aggregated.to_csv("../data/gold/NZ_dataset_historic_aggregated_secure.csv")

Analysis of the new cleaned dataset

As this is the dataset that we can upload to Zenodo, and document on the open data catalogue - the following helps explain the dataset a bit more.

Overall info

Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3843 entries, 0 to 3842
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   month_num   3843 non-null   object 
 1   prodid_num  3843 non-null   int64  
 2   char11      3843 non-null   object 
 3   char1       3843 non-null   float64
 4   char2       3843 non-null   int64  
 5   char3       3843 non-null   object 
 6   char4       3843 non-null   object 
 7   char5       3843 non-null   object 
 8   char6       3843 non-null   object 
 9   char7       3843 non-null   object 
 10  char8       3843 non-null   object 
 11  char9       3843 non-null   object 
 12  char10      3843 non-null   object 
 13  quantity    3843 non-null   int64  
 14  value       3843 non-null   int64  
 15  unit_price  3843 non-null   float64
dtypes: float64(2), int64(4), object(10)
memory usage: 480.5+ KB

Show the code

stats = {}
stats['Number of unique products (prodid_num column)'] = df_aggregated['prodid_num'].nunique()
stats['Number of months in sample'] = df_aggregated['month_num'].nunique()
stats['First month in sample'] = df_aggregated['month_num'].min()
stats['Last month in sample'] = df_aggregated['month_num'].max()
stats['Char11 unique count (brands)'] = df_aggregated['char11'].nunique()
stats['Char1 unique count (possibly screen size)'] = df_aggregated['char1'].nunique()
stats['Char10 unique count'] = df_aggregated['char10'].nunique()

pd.DataFrame.from_dict(stats, orient='index', columns=['statistic'])

Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)

Tracking number of unique products over time

Show the code

import plotly.graph_objects as go

series_count = df_aggregated.groupby(['month_num'])['prodid_num'].count().to_frame().reset_index()
series_count['Average'] = series_count['prodid_num'].mean()

fig = px.bar(series_count, x='month_num', y='prodid_num', title='Number of unique products per month')

# add moving-average line
fig.add_trace(
    go.Scatter(
        x=series_count['month_num'],
        y=series_count['Average'],
        mode='lines',
        name='Average',
        line=dict(color='red', width=1.5 , dash='dash')
    )
)

fig.update_layout(xaxis_title='Month', yaxis_title='Count')
fig.show()

Average prices across time

Show the code

price_trend = df_aggregated.groupby(["month_num"])["unit_price"].mean().reset_index()

fig2 = px.line(price_trend, x='month_num', y='unit_price', title='Average price per month', range_y=[0, None])
fig2.update_layout(xaxis_title='Month', yaxis_title='Price (in NZD)')
fig2.show()

Churn in products

Show the code

df_aggregated2 = df_aggregated.reset_index()
n = round(df_aggregated2.groupby(['prodid_num'])['month_num'].count().mean(),1)

"The average length of time products are in sample: {n}".format(n=n)

'The average length of time products are in sample: 8.4'

Product lifetimes (Gantt-like)

We can represent visually the longevity of products in sample using a gantt -like view where a line represents a product’s time in the sample based on when it entered and existed.