Time Series Forecast of Walmart Sales Data

May Shen, j.shen33@gmail.com

April 14th 2019

Time series forecasting is an important technique that is widely used in business settings such as stock and sales. In this project, I’ll use a powerful tool, the autoregressive integrated moving average (ARIMA) model, to forecast the Walmart sales data.

Data Import and Preprocessing
Data Visualization
Forecast Model
Conclusions

Data Import and Preprocessing

The data contains historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and the task is to predict the department-wide sales for each store. The data can be downloaded from Kaggle.

import pandas as pd

train = pd.read_csv('train.csv')

train.head()

	Store	Dept	Date	Weekly_Sales	IsHoliday
0	1	1	2010-02-05	24924.50	False
1	1	1	2010-02-12	46039.49	True
2	1	1	2010-02-19	41595.55	False
3	1	1	2010-02-26	19403.54	False
4	1	1	2010-03-05	21827.90	False

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421570 entries, 0 to 421569
Data columns (total 5 columns):
Store           421570 non-null int64
Dept            421570 non-null int64
Date            421570 non-null object
Weekly_Sales    421570 non-null float64
IsHoliday       421570 non-null bool
dtypes: bool(1), float64(1), int64(2), object(1)
memory usage: 13.3+ MB

train.describe()

	Store	Dept	Weekly_Sales
count	421570.000000	421570.000000	421570.000000
mean	22.200546	44.260317	15981.258123
std	12.785297	30.492054	22711.183519
min	1.000000	1.000000	-4988.940000
25%	11.000000	18.000000	2079.650000
50%	22.000000	37.000000	7612.030000
75%	33.000000	74.000000	20205.852500
max	45.000000	99.000000	693099.360000

Here is some basic information about the data.

train.csv

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

Store - the store number
Dept - the department number
Date - the week
Weekly_Sales - sales for the given department in the given store
IsHoliday - whether the week is a special holiday week

Now, let’s focus on the time aspect of the data. Let’s narrow it down to only 1 store and 1 department for illustrative purpose.

What’s the date range of the data?

dat = train.loc[(train.Store == 1) & (train.Dept == 1),['Date','Weekly_Sales']]
print('Earliest date: %s; Latest date: %s' % (dat['Date'].min(), dat['Date'].max()))

Earliest date: 2010-02-05; Latest date: 2012-10-26

Let’s take a look at the data that we’ll focus on.

dat.head()

	Date	Weekly_Sales
0	2010-02-05	24924.50
1	2010-02-12	46039.49
2	2010-02-19	41595.55
3	2010-02-26	19403.54
4	2010-03-05	21827.90

dat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 142
Data columns (total 2 columns):
Date            143 non-null object
Weekly_Sales    143 non-null float64
dtypes: float64(1), object(1)
memory usage: 3.4+ KB

dat.describe()

	Weekly_Sales
count	143.000000
mean	22513.322937
std	9854.349032
min	14537.370000
25%	16494.630000
50%	18535.480000
75%	23214.215000
max	57592.120000

dat.Date.describe()

count            143
unique           143
top       2011-12-23
freq               1
Name: Date, dtype: object

dat.set_index('Date', inplace=True)

dat.groupby('Date')

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f4fd3787e50>

dat.index

Index([u'2010-02-05', u'2010-02-12', u'2010-02-19', u'2010-02-26',
       u'2010-03-05', u'2010-03-12', u'2010-03-19', u'2010-03-26',
       u'2010-04-02', u'2010-04-09',
       ...
       u'2012-08-24', u'2012-08-31', u'2012-09-07', u'2012-09-14',
       u'2012-09-21', u'2012-09-28', u'2012-10-05', u'2012-10-12',
       u'2012-10-19', u'2012-10-26'],
      dtype='object', name=u'Date', length=143)

Data Visualization

Now that we have some basic understanding of the data, let’s plot the sales over time to see the trend.

%matplotlib inline
# plot data
dat.plot(figsize=(15, 6),title="Walmart Sales in One Department of One Store Feb 2010 to Oct 2012")

<matplotlib.axes._subplots.AxesSubplot at 0x7f4fd37bb9d0>

png

We can see that there are some clear patterns in the data. Through simple eye-balling, there are four spikes in sales around November, December, February, and April.

dat.index = pd.to_datetime(dat.index)

from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(dat, model='multiplicative')
fig = result.plot()

png

From the plot above we can clearly see the seasonal component of the data, and we can also see the separated bumpy trend of the data.

Forecast Model

Now let’s build an ARIMA model to predict sales over time. ARIMA (autoregressive integrated moving average) models, also called Box-Jenkins models, are a class of powerful and popular models to forecast time series data. The ARIMA models may possibly include autoregressive terms, moving average terms, and differencing operations.

AR: Autoregression. A model that uses the dependent relationship between an observation and some number of lagged observations. AR(1): a linear model to predict the value at the present time using the value at the previous time.

I: Integrated. The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary.

MA: Moving Average. A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.

ARIMA models are commonly denoted as ARIMA(p,d,q) where the parameters are substituted with integer values to quickly indicate the specific ARIMA model being used. A value of 0 can be used for a parameter, which indicates to not use that element of the model. This way, the ARIMA model can be configured to perform the function of an ARMA model, and even a simple AR, I, or MA model.

from pyramid.arima import auto_arima
stepwise_model = auto_arima(dat, start_p=1, d=1, start_q=1,
                           max_p=3, max_q=3, m=12,
                           start_P=0, seasonal=True,
                           D=1, trace=True,
                           error_action='ignore',  
                           suppress_warnings=True)
print(stepwise_model.aic())

Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=2760.510, BIC=2774.848, Fit time=1.505 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 0, 12); AIC=2842.204, BIC=2847.939, Fit time=0.048 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 0, 12); AIC=2806.564, BIC=2818.034, Fit time=0.202 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=2759.891, BIC=2771.361, Fit time=1.477 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 1, 12); AIC=2789.475, BIC=2803.813, Fit time=0.486 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 0, 12); AIC=2842.065, BIC=2850.668, Fit time=0.145 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=2784.482, BIC=2801.687, Fit time=12.198 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 1, 12); AIC=2758.284, BIC=2766.887, Fit time=0.798 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(1, 1, 1, 12); AIC=2760.066, BIC=2771.537, Fit time=1.015 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 2, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(1, 1, 2, 12); AIC=2790.007, BIC=2804.345, Fit time=4.098 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(0, 1, 1, 12); AIC=2760.133, BIC=2771.603, Fit time=1.594 seconds
Total fit time: 23.614 seconds
2758.2843325496183

Now we can pick the model with the best performance and validate the model performance by splitting the data into a train and a test set.

train = dat.loc[:'2012-05-01']
test = dat.loc['2012-05-01':]
stepwise_model.fit(train)

ARIMA(callback=None, disp=0, maxiter=50, method=None, order=(0, 1, 0),
   out_of_sample_size=0, scoring='mse', scoring_args={},
   seasonal_order=(0, 1, 1, 12), solver='lbfgs', start_params=None,
   suppress_warnings=True, transparams=True, trend='c')

future_forecast = stepwise_model.predict(n_periods=len(test))

future_forecast = pd.DataFrame(future_forecast,index = test.index,columns=['Prediction'])
pd.concat([test,future_forecast],axis=1).plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f4ffc078490>

png

pd.concat([dat,future_forecast],axis=1).plot(figsize=(15, 6),
                                             title="Walmart Sales in One Department of One Store Feb 2010 to Oct 2012")

<matplotlib.axes._subplots.AxesSubplot at 0x7f4fd2f3c0d0>

png

Conclusion

Voilà! Now we have a forecast model that can predict the sales of the coming weeks. With real-world data, the predictions will almost never be 100% in sync with the data, but we can see that the predictions stay in a reasonable range for off-holiday season sales. We can use the forecast information to optimize staffing and other resources.

Share on

Twitter Facebook Google+ LinkedIn

Jianhong (May) Shen 沈建虹