Time Series Forecast of Walmart Sales Data

May Shen, j.shen33@gmail.com

April 14th 2019

Time series forecasting is an important technique that is widely used in business settings such as stock and sales. In this project, I’ll use a powerful tool, the autoregressive integrated moving average (ARIMA) model, to forecast the Walmart sales data.

Table of contents

  1. Data Import and Preprocessing

  2. Data Visualization

  3. Forecast Model

  4. Conclusions

Data Import and Preprocessing

The data contains historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and the task is to predict the department-wide sales for each store. The data can be downloaded from Kaggle.

import pandas as pd
train = pd.read_csv('train.csv')
train.head()
StoreDeptDateWeekly_SalesIsHoliday
0112010-02-0524924.50False
1112010-02-1246039.49True
2112010-02-1941595.55False
3112010-02-2619403.54False
4112010-03-0521827.90False
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421570 entries, 0 to 421569
Data columns (total 5 columns):
Store           421570 non-null int64
Dept            421570 non-null int64
Date            421570 non-null object
Weekly_Sales    421570 non-null float64
IsHoliday       421570 non-null bool
dtypes: bool(1), float64(1), int64(2), object(1)
memory usage: 13.3+ MB
train.describe()
StoreDeptWeekly_Sales
count421570.000000421570.000000421570.000000
mean22.20054644.26031715981.258123
std12.78529730.49205422711.183519
min1.0000001.000000-4988.940000
25%11.00000018.0000002079.650000
50%22.00000037.0000007612.030000
75%33.00000074.00000020205.852500
max45.00000099.000000693099.360000

Here is some basic information about the data.

train.csv

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

  1. Store - the store number
  2. Dept - the department number
  3. Date - the week
  4. Weekly_Sales - sales for the given department in the given store
  5. IsHoliday - whether the week is a special holiday week

Now, let’s focus on the time aspect of the data. Let’s narrow it down to only 1 store and 1 department for illustrative purpose.

What’s the date range of the data?

dat = train.loc[(train.Store == 1) & (train.Dept == 1),['Date','Weekly_Sales']]
print('Earliest date: %s; Latest date: %s' % (dat['Date'].min(), dat['Date'].max()))
Earliest date: 2010-02-05; Latest date: 2012-10-26

Let’s take a look at the data that we’ll focus on.

dat.head()
DateWeekly_Sales
02010-02-0524924.50
12010-02-1246039.49
22010-02-1941595.55
32010-02-2619403.54
42010-03-0521827.90
dat.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 142
Data columns (total 2 columns):
Date            143 non-null object
Weekly_Sales    143 non-null float64
dtypes: float64(1), object(1)
memory usage: 3.4+ KB
dat.describe()
Weekly_Sales
count143.000000
mean22513.322937
std9854.349032
min14537.370000
25%16494.630000
50%18535.480000
75%23214.215000
max57592.120000
dat.Date.describe()
count            143
unique           143
top       2011-12-23
freq               1
Name: Date, dtype: object
dat.set_index('Date', inplace=True)
dat.groupby('Date')
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f4fd3787e50>
dat.index
Index([u'2010-02-05', u'2010-02-12', u'2010-02-19', u'2010-02-26',
       u'2010-03-05', u'2010-03-12', u'2010-03-19', u'2010-03-26',
       u'2010-04-02', u'2010-04-09',
       ...
       u'2012-08-24', u'2012-08-31', u'2012-09-07', u'2012-09-14',
       u'2012-09-21', u'2012-09-28', u'2012-10-05', u'2012-10-12',
       u'2012-10-19', u'2012-10-26'],
      dtype='object', name=u'Date', length=143)

Data Visualization

Now that we have some basic understanding of the data, let’s plot the sales over time to see the trend.

%matplotlib inline
# plot data
dat.plot(figsize=(15, 6),title="Walmart Sales in One Department of One Store Feb 2010 to Oct 2012")
<matplotlib.axes._subplots.AxesSubplot at 0x7f4fd37bb9d0>

png

We can see that there are some clear patterns in the data. Through simple eye-balling, there are four spikes in sales around November, December, February, and April.

dat.index = pd.to_datetime(dat.index)
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(dat, model='multiplicative')
fig = result.plot()

png

From the plot above we can clearly see the seasonal component of the data, and we can also see the separated bumpy trend of the data.

Forecast Model

Now let’s build an ARIMA model to predict sales over time. ARIMA (autoregressive integrated moving average) models, also called Box-Jenkins models, are a class of powerful and popular models to forecast time series data. The ARIMA models may possibly include autoregressive terms, moving average terms, and differencing operations.

AR: Autoregression. A model that uses the dependent relationship between an observation and some number of lagged observations. AR(1): a linear model to predict the value at the present time using the value at the previous time.

I: Integrated. The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary.

MA: Moving Average. A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.

ARIMA models are commonly denoted as ARIMA(p,d,q) where the parameters are substituted with integer values to quickly indicate the specific ARIMA model being used. A value of 0 can be used for a parameter, which indicates to not use that element of the model. This way, the ARIMA model can be configured to perform the function of an ARMA model, and even a simple AR, I, or MA model.

from pyramid.arima import auto_arima
stepwise_model = auto_arima(dat, start_p=1, d=1, start_q=1,
                           max_p=3, max_q=3, m=12,
                           start_P=0, seasonal=True,
                           D=1, trace=True,
                           error_action='ignore',  
                           suppress_warnings=True)
print(stepwise_model.aic())
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=2760.510, BIC=2774.848, Fit time=1.505 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 0, 12); AIC=2842.204, BIC=2847.939, Fit time=0.048 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 0, 12); AIC=2806.564, BIC=2818.034, Fit time=0.202 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=2759.891, BIC=2771.361, Fit time=1.477 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 1, 12); AIC=2789.475, BIC=2803.813, Fit time=0.486 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 0, 12); AIC=2842.065, BIC=2850.668, Fit time=0.145 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=2784.482, BIC=2801.687, Fit time=12.198 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 1, 12); AIC=2758.284, BIC=2766.887, Fit time=0.798 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(1, 1, 1, 12); AIC=2760.066, BIC=2771.537, Fit time=1.015 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 2, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(1, 1, 2, 12); AIC=2790.007, BIC=2804.345, Fit time=4.098 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(0, 1, 1, 12); AIC=2760.133, BIC=2771.603, Fit time=1.594 seconds
Total fit time: 23.614 seconds
2758.2843325496183

Now we can pick the model with the best performance and validate the model performance by splitting the data into a train and a test set.

train = dat.loc[:'2012-05-01']
test = dat.loc['2012-05-01':]
stepwise_model.fit(train)
ARIMA(callback=None, disp=0, maxiter=50, method=None, order=(0, 1, 0),
   out_of_sample_size=0, scoring='mse', scoring_args={},
   seasonal_order=(0, 1, 1, 12), solver='lbfgs', start_params=None,
   suppress_warnings=True, transparams=True, trend='c')
future_forecast = stepwise_model.predict(n_periods=len(test))
future_forecast = pd.DataFrame(future_forecast,index = test.index,columns=['Prediction'])
pd.concat([test,future_forecast],axis=1).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f4ffc078490>

png

pd.concat([dat,future_forecast],axis=1).plot(figsize=(15, 6),
                                             title="Walmart Sales in One Department of One Store Feb 2010 to Oct 2012")
<matplotlib.axes._subplots.AxesSubplot at 0x7f4fd2f3c0d0>

png

Conclusion

Voilà! Now we have a forecast model that can predict the sales of the coming weeks. With real-world data, the predictions will almost never be 100% in sync with the data, but we can see that the predictions stay in a reasonable range for off-holiday season sales. We can use the forecast information to optimize staffing and other resources.