Forecasting the Incidence of Breast, Colorectal and Bladder Cancers in North of Iran Using Time Series Models; Comparing Bayesian, ARIMA and Bootstrap Approaches

Introduction: Cancers are the second cause of death worldwide. Prevalence and incidence of cancers is getting increased by aging and population growth. This study aims to predict the incidence of breast, colorectal and bladder cancers in north of Iran until 2020 using time series models. Methods: The number of breast, colorectal and bladder cancer cases from April 2014 to March 2016 was extracted. The time variable was each month of the study years and using the number of daily registered cancers in each month, the time series of the monthly incident cases was designed. Then, three methods of time series analysis including Box Jenkins, Bayesian and Bootstrap were applied for predicting the incidence of the above cancers until March 2020. Results: The number of bladder cancer cases in March 2014 was 6 cases. This study showed that the number of breast cancer cases in March 2020 will be increased to 15, 15 and 26 cases based on ARIMA, Bootstrap and Bayesian methods respectively. In addition, the incident cases of breast cancer, will be increased from 32 in 2014 to 65 (ARIMA method), 47 (Bootstrap method) and 364 (Bayesian method). The corresponding figure for colorectal cancer was 30, 30 and 95 respectively. Conclusion: The increasing trend of breast, bladder and colorectal cancers will be continued which is considerable based on the Bayesian method results. Considering the limited reliable data used in a short time, it seems that the forecasting results of this model is acceptable. cancer- Forecasting- Time series [5, 6]. Colorectal cancer is the third prevalent cancers worldwide with approximately 150,000 new cases annually. In Iran, during 2003- 2008, the age standardized incidence of colorectal cancer has been increased from 5.47 to 11.12 per 100,000 in women and from 5.56 to 12.7 per 100,000 in men [7, 8]. Bladder cancer is one of the other common cancers especially among men. According to the results of a meta-analysis, the standardized incidence of this cancer among Iranian men and women has been estimated as of 10.92 per Population-based


Introduction
. Colorectal cancer is the third prevalent cancers worldwide with approximately 150,000 new cases annually. In Iran, during 2003-2008, the age standardized incidence of colorectal cancer has been increased from 5.47 to 11.12 per 100,000 in women and from 5.56 to 12.7 per 100,000 in men [7,8]. Bladder cancer is one of the other common cancers especially among men. According to the results of a meta-analysis, the standardized incidence of this cancer among Iranian men and women has been estimated as of 10.92 per apjec.waocp.com Ghasem Janbabaee, et al: Forecasting the Incidence of Breast, Colorectal and Bladder Cancers in North of Iran Using 100,000 and 2.80 per 100,000 respectively [9]. Control of cancer, as one of the three man health priorities in Iran, requires designing a clear road map for all stakeholders. The present global facts show the importance of approach to this strategy. For example, the treatment costs of cancer is 19% higher than the costs of cardiovascular disorders. Meanwhile, one-third to half of the cancer associated deaths which are occurred in the low-middle income countries are preventable by early diagnosis and treatment. The growth of knowledge of public health helps policymakers plan suitable evidence based strategies. Therefore, it is better to draw the future perspective of cancer based on the present situation and associated factors such as changes in population and risk factors [10].
Time series is one of the most common used techniques applied in the futures studies including a set of observations of a specified variable sorting based on time.
In general, the aim of time series studies is determination of the probable models of data generating and predicting their quantities in the future. These techniques facilitate the statistical analysis of the variables according to the time [11]. In a time series study investigating the previous behavior of the series, the best model engaged in the data generation is detected. Therefore, assuming the similar behaviors in the future, the upcoming amounts of the series is predicted. Such analyses are attributed to dependent data which are associated with each other during the time. Such dependence between the sequential observations is the basic principle of the time series analysis [12]. This study aims to forecast the incidence of breast cancer, colorectal cancer and bladder cancer in the northern part of Iran (Mazandaran province) until 2020 using different approaches of time series analysis.

Materials and Methods
This cross sectional study was carried out based on the recorded data. The monthly number of the incident cases of bladder cancer, breast cancer and colorectal cancer from April 2014 to March 2016 were extracted. Note that only the information of the recent two years were completely available, just 24 time points were established. Sampling was conducted based on consensus method.
The main source of data was the cancer registry of Mazandaran University of Medical Sciences, Sari, Iran (IR.MAZUMS.REC.96.2730). The data extraction was conducted without patients' names.
Three methods of analysis including Box Jenkins, Bayesian and Bootstrap were applied for prediction of the breast, colorectal and bladder cancers until March 2020. The time variable in this time series analysis was each of the months of the study years. Considering the daily number of the incident cancers registered in each month, the time series model of the monthly incident cases was designed.

Modeling approaches
In the time series modeling based on the Box-Jenkins model, to investigate the nature of data, time series graphs were designed including ACF (autocorrelation function) and PACF (partial autocorrelation function). The type of these series was assessed in term of static or instability of the mean, variance and trend detection. Model making was first begun by detecting an experimental ARIMA model through real data analysis. Then, the unknown parameters of the model (p, d, q) were estimated using ACF and PACF graphs. The final model was designed using ARIMA method. Goodness of fit of the model was assessed using AIC, Box Ljung test as well as evaluation of the model residuals (Q-Q plot).
For each observation, the Q-Q plot shows the observed (X axis) and expected (Y axis when the sample data are normally distributed) values. In the case of normal distributed data, all points will be collected around a direct line.
In the Bootstrap approach, for each series, 1000 sampling was performed according to the selected rank in the ARIMA model. Then, the model goodness of fit was applied on all ranks of the sampling and finally, the number of cancer cases was predicted for the future.
In the Bayesian approach, the probable trend of each series was investigated using bsts package. Then, the posterior distributions were selected using appropriate prior distributions and the forecasting was conducted for the next 48 months The statistical analyses were performed using R version 3.5.3 software. The tseries part of the forecast package was used for Box-Jenkins and Bootstrap modeling and the bsts part of the package was applied for the Bayesian approach modeling.

Results
All bladder cancer cases were 367 patients varied between 6 cases in April to 25 cases in August 2014 (appendix 1). The relevant distributions were 18 in March to 38 in February.
The model parameters were estimated from ACF and PACF models after differentiation. As illustrated in the graphs of appendix3&4, the p & q parameters were estimated as 1. Considering the graphs in the appendix 2-4, the ARIMA (p=1, q=1 and d=1) seems to be the best models. I.e. the model includes both autoregressive and moving average components (ARIMA [(p, q, d) (1,1,1)]). This model had also the least AIC (149.45). It should be noted that Ljung-Box test was applied for assessment the ARIMA model for forecasting the bladder cancer incidence and the statistics showed that the final selected model is appropriate (X-squared=0.015718, p-value=0.900). The residuals were normally distributed indicating the effectiveness of the model (appendix 5).
As illustrated in the graph in the appendix 6, the average monthly number of the bladder cancer incident cases in north of Iran in 2020 will be 15 per month. The graph in appendix 7 illustrates the results of forecasting following 1000 Bootstrap sampling based on the ARIMA model. The results of the bladder cancer new cases based on Bayesian approach was estimated as of 30 cases per month in 2020 (  Totally 1113 breast cancer cases were investigated minimum and maximum of which were identified in April 2015 (25 cases) and March 2016 (99 cases) respectively (Appendix 9). These cases were distributed from 57 in March to 163 in February. The time series model had instable pattern converted to stable model by one step differentiation (appendix 10). Autoregressive and moving average parameters were estimated as of 1 (appendix 11& 12). Therefore, the ARIMA (p,d,q) (1,1,1) was selected as the best model. Moreover, the AIC of the model was estimated as of 196.13. The results of Ljung-Box test for the assessment of the ARIMA model showed that the model was appropriate (X-squared=0.014429, p-value=0.9044). According to the graph appendix 13, the residuals were normally distributed.
Graph of the appendix 14 shows the average number of breast cancer cases based on the ARIMA model in 2020 as of 65 cases per month. The corresponding figures for Bootstrap approach (appendix 15) and Bayesian approach (appendix 16) were predicted as of 47 and 358 cases respectively per month ( Table 1).
The total number of colorectal cancer cases was 722 minimum and maximum of which were observed in September-October 2015 (21 cases) and April 2014 (86 cases) respectively (appendix 17). These cases were distributed from March (44 cases) to February (86 cases). Note that the time series had instability, one step differentiation was performed (appendix 18). Autoregressive parameter was estimated from the ACF graph (appendix 19) and moving average was estimated from PACF graph (appendix 20), both of which were estimated as of 1 and ARIMA (p,q,d) (1,1,1) was selected as the best model with AIC as of 160.34. Based on the results of the Ljung-Box test (X-squared=0.0063381, p-value=0.936), the selected model was appropriate. In addition, the residuals had normal distribution (appendix 21).
The number of colorectal cancer cases in north of Iran based on the ARIMA model (appendix 22) was estimated as of 30 cases per month. Corresponding incidences for Bootstrap (appendix 23) and Bayesian models (appendix 24) were 30 and 89 cases respectively (Table 1).

Discussion
In this study, the incidence of breast, colorectal and bladder cancers until March 2020 was predicted using time series analysis based on three modeling approaches (ARIMA, Bootstrap and Bayesian). The results showed that the number of bladder cancer incident cases will be increased from six cases in 2014 to 15, 15 and 26 cases in March 2020 based on ARIMA, Bootstrap and Bayesian approaches. The number of patients diagnose as breast cancer will be increased from 32 to 65, 47 and 364 cases in 2020 based on the above three approaches respectively. In addition, the corresponding incidences for colorectal cancers will be 30, 30 and 95 cases respectively.
Time series models are used in different fields of medical sciences such as forecasting the number of patients, deaths … . In the study conducted by Nikbakht et al, trend of colon cancer in Southeast of Iran until 206 was predicted and showed an increasing trend [13].
Alvaro-Meca et al applied time series models to forecast the mortality of breast cancer in Spain during 1981-2007 and obtained an ARIMA (0,2,0) which was used for 15 years forecasting. Based on the results of that study, an increasing trend of breast cancer mortality for all age groups until 1995 was observed which was then reduced so that the total pattern of death during the 15-year study period was a decreasing trend [14].
Bae et al in 2002 found an increasing trend for all cancers mortality during 1983-2000 based on time series models [15].
Fazeli et al investigated the mortality of breast cancer among four age groups of Iranian women during 1995-2004. They found an increasing trend from 2005 to 2002 and a reducing trend during 2002-2004 [16].
Time series models are widely used in forecasting the cancers [17][18]. Similar to the current study, results of the other forecasting studies are in keeping with the fact that the rate of cancers is increasing. One of the main characteristics of the time series analysis is that such studies are suitable for short time durations so that in the case of long periods, there are more uncertainties [18].
Bayesian theory which has been suggested by Tomas Bayes for the first time, is being used widely in the field of medical sciences [19]. Using Bayesian method during time series analysis has a lot of benefits. When modeling is performed on data with short time period, over fitting may be occurred. Therefore, the designed model can be fit with the current data but cannot precisely predict the new data.
One of the methods for solution of the over fitting is using Bayesian method and prior distribution as a model parameter. In the Bayesian approach, these parameters are considered as random variables and are applied to detect the posterior distribution and more precise estimates [20][21][22][23][24]. In the case of low prior data, the Bayesian approach is an appropriate method for prediction [25]. Ribes et al predicted the cancer incidence until 2020 using Bayesian approach. They reported that the incidence of cancer in 2020 will reach to 26455 in men and 18345 in women indicating 22.5% and 24.5% increase among men and women respectively [26]. That is similar to the results of the present study.
One of the limitations of the current study is that the information of just 24 months (2014-2015) was completely available. It should be noted that at least 50 time points are required for a precise time series analysis. To overcome this limitation in the ARIMA model, 1000 sampling was performed using Bootstrap approach before forecasting. As another limitation of our study, the time of cancer onset is ignored by the Bootstrapping while time series are time dependent. Moreover, the ARIMA model has a short forecasting domain and in the long time predictions, the confidence intervals will be wide. Such limitation can be resolved by the Byesian method.
In conclusion, our study predicted an increasing trend for breast, bladder and colorectal cancers in northern part of Iran. We also found higher estimations based on the apjec.waocp.com Ghasem Janbabaee, et al: Forecasting the Incidence of Breast, Colorectal and Bladder Cancers in North of Iran Using Bayesian approach.