Appendix C: Model Tuning Process | RDP 2021-05: Central Bank Communication: One Size Does Not Fit All

RDP 2021-05: Central Bank Communication: One Size Does Not Fit All Appendix C: Model Tuning Process

Joan Huang and John Simon

May 2021

Download the Paper 2,051KB

C.1 Feature selection process

In this study we adopt an automatic feature selection method, called recursive feature elimination (RFE) (Guyon et al 2002), to select the relevant features for each model. This helps ensure that each feature included in the final model has a minimum degree of predictive power. Otherwise, the models may mistake ‘noise’ for ‘signal’. This algorithm is configured to explore all possible subsets of the features. The computing process is shown in Table C1.

Table C1: Key Steps of a Recursive Feature Elimination Process
1.1	Train the model on training dataset using all features {X₁,X₂,…,X_n}
1.2	Calculate model performance
1.3	Calculate variable performance
1.4	For each subset size of S_i,i = 1…n do Keep the S_i most important features Train the model on the training dataset using top S_i features Calculate the model performance
1.5	End
1.6	Calculate the performance profile over the S_i
1.7	Determine the appropriate number of predictors
1.8	Use the model corresponding to the optimal S_i
Source: https://topepo.github.io/caret/recursive-feature-elimination.html

Our model includes 292 features in total, so in the first step of the RFE process we include all features. Then, we run the model using 30 different subset feature sizes, that is (10, 20,…, 290, 292). To minimising overfitting due to feature selection, we take the cross-validation resampling method to run the process listed in Table C1 on the testing dataset only and calculate the model performance using the validation dataset. We run this process 10 times and calculate the model performance (accuracy) for each subset of features using the average of the results from those 10 runs.

C.2 Tuning parameters process

To improve model performance, we tune 2 parameters:

the number of trees that will be built for each model (n_tree), and
the optimal number of variables that will be selected for each node in a tree (m_try).

The default value of n_tree is 500, and that of m_try is the root square of number of features. Different values of those 2 parameters may affect model performance. To find the optimal settings, we employ a grid search approach.

For the grid search, we choose 11 different n_tree values (10, 100, 200, 300,…,1,000) and, for m_try, as suggested by Breiman (2001), we choose 3 values: the default value (m_try = 17), half of the default (m_try = 9), and twice the default (m_try = 34). For each combination, we build 10 models using 10-fold cross-validation and repeat the process 3 times. The best combination of n_tree and m_try is selected based on the combination that returns the highest accuracy.

C.3 Top ten features for four models

Table C2: Top Ten Features for Four RF Models
Rank	Reasoning model		Readability model
Rank	Features	Importance^(a)	Features	Importance^(a)
Economist
1	Proportion of VB	6.2	Proportion of CC	13.1
2	Proportion of NNS	4.6	Proportion of RB	9.7
3	Proportion of MD	4.5	Proportion of VB	7.6
4	Count of digits	4.1	Proportion of VBP	7.4
5	Count of VB	3.9	Count of NN	6.9
6	Proportion of NN	3.6	Count of NP	6.8
7	Count of MD	3.5	Count of punctuation	5.9
8	Proportion of IN	3.5	Proportion of MD	5.5
9	Proportion of CD	3.5	Count of commas	4.6
10	Proportion of VBN	2.8	Count of SBAR	4.6
Non-economist
1	Proportion of VB	10.6	Proportion of DT	5.2
2	Proportion of MD	9.0	Proportion of JJ	5.1
3	Proportion of JJ	7.3	FK grade level	4.7
4	Proportion of IN	6.1	Count of NP	4.7
5	Proportion of NN	5.9	Count of syllables	4.7
6	Count of MD	5.3	Proportion of NN	4.4
7	Proportion of VBN	5.3	Proportion of CC	4.4
8	Count of VB	5.2	Proportion of VB	4.3
9	Proportion of TO	5.1	Proportion of IN	4.3
10	Proportion of CC	5.1	Proportion of NNS	4.2
Note: The feature importance is extracted as a part of model outputs that is generated using the caret package in R; the importance value for each variable is calculated as the contribution of each variable based on mean decrease in impurity (Gini) after removing this feature