spFSR
- feature
selection and ranking by simultaneous perturbation stochastic
approximationFeature selection can be loosely defined as finding an optimal subset
of available features in a dataset that are asssociated with the
response variable. There are three broad categories of featue selection
methods: filter methods, wrapper methods, and embedded methods. In the
vignette, we introduce the Simultaneous Perturbation Stochastic
Approximation for Feature Selection and Ranking (SPSA-FSR) Algorithm as
one of the wrapper methods and how to use the spFSR
package
which implements this algorithm. The spFSR
package is built
upon the works by Aksakalli and Malekipirbazari
(2016) <arXiv:1508.07630> and Yenice
et al. (2018) <arXiv:1804.05589>.
As the spFSR
package depends upon the mlr
package, we shall follow the mlr
workflow to define a learner and a task. The spFSR
package supports classification and regression problems. We show how to
perform feature selection with the spFSR
package with two
applications - one classification problem and one regression problem.
The spFSR
package does not support unsupervised learning
(such as clustering), cost-sensitive classification, and survival
analysis.
Let X be an n × p data matrix of p features and n observations whereas Y denotes the n × 1 response vector consitute as the dataset. Let X := {X1, X2, ....Xp} denote the feature set where Xj represents the jth feature in. For a nonempty subset X′ ⊂ X, we define ℒC(X′, Y) as the true value of performance criterion of a wrapper classifier (the model) C on the dataset. As ℒC is not known, we train the classifier C and compute the error rate, which is denoted by yC(X′, Y). Therefore, yC = ℒC + ε. The wrapper feature selection problem is defined as determining the non-empty feature set X*:
X* := arg minX′ ⊂ XyC(X′, Y)
It is a stochastic optimisation problem because the functional form of Y is unknown and can be estmated using stochastic optimisation algorithms, including the Simultaneous Perturbation Stochastic Approximation (SPSA). Introduced by Spall (1992), the SPSA is a pseudo-gradient descent stochastic optimisation algorithm. It starts with a random solution (vector) and moves toward the optimal solution in successive iterations in which the current solution is perturbed simultaneously by random offsets generated from a specified probability distribution. The SPSA-FSR algorithm is therefore an application of the SPSA for feature selection.
In the context of SPSA, given w ∈ D ⊂ ℝp, the loss function is given as:
ℒ(w) : D ↦ ℝ
where its functional form is unknown but one can observe noisy measurement:
y(w) := ℒ(w) + ε(w)
where ε is the noise. Let g(w) denote the gradient of ℒ:
$$g(w): = \nabla \mathcal{L} = \frac{\partial \mathcal{L} }{\partial w}$$
The SPSA-FSR algorithm uses a binary version of the SPSA where w ∈ ℤp. Hence, the loss function becomes ℒ : {0, 1}p ↦ ℝ. The SPSA-FSR algorithm starts with an initial solution ŵ0 and iterates following the recursion below in search for a local minima w*:
ŵk + 1 := ŵk − akĝ(ŵk)
where:
Let Δk ∈ ℝp be a simultaneous perturbation vector at iteration k. Spall (1992) suggests to imposes certain regularity conditions on Δk:
The finite inverse requirement basically precludes Δk from uniform or normal distributions. A good candidate is a symmetric zero mean Bernoulli distribution, say ±1 with 0.5 probability. The SPSA-FSR algorithm “perturbs” the current iterate ŵk by an amount of ckΔk in each direction of ŵk + ckΔk and ŵk − ckΔk respectively. Hence, the simultaneous perturbations around ŵk are defined as:
ŵk± := ŵk ± ckΔk
where ck is a nonnegative gradient gain sequence. The noisy measurements of ŵk± at iteration k become:
$$y^{+}_k:=\mathcal{L}(\hat{w}_k + c_k \Delta_k) + \varepsilon_{k}^{+} \\ y^{-}_k:=\mathcal{L}(\hat{w}_k - c_k \Delta_k) + \varepsilon_{k}^{-}$$
where 𝔼(εk+ − εk−|ŵ0, ŵ1, ...ŵk, Δk) = 0∀k. At each iteration, ŵk± are bounded and rounded before yk± are evaluated. Therefore, ĝk is computed as:
$$\hat{g}_k(\hat{w}_k):=\bigg[ \frac{y^{+}_k-y^{-}_k}{w^{+}_{k1}-w^{-}_{k1}},...,\frac{y^{+}_k-y^{-}_k}{w^{+}_{kp}-w^{-}_{kp}} \bigg]^{T} = \bigg[ \frac{y^{+}_k-y^{-}_k}{2c_k \Delta_{k1}},...,\frac{y^{+}_k-y^{-}_k}{2c_k \Delta_{kp}} \bigg]^{T} = \frac{y^{+}_k-y^{-}_k}{2c_k}[\Delta_{k1}^{-1},...,\Delta_{kp}^{-1}]^{T}$$
For convergence, Spall (1992) proposes the iteration and gradient gain sequences to be:
A, a, α, c and γ are pre-defined; these parameters must be fine-tuned properly. In the SPSA-FSR algorithm, we set γ = 1 so that ck is a constant. The detail of fine-tuning values are found in Aksakalli and Malekipirbazari (2016). Yenice et al. (2018) propose a nonmonotone Barzilai-Borwein method (Barzilai and Borwein 1988) to smooth the gain via
$$\hat{b}_k = \frac{\sum_{n=k-t}^k{\hat{a}_{n}^{'}}}{t+1}$$
The role of b̂k is to eliminate the irrational fluctuations in the gains and ensure the stability of the SPSA-FSR algorithm. It averages the gains at the current and last two iterations, i.e. t = 2. Gain smoothing results in a decrease in convergence time. Due to its stochastic nature and noisy measurements, the gradients ĝ(ŵ) can be approximately wrongly and hence distort the convergence direction in SPSA-FRS algorithm. To mitigate such side effect, the current and the previous m gradients are averaged as a gradient estimate at the current iteration:
$$\hat{g_k}(\hat{w_k}) = \frac{\sum_{n=k-m}^k{\hat{g_{n}}(\hat{w_{k}})}}{m+1}$$
The SPSA-FSR algorithm does not have automatic stopping rules. So, we can specify a maximum number of iterations as a stopping criterion. The SPSA-FSR algorithm is summarised as:
- Generate Δk, j ∼ Bernoulli(−1, +1) with ℙ(Δk, j = 1) = ℙ(Δk, j = −1) = 0.5 for j = 1, ..p
- Compute ŵk± := ŵk ± cΔk
- Bound and then round:
- ŵk± ↦ B(ŵk±) where B(•) is the component-wise [0, 1] operator.
- B(ŵk±) ↦ R(ŵk±) where R(•) is the component-wise rounding operator.
- Evaluate yk± := ℒ(ŵk + ckΔk) ± εk+
- Compute the gradient estimate:
$$\hat{g}_k(\hat{w}_k):=\bigg( \frac{y^{+}_k-y^{-}_k}{2c}\bigg)[\Delta_{k1}^{-1},...,\Delta_{kp}^{-1}]^{T}$$
- Update ŵk± := ŵk ± ckΔk
The spFSR
package can be installed from CRAN as
follow:
If it is installed manually from an archive (tar.gz), then the following dependency and imported packages must be installed first:
if(!require("mlr") ){ install.packages("mlr") } # mlr (>= 2.11)
if(!require("parallelMap") ){ install.packages("parallelMap") } # parallelMap (>= 1.3)
if(!require("parallel") ){ install.packages("parallel")} # parallel (>= 3.4.2)
if(!require("tictoc") ){ install.packages("tictoc") } # tictoc (>= 1.0)
if(!require("ggplot2") ){ install.packages("ggplot2") } # tictoc (>= 1.0)
The mlr
depends on other packages. Although only some
are utilised in the spFSR
, it is highly recommended to
install the suggested packages:
ada, adabag, bartMachine, batchtools, brnn, bst, C50, care, caret (>= 6.0-57), class, clue, cluster, clusterSim (>= 0.44-5), clValid, cmaes, CoxBoost, crs, Cubist, deepnet, DiceKriging, DiceOptim, DiscriMiner, e1071, earth, elasticnet, elmNN, emoa, evtree, extraTrees, flare, fields, FNN, fpc, frbs, FSelector, gbm, GenSA, ggvis, glmnet, h2o (>= 3.6.0.8), GPfit, Hmisc, ipred, irace (>= 2.0), kernlab, kknn, klaR, knitr, kohonen, laGP, LiblineaR, lqa, MASS, mboost, mco, mda, mlbench, mldr, mlrMBO, modeltools, mRMRe, nnet, nodeHarvest (>= 0.7-3), neuralnet, numDeriv, pamr, party, penalized (>= 0.9-47), pls, PMCMR (>= 4.1), pROC (>= 1.8), randomForest, randomForestSRC (>= 2.2.0), ranger (>= 0.6.0), RCurl, Rfast, rFerns, rjson, rknn, rmarkdown, robustbase, ROCR, rotationForest, rpart, RRF, rrlda, rsm, RSNNS, RWeka, sda, shiny, smoof, sparsediscrim, sparseLDA, stepPlr, SwarmSVM, svglite, testthat, tgp, TH.data, xgboost (>= 0.6-2), XML
To see why, say we would like to apply k-nearest neighbour (knn) on a
classification problem. In the mlr
, this can be done by
defining a learner
as
mlr::makeLearner("classif.knn", k = 5)
in R
.
Note that "classif.knn"
is called from the
class
package via mlr
. So it the
class
package has not been installed, this learner cannot
be defined. To get the full list of learners from the mlr
package, see listLearners()
for more details.
Using the classical iris data, the goal is to choose 3 of 4 features
(Sepal.Length
, Sepal.Width
,
Petal.Length
, and Petal.Width
) that give the
highest mean accuracy rate in predicting Species
.
After loading the mlr
package, we create a
wrapper
which is a knn learner with k = 5. Then, we make a
classification task
by specifying Species
as
the response or target variable we would like to predict. Lastly, we
specify acc
(accuracy) to evaluate the wrapper’s
performance.
spFeatureSelection
The spFeatureSelection
function requires four main
arguments:
task
: a task
object created using
mlr
package. In this example,
task = classifTask
wrapper
: a Learner
object created using
mlr
package. In this example, it is
wrapper = knnWrapper
measure
: a performance measure supported by
task
; here, measure = perf.measure
num.features.selected
: number of features to be
selected. In this example, we aim to choose three features
(num.features.selected = 3
).In addition, due to the stochastic nature of the SPSA-FSR algorithm,
we recommend user to run it multiple iterations by specifying
iters.max
in spFeatureSelection
. The default
value of iters.max
is 25. For illustration, we shall run up
to 10 iterations by specifying iters.max = 10
. To speed up,
user can specify how many processor cores to run the algorithm. The
default value is 2 or the minimum core available on the computer. In
this example, we apply a single core (num.cores = 1
).
library(spFSR)
#> Loading required package: parallelMap
#> Loading required package: parallel
#> Loading required package: tictoc
set.seed(123)
spsaMod <- spFeatureSelection(
task = classifTask,
wrapper = knnWrapper,
measure = perf.measure ,
num.features.selected = 3,
iters.max = 10,
num.cores = 1)
#> SPSA-FSR begins:
#> Wrapper = knn
#> Measure = acc
#> Number of selected features = 3
#>
#> iter value st.dev num.ft best.value
#> 1 0.96 0.03137 3 0.96 *
#> 2 0.95333 0.0276 3 0.96
#> 3 0.96222 0.02779 3 0.96222 *
#> 4 0.95556 0.03488 3 0.96222
#> 5 0.95556 0.04115 3 0.96222
#> 6 0.95556 0.02412 3 0.96222
#> 7 0.95556 0.02999 3 0.96222
#> 8 0.95778 0.03443 3 0.96222
#> 9 0.96222 0.02477 3 0.96222
#> 10 0.95111 0.03534 3 0.96222
#>
#> Best iteration = 3
#> Number of selected features = 3
#> Best measure value = 0.96222
#> Std. dev. of best measure = 0.02779
#> Run time = 0.84 minutes.
The output above shows that the result produced by
spFeatureSelection
. At each iteration (iter
),
it shows mean accuracy rate (value
) and its standard
deviation (st.dev
) on three features (num.fit
= 3). The best.value
represents the best mean accuracy rate
produced by spFeatureSelection
. At the first iteration
(iter
= 1), the best mean accuracy rate is 0.96 and it is
denoted by ’*’. At the second iteration, the mean accuracy rate is lower
and hence the accuracy rate from the first iteration remains as the best
value. At the third iteration, the accuracy rate improves to 0.96222
which is higher the previous best value. The accuracy rate of the third
iteration therefore becomes the best value.
The spFSR
package supports three S3 generic methods:
print
, summary
, and plot
. The
usages of print
and summary
are quite
straighforward. The summary return the following information:
summary(spsaMod)
#> $target
#> [1] "Species"
#>
#> $importance
#> features importance
#> 1 Sepal.Width 0.49887
#> 2 Petal.Length 0.49873
#> 3 Petal.Width 0.49871
#>
#> $nfeatures
#> [1] 3
#>
#> $niters
#> [1] 10
#>
#> $name
#> [1] "k-Nearest Neighbor"
#>
#> $best.iter
#> [1] 3
#>
#> $best.value
#> [1] 0.96222
#>
#> $best.std
#> [1] 0.02779
#>
#> attr(,"class")
#> [1] "summary.spFSR"
We can produce a scatterplot of mean accuracy rate at each iteration
by calling the plot
function on spsaMod
. We
can also add an error bar of ±1
standard deviation around the mean accuracy rate at each iteration by
specifying errorBar = TRUE
. Other graphical parameters such
as pch
, type
, ylab
, and
col
are allowed.
The spFSR
package has:
getImportance
which returns the importance ranks of
best performing features as a data.frame
objectplotImportance
which plots the importance ranks of best
performing featuresgetBestModel
which returns the trained or wrapped model
based on the set of best performing features.getImportance(spsaMod)
#> features importance
#> 1 Sepal.Width 0.49887
#> 2 Petal.Length 0.49873
#> 3 Petal.Width 0.49871
plotImportance(spsaMod)
The vertical bar chart generated by plotImportance
shows
that Petal.Width
is the most important feature. We can
obtain the best performing model by calling
getBestModel
.
bestMod
is an object of WrapperModel
class
from the mlr
package.
It inherits all methods of this class including predict
.
The predict
function can be used to predict out-of-sample
data using setting new.data
to a test data. It can be also
used to return the predicted responses by incorporating the
task = spsaMod$task.spfs
(which contains only the best
performing features). The code chunk below illustrates how predicted
responses can be obtained and used to calculate the confusion matrix by
calling calculateConfusionMatrix
from the mlr
package.
The goal is to select 10 out of 14 features which predict the median
house price (medv
in USD 1000’s) from the “BostonHosuing”
data. The data is loaded from the mlbench
package
if( !require(mlbench) ){install.packages('mlbench')}
library(mlbench)
data("BostonHousing")
head(BostonHousing)
#> crim zn indus chas nox rm age dis rad tax ptratio b lstat
#> 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
#> 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
#> 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
#> 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
#> 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
#> 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
#> medv
#> 1 24.0
#> 2 21.6
#> 3 34.7
#> 4 33.4
#> 5 36.2
#> 6 28.7
spFeatureSelection
We start configuring a regression task and a linear regression (lm) wrapper:
For a regression problem, stratified sampling is not supported and so
cv.stratify
must be FALSE
. We use mean squared
error (mse
) to evaluate the linear regression’s
performance. Similar to the previous example, we shall run up to 10
iterations by specifying iters.max = 10
on a single core
(num.cores = 1
).
regSPSA <- spFeatureSelection(
task = regTask, wrapper = regWrapper,
measure = mse, num.features.selected = 10,
cv.stratify = FALSE,
iters.max = 10,
num.cores = 1
)
#> SPSA-FSR begins:
#> Wrapper = lm
#> Measure = mse
#> Number of selected features = 10
#>
#> iter value st.dev num.ft best.value
#> 1 25.41895 7.13893 10 25.41895 *
#> 2 24.44128 5.996 10 24.44128 *
#> 3 25.38896 7.46722 10 24.44128
#> 4 24.36343 5.90703 10 24.36343 *
#> 5 24.97904 6.06904 10 24.36343
#> 6 24.83484 5.76259 10 24.36343
#> 7 25.69833 8.62474 10 24.36343
#> 8 24.86113 5.28723 10 24.36343
#> 9 24.81996 6.32071 10 24.36343
#> 10 24.71374 6.64379 10 24.36343
#>
#> Best iteration = 4
#> Number of selected features = 10
#> Best measure value = 24.36343
#> Std. dev. of best measure = 5.90703
#> Run time = 1.64 minutes.
The methods and importance functions can be also used for regression problems.
summary(regSPSA)
#> $target
#> [1] "medv"
#>
#> $importance
#> features importance
#> 1 lstat 1.00000
#> 2 dis 0.98933
#> 3 chas 0.95279
#> 4 ptratio 0.93678
#> 5 rm 0.65468
#> 6 indus 0.57742
#> 7 zn 0.50451
#> 8 nox 0.47618
#> 9 tax 0.28519
#> 10 rad 0.05062
#>
#> $nfeatures
#> [1] 10
#>
#> $niters
#> [1] 10
#>
#> $name
#> [1] "Simple Linear Regression"
#>
#> $best.iter
#> [1] 4
#>
#> $best.value
#> [1] 24.36343
#>
#> $best.std
#> [1] 5.90703
#>
#> attr(,"class")
#> [1] "summary.spFSR"
getImportance(regSPSA)
#> features importance
#> 1 lstat 1.00000
#> 2 dis 0.98933
#> 3 chas 0.95279
#> 4 ptratio 0.93678
#> 5 rm 0.65468
#> 6 indus 0.57742
#> 7 zn 0.50451
#> 8 nox 0.47618
#> 9 tax 0.28519
#> 10 rad 0.05062
plotImportance(regSPSA)
The importance plot reveals lstat and dis to be two most important
features in predicting the median housing value. We can also obtain the
best model via getBestModel
and make predictions.
Leveraging on the mlr
package, the spFSR
package implements the SPSA-FSR Algorithm for feature selection. Given a
wrapper, the spFSR
package can determine a subset of
features which predicts the response variable while optimising a
specified performance measure.