br Next we discuss different approaches to quantify the
Next, we discuss different approaches to quantify the robust-ness of feature selector or ranking algorithms by (1) a conventional analysis and (2) a visual-based study.
2.4.1. Conventional (-)-Bicuculline methiodide analysis
To study the stability of the feature ranking or selection tech-niques several metrics have been proposed.
Consider r and r the output of a feature ranking technique ap-plied to two subsamples of D. The most widely used metric to measure the similarity between two ranking lists is the Spearman’s rank correlation coe cient (SR) . The SR between two ranked lists r and r is defined by
where ri is the rank of feature-i. SR values range from −1 to 1. It takes the value one when the rankings are identical and the value zero when there is no correlation.
When we attempt to measure the distance between two top-k lists s and s with the most relevant k features, several metrics have been presented (for details see ). In this work we use the
Jaccard stability index (JI) that can be defined as
of environmental exposures and their interaction with genetic fac-
tors in common tumors in Spain (prostate, breast, colorectal, gas-
(4) troesophageal and chronic lymphocytic leukemia).All participants
signed an informed consent. Approval for the study was obtained
where s and s are the two feature subsets, r is the number of fea- from the ethical review boards of all recruiting centers . In-
tures that are common in both lists and l the number of features stances with missing values have been removed leading to a
that appear only in one of the two lists. The JI lies in the range dataset with 3295 instances: 2230 are controls, while the other
The stability for a set of rankings or lists
netic variables (Single Nucleotide Polymorphisms -SNPs), 48 envi-
When it comes to evaluate the stability of a feature selec- ronmental factors including red meat, vegetable consumption, BMI,
tion (or ranking) algorithm that provides several results A = physical activity, alcohol consumption and 5 variables regarding
family history of CRC, sex, age, level of education and race.
Next, the variables considered in this study are listed.
similarities and average the results, what leads to a single scalar
coe cient, Jaccard stability index [22,30] or Kuncheva’s
stability index , for example.
2.4.2. Visual based stability analysis
The outcome of a feature ranking algorithm can be interpreted
• Environmental factors. physical activity, BMI, alcohol consump-
as a point in a high dimensional space (with p dimensions). The
tion, smoking. Dietary factors: consumption of vegetable, red
stability of a ranking feature selector is commonly measured as
meat, legume, fruit, cereals, fish, dairy, oil, calcium, carotenoids,
or distance between different outcomes of the
cholesterol, edible, total energy, ethanol in the past decade,
same feature selector on slightly different datasets. As mentioned
ethanol in the present, monounsaturated fats, polyunsaturated
is assessed computing pairwise similarities be-
fats, saturated fats, total fats, folic acid, glucids, total intake
tween points in that high dimensional space and averaging the re-
in grams, Iron, magnesium, niacin, phosphorus, potassium,
sults. In this case, the ranking data is turned into a single number
fiber, animal protein, vegetable protein, total protein, retinoids,