Performs a feature selection on positioned n-gram data using a Fisher's permutation test.

test_features(
  target,
  features,
  criterion = "ig",
  adjust = "BH",
  threshold = 1,
  quick = TRUE,
  times = 1e+05
)

Arguments

target

integer vector with target information (e.g. class labels).

features

integer matrix of features with number of rows equal to the length of the target vector.

criterion

criterion used in permutation test. See Details for the list of possible criterions.

adjust

name of p-value adjustment method. See p.adjust for the list of possible values. If NULL, p-values are not adjusted.

threshold

integer. Features that occur less than threshold and more often than nrow(features)-threshold are discarded from the permutation test.

quick

logical, if TRUE Quick Permutation Test (QuiPT) is used. If FALSE, normal permutation test is performed.

times

number of times procedure should be repeated. Ignored if quick is TRUE.

Value

an object of class feature_test.

Details

Since the procedure involves multiple testing, it is advisable to use one of the avaible p-value adjustment methods. Such methods can be used directly by specifying the adjust parameter.

Available criterions:

ig

Information Gain: calc_ig.

kl

Kullback-Leibler divergence: calc_kl.

cs

Chi-squared-based measure: calc_cs.

Note

Both target and features must be binary, i.e. contain only 0 and 1 values.

Features occuring too often and too rarely are considered not informative and may be removed using the threshold parameter.

References

Radivojac P, Obradovic Z, Dunker AK, Vucetic S, Feature selection filters based on the permutation test in Machine Learning: ECML 2004, 15th European Conference on Machine Learning, Springer, 2004.

See also

binarize - binarizes input data.

calc_criterion - computes selected criterion.

distr_crit - distribution of criterion used in QuiPT.

summary.feature_test - summary of results.

cut.feature_test - aggregates test results in groups based on feature's p-value.

Examples

# significant feature tar_feat1 <- create_feature_target(10, 390, 0, 600) # significant feature tar_feat2 <- create_feature_target(9, 391, 1, 599) # insignificant feature tar_feat3 <- create_feature_target(198, 202, 300, 300) test_res <- test_features(tar_feat1[, 1], cbind(tar_feat1[, 2], tar_feat2[, 2], tar_feat3[, 2])) summary(test_res)
#> Total number of features: 3 #> Number of significant features: 2 #> Criterion used: Information Gain #> Feature test: QuiPT #> p-values adjustment method: BH
cut(test_res)
#> $`[0,0.0001]` #> character(0) #> #> $`(0.0001,0.01]` #> [1] "feature1" "feature2" #> #> $`(0.01,0.05]` #> character(0) #> #> $`(0.05,1]` #> [1] "feature3" #>
# real data example # we will analyze only a subsample of a dataset to make analysis quicker ids <- c(1L:100, 701L:800) deg_seqs <- degenerate(human_cleave[ids, 1L:9], list(`a` = c(1, 6, 8, 10, 11, 18), `b` = c(2, 5, 13, 14, 16, 17, 19, 20), `c` = c(3, 4, 7, 9, 12, 15))) # positioned n-grams example bigrams_pos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE) test_features(human_cleave[ids, 10], bigrams_pos)
#> 1_a.a_0 2_a.a_0 3_a.a_0 4_a.a_0 5_a.a_0 6_a.a_0 #> 5.314503e-02 5.963485e-05 3.800925e-04 2.863615e-03 1.078651e-03 8.567134e-01 #> 7_a.a_0 8_a.a_0 1_b.a_0 2_b.a_0 3_b.a_0 4_b.a_0 #> 1.087544e-01 8.606368e-01 2.507408e-01 1.042950e-02 7.326150e-01 1.876294e-02 #> 5_b.a_0 6_b.a_0 7_b.a_0 8_b.a_0 1_c.a_0 2_c.a_0 #> 8.618417e-01 1.000000e+00 1.000000e+00 5.799196e-01 1.393980e-01 9.872458e-01 #> 3_c.a_0 4_c.a_0 5_c.a_0 6_c.a_0 7_c.a_0 8_c.a_0 #> 2.507408e-01 7.615584e-01 7.697187e-02 5.882914e-01 1.000000e+00 1.000000e+00 #> 1_a.b_0 2_a.b_0 3_a.b_0 4_a.b_0 5_a.b_0 6_a.b_0 #> 7.326150e-01 7.863582e-01 2.507408e-01 1.000000e+00 2.027335e-01 1.676915e-01 #> 7_a.b_0 8_a.b_0 1_b.b_0 2_b.b_0 3_b.b_0 4_b.b_0 #> 6.290592e-01 8.852186e-01 1.000000e+00 1.296950e-01 3.837051e-01 6.924186e-03 #> 5_b.b_0 6_b.b_0 7_b.b_0 8_b.b_0 1_c.b_0 2_c.b_0 #> 7.697187e-02 2.802138e-02 1.000000e+00 1.000000e+00 6.656136e-01 7.863582e-01 #> 3_c.b_0 4_c.b_0 5_c.b_0 6_c.b_0 7_c.b_0 8_c.b_0 #> 1.932067e-03 1.000000e+00 1.078651e-03 1.245531e-01 1.847110e-01 1.000000e+00 #> 1_a.c_0 2_a.c_0 3_a.c_0 4_a.c_0 5_a.c_0 6_a.c_0 #> 2.751687e-01 1.024237e-02 1.087544e-01 1.676915e-01 1.676915e-01 7.326150e-01 #> 7_a.c_0 8_a.c_0 1_b.c_0 2_b.c_0 3_b.c_0 4_b.c_0 #> 6.812614e-01 3.279697e-01 1.000000e+00 1.142154e-01 4.090253e-01 1.078651e-03 #> 5_b.c_0 6_b.c_0 7_b.c_0 8_b.c_0 1_c.c_0 2_c.c_0 #> 1.000000e+00 6.812614e-01 4.766643e-01 6.112035e-01 2.335240e-01 3.473833e-02 #> 3_c.c_0 4_c.c_0 5_c.c_0 6_c.c_0 7_c.c_0 8_c.c_0 #> 4.635896e-02 3.473833e-02 1.087544e-01 1.000000e+00 9.442481e-01 1.000000e+00
# unpositioned n-grams example, binarization required bigrams_notpos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE) test_features(human_cleave[ids, 10], binarize(bigrams_notpos))
#> 1_a.a_0 2_a.a_0 3_a.a_0 4_a.a_0 5_a.a_0 6_a.a_0 #> 5.314503e-02 5.963485e-05 3.800925e-04 2.863615e-03 1.078651e-03 8.567134e-01 #> 7_a.a_0 8_a.a_0 1_b.a_0 2_b.a_0 3_b.a_0 4_b.a_0 #> 1.087544e-01 8.606368e-01 2.507408e-01 1.042950e-02 7.326150e-01 1.876294e-02 #> 5_b.a_0 6_b.a_0 7_b.a_0 8_b.a_0 1_c.a_0 2_c.a_0 #> 8.618417e-01 1.000000e+00 1.000000e+00 5.799196e-01 1.393980e-01 9.872458e-01 #> 3_c.a_0 4_c.a_0 5_c.a_0 6_c.a_0 7_c.a_0 8_c.a_0 #> 2.507408e-01 7.615584e-01 7.697187e-02 5.882914e-01 1.000000e+00 1.000000e+00 #> 1_a.b_0 2_a.b_0 3_a.b_0 4_a.b_0 5_a.b_0 6_a.b_0 #> 7.326150e-01 7.863582e-01 2.507408e-01 1.000000e+00 2.027335e-01 1.676915e-01 #> 7_a.b_0 8_a.b_0 1_b.b_0 2_b.b_0 3_b.b_0 4_b.b_0 #> 6.290592e-01 8.852186e-01 1.000000e+00 1.296950e-01 3.837051e-01 6.924186e-03 #> 5_b.b_0 6_b.b_0 7_b.b_0 8_b.b_0 1_c.b_0 2_c.b_0 #> 7.697187e-02 2.802138e-02 1.000000e+00 1.000000e+00 6.656136e-01 7.863582e-01 #> 3_c.b_0 4_c.b_0 5_c.b_0 6_c.b_0 7_c.b_0 8_c.b_0 #> 1.932067e-03 1.000000e+00 1.078651e-03 1.245531e-01 1.847110e-01 1.000000e+00 #> 1_a.c_0 2_a.c_0 3_a.c_0 4_a.c_0 5_a.c_0 6_a.c_0 #> 2.751687e-01 1.024237e-02 1.087544e-01 1.676915e-01 1.676915e-01 7.326150e-01 #> 7_a.c_0 8_a.c_0 1_b.c_0 2_b.c_0 3_b.c_0 4_b.c_0 #> 6.812614e-01 3.279697e-01 1.000000e+00 1.142154e-01 4.090253e-01 1.078651e-03 #> 5_b.c_0 6_b.c_0 7_b.c_0 8_b.c_0 1_c.c_0 2_c.c_0 #> 1.000000e+00 6.812614e-01 4.766643e-01 6.112035e-01 2.335240e-01 3.473833e-02 #> 3_c.c_0 4_c.c_0 5_c.c_0 6_c.c_0 7_c.c_0 8_c.c_0 #> 4.635896e-02 3.473833e-02 1.087544e-01 1.000000e+00 9.442481e-01 1.000000e+00