Performs a feature selection on positioned n-gram data using a Fisher's permutation test.
test_features( target, features, criterion = "ig", adjust = "BH", threshold = 1, quick = TRUE, times = 1e+05 )
target |
|
---|---|
features |
|
criterion | criterion used in permutation test. See Details for the list of possible criterions. |
adjust | name of p-value adjustment method. See |
threshold |
|
quick |
|
times | number of times procedure should be repeated. Ignored if |
an object of class feature_test
.
Since the procedure involves multiple testing, it is advisable to use one
of the avaible p-value adjustment methods. Such methods can be used directly by
specifying the adjust
parameter.
Available criterions:
Information Gain: calc_ig
.
Kullback-Leibler divergence: calc_kl
.
Chi-squared-based measure: calc_cs
.
Both target
and features
must be binary, i.e. contain only 0
and 1 values.
Features occuring too often and too rarely are considered not informative and may be removed using the threshold parameter.
Radivojac P, Obradovic Z, Dunker AK, Vucetic S, Feature selection filters based on the permutation test in Machine Learning: ECML 2004, 15th European Conference on Machine Learning, Springer, 2004.
binarize
- binarizes input data.
calc_criterion
- computes selected criterion.
distr_crit
- distribution of criterion used in QuiPT.
summary.feature_test
- summary of results.
cut.feature_test
- aggregates test results in groups based on feature's
p-value.
# significant feature tar_feat1 <- create_feature_target(10, 390, 0, 600) # significant feature tar_feat2 <- create_feature_target(9, 391, 1, 599) # insignificant feature tar_feat3 <- create_feature_target(198, 202, 300, 300) test_res <- test_features(tar_feat1[, 1], cbind(tar_feat1[, 2], tar_feat2[, 2], tar_feat3[, 2])) summary(test_res)#> Total number of features: 3 #> Number of significant features: 2 #> Criterion used: Information Gain #> Feature test: QuiPT #> p-values adjustment method: BHcut(test_res)#> $`[0,0.0001]` #> character(0) #> #> $`(0.0001,0.01]` #> [1] "feature1" "feature2" #> #> $`(0.01,0.05]` #> character(0) #> #> $`(0.05,1]` #> [1] "feature3" #># real data example # we will analyze only a subsample of a dataset to make analysis quicker ids <- c(1L:100, 701L:800) deg_seqs <- degenerate(human_cleave[ids, 1L:9], list(`a` = c(1, 6, 8, 10, 11, 18), `b` = c(2, 5, 13, 14, 16, 17, 19, 20), `c` = c(3, 4, 7, 9, 12, 15))) # positioned n-grams example bigrams_pos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE) test_features(human_cleave[ids, 10], bigrams_pos)#> 1_a.a_0 2_a.a_0 3_a.a_0 4_a.a_0 5_a.a_0 6_a.a_0 #> 5.314503e-02 5.963485e-05 3.800925e-04 2.863615e-03 1.078651e-03 8.567134e-01 #> 7_a.a_0 8_a.a_0 1_b.a_0 2_b.a_0 3_b.a_0 4_b.a_0 #> 1.087544e-01 8.606368e-01 2.507408e-01 1.042950e-02 7.326150e-01 1.876294e-02 #> 5_b.a_0 6_b.a_0 7_b.a_0 8_b.a_0 1_c.a_0 2_c.a_0 #> 8.618417e-01 1.000000e+00 1.000000e+00 5.799196e-01 1.393980e-01 9.872458e-01 #> 3_c.a_0 4_c.a_0 5_c.a_0 6_c.a_0 7_c.a_0 8_c.a_0 #> 2.507408e-01 7.615584e-01 7.697187e-02 5.882914e-01 1.000000e+00 1.000000e+00 #> 1_a.b_0 2_a.b_0 3_a.b_0 4_a.b_0 5_a.b_0 6_a.b_0 #> 7.326150e-01 7.863582e-01 2.507408e-01 1.000000e+00 2.027335e-01 1.676915e-01 #> 7_a.b_0 8_a.b_0 1_b.b_0 2_b.b_0 3_b.b_0 4_b.b_0 #> 6.290592e-01 8.852186e-01 1.000000e+00 1.296950e-01 3.837051e-01 6.924186e-03 #> 5_b.b_0 6_b.b_0 7_b.b_0 8_b.b_0 1_c.b_0 2_c.b_0 #> 7.697187e-02 2.802138e-02 1.000000e+00 1.000000e+00 6.656136e-01 7.863582e-01 #> 3_c.b_0 4_c.b_0 5_c.b_0 6_c.b_0 7_c.b_0 8_c.b_0 #> 1.932067e-03 1.000000e+00 1.078651e-03 1.245531e-01 1.847110e-01 1.000000e+00 #> 1_a.c_0 2_a.c_0 3_a.c_0 4_a.c_0 5_a.c_0 6_a.c_0 #> 2.751687e-01 1.024237e-02 1.087544e-01 1.676915e-01 1.676915e-01 7.326150e-01 #> 7_a.c_0 8_a.c_0 1_b.c_0 2_b.c_0 3_b.c_0 4_b.c_0 #> 6.812614e-01 3.279697e-01 1.000000e+00 1.142154e-01 4.090253e-01 1.078651e-03 #> 5_b.c_0 6_b.c_0 7_b.c_0 8_b.c_0 1_c.c_0 2_c.c_0 #> 1.000000e+00 6.812614e-01 4.766643e-01 6.112035e-01 2.335240e-01 3.473833e-02 #> 3_c.c_0 4_c.c_0 5_c.c_0 6_c.c_0 7_c.c_0 8_c.c_0 #> 4.635896e-02 3.473833e-02 1.087544e-01 1.000000e+00 9.442481e-01 1.000000e+00# unpositioned n-grams example, binarization required bigrams_notpos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE) test_features(human_cleave[ids, 10], binarize(bigrams_notpos))#> 1_a.a_0 2_a.a_0 3_a.a_0 4_a.a_0 5_a.a_0 6_a.a_0 #> 5.314503e-02 5.963485e-05 3.800925e-04 2.863615e-03 1.078651e-03 8.567134e-01 #> 7_a.a_0 8_a.a_0 1_b.a_0 2_b.a_0 3_b.a_0 4_b.a_0 #> 1.087544e-01 8.606368e-01 2.507408e-01 1.042950e-02 7.326150e-01 1.876294e-02 #> 5_b.a_0 6_b.a_0 7_b.a_0 8_b.a_0 1_c.a_0 2_c.a_0 #> 8.618417e-01 1.000000e+00 1.000000e+00 5.799196e-01 1.393980e-01 9.872458e-01 #> 3_c.a_0 4_c.a_0 5_c.a_0 6_c.a_0 7_c.a_0 8_c.a_0 #> 2.507408e-01 7.615584e-01 7.697187e-02 5.882914e-01 1.000000e+00 1.000000e+00 #> 1_a.b_0 2_a.b_0 3_a.b_0 4_a.b_0 5_a.b_0 6_a.b_0 #> 7.326150e-01 7.863582e-01 2.507408e-01 1.000000e+00 2.027335e-01 1.676915e-01 #> 7_a.b_0 8_a.b_0 1_b.b_0 2_b.b_0 3_b.b_0 4_b.b_0 #> 6.290592e-01 8.852186e-01 1.000000e+00 1.296950e-01 3.837051e-01 6.924186e-03 #> 5_b.b_0 6_b.b_0 7_b.b_0 8_b.b_0 1_c.b_0 2_c.b_0 #> 7.697187e-02 2.802138e-02 1.000000e+00 1.000000e+00 6.656136e-01 7.863582e-01 #> 3_c.b_0 4_c.b_0 5_c.b_0 6_c.b_0 7_c.b_0 8_c.b_0 #> 1.932067e-03 1.000000e+00 1.078651e-03 1.245531e-01 1.847110e-01 1.000000e+00 #> 1_a.c_0 2_a.c_0 3_a.c_0 4_a.c_0 5_a.c_0 6_a.c_0 #> 2.751687e-01 1.024237e-02 1.087544e-01 1.676915e-01 1.676915e-01 7.326150e-01 #> 7_a.c_0 8_a.c_0 1_b.c_0 2_b.c_0 3_b.c_0 4_b.c_0 #> 6.812614e-01 3.279697e-01 1.000000e+00 1.142154e-01 4.090253e-01 1.078651e-03 #> 5_b.c_0 6_b.c_0 7_b.c_0 8_b.c_0 1_c.c_0 2_c.c_0 #> 1.000000e+00 6.812614e-01 4.766643e-01 6.112035e-01 2.335240e-01 3.473833e-02 #> 3_c.c_0 4_c.c_0 5_c.c_0 6_c.c_0 7_c.c_0 8_c.c_0 #> 4.635896e-02 3.473833e-02 1.087544e-01 1.000000e+00 9.442481e-01 1.000000e+00