"- test the variability of the ranking across train-test splits\n",
"- change the granularity of the output ranking with `n_features_to_select` and `step`\n",
"- modify the scaling for the input generation to see how the ranking changes."
]
},
{
"cell_type": "markdown",
"id": "c26064a9-17a9-44d6-9c66-b85676c94fc8",
"metadata": {},
"source": [
"## Back to general case and combining all options for the classification"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc559465-2c51-4e62-9b40-d683b91f1eac",
"metadata": {},
"outputs": [],
"source": [
"# generate inputs and labels\n",
"m = 50 # input dimensionality\n",
"s0 = 100 # number of samples for class 0\n",
"s1 = 100 # number of samples for class 1\n",
"X, y = gen_inputs(m, s0, s1, scaling=0.5)"
]
},
{
"cell_type": "markdown",
"id": "6febb8bf-a89c-4c1d-8383-feffd43f98c2",
"metadata": {},
"source": [
"Here the inputs are modified compared to the previous notebooks, with a gradient of contrast that increases with the feature index $j$."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "edb43b31-62f0-4d0f-98d8-ac3e3c062405",
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
...
@@ -136,8 +379,6 @@
...
@@ -136,8 +379,6 @@
"id": "0024c6aa",
"id": "0024c6aa",
"metadata": {},
"metadata": {},
"source": [
"source": [
"## Parameterization of classifier\n",
"\n",
"We then build a pipeline for the classifier. The outer cross-validation corresponds to the train-test splitting as before.\n",
"We then build a pipeline for the classifier. The outer cross-validation corresponds to the train-test splitting as before.\n",
"\n",
"\n",
"The inner crosss-validation corresponds to the optimization of the hyperparameter `C` of the classifier (logistic regression in the pipeline)."
"The inner crosss-validation corresponds to the optimization of the hyperparameter `C` of the classifier (logistic regression in the pipeline)."
...
@@ -153,7 +394,7 @@
...
@@ -153,7 +394,7 @@
"# hyperparameter for regularization\n",
"# hyperparameter for regularization\n",
"Cs = [0.01,0.1,1.0,10.0,100.0]\n",
"Cs = [0.01,0.1,1.0,10.0,100.0]\n",
"\n",
"\n",
"# classifier in pipeline and wrapper for RFE\n",
"# classifier in pipeline\n",
"clf = Pipeline([('scl',StandardScaler()),\n",
"clf = Pipeline([('scl',StandardScaler()),\n",
" ('mlr',LogisticRegression())])\n",
" ('mlr',LogisticRegression())])\n",
"\n",
"\n",
...
@@ -174,9 +415,6 @@
...
@@ -174,9 +415,6 @@
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
"# check names of parameters for pipeline (for grid search)\n",
"print(clf.get_params())\n",
"\n",
"# quick fix to get coefficients from mlr estimator in pipeline\n",
"# quick fix to get coefficients from mlr estimator in pipeline\n",
"def get_coef(clf_pipeline):\n",
"def get_coef(clf_pipeline):\n",
" return clf_pipeline['mlr'].coef_"
" return clf_pipeline['mlr'].coef_"
...
@@ -187,8 +425,6 @@
...
@@ -187,8 +425,6 @@
"id": "e3f096d7",
"id": "e3f096d7",
"metadata": {},
"metadata": {},
"source": [
"source": [
"## Optimization involving the tuning of hyperparameter\n",
"\n",
"We use `GridSearchCV` to optimize the hyperparameter, the use the best classifier pipeline on the test set and perform recursive feature elimination (RFE) to identify informative features that contribute to the correct classification. The latter gives a ranking where low ranks correspond to informative features."
"We use `GridSearchCV` to optimize the hyperparameter, the use the best classifier pipeline on the test set and perform recursive feature elimination (RFE) to identify informative features that contribute to the correct classification. The latter gives a ranking where low ranks correspond to informative features."
"The ranking estimated from classification can be compared to the \"ground truth\", which is in theory the order of input indices. Note that the randomness in the generation of the inputs adds some noise that affects the precision of the estimated ranking."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6ee2f8f7-3f55-41bd-b991-fefed23fe20f",
"metadata": {},
"metadata": {},
"outputs": [],
"source": [
"source": [
"Modify the scaling for the input generation to see how the ranking changes."
# check names of parameters for pipeline (for grid search)
print(clf.get_params())
# quick fix to get coefficients from mlr estimator in pipeline
# quick fix to get coefficients from mlr estimator in pipeline
defget_coef(clf_pipeline):
defget_coef(clf_pipeline):
returnclf_pipeline['mlr'].coef_
returnclf_pipeline['mlr'].coef_
```
```
%% Cell type:markdown id:e3f096d7 tags:
%% Cell type:markdown id:e3f096d7 tags:
## Optimization involving the tuning of hyperparameter
We use `GridSearchCV` to optimize the hyperparameter, the use the best classifier pipeline on the test set and perform recursive feature elimination (RFE) to identify informative features that contribute to the correct classification. The latter gives a ranking where low ranks correspond to informative features.
We use `GridSearchCV` to optimize the hyperparameter, the use the best classifier pipeline on the test set and perform recursive feature elimination (RFE) to identify informative features that contribute to the correct classification. The latter gives a ranking where low ranks correspond to informative features.
Recall that the inputs are geneterated such that "the shift across classes linearly depends on the feature index". This means that inputs with large index should have low (informative) ranking.
Recall that the inputs are geneterated such that "the shift across classes linearly depends on the feature index". This means that inputs with large index should have low (informative) ranking.
Modify the scaling for the input generation to see how the ranking changes.
The ranking estimated from classification can be compared to the "ground truth", which is in theory the order of input indices. Note that the randomness in the generation of the inputs adds some noise that affects the precision of the estimated ranking.