"We generate synthetic data with 2 classes to separate (`s0` and `s1` samples, respectively). The input dimensionality corresponds `m` features. See the notebooks on supervised learning for details."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4673d31d",
"metadata": {},
"outputs": [],
"source": [
"# create synthetic dataset where 2 classes of s0+s1 samples of m-dimensional inputs with controlled contrast\n",
"def gen_inputs(m, # input dimensionality\n",
" s0, # number of samples for class 0\n",
" s1, # number of samples for class 1\n",
" scaling, # scaling factor to separate classes\n",
" type): # type of scaling (0=bimodal or 1=linear)\n",
"\n",
" # labels\n",
" lbl = np.zeros([s0+s1], dtype=int)\n",
" # inputs\n",
" X = np.zeros([s0+s1,m])\n",
"\n",
" # create s0 and s1 samples for the 2 classes\n",
" for i in range(s0+s1):\n",
" # label\n",
" lbl[i] = int(i<s0)\n",
" # inputs are random noise plus a shift\n",
" for j in range(m):\n",
" # positive/negative shift for 1st/2nd class, only for indices j larger than m/2\n",
" if type==0:\n",
" if j>=m/2:\n",
" if i<s0:\n",
" a = -scaling\n",
" else:\n",
" a = scaling\n",
" else:\n",
" a = 0.0\n",
" else:\n",
" if i<s0:\n",
" a = -j / m * scaling\n",
" else:\n",
" a = j / m * scaling\n",
" # the shift linearly depends on the feature index j\n",
" X[i,j] = a + np.random.randn()\n",
" \n",
" return X, lbl"
]
},
{
"cell_type": "markdown",
"id": "2052e9ec-c51e-429a-8ec0-d67eb9ec24ed",
"metadata": {},
"source": [
"For features with indices smaller than $m/2$, there is no contrast between the two classes (in red and blue). For indices larger than $m/2$, the contrast is determined by the scaling parameters (separation between the two disributions). Each feature involve independent noise in the generation of the samples."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "970d0a6e-d894-45f5-a1a0-5752ee222688",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# input properties\n",
"m = 1 # input dimensionality (try 1 and 10)\n",
"s0 = 50 # number of samples for class 0\n",
"s1 = 50 # number of samples for class 1\n",
"scaling = 0.5 # class contrast\n",
"\n",
"# generate inputs\n",
"X, y = gen_inputs(m, s0, s1, scaling, 0)"
]
},
{
"cell_type": "markdown",
"id": "a2b44140-a4fb-4f9c-9fd1-ff3934a61c3c",
"metadata": {},
"source": [
"We see that the left half of vertical columns in the matrix (sample, feature) have similar values for the red and blue classes (top and bottom half of rows, respectively). However, the right columns show differentiated values for the two classes: lower values for the red class than for the blue class. Importantly, note that these data involve noise: for each sample, not the same input feature (among high indices) allows for a clear prediction of the class of the sample."
"We can build a linear model that take all features together. To each feature corresponds a regression coefficient that evaluates its contribution in a multivariate context. Note that this is equivalent to predicting the class index from all features together."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3fb8fe3c-7ff4-4e33-b565-113bdd97cb9c",
"metadata": {},
"outputs": [],
"source": [
"# create a dataframe with predicted variable as class index and all features\n",
"The goal of classification is to extract the relevant information to perform the classification. Here we use cross-validationwith 80%-20% for training-testing."
"When plotting the accuracy results, we see that the classifier performs very well on the training set as illustrated by the accuracy in brown. In comparison the accuracy on the testing set in orange is lower, but still way above the chance level in gray (for the scaling factor of $0.7$ used in generating the samples). The comparison between these three accuracy distributions (because we use several repetitions of the splitting scheme) provides a quantitative manner to understand whether the classifier training is successful. A more complete check-up would consider the evolution of these accuracies during the iterative training of the classifier.\n",
"\n",
"Here we have two classes with the same number of samples in each, so the chance-level is in theory 50%. The point of evaluating the chance level from the data is to quantify the variability of the corresponding accuracy."
"The `weights` dataframe has to the trained weights for the classifier in the two conditions ('train' and 'shuf'), together with the index of the feature. We can plot them to see which features are used by the trained classifier, indicating informative features.\n",
"\n",
"We indeed see large weights (in absolute value) for high features indices and small weights for low feature indices for the trained classifier. In contrast, the classifier trained with shuffled labels has all weights close to zero, indicating an absence of specialization."
We generate synthetic data with 2 classes to separate (`s0` and `s1` samples, respectively). The input dimensionality corresponds `m` features. See the notebooks on supervised learning for details.
%% Cell type:code id:4673d31d tags:
``` python
# create synthetic dataset where 2 classes of s0+s1 samples of m-dimensional inputs with controlled contrast
defgen_inputs(m,# input dimensionality
s0,# number of samples for class 0
s1,# number of samples for class 1
scaling,# scaling factor to separate classes
type):# type of scaling (0=bimodal or 1=linear)
# labels
lbl=np.zeros([s0+s1],dtype=int)
# inputs
X=np.zeros([s0+s1,m])
# create s0 and s1 samples for the 2 classes
foriinrange(s0+s1):
# label
lbl[i]=int(i<s0)
# inputs are random noise plus a shift
forjinrange(m):
# positive/negative shift for 1st/2nd class, only for indices j larger than m/2
iftype==0:
ifj>=m/2:
ifi<s0:
a=-scaling
else:
a=scaling
else:
a=0.0
else:
ifi<s0:
a=-j/m*scaling
else:
a=j/m*scaling
# the shift linearly depends on the feature index j
For features with indices smaller than $m/2$, there is no contrast between the two classes (in red and blue). For indices larger than $m/2$, the contrast is determined by the scaling parameters (separation between the two disributions). Each feature involve independent noise in the generation of the samples.
We see that the left half of vertical columns in the matrix (sample, feature) have similar values for the red and blue classes (top and bottom half of rows, respectively). However, the right columns show differentiated values for the two classes: lower values for the red class than for the blue class. Importantly, note that these data involve noise: for each sample, not the same input feature (among high indices) allows for a clear prediction of the class of the sample.
We can build a linear model that take all features together. To each feature corresponds a regression coefficient that evaluates its contribution in a multivariate context. Note that this is equivalent to predicting the class index from all features together.
The goal of classification is to extract the relevant information to perform the classification. Here we use cross-validationwith 80%-20% for training-testing.
When plotting the accuracy results, we see that the classifier performs very well on the training set as illustrated by the accuracy in brown. In comparison the accuracy on the testing set in orange is lower, but still way above the chance level in gray (for the scaling factor of $0.7$ used in generating the samples). The comparison between these three accuracy distributions (because we use several repetitions of the splitting scheme) provides a quantitative manner to understand whether the classifier training is successful. A more complete check-up would consider the evolution of these accuracies during the iterative training of the classifier.
Here we have two classes with the same number of samples in each, so the chance-level is in theory 50%. The point of evaluating the chance level from the data is to quantify the variability of the corresponding accuracy.
The `weights` dataframe has to the trained weights for the classifier in the two conditions ('train' and 'shuf'), together with the index of the feature. We can plot them to see which features are used by the trained classifier, indicating informative features.
We indeed see large weights (in absolute value) for high features indices and small weights for low feature indices for the trained classifier. In contrast, the classifier trained with shuffled labels has all weights close to zero, indicating an absence of specialization.