Skip to content
Snippets Groups Projects
Commit f2469e7c authored by GILSON Matthieu's avatar GILSON Matthieu
Browse files

add notebook comparison stats ml

parent ad354f79
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:00cde4c7 tags:
This notebook compares statistical testing and classification on synthetic data.
%% Cell type:code id:8c9c8432 tags:
``` python
# import librairies
import numpy as np
import scipy.signal as spsg
import pandas as pd
import scipy.stats as stt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sb
font = {'family' : 'DejaVu Sans',
'weight' : 'regular',
'size' : 18}
mpl.rc('font', **font)
```
%% Cell type:markdown id:0d12a052 tags:
## Generate synthetic data to classify
We generate synthetic data with 2 classes to separate (`s0` and `s1` samples, respectively). The input dimensionality corresponds `m` features. See the notebooks on supervised learning for details.
%% Cell type:code id:4673d31d tags:
``` python
# create synthetic dataset where 2 classes of s0+s1 samples of m-dimensional inputs with controlled contrast
def gen_inputs(m, # input dimensionality
s0, # number of samples for class 0
s1, # number of samples for class 1
scaling, # scaling factor to separate classes
type): # type of scaling (0=bimodal or 1=linear)
# labels
lbl = np.zeros([s0+s1], dtype=int)
# inputs
X = np.zeros([s0+s1,m])
# create s0 and s1 samples for the 2 classes
for i in range(s0+s1):
# label
lbl[i] = int(i<s0)
# inputs are random noise plus a shift
for j in range(m):
# positive/negative shift for 1st/2nd class, only for indices j larger than m/2
if type==0:
if j>=m/2:
if i<s0:
a = -scaling
else:
a = scaling
else:
a = 0.0
else:
if i<s0:
a = -j / m * scaling
else:
a = j / m * scaling
# the shift linearly depends on the feature index j
X[i,j] = a + np.random.randn()
return X, lbl
```
%% Cell type:markdown id:2052e9ec-c51e-429a-8ec0-d67eb9ec24ed tags:
For features with indices smaller than $m/2$, there is no contrast between the two classes (in red and blue). For indices larger than $m/2$, the contrast is determined by the scaling parameters (separation between the two disributions). Each feature involve independent noise in the generation of the samples.
%% Cell type:code id:970d0a6e-d894-45f5-a1a0-5752ee222688 tags:
``` python
# input properties
m = 1 # input dimensionality (try 1 and 10)
s0 = 50 # number of samples for class 0
s1 = 50 # number of samples for class 1
scaling = 0.5 # class contrast
# generate inputs
X, y = gen_inputs(m, s0, s1, scaling, 0)
```
%% Cell type:markdown id:a2b44140-a4fb-4f9c-9fd1-ff3934a61c3c tags:
We see that the left half of vertical columns in the matrix (sample, feature) have similar values for the red and blue classes (top and bottom half of rows, respectively). However, the right columns show differentiated values for the two classes: lower values for the red class than for the blue class. Importantly, note that these data involve noise: for each sample, not the same input feature (among high indices) allows for a clear prediction of the class of the sample.
%% Cell type:code id:e1d2470e-44a2-4fbd-834e-e898eb7f649b tags:
``` python
plt.figure(figsize=[8,6])
plt.axes([0.2,0.2,0.55,0.7])
plt.imshow(X, aspect=0.2, interpolation='nearest')
plt.xticks(range(0,m,2))
plt.xlabel('feature')
plt.ylabel('sample')
plt.axes([0.6,0.2,0.3,0.7])
plt.imshow(y.reshape([s0+s1,1]), aspect=0.2, interpolation='nearest', cmap='bwr')
plt.xticks([0],[' '])
plt.xlabel('label')
plt.show()
```
%% Cell type:markdown id:9089a6d5-ff78-4c1c-9bd0-add0f7997e43 tags:
## Statistical testing
The goal of statistical testing is to compare two distrbutions, each for each class. We can start with comparing features individually.
%% Cell type:code id:fcdbcf9f-73ba-4010-893f-7a3083450c99 tags:
``` python
# bins for istograms
vbins = np.linspace(-3,3,30)
# plot
plt.figure(figsize=[6,7])
for j, i in enumerate(np.unique([0, int((m-1)*0.33), int((m-1)*0.66), m-1])):
plt.subplot(4,1,j+1)
plt.hist(X[:s0,i], histtype='step', bins=vbins, color='r')
plt.hist(X[s0:,i], histtype='step', bins=vbins, color='b')
plt.axis(xmin=-3, xmax=3)
plt.legend(['class 0', 'class 1'], fontsize=10)
plt.title('input {}'.format(i), loc='left')
plt.xlabel('X values')
plt.tight_layout()
plt.show()
```
%% Cell type:code id:dea3710d-4929-456a-95e2-23a1445e3b03 tags:
``` python
p_val_indiv = np.zeros([m])
for i in range(m):
# results of t tests
print(stt.ttest_ind(X[:s0,i], X[s0:,i], equal_var=True))
p_val_indiv[i] = stt.ttest_ind(X[:s0,i], X[s0:,i], equal_var=True)[1]
```
%% Cell type:markdown id:8b834ae4-32e2-48b8-aebb-e1d4a62c803b tags:
We can build a linear model that take all features together. To each feature corresponds a regression coefficient that evaluates its contribution in a multivariate context. Note that this is equivalent to predicting the class index from all features together.
%% Cell type:code id:3fb8fe3c-7ff4-4e33-b565-113bdd97cb9c tags:
``` python
# create a dataframe with predicted variable as class index and all features
df = pd.DataFrame()
df['y'] = y
for i in range(m):
tag = 'x{}'.format(i)
df[tag] = X[:,i]
# define linear model
if m==1:
lm = smf.ols('y ~ x0', df)
else:
lm = smf.ols('y ~ x0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9', df)
# fit model to data
lmf = lm.fit()
# summary
print(lmf.summary())
```
%% Cell type:code id:14b15b16-d7ee-40ec-b39f-da6cd11dea68 tags:
``` python
print(sm.stats.anova_lm(lmf, typ=2))
```
%% Cell type:markdown id:3cdb2b17-ae91-4c68-9048-a704d26a0e41 tags:
Importantly, the p-values evaluated in a multivariate context differ from those calculated individually (compare `m=1` and `m=10`).
%% Cell type:code id:7f88dbf8-05fa-4182-bff0-4409bdd19c2e tags:
``` python
p_val_multi = lmf.pvalues[1:]
plt.figure()
plt.scatter(-np.log(p_val_indiv)/np.log(10), -np.log(p_val_multi)/np.log(10))
plt.xlabel('p-value (log 10) indiv')
plt.ylabel('p-value (log 10) multi')
plt.tight_layout()
```
%% Cell type:markdown id:089b6b07-3866-4c89-ab04-c7fe8c32ad2b tags:
## Classification
The goal of classification is to extract the relevant information to perform the classification. Here we use cross-validationwith 80%-20% for training-testing.
%% Cell type:code id:1187b5f5-1078-4241-83d3-1717edac5ee1 tags:
``` python
# number of split repetitions
n_rep = 10
# outer cross-validation scheme
cvs = StratifiedShuffleSplit(n_splits=n_rep, test_size=0.2)
```
%% Cell type:code id:6f2a4e24 tags:
``` python
# default value fhyperparameter for regularization
C = 10000.0
# classifier in pipeline and wrapper for RFE
clf = Pipeline([('scl',StandardScaler()),
('mlr',LogisticRegression(C=C))])
```
%% Cell type:code id:07cdf8f2 tags:
``` python
# results: accuracy and trained weights
d_acc = {'type': [], 'score': []}
d_weights = {'type': [], 'feat_idx': [], 'value': []}
# repeat classification
for train_ind, test_ind in cvs.split(X, y):
# train the classifier
clf.fit(X[train_ind,:], y[train_ind])
# accuracy on train set
score = clf.score(X[train_ind,:], y[train_ind])
d_acc['type'] += ['train']
d_acc['score'] += [score]
# accuracy on test set
score = clf.score(X[test_ind,:], y[test_ind])
d_acc['type'] += ['test']
d_acc['score'] += [score]
# trained weights
for idx in range(m):
d_weights['type'] += ['train']
d_weights['feat_idx'] += [idx]
d_weights['value'] += [clf[-1].coef_[0,idx]]
# shuffling labels to evaluate the chance-level classification, retrain and test
train_ind_rand = np.random.permutation(train_ind)
clf.fit(X[train_ind,:], y[train_ind_rand])
score = clf.score(X[test_ind,:], y[test_ind])
d_acc['type'] += ['shuf']
d_acc['score'] += [score]
# trained weights
for idx in range(m):
d_weights['type'] += ['shuf']
d_weights['feat_idx'] += [idx]
d_weights['value'] += [clf[-1].coef_[0,idx]]
# transform lists of dictionnaries in dataframes
acc = pd.DataFrame(d_acc, columns=['type', 'score'])
weights = pd.DataFrame(d_weights, columns=['type', 'feat_idx', 'value'])
```
%% Cell type:markdown id:2702e677 tags:
When plotting the accuracy results, we see that the classifier performs very well on the training set as illustrated by the accuracy in brown. In comparison the accuracy on the testing set in orange is lower, but still way above the chance level in gray (for the scaling factor of $0.7$ used in generating the samples). The comparison between these three accuracy distributions (because we use several repetitions of the splitting scheme) provides a quantitative manner to understand whether the classifier training is successful. A more complete check-up would consider the evolution of these accuracies during the iterative training of the classifier.
Here we have two classes with the same number of samples in each, so the chance-level is in theory 50%. The point of evaluating the chance level from the data is to quantify the variability of the corresponding accuracy.
%% Cell type:code id:03237341 tags:
``` python
chance_level = 0.5
plt.figure()
sb.violinplot(data=acc, y='score', x='type', hue='type', cut=1.0,
palette=['brown','orange','gray'], density_norm='width', inner=None, legend=False)
sb.swarmplot(data=acc, y='score', x='type', hue='type',
palette=['black','black','black'])
plt.plot([-1,3], [chance_level]*2, '--k')
plt.yticks([0,1])
plt.xlabel('')
plt.ylabel('accuracy')
plt.tight_layout()
plt.title('train-test accuracies')
plt.show()
```
%% Cell type:markdown id:4dead486-aa61-42fe-af61-12862fce1100 tags:
The `weights` dataframe has to the trained weights for the classifier in the two conditions ('train' and 'shuf'), together with the index of the feature. We can plot them to see which features are used by the trained classifier, indicating informative features.
We indeed see large weights (in absolute value) for high features indices and small weights for low feature indices for the trained classifier. In contrast, the classifier trained with shuffled labels has all weights close to zero, indicating an absence of specialization.
%% Cell type:code id:0427e876-3a2d-4df4-9c7f-3105e8ca339d tags:
``` python
plt.figure()
sb.lineplot(data=weights, y='value', x='feat_idx', hue='type')
plt.xlabel('feature index')
plt.ylabel('weight value')
plt.tight_layout()
plt.title('trained weights', fontsize=16)
plt.legend(fontsize=14, loc='lower left')
plt.show()
```
%% Cell type:markdown id:173c52a4-52af-4a20-afa2-0e6f958d8995 tags:
## Compare p-values and trained weights
%% Cell type:code id:3ab07ed6-8fab-4ed5-816b-2d5c8c3b9dcd tags:
``` python
abs_weight = np.abs(weights[weights['type']=='train'].groupby('feat_idx')['value'].apply(func=np.mean))
```
%% Cell type:code id:c03b4ffa-375d-48b3-b9bb-d27f2fb72bc2 tags:
``` python
plt.figure()
plt.scatter(-np.log(p_val_indiv)/np.log(10), abs_weight)
plt.xlabel('p-value (log 10)')
plt.ylabel('absolute weight')
plt.title('individual')
plt.tight_layout()
```
%% Cell type:code id:a7dc35c7-3689-4c2c-bafc-d70a0baa5f77 tags:
``` python
plt.figure()
plt.scatter(-np.log(p_val_multi)/np.log(10), abs_weight)
plt.xlabel('p-value (log 10)')
plt.ylabel('absolute weight')
plt.title('multivariate')
plt.tight_layout()
```
%% Cell type:markdown id:c0f55fdc-d11a-4188-9e3a-3d793431ed54 tags:
## Exercices:
- change from $m=1$ to $10$ inputs
- modify the values of the scaling factor to increase/decrease the contrast between the two classes
- modify the value of the regularization parameter to see the effect of the trained weights
- change the input structure: `gen_inputs(m, s0, s1, scaling, 1)` instead of `gen_inputs(m, s0, s1, scaling, 0)`
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment