Skip to content
Snippets Groups Projects
Commit 477ce028 authored by GILSON Matthieu's avatar GILSON Matthieu
Browse files

update stats

parent e9338602
No related branches found
No related tags found
No related merge requests found
%% Cell type:code id:6d3a1742-1df0-4cb6-87d5-40e35272387c tags:
``` python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
sb.set_style('whitegrid')
sb.set(font_scale=1.5)
```
%% Cell type:markdown id:8e4cf7fe-9ca6-451d-b43c-d3baf02fc463 tags:
We consider the following datasets from the R dataset vault [https://vincentarelbundock.github.io/Rdatasets/articles/data.html](https://vincentarelbundock.github.io/Rdatasets/articles/data.html), which can be accessed via the `statsmodels` package as described in [https://www.statsmodels.org/stable/datasets/index.html#r-datasets-function-reference](https://www.statsmodels.org/stable/datasets/index.html#r-datasets-function-reference), for example `df = sm.datasets.get_rdataset('gss_wages', package='stevedata').data`.
%% Cell type:markdown id:37bca353-bef6-455b-8219-dba0bff6a057 tags:
# NeuroCog dataset
Dataset 'NeuroCog' from package 'heplots': https://vincentarelbundock.github.io/Rdatasets/doc/heplots/NeuroCog.html
%% Cell type:code id:c6b9a1f8-d9fc-4ecc-b388-d0396647c1dc tags:
``` python
# load NeuroCog data
df = pd.read_csv('NeuroCog_dataset.csv', sep=',')
df.head()
```
%% Cell type:code id:efbfa176-a144-4ab3-aba3-59d66fe2b317 tags:
``` python
plt.figure()
sb.scatterplot(data=df, y='Speed', x='Memory')
plt.show()
```
%% Cell type:code id:0585462e-ea90-4ebd-8789-8743a4463918 tags:
``` python
plt.figure()
sb.scatterplot(data=df, y='Speed', x='Age')
plt.show()
```
%% Cell type:code id:f8e71fbd-a408-4aad-8c7f-3aba7b0edddb tags:
``` python
plt.figure()
sb.scatterplot(data=df, y='Speed', x='Age', hue='Dx')
plt.legend(fontsize=10)
plt.show()
```
%% Cell type:markdown id:2bc8ca41-1403-4b1f-b6ac-659be994b79d tags:
# Gender Gap Wage dataset
Dataset 'gss_wages' from package 'stevedata': https://vincentarelbundock.github.io/Rdatasets/doc/stevedata/gss_wages.html
Target variable: 'realrinc'
%% Cell type:code id:0a46bc09-9afb-4ebb-aba2-20773ff0191d tags:
``` python
# gender-gap wage dataset
df = pd.read_csv('GenderGapWage_dataset.csv', sep=',')
df.head()
```
%% Cell type:code id:e4cb2cc8-be2f-4ca3-af28-e9a775c9aa99 tags:
``` python
plt.figure()
sb.violinplot(data=df, y='realrinc', x='gender')
plt.show()
```
%% Cell type:code id:b86cc9fc-4714-4694-b145-f9a10ac1607b tags:
``` python
plt.figure()
sb.scatterplot(data=df, y='realrinc', x='age')
plt.show()
```
%% Cell type:code id:d27a22af-5c9b-400b-b086-4a248cbf641e tags:
``` python
plt.figure()
sb.violinplot(data=df, y='realrinc', x='educcat')
plt.xticks(rotation=70)
plt.show()
```
%% Cell type:markdown id:34dbeeb0-914d-4706-89c7-1bf9dff1fa63 tags:
# Fertility dataset
Dataset 'Fertility' from package 'AER': https://vincentarelbundock.github.io/Rdatasets/doc/AER/Fertility.html
Target variable: 'mkids'
%% Cell type:code id:2e3266ab-fd0d-4ab8-b44b-9b3a1bc25304 tags:
``` python
# fertility dataset
df = pd.read_csv('Fertility_dataset.csv', sep=',')
df.head()
```
%% Cell type:code id:ae4c0ab9-32a7-41bc-aa68-d840104a62a1 tags:
``` python
plt.figure()
plt.subplot(1,2,1)
sb.violinplot(data=df, y='mkids', x='gender1')
plt.subplot(1,2,2)
sb.violinplot(data=df, y='mkids', x='gender2')
plt.tight_layout()
plt.show()
```
%% Cell type:code id:67b86097-2537-42cc-9d62-027eb38498c5 tags:
``` python
plt.figure()
sb.violinplot(data=df, y='mkids', x='age')
plt.show()
```
%% Cell type:markdown id:4e8e903c-52a4-4b5f-b81c-d807a68ebac0 tags:
# Covid dataset
An example of Simpson's paradox: https://en.wikipedia.org/wiki/Simpson%27s_paradox
Dataset 'simpsons_paradox_covid' from package 'openintro': https://vincentarelbundock.github.io/Rdatasets/doc/openintro/simpsons_paradox_covid.html
Target variable: 'outcome'
%% Cell type:raw id:4d805d0d-6088-4e93-b969-e3c424eedaaf tags:
import statsmodels.api as sm
df = sm.datasets.get_rdataset('simpsons_paradox_covid', package='openintro').data
%% Cell type:raw id:fc9e7e5c-2878-45f2-b2fc-36dcaa2f1dd3 tags:
df.to_csv('Covid.csv', sep=',', index=False)
%% Cell type:code id:b9eecf78-2042-4bdf-afbc-ae365d801cc1 tags:
``` python
# covid dataset
df = pd.read_csv('Covid.csv', sep=',')
df.head()
```
%% Cell type:markdown id:04c0d5ba-e4bd-42df-a63c-64c84bbe9c1c tags:
# Trump Vote dataset
Dataset 'TV16' from package 'openintro': https://vincentarelbundock.github.io/Rdatasets/doc/stevedata/TV16.html
Target variable: 'votetrump'
%% Cell type:raw id:5c92008a-a06c-4e7d-9fa7-5a9235de9fd8 tags:
import statsmodels.api as sm
df = sm.datasets.get_rdataset('TV16', package='stevedata').data
df.head()
%% Cell type:raw id:c2c424c9-340f-47d0-af14-75425b52b981 tags:
df.to_csv('TV16.csv', sep=',', index=False)
%% Cell type:code id:2bdac1bb-4171-415e-a644-c956d8305e9d tags:
``` python
# covid dataset
df = pd.read_csv('TV16.csv', sep=',')
df.head()
```
%% Cell type:code id:85e6a573-a80b-43f7-8d06-c334bec52fe6 tags:
``` python
plt.figure()
sb.violinplot(data=df, y='votetrump', x='racef')
plt.xticks(rotation=60)
plt.tight_layout()
plt.show()
```
%% Cell type:code id:e567922c-dfad-44dc-b9f7-5fbd4f714ba9 tags:
``` python
plt.figure()
sb.violinplot(data=df, y='votetrump', x='religimp')
plt.xticks(rotation=60)
plt.tight_layout()
plt.show()
```
File added
Source diff could not be displayed: it is too large. Options to address this: view the blob.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment