update stats

477ce028 · GILSON Matthieu · e9338602 · 477ce028 · 477ce028 · e9338602
Commit 477ce028 authored 10 months ago by GILSON Matthieu
--- a/data/ex_stats/load_data.ipynb
+++ b/data/ex_stats/load_data.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6d3a1742-1df0-4cb6-87d5-40e35272387c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "import pandas as pd\n",
+    "\n",
+    "import matplotlib.pyplot as plt\n",
+    "%matplotlib inline\n",
+    "import seaborn as sb\n",
+    "\n",
+    "sb.set_style('whitegrid')\n",
+    "sb.set(font_scale=1.5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8e4cf7fe-9ca6-451d-b43c-d3baf02fc463",
+   "metadata": {},
+   "source": [
+    "We consider the following datasets from the R dataset vault [https://vincentarelbundock.github.io/Rdatasets/articles/data.html](https://vincentarelbundock.github.io/Rdatasets/articles/data.html), which can be accessed via the `statsmodels` package as described in [https://www.statsmodels.org/stable/datasets/index.html#r-datasets-function-reference](https://www.statsmodels.org/stable/datasets/index.html#r-datasets-function-reference), for example `df = sm.datasets.get_rdataset('gss_wages', package='stevedata').data`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "37bca353-bef6-455b-8219-dba0bff6a057",
+   "metadata": {},
+   "source": [
+    "# NeuroCog dataset\n",
+    "\n",
+    "Dataset 'NeuroCog' from package 'heplots': https://vincentarelbundock.github.io/Rdatasets/doc/heplots/NeuroCog.html"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c6b9a1f8-d9fc-4ecc-b388-d0396647c1dc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    " # load NeuroCog data \n",
+    "df = pd.read_csv('NeuroCog_dataset.csv', sep=',')\n",
+    "\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "efbfa176-a144-4ab3-aba3-59d66fe2b317",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure()\n",
+    "sb.scatterplot(data=df, y='Speed', x='Memory')\n",
+    "\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0585462e-ea90-4ebd-8789-8743a4463918",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure()\n",
+    "sb.scatterplot(data=df, y='Speed', x='Age')\n",
+    "\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f8e71fbd-a408-4aad-8c7f-3aba7b0edddb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure()\n",
+    "sb.scatterplot(data=df, y='Speed', x='Age', hue='Dx')\n",
+    "plt.legend(fontsize=10)\n",
+    "\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2bc8ca41-1403-4b1f-b6ac-659be994b79d",
+   "metadata": {},
+   "source": [
+    "# Gender Gap Wage dataset\n",
+    "\n",
+    "Dataset 'gss_wages' from package 'stevedata': https://vincentarelbundock.github.io/Rdatasets/doc/stevedata/gss_wages.html\n",
+    "\n",
+    "Target variable: 'realrinc'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0a46bc09-9afb-4ebb-aba2-20773ff0191d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# gender-gap wage dataset\n",
+    "df = pd.read_csv('GenderGapWage_dataset.csv', sep=',')\n",
+    "\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e4cb2cc8-be2f-4ca3-af28-e9a775c9aa99",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure()\n",
+    "sb.violinplot(data=df, y='realrinc', x='gender')\n",
+    "\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b86cc9fc-4714-4694-b145-f9a10ac1607b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure()\n",
+    "sb.scatterplot(data=df, y='realrinc', x='age')\n",
+    "\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d27a22af-5c9b-400b-b086-4a248cbf641e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure()\n",
+    "sb.violinplot(data=df, y='realrinc', x='educcat')\n",
+    "plt.xticks(rotation=70)\n",
+    "\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "34dbeeb0-914d-4706-89c7-1bf9dff1fa63",
+   "metadata": {},
+   "source": [
+    "# Fertility dataset\n",
+    "\n",
+    "Dataset 'Fertility' from package 'AER': https://vincentarelbundock.github.io/Rdatasets/doc/AER/Fertility.html\n",
+    "\n",
+    "Target variable: 'mkids'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2e3266ab-fd0d-4ab8-b44b-9b3a1bc25304",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# fertility dataset\n",
+    "df = pd.read_csv('Fertility_dataset.csv', sep=',')\n",
+    "\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ae4c0ab9-32a7-41bc-aa68-d840104a62a1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure()\n",
+    "plt.subplot(1,2,1)\n",
+    "sb.violinplot(data=df, y='mkids', x='gender1')\n",
+    "plt.subplot(1,2,2)\n",
+    "sb.violinplot(data=df, y='mkids', x='gender2')\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "67b86097-2537-42cc-9d62-027eb38498c5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure()\n",
+    "sb.violinplot(data=df, y='mkids', x='age')\n",
+    "\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4e8e903c-52a4-4b5f-b81c-d807a68ebac0",
+   "metadata": {},
+   "source": [
+    "# Covid dataset\n",
+    "\n",
+    "An example of Simpson's paradox: https://en.wikipedia.org/wiki/Simpson%27s_paradox\n",
+    "\n",
+    "Dataset 'simpsons_paradox_covid' from package 'openintro': https://vincentarelbundock.github.io/Rdatasets/doc/openintro/simpsons_paradox_covid.html\n",
+    "\n",
+    "Target variable: 'outcome'"
+   ]
+  },
+  {
+   "cell_type": "raw",
+   "id": "4d805d0d-6088-4e93-b969-e3c424eedaaf",
+   "metadata": {},
+   "source": [
+    "import statsmodels.api as sm\n",
+    "df = sm.datasets.get_rdataset('simpsons_paradox_covid', package='openintro').data"
+   ]
+  },
+  {
+   "cell_type": "raw",
+   "id": "fc9e7e5c-2878-45f2-b2fc-36dcaa2f1dd3",
+   "metadata": {},
+   "source": [
+    "df.to_csv('Covid.csv', sep=',', index=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b9eecf78-2042-4bdf-afbc-ae365d801cc1",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# covid dataset\n",
+    "df = pd.read_csv('Covid.csv', sep=',')\n",
+    "\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "04c0d5ba-e4bd-42df-a63c-64c84bbe9c1c",
+   "metadata": {},
+   "source": [
+    "# Trump Vote dataset\n",
+    "\n",
+    "Dataset 'TV16' from package 'openintro': https://vincentarelbundock.github.io/Rdatasets/doc/stevedata/TV16.html\n",
+    "\n",
+    "Target variable: 'votetrump'"
+   ]
+  },
+  {
+   "cell_type": "raw",
+   "id": "5c92008a-a06c-4e7d-9fa7-5a9235de9fd8",
+   "metadata": {},
+   "source": [
+    "import statsmodels.api as sm\n",
+    "df = sm.datasets.get_rdataset('TV16', package='stevedata').data\n",
+    "\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "raw",
+   "id": "c2c424c9-340f-47d0-af14-75425b52b981",
+   "metadata": {},
+   "source": [
+    "df.to_csv('TV16.csv', sep=',', index=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2bdac1bb-4171-415e-a644-c956d8305e9d",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# covid dataset\n",
+    "df = pd.read_csv('TV16.csv', sep=',')\n",
+    "\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "85e6a573-a80b-43f7-8d06-c334bec52fe6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure()\n",
+    "sb.violinplot(data=df, y='votetrump', x='racef')\n",
+    "plt.xticks(rotation=60)\n",
+    "plt.tight_layout()\n",
+    "\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e567922c-dfad-44dc-b9f7-5fbd4f714ba9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure()\n",
+    "sb.violinplot(data=df, y='votetrump', x='religimp')\n",
+    "plt.xticks(rotation=60)\n",
+    "plt.tight_layout()\n",
+    "\n",
+    "plt.show()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
+%% Cell type:code id:6d3a1742-1df0-4cb6-87d5-40e35272387c tags:
+``` python
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+%matplotlib inline
+import seaborn as sb
+sb.set_style('whitegrid')
+sb.set(font_scale=1.5)
+```
+%% Cell type:markdown id:8e4cf7fe-9ca6-451d-b43c-d3baf02fc463 tags:
+We consider the following datasets from the R dataset vault [https://vincentarelbundock.github.io/Rdatasets/articles/data.html](https://vincentarelbundock.github.io/Rdatasets/articles/data.html), which can be accessed via the `statsmodels` package as described in [https://www.statsmodels.org/stable/datasets/index.html#r-datasets-function-reference](https://www.statsmodels.org/stable/datasets/index.html#r-datasets-function-reference), for example `df = sm.datasets.get_rdataset('gss_wages', package='stevedata').data`.
+%% Cell type:markdown id:37bca353-bef6-455b-8219-dba0bff6a057 tags:
+# NeuroCog dataset
+Dataset 'NeuroCog' from package 'heplots': https://vincentarelbundock.github.io/Rdatasets/doc/heplots/NeuroCog.html
+%% Cell type:code id:c6b9a1f8-d9fc-4ecc-b388-d0396647c1dc tags:
+``` python
+ # load NeuroCog data
+df = pd.read_csv('NeuroCog_dataset.csv', sep=',')
+df.head()
+```
+%% Cell type:code id:efbfa176-a144-4ab3-aba3-59d66fe2b317 tags:
+``` python
+plt.figure()
+sb.scatterplot(data=df, y='Speed', x='Memory')
+plt.show()
+```
+%% Cell type:code id:0585462e-ea90-4ebd-8789-8743a4463918 tags:
+``` python
+plt.figure()
+sb.scatterplot(data=df, y='Speed', x='Age')
+plt.show()
+```
+%% Cell type:code id:f8e71fbd-a408-4aad-8c7f-3aba7b0edddb tags:
+``` python
+plt.figure()
+sb.scatterplot(data=df, y='Speed', x='Age', hue='Dx')
+plt.legend(fontsize=10)
+plt.show()
+```
+%% Cell type:markdown id:2bc8ca41-1403-4b1f-b6ac-659be994b79d tags:
+# Gender Gap Wage dataset
+Dataset 'gss_wages' from package 'stevedata': https://vincentarelbundock.github.io/Rdatasets/doc/stevedata/gss_wages.html
+Target variable: 'realrinc'
+%% Cell type:code id:0a46bc09-9afb-4ebb-aba2-20773ff0191d tags:
+``` python
+# gender-gap wage dataset
+df = pd.read_csv('GenderGapWage_dataset.csv', sep=',')
+df.head()
+```
+%% Cell type:code id:e4cb2cc8-be2f-4ca3-af28-e9a775c9aa99 tags:
+``` python
+plt.figure()
+sb.violinplot(data=df, y='realrinc', x='gender')
+plt.show()
+```
+%% Cell type:code id:b86cc9fc-4714-4694-b145-f9a10ac1607b tags:
+``` python
+plt.figure()
+sb.scatterplot(data=df, y='realrinc', x='age')
+plt.show()
+```
+%% Cell type:code id:d27a22af-5c9b-400b-b086-4a248cbf641e tags:
+``` python
+plt.figure()
+sb.violinplot(data=df, y='realrinc', x='educcat')
+plt.xticks(rotation=70)
+plt.show()
+```
+%% Cell type:markdown id:34dbeeb0-914d-4706-89c7-1bf9dff1fa63 tags:
+# Fertility dataset
+Dataset 'Fertility' from package 'AER': https://vincentarelbundock.github.io/Rdatasets/doc/AER/Fertility.html
+Target variable: 'mkids'
+%% Cell type:code id:2e3266ab-fd0d-4ab8-b44b-9b3a1bc25304 tags:
+``` python
+# fertility dataset
+df = pd.read_csv('Fertility_dataset.csv', sep=',')
+df.head()
+```
+%% Cell type:code id:ae4c0ab9-32a7-41bc-aa68-d840104a62a1 tags:
+``` python
+plt.figure()
+plt.subplot(1,2,1)
+sb.violinplot(data=df, y='mkids', x='gender1')
+plt.subplot(1,2,2)
+sb.violinplot(data=df, y='mkids', x='gender2')
+plt.tight_layout()
+plt.show()
+```
+%% Cell type:code id:67b86097-2537-42cc-9d62-027eb38498c5 tags:
+``` python
+plt.figure()
+sb.violinplot(data=df, y='mkids', x='age')
+plt.show()
+```
+%% Cell type:markdown id:4e8e903c-52a4-4b5f-b81c-d807a68ebac0 tags:
+# Covid dataset
+An example of Simpson's paradox: https://en.wikipedia.org/wiki/Simpson%27s_paradox
+Dataset 'simpsons_paradox_covid' from package 'openintro': https://vincentarelbundock.github.io/Rdatasets/doc/openintro/simpsons_paradox_covid.html
+Target variable: 'outcome'
+%% Cell type:raw id:4d805d0d-6088-4e93-b969-e3c424eedaaf tags:
+import statsmodels.api as sm
+df = sm.datasets.get_rdataset('simpsons_paradox_covid', package='openintro').data
+%% Cell type:raw id:fc9e7e5c-2878-45f2-b2fc-36dcaa2f1dd3 tags:
+df.to_csv('Covid.csv', sep=',', index=False)
+%% Cell type:code id:b9eecf78-2042-4bdf-afbc-ae365d801cc1 tags:
+``` python
+# covid dataset
+df = pd.read_csv('Covid.csv', sep=',')
+df.head()
+```
+%% Cell type:markdown id:04c0d5ba-e4bd-42df-a63c-64c84bbe9c1c tags:
+# Trump Vote dataset
+Dataset 'TV16' from package 'openintro': https://vincentarelbundock.github.io/Rdatasets/doc/stevedata/TV16.html
+Target variable: 'votetrump'
+%% Cell type:raw id:5c92008a-a06c-4e7d-9fa7-5a9235de9fd8 tags:
+import statsmodels.api as sm
+df = sm.datasets.get_rdataset('TV16', package='stevedata').data
+df.head()
+%% Cell type:raw id:c2c424c9-340f-47d0-af14-75425b52b981 tags:
+df.to_csv('TV16.csv', sep=',', index=False)
+%% Cell type:code id:2bdac1bb-4171-415e-a644-c956d8305e9d tags:
+``` python
+# covid dataset
+df = pd.read_csv('TV16.csv', sep=',')
+df.head()
+```
+%% Cell type:code id:85e6a573-a80b-43f7-8d06-c334bec52fe6 tags:
+``` python
+plt.figure()
+sb.violinplot(data=df, y='votetrump', x='racef')
+plt.xticks(rotation=60)
+plt.tight_layout()
+plt.show()
+```
+%% Cell type:code id:e567922c-dfad-44dc-b9f7-5fbd4f714ba9 tags:
+``` python
+plt.figure()
+sb.violinplot(data=df, y='votetrump', x='religimp')
+plt.xticks(rotation=60)
+plt.tight_layout()
+plt.show()
+```
--- a/stats/Statistics_Regression.pdf
+++ b/stats/Statistics_Regression.pdf
--- a/stats/exam_stats.ipynb
+++ b/stats/exam_stats.ipynb