4ma-ml/notebook/project/starter_py.ipynb

389 lines
43 KiB
Text
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data analysis: Velib"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Author: O. Roustant, INSA Toulouse. February 2022.\n",
"\n",
"We consider the Vélib data set, related to the bike sharing system of Paris. The data are loading profiles of the bike stations over one week, collected every hour, from the period Monday 2nd Sept. - Sunday 7th Sept., 2014. The loading profile of a station, or simply loading, is defined as the ratio of number of available bikes divided by the number of bike docks. A loading of 1 means that the station is fully loaded, i.e. all bikes are available. A loading of 0 means that the station is empty, all bikes have been rent.\n",
"\n",
"From the viewpoint of data analysis, the individuals are the stations. The variables are the 168 time steps (hours in the week). The aim is to detect clusters in the data, corresponding to common customer usages. This clustering should then be used to predict the loading profile.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Lun-00</th>\n",
" <th>Lun-01</th>\n",
" <th>Lun-02</th>\n",
" <th>Lun-03</th>\n",
" <th>Lun-04</th>\n",
" <th>Lun-05</th>\n",
" <th>Lun-06</th>\n",
" <th>Lun-07</th>\n",
" <th>Lun-08</th>\n",
" <th>Lun-09</th>\n",
" <th>...</th>\n",
" <th>Dim-14</th>\n",
" <th>Dim-15</th>\n",
" <th>Dim-16</th>\n",
" <th>Dim-17</th>\n",
" <th>Dim-18</th>\n",
" <th>Dim-19</th>\n",
" <th>Dim-20</th>\n",
" <th>Dim-21</th>\n",
" <th>Dim-22</th>\n",
" <th>Dim-23</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.038462</td>\n",
" <td>0.038462</td>\n",
" <td>0.076923</td>\n",
" <td>0.038462</td>\n",
" <td>0.038462</td>\n",
" <td>0.038462</td>\n",
" <td>0.038462</td>\n",
" <td>0.038462</td>\n",
" <td>0.107143</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.296296</td>\n",
" <td>0.111111</td>\n",
" <td>0.111111</td>\n",
" <td>0.148148</td>\n",
" <td>0.307692</td>\n",
" <td>0.076923</td>\n",
" <td>0.115385</td>\n",
" <td>0.076923</td>\n",
" <td>0.153846</td>\n",
" <td>0.153846</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.478261</td>\n",
" <td>0.478261</td>\n",
" <td>0.478261</td>\n",
" <td>0.434783</td>\n",
" <td>0.434783</td>\n",
" <td>0.434783</td>\n",
" <td>0.434783</td>\n",
" <td>0.434783</td>\n",
" <td>0.260870</td>\n",
" <td>0.043478</td>\n",
" <td>...</td>\n",
" <td>0.043478</td>\n",
" <td>0.000000</td>\n",
" <td>0.217391</td>\n",
" <td>0.130435</td>\n",
" <td>0.045455</td>\n",
" <td>0.173913</td>\n",
" <td>0.173913</td>\n",
" <td>0.173913</td>\n",
" <td>0.260870</td>\n",
" <td>0.391304</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.218182</td>\n",
" <td>0.145455</td>\n",
" <td>0.127273</td>\n",
" <td>0.109091</td>\n",
" <td>0.109091</td>\n",
" <td>0.109091</td>\n",
" <td>0.090909</td>\n",
" <td>0.090909</td>\n",
" <td>0.054545</td>\n",
" <td>0.109091</td>\n",
" <td>...</td>\n",
" <td>0.259259</td>\n",
" <td>0.259259</td>\n",
" <td>0.203704</td>\n",
" <td>0.129630</td>\n",
" <td>0.148148</td>\n",
" <td>0.296296</td>\n",
" <td>0.314815</td>\n",
" <td>0.370370</td>\n",
" <td>0.370370</td>\n",
" <td>0.407407</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.952381</td>\n",
" <td>0.952381</td>\n",
" <td>0.952381</td>\n",
" <td>0.952381</td>\n",
" <td>0.952381</td>\n",
" <td>0.952381</td>\n",
" <td>0.952381</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>...</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.904762</td>\n",
" <td>0.857143</td>\n",
" <td>0.857143</td>\n",
" <td>0.857143</td>\n",
" <td>0.761905</td>\n",
" <td>0.761905</td>\n",
" <td>0.761905</td>\n",
" <td>0.761905</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0.927536</td>\n",
" <td>0.811594</td>\n",
" <td>0.739130</td>\n",
" <td>0.724638</td>\n",
" <td>0.724638</td>\n",
" <td>0.724638</td>\n",
" <td>0.724638</td>\n",
" <td>0.724638</td>\n",
" <td>0.753623</td>\n",
" <td>0.971014</td>\n",
" <td>...</td>\n",
" <td>0.227273</td>\n",
" <td>0.454545</td>\n",
" <td>0.590909</td>\n",
" <td>0.833333</td>\n",
" <td>1.000000</td>\n",
" <td>0.818182</td>\n",
" <td>0.636364</td>\n",
" <td>0.712121</td>\n",
" <td>0.621212</td>\n",
" <td>0.575758</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 168 columns</p>\n",
"</div>"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"path='' # si les données sont déjà dans le répertoire courant\n",
"loading = pd.read_csv(path+'velibLoading.csv', sep = \" \")\n",
"loading.head()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>bonus</th>\n",
" <th>names</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.377389</td>\n",
" <td>48.886300</td>\n",
" <td>0</td>\n",
" <td>EURYALE DEHAYNIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.317591</td>\n",
" <td>48.890020</td>\n",
" <td>0</td>\n",
" <td>LEMERCIER</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2.330447</td>\n",
" <td>48.850297</td>\n",
" <td>0</td>\n",
" <td>MEZIERES RENNES</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2.271396</td>\n",
" <td>48.833734</td>\n",
" <td>0</td>\n",
" <td>FARMAN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2.366897</td>\n",
" <td>48.845887</td>\n",
" <td>0</td>\n",
" <td>QUAI DE LA RAPEE</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"velibAdds = pd.read_csv(path+'velibAdds.csv', sep = \" \")\n",
"velibAdds.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">&lt;</span><span style=\"color: #ff00ff; text-decoration-color: #ff00ff; font-weight: bold\">Figure</span><span style=\"color: #000000; text-decoration-color: #000000\"> size 432x288 with </span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"color: #000000; text-decoration-color: #000000\"> Axes</span><span style=\"font-weight: bold\">&gt;</span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m<\u001b[0m\u001b[1;95mFigure\u001b[0m\u001b[39m size 432x288 with \u001b[0m\u001b[1;36m1\u001b[0m\u001b[39m Axes\u001b[0m\u001b[1m>\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": ""
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Plot the loading of the first station\n",
"%matplotlib inline\n",
"from pylab import * \n",
"import numpy as np\n",
"\n",
"p = loading.columns.size\n",
"Time = np.linspace(1, p, p)\n",
"plot(Time, loading.transpose()[1], linewidth = 2, color = 'blue')\n",
"xlabel('Time'); ylabel('Laoding'); title(velibAdds.names[1])\n",
"vlines(x = np.linspace(1, p, 8), ymin = 0, ymax = 1, colors = \"black\", linestyle = \"dotted\")\n",
"show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Descriptive statistics.\n",
"\n",
"Some ideas : \n",
"\n",
"1. Draw a matrix of plots of size 4*4, corresponding to the first 16 stations. (Do not forget the vertical lines corresponding to days).\n",
"2. Draw the boxplot of the variables, sorted in time order. \n",
"What can you say about the distribution of the variables? \n",
"Position, dispersion, symmetry? Can you see a difference between days?\n",
"3. Plot the average hourly loading for each day (on a single graph).\n",
"Comments? \n",
"4. Plot the stations coordinates on a 2D map (latitude versus longitude). Use the package ggmap (function 'qmplot') to visualize the average loading for a given hour (6h, 12h, 23h) as a color scale.\n",
"Comments ?\n",
"5. Use a different color for stations which are located on a hill. (Use the basis 'plot' function, and the function 'qmplot' of R package ggmap).\n",
"6. Redo questions 1-3 for the subset of stations which are located on a hill. Same questions for those who are not. Comment?"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}