ACEA Water, Intro to Time Series Forecasting

Page content

강의 홍보

Overview

Can you build a model to predict the amount of water in each waterbody to help preserve this natural resource? This is an Analytics competition where your task is to create a Notebook that best addresses the Evaluation criteria below. Submissions should be shared directly with host and will be judged by the Acea Group based on how well they addrss:

Methodology/Completeness (min 0 points, max 5 points)

  • Are the statistical models appropriate given the data?
  • Did the author develop one or more machine learning models?
  • Did the author provide a way of assessing the performance and accuracy of their solution?
  • What is the Mean Absolute Error (MAE) of the models?
  • What is the Root Mean Square Error (RMSE) of the models?

Presentation (min 0 points, max 5 points)

  • Does the notebook have a compelling and coherent narrative?
  • Does the notebook contain data visualizations that help to communicate the author’s main points?
  • Did the author include a thorough discussion on the intersection between features and their prediction? For example between rainfall and amount/level of water.
  • Was there discussion of automated insight generation, demonstrating what factors to take into account?
  • Is the code documented in a way that makes it easy to understand and reproduce?
  • Were all external sources of data made public and cited appropriately?

Application (min 0 points, max 5 points)

  • Is the provided model useful/able to forecast water availability in terms of level or water flow in a time interval of the year?
  • Is the provided methodology applicable also on new datasets belong to another waterbody?

1. Problem Definition

  • It’s generally difficult to define problem. Since the goal of this competition is to build a model predict the depth to groundwater or river of an aquifer located in Petrignano, Italy, the problem definition could be related with the future prediction.

1.1 📊 Quick EDA 📈

  • To explore data, let’s import relevant libraries
import numpy as np # Linear Algebra

import pandas as pd # Data Processing, CSV File I/O
import seaborn as sns # Visualisation
import matplotlib.pyplot as plt # Visualisation
from matplotlib.patches import Rectangle # Helper
import os # PATH


PATH = "../input/acea-water-prediction/"

aquifer_auser = pd.read_csv(f"{PATH}Aquifer_Auser.csv")
aquifer_doganella = pd.read_csv(f"{PATH}Aquifer_Doganella.csv")
aquifer_luco = pd.read_csv(f"{PATH}Aquifer_Luco.csv")
aquifer_petrignano = pd.read_csv(f"{PATH}Aquifer_Petrignano.csv")
lake_biliancino = pd.read_csv(f"{PATH}Lake_Bilancino.csv")
river_arno = pd.read_csv(f"{PATH}River_Arno.csv")
water_spring_amiata = pd.read_csv(f"{PATH}Water_Spring_Amiata.csv")
water_spring_lupa = pd.read_csv(f"{PATH}Water_Spring_Lupa.csv")
water_spring_madonna = pd.read_csv(f"{PATH}Water_Spring_Madonna_di_Canneto.csv")
  • To understand each dataset, we will make temp_df dataset, given column category such as Date, Rainfall, Depth, Volume, Hydrometry, Lake_Level, Flow_Rate. Through this, We will know more about the common and difference between datasets.
datasets = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if '.csv' in filename:
            datasets += list([filename])

print("The datasets:", datasets)
datasets_df = pd.DataFrame(columns=['filename'], data=datasets)
datasets_df['waterbody_type'] = datasets_df.filename.apply(lambda x: x.split('_')[0])
print("The datasets_df:", datasets_df)

f, ax = plt.subplots(nrows=1, ncols=2, figsize=(24,6),  gridspec_kw={'width_ratios': [1,2]})
sns.countplot(datasets_df.waterbody_type, palette=['lightblue'], ax= ax[0])
ax[0].set_title('Datasets', fontsize=16)
ax[0].set_ylabel('Number of Datasets', fontsize=14)
ax[0].set_ylim(0, 5)
ax[0].set_xlabel('Waterbody Type', fontsize=14)
ax[0].set_xticklabels(labels=['Aquifer', 'Water Spring', 'Lake', 'River'], fontsize=14)


def get_column_category(x):
    if 'Date' in x:
        return 'Date'
    elif 'Rainfall' in x:
        return 'Rainfall'
    elif 'Depth' in x:
        return 'Depth to Groundwater'
    elif 'Temperature' in x:
        return 'Temperature'
    elif 'Volume' in x:
        return 'Volume'
    elif 'Hydrometry' in x:
        return 'Hydrometry'
    elif 'Lake_Level' in x:
        return 'Lake Level'
    elif 'Flow_Rate' in x:
        return 'Flow Rate'
    else:
        return x

temp_df = pd.DataFrame({'column_name' : aquifer_auser.columns, 'waterbody_type':'Aquifer Auser'})
temp_df = temp_df.append(pd.DataFrame({'column_name' : aquifer_doganella.columns, 'waterbody_type':'Aquifer Doganella'}))
temp_df = temp_df.append(pd.DataFrame({'column_name' : aquifer_luco.columns, 'waterbody_type':'Aquifer Luco'}))
temp_df = temp_df.append(pd.DataFrame({'column_name' : aquifer_petrignano.columns, 'waterbody_type':'Aquifer Petrignano'}))
temp_df = temp_df.append(pd.DataFrame({'column_name' : lake_biliancino.columns, 'waterbody_type':'Lake Biliancino'}))
temp_df = temp_df.append(pd.DataFrame({'column_name' : river_arno.columns, 'waterbody_type':'River Arno'}))
temp_df = temp_df.append(pd.DataFrame({'column_name' : water_spring_amiata.columns, 'waterbody_type':'Water Spring Amiata'}))
temp_df = temp_df.append(pd.DataFrame({'column_name' : water_spring_lupa.columns, 'waterbody_type':'Water Spring Lupa'}))
temp_df = temp_df.append(pd.DataFrame({'column_name' : water_spring_madonna.columns, 'waterbody_type':'Water Spring Madonna'}))
temp_df['column_category'] = temp_df.column_name.apply(lambda x: get_column_category(x))
temp_df = temp_df.groupby('waterbody_type').column_category.value_counts().to_frame()
temp_df.columns = ['counts']
temp_df = temp_df.reset_index(drop=False)
temp_df = temp_df.pivot(index='waterbody_type', columns='column_category')['counts']
temp_df['Number of Features in Dataset'] = temp_df.sum(axis=1)
display(temp_df)

sns.heatmap(temp_df, cmap='Blues', linewidth=1, ax=ax[1], vmin=0, vmax=10, annot=True)

#ax[1].set_ylabel('Waterbody Type', fontsize=16)
ax[1].set_ylabel(' ', fontsize=16)
ax[1].set_xlabel('Column Category', fontsize=16)
ax[1].set_title('Features, Number of Columns and Target Variables', fontsize=16)
ax[1].add_patch(Rectangle((1, 0), 1, 4, fill=False, alpha=1, color='dodgerblue', lw=3))
ax[1].add_patch(Rectangle((2, 4), 1, 1, fill=False, alpha=1, color='dodgerblue', lw=3))
ax[1].add_patch(Rectangle((4, 4), 1, 1, fill=False, alpha=1, color='dodgerblue', lw=3))
ax[1].add_patch(Rectangle((3, 5), 1, 1, fill=False, alpha=1, color='dodgerblue', lw=3))
ax[1].add_patch(Rectangle((2, 6), 1, 3, fill=False, alpha=1, color='dodgerblue', lw=3))
for tick in ax[1].xaxis.get_major_ticks():
    tick.label.set_fontsize(14) 
for tick in ax[1].yaxis.get_major_ticks():
    tick.label.set_fontsize(14)
    #tick.label.set_rotation('horizontal')
    tick.label.set_rotation(45) 
    
plt.show()
The datasets: ['Aquifer_Doganella.csv', 'Aquifer_Auser.csv', 'Water_Spring_Amiata.csv', 'Lake_Bilancino.csv', 'Water_Spring_Madonna_di_Canneto.csv', 'Aquifer_Luco.csv', 'Aquifer_Petrignano.csv', 'Water_Spring_Lupa.csv', 'River_Arno.csv']
The datasets_df:                               filename waterbody_type
0                Aquifer_Doganella.csv        Aquifer
1                    Aquifer_Auser.csv        Aquifer
2              Water_Spring_Amiata.csv          Water
3                   Lake_Bilancino.csv           Lake
4  Water_Spring_Madonna_di_Canneto.csv          Water
5                     Aquifer_Luco.csv        Aquifer
6               Aquifer_Petrignano.csv        Aquifer
7                Water_Spring_Lupa.csv          Water
8                       River_Arno.csv          River


/opt/conda/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
column_categoryDateDepth to GroundwaterFlow RateHydrometryLake LevelRainfallTemperatureVolumeNumber of Features in Dataset
waterbody_type
Aquifer Auser1.05.0NaN2.0NaN10.04.05.027.0
Aquifer Doganella1.09.0NaNNaNNaN2.02.08.022.0
Aquifer Luco1.04.0NaNNaNNaN10.04.03.022.0
Aquifer Petrignano1.02.0NaN1.0NaN1.02.01.08.0
Lake Biliancino1.0NaN1.0NaN1.05.01.0NaN9.0
River Arno1.0NaNNaN1.0NaN14.01.0NaN17.0
Water Spring Amiata1.03.04.0NaNNaN5.03.0NaN16.0
Water Spring Lupa1.0NaN1.0NaNNaN1.0NaNNaN3.0
Water Spring Madonna1.0NaN1.0NaNNaN1.01.0NaN4.0

png

  • Both Date and Rainfall appear in all datasets
  • Temperature appear in all datasets except Water Spring Lupa.
  • Target Variables are shared in the right figure.
    • The Rectangle figure indicates target variable - Depth To Groundwater, Flow Rate, Hydrometry Lake Level
  • Since I have no good knowledge of Water Engineering, it’s necessary to define each target variables.
    • Depth to Groundwater: groundwater level detected by the piezometer (m from the ground floor)
    • Flow Rate (water spring/aquifer): flow rate (l/s)
    • Flow Rate (lake): flow rate ($m^3$/s)
    • Hydrometry: groundwater level detected by the hydrometric station (m)
    • Lake Level: river level (m)

1.2 Picking Data ❗

  • Since this Notebook is for newbie studying time series analysis and forecasting, it will begin with one dataset, Lake Biliancino
  • Let’s explore Water Spring Lupa data
Lake_Bilancino = pd.read_csv(f"{PATH}Lake_Bilancino.csv")
display(Lake_Bilancino)
Lake_Bilancino.info()
DateRainfall_S_PieroRainfall_MangonaRainfall_S_AgataRainfall_CavallinaRainfall_Le_CrociTemperature_Le_CrociLake_LevelFlow_Rate
003/06/2002NaNNaNNaNNaNNaNNaN249.430.31
104/06/2002NaNNaNNaNNaNNaNNaN249.430.31
205/06/2002NaNNaNNaNNaNNaNNaN249.430.31
306/06/2002NaNNaNNaNNaNNaNNaN249.430.31
407/06/2002NaNNaNNaNNaNNaNNaN249.440.31
..............................
659826/06/20200.00.00.00.00.022.50250.850.60
659927/06/20200.00.00.00.00.023.40250.840.60
660028/06/20200.00.00.00.00.021.50250.830.60
660129/06/20200.00.00.00.00.023.20250.820.60
660230/06/20200.00.00.00.00.022.75250.800.60

6603 rows × 9 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6603 entries, 0 to 6602
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Date                  6603 non-null   object 
 1   Rainfall_S_Piero      6026 non-null   float64
 2   Rainfall_Mangona      6026 non-null   float64
 3   Rainfall_S_Agata      6026 non-null   float64
 4   Rainfall_Cavallina    6026 non-null   float64
 5   Rainfall_Le_Croci     6026 non-null   float64
 6   Temperature_Le_Croci  6025 non-null   float64
 7   Lake_Level            6603 non-null   float64
 8   Flow_Rate             6582 non-null   float64
dtypes: float64(8), object(1)
memory usage: 464.4+ KB
  • Missing Value appears in some variables.
Lake_Bilancino = Lake_Bilancino.dropna().reset_index(drop=True)
Lake_Bilancino.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6025 entries, 0 to 6024
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Date                  6025 non-null   object 
 1   Rainfall_S_Piero      6025 non-null   float64
 2   Rainfall_Mangona      6025 non-null   float64
 3   Rainfall_S_Agata      6025 non-null   float64
 4   Rainfall_Cavallina    6025 non-null   float64
 5   Rainfall_Le_Croci     6025 non-null   float64
 6   Temperature_Le_Croci  6025 non-null   float64
 7   Lake_Level            6025 non-null   float64
 8   Flow_Rate             6025 non-null   float64
dtypes: float64(8), object(1)
memory usage: 423.8+ KB
  • Data Type of Date is object, let’s convert it to datetime
Lake_Bilancino['Date_dt'] = pd.to_datetime(Lake_Bilancino.Date, format = '%d/%m/%Y')
  • I have no idea what to do with Rainfall_* variables. So will delete it.
Lake_Bilancino.drop(['Date', 
                     'Rainfall_S_Piero', 
                     'Rainfall_Mangona', 
                     'Rainfall_S_Agata', 
                     'Rainfall_Cavallina', 
                     'Rainfall_Le_Croci'], axis = 1, inplace=True)
display(Lake_Bilancino)
Temperature_Le_CrociLake_LevelFlow_RateDate_dt
06.50251.210.52004-01-02
14.45251.280.52004-01-03
22.00251.350.52004-01-04
30.90251.370.52004-01-05
42.25251.420.52004-01-06
...............
602022.50250.850.62020-06-26
602123.40250.840.62020-06-27
602221.50250.830.62020-06-28
602323.20250.820.62020-06-29
602422.75250.800.62020-06-30

6025 rows × 4 columns

  • Let’s Draw Time Series Data
fig, ax = plt.subplots(nrows = 3, ncols = 1, figsize=(10, 10))
ax[0].plot(Lake_Bilancino['Date_dt'], Lake_Bilancino['Temperature_Le_Croci'])
ax[0].set_title("Temperature_Le_Croci")
ax[1].plot(Lake_Bilancino['Date_dt'], Lake_Bilancino['Lake_Level'])
ax[1].set_title("Lake_Level")
ax[2].plot(Lake_Bilancino['Date_dt'], Lake_Bilancino['Flow_Rate'])
ax[2].set_title("Flow_Rate")

plt.show()

png

Reference Notebook

https://www.kaggle.com/iamleonie/eda-quenching-the-thirst-for-insights