Project -- Dry Wells

Predicting water shortages in California’s Central Valley using periodic water level data

California’s Central Valley produces 1/4 of the nation’s food and its supplies 1/5 of national groundwater demand.¹ However, the region is prone to drought and water shortages. Furthermore, predicting these water shortages is difficult. Climate change has thrown off the statistical models resulting in huge errors in official water supply predictions, including a 68% error in the predictions for the Sacramento Valley region this year.² Perhaps in response to the poor performance of these complex hydrological models, the California Department of Water Resources recently released its Drought and Water Shortage Risk Explorer, containing an interactive map of the shortage risks of wells in the state. However, this risk assessment does not incorporate past data into a statistical model.

This project uses the Household Water Supply Shortage Reporting System, which logs 3765 reports of water shortages in California from 2014 onwards. We combine this with a dataset of all wells in the state. The predictors used are the periodic groundwater level measurements from the Department of Water Resources.

from plotly import express as px
import pandas as pd
import numpy as np

Let’s have a look at the (cleaned) data. Here are the wells in our database - 289419 wells in total.

wells = pd.read_csv('../cleaned/wells.csv')

print(wells.shape)
print(wells.columns)

(289419, 3)
Index(['latitude', 'longitude', 'gis_gama_study_area'], dtype='object')

fig = px.scatter_mapbox(wells.dropna(subset=['gis_gama_study_area']).sample(frac=0.2),
                        mapbox_style="carto-positron",
                        lat='latitude',
                        lon='longitude',
                        color='gis_gama_study_area',
                        opacity = 0.5,
                        zoom=5)
# fig.show()

stations = pd.read_csv('../cleaned/st.csv')

print(stations.shape)
print(stations.columns)

(45269, 3)
Index(['site_code', 'latitude', 'longitude'], dtype='object')

Here are the stations which can be linked to periodic groundwater level measurements.

fig = px.scatter_mapbox(stations,
                        mapbox_style="carto-positron",
                        lat='latitude',
                        lon='longitude',
                        opacity = 0.5,
                        zoom=4)
# fig.show()

shortages = pd.read_csv('../cleaned/sh.csv')

print(shortages.shape)
print(shortages.columns)

(3765, 4)
Index(['latitude', 'longitude', 'date', 'y'], dtype='object')

wells['type'] = 'well'
stations['type'] = 'station'
shortages['type'] = 'shortage'
stacked = pd.concat([wells,stations,shortages], join='inner')

Here is the geographical distribution of the wells, shortages, and stations. The map shows a 20% random sample.

fig = px.scatter_mapbox(stacked.sample(frac=0.2),
                        mapbox_style="carto-positron",
                        lat='latitude',
                        lon='longitude',
                        opacity = 0.3,
                        color='type',
                        color_discrete_sequence=["green", "blue", "red"],
                        zoom=5)
fig.write_html("../output/stacked.html")

The boundaries of our study are defined according to GAMA study areas that fall within the Central Valley. They were picked for its economic significance and density of measurement stations.

For the first step of the analysis, we look to explore the spatial correlation between measurements and shortages. The following steps are taken to process the data.

Merge the stations dataset (which only has geographical coordinates) with the measurements via an identifier code.
Take the difference of the groundwater elevation between measurements.
Spatially aggregate the measurements by rounding off geographical coordinates to form a grid.
Take the mean across time.

d = pd.read_csv('../cleaned/m_agg.csv')

d.columns

Index(['latitude', 'longitude', 'd_gwe', 'wse_ch_av', 'wells', 'shortages',
       'sh_frac'],
      dtype='object')

d.head()

	latitude	longitude	d_gwe	wse_ch_av	wells
0	35.00	-118.95	0.484028	-1.094127	23
1	35.00	-118.90	-0.736640	-5.352953	27
2	35.00	-118.85	-0.607778	-5.317659	17
3	35.05	-119.25	0.019417	0.514444	15
4	35.05	-119.15	0.236467	0.026113	27

Here’s a look at the frequency of shortages relative to the number of wells. This is the outcome that we aim to predict.

fig = px.scatter_mapbox(d,
                        mapbox_style="carto-positron",
                        lat='latitude',
                        lon='longitude',
                        color='sh_frac',
                        opacity=0.8,
                        zoom=5,
                        labels={"sh_frac": "Frequency of shortages"}
                       )

# fig.show()

For the model, we use PySAL, a package for spatial analysis.

Having done the data processing and aggregation, each unit of observation in our dataset is roughly a 3 mile by 3 mile square, with the predictors being the average change in groundwater level in that square since 2013, and the dependent variable being the relative frequency of shortages.

As a benchmark, here are the predictions given by a linear regression model without accounting for spatial correlations.

m_preds = pd.read_csv('../cleaned/m_preds.csv')

fig = px.scatter_mapbox(m_preds,
                        mapbox_style="carto-positron",
                        lat='latitude',
                        lon='longitude',
                        color='m1preds',
                        zoom=5,
                        title='Baseline linear regression model')
# fig.show()

This model doesn’t seem to produce very good predictions. The R-squared statistic is only 0.0095, suggesting a poor fit, and the map of the predictions also fail the eye test. However, one feature of the dataset that PySAL can exploit is the spatial correlations - the predictors at one location might contain useful signals for the outcome in other locations close by. Indeed, the spatial dependence diagnostic tests in PySAL reject the null hypothesis that there is not spatial dependence. Thus, to improve the model accuracy, we include spatial lags. In other words, we include as predictors a weighted average of the neighbors of each square. As our data is (roughly) carved into a grid, it makes the most sense to include the lags from the 4 nearest neighbors and 8 nearest neighbors.

fig = px.scatter_mapbox(m_preds,
                        mapbox_style="carto-positron",
                        lat='latitude',
                        lon='longitude',
                        color='m2preds',
                        zoom=5,
                        labels={"m2preds": "predicted values"},
                        title='Model including lags from 4 nearest neighbors')
# fig.show()

fig = px.scatter_mapbox(m_preds,
                        mapbox_style="carto-positron",
                        lat='latitude',
                        lon='longitude',
                        color='m3preds',
                        zoom=5,
                        labels={"m3preds": "predicted values"},
                        title='Model including lags from 8 nearest neighbors'
                       )
# fig.show()

The models with 4 and 8 nearest neighbors produce predictions that look close to the shortage map, and have a higher R-square of around 0.024. That said, the R-square is still predictably very low due to the parsimony of the model which only has 2 predictors and its lags. The coefficients for the spatial lags are negative and statistically significant. To give a rough sense of the magnitude of the effects, the coefficient for groundwater elevation change is -0.0031306, suggesting that a 10-feet decrease in groundwater level over the years is accompanied by a 3 pecentage point increase in the risk of water shortage.

For the model to be useful for making predictions ahead of time, we need to exploit the time dimension of our data. Now, instead of averaging out the measurements across time, we aggregate the measurements (which are sporadically taken at roughly one-month intervals) into monthly data, interpolate missing months with the earlier value, and take the first difference. We also include the first and second lag (measurements from two prior months) to augment our predictors. After running the model, we obtain predicted values for every month-coordinate combination in the dataset. The average across time of the predicted values are mapped below, but the set of predictions contain a lot more information. Looking at the regression summary, both the water level measurement and its first lag are statistically significant. In fact, the first lag of the water level measurement is statistically significant (p-value of 0.01) regardless of whether the measurement for the month itself is included, suggesting that the water level changes may predict shortages before they occur.

mt_preds = pd.read_csv('../cleaned/mt_preds.csv')

fig = px.scatter_mapbox(mt_preds,
                        mapbox_style="carto-positron",
                        lat='latitude',
                        lon='longitude',
                        color='preds',
                        zoom=5)
# fig.show()

In conclusion, this exercise has demonstrated that periodic groundwater level measurements contain useful signals for well shortages. In particular, this prediction task calls for incorporating both spatial and temporal lags. To extend the work of this project, future efforts could improve the method of geographical aggregation, take into account climate (especially seasonality), and include more detailed information on local hydrology such as the proximity to water features. As a concrete recommendation, the relevant authorities would do well to improve data availability and linkage, and to judiciously incorporate hydrological data into their existing risk-prediction systems.

https://ca.water.usgs.gov/projects/central-valley/about-central-valley.html ↩
https://calmatters.org/newsletters/whatmatters/2022/02/california-water-inaccurate-forecasts/ ↩

Written on March 8, 2022