--- title: "Population Downscaling Using Areal Interpolation - A Comparative Analysis" date: "`r Sys.Date()`" author: "Marios Batsaris" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Population Downscaling Using Areal Interpolation - A Comparative Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction Areal Interpolation may be defined as the process of transforming data reported over a set of spatial units (source) to another (target). Its application to population data has attracted considerable attention during the last few decades. A massive amount of methods have been reported in the scientific literature. Most of them focus on the improvement of the accuracy by using more sophisticated techniques rather than developing standardized methods. As a result, only a few implementation tools exists within the R community. One of the most common, easy and straightforward methods of Areal Interpolation is Areal Weighting Interpolation (AWI). AWI proportionately interpolates the population values of the source features based on areal (or spatial) weights calculated by the area of intersection between the source and the target zones. [`sf`](https://cran.r-project.org/package=sf/) and [`areal`](https://cran.r-project.org/package=areal/) packages provide Areal Interpolation functionality within the R ecosystem. Both packages implement (AWI). `sf` functionality comes up with extensive and intensive interpolation options and calculates the areal weights based on the total area of the source features (total weights). `sf` functionality is suitable for completely overlapping data. `areal` extends the existing functionality of the `sf` package by introducing an additional formula for data without complete overlap. In this case weights are calculated using the sum of the remaining source areas after the intersection (sum weights). When the case involves Areal Interpolation of urban population data (small scale applications) where the source features (such as city blocks or census tracts) are somehow larger than target features (such as buildings) in terms of footprint area the `sf` functionality (total weights) is unable to calculate areal weights properly and therefore, is not ideal for such applications. `areal` functionality may be confusing for novice R (or GIS) users as it is not obvious that the weight option should be set to ``sum`` to calculate areal weights correctly. To overcome these limitations [`populR`](https://cran.r-project.org/package=populR) is introduced. `populR` is suitable for Areal Interpolation of urban population and provides an AWI approach that matches the existing functionality of `areal` using ``sum weights`` and additionally, proposes a VWI approach which, to our knowledge, extends the existing Areal Interpolation functionality within the R ecosystem. VWI uses the area of intersection between source and target features multiplied by the building height or number of floors (volume) to guide the interpolation process. In this vignette a comparative analysis of Areal Interpolation alternatives within the programming environment of R is carried out. `sf`, `areal` and `populR` results are obtained and further compared to a more realistic population distribution. ## Case Study A small part of the city of Mytilini, Lesvos, Greece was chosen as the case study (figure below).The study area consists of 9 city blocks (source) counting 911 residents and 179 buildings units (target) including floor number information. These data are included in `populR` package for further experimentation. ```{r sarea, fig.height = 5, fig.width = 5, fig.align = "center"} # attach library library(populR) # load data data('src') data('trg') source <- src target <- trg # plot data plot(source['geometry'], col = "#634B56", border = NA) plot(target['geometry'], col = "#FD8D3C", add = T) ``` ## Implementation In this section a demonstration of the `sf`, `areal` and `populR` packages takes place. First, the packages are attached to the script and next `populR` built-in data are loaded. Then, Areal Interpolation functions are executed for each one of the aforementioned packages. The ``st_interpolate_aw()`` function of the `sf` package takes: 1. ``x``: an object of class `sf` with data to be interpolated 2. ``to``: the target geometries (sf object) 3. ``extensive``: whether to use extensive (TRUE) or intensive interpolation (FALSE) `areal` provides the ``aw_interpolate()`` function which requires: 1. ``data``: an sf object to be used as target 2. ``tid``: target identification numbers 3. ``source``: an sf object with data to be interpolated 4. ``sid``: source identification numbers 5. ``weight``: may be either ``sum`` or ``total`` for extensive interpolation and ``sum`` intensive interpolation 6. ``output``: whether `sf` object or `tibble` 7. ``extensive``: a vector of quoted (extensive) variable names - optional if intensive is specified 8. ``intensive``: a vector of quoted (intensive) variable names - optional if extensive is specified Finally, `populR` offers ``pp_estimate()`` function which takes: 1. ``target``: an sf object to be used as target 2. ``source``: an sf object with data to be interpolated 3. ``sid``: source identification number 4. ``spop``: source population values to be interpolated 5. ``volume``: target volume information (number of floors or height) - required for the vwi approach 6. ``point``: whether to return point geometries (TRUE) or not (FALSE) - optional 7. ``method``: whether to use awi or vwi Evidently, `sf` package's `st_interpolate_aw` function requires only 3 arguments which make it very easy to implement while `populR` requires at least 5 and `areal` at least 7 arguments which potentially increases the implementation complexity. On the other hand, only `areal` may be used for multiple interpolations at once as the ``extensive`` or ``intensive`` argument takes a vector of quoted values (not included in this vignette). For the reader's convenience names were shortened as follows: * ``awi``: populR awi approach * ``vwi``: populR vwi approach * ``aws``: areal using extensive interpolation and sum weights * ``awt``: areal using extensive interpolation and total weights * ``sf``: sf using extensive interpolation ```{r setup, message=FALSE, warning=FALSE} # attach libraries library(populR) library(areal) library(sf) # load data data('src') data('trg') source <- src target <- trg # populR - awi awi <- pp_estimate(target = target, source = source, spop = pop, sid = sid, method = awi) # populR - vwi vwi <- pp_estimate(target = target, source = source, spop = pop, sid = sid, volume = floors, method = vwi) # areal - sum weights aws <- aw_interpolate(target, tid = tid, source = source, sid = 'sid', weight = 'sum', output = 'sf', extensive = 'pop') # areal - total weights awt <- aw_interpolate(target, tid = tid, source = source, sid = 'sid', weight = 'total', output = 'sf', extensive = 'pop') # sf - total weights sf <- st_interpolate_aw(source['pop'], target, extensive = TRUE) ``` ## Results The study area counts 911 residents as already mentioned in previous section. From the code chunk below it is clear that ``awi``, ``vwi`` and ``aws`` correctly estimated population values as they sum to 911 while ``awt`` and ``sf`` results underestimated values. This is expected as both methods use the total area of the source features during the interpolation process and are useful when source and target features completely overlap. ```{r} # sum initial values sum(source$pop) # populR - awi sum(awi$pp_est) # populR - vwi sum(vwi$pp_est) # areal - awt sum(awt$pop) # areal - aws sum(aws$pop) # sf sum(sf$pop) ``` Moreover, identical results were obtained by the ``awi`` and ``aws`` approaches and somehow different results by the ``vwi`` as shown in the code block below. ```{r} # order values using tid awi <- awi[order(awi$tid),] vwi <- vwi[order(vwi$tid),] # get values and create a df awi_values <- awi$pp_est vwi_values <- vwi$pp_est awt_values <- awt$pop aws_values <- aws$pop sf_values <- sf$pop df <- data.frame(vwi = vwi_values, awi = awi_values, aws = aws_values, awt = awt_values, sf = sf_values) df[1:20,] ``` ### Comparison to Reference Data Due to confidentiality concerns, population data at building level are not available in Greece. Therefore, an alternate population distribution previously published in [Batsaris et al. 2019](https://doi.org/10.4018/ijagr.2019100103) was used as reference data set to compare the results. This reference population values are included in the built-in data set as shown below in the field ``rf``. ```{r} target ``` In the code chunk below the first 20 features are presented for comparison. ```{r} rf <- awi$rf df <- cbind(rf, df) df[1:20,] ``` `populR` provides a function (``pp_compare()``) to compare the results with alternate population data. ``pp_compare()`` produces scatter diagram, linear regression model, correlation coeficient ($R^2$), MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error) to investigate the relationship of the results with the reference (or other) data. Generally, the diagrams suggest strong and positive relationships in all cases. However, ``vwi`` provides the strongest relationship and $R^2$ coefficient. ``vwi`` provides the smallest MAE value in comparison with the other methods as shown below. ```{r scatter, fig.height = 7, fig.width = 7.2, fig.align = "center", message=FALSE, warning=FALSE} awi_error <- pp_compare(df, estimated = awi, actual = rf, title = "awi vs actual") awi_error vwi_error <- pp_compare(df, estimated = vwi, actual = rf, title = "vwi vs actual") vwi_error sf_error <- pp_compare(df, estimated = sf, actual = rf, title = "sf vs actual") sf_error awt_error <- pp_compare(df, estimated = awt, actual = rf, title = "awt vs actual") awt_error aws_error <- pp_compare(df, estimated = aws, actual = rf, title = "aws vs actual") aws_error ``` RMSE (Root Mean Squared Error) is also calculated. Again, ``vwi`` provides the smallest error value as shown in the code block below. ## Comparison on Performance Finally, a performance comparison (execution times) is carried out in this section using [microbenchmark](https://cran.r-project.org/package=microbenchmark/) package. Execution time measurements suggest that `populR` functionality executed much faster than `areal` and `sf` as shown below. Both ``awi`` and ``vwi`` achieved the best mean execution time (about 76.74 milliseconds). ``aws`` follows with 136.67 milliseconds and finally, ``awt`` with 180.53 milliseconds. ```{r} library(microbenchmark) # performance comparison microbenchmark( suppressWarnings(pp_estimate(target = target, source = source, spop = pop, sid = sid, method = awi)), suppressWarnings(pp_estimate(target = target, source = source, spop = pop, sid = sid, volume = floors, method = vwi)), aw_interpolate(target, tid = tid, source = source, sid = 'sid', weight = 'sum', output = 'sf', extensive = 'pop'), aw_interpolate(target, tid = tid, source = source, sid = 'sid', weight = 'total', output = 'sf', extensive = 'pop'), suppressWarnings(st_interpolate_aw(source['pop'], target, extensive = TRUE)) ) ``` ## Summary In this vignette a demonstration and a comparative analysis of areal interpolation packages implemented in urban population data is undertaken. Both `sf` and `areal` packages provide general purpose AWI functionality while `populR` package focuses on areal interpolation of population data. Additionally, `populR` provides VWI which extends R's existing functionality. The city of Mytilini, Greece was used as the case study to investigate three main pillars: a) implementation, b) results, c) performance. Notes on implementation indicate that `sf` package requires only 3 arguments to use while `populR` at least 5 and `areal` 7. The results provide insight that ``sf`` and ``awt`` may not be ideal for data that are not completely overlapping. Moreover, ``aws`` and ``awi`` obtained the same results while ``vwi`` outperformed the others in comparison to the reference data set. Finally, `populR` performs much faster than `sf` and `areal` packages.