
The postlink package is dedicated to providing a unified
suite of statistical tools designed to rigorously account for record
linkage errors in post-linkage data analysis.
Record linkage is often error-prone, particularly when identifiers
used for matching records are noisy or non-unique. Mismatches (false
matches) act as a contaminant in the linked data, typically leading to
attenuated estimates in downstream analysis. The postlink R
package currently supports three statistical frameworks to account for
potential mismatch errors during downstream regression modeling:
The postlink package currently focuses on methods for
secondary analysis, where the individual files that were linked are not
accessible. For the primary analysis setting, when individual files are
accessible, methods that perform record linkage and analysis jointly
with direct propagation of uncertainty would be more suitable.
The long-term goal of postlink is to extend support for
a wide array of linkage and post-linkage analysis scenarios.
The package is built on a modular, object-oriented S3 architecture that decouples the specification of linkage error from the substantive statistical modeling. This provides a familiar, standard formula based modeling interface.
Phase 1: Adjustment Specification
First, we define the linked data and the chosen adjustment methodology using a constructor function. These constructors validate the data and return a lightweight S3 adjustment object.
adjELE(): Specifies the Exchangeable Linkage Error
(ELE) model, using known or audited mismatch rates.adjMixture(): Specifies a frequentist mixture model
approach that treats match status as a latent variable, estimating
linkage error rates directly from data (e.g., using an overall mismatch
rate, safe matches, predictors of match status, or predicted correct
match probabilities). If no record linkage information is available, a
constant mismatch rate is assumed.adjMixBayes(): Specifies a Bayesian mixture model
approach, with default or informative priors, enabling parameter
estimation and multiple imputation of latent match statuses using
Stan.Phase 2: Estimation & Inference
The adjustment object is subsequently passed to a standard modeling wrapper, integrating the linkage error correction into the familiar R modeling syntax:
plglm() for generalized linear models (linear,
logistic, Poisson, Gamma).plcoxph() for Cox proportional hazards regression.plsurvreg() for parametric survival models.plctable() for contingency table analysis.Estimation and inference supported for each type of adjustment object
vary. Please refer to the adj* or pl*
documentation for models currently supported.
Standard R workflows (e.g., summary(),
predict(), vcov(), and confint())
can be used for display and processing of results. These methods
specially are derived to consider the additional steps introduced by the
linkage error adjustment.
Note: While the two-phase workflow is recommended
for standard analyses, the package’s architecture isolates the core
logic of each method into independent internal routines. If preferred,
the underlying computational functions can be used directly by supplying
pre-computed design matrices and response vectors (e.g.,
coxphELE(), glmMixture(),
survregMixBayes())
The development version of postlink can be installed
from GitHub or locally:
# Using devtools:
# install.packages("devtools")
devtools::install_github("postlink-group/postlink")
# Or, using pak:
# install.packages("pak")
pak::pkg_install("postlink-group/postlink")Because postlink includes Bayesian mixture models
powered by rstan, installing the development version of the
package from source requires a working C++ toolchain to compile the
underlying models.
Depending on your operating system, please ensure you have the following installed:
xcode-select --installsudo apt-get install r-base-dev on Ubuntu/Debian).(Note: Once postlink is officially released on CRAN,
Windows and macOS users will be able to download pre-compiled binaries,
bypassing this requirement).
Below is a brief example illustrating the typical workflow using
postlink.
We analyze the relationship between age at death and year of birth
linked using historical records from the LIFE-M project. The linked
dataset contains a mix of hand-linked records (assumed correct) and
purely machine-linked records subject to an approximate 5% mismatch
rate. Instead of fitting the standard glm model ignoring the mismatch
errors, we use the postlink to adjust for potential
mismatches using the entire linked dataset.
library(postlink)
# Load the LIFE-M demo dataset
data(lifem)
# Phase 1: Adjustment Specification
# We model the correct match indicator via logistic regression using
# name commonness scores (commf, comml) and a 5% expected mismatch rate.
adj_object <- adjMixture(
linked.data = lifem,
m.formula = ~ commf + comml,
m.rate = 0.05,
safe.matches = hndlnk
)
# Check specified adjustment
print(adj_object)
# Phase 2: Estimation & Inference
# Fit a Gaussian regression model utilizing a cubic polynomial for year of birth.
fit <- plglm(
age_at_death ~ poly(unit_yob, 3, raw = TRUE),
family = "gaussian",
adjustment = adj_object
)
# View model results
summary(fit)
confint(fit)Chambers, R. (2009). Regression analysis of probability-linked data. , 4, 1-15.
Chambers, R. L., Fabrizi, E., Ranalli, M. G., Salvati, N., & Wang, S. (2023). Robust regression using probabilistically linked data. , 15(2), e1596.
Gutman, R., Sammartino, C., Green, T., & Montague, B. (2016). Error adjustments for file linking methods using encrypted unique client identifier (eUCI) with application to recently released prisoners who are HIV+. , 35(1), 115–129.
Slawski, M., West, B. T., Bukke, P., Wang, Z., Diao, G., & Ben-David, E. (2025). A general framework for regression with mismatched data based on mixture modelling. , 188(3), 896-919.
Vo, T. H., Garès, V., Zhang, L. C., Happe, A., Oger, E., Paquelet, S., & Chauvet, G. (2024). Cox regression with linked data. , 43(2), 296-314.