From Administrative Chaos to Analytical Cohorts: A Three-Stage Normalisation Pipeline for Longitudinal University Administrative Records
By: H. R. Paz
Potential Business Impact:
Cleans messy student records for better learning insights.
The growing use of longitudinal university administrative records in data-driven decision-making often overlooks a critical layer: how raw, inconsistent data are normalised before modelling. This article presents a three-stage normalisation pipeline for a dataset of 24,133 engineering students at a Latin American public university, spanning four decades (1980-2019). The pipeline comprises: (i) N1 CENSAL, harmonising demographics into a single person-level layer; (ii) N1b IDENTITY RESOLUTION, consolidating duplicate identifiers into a canonical ID while preserving an audit trail; and (iii) N1c GEO and SECONDARY-SCHOOL NORMALISATION, which builds reference tables, classifies school types (state national, state provincial, private secular, private religious), and flags irrecoverable cases as DATA_MISSING. The pipeline preserves 100% of students, achieves full geocoding, and yields valid school types for 56.6% of the population. The remaining 43.4% are identified as structurally missing due to legacy enrolment practices rather than stochastic non-response. Forensic analysis (chi-square, logistic regression) shows missingness is highly predictable from entry decade and geography, confirming a structural, historically induced mechanism. The article contributes: (a) a transparent, reproducible normalisation pipeline tailored to higher education; (b) a framework for treating structurally missing information without speculative imputation; and (c) guidance on defining analytically coherent cohorts (full population vs. secondary-school-informed subcohorts) for downstream learning analytics and policy evaluation.
Similar Papers
Free Tuition, Stratified Pipelines: Four Decades of Administrative Cohorts and Equity in Access to Engineering and Science in an Argentine Public University
Computers and Society
Shows free college still favors rich students.
When Administrative Networks Fail: Curriculum Structure, Early Performance, and the Limits of Co-enrolment Social Synchrony for Dropout Prediction in Engineering Education
Computers and Society
Helps predict students who might quit school.
A Leakage-Aware Data Layer For Student Analytics: The Capire Framework For Multilevel Trajectory Modeling
Computers and Society
Finds students likely to quit school early.