π€ AI Summary
This study addresses the inferential bias in regression with topic models arising from traditional plug-in approaches, which fail to consistently estimate topic proportions for fixed-length documents and neglect estimation uncertainty. The authors propose a supervised regression method based on corrected spectral moments that directly identifies regression coefficients through response-weighted word moments, bypassing explicit estimation of document-level topic shares. Notably, the approach leverages operator commutativity to identify the previously unknown Dirichlet total concentration parameter. Under an asymptotic regime where the number of documents grows while document length remains fixed, the method achieves asymptotically efficient estimation, with sandwich standard errors and asymptotic linearity ensuring proper uncertainty propagation. Simulations confirm that its confidence intervals attain near-nominal coverage, substantially outperforming plug-in methods, and empirical analysis of top economics journals successfully uncovers latent topic effects.
π Abstract
Topic models are often used as dimension-reduction tools before regression, with estimated document-level topic shares treated as observed covariates. This plug-in workflow creates two inferential difficulties: valid inference requires a regular first-stage-to-second-stage expansion that propagates topic-estimation uncertainty, and, at fixed document length, a document's topic mixture cannot be consistently recovered from its own words even when the population topic matrix is known. Corrected spectral moment methods for latent Dirichlet allocation (LDA) offer a starting point: when the total Dirichlet concentration is known, low-order word moments can be corrected to yield operators diagonal in the latent topic basis. We extend this to downstream regression. Under a finite LDA model with response residuals orthogonal to the low-order token moments used for identification, response-weighted word moments admit the same correction, and the resulting supervised operator identifies the regression coefficient $Ξ²$ directly, without estimating document-level topic shares. The main obstacle is that the correction depends on the unknown total concentration $Ξ±_0$. We show that, for $k\ge3$ topics and under a generic finite-probe condition, $Ξ±_0$ is identified by commutativity: at the true value a family of corrected word-moment operators commute, whereas away from it they generically do not. This yields a feasible estimator and lets uncertainty in $\hatΞ±_0$ propagate into inference for $Ξ²$. The estimator is asymptotically linear as the number of documents grows with fixed document length, with sandwich standard errors from document-level moment contributions. Simulations show near-nominal coverage where plug-in topic-share regressions can undercover, and an application to top economics journals illustrates contrast inference for latent topic effects.