Extracting O*NET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional labor market databases (e.g., O*NET) suffer from infrequent updates, limited occupational coverage, and restricted accessibility. Method: This study introduces an automated paradigm for constructing high-fidelity labor market data from large-scale job postings—leveraging 155 million online job advertisements from the NLx Corpus and aligning them systematically with the O*NET occupational taxonomy via the open-source tool JAAT. Using NLP-driven multidimensional structured extraction, it captures skills, SOC codes, tools/technologies, compensation, and other attributes, generating a monthly, occupation/state/industry-resolved dataset comprising over 10 billion data points (2015–2025). Extraction reliability is rigorously validated via an LLM-as-a-Judge evaluation framework. Contribution/Results: The resulting public infrastructure delivers unprecedented timeliness, breadth, and reproducibility, enabling robust research in education and workforce development.

Technology Category

Application Category

📝 Abstract
Data from online job postings are difficult to access and are not built in a standard or transparent manner. Data included in the standard taxonomy and occupational information database (O*NET) are updated infrequently and based on small survey samples. We adopt O*NET as a framework for building natural language processing tools that extract structured information from job postings. We publish the Job Ad Analysis Toolkit (JAAT), a collection of open-source tools built for this purpose, and demonstrate its reliability and accuracy in out-of-sample and LLM-as-a-Judge testing. We extract more than 10 billion data points from more than 155 million online job ads provided by the National Labor Exchange (NLx) Research Hub, including O*NET tasks, occupation codes, tools, and technologies, as well as wages, skills, industry, and more features. We describe the construction of a dataset of occupation, state, and industry level features aggregated by monthly active jobs from 2015 - 2025. We illustrate the potential for research and future uses in education and workforce development.
Problem

Research questions and friction points this paper is trying to address.

Extracting structured O*NET features from online job postings
Addressing infrequent updates and small samples in occupational data
Building transparent public labor market data from job ads
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using O*NET framework for NLP extraction tools
Developing open-source Job Ad Analysis Toolkit
Aggregating job data into public monthly datasets
🔎 Similar Papers
No similar papers found.