Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study addresses the inadequate coverage of small and medium-sized enterprises—particularly tier-2 suppliers and emerging niche players—in existing commercial databases, which hinders supply chain resilience. To overcome this limitation, the authors propose a Web–Knowledge–Web iterative framework that employs domain-specific web crawlers to identify candidate entities, constructs a heterogeneous knowledge graph, and introduces a novel closed-loop mechanism wherein graph topology and coverage signals guide subsequent crawling. Ecological species richness estimators (Chao1 and ACE) are adapted to assess entity coverage. Evaluated in the semiconductor equipment manufacturing sector (NAICS 333242), the approach achieves peak recall with only 112 crawled pages and, under a 213-page budget, builds a knowledge graph comprising 765 entities and 586 relations, attaining an F1 score of 0.118—significantly outperforming baseline methods.

Technology Category

Application Category

📝 Abstract

Identifying the full landscape of small and medium-sized enterprises (SMEs) in specialized industry sectors is critical for supply-chain resilience, yet existing business databases suffer from substantial coverage gaps -- particularly for sub-tier suppliers and firms in emerging niche markets. We propose a \textbf{Web--Knowledge--Web (W$\to$K$\to$W)} pipeline that iteratively (1)~crawls domain-specific web sources to discover candidate supplier entities, (2)~extracts and consolidates structured knowledge into a heterogeneous knowledge graph, and (3)~uses the knowledge graph's topology and coverage signals to guide subsequent crawling toward under-represented regions of the supplier space. To quantify discovery completeness, we introduce a \textbf{coverage estimation framework} inspired by ecological species-richness estimators (Chao1, ACE) adapted for web-entity populations. Experiments on the semiconductor equipment manufacturing sector (NAICS 333242) demonstrate that the W$\to$K$\to$W pipeline achieves the highest precision (0.138) and F1 (0.118) among all methods using the same 213-page crawl budget, building a knowledge graph of 765 entities and 586 relations while reaching peak recall by iteration~3 with only 112 pages.

Problem

Research questions and friction points this paper is trying to address.

supplier discovery

coverage gaps

small and medium-sized enterprises

domain-specific crawling

supply-chain resilience

Innovation

Methods, ideas, or system contributions that make the work stand out.

Web-Knowledge-Web pipeline

coverage-aware crawling

heterogeneous knowledge graph