Retrofitting Service Dependency Discovery in Distributed Systems

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

In modern distributed systems, complex network routing mechanisms—such as NAT—obscure the actual hosting locations of services, rendering network-metadata-based runtime dependency inference ineffective and impeding root-cause analysis of performance degradation. To address this, we propose a non-intrusive service dependency graph construction method that dynamically injects lightweight metadata into TCP headers, synergistically combining process-level runtime tracing with agentless traffic parsing. Our approach requires no source-code modification or global deployment, and—crucially—enables precise, end-to-end dependency inference across NAT boundaries for the first time. It is fully compatible with standard TCP/IP protocols and supports incremental, heterogeneous deployment. Evaluated across nine distinct NAT configurations, our method achieves 100% precision and recall in all cases, significantly outperforming state-of-the-art alternatives.

Technology Category

Application Category

📝 Abstract

Modern distributed systems rely on complex networks of interconnected services, creating direct or indirect dependencies that can propagate faults and cause cascading failures. To localize the root cause of performance degradation in these environments, constructing a service dependency graph is highly beneficial. However, building an accurate service dependency graph is impaired by complex routing techniques, such as Network Address Translation (NAT), an essential mechanism for connecting services across networks. NAT obfuscates the actual hosts running the services, causing existing run-time approaches that passively observe network metadata to fail in accurately inferring service dependencies. To this end, this paper introduces XXXX, a novel run-time system for constructing process-level service dependency graphs. It operates without source code instrumentation and remains resilient under complex network routing mechanisms, including NAT. XXXX implements a non-disruptive method of injecting metadata onto a TCP packet's header that maintains protocol correctness across host boundaries. In other words, if no receiving agent is present, the instrumentation leaves existing TCP connections unaffected, ensuring non-disruptive operation when it is partially deployed across hosts. We evaluated XXXX extensively against three state-of-the-art systems across nine scenarios, involving three network configurations (NAT-free, internal-NAT, external-NAT) and three microservice benchmarks. XXXX was the only approach that performed consistently across networking configurations. With regards to correctness, it performed on par with, or better than, the state-of-the-art with precision and recall values of 100% in the majority of the scenarios.

Problem

Research questions and friction points this paper is trying to address.

Accurately discovering service dependencies in distributed systems

Overcoming NAT obfuscation in service dependency mapping

Building resilient dependency graphs under complex network routing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Injecting metadata into TCP headers for dependency tracking

Constructing process-level service dependency graphs without code instrumentation

Maintaining protocol correctness across NAT-enabled network boundaries

🔎 Similar Papers

Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis