π€ AI Summary
This work addresses critical limitations in existing large language model (LLM) evaluation toolsβnamely, poor usability, reliance on external services, and inadequate support for privacy and compliance requirements. To overcome these challenges, the authors propose an open-source, browser-accessible, local-first evaluation framework designed for three key user groups: technical experts, domain specialists, and compliance officers. The framework employs a role-based, plugin architecture that enables plug-and-play integration of metrics and datasets. It natively supports deterministic metrics such as BLEU, ROUGE, and BERTScore, while optionally incorporating external LLM judges, with clear separation between local and remote evaluation pathways. Novel contributions include RAG triple-based metrics, token-level confidence visualization, and a multi-judge consensus mechanism, collectively enhancing transparency, reproducibility, and auditability. The system demonstrates consistent implementation across 18 metrics and allows seamless extension of new components without core modifications, effectively decoupling AI development from independent evaluation.
π Abstract
Assessing whether Large Language Models outputs are factually grounded, epistemically calibrated, and methodologically reproducible is a prerequisite for responsible AI deployment. Yet auditing LLMs remains inaccessible to non-technical practitioners: existing tools require programming expertise and non-trivial environment setup, and cloud-hosted platforms transmit evaluation data to external services, creating barriers for domain experts and compliance officers legally responsible for AI oversight. We introduce LLM-FACETS (LLM FActuality Cross-EvaluaTion System): an open-source framework with a browser-accessible interface and a plugin architecture, structured around three practitioner profiles (technical experts, domain experts, compliance officers) that mirror the stakeholder categories identified in the EU AI Act and the NIST AI Risk Management Framework. The architecture makes data flows explicit: deterministic metrics (BLEU, ROUGE, BERTScore) run entirely within the self-hosted server with no outbound transmission; LLM-judge metrics contact external APIs explicitly, with users retaining full credential control. The framework operationalizes transparency through three mechanisms: token-level log-probability visualization for epistemic uncertainty, multi-judge consensus to mitigate judge bias, and RAG Triad metrics (Faithfulness, Answer Relevance, Context Relevance) to detect and localize hallucinations. A plugin architecture allows any new metric or dataset to be integrated without modifying the evaluation pipeline. The open-source implementation enables cross-checking across multiple metrics targeting the same property, ensuring reproducibility and decoupling AI accountability from the teams building the systems assessed. We verify the framework through cross-validation of 18 metric implementations against canonical reference libraries.