IJCA Vol 4 i1 2025 webmag - Flipbook - Page 12
12
The International Journal of Conformity Assessment
This test was made available to 昀椀ve specialists with
an average of more than 10 years of experience in the
昀椀eld of accredited laboratories applying the standard
in various contexts, including auditors, consultants,
and quality system leaders. This process allowed for
a consensus to determine the correct answer, using a
simple majority to assign a valid alternative as correct
in each case. The results provided a solid basis to
evaluate the generative AI’s capability in terms of
accuracy, relevance, and speci昀椀c alignment with the
principles of the ISO/IEC 17025 standard.
During the study, the performance of four generative
AI tools was tested: Meta AI, ChatGPT 4.0 Free,
ChatGPT o1, and L-Squad (developed within ChatGPT).
These tests enabled a comparison of their abilities to
interpret and apply the requirements of the ISO/IEC
17025 standard. The conditions for the tests were as
follows:
• For evaluating literal and inferential
comprehension: Detailed prompts were used. The
detailed prompt was:
“You are a specialist in ISO/IEC 17025. I will
provide you with an exam divided into two
sections. You must provide precise and wellsupported answers based on the requirements
of the standard. Organize the answers in a table
with two columns: the 昀椀rst for the question
number and the second for the corresponding
answer. Ensure that each response is veri昀椀ed
and grounded in ISO/IEC 17025, as well as any
relevant documents related to the accreditation
of testing and calibration laboratories. Complete
Section 1 昀椀rst, followed by Section 2, maintaining
a consistent table format for clarity and ease of
understanding.”
This prompt aimed to guide the model to generate
well-founded and structured responses.
• For speci昀椀cally evaluating criteria-based
comprehension: The following prompt was used:
“Provide justi昀椀cation for your answers to
questions 26, 31, 32, 38, and 40.”
This prompt required the responses to be clearly
justi昀椀ed and directly grounded in the ISO/IEC
17025 standard.
Evaluation Results
The results obtained after evaluating the four
generative AI tools (Meta AI, ChatGPT 4.0 Free,
ChatGPT o1, and L-Squad) are presented in the
following table (Table 1). This table summarizes
the scores achieved at each comprehension level
compared to the scores assigned by consensus using
the panel of specialists* and the evaluation of criteriabased comprehension:
Table 1. Comparative Results of AI Tools Across Different
Levels of Comprehension
Evaluation
Literal*
Inferential*
Criteria-Based
Total
Meta
AI
29
4
6
39
ChatGPT
4o Free
27
4
4
35
ChatGPT
o1
28
4
2
34
L-Squad
30
4
8
42
• Literal comprehension: All tools performed
close to the expected score of 35, with L-Squad
achieving the highest score.
• Inferential comprehension: The four tools scored
4 out of 5, indicating their ability to deduce implicit
information effectively.
• Criteria-based comprehension: The most
signi昀椀cant differences were observed in this
category. L-Squad scored 8 points, demonstrating
a stronger ability to justify answers with clear and
standard-aligned reasoning. In contrast, ChatGPT
o1 scored only 2 points, highlighting challenges in
providing robust justi昀椀cations.
The values in the following table (Table 2) represent
the level of agreement of each response generated
by the AI tools with the answers determined by
consensus from the expert panel* and the evaluation
of criteria-based comprehension.
Table 2. Percentage of Response Agreement Between AI
Tools and the Expert Panel
Evaluation
Literal*
Inferential*
Criteria-Based
Total
Meta
AI (%)
82.9%
80.0%
60.0%
78.0%
ChatGPT
4o Free (%)
77.1%
80.0%
40.0%
70.0%
ChatGPT
o1 (%)
80.0%
80.0%
20.0%
68.0%
L-Squad
(%)
85.7%
80.0%
80.0%
84.0%
It is worth noting that, in 31 questions (77.5%), all
four tools provided the same answer at the literal
and inferential levels. This highlights the level of
agreement that is appropriate among the four tools.
Test for Equality of Variances
After con昀椀rming that the data follow a normal
distribution, and to complement the analysis of the
results, an equality of variances test was conducted
among the evaluated generative AI tools using