IJCA Vol 4 i1 2025 webmag - Flipbook - Page 14
14
The International Journal of Conformity Assessment
©Freepik
observed in the results could be attributed to
randomness rather than inherent differences in the
capabilities of the tools. However, the descriptive
analysis and con昀椀dence intervals highlight that
L-Squad achieved a higher mean and a lower standard
deviation, which may re昀氀ect a more consistent and
robust performance in the context of the evaluation.
Discussion
Personalization based on a speci昀椀c normative
context, such as ISO/IEC 17025, may constrain
L-Squad’s ability to adapt to scenarios requiring
昀氀exibility beyond the standard. This presents a
challenge for its applicability in multidisciplinary
contexts or complementary standards.
2. Limitations Observed in ChatGPT o1 Despite Its
Advanced Reasoning Capabilities
Despite being designed for advanced reasoning,
ChatGPT o1 achieved a notably low score in criterionbased understanding (2 points). This result can be
attributed to a lack of alignment with the speci昀椀c
instructions of the ISO/IEC 17025 standard. Although
1. Differences in Criterion-Based Understanding
the “Uses advanced reasoning” engine can generate
Reflect Speci昀椀c Con昀椀gurations and Required
complex responses, the model lacked optimization for
Improvements
justifying answers with precise normative references—
L-Squad performed signi昀椀cantly better in criterionan essential feature in technical contexts. Fernándezbased understanding, achieving a score of 8 out of 10 Samos Gutiérrez (2023) emphasizes that, in normative
compared to other tools like Meta AI and ChatGPT o1. applications, AI must prioritize human veri昀椀cation and
This performance can be attributed to its personalized technical coherence—factors that may have limited
con昀椀guration based on Reinforcement Learning from ChatGPT o1’s performance due to insu昀케cient focus
Human Feedback (RLHF). According to Naik et al.
on these aspects during its con昀椀guration.
(2024), this approach enables models to be adjusted
Additionally, the lower score may re昀氀ect a
to speci昀椀c expectations through iterative re昀椀nement
reduced ability to justify responses based on
cycles. L-Squad exempli昀椀es the impact of aligning a
explicit requirements or solid recommendations.
model with technical and normative contexts.
This contrasts with tools like L-Squad, whose
However, while L-Squad demonstrated outstanding
customization included speci昀椀c directives guiding
performance in literal and criterion-based
its reasoning toward well-founded normative
understanding, it is not without limitations, which
interpretations.
must be addressed to enhance its effectiveness in
3. Appropriate Concordance in Literal and Inferential
future developments:
Understanding
The accuracy and relevance of L-Squad’s
Across 31 questions, all four tools generated the
responses are highly dependent on the quality and
same response, suggesting that generative models
speci昀椀city of the instructions provided during its
have a solid understanding in literal and inferential
con昀椀guration. Any omission or ambiguity in the
dimensions. This demonstrates the general capacity
guidelines can limit its ability to interpret complex
of the models to identify and contextualize explicit
cases or speci昀椀c contexts.
The evaluation of generative AI tools reveals
important implications for applying AI in normative
contexts, particularly with respect to the ISO/IEC
17025 standard.