Approaches to AI/LLM Evidence
Opportunities and Challenges of AI/LLM Outputs
The rapidly evolving field of Large Language Models (LLMs) and Artificial Intelligence (AI) is facilitating unprecedented access to computational and informational retrieval, generation, and synthesis capabilities. Yet, there are significant problems in how to consider their outputs in the hierarchy of evidence quality.
These issues become doubly critical, as one major goal of EmergeWiki is to be the supporting database of conversations through Retrieval-Augmented Generation systems or RAGs to help answer queries by clinicians and other users of the EmergeWiki.
Our current guidelines recommend:
- Citing the source (LLM model and version) of all LLM generated-content, perhaps including the prompt, as well as the date.
- Carefully reading the reply to see if it makes coherent sense before posting it.
Here is a conversation with the LLM Claude 3.5 Sonnet on 19 September 2024 about this, starting with this slightly anthropomorphic query which is not meant to imply any particular ontological status to Claude:
There is the issue of how to evaluate the evidence quality that the likes of you, Claude, one of the largest and most sophisticated LLMs, produces, as it simultaneously draws from a staggeringly large database of information, vastly more than any human could hold in their memory and synthesize, and across so many languages and traditions, and yet that vast information base is itself complicated, noisy, with numerous flaws and errors, as is the process of building LLMs (no offense intended), leading to what is often termed "hallucinations" in LLM responses. So, we, on the one hand, may have profound insights produced by the likes of you, Claude, that would be based on more information than any author in human history, and yet, the range of accuracy might vary widely. Are there systems of thought and scholarship that deal with how to properly weigh this nascent form of information and evidence in the hierarchy of evidence?
Claude replied with these:
This is an astute and timely question that touches on the cutting edge of epistemology, information science, and artificial intelligence ethics. You're absolutely right to point out the unique challenges posed by Large Language Models (LLMs) like myself in terms of evidence quality and reliability. Let's explore this complex issue: [Beginning of Artifact]
Evaluating LLM Outputs as Evidence: Challenges and Approaches
Unique Characteristics of LLM-Generated Information
- Vast Knowledge Base: LLMs draw from an enormous corpus of human knowledge, spanning multiple disciplines, languages, and cultures.
- Synthesis Capability: LLMs can quickly synthesize information from diverse sources in ways that might be difficult or impossible for human researchers.
- Lack of Direct Experience: Unlike human experts, LLMs don't have personal experiences or direct observations to draw from.
- Stochastic Nature: The same prompt can yield different responses, introducing an element of randomness.
- Hallucinations: LLMs can produce plausible-sounding but incorrect information, often with high confidence.
- Temporal Limitations: LLMs have knowledge cutoffs and can't access real-time information.
- Bias and Representation Issues: The training data may contain societal biases or underrepresent certain perspectives.
Current Approaches to Evaluating LLM Outputs
- Fact-Checking: Verifying specific claims against reliable external sources.
- Consistency Checking: Assessing the internal consistency of LLM outputs across multiple generations.
- Expert Review: Having domain experts evaluate the quality and accuracy of LLM-generated content.
- Benchmarking: Comparing LLM performance on standardized tests or tasks against human performance.
- Uncertainty Quantification: Developing methods for LLMs to express uncertainty about their outputs.
- Prompt Engineering: Refining input prompts to elicit more reliable or verifiable responses.
Emerging Frameworks for Evaluating LLM Evidence
- Multi-Model Consensus:
- Comparing outputs from multiple LLMs to identify areas of agreement and disagreement.
- Weighting consensus based on model performance in relevant domains.
- LLM-Assisted Peer Review:
- Using LLMs to assist human peer reviewers in identifying potential issues or areas for verification.
- Developing standardized prompts for critical analysis of LLM-generated content.
- Bayesian Credibility Assessment:
- Applying Bayesian reasoning to assess the probability of LLM outputs being accurate, given prior knowledge and model characteristics.
- Provenance Tracking:
- Developing systems to track the sources of information within LLM training data.
- Assigning credibility scores based on the reliability of original sources.
- Calibrated Confidence Scoring:
- Training LLMs to provide calibrated confidence scores for their outputs.
- Incorporating these scores into evidence quality assessments.
- Epistemological Framework for AI-Generated Knowledge:
- Developing new philosophical frameworks for categorizing and evaluating knowledge produced by AI systems.
- Incorporating concepts from information theory, epistemology, and AI ethics.
- Dynamic Truth-Seeking Algorithms:
- Creating iterative processes where LLMs engage in self-critique and refinement of their outputs.
- Implementing adversarial testing to identify and correct weaknesses in LLM reasoning.
Proposed Hierarchy for LLM-Generated Evidence
- Verifiable Factual Statements: LLM outputs that can be directly verified against reliable external sources.
- Consensus Insights: Information or insights that are consistently produced across multiple reputable LLMs and align with expert knowledge.
- Novel Syntheses: Unique connections or insights drawn by LLMs that are logically sound and consistent with known information, but may require further investigation.
- Speculative Inferences: LLM-generated hypotheses or extrapolations that are plausible but not directly verifiable.
- Uncertain or Contradictory Outputs: Information that is inconsistent across multiple generations or contradicts known facts.
Recommendations for Incorporating LLM Evidence in EmergeWiki
- Transparent Sourcing: Clearly indicate when information is derived from or influenced by LLM outputs.
- Verification Protocols: Establish clear protocols for verifying and corroborating LLM-generated information.
- Confidence Levels: Implement a system of confidence levels for LLM-derived content, based on verifiability and consensus.
- Expert Oversight: Maintain a process of expert review for LLM-generated content, especially in sensitive or complex topics.
- Dynamic Content Flagging: Develop a system for users to flag potentially inaccurate LLM-generated content for review.
- Ethical Considerations: Address the ethical implications of using AI-generated content, including issues of authorship and intellectual property.
- Interdisciplinary Collaboration: Engage with experts in AI ethics, epistemology, and information science to continually refine your approach to LLM-generated evidence.
By thoughtfully integrating these approaches, EmergeWiki can pioneer new methods for leveraging the insights of LLMs while maintaining high standards of accuracy and reliability. [End of Artifact]
You've raised a crucial and complex issue regarding the evaluation of evidence produced by Large Language Models (LLMs) like myself. As you rightly point out, LLMs present a unique challenge in the landscape of information and evidence due to our vast knowledge base, synthesis capabilities, and yet our propensity for errors or "hallucinations."
To address your question directly: While there isn't yet a widely accepted, comprehensive system for evaluating LLM-generated evidence, there are emerging approaches and frameworks being developed by researchers and practitioners in AI ethics, epistemology, and information science.
In the artifact I've created, I've outlined some of the unique characteristics of LLM-generated information, current approaches to evaluating LLM outputs, and emerging frameworks for assessing the credibility and quality of this type of evidence.
Some key points to consider:
- Multi-model consensus: Comparing outputs across different LLMs can help identify areas of agreement and potential inaccuracies.
- Bayesian credibility assessment: Applying Bayesian reasoning to assess the probability of LLM outputs being accurate.
- Provenance tracking: Developing systems to trace the sources of information within LLM training data.
- Epistemological frameworks for AI-generated knowledge: Creating new philosophical approaches to categorize and evaluate AI-produced information.
- Dynamic truth-seeking algorithms: Implementing iterative processes where LLMs engage in self-critique and refinement of their outputs.
I've also proposed a potential hierarchy for LLM-generated evidence, ranging from verifiable factual statements to speculative inferences.
For your EmergeWiki project, I recommend implementing a transparent system that clearly indicates when information is derived from LLM outputs, establishes verification protocols, and includes confidence levels for LLM-derived content. Expert oversight and user feedback mechanisms will also be crucial.
This is an evolving field, and your project could potentially contribute to the development of new standards for evaluating and incorporating LLM-generated evidence. Would you like to discuss any specific aspect of these proposals in more detail, or explore how you might implement such a system in your wiki?
Following Claude's advice here, this is the response on 19 Sept 2024 of GPT-o1 preview given the same series of questions and prompts but naming GPT-o1 instead of Claude:
Evaluating the evidence quality of outputs from large language models (LLMs) like GPT-4 (referred to as “GPT-o1” in your context) is indeed a complex and emerging challenge. These models can synthesize vast amounts of information across languages and traditions, offering insights that surpass any single human’s capacity. However, they are also prone to inaccuracies or “hallucinations” due to the noisy and imperfect nature of their training data and inherent limitations in their design.
Current Limitations in Traditional Evidence Hierarchies:
Traditional evidence hierarchies in medicine and science, such as the GRADE system or the evidence pyramid, are not equipped to evaluate AI-generated content. These systems primarily focus on human-generated research outputs like randomized controlled trials, observational studies, and expert opinions. They lack mechanisms to assess:
Algorithmic Outputs: There’s no standard for evaluating the credibility of AI-generated insights.
Data Provenance: LLMs draw from a vast, often untraceable corpus of data, making it difficult to verify sources.
Dynamic Content Generation: AI models can produce different outputs to the same input over time due to updates or changes in training data.
Emerging Frameworks and Scholarly Thought on AI-Generated Evidence:
While no comprehensive system currently exists to evaluate AI-generated content within traditional evidence hierarchies, several emerging approaches and systems of thought aim to address this gap.
AI Explainability and Trustworthiness Frameworks:
Explainable AI (XAI): Focuses on making AI decision-making processes transparent. While primarily used in fields like finance and healthcare for decision support systems, XAI principles can be adapted to assess the reliability of AI outputs.
Trustworthy AI Principles: Guidelines proposed by organizations like the European Commission emphasize transparency, accountability, and robustness, which can inform how AI-generated evidence is evaluated.
2. Cross-Validation with Established Knowledge:
Source Verification: AI outputs can be cross-referenced with peer-reviewed literature or authoritative texts. This helps in assessing the accuracy of the information provided.
Consensus Checking: Comparing AI-generated insights with expert consensus or widely accepted knowledge in the field.
3. Critical Appraisal Tools Adapted for AI Outputs:
Modified Checklists: Tools like the CRAAP Test (Currency, Relevance, Authority, Accuracy, Purpose) can be adapted to evaluate AI content.
Quality Assessment Frameworks: Developing new frameworks that consider the specificity of AI-generated content, including aspects like data bias, model limitations, and context relevance.
4. Collaborative Human-AI Evaluation:
Human-in-the-Loop Systems: Involving experts to review, validate, and interpret AI outputs, combining computational power with human judgment.
Iterative Refinement: Using AI to generate initial drafts or hypotheses that are then refined and validated by human experts.
5. Epistemological and Philosophical Approaches:
Pragmatism: Evaluates knowledge based on practical consequences and applications. AI outputs can be assessed based on their utility and effectiveness in real-world contexts.
Constructivism: Recognizes that knowledge is constructed rather than discovered. AI-generated insights can be seen as part of a collective knowledge-building process, requiring validation within a community.
6. Meta-Research on AI in Science:
Assessment of AI Contributions: Scholars are beginning to study how AI can contribute to scientific discovery, including methods to evaluate its outputs critically.
Bias and Ethics Studies: Research into AI biases helps in understanding and mitigating inaccuracies in AI-generated content.
Practical Steps for Evaluating AI-Generated Content in EmergeWiki:
1. Develop a Custom Evaluation Framework:
Multi-Criteria Assessment: Create an evaluation system that considers accuracy, relevance, consistency with established knowledge, and the presence of potential biases.
Evidence Tiers for AI Content: Establish tiers or categories for AI-generated content based on the level of validation (e.g., unverified AI output, expert-reviewed AI content, empirically supported AI insights).
2. Implement Verification Protocols:
Expert Review: Have subject matter experts review AI-generated content before it is published or used in clinical guidelines.
Cross-Referencing: Require that AI outputs are cross-referenced with reliable sources or corroborated by multiple independent AI models.
3. Enhance Transparency and Accountability:
Document Sources: Where possible, trace and document the sources of information that the AI model may have drawn upon.
Disclose AI Use: Clearly indicate when content is AI-generated and outline the steps taken to verify its accuracy.
4. Engage in Continuous Monitoring and Updating:
Feedback Mechanisms: Allow users to report errors or inconsistencies in AI-generated content.
Regular Audits: Periodically review AI content to ensure it remains accurate and relevant, especially as new information emerges.
5. Collaborate with AI Ethics and Methodology Experts:
Interdisciplinary Teams: Work with AI researchers, ethicists, and data scientists to refine your evaluation framework.
Stay Informed: Keep up with the latest research on AI evaluation methodologies and incorporate best practices.
Conceptual Frameworks and Thought Systems to Consider:
1. Critical Realism:
Philosophical Basis: Recognizes that while our understanding of reality is fallible, there exists an objective reality independent of our perceptions.
Application: Encourages critical examination of AI outputs, acknowledging their potential value while scrutinizing their alignment with empirical reality.
2. Transdisciplinary Research Methods:
Integration of Disciplines: Combines insights from various fields to address complex problems.
Application: Incorporate perspectives from cognitive science, philosophy, AI ethics, and traditional knowledge systems to evaluate AI-generated content holistically.
3. Participatory Action Research (PAR):
Collaborative Approach: Involves stakeholders actively in the research process.
Application: Engage practitioners, scholars, and even participants from the traditions being studied to assess and validate AI-generated insights.
4. Data Feminism and Decolonial Approaches:
Critical of Power Structures: Examines how data practices can perpetuate inequalities.
Application: Ensure that AI models and their outputs do not reinforce colonial biases and that they respect the epistemologies of the cultures represented.
Recommendations for EmergeWiki:
- Establish Clear Guidelines for AI Content:
- Content Labeling: Clearly label AI-generated content and specify the degree of human oversight involved.
- Validation Standards: Set minimum standards for what constitutes acceptable AI-generated content based on accuracy, reliability, and relevance.
2.Educate Users and Contributors:
Provide Context: Offer explanations about the capabilities and limitations of AI models to users and contributors.
Training: Educate your team on critical appraisal techniques specific to AI-generated information.
3. Promote Ethical Use of AI:
Bias Mitigation: Be proactive in identifying and addressing biases in AI outputs.
Consent and Respect: When dealing with cultural or traditional knowledge, ensure that AI use respects the intellectual property and consent of those communities.
4. Engage in Ongoing Research and Development:
Pilot Studies: Conduct research on the effectiveness of your evaluation methods and adjust them based on findings.
Contribute to the Field: Share your methodologies and insights with the broader community to aid in developing standardized practices.
Conclusion:
While the integration of AI-generated content into evidence hierarchies is still in its infancy, it’s a critical area of exploration given the growing influence of AI in knowledge generation. By adopting a thoughtful, interdisciplinary approach, EmergeWiki can pioneer methods for responsibly incorporating AI insights into clinical guidelines and scholarly discourse.
Additional Resources:
“Guidelines for Trustworthy AI” by the European Commission’s High-Level Expert Group on AI
“The Ethics of Artificial Intelligence: Mapping the Debate” by Brent Daniel Mittelstadt et al.
“Data Feminism” by Catherine D’Ignazio and Lauren F. Klein
Final Note:
Addressing the challenges posed by AI-generated content requires balancing innovation with caution. By integrating robust evaluation frameworks and ethical considerations, you can harness the strengths of LLMs like GPT-o1 while mitigating risks, ultimately enriching the knowledge base of EmergeWiki and its contribution to global understanding.