SlideShare a Scribd company logo
1 of 99
H2O.ai Confidential
Generative AI Masterclass - Model Risk
Management
H2O.ai Confidential
ATLANTA
WELCOME
v
H2O.ai Confidential
Introduction
- Today’s training will look into responsible, explainable and interpretable AI when applied in the context of
Generative AI and specifically Large Language Models (LLMS).
- This will include both several sections on theoretical concepts as well as hands-on labs using Enterprise
h2oGPT and H2O GenAI Applications.
- These hands-on labs focus on applying Gen AI in the context of a Model Risk Manager’s role at a bank or
financial institution.
- NOTE: A separate end-to-end masterclass on Generative AI is also available within the training environment,
as well as on github: https://github.com/h2oai/h2o_genai_training.
Including:
- Data Preparation for LLMs
- Fine-Tuning custom models
- Model Evaluation
- Retrieval-Augmented Generation (RAG)
- Guardrails
- AI Applications
v
H2O.ai Confidential
Section Session Duration Speaker
Welcome Session Kick-off 5m Jon Farland
Interpretability for Generative AI Large Language Model
Interpretability
25m Kim Montgomery
Workshop: Explainable and
Interpretable AI for LLMs
20m Navdeep Gill
Benchmarking and Evaluations Frameworks for Evaluating
Generative AI
20m Srinivas Neppalli
Workshop: Experimental
Design of Gen AI Applications
20m Jon Farland
Security, Guardrails and Hacking Workshop: Guardrails and
Hacking
20m Ashrith Barthur
Applied Generative AI for Banking
- Complaint Summarizer
Workshop: Complaint
Summarizer AI Application
20m Jon Farland
Agenda
v
H2O.ai Confidential
Housekeeping
- The training environment for today is a dedicated instance of the H2O AI Managed
Cloud, a GPU-powered environment capable of training and deploying LLMs, as well
designing and hosting entire AI Applications.
- It an be accessed at https://genai-training.h2o.ai.
- Login credentials should have been provided to the email address you were registered
with.
- If you don’t yet have credentials, or you are otherwise unable to access the
environment, please speak with any member of the H2O.ai team member.
- The training environment will be available to attendees for 3 days after the conference,
but dedicated proof-of-concept environments can be provided (including on-
premise) at request. Please speak to any H2O.ai team member or email
jon.farland@h2o.ai
H2O.ai Confidential
Interpretability for
Generative AI
What is Generative AI?
GenAI enables the creation of novel content
Input
GenAI Model
Learns patterns in
unstructured data
Unstructured data
Output Novel Content
Data
Traditional AI Model
Learns relationship
between data and label
Output Label
Labels
VS
H2O.ai Confidential
More complicated input:
● Prompt phrasing
● Instructions
● Examples
More relevant dimensions to output:
● Truthfulness/Accuracy
● Safety
● Fairness
● Robustness
● Privacy
● Machine Ethics
[TrustLLM: Trustworthiness in Large Language Models, Sun, et al]
GenAI Complications
H2O.ai Confidential
● Can the model recognize problematic responses?
○ Inaccurate responses
○ Unethical responses
○ Responses conveying stereotypes
● Can an inappropriate response be provoked?
○ Jailbreaking
○ Provoking toxicity
○ Leading questions / false context
Common tests
H2O.ai Confidential
TrustLLM Result Summary Matrix
[TrustLLM: Trustworthiness in Large Language Models, Sun, et al]
H2O.ai Confidential
TrustLLM Main Conclusions
TrustLLM Main Findings:
● Trustworthiness and utility were positively correlated.
● Generally closed-sourced models outperformed open source.
● Over alignment for trustworthiness can compromise utility.
[TrustLLM: Trustworthiness in Large Language Models, Sun, et
al]
v
H2O.ai Confidential
Accuracy: Traditional ML
Traditional machine
learning
● Comparing a prediction
to an outcome
● Generally the correct
labels are in a simple
format
v
H2O.ai Confidential
Accuracy: Example LLMs
The simplest way to measure accuracy is to compare the result
against another source of information.
Example sources:
● Checking results against a given source (RAG)
● Checking results against the tuning data
● Checking results against an external source (eg wikipedia)
● Checking results against the training data (cumbersome).
● Checking for self-consistency (Self-check GPT)
● Checking results against a larger LLM
Scoring methods:
● Natural language inference
● Comparing embeddings
● Influence functions
v
H2O.ai Confidential
RAG (Retrieval Augmented Generation)
01
Chunk and
Embed
Documents 02
Submit a
Query
Retrieve
Relevant
Information via
Similarity
Search
03
04 05
Combine relevant
information to
ground the query to
the model
Generate
Embedding
for Query
v
H2O.ai Confidential
Accuracy: Retrieval Augmented Generation
(RAG) Provides a Simple Solution
v
H2O.ai Confidential
Accuracy: Retrieval Augmented
Generation (RAG) Provides a Simple
Solution
H2O.ai Confidential
Influence functions
● Seeks to measure the influence of including a data point in
the training set on model response.
● Datamodels/TRAK
○ Learn model based on binary indicator functions.
○ Directly measure how much a training instance influences
the outcome.
● DataInf
○ Measures the influence of a document during fine tuning.
[DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and
Diffusion Models, Kwon et. al]
[TRAK: Attributing Model Behavior at Scale. Park et. al]
H2O.ai Confidential
Influence functions / computer vision
[TRAK: Attributing Model Behavior at Scale. Park et. al]
H2O.ai Confidential
Influence functions / NLP
Influence functions / nlp
[Studying Large Language Model Generalization with Influence Functions, Grosse, et. al]
H2O.ai Confidential
Self consistency comparison
Self-Check GPT
● Sampling different responses from an LLM.
● Checking for consistency between responses.
● Assuming that hallucinations will occur less consistently.
[SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for
Generative Large Language Models, Potsawee Manakul, Adrian Liusie, Mark
JF Gales]
v
H2O.ai Confidential
Counterfactual analysis: Traditional ML
● How does changing a feature change the model outcome?
● What is the smallest change that can change the outcome?
v
H2O.ai Confidential
Counterfactual analysis: Model Analyzer
v
H2O.ai Confidential
Counterfactual analysis: LLM
How consistent are results under different:
● Prompts / instructions.
○ Changes in prompt design
○ Changes in prompt instructions
○ Multi-shot examples
○ Word replacement with synonyms
○ Proper names or pronouns (fairness)
○ Chain of thought / other guided reasoning related methods
● Different context / RAG retrieval
v
H2O.ai Confidential
Intervention in the case of problems
If problematic behavior is found in a model there are several options.
● Prompt/ instruction modifications.
● Choosing a different base model.
● Fine-tuning to modify LLM model behavior
● Altering the document retrieval process (RAG)
● Monitoring model output for problematic responses.
H2O.ai Confidential
Conclusions
● Many of the basic problems of understanding LLMs are
similar to that of other large models.
● Through careful testing we can hope to understand and
correct some of the safety issues involved in using LLMs.
H2O.ai Confidential
Kim Montgomery
LLM interpretation
kim.montgomery@h2o.ai
Contact
Thank you!
H2O.ai Confidential
Lab 1 - Using Chain-of-
Verification for Explainable AI
v
H2O.ai Confidential
Chain of Verification (CoVe)
CoVe enhances the reliability of answers provided by Large Language Models, particularly in
factual question and answering scenarios, by systematically verifying and refining responses
to minimize inaccuracies.
The CoVe method consists of the following four sequential steps:
1. Initial Baseline Response Creation: In this step, an initial response to the original question is generated as a starting
point.
1. Verification Question Generation: Verification questions are created to fact-check the baseline response. These
questions are designed to scrutinize the accuracy of the initial response.
1. Execute Verification: The verification questions are independently answered to minimize any potential bias. This step
ensures that the verification process is objective and thorough.
1. Final Refined Answer Generation: Based on the results of the verification process, a final refined answer is generated.
This answer is expected to be more accurate and reliable, reducing the likelihood of hallucinations in the response.
v
H2O.ai Confidential
Verification Questions
Questions are categorized into three main groups:
1. Wiki Data & Wiki Category List: This category involves questions that expect answers in
the form of a list of entities. For instance, questions like “Who are some politicians born in
Boston?”
2. Multi-Span QA: Questions in this category seek multiple independent answers. An
example would be: “Who invented the first mechanized printing press and in what year?”
The answer is “Johannes Gutenberg, 1450”.
3. Long-form Generation: Any question that requires a detailed or lengthy response falls
under this group.
v
H2O.ai Confidential
Chain of Verification (CoVe)
Dhuliawala, Shehzaad, et al. "Chain-of-Verification Reduces Hallucination in Large Language Models." arXiv
preprint arXiv:2309.11495 (2023)
v
H2O.ai Confidential
CoVe and Explainable AI (XAI)
● Interpretability and Transparency:
○ Verification process generates questions to fact-check baseline
responses, improving transparency in decision-making.
● Reliability and Trust:
○ Refined answers enhance accuracy, building trust and reliability in
model outputs.
● Bias and Fairness:
○ Verification questions in CoVe identify and mitigate potential biases in
model output.
● User Interaction:
○ Verification process involves user interaction through verification
questions.
v
H2O.ai Confidential
https://chain-of-verifcation.genai-training.h2o.ai/
Who are some
politicians born in
Boston?
v
H2O.ai Confidential
Who are some
CEOs of banks
in the US?
v
H2O.ai Confidential
What are some
credit scoring
bureaus in the US?
v
H2O.ai Confidential
What are some
agencies assigned to
regulate and oversee
financial institutions
in the US?
v
H2O.ai Confidential
Provide a list of major
investment firms and
financial institutions
headquartered in the
United States?
v
H2O.ai Confidential
Benefits and Limitations of CoVe
● Benefits:
○ Enhanced Reliability: By incorporating verification steps, users can trust the accuracy of
information obtained from LLMs.
○ Depth of Understanding: The refinement of answers allows users to gain a deeper
understanding of the topic beyond the initial response.
○ Educational Value: Promotes responsible and informed use of LLMs, encouraging users to go
beyond surface-level information.
● Limitations
○ Incomplete Removal of Hallucinations: CoVe does not completely eliminate hallucinations in
generated content, which means it can still produce incorrect or misleading information.
○ Limited Scope of Hallucination Mitigation: CoVe primarily addresses hallucinations in the
form of directly stated factual inaccuracies but may not effectively handle other forms of
hallucinations, such as errors in reasoning or opinions.
○ Increased Computational Expense: Generating and executing verification alongside
responses in CoVe adds to the computational cost, similar to other reasoning methods like
Chain-of-Thought.
○ Upper Bound on Improvement: The effectiveness of CoVe is limited by the overall capabilities
of the underlying language model, particularly in its ability to identify and rectify its own
mistakes.
v
H2O.ai Confidential
How to improve the CoVe pipeline
● Prompt engineering
● External tools
○ Final output highly depends on the answers of the verification questions.
○ For factual questions & answering you can use advanced search tools like google search or
serp API etc.
○ For custom use cases you can always use RAG methods or other retrieval techniques for
answering the verification questions.
● More chains
● Human in the loop
H2O.ai Confidential
Conclusions
● CoVe aim to improves model transparency, reliability, and trust.
● CoVe is not a silver bullet, but it can improve a LLM testing
arsenal.
H2O.ai Confidential
Navdeep Gill
Engineering Manager, AI Governance | Responsible AI
navdeep.gill@h2o.ai
Contact
Thank you
H2O.ai Confidential
Benchmarking and
Evaluation
v
H2O.ai Confidential
Write a 1000 word essay in 1 minute
LLMs are good at generating large amount of text that is consistent and
logical.
Are LLMs smarter than humans?
Introduction
Have LLMs manage your investment portfolio
A model can give a generic advice on safe money management. But we don’t
trust our life savings with a chat bot.
Let a bot reply to your email
It depends on how important the email is. May be we are more comfortable
with the model automatically creating a draft.
v
H2O.ai Confidential
Summarization
Summarizing large documents without losing essential information. Extracting
key-value pairs.
How can we use LLMS while minimizing risk?
Introduction
Customer Service
Answer FAQs from customers. May require retrieving from a knowledge base
and summarizing.
Report Generation - AutoDoc
Create ML interpretation documents. Reports required for regulatory
compliance.
v
H2O.ai Confidential
Risk
How risky are LLMs?
A lawyer used ChatGPT to prepare
a court filing. It went horribly awry.
“While ChatGPT can be useful to
professionals in numerous industries,
including the legal profession, it has
proved itself to be both limited and
unreliable. In this case, the AI invented
court cases that didn't exist, and
asserted that they were real.”
CBS News
Chevy dealership’s AI chatbot
suggests Ford F-150 when asked
for best truck
“As an AI, I don't have personal
preferences but I can provide insights
based on popular opinions and
reviews. Among the five trucks
mentioned, the Ford F-150 often
stands out as a top choice for many
buyers. It's known for its impressive
towing …”
Detroit Free Press
v
H2O.ai Confidential
Data Fine Tuning RAG
Foundation
Model
Leaderboard Risk
Management
Large & Diverse
To train a foundation model, you
need a large, diverse dataset that
covers the tasks the model should
be able to perform.
LLM Lifecycle
Supervised Fine
Tuning
Fine-tuning can improve a model's
performance on a task while
preserving its general language
knowledge.
h2oGPTe
A powerful search assistant to
answer questions from large
volumes of documents, websites,
and workplace content.
Generative AI
They are designed to produce a
wide and general variety of
outputs, such as text, image or
audio generation. They can be
standalone systems or can be used
as a "base" for many other
applications.
HELM
HELM is a framework for evaluating
foundation models. Leaderboard
shows how the various models
perform across different groups of
scenarios and different metrics.
Eval Studio
Design and execute task-specific
benchmarks. Perform both manual
and LLM based evaluations.
Systematically collect and store
results along with metadata.
v
H2O.ai Confidential
MMLU (Massive Multitask Language Understanding)
A test to measure a text model's multitask accuracy. The
test covers 57 tasks including elementary mathematics, US
history, computer science, law, and more.
Evaluation for LLMs
Popular benchmarks on open source leaderboards
HellaSwag
A test of common-sense inference, which is easy for
humans (~95%) but challenging for SOTA models.
A12 Reasoning Challenge (ARC)
A set of grade-school science questions.
Truthful QA
A test to measure a model’s propensity to reproduce
falsehoods commonly found online.
When you drop a ball from rest it accelerates downward at 9.8 m/s². If
you instead throw it downward assuming no air resistance its
acceleration immediately after leaving your hand is
(A) 9.8 m/s²
(B) more than 9.8 m/s²
(C) less than 9.8 m/s²
(D) Cannot say unless the speed of throw is given.
MMLU Example
A woman is outside with a bucket and a dog. The dog is running
around trying to avoid a bath. She…
(A) Rinses the bucket off with soap and blow dry the dog’s head.
(B) Uses a hose to keep it from getting soapy.
(C) Gets the dog wet, then it runs away again.
(D) Cannot say unless the speed of throw is given.
HellaSwag Example
v
H2O.ai Confidential
Hugging Face Open LLM
Leaderboard
It is a popular location to track
various models evaluated using
different metrics.
These metrics include human
baselines that provide us some
idea of how these models have
been drastically improved over
the last two years.
Approaching human baseline
Popular benchmarks on open source leaderboards
H2O.ai Confidential
Benchmarks are not task specific
Benchmarks on open-source
leaderboards are well-rounded and
diverse. They are not sufficient to
reflect the performance of the
model in a domain specific scenario.
The Need for Evaluation
Popular leaderboards are not enough
Some Model Entries may cheat!
There can be models on the
leaderboard that are trained on the
benchmark data itself. We do not
have robust enough tests to detect
this.
Non-verifiable Results
The procedure followed in
conducting the tests and the results
are not completely transparent and
can also vary among different
leaderboards.
v
H2O.ai Confidential
Create task specific QA
pairs along with the
Reference documents.
- Bank Teller
- Loan officer
- Program Manager
- Data Analyst
Custom Test Sets
Create custom benchmarks for domain specific scenarios
Task Specific Evals
Create the QA pairs that
test for agreement with
your values, intentions,
and preferences.
- Correctness
- Relevance
- Similarity
- Hallucination
- Precision
- Recall
- Faithfulness
Test for Alignment
Test that all outputs meet
your safety levels.
- Toxicity
- Bias
- Offensive
- PII of customers
- Company Secrets
Test for Safety
Tests to confirm or show
proof of meeting
compliance standards.
- Government
- Company
Test for Compliance
v
H2O.ai Confidential
H2O Eval Studio
Design and Execute task specific benchmarks
All the Evaluators are included
Eval studio contains evaluators to check for
Alignment, Safety, and Compliance as
discussed before.
Create custom benchmarks
Users can upload Documents and create
custom Tests (Question-Answer pairs) based on
the document collection.
Run Evals and visualize results
Once a benchmark has been designed, users
can then run the evaluation against the
benchmark and visualize the results. A detailed
report can also be downloaded.
H2O.ai Confidential
Srinivas Neppalli
Senior AI Engineer
srinivas.neppalli@h2o.ai
Contact
Thank you!
H2O.ai Confidential
Lab 2 - Experimental Design of
Gen AI Evaluations
v
H2O.ai Confidential
Through the Lens of Model Risk Management
One possible definition of “Conceptual Soundness”
for LLMs by themselves might be considered as a
combination of the following choices:
(1)Training Data
(1)Model Architecture
(1)An explanation of why (1) and (2) were made
(1)An explanation of why (1) and (2) are reasonable
for the use case that the LLM will be applied to.
v
H2O.ai Confidential
Through the Lens of Model Risk Management
What about a RAG system?
How does the concept of “Conceptual Soundness” get
applied when not only choices surrounding training
data and model architecture involved, but also choices
around:
- Embeddings
- System Prompts (e.g. Personalities)
- Chunk Sizes
- Chunking Strategies
- OCR Techniques
- RAG-type (e.g. Hypothetical Document Embeddings)
- Mixture-of-Experts or Ensembling
H2O.ai Confidential
Models / Systems / Agents are the fundamental AI
systems under scrutiny. As opposed to traditional
machine learning models, Generative AI include many
choices beyond the models themselves
H2O.ai Confidential
Benchmarks / Tests are the sets of prompts and
response that are used gauge how well an AI system can
perform a certain task or use case.
H2O.ai Confidential
Evaluators are the mathematical functions used to
evaluate various dimensions of performance.
H2O.ai Confidential
Documents are the data sets used for evaluation in the
case of RAG systems, combining models, parsing, OCR,
chunking, embeddings and other components of an
evaluation.
v
H2O.ai Confidential
What is the primary unit of analysis when evaluating an AI system or model?
An eval can be defined as a series of tuples each of size 3.
Each tuple consists of:
(1)Context / Prompt / Question
(1)Output / Response / Ground Truth Answer
(1)Document (in the case of RAG)
Source: https://www.jobtestprep.com/bank-teller-sample-questions
Designing Your Own Eval
v
H2O.ai Confidential
Problem statement: How well does my Bank Teller AI Application correctly answer
questions related to being a Bank Teller?
Create an eval test case that can be used to evaluate how well BankTellerGPT can
answer questions related to being a Bank Teller.
LLM-only Example Test Case
{
Prompt: Respond to the following questions with single letter answer. Question: A specific bank branch serves
256 clients on average every day. The ratio between tellers and clients is 1:32, so that every teller serves 32
people on average every day. The management wishes to change this ratio to 1:20. How many new tellers should
be hired? A. 4 B. 5 C. 9 D. 12,
Response: B. 5,
Document: None
}
Designing Your Own Eval - BankTellerGPT
Source: https://www.jobtestprep.com/bank-teller-
sample-questions
v
H2O.ai Confidential
Designing Your Own Eval - BankTellerGPT
Problem statement: How well does my Bank Teller AI Application actually answer
questions related to being a Bank Teller?
Create an eval test case that can be used to evaluate how well BankTellerGPT can
answer questions related to being a Bank Teller.
RAG Example Test Case
{
Prompt: Respond to the following questions with single letter answer. Question: A specific bank branch serves
256 clients on average every day. The ratio between tellers and clients is 1:32, so that every teller serves 32
people on average every day. The management wishes to change this ratio to 1:20. How many new tellers should
be hired? A. 4 B. 5 C. 9 D. 12,
Response: B. 5,
Document: “Internal Bank Teller Knowledge Base”
}
Source: https://www.jobtestprep.com/bank-teller-
sample-questions
v
H2O.ai Confidential
Designing Your Own Eval
Task # 1: Create your own GenAI Test Benchmark for the SR 11-7 document
Some possible test cases
Prompt: How should banks approach model development?
Response: Banks should approach model development with a focus on sound risk management practices. They
should ensure that models are developed and used in a controlled environment, with proper documentation,
testing, and validation.
Prompt: How can model risk be reduced?
Response: Model risk can be reduced by establishing limits on model use, monitoring model performance,
adjusting or revising models over time, and supplementing model results with other analysis and information.
Prompt: How often should a bank update its model inventory?
Response: A bank should update its model inventory regularly to ensure that it remains current and accurate.
v
H2O.ai Confidential
Designing Your Own Eval - BankTellerGPT
Task # 2: Create and launch LLM-only eval
leaderboard
To complete this, you will need to
1. Pick an evaluator (e.g. Token presence)
1. Pick a connection (e.g. Enterprise h2oGPT - LLM Only)
1. Pick a set of eval tests (e.g. Bank Teller Benchmark)
v
H2O.ai Confidential
Designing Your Own Eval - SR 11-7
Task # 3: Create a new evaluator based on RAG
and launch leaderboard
To complete this, you will need to
1. Pick an evaluator (e.g. Answer correctness)
1. Pick a connection (e.g. Enterprise h2oGPT-RAG)
1. Pick your test created in step 1
v
H2O.ai Confidential
Evaluators
Evaluator RAG LLM Purpose Method
PII
(privacy)
Yes Yes Assess whether the answer contains personally identifiable
information (PII) like credit card numbers, phone numbers, social
security numbers, street addresses, email addresses and employee
names.
Regex suite which quickly and reliably detects formatted PII - credit card numbers, social
security numbers (SSN) and emails.
Sensitive data
(security)
Yes Yes Assess whether the answer contains security-related information
like activation keys, passwords, API keys, tokens or certificates.
Regex suite which quickly and reliably detects formatted sensitive data - certificates
(SSL/TLS certs in PEM format), API keys (H2O.ai and OpenAI), activation keys
(Windows).
Answer Correctness Yes Yes Assess whether the answer is correct given the expected answer
(ground truth).
A score based on combined and weighted semantic and factual similarity between the
answer and ground truth (see Answer Semantic Similarity and Faithfulness below).
Answer Relevance Yes Yes Assess whether the answer is (in)complete and does not contain
redundant information which was not asked - noise.
A score based on the cosine similarity of the question and generated questions, where
generated questions are created by prompting an LLM to generate questions from the
actual answer.
Answer Similarity Yes Yes Assess semantic similarity of the answer and expected answer. A score based on similarity metric value of the actual and expected answer calculated by
a cross-encoder model (NLP).
Context Precision Yes No Assess the quality of the retrieved context considering order and
relevance of the text chunks on the context stack.
A score based on the presence of the expected answer - ground truth - in the text chunks
at the top of the retrieved context chunk stack - relevant chunks deep in the stack,
irrelevant chunks and unnecessarily big context make the score lower.
Context Recall Yes No Assess how much of the ground truth is represented in the
retrieved context.
A score based on the ratio of the number of sentences in the ground truth that can be
attributed to the context to the total number of sentences in the ground truth.
Context Relevance Yes No Assess whether the context is (in)complete and does not contain
redundant information which is not needed - noise.
A score based on the ratio of context sentences which are needed to generate the answer
to the total number of sentences in the retrieved context.
H2O EvalStudio evaluators overview
TERMINOLOGY: answer ~ actual RAG/LLM answer / expected answer ~ expected RAG/LLM answer i.e. ground truth | retrieved context ~ text chunks retrieved from the vector DB prior LLM
answer generation in RAG.
v
H2O.ai Confidential
Evaluators (continued)
Evaluator RAG LLM Purpose Method
Faithfulness Yes No Assess whether answer claims can be inferred from the context i.e.
factual consistency of the answer given the context. (hallucinations)
A score which is based on the ratio of the answer’s claims which present in the context
to the total number of answer’s claims.
Hallucination Metric Yes No Asses the RAG’s base LLM model hallucination. A score based on the Vectara hallucination evaluation cross-encoder model which
assesses RAG’s base LLM hallucination when it generates the actual answer from the
retrieved context.
RAGAs Yes No Assess overall answer quality considering both context and answer. Composite metrics score which is harmonic mean of Faithfulness, Answer Relevancy,
Context Precision and Context Recall metrics.
Tokens Presence Yes Yes Assesses whether both retrieved context and answer contain
required string tokens.
Scored based on the substring and/or regular expression based search of the required
set of strings in the retrieved context and answer.
Faithfulness Yes No Assess whether answer claims can be inferred from the context i.e.
factual consistency of the answer given the context. (hallucinations)
A score which is based on the ratio of the answer’s claims which present in the context
to the total number of answer’s claims.
H2O EvalStudio evaluators overview
TERMINOLOGY: answer ~ actual RAG/LLM answer / expected answer ~ expected RAG/LLM answer i.e. ground truth | retrieved context ~ text chunks retrieved from the vector DB prior LLM
answer generation in RAG.
H2O.ai Confidential
Security, Guardrails and
Hacking
v
H2O.ai Confidential
● LLM Guardrails are a set of predefined constraints and guidelines
that are applied to LLMs to manage their behavior.
● Guardrails serve to ensure responsible, ethical, and safe usage of
LLMs, mitigate potential risks, and promote transparency and
accountability.
● Guardrails are a form of proactive control and oversight over the
output and behavior of language models, which are otherwise
capable of generating diverse content, including text that may be
biased, inappropriate, or harmful.
Understanding the distinct functions of each type of guardrail is pivotal
in creating a comprehensive and effective strategy for governing AI
systems.
Guardrails
v
H2O.ai Confidential
● Content Filter Guardrails: Content filtering is crucial to prevent
harmful, offensive, or inappropriate content from being generated by
LLMs. These guardrails help ensure that the outputs conform to
community guidelines, curbing hate speech, explicit content, and
misinformation.
● Bias Mitigation Guardrails: Bias is an ongoing concern in AI, and
mitigating bias is critical. These guardrails aim to reduce the model's
inclination to produce content that perpetuates stereotypes or
discriminates against particular groups. They work to promote fairness
and inclusivity in the model's responses.
● Safety and Privacy Guardrails: Protecting user privacy is paramount.
Safety and privacy guardrails are designed to prevent the generation of
content that may infringe on user privacy or include sensitive, personal
information. These measures safeguard users against unintended data
exposure.
Types of Guardrails
v
H2O.ai Confidential
Types of Guardrails
● Fact-Checking & Hallucination Guardrails: To combat misinformation,
fact-checking guardrails are used to verify the accuracy of the
information generated by LLMs. They help ensure that the model's
responses align with factual accuracy, especially in contexts like news
reporting or educational content.
● Context/Topic and User Intent Guardrails: For LLMs to be effective,
they must produce responses that are contextually relevant and aligned
with user intent. These guardrails aim to prevent instances where the
model generates content that is unrelated or fails to address the user's
queries effectively.
● Explainability and Transparency Guardrails: In the pursuit of making
LLMs more interpretable, these guardrails require the model to provide
explanations for its responses. This promotes transparency by helping
users understand why a particular output was generated, fostering
trust and accountability.
● Jailbreak Guardrails: Ensure robustness to malicious user attacks such
as prompt injection.
H2O.ai Confidential
Lab 3 - Hacking and Security
Posture
v
H2O.ai Confidential
https://jailbreaking.genai-training.h2o.ai/
H2O.ai Confidential
Ashrith Barthur
Principal Data Scientist
ashrith.barthur@h2o.ai
Contact
Thank you!
H2O.ai Confidential
Applied Generative AI for
Banking
v
H2O.ai Confidential
Dataset: Consumer Complaint Database
H2O.ai Confidential
Lab 4 - Complaint Summarizer
v
H2O.ai Confidential
https://complaint-analysis.genai-training.h2o.ai/
H2O.ai Confidential
Determine what is the one
credit product with the
highest number of complaints.
Task 1
Complaint Summarizer
Applied Generative AI for Banking
H2O.ai Confidential
Determine what is the one
credit product with the
highest number of complaints.
Answer: Credit Reporting
Task 1
Complaint Summarizer
Applied Generative AI for Banking
H2O.ai Confidential
Determine what is the top
complaint for TransUnion?
Task 2
Complaint Summarizer
Applied Generative AI for Banking
H2O.ai Confidential
Determine what is the top
complaint for TransUnion?
Answer: Violation of
Consumers Rights to Privacy
and Confidentiality Under the
Fair Credit Reporting Act.
Task 2
Complaint Summarizer
Applied Generative AI for Banking
H2O.ai Confidential
Use H2OGPT to summarize a
complaint from the database
and provided immediate next
steps
Task 3
Complaint Summarizer
Applied Generative AI for Banking
H2O.ai Confidential
Use H2OGPT to summarize a
complaint from the database
and provided immediate next
steps
Answer: [See screenshot]
Task 3
Complaint Summarizer
Applied Generative AI for Banking
H2O.ai Confidential
CERTIFICATION
H2O.ai Confidential
Certification Exam
Link to Exam!
H2O.ai Confidential
Contact
Thank you!
MAKERS
Jonathan Farland
Director of Solution Engineering
jon.farland@h2o.ai
www.linkedin.com/in/jonfarland/
H2O.ai Confidential
TECHNICAL APPENDIX
H2O.ai Confidential
Retrieval-Augmented Generation (RAG)
RAG as a system is a
particularly good use of
vector databases.
RAG systems take
advantage the context
window for LLMs, filling it
with only the most relevant
examples from real data.
This “grounds” the LLM to
relevant context and
greatly minimizes any
hallucination.
Embedding Models Source: https://huggingface.co/blog/1b-sentence-embeddings
Embedding Models - INSTRUCTOR Source: https://arxiv.org/pdf/2212.09741.pdf
Instruction-based Omnifarious
Representations
Model is trained to generate embeddings
using both the instruction as well as the
textual input.
Applicable to virtually every use case, due
to its ability to create latent vector
representations that include instruction.
Embedding Models - BGE Source: https://arxiv.org/pdf/2310.07554.pdf
LLM-Embedder
This embedding model is trained
specifically for use with RAG systems.
Reward model introduced that provides
higher rewards to a retrieval candidate
if it results in a higher generation
likelihood for the expected output
Uses contrastive learning to directly
get at optimizing for RAG
applications
H2O.ai Confidential
AI Engines Deployment Consumption
LLM AppStudio
LLM DataStudio
LLM EvalStudio
H2O LLMs Ecosystem
AppStore
End Users
Generative AI with H2O.ai
MLOps
AI Engine Manager
Doc-QA
Enterprise
h2oGPT
H2O.ai Confidential
Explanations in
Highlights the most important regions of an image
Highlights the most important words
H2O.ai Confidential
Code Translation
H2O.ai Confidential
LLM Context Length Testing Source:
https://github.com/gkamradt/LLMTest_NeedleInAHaystack?tab=read
me-ov-file
H2O.ai Confidential
LLM Context Length Testing Source:
https://github.com/gkamradt/LLMTest_NeedleInAHaystack?tab=read
me-ov-file
H2O.ai Confidential
Ethical Considerations, Data Privacy, and User Consent
Assess the potential impact of generative AI on individuals and society. Give users control over
how their data is used by generative AI. Consent mechanisms should be transparent and user-
friendly.
Monitoring, Regulation, and Security
Detect misuse or anomalies in generative AI behavior. Regulatory
compliance ensures adherence to ethical and legal guidelines. Security
measures are crucial to protect AI models from adversarial attacks or
unauthorized access.
Accountability and Oversight
Define roles and responsibilities for AI development and
deployment. Oversight mechanisms ensure that responsible
practices are followed.
Education and Awareness
Users and developers should be informed about
generative AI capabilities, limitations, and ethical
considerations.
Stakeholder Involvement
Involving various stakeholders in AI discussions promotes diverse
perspectives and responsible decision-making.
Continuous Evaluation and Improvement
Continually assess models to ensure fairness, accuracy, and
alignment with ethical standards.
Transparency, Explainability, Bias Mitigation,Debugging,
and Guardrails
Recognize and mitigate both subtle and glaring biases that may emerge from training
data. Ensures that users can understand and trust the decisions made by generative
AI models. Debug models with techniques such as adversarial prompt engineering.
Proactively manage risks and maintain control over the model's behavior with
guardrails.
Responsible Generative AI
Responsible
Generative
AI
Audit Input Data, Benchmarks, and Test the Unknown
Assess quality of data used as input to train Generative AI models. Utilize benchmarks
and random attacks for testing.
v
H2O.ai Confidential
CoVe method reduces factual errors in large language models by drafting, fact-checking, and
verifying responses - it deliberates on its own responses and self-correcting them
Steps:
1. Given a user query, a LLM generates a baseline response that may contain inaccuracies, e.g.
factual hallucinations
2. To improve this, CoVe first generates a plan of a set of verification questions to ask, and then
executes that plan by answering them and hence checking for agreement
3. Individual verification questions are typically answered with higher accuracy than the
original accuracy of the facts in the original longform generation
4. Finally, the revised response takes into account the verifications
5. The factored version of CoVe answers verification questions such that they cannot condition
on the original response, avoiding repetition and improving performance
Minimizing Model Hallucinations
Using Chain-of-Verification (CoVe) Method
v
H2O.ai Confidential
Minimizing Model Hallucinations
Using Chain-of-Verification (CoVe) Method

More Related Content

What's hot

Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
 
[DSC Europe 23] Marcel Tkacik - Augmented Retrieval Products with GAI models
[DSC Europe 23] Marcel Tkacik - Augmented Retrieval Products with GAI models[DSC Europe 23] Marcel Tkacik - Augmented Retrieval Products with GAI models
[DSC Europe 23] Marcel Tkacik - Augmented Retrieval Products with GAI modelsDataScienceConferenc1
 
Generative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's PerspectiveGenerative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's PerspectiveHuahai Yang
 
Regulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with TransparencyRegulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with TransparencyDebmalya Biswas
 
Using the power of Generative AI at scale
Using the power of Generative AI at scaleUsing the power of Generative AI at scale
Using the power of Generative AI at scaleMaxim Salnikov
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfDavid Rostcheck
 
Understanding generative AI models A comprehensive overview.pdf
Understanding generative AI models A comprehensive overview.pdfUnderstanding generative AI models A comprehensive overview.pdf
Understanding generative AI models A comprehensive overview.pdfStephenAmell4
 
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬VINCI Digital - Industrial IoT (IIoT) Strategic Advisory
 
India's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdf
India's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdfIndia's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdf
India's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdfgranitesrijan
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models BootcampData Science Dojo
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...ssuser4edc93
 
How ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundlyHow ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundlyPekka Abrahamsson / Tampere University
 
Best Practice on using Azure OpenAI Service
Best Practice on using Azure OpenAI ServiceBest Practice on using Azure OpenAI Service
Best Practice on using Azure OpenAI ServiceKumton Suttiraksiri
 
Understanding LLMOps-Large Language Model Operations
Understanding LLMOps-Large Language Model OperationsUnderstanding LLMOps-Large Language Model Operations
Understanding LLMOps-Large Language Model OperationsMy Gen Tec
 
Leveraging Generative AI & Best practices
Leveraging Generative AI & Best practicesLeveraging Generative AI & Best practices
Leveraging Generative AI & Best practicesDianaGray10
 
A Framework for Navigating Generative Artificial Intelligence for Enterprise
A Framework for Navigating Generative Artificial Intelligence for EnterpriseA Framework for Navigating Generative Artificial Intelligence for Enterprise
A Framework for Navigating Generative Artificial Intelligence for EnterpriseRocketSource
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023HyunJoon Jung
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGISynaptonIncorporated
 

What's hot (20)

Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
[DSC Europe 23] Marcel Tkacik - Augmented Retrieval Products with GAI models
[DSC Europe 23] Marcel Tkacik - Augmented Retrieval Products with GAI models[DSC Europe 23] Marcel Tkacik - Augmented Retrieval Products with GAI models
[DSC Europe 23] Marcel Tkacik - Augmented Retrieval Products with GAI models
 
Generative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's PerspectiveGenerative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's Perspective
 
Intro to LLMs
Intro to LLMsIntro to LLMs
Intro to LLMs
 
Regulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with TransparencyRegulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with Transparency
 
Using the power of Generative AI at scale
Using the power of Generative AI at scaleUsing the power of Generative AI at scale
Using the power of Generative AI at scale
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdf
 
Understanding generative AI models A comprehensive overview.pdf
Understanding generative AI models A comprehensive overview.pdfUnderstanding generative AI models A comprehensive overview.pdf
Understanding generative AI models A comprehensive overview.pdf
 
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬
 
India's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdf
India's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdfIndia's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdf
India's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdf
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models Bootcamp
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
 
How ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundlyHow ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundly
 
Best Practice on using Azure OpenAI Service
Best Practice on using Azure OpenAI ServiceBest Practice on using Azure OpenAI Service
Best Practice on using Azure OpenAI Service
 
Understanding LLMOps-Large Language Model Operations
Understanding LLMOps-Large Language Model OperationsUnderstanding LLMOps-Large Language Model Operations
Understanding LLMOps-Large Language Model Operations
 
Leveraging Generative AI & Best practices
Leveraging Generative AI & Best practicesLeveraging Generative AI & Best practices
Leveraging Generative AI & Best practices
 
A Framework for Navigating Generative Artificial Intelligence for Enterprise
A Framework for Navigating Generative Artificial Intelligence for EnterpriseA Framework for Navigating Generative Artificial Intelligence for Enterprise
A Framework for Navigating Generative Artificial Intelligence for Enterprise
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGI
 

Similar to Generative AI Masterclass - Model Risk Management.pptx

Using the power of OpenAI with your own data: what's possible and how to start?
Using the power of OpenAI with your own data: what's possible and how to start?Using the power of OpenAI with your own data: what's possible and how to start?
Using the power of OpenAI with your own data: what's possible and how to start?Maxim Salnikov
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptxgdgsurrey
 
Building Generative AI-infused apps: what's possible and how to start
Building Generative AI-infused apps: what's possible and how to startBuilding Generative AI-infused apps: what's possible and how to start
Building Generative AI-infused apps: what's possible and how to startMaxim Salnikov
 
Formal Versus Agile: Survival of the Fittest? (Paul Boca)
Formal Versus Agile: Survival of the Fittest? (Paul Boca)Formal Versus Agile: Survival of the Fittest? (Paul Boca)
Formal Versus Agile: Survival of the Fittest? (Paul Boca)AdaCore
 
ANIn Chennai March 2024 |Conversational AI for Agility in Healthcare by Shine...
ANIn Chennai March 2024 |Conversational AI for Agility in Healthcare by Shine...ANIn Chennai March 2024 |Conversational AI for Agility in Healthcare by Shine...
ANIn Chennai March 2024 |Conversational AI for Agility in Healthcare by Shine...AgileNetwork
 
2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...Ed Chi
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Sri Ambati
 
Webcast Presentation: Accelerate Continuous Delivery with Development Testing...
Webcast Presentation: Accelerate Continuous Delivery with Development Testing...Webcast Presentation: Accelerate Continuous Delivery with Development Testing...
Webcast Presentation: Accelerate Continuous Delivery with Development Testing...GRUC
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Agile Methods: Fact or Fiction
Agile Methods: Fact or FictionAgile Methods: Fact or Fiction
Agile Methods: Fact or FictionMatt Ganis
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
 
Human-Centered Artificial Intelligence: Reliable, Safe & Trustworthy
Human-Centered Artificial Intelligence: Reliable, Safe & TrustworthyHuman-Centered Artificial Intelligence: Reliable, Safe & Trustworthy
Human-Centered Artificial Intelligence: Reliable, Safe & TrustworthyJalnaAfridi
 
Scaling & Managing Production Deployments with H2O ModelOps
Scaling & Managing Production Deployments with H2O ModelOpsScaling & Managing Production Deployments with H2O ModelOps
Scaling & Managing Production Deployments with H2O ModelOpsSri Ambati
 
Glasswall Wardley Maps & Services
Glasswall Wardley Maps & ServicesGlasswall Wardley Maps & Services
Glasswall Wardley Maps & ServicesSteve Purkis
 
Automated Testing DITA Content and Customizations
Automated Testing DITA Content and CustomizationsAutomated Testing DITA Content and Customizations
Automated Testing DITA Content and CustomizationsSteve Anderson
 
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and CostLLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and CostAggregage
 
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...AgileNetwork
 

Similar to Generative AI Masterclass - Model Risk Management.pptx (20)

Using the power of OpenAI with your own data: what's possible and how to start?
Using the power of OpenAI with your own data: what's possible and how to start?Using the power of OpenAI with your own data: what's possible and how to start?
Using the power of OpenAI with your own data: what's possible and how to start?
 
LLM-Datacraft.pdf
LLM-Datacraft.pdfLLM-Datacraft.pdf
LLM-Datacraft.pdf
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
 
Building Generative AI-infused apps: what's possible and how to start
Building Generative AI-infused apps: what's possible and how to startBuilding Generative AI-infused apps: what's possible and how to start
Building Generative AI-infused apps: what's possible and how to start
 
Formal Versus Agile: Survival of the Fittest? (Paul Boca)
Formal Versus Agile: Survival of the Fittest? (Paul Boca)Formal Versus Agile: Survival of the Fittest? (Paul Boca)
Formal Versus Agile: Survival of the Fittest? (Paul Boca)
 
ANIn Chennai March 2024 |Conversational AI for Agility in Healthcare by Shine...
ANIn Chennai March 2024 |Conversational AI for Agility in Healthcare by Shine...ANIn Chennai March 2024 |Conversational AI for Agility in Healthcare by Shine...
ANIn Chennai March 2024 |Conversational AI for Agility in Healthcare by Shine...
 
2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...
 
Problem prediction model
Problem prediction modelProblem prediction model
Problem prediction model
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
 
Webcast Presentation: Accelerate Continuous Delivery with Development Testing...
Webcast Presentation: Accelerate Continuous Delivery with Development Testing...Webcast Presentation: Accelerate Continuous Delivery with Development Testing...
Webcast Presentation: Accelerate Continuous Delivery with Development Testing...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Agile Methods: Fact or Fiction
Agile Methods: Fact or FictionAgile Methods: Fact or Fiction
Agile Methods: Fact or Fiction
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
Human-Centered Artificial Intelligence: Reliable, Safe & Trustworthy
Human-Centered Artificial Intelligence: Reliable, Safe & TrustworthyHuman-Centered Artificial Intelligence: Reliable, Safe & Trustworthy
Human-Centered Artificial Intelligence: Reliable, Safe & Trustworthy
 
Scaling & Managing Production Deployments with H2O ModelOps
Scaling & Managing Production Deployments with H2O ModelOpsScaling & Managing Production Deployments with H2O ModelOps
Scaling & Managing Production Deployments with H2O ModelOps
 
Glasswall Wardley Maps & Services
Glasswall Wardley Maps & ServicesGlasswall Wardley Maps & Services
Glasswall Wardley Maps & Services
 
Automated Testing DITA Content and Customizations
Automated Testing DITA Content and CustomizationsAutomated Testing DITA Content and Customizations
Automated Testing DITA Content and Customizations
 
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and CostLLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
 
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
 

More from Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thSri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 
ML Model Deployment and Scoring on the Edge with Automatic ML & DF
ML Model Deployment and Scoring on the Edge with Automatic ML & DFML Model Deployment and Scoring on the Edge with Automatic ML & DF
ML Model Deployment and Scoring on the Edge with Automatic ML & DFSri Ambati
 
Automatic Model Documentation with H2O
Automatic Model Documentation with H2OAutomatic Model Documentation with H2O
Automatic Model Documentation with H2OSri Ambati
 
Your AI Transformation
Your AI Transformation Your AI Transformation
Your AI Transformation Sri Ambati
 

More from Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 
ML Model Deployment and Scoring on the Edge with Automatic ML & DF
ML Model Deployment and Scoring on the Edge with Automatic ML & DFML Model Deployment and Scoring on the Edge with Automatic ML & DF
ML Model Deployment and Scoring on the Edge with Automatic ML & DF
 
Automatic Model Documentation with H2O
Automatic Model Documentation with H2OAutomatic Model Documentation with H2O
Automatic Model Documentation with H2O
 
Your AI Transformation
Your AI Transformation Your AI Transformation
Your AI Transformation
 

Recently uploaded

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Generative AI Masterclass - Model Risk Management.pptx

  • 1. H2O.ai Confidential Generative AI Masterclass - Model Risk Management
  • 3. v H2O.ai Confidential Introduction - Today’s training will look into responsible, explainable and interpretable AI when applied in the context of Generative AI and specifically Large Language Models (LLMS). - This will include both several sections on theoretical concepts as well as hands-on labs using Enterprise h2oGPT and H2O GenAI Applications. - These hands-on labs focus on applying Gen AI in the context of a Model Risk Manager’s role at a bank or financial institution. - NOTE: A separate end-to-end masterclass on Generative AI is also available within the training environment, as well as on github: https://github.com/h2oai/h2o_genai_training. Including: - Data Preparation for LLMs - Fine-Tuning custom models - Model Evaluation - Retrieval-Augmented Generation (RAG) - Guardrails - AI Applications
  • 4. v H2O.ai Confidential Section Session Duration Speaker Welcome Session Kick-off 5m Jon Farland Interpretability for Generative AI Large Language Model Interpretability 25m Kim Montgomery Workshop: Explainable and Interpretable AI for LLMs 20m Navdeep Gill Benchmarking and Evaluations Frameworks for Evaluating Generative AI 20m Srinivas Neppalli Workshop: Experimental Design of Gen AI Applications 20m Jon Farland Security, Guardrails and Hacking Workshop: Guardrails and Hacking 20m Ashrith Barthur Applied Generative AI for Banking - Complaint Summarizer Workshop: Complaint Summarizer AI Application 20m Jon Farland Agenda
  • 5. v H2O.ai Confidential Housekeeping - The training environment for today is a dedicated instance of the H2O AI Managed Cloud, a GPU-powered environment capable of training and deploying LLMs, as well designing and hosting entire AI Applications. - It an be accessed at https://genai-training.h2o.ai. - Login credentials should have been provided to the email address you were registered with. - If you don’t yet have credentials, or you are otherwise unable to access the environment, please speak with any member of the H2O.ai team member. - The training environment will be available to attendees for 3 days after the conference, but dedicated proof-of-concept environments can be provided (including on- premise) at request. Please speak to any H2O.ai team member or email jon.farland@h2o.ai
  • 7. What is Generative AI? GenAI enables the creation of novel content Input GenAI Model Learns patterns in unstructured data Unstructured data Output Novel Content Data Traditional AI Model Learns relationship between data and label Output Label Labels VS
  • 8. H2O.ai Confidential More complicated input: ● Prompt phrasing ● Instructions ● Examples More relevant dimensions to output: ● Truthfulness/Accuracy ● Safety ● Fairness ● Robustness ● Privacy ● Machine Ethics [TrustLLM: Trustworthiness in Large Language Models, Sun, et al] GenAI Complications
  • 9. H2O.ai Confidential ● Can the model recognize problematic responses? ○ Inaccurate responses ○ Unethical responses ○ Responses conveying stereotypes ● Can an inappropriate response be provoked? ○ Jailbreaking ○ Provoking toxicity ○ Leading questions / false context Common tests
  • 10. H2O.ai Confidential TrustLLM Result Summary Matrix [TrustLLM: Trustworthiness in Large Language Models, Sun, et al]
  • 11. H2O.ai Confidential TrustLLM Main Conclusions TrustLLM Main Findings: ● Trustworthiness and utility were positively correlated. ● Generally closed-sourced models outperformed open source. ● Over alignment for trustworthiness can compromise utility. [TrustLLM: Trustworthiness in Large Language Models, Sun, et al]
  • 12. v H2O.ai Confidential Accuracy: Traditional ML Traditional machine learning ● Comparing a prediction to an outcome ● Generally the correct labels are in a simple format
  • 13. v H2O.ai Confidential Accuracy: Example LLMs The simplest way to measure accuracy is to compare the result against another source of information. Example sources: ● Checking results against a given source (RAG) ● Checking results against the tuning data ● Checking results against an external source (eg wikipedia) ● Checking results against the training data (cumbersome). ● Checking for self-consistency (Self-check GPT) ● Checking results against a larger LLM Scoring methods: ● Natural language inference ● Comparing embeddings ● Influence functions
  • 14. v H2O.ai Confidential RAG (Retrieval Augmented Generation) 01 Chunk and Embed Documents 02 Submit a Query Retrieve Relevant Information via Similarity Search 03 04 05 Combine relevant information to ground the query to the model Generate Embedding for Query
  • 15. v H2O.ai Confidential Accuracy: Retrieval Augmented Generation (RAG) Provides a Simple Solution
  • 16. v H2O.ai Confidential Accuracy: Retrieval Augmented Generation (RAG) Provides a Simple Solution
  • 17. H2O.ai Confidential Influence functions ● Seeks to measure the influence of including a data point in the training set on model response. ● Datamodels/TRAK ○ Learn model based on binary indicator functions. ○ Directly measure how much a training instance influences the outcome. ● DataInf ○ Measures the influence of a document during fine tuning. [DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models, Kwon et. al] [TRAK: Attributing Model Behavior at Scale. Park et. al]
  • 18. H2O.ai Confidential Influence functions / computer vision [TRAK: Attributing Model Behavior at Scale. Park et. al]
  • 19. H2O.ai Confidential Influence functions / NLP Influence functions / nlp [Studying Large Language Model Generalization with Influence Functions, Grosse, et. al]
  • 20. H2O.ai Confidential Self consistency comparison Self-Check GPT ● Sampling different responses from an LLM. ● Checking for consistency between responses. ● Assuming that hallucinations will occur less consistently. [SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models, Potsawee Manakul, Adrian Liusie, Mark JF Gales]
  • 21. v H2O.ai Confidential Counterfactual analysis: Traditional ML ● How does changing a feature change the model outcome? ● What is the smallest change that can change the outcome?
  • 23. v H2O.ai Confidential Counterfactual analysis: LLM How consistent are results under different: ● Prompts / instructions. ○ Changes in prompt design ○ Changes in prompt instructions ○ Multi-shot examples ○ Word replacement with synonyms ○ Proper names or pronouns (fairness) ○ Chain of thought / other guided reasoning related methods ● Different context / RAG retrieval
  • 24. v H2O.ai Confidential Intervention in the case of problems If problematic behavior is found in a model there are several options. ● Prompt/ instruction modifications. ● Choosing a different base model. ● Fine-tuning to modify LLM model behavior ● Altering the document retrieval process (RAG) ● Monitoring model output for problematic responses.
  • 25. H2O.ai Confidential Conclusions ● Many of the basic problems of understanding LLMs are similar to that of other large models. ● Through careful testing we can hope to understand and correct some of the safety issues involved in using LLMs.
  • 26. H2O.ai Confidential Kim Montgomery LLM interpretation kim.montgomery@h2o.ai Contact Thank you!
  • 27. H2O.ai Confidential Lab 1 - Using Chain-of- Verification for Explainable AI
  • 28. v H2O.ai Confidential Chain of Verification (CoVe) CoVe enhances the reliability of answers provided by Large Language Models, particularly in factual question and answering scenarios, by systematically verifying and refining responses to minimize inaccuracies. The CoVe method consists of the following four sequential steps: 1. Initial Baseline Response Creation: In this step, an initial response to the original question is generated as a starting point. 1. Verification Question Generation: Verification questions are created to fact-check the baseline response. These questions are designed to scrutinize the accuracy of the initial response. 1. Execute Verification: The verification questions are independently answered to minimize any potential bias. This step ensures that the verification process is objective and thorough. 1. Final Refined Answer Generation: Based on the results of the verification process, a final refined answer is generated. This answer is expected to be more accurate and reliable, reducing the likelihood of hallucinations in the response.
  • 29. v H2O.ai Confidential Verification Questions Questions are categorized into three main groups: 1. Wiki Data & Wiki Category List: This category involves questions that expect answers in the form of a list of entities. For instance, questions like “Who are some politicians born in Boston?” 2. Multi-Span QA: Questions in this category seek multiple independent answers. An example would be: “Who invented the first mechanized printing press and in what year?” The answer is “Johannes Gutenberg, 1450”. 3. Long-form Generation: Any question that requires a detailed or lengthy response falls under this group.
  • 30. v H2O.ai Confidential Chain of Verification (CoVe) Dhuliawala, Shehzaad, et al. "Chain-of-Verification Reduces Hallucination in Large Language Models." arXiv preprint arXiv:2309.11495 (2023)
  • 31. v H2O.ai Confidential CoVe and Explainable AI (XAI) ● Interpretability and Transparency: ○ Verification process generates questions to fact-check baseline responses, improving transparency in decision-making. ● Reliability and Trust: ○ Refined answers enhance accuracy, building trust and reliability in model outputs. ● Bias and Fairness: ○ Verification questions in CoVe identify and mitigate potential biases in model output. ● User Interaction: ○ Verification process involves user interaction through verification questions.
  • 33. v H2O.ai Confidential Who are some CEOs of banks in the US?
  • 34. v H2O.ai Confidential What are some credit scoring bureaus in the US?
  • 35. v H2O.ai Confidential What are some agencies assigned to regulate and oversee financial institutions in the US?
  • 36. v H2O.ai Confidential Provide a list of major investment firms and financial institutions headquartered in the United States?
  • 37. v H2O.ai Confidential Benefits and Limitations of CoVe ● Benefits: ○ Enhanced Reliability: By incorporating verification steps, users can trust the accuracy of information obtained from LLMs. ○ Depth of Understanding: The refinement of answers allows users to gain a deeper understanding of the topic beyond the initial response. ○ Educational Value: Promotes responsible and informed use of LLMs, encouraging users to go beyond surface-level information. ● Limitations ○ Incomplete Removal of Hallucinations: CoVe does not completely eliminate hallucinations in generated content, which means it can still produce incorrect or misleading information. ○ Limited Scope of Hallucination Mitigation: CoVe primarily addresses hallucinations in the form of directly stated factual inaccuracies but may not effectively handle other forms of hallucinations, such as errors in reasoning or opinions. ○ Increased Computational Expense: Generating and executing verification alongside responses in CoVe adds to the computational cost, similar to other reasoning methods like Chain-of-Thought. ○ Upper Bound on Improvement: The effectiveness of CoVe is limited by the overall capabilities of the underlying language model, particularly in its ability to identify and rectify its own mistakes.
  • 38. v H2O.ai Confidential How to improve the CoVe pipeline ● Prompt engineering ● External tools ○ Final output highly depends on the answers of the verification questions. ○ For factual questions & answering you can use advanced search tools like google search or serp API etc. ○ For custom use cases you can always use RAG methods or other retrieval techniques for answering the verification questions. ● More chains ● Human in the loop
  • 39. H2O.ai Confidential Conclusions ● CoVe aim to improves model transparency, reliability, and trust. ● CoVe is not a silver bullet, but it can improve a LLM testing arsenal.
  • 40. H2O.ai Confidential Navdeep Gill Engineering Manager, AI Governance | Responsible AI navdeep.gill@h2o.ai Contact Thank you
  • 42. v H2O.ai Confidential Write a 1000 word essay in 1 minute LLMs are good at generating large amount of text that is consistent and logical. Are LLMs smarter than humans? Introduction Have LLMs manage your investment portfolio A model can give a generic advice on safe money management. But we don’t trust our life savings with a chat bot. Let a bot reply to your email It depends on how important the email is. May be we are more comfortable with the model automatically creating a draft.
  • 43. v H2O.ai Confidential Summarization Summarizing large documents without losing essential information. Extracting key-value pairs. How can we use LLMS while minimizing risk? Introduction Customer Service Answer FAQs from customers. May require retrieving from a knowledge base and summarizing. Report Generation - AutoDoc Create ML interpretation documents. Reports required for regulatory compliance.
  • 44. v H2O.ai Confidential Risk How risky are LLMs? A lawyer used ChatGPT to prepare a court filing. It went horribly awry. “While ChatGPT can be useful to professionals in numerous industries, including the legal profession, it has proved itself to be both limited and unreliable. In this case, the AI invented court cases that didn't exist, and asserted that they were real.” CBS News Chevy dealership’s AI chatbot suggests Ford F-150 when asked for best truck “As an AI, I don't have personal preferences but I can provide insights based on popular opinions and reviews. Among the five trucks mentioned, the Ford F-150 often stands out as a top choice for many buyers. It's known for its impressive towing …” Detroit Free Press
  • 45. v H2O.ai Confidential Data Fine Tuning RAG Foundation Model Leaderboard Risk Management Large & Diverse To train a foundation model, you need a large, diverse dataset that covers the tasks the model should be able to perform. LLM Lifecycle Supervised Fine Tuning Fine-tuning can improve a model's performance on a task while preserving its general language knowledge. h2oGPTe A powerful search assistant to answer questions from large volumes of documents, websites, and workplace content. Generative AI They are designed to produce a wide and general variety of outputs, such as text, image or audio generation. They can be standalone systems or can be used as a "base" for many other applications. HELM HELM is a framework for evaluating foundation models. Leaderboard shows how the various models perform across different groups of scenarios and different metrics. Eval Studio Design and execute task-specific benchmarks. Perform both manual and LLM based evaluations. Systematically collect and store results along with metadata.
  • 46. v H2O.ai Confidential MMLU (Massive Multitask Language Understanding) A test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. Evaluation for LLMs Popular benchmarks on open source leaderboards HellaSwag A test of common-sense inference, which is easy for humans (~95%) but challenging for SOTA models. A12 Reasoning Challenge (ARC) A set of grade-school science questions. Truthful QA A test to measure a model’s propensity to reproduce falsehoods commonly found online. When you drop a ball from rest it accelerates downward at 9.8 m/s². If you instead throw it downward assuming no air resistance its acceleration immediately after leaving your hand is (A) 9.8 m/s² (B) more than 9.8 m/s² (C) less than 9.8 m/s² (D) Cannot say unless the speed of throw is given. MMLU Example A woman is outside with a bucket and a dog. The dog is running around trying to avoid a bath. She… (A) Rinses the bucket off with soap and blow dry the dog’s head. (B) Uses a hose to keep it from getting soapy. (C) Gets the dog wet, then it runs away again. (D) Cannot say unless the speed of throw is given. HellaSwag Example
  • 47. v H2O.ai Confidential Hugging Face Open LLM Leaderboard It is a popular location to track various models evaluated using different metrics. These metrics include human baselines that provide us some idea of how these models have been drastically improved over the last two years. Approaching human baseline Popular benchmarks on open source leaderboards
  • 48. H2O.ai Confidential Benchmarks are not task specific Benchmarks on open-source leaderboards are well-rounded and diverse. They are not sufficient to reflect the performance of the model in a domain specific scenario. The Need for Evaluation Popular leaderboards are not enough Some Model Entries may cheat! There can be models on the leaderboard that are trained on the benchmark data itself. We do not have robust enough tests to detect this. Non-verifiable Results The procedure followed in conducting the tests and the results are not completely transparent and can also vary among different leaderboards.
  • 49. v H2O.ai Confidential Create task specific QA pairs along with the Reference documents. - Bank Teller - Loan officer - Program Manager - Data Analyst Custom Test Sets Create custom benchmarks for domain specific scenarios Task Specific Evals Create the QA pairs that test for agreement with your values, intentions, and preferences. - Correctness - Relevance - Similarity - Hallucination - Precision - Recall - Faithfulness Test for Alignment Test that all outputs meet your safety levels. - Toxicity - Bias - Offensive - PII of customers - Company Secrets Test for Safety Tests to confirm or show proof of meeting compliance standards. - Government - Company Test for Compliance
  • 50. v H2O.ai Confidential H2O Eval Studio Design and Execute task specific benchmarks All the Evaluators are included Eval studio contains evaluators to check for Alignment, Safety, and Compliance as discussed before. Create custom benchmarks Users can upload Documents and create custom Tests (Question-Answer pairs) based on the document collection. Run Evals and visualize results Once a benchmark has been designed, users can then run the evaluation against the benchmark and visualize the results. A detailed report can also be downloaded.
  • 51. H2O.ai Confidential Srinivas Neppalli Senior AI Engineer srinivas.neppalli@h2o.ai Contact Thank you!
  • 52. H2O.ai Confidential Lab 2 - Experimental Design of Gen AI Evaluations
  • 53. v H2O.ai Confidential Through the Lens of Model Risk Management One possible definition of “Conceptual Soundness” for LLMs by themselves might be considered as a combination of the following choices: (1)Training Data (1)Model Architecture (1)An explanation of why (1) and (2) were made (1)An explanation of why (1) and (2) are reasonable for the use case that the LLM will be applied to.
  • 54. v H2O.ai Confidential Through the Lens of Model Risk Management What about a RAG system? How does the concept of “Conceptual Soundness” get applied when not only choices surrounding training data and model architecture involved, but also choices around: - Embeddings - System Prompts (e.g. Personalities) - Chunk Sizes - Chunking Strategies - OCR Techniques - RAG-type (e.g. Hypothetical Document Embeddings) - Mixture-of-Experts or Ensembling
  • 55. H2O.ai Confidential Models / Systems / Agents are the fundamental AI systems under scrutiny. As opposed to traditional machine learning models, Generative AI include many choices beyond the models themselves
  • 56. H2O.ai Confidential Benchmarks / Tests are the sets of prompts and response that are used gauge how well an AI system can perform a certain task or use case.
  • 57. H2O.ai Confidential Evaluators are the mathematical functions used to evaluate various dimensions of performance.
  • 58. H2O.ai Confidential Documents are the data sets used for evaluation in the case of RAG systems, combining models, parsing, OCR, chunking, embeddings and other components of an evaluation.
  • 59. v H2O.ai Confidential What is the primary unit of analysis when evaluating an AI system or model? An eval can be defined as a series of tuples each of size 3. Each tuple consists of: (1)Context / Prompt / Question (1)Output / Response / Ground Truth Answer (1)Document (in the case of RAG) Source: https://www.jobtestprep.com/bank-teller-sample-questions Designing Your Own Eval
  • 60. v H2O.ai Confidential Problem statement: How well does my Bank Teller AI Application correctly answer questions related to being a Bank Teller? Create an eval test case that can be used to evaluate how well BankTellerGPT can answer questions related to being a Bank Teller. LLM-only Example Test Case { Prompt: Respond to the following questions with single letter answer. Question: A specific bank branch serves 256 clients on average every day. The ratio between tellers and clients is 1:32, so that every teller serves 32 people on average every day. The management wishes to change this ratio to 1:20. How many new tellers should be hired? A. 4 B. 5 C. 9 D. 12, Response: B. 5, Document: None } Designing Your Own Eval - BankTellerGPT Source: https://www.jobtestprep.com/bank-teller- sample-questions
  • 61. v H2O.ai Confidential Designing Your Own Eval - BankTellerGPT Problem statement: How well does my Bank Teller AI Application actually answer questions related to being a Bank Teller? Create an eval test case that can be used to evaluate how well BankTellerGPT can answer questions related to being a Bank Teller. RAG Example Test Case { Prompt: Respond to the following questions with single letter answer. Question: A specific bank branch serves 256 clients on average every day. The ratio between tellers and clients is 1:32, so that every teller serves 32 people on average every day. The management wishes to change this ratio to 1:20. How many new tellers should be hired? A. 4 B. 5 C. 9 D. 12, Response: B. 5, Document: “Internal Bank Teller Knowledge Base” } Source: https://www.jobtestprep.com/bank-teller- sample-questions
  • 62. v H2O.ai Confidential Designing Your Own Eval Task # 1: Create your own GenAI Test Benchmark for the SR 11-7 document Some possible test cases Prompt: How should banks approach model development? Response: Banks should approach model development with a focus on sound risk management practices. They should ensure that models are developed and used in a controlled environment, with proper documentation, testing, and validation. Prompt: How can model risk be reduced? Response: Model risk can be reduced by establishing limits on model use, monitoring model performance, adjusting or revising models over time, and supplementing model results with other analysis and information. Prompt: How often should a bank update its model inventory? Response: A bank should update its model inventory regularly to ensure that it remains current and accurate.
  • 63. v H2O.ai Confidential Designing Your Own Eval - BankTellerGPT Task # 2: Create and launch LLM-only eval leaderboard To complete this, you will need to 1. Pick an evaluator (e.g. Token presence) 1. Pick a connection (e.g. Enterprise h2oGPT - LLM Only) 1. Pick a set of eval tests (e.g. Bank Teller Benchmark)
  • 64. v H2O.ai Confidential Designing Your Own Eval - SR 11-7 Task # 3: Create a new evaluator based on RAG and launch leaderboard To complete this, you will need to 1. Pick an evaluator (e.g. Answer correctness) 1. Pick a connection (e.g. Enterprise h2oGPT-RAG) 1. Pick your test created in step 1
  • 65. v H2O.ai Confidential Evaluators Evaluator RAG LLM Purpose Method PII (privacy) Yes Yes Assess whether the answer contains personally identifiable information (PII) like credit card numbers, phone numbers, social security numbers, street addresses, email addresses and employee names. Regex suite which quickly and reliably detects formatted PII - credit card numbers, social security numbers (SSN) and emails. Sensitive data (security) Yes Yes Assess whether the answer contains security-related information like activation keys, passwords, API keys, tokens or certificates. Regex suite which quickly and reliably detects formatted sensitive data - certificates (SSL/TLS certs in PEM format), API keys (H2O.ai and OpenAI), activation keys (Windows). Answer Correctness Yes Yes Assess whether the answer is correct given the expected answer (ground truth). A score based on combined and weighted semantic and factual similarity between the answer and ground truth (see Answer Semantic Similarity and Faithfulness below). Answer Relevance Yes Yes Assess whether the answer is (in)complete and does not contain redundant information which was not asked - noise. A score based on the cosine similarity of the question and generated questions, where generated questions are created by prompting an LLM to generate questions from the actual answer. Answer Similarity Yes Yes Assess semantic similarity of the answer and expected answer. A score based on similarity metric value of the actual and expected answer calculated by a cross-encoder model (NLP). Context Precision Yes No Assess the quality of the retrieved context considering order and relevance of the text chunks on the context stack. A score based on the presence of the expected answer - ground truth - in the text chunks at the top of the retrieved context chunk stack - relevant chunks deep in the stack, irrelevant chunks and unnecessarily big context make the score lower. Context Recall Yes No Assess how much of the ground truth is represented in the retrieved context. A score based on the ratio of the number of sentences in the ground truth that can be attributed to the context to the total number of sentences in the ground truth. Context Relevance Yes No Assess whether the context is (in)complete and does not contain redundant information which is not needed - noise. A score based on the ratio of context sentences which are needed to generate the answer to the total number of sentences in the retrieved context. H2O EvalStudio evaluators overview TERMINOLOGY: answer ~ actual RAG/LLM answer / expected answer ~ expected RAG/LLM answer i.e. ground truth | retrieved context ~ text chunks retrieved from the vector DB prior LLM answer generation in RAG.
  • 66. v H2O.ai Confidential Evaluators (continued) Evaluator RAG LLM Purpose Method Faithfulness Yes No Assess whether answer claims can be inferred from the context i.e. factual consistency of the answer given the context. (hallucinations) A score which is based on the ratio of the answer’s claims which present in the context to the total number of answer’s claims. Hallucination Metric Yes No Asses the RAG’s base LLM model hallucination. A score based on the Vectara hallucination evaluation cross-encoder model which assesses RAG’s base LLM hallucination when it generates the actual answer from the retrieved context. RAGAs Yes No Assess overall answer quality considering both context and answer. Composite metrics score which is harmonic mean of Faithfulness, Answer Relevancy, Context Precision and Context Recall metrics. Tokens Presence Yes Yes Assesses whether both retrieved context and answer contain required string tokens. Scored based on the substring and/or regular expression based search of the required set of strings in the retrieved context and answer. Faithfulness Yes No Assess whether answer claims can be inferred from the context i.e. factual consistency of the answer given the context. (hallucinations) A score which is based on the ratio of the answer’s claims which present in the context to the total number of answer’s claims. H2O EvalStudio evaluators overview TERMINOLOGY: answer ~ actual RAG/LLM answer / expected answer ~ expected RAG/LLM answer i.e. ground truth | retrieved context ~ text chunks retrieved from the vector DB prior LLM answer generation in RAG.
  • 68. v H2O.ai Confidential ● LLM Guardrails are a set of predefined constraints and guidelines that are applied to LLMs to manage their behavior. ● Guardrails serve to ensure responsible, ethical, and safe usage of LLMs, mitigate potential risks, and promote transparency and accountability. ● Guardrails are a form of proactive control and oversight over the output and behavior of language models, which are otherwise capable of generating diverse content, including text that may be biased, inappropriate, or harmful. Understanding the distinct functions of each type of guardrail is pivotal in creating a comprehensive and effective strategy for governing AI systems. Guardrails
  • 69. v H2O.ai Confidential ● Content Filter Guardrails: Content filtering is crucial to prevent harmful, offensive, or inappropriate content from being generated by LLMs. These guardrails help ensure that the outputs conform to community guidelines, curbing hate speech, explicit content, and misinformation. ● Bias Mitigation Guardrails: Bias is an ongoing concern in AI, and mitigating bias is critical. These guardrails aim to reduce the model's inclination to produce content that perpetuates stereotypes or discriminates against particular groups. They work to promote fairness and inclusivity in the model's responses. ● Safety and Privacy Guardrails: Protecting user privacy is paramount. Safety and privacy guardrails are designed to prevent the generation of content that may infringe on user privacy or include sensitive, personal information. These measures safeguard users against unintended data exposure. Types of Guardrails
  • 70. v H2O.ai Confidential Types of Guardrails ● Fact-Checking & Hallucination Guardrails: To combat misinformation, fact-checking guardrails are used to verify the accuracy of the information generated by LLMs. They help ensure that the model's responses align with factual accuracy, especially in contexts like news reporting or educational content. ● Context/Topic and User Intent Guardrails: For LLMs to be effective, they must produce responses that are contextually relevant and aligned with user intent. These guardrails aim to prevent instances where the model generates content that is unrelated or fails to address the user's queries effectively. ● Explainability and Transparency Guardrails: In the pursuit of making LLMs more interpretable, these guardrails require the model to provide explanations for its responses. This promotes transparency by helping users understand why a particular output was generated, fostering trust and accountability. ● Jailbreak Guardrails: Ensure robustness to malicious user attacks such as prompt injection.
  • 71. H2O.ai Confidential Lab 3 - Hacking and Security Posture
  • 73. H2O.ai Confidential Ashrith Barthur Principal Data Scientist ashrith.barthur@h2o.ai Contact Thank you!
  • 76. H2O.ai Confidential Lab 4 - Complaint Summarizer
  • 78. H2O.ai Confidential Determine what is the one credit product with the highest number of complaints. Task 1 Complaint Summarizer Applied Generative AI for Banking
  • 79. H2O.ai Confidential Determine what is the one credit product with the highest number of complaints. Answer: Credit Reporting Task 1 Complaint Summarizer Applied Generative AI for Banking
  • 80. H2O.ai Confidential Determine what is the top complaint for TransUnion? Task 2 Complaint Summarizer Applied Generative AI for Banking
  • 81. H2O.ai Confidential Determine what is the top complaint for TransUnion? Answer: Violation of Consumers Rights to Privacy and Confidentiality Under the Fair Credit Reporting Act. Task 2 Complaint Summarizer Applied Generative AI for Banking
  • 82. H2O.ai Confidential Use H2OGPT to summarize a complaint from the database and provided immediate next steps Task 3 Complaint Summarizer Applied Generative AI for Banking
  • 83. H2O.ai Confidential Use H2OGPT to summarize a complaint from the database and provided immediate next steps Answer: [See screenshot] Task 3 Complaint Summarizer Applied Generative AI for Banking
  • 86. H2O.ai Confidential Contact Thank you! MAKERS Jonathan Farland Director of Solution Engineering jon.farland@h2o.ai www.linkedin.com/in/jonfarland/
  • 88. H2O.ai Confidential Retrieval-Augmented Generation (RAG) RAG as a system is a particularly good use of vector databases. RAG systems take advantage the context window for LLMs, filling it with only the most relevant examples from real data. This “grounds” the LLM to relevant context and greatly minimizes any hallucination.
  • 89. Embedding Models Source: https://huggingface.co/blog/1b-sentence-embeddings
  • 90. Embedding Models - INSTRUCTOR Source: https://arxiv.org/pdf/2212.09741.pdf Instruction-based Omnifarious Representations Model is trained to generate embeddings using both the instruction as well as the textual input. Applicable to virtually every use case, due to its ability to create latent vector representations that include instruction.
  • 91. Embedding Models - BGE Source: https://arxiv.org/pdf/2310.07554.pdf LLM-Embedder This embedding model is trained specifically for use with RAG systems. Reward model introduced that provides higher rewards to a retrieval candidate if it results in a higher generation likelihood for the expected output Uses contrastive learning to directly get at optimizing for RAG applications
  • 92. H2O.ai Confidential AI Engines Deployment Consumption LLM AppStudio LLM DataStudio LLM EvalStudio H2O LLMs Ecosystem AppStore End Users Generative AI with H2O.ai MLOps AI Engine Manager Doc-QA Enterprise h2oGPT
  • 93. H2O.ai Confidential Explanations in Highlights the most important regions of an image Highlights the most important words
  • 95. H2O.ai Confidential LLM Context Length Testing Source: https://github.com/gkamradt/LLMTest_NeedleInAHaystack?tab=read me-ov-file
  • 96. H2O.ai Confidential LLM Context Length Testing Source: https://github.com/gkamradt/LLMTest_NeedleInAHaystack?tab=read me-ov-file
  • 97. H2O.ai Confidential Ethical Considerations, Data Privacy, and User Consent Assess the potential impact of generative AI on individuals and society. Give users control over how their data is used by generative AI. Consent mechanisms should be transparent and user- friendly. Monitoring, Regulation, and Security Detect misuse or anomalies in generative AI behavior. Regulatory compliance ensures adherence to ethical and legal guidelines. Security measures are crucial to protect AI models from adversarial attacks or unauthorized access. Accountability and Oversight Define roles and responsibilities for AI development and deployment. Oversight mechanisms ensure that responsible practices are followed. Education and Awareness Users and developers should be informed about generative AI capabilities, limitations, and ethical considerations. Stakeholder Involvement Involving various stakeholders in AI discussions promotes diverse perspectives and responsible decision-making. Continuous Evaluation and Improvement Continually assess models to ensure fairness, accuracy, and alignment with ethical standards. Transparency, Explainability, Bias Mitigation,Debugging, and Guardrails Recognize and mitigate both subtle and glaring biases that may emerge from training data. Ensures that users can understand and trust the decisions made by generative AI models. Debug models with techniques such as adversarial prompt engineering. Proactively manage risks and maintain control over the model's behavior with guardrails. Responsible Generative AI Responsible Generative AI Audit Input Data, Benchmarks, and Test the Unknown Assess quality of data used as input to train Generative AI models. Utilize benchmarks and random attacks for testing.
  • 98. v H2O.ai Confidential CoVe method reduces factual errors in large language models by drafting, fact-checking, and verifying responses - it deliberates on its own responses and self-correcting them Steps: 1. Given a user query, a LLM generates a baseline response that may contain inaccuracies, e.g. factual hallucinations 2. To improve this, CoVe first generates a plan of a set of verification questions to ask, and then executes that plan by answering them and hence checking for agreement 3. Individual verification questions are typically answered with higher accuracy than the original accuracy of the facts in the original longform generation 4. Finally, the revised response takes into account the verifications 5. The factored version of CoVe answers verification questions such that they cannot condition on the original response, avoiding repetition and improving performance Minimizing Model Hallucinations Using Chain-of-Verification (CoVe) Method
  • 99. v H2O.ai Confidential Minimizing Model Hallucinations Using Chain-of-Verification (CoVe) Method