Amazon SageMaker Clarify makes it easier to evaluate and select foundation models (preview)

November 29, 2023 Antje Barth

I’m happy to share that Amazon SageMaker Clarify now supports foundation model (FM) evaluation (preview). As a data scientist or machine learning (ML) engineer, you can now use SageMaker Clarify to evaluate, compare, and select FMs in minutes based on metrics such as accuracy, robustness, creativity, factual knowledge, bias, and toxicity. This new capability adds to SageMaker Clarify’s existing ability to detect bias in ML data and models and explain model predictions.

The new capability provides both automatic and human-in-the-loop evaluations for large language models (LLMs) anywhere, including LLMs available in SageMaker JumpStart, as well as models trained and hosted outside of AWS. This removes the heavy lifting of finding the right model evaluation tools and integrating them into your development environment. It also simplifies the complexity of trying to adopt academic benchmarks to your generative artificial intelligence (AI) use case.

Evaluate FMs with SageMaker Clarify
With SageMaker Clarify, you now have a single place to evaluate and compare any LLM based on predefined criteria during model selection and throughout the model customization workflow. In addition to automatic evaluation, you can also use the human-in-the-loop capabilities to set up human reviews for more subjective criteria, such as helpfulness, creative intent, and style, by using your own workforce or managed workforce from SageMaker Ground Truth.

To get started with model evaluations, you can use curated prompt datasets that are purpose-built for common LLM tasks, including open-ended text generation, text summarization, question answering (Q&A), and classification. You can also extend the model evaluation with your own custom prompt datasets and metrics for your specific use case. Human-in-the-loop evaluations can be used for any task and evaluation metric. After each evaluation job, you receive an evaluation report that summarizes the results in natural language and includes visualizations and examples. You can download all metrics and reports and also integrate model evaluations into SageMaker MLOps workflows.

In SageMaker Studio, you can find Model evaluation under Jobs in the left menu. You can also select Evaluate directly from the model details page of any LLM in SageMaker JumpStart.

Evaluate foundation models with Amazon SageMaker Clarify

Select Evaluate a model to set up the evaluation job. The UI wizard will guide you through the selection of automatic or human evaluation, model(s), relevant tasks, metrics, prompt datasets, and review teams.

Evaluate foundation models with Amazon SageMaker Clarify

Once the model evaluation job is complete, you can view the results in the evaluation report.

Evaluate foundation models with Amazon SageMaker Clarify

In addition to the UI, you can also start with example Jupyter notebooks that walk you through step-by-step instructions on how to programmatically run model evaluation in SageMaker.

Evaluate models anywhere with the FMEval open source library
To run model evaluation anywhere, including models trained and hosted outside of AWS, use the FMEval open source library. The following example demonstrates how to use the library to evaluate a custom model by extending the ModelRunner class.

For this demo, I choose GPT-2 from the Hugging Face model hub and define a custom HFModelConfig and HuggingFaceCausalLLMModelRunner class that works with causal decoder-only models from the Hugging Face model hub such as GPT-2. The example is also available in the FMEval GitHub repo.

!pip install fmeval

# ModelRunners invoke FMs
from amazon_fmeval.model_runners.model_runner import ModelRunner

# Additional imports for custom model
import warnings
from dataclasses import dataclass
from typing import Tuple, Optional
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

@dataclass
class HFModelConfig:
    model_name: str
    max_new_tokens: int
    normalize_probabilities: bool = False
    seed: int = 0
    remove_prompt_from_generated_text: bool = True

class HuggingFaceCausalLLMModelRunner(ModelRunner):
    def __init__(self, model_config: HFModelConfig):
        self.config = model_config
        self.model = AutoModelForCausalLM.from_pretrained(self.config.model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(self.config.model_name)

    def predict(self, prompt: str) -> Tuple[Optional[str], Optional[float]]:
        input_ids = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        generations = self.model.generate(
            **input_ids,
            max_new_tokens=self.config.max_new_tokens,
            pad_token_id=self.tokenizer.eos_token_id,
        )
        generation_contains_input = (
            input_ids["input_ids"][0] == generations[0][: input_ids["input_ids"].shape[1]]
        ).all()
        if self.config.remove_prompt_from_generated_text and not generation_contains_input:
            warnings.warn(
                "Your model does not return the prompt as part of its generations. "
                "`remove_prompt_from_generated_text` does nothing."
            )
        if self.config.remove_prompt_from_generated_text and generation_contains_input:
            output = self.tokenizer.batch_decode(generations[:, input_ids["input_ids"].shape[1] :])[0]
        else:
            output = self.tokenizer.batch_decode(generations, skip_special_tokens=True)[0]

        with torch.inference_mode():
            input_ids = self.tokenizer(self.tokenizer.bos_token + prompt, return_tensors="pt")["input_ids"]
            model_output = self.model(input_ids, labels=input_ids)
            probability = -model_output[0].item()

        return output, probability

Next, create an instance of HFModelConfig and HuggingFaceCausalLLMModelRunner with the model information.

hf_config = HFModelConfig(model_name="gpt2", max_new_tokens=32)
model = HuggingFaceCausalLLMModelRunner(model_config=hf_config)

Then, select and configure the evaluation algorithm.

# Let's evaluate the FM for FactualKnowledge
from amazon_fmeval.fmeval import get_eval_algorithm
from amazon_fmeval.eval_algorithms.factual_knowledge import FactualKnowledgeConfig

eval_algorithm_config = FactualKnowledgeConfig("<OR>")
eval_algorithm = get_eval_algorithm("factual_knowledge", eval_algorithm_config)

Let’s first test with one sample. The evaluation score is the percentage of factually correct responses.

model_output = model.predict("London is the capital of")[0]
print(model_output)

eval_algo.evaluate_sample(
    target_output="UK<OR>England<OR>United Kingdom", 
	model_output=model_output
)
the UK, and the UK is the largest producer of food in the world.

The UK is the world's largest producer of food in the world.
[EvalScore(name='factual_knowledge', value=1)]

Although it’s not a perfect response, it includes “UK.”

Next, you can evaluate the FM using built-in datasets or define your custom dataset. If you want to use a custom evaluation dataset, create an instance of DataConfig:

config = DataConfig(
    dataset_name="my_custom_dataset",
    dataset_uri="dataset.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answer",
)

eval_output = eval_algorithm.evaluate(
    model=model, 
    dataset_config=config, 
    prompt_template="$feature", #$feature is replaced by the input value in the dataset 
    save=True
)

The evaluation results will return a combined evaluation score across the dataset and detailed results for each model input stored in a local output path.

Join the preview
FM evaluation with Amazon SageMaker Clarify is available today in public preview in AWS Regions US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Tokyo), Europe (Frankfurt), and Europe (Ireland). The FMEval open source library is available on GitHub. To learn more, visit Amazon SageMaker Clarify.

Get started
Log in to the AWS Management Console and start evaluating your FMs with SageMaker Clarify today!

— Antje

Previous Article
Introducing highly durable Amazon OpenSearch Service clusters with 30% price/performance improvement
Introducing highly durable Amazon OpenSearch Service clusters with 30% price/performance improvement

You can use the new OR1 instances to create Amazon OpenSearch Service clusters that use Amazon Simple Stora...

Next Article
Evaluate, compare, and select the best foundation models for your use case in Amazon Bedrock (preview)
Evaluate, compare, and select the best foundation models for your use case in Amazon Bedrock (preview)

I’m happy to share that you can now evaluate, compare, and select the best foundation models (FMs) for your...