LLM Evaluations

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

arize-phoenix-evals

Phoenix provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) application, whether or not the response is toxic, and much more.

Phoenix's approach to LLM evals is notable for the following reasons:

Includes pre-tested templates and convenience functions for a set of common Eval “tasks”
Data science rigor applied to the testing of model and template combinations
Designed to run as fast as possible on batches of data
Includes benchmark datasets and tests for each eval function

Installation

Install the arize-phoenix sub-package via pip

pip install arize-phoenix-evals

Note you will also have to install the LLM vendor SDK you would like to use with LLM Evals. For example, to use OpenAI's GPT-4, you will need to install the OpenAI Python SDK:

pip install 'openai>=1.0.0'

Usage

Here is an example of running the RAG relevance eval on a dataset of Wikipedia questions and answers:

import os
from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, ConfusionMatrixDisplay

os.environ["OPENAI_API_KEY"] = "<your-openai-key>"

# Download the benchmark golden dataset
df = download_benchmark_dataset(
    task="binary-relevance-classification", dataset_name="wiki_qa-train"
)
# Sample and re-name the columns to match the template
df = df.sample(100)
df = df.rename(
    columns={
        "query_text": "input",
        "document_text": "reference",
    },
)
model = OpenAIModel(
    model="gpt-4",
    temperature=0.0,
)


rails =list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
df[["eval_relevance"]] = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, rails)
#Golden dataset has True/False map to -> "irrelevant" / "relevant"
#we can then scikit compare to output of template - same format
y_true = df["relevant"].map({True: "relevant", False: "irrelevant"})
y_pred = df["eval_relevance"]

# Compute Per-Class Precision, Recall, F1 Score, Support
precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred)

To learn more about LLM Evals, see the LLM Evals documentation.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.9.2

May 21, 2024

0.9.1

May 21, 2024

0.9.0

May 17, 2024

0.8.2

May 14, 2024

0.8.1

May 4, 2024

0.8.0

Apr 22, 2024

0.7.0

Apr 13, 2024

0.6.1

Apr 4, 2024

0.6.0

Mar 29, 2024

0.5.0

Mar 20, 2024

0.4.0

Mar 20, 2024

0.3.1

Mar 16, 2024

0.3.0

Mar 13, 2024

0.2.0

Mar 7, 2024

0.1.0

Mar 5, 2024

0.0.5

Feb 24, 2024

0.0.4

Feb 24, 2024

0.0.3

Feb 23, 2024

0.0.2

Feb 23, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arize_phoenix_evals-0.9.2.tar.gz (35.7 kB view hashes)

Uploaded May 21, 2024 Source

Built Distribution

arize_phoenix_evals-0.9.2-py3-none-any.whl (47.4 kB view hashes)

Uploaded May 21, 2024 Python 3

Hashes for arize_phoenix_evals-0.9.2.tar.gz

Hashes for arize_phoenix_evals-0.9.2.tar.gz
Algorithm	Hash digest
SHA256	`8d96952a1395cda9c5464412d82049f15b5b4281005556b2978dc69be48af3a7`
MD5	`afe681f1ff85f58d3b1d9b8c46da1542`
BLAKE2b-256	`afd1b129d8e8acef104692a341da3c4918c84ae4d5242baab6e00e6e4336a28c`

Hashes for arize_phoenix_evals-0.9.2-py3-none-any.whl

Hashes for arize_phoenix_evals-0.9.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`20c8db2ec20ac83a9fe958a6d6f3c7ec0c79a6508296b2701fd04fe32d2eb1d3`
MD5	`3ebe5947f029e8f29772502604d17042`
BLAKE2b-256	`fe39fd88ea634b60ac0f5345d58edd463f609fb91bfd3cddbd56d985c05a273f`