Sotopia Evaluation Module

Overview

This module provides various classes and methods to evaluate social interactions in the Sotopia environment, assessing multiple dimensions such as believability, relationship, knowledge, secret, social rules, financial and material benefits, and goal achievement. The evaluations can be synchronous or asynchronous, and aggregate responses to provide a comprehensive summary.

Classes

SotopiaDimensions

This class represents the social dimensions used in the Sotopia paper (ICLR 2024).

Attributes

believability: Tuple containing reasoning (str) and score (int).
relationship: Tuple containing reasoning (str) and score (int).
knowledge: Tuple containing reasoning (str) and score (int).
secret: Tuple containing reasoning (str) and score (int).
social_rules: Tuple containing reasoning (str) and score (int).
financial_and_material_benefits: Tuple containing reasoning (str) and score (int).
goal: Tuple containing reasoning (str) and score (int).

Validators

zero_to_ten_validator: Ensures the score is between 0 and 10.
minus_five_to_five_validator: Ensures the score is between -5 and 5.
minus_ten_to_zero_validator: Ensures the score is between -10 and 0.

SotopiaDimensionsPlus

An updated version of SotopiaDimensions with more detailed instructions for each dimension.

GoalDimension

This class evaluates only the goal achievement.

Attributes

goal: Tuple containing reasoning (str) and score (int).

Validators

zero_to_ten_validator: Ensures the score is between 0 and 10.

EvaluationForTwoAgents

A generic class to evaluate two agents simultaneously.

Attributes

agent_1_evaluation: Evaluation results for agent 1.
agent_2_evaluation: Evaluation results for agent 2.

Evaluator

Abstract base class for evaluators.

Methods

call: Abstract method to perform evaluation.
acall: Abstract method to perform asynchronous evaluation.

RuleBasedTerminatedEvaluator

This class evaluates conversations based on rule-based criteria for termination.

Attributes

max_turn_number: Maximum number of turns before termination.
max_stale_turn: Maximum number of stale turns before termination.

Methods

call: Performs the evaluation and returns termination status.
acall: Asynchronous version of the __call__ method.

ReachGoalLLMEvaluator

This class evaluates goal achievement using a language model.

Attributes

model_name: Name of the language model.
response_format_class: Class type for the evaluation response format.

Methods

call: Not implemented for synchronous evaluation.
acall: Asynchronous evaluation method using a language model.

Functions

_reduce

Reduces a list of responses by averaging the scores and aggregating comments.

Parameters

responses_per_reducer: List of tuples containing response and reasoning.

Returns

Tuple containing reduced dictionary of scores and aggregated comments.

unweighted_aggregate_evaluate

Aggregates responses from the environment.

Parameters

responses: List of responses from the environment.

Returns

An instance of ScriptEnvironmentResponse.

Usage Examples

import logging
from sotopia.evaluators import (
    SotopiaDimensions,
    SotopiaDimensionsPlus,
    GoalDimension,
    EvaluationForTwoAgents,
    RuleBasedTerminatedEvaluator,
    ReachGoalLLMEvaluator,
    unweighted_aggregate_evaluate
)
 
log = logging.getLogger("evaluators")
 
# Example 1: Creating an instance of SotopiaDimensions
dimensions = SotopiaDimensions(
    believability=("Agent interacts naturally.", 8),
    relationship=("Relationship improved.", 3),
    knowledge=("Gained new knowledge.", 7),
    secret=("No secrets revealed.", 0),
    social_rules=("No rules violated.", 0),
    financial_and_material_benefits=("Marginal gain.", 2),
    goal=("Achieved most goals.", 7),
)
 
# Example 2: Evaluating with RuleBasedTerminatedEvaluator
evaluator = RuleBasedTerminatedEvaluator(max_turn_number=20, max_stale_turn=2)
termination_status = evaluator.__call__(
    turn_number=21,
    messages=[("Agent 1", AgentAction(action_type="talk")), ("Agent 2", AgentAction(action_type="leave"))]
)
 
print(termination_status)
# Output: [('environment', (('terminated', True), 'Agent 2 is leaving; '))]
 
# Example 3: Asynchronous evaluation with ReachGoalLLMEvaluator
import asyncio
 
async def evaluate():
    evaluator = ReachGoalLLMEvaluator("gpt-3", EvaluationForTwoAgents[SotopiaDimensions])
    result = await evaluator.__acall__(turn_number=10, messages=[...])
    aggregated_response = unweighted_aggregate_evaluate(result)
    print(aggregated_response)
 
# Running the asynchronous evaluation
asyncio.run(evaluate())

Persistant Profile Parallel