Python API
envs
Evaluators

Sotopia Evaluation Module

Overview

This module provides various classes and methods to evaluate social interactions in the Sotopia environment, assessing multiple dimensions such as believability, relationship, knowledge, secret, social rules, financial and material benefits, and goal achievement. The evaluations can be synchronous or asynchronous, and aggregate responses to provide a comprehensive summary.

Classes

SotopiaDimensions

This class represents the social dimensions used in the Sotopia paper (ICLR 2024).

Attributes

  • believability: Tuple containing reasoning (str) and score (int).
  • relationship: Tuple containing reasoning (str) and score (int).
  • knowledge: Tuple containing reasoning (str) and score (int).
  • secret: Tuple containing reasoning (str) and score (int).
  • social_rules: Tuple containing reasoning (str) and score (int).
  • financial_and_material_benefits: Tuple containing reasoning (str) and score (int).
  • goal: Tuple containing reasoning (str) and score (int).

Validators

  • zero_to_ten_validator: Ensures the score is between 0 and 10.
  • minus_five_to_five_validator: Ensures the score is between -5 and 5.
  • minus_ten_to_zero_validator: Ensures the score is between -10 and 0.

SotopiaDimensionsPlus

An updated version of SotopiaDimensions with more detailed instructions for each dimension.

GoalDimension

This class evaluates only the goal achievement.

Attributes

  • goal: Tuple containing reasoning (str) and score (int).

Validators

  • zero_to_ten_validator: Ensures the score is between 0 and 10.

EvaluationForTwoAgents

A generic class to evaluate two agents simultaneously.

Attributes

  • agent_1_evaluation: Evaluation results for agent 1.
  • agent_2_evaluation: Evaluation results for agent 2.

Evaluator

Abstract base class for evaluators.

Methods

  • call: Abstract method to perform evaluation.
  • acall: Abstract method to perform asynchronous evaluation.

RuleBasedTerminatedEvaluator

This class evaluates conversations based on rule-based criteria for termination.

Attributes

  • max_turn_number: Maximum number of turns before termination.
  • max_stale_turn: Maximum number of stale turns before termination.

Methods

  • call: Performs the evaluation and returns termination status.
  • acall: Asynchronous version of the __call__ method.

ReachGoalLLMEvaluator

This class evaluates goal achievement using a language model.

Attributes

  • model_name: Name of the language model.
  • response_format_class: Class type for the evaluation response format.

Methods

  • call: Not implemented for synchronous evaluation.
  • acall: Asynchronous evaluation method using a language model.

Functions

_reduce

Reduces a list of responses by averaging the scores and aggregating comments.

Parameters

  • responses_per_reducer: List of tuples containing response and reasoning.

Returns

  • Tuple containing reduced dictionary of scores and aggregated comments.

unweighted_aggregate_evaluate

Aggregates responses from the environment.

Parameters

  • responses: List of responses from the environment.

Returns

  • An instance of ScriptEnvironmentResponse.

Usage Examples

import logging
from sotopia.evaluators import (
    SotopiaDimensions,
    SotopiaDimensionsPlus,
    GoalDimension,
    EvaluationForTwoAgents,
    RuleBasedTerminatedEvaluator,
    ReachGoalLLMEvaluator,
    unweighted_aggregate_evaluate
)
 
log = logging.getLogger("evaluators")
 
# Example 1: Creating an instance of SotopiaDimensions
dimensions = SotopiaDimensions(
    believability=("Agent interacts naturally.", 8),
    relationship=("Relationship improved.", 3),
    knowledge=("Gained new knowledge.", 7),
    secret=("No secrets revealed.", 0),
    social_rules=("No rules violated.", 0),
    financial_and_material_benefits=("Marginal gain.", 2),
    goal=("Achieved most goals.", 7),
)
 
# Example 2: Evaluating with RuleBasedTerminatedEvaluator
evaluator = RuleBasedTerminatedEvaluator(max_turn_number=20, max_stale_turn=2)
termination_status = evaluator.__call__(
    turn_number=21,
    messages=[("Agent 1", AgentAction(action_type="talk")), ("Agent 2", AgentAction(action_type="leave"))]
)
 
print(termination_status)
# Output: [('environment', (('terminated', True), 'Agent 2 is leaving; '))]
 
# Example 3: Asynchronous evaluation with ReachGoalLLMEvaluator
import asyncio
 
async def evaluate():
    evaluator = ReachGoalLLMEvaluator("gpt-3", EvaluationForTwoAgents[SotopiaDimensions])
    result = await evaluator.__acall__(turn_number=10, messages=[...])
    aggregated_response = unweighted_aggregate_evaluate(result)
    print(aggregated_response)
 
# Running the asynchronous evaluation
asyncio.run(evaluate())