Displaying evaluation results

  1. Navigate to Infor OS > GenAI > Factory > Evaluation.
  2. Click View Details… button next to the relevant evaluation.
  3. Select a specific Evaluation job from the list.
  4. Review the evaluation results, which include:
    • Executed by: the user who ran the evaluation job
    • Executed on: when the evaluation job was ran

      Type: the endpoint that the evaluation was tested against

      Status: the current progress of the evaluation job

      Steps: The answer scoring for correctness,similarity, and relevance.

    The Steps section lists each evaluation parameter as a separate test step. Each step represents one test scenario.

    Each step displays these values:

    • Input: the question or prompt that the user sent to the GenAI Assistant during the test.
    • Scoring indicators: four color-coded circular score indicators that show how the GenAI Assistant performed for each scoring dimension. Green indicates a passing score. Red indicates a failing score. Each indicator is scored on a scale from 1 to 5. The four scoring dimensions are:
      • Answer correctness
      • Answer similarity
      • Answer relevance
      • Agent trajectory

    When you select a step, the Steps result details panel opens on the right. The panel provides a detailed breakdown of that test scenario:

    • Average Score: the combined average of all scoring dimensions for the step, displayed as a value from 1 to 5. The value gives a quick quality indicator.
    • Results: the individual scores for each selected scoring dimension.
    • Input: the exact question or prompt that the user submitted to the GenAI Assistant for the test scenario.
    • Ground Truth: the expected answer that is defined for the evaluation parameter.
    • Model Generated Output: the response that the GenAI Assistant produced during the evaluation run.
    • Agent trajectory trace: a section with a View evaluation trace link that shows the ground truth and the actual trajectory.
    • Answer correctness: an explanation from the Judge Model that describes whether the response was factually accurate and whether the response fully addressed the input question.
    • Answer similarity: an explanation from the Judge Model that describes how closely the structure, content, and wording of the response matched the ground truth.
    • Answer relevance: an assessment from the Judge Model that describes whether the response directly and comprehensively addressed the input question.
    • Agent trajectory: the most technically detailed scoring dimension in the evaluation.