Displaying evaluation results

Navigate to Infor OS > GenAI > Factory > Evaluation.

Click View Details… button next to the relevant evaluation.

Select a specific Evaluation job from the list.

Review the evaluation results, which include:

Executed by: the user who ran the evaluation job
Executed on: when the evaluation job was ran
Type: the endpoint that the evaluation was tested against
Status: the current progress of the evaluation job
Steps: The answer scoring for correctness,similarity, and relevance.

The Steps section lists each evaluation parameter as a separate test step. Each step represents one test scenario.

Each step displays these values:

Input: the question or prompt that the user sent to the GenAI Assistant during the test.
Scoring indicators: four color-coded circular score indicators that show how the GenAI Assistant performed for each scoring dimension. Green indicates a passing score. Red indicates a failing score. Each indicator is scored on a scale from 1 to 5. The four scoring dimensions are:
- Answer correctness
- Answer similarity
- Answer relevance
- Agent trajectory

When you select a step, the Steps result details panel opens on the right. The panel provides a detailed breakdown of that test scenario:

Average Score: the combined average of all scoring dimensions for the step, displayed as a value from 1 to 5. The value gives a quick quality indicator.
Results: the individual scores for each selected scoring dimension.
Input: the exact question or prompt that the user submitted to the GenAI Assistant for the test scenario.
Ground Truth: the expected answer that is defined for the evaluation parameter.
Model Generated Output: the response that the GenAI Assistant produced during the evaluation run.
Agent trajectory trace: a section with a View evaluation trace link that shows the ground truth and the actual trajectory.
Answer correctness: an explanation from the Judge Model that describes whether the response was factually accurate and whether the response fully addressed the input question.
Answer similarity: an explanation from the Judge Model that describes how closely the structure, content, and wording of the response matched the ground truth.
Answer relevance: an assessment from the Judge Model that describes whether the response directly and comprehensively addressed the input question.
Agent trajectory: the most technically detailed scoring dimension in the evaluation.