Strengthen agent evaluation with subset matching, argument validation, and improved tool call tracking in ReliabilityEval

ReliabilityEval has been extended with more precise evaluation capabilities: expected tool calls can now be matched as a subset of actual calls rather than requiring an exact full match, argument values are validated against expected parameters, and missing tool calls are explicitly tracked and surfaced in results. Multi-round tool call collection has also been fixed so all rounds are gathered correctly, along with a mutation bug that was modifying original RunOutput.messages in place and an arun() issue using the wrong ID when saving evaluation files.