LLM Reasoning Evaluation Report
Overview
Robustness
Language Pairs
Confidence
Sample Patterns
Model Comparison
Accuracy by Model & Language
Task Vulnerability (Avg. Accuracy Drop)
Summary Table
Attack
Metric
Robustness Heatmap
Per-Task Radar
Metric Comparison
Model
Pairwise Consistency Matrix (Spearman ρ)
Positive Transfer Matrix (directed)
Negative Transfer Matrix (directed)
Average Pairwise Consistency
Average Transfer Rates
Model
Language
Confidence Distribution (max logprob)
Choice Entropy Distribution
Top-k Accuracy
Calibration Curve
Flip Hardness Distribution
Task Vulnerability by Language
Sample Explorer
Task
Min flips
Model A
Model B
Task Accuracy Comparison
Agreement