LLM Reasoning Evaluation Report

Accuracy by Model & Language
Task Vulnerability (Avg. Accuracy Drop)
Summary Table
Robustness Heatmap
Per-Task Radar
Metric Comparison
Pairwise Consistency Matrix (Spearman ρ)
Positive Transfer Matrix (directed)
Negative Transfer Matrix (directed)
Average Pairwise Consistency
Average Transfer Rates
Confidence Distribution (max logprob)
Choice Entropy Distribution
Top-k Accuracy
Calibration Curve
Flip Hardness Distribution
Task Vulnerability by Language
Sample Explorer
Task Accuracy Comparison
Agreement