Telescope Leaderboards
Benchmarking leading language models across critical capabilities for finance.
Overall Benchmark
| # | Model | Performance | Range | Score |
|---|---|---|---|---|
| 1 | claude-sonnet-4 | | ±1.0 | 98.1 |
| 2 | claude-3.7-sonnet | | ±1.0 | 97.8 |
| 3 | gpt-4.1 | | ±1.1 | 97.5 |
| 4 | gpt-5-chat | | ±0.9 | 96.8 |
| 5 | claude-opus-4 | | ±0.8 | 96.7 |
| 6 | gpt-4o | | ±1.3 | 96.6 |
| 7 | ernie-4.5-300b-a47b | | ±0.9 | 95.1 |
| 8 | o3 | | ±1.5 | 94.0 |
| 9 | llama-4-maverick | | ±1.9 | 93.7 |
| 10 | kimi-k2 | | ±1.4 | 93.3 |
| 11 | qwen3-235b-a22b-thinking-2507 | | ±1.6 | 93.1 |
| 12 | magistral-medium-2506 | | ±1.2 | 91.7 |
| 13 | llama-3.3-70b-instruct | | ±1.2 | 91.4 |
| 14 | llama-3.1-405b-instruct | | ±1.9 | 90.4 |
| 15 | grok-4 | | ±1.5 | 90.0 |
| 16 | o1 | | ±1.4 | 89.9 |
| 17 | deepseek-r1-0528 | | ±1.8 | 89.0 |
| 18 | hunyuan-a13b-instruct:free | | ±2.4 | 85.3 |
| 19 | o4-mini-high | | ±2.2 | 83.3 |
| 20 | gpt-5 | | ±1.7 | 74.0 |
| 21 | gemini-2.5-pro | | ±2.3 | 68.4 |
Behavioural Benchmarks
Evaluating the personality and decision-making patterns that determine whether an AI can be a reliable financial partner.
Epistemic Humility
Acknowledging uncertainty and limits of knowledge
| # | Model | Performance | Range | Score |
|---|
Replicability
Maintaining consistent responses across similar scenarios
| # | Model | Performance | Range | Score |
|---|
Biases
Avoiding systematic errors that distort judgment
| # | Model | Performance | Range | Score |
|---|
Market Stress
Performing reliably during uncertain market conditions
| # | Model | Performance | Range | Score |
|---|
Rationality
Applying sound logic and probability-based reasoning
| # | Model | Performance | Range | Score |
|---|