Telescope Leaderboards

Benchmarking leading language models across critical capabilities for finance.

Overall Benchmark

# Model Performance Range Score
1 claude-sonnet-4
±1.0 98.1
2 claude-3.7-sonnet
±1.0 97.8
3 gpt-4.1
±1.1 97.5
4 gpt-5-chat
±0.9 96.8
5 claude-opus-4
±0.8 96.7
6 gpt-4o
±1.3 96.6
7 ernie-4.5-300b-a47b
±0.9 95.1
8 o3
±1.5 94.0
9 llama-4-maverick
±1.9 93.7
10 kimi-k2
±1.4 93.3
11 qwen3-235b-a22b-thinking-2507
±1.6 93.1
12 magistral-medium-2506
±1.2 91.7
13 llama-3.3-70b-instruct
±1.2 91.4
14 llama-3.1-405b-instruct
±1.9 90.4
15 grok-4
±1.5 90.0
16 o1
±1.4 89.9
17 deepseek-r1-0528
±1.8 89.0
18 hunyuan-a13b-instruct:free
±2.4 85.3
19 o4-mini-high
±2.2 83.3
20 gpt-5
±1.7 74.0
21 gemini-2.5-pro
±2.3 68.4

Behavioural Benchmarks

Evaluating the personality and decision-making patterns that determine whether an AI can be a reliable financial partner.

Epistemic Humility

Acknowledging uncertainty and limits of knowledge

# Model Performance Range Score

Replicability

Maintaining consistent responses across similar scenarios

# Model Performance Range Score

Biases

Avoiding systematic errors that distort judgment

# Model Performance Range Score

Market Stress

Performing reliably during uncertain market conditions

# Model Performance Range Score

Rationality

Applying sound logic and probability-based reasoning

# Model Performance Range Score