Telescope Leaderboards

Benchmarking leading language models across critical capabilities for finance.

Overall Benchmark

#	Model	Range	Score
1	claude-sonnet-4	±1.0	98.1
2	claude-3.7-sonnet	±1.0	97.8
3	gpt-4.1	±1.1	97.5
4	gpt-5-chat	±0.9	96.8
5	claude-opus-4	±0.8	96.7
6	gpt-4o	±1.3	96.6
7	ernie-4.5-300b-a47b	±0.9	95.1
8	o3	±1.5	94.0
9	llama-4-maverick	±1.9	93.7
10	kimi-k2	±1.4	93.3
11	qwen3-235b-a22b-thinking-2507	±1.6	93.1
12	magistral-medium-2506	±1.2	91.7
13	llama-3.3-70b-instruct	±1.2	91.4
14	llama-3.1-405b-instruct	±1.9	90.4
15	grok-4	±1.5	90.0
16	o1	±1.4	89.9
17	deepseek-r1-0528	±1.8	89.0
18	hunyuan-a13b-instruct:free	±2.4	85.3
19	o4-mini-high	±2.2	83.3
20	gpt-5	±1.7	74.0
21	gemini-2.5-pro	±2.3	68.4

Evaluating the personality and decision-making patterns that determine whether an AI can be a reliable financial partner.

Acknowledging uncertainty and limits of knowledge

#	Model	Performance	Range	Score

Maintaining consistent responses across similar scenarios

#	Model	Performance	Range	Score

Avoiding systematic errors that distort judgment

#	Model	Performance	Range	Score

Performing reliably during uncertain market conditions

#	Model	Performance	Range	Score

Applying sound logic and probability-based reasoning

#	Model	Performance	Range	Score