AI 大模型排名 ArtificialAnalysis AI 大模型排行榜

本页面排行数据来自 Artificial Analysis ,它对超过 100 个 AI 模型(LLM)的性能进行了比较和排名,评估指标包括智能程度、价格等。另外排行也汇总其他权威 AI 基准测试结果以供参考。

AI 大模型排名(基于 Artificial Analysis

重置筛选
模型信息 Artificial Analysis测试基准结果 其他 AI 测试基准结果
排名 模型名称 机构 综合指数 Coding Math 价格 ($/1M) MMLU Pro ? GPQA ? HLE ? LiveCodeBench ? SciCode ? Math 500 ? AIME ?
1 Gemini 3 Pro Preview (high) Google 72.8 62.3 95.7 $4.5 0.898 0.908 0.372 0.917 0.561 - -
2 Claude Opus 4.5 (Reasoning) Anthropic 69.8 60.2 91.3 $10 0.895 0.866 0.284 0.871 0.495 - -
3 GPT-5.1 (high) OpenAI 69.7 57.5 94 $3.438 0.87 0.873 0.265 0.868 0.433 - -
4 GPT-5 (high) OpenAI 68.5 52.7 94.3 $3.438 0.871 0.854 0.265 0.846 0.429 0.994 0.957
5 GPT-5 Codex (high) OpenAI 68.5 53.5 98.7 $3.438 0.865 0.837 0.256 0.84 0.409 - -
6 Kimi K2 Thinking Moonshot AI 67 52.2 94.7 $1.075 0.848 0.838 0.223 0.853 0.424 - -
7 GPT-5.1 Codex (high) OpenAI 66.9 52.5 95.7 $3.438 0.86 0.86 0.234 0.849 0.402 - -
8 GPT-5 (medium) OpenAI 66.4 49.2 91.7 $3.438 0.867 0.842 0.235 0.703 0.411 0.991 0.917
9 DeepSeek V3.2 (Reasoning) DeepSeek 65.9 52.8 92 $0.315 0.862 0.84 0.222 0.862 0.389 - -
10 o3 OpenAI 65.5 52.2 88.3 $3.5 0.853 0.827 0.2 0.808 0.41 0.992 0.903
11 Grok 4 xAI 65.3 55.1 92.7 $6 0.866 0.877 0.239 0.819 0.457 0.99 0.943
12 o3-pro OpenAI 65.3 - - $35 - 0.845 - - - - -
13 Gemini 3 Pro Preview (low) Google 64.5 55.8 86.7 $4.5 0.895 0.887 0.276 0.857 0.499 - -
14 GPT-5 mini (high) OpenAI 64.3 51.4 90.7 $0.688 0.837 0.828 0.197 0.838 0.392 - -
15 Grok 4.1 Fast (Reasoning) xAI 64.1 49.7 89.3 $0.275 0.854 0.853 0.176 0.822 0.442 - -
16 Claude 4.5 Sonnet (Reasoning) Anthropic 62.7 49.8 88 $6 0.875 0.834 0.173 0.714 0.447 - -
17 Nova 2.0 Pro Preview (medium) Amazon 62.4 46.1 89 $3.438 0.83 0.785 0.089 0.73 0.427 - -
18 GPT-5.1 Codex mini (high) OpenAI 62.3 52.5 91.7 $0.688 0.82 0.813 0.169 0.836 0.426 - -
19 GPT-5 (low) OpenAI 61.8 46.8 83 $3.438 0.86 0.808 0.184 0.763 0.391 0.987 0.83
20 MiniMax-M2 MiniMax 61.4 47.6 78.3 $0.525 0.82 0.777 0.125 0.826 0.361 - -
21 GPT-5 mini (medium) OpenAI 60.8 45.7 85 $0.688 0.828 0.803 0.146 0.692 0.41 - -
22 gpt-oss-120B (high) OpenAI 60.5 49.6 93.4 $0.263 0.808 0.782 0.185 0.878 0.389 - -
23 Grok 4 Fast (Reasoning) xAI 60.3 48.4 89.7 $0.275 0.85 0.847 0.17 0.832 0.442 - -
24 Claude Opus 4.5 (Non-reasoning) Anthropic 59.9 53 62.7 $10 0.889 0.81 0.129 0.738 0.47 - -
25 Gemini 2.5 Pro Google 59.6 49.3 87.7 $3.438 0.862 0.844 0.211 0.801 0.428 0.967 0.887
26 o4-mini (high) OpenAI 59.6 48.9 90.7 $1.925 0.832 0.784 0.175 0.859 0.465 0.989 0.94
27 Claude 4.1 Opus (Reasoning) Anthropic 59.3 46.1 80.3 $30 0.88 0.809 0.119 0.654 0.409 - -
28 DeepSeek V3.2 Speciale DeepSeek 58.6 55.4 96.7 $0.315 0.863 0.871 0.261 0.896 0.44 - -
29 Nova 2.0 Lite (medium) Amazon 57.7 39.8 88.7 $0.85 0.813 0.768 0.086 0.663 0.368 - -
30 DeepSeek V3.1 Terminus (Reasoning) DeepSeek 57.7 49.6 89.7 $0.8 0.851 0.792 0.152 0.798 0.406 - -
31 Nova 2.0 Pro Preview (low) Amazon 57.6 39.6 63.3 $3.438 0.822 0.751 0.052 0.638 0.387 - -
32 Qwen3 235B A22B 2507 (Reasoning) Alibaba 57.5 44.6 91 $2.625 0.843 0.79 0.15 0.788 0.424 0.984 0.94
33 Grok 3 mini Reasoning (high) xAI 57.1 42.2 84.7 $0.35 0.828 0.791 0.111 0.696 0.406 0.992 0.933
34 Doubao Seed Code ByteDance Seed 57.1 47.4 79.3 $0.407 0.854 0.764 0.133 0.766 0.407 - -
35 DeepSeek V3.2 Exp (Reasoning) DeepSeek 56.9 48.6 87.7 $0.315 0.85 0.797 0.138 0.789 0.377 - -
36 Claude 4 Sonnet (Reasoning) Anthropic 56.5 45.1 74.3 $6 0.842 0.777 0.096 0.655 0.4 0.991 0.773
37 GLM-4.6 (Reasoning) Z AI 56 43.8 86 $1 0.829 0.78 0.133 0.695 0.384 - -
38 Nova 2.0 Omni (medium) Amazon 56 35.5 89.7 $0.85 0.809 0.76 0.068 0.66 0.362 - -
39 Qwen3 Max Thinking Alibaba 55.8 36.2 82.3 $2.4 0.824 0.776 0.12 0.535 0.387 - -
40 Qwen3 Max Alibaba 55.1 44.7 80.7 $2.4 0.841 0.764 0.111 0.767 0.383 - -
41 Claude 4.5 Haiku (Reasoning) Anthropic 54.6 43.4 83.7 $2 0.76 0.672 0.097 0.615 0.433 - -
42 Gemini 2.5 Flash Preview (Sep '25) (Reasoning) Google 54.4 42.5 78.3 $0.85 0.842 0.793 0.127 0.713 0.405 - -
43 Qwen3 VL 235B A22B (Reasoning) Alibaba 54.4 38.4 88.3 $2.625 0.836 0.772 0.101 0.646 0.399 - -
44 Qwen3 Next 80B A3B (Reasoning) Alibaba 54.3 42.1 84.3 $1.875 0.824 0.759 0.117 0.784 0.388 - -
45 Claude 4 Opus (Reasoning) Anthropic 54.2 44.2 73.3 $30 0.873 0.796 0.117 0.636 0.398 0.982 0.757
46 Gemini 2.5 Pro Preview (Mar' 25) Google 54.1 46.7 - $3.438 0.858 0.836 0.171 0.778 0.395 0.98 0.87
47 DeepSeek V3.1 (Reasoning) DeepSeek 54 47.2 89.7 $0.654 0.851 0.779 0.13 0.784 0.391 - -
48 Gemini 2.5 Pro Preview (May' 25) Google 53.2 - - $3.438 0.837 0.822 0.154 0.77 0.416 0.986 0.843
49 DeepSeek V3.2 (Non-reasoning) DeepSeek 52.4 42.8 59 $0.315 0.837 0.751 0.105 0.593 0.387 - -
50 gpt-oss-20B (high) OpenAI 52.1 40.7 89.3 $0.1 0.748 0.688 0.098 0.777 0.344 - -
51 Magistral Medium 1.2 Mistral 52 42.3 82 $2.75 0.815 0.739 0.096 0.75 0.392 - -
52 DeepSeek R1 0528 (May '25) DeepSeek 52 44.1 76 $2.362 0.849 0.813 0.149 0.77 0.403 0.983 0.893
53 Qwen3 VL 32B (Reasoning) Alibaba 51.9 36.4 84.7 $2.625 0.818 0.733 0.096 0.738 0.285 - -
54 Seed-OSS-36B-Instruct ByteDance Seed 51.6 39.8 84.7 $0.3 0.815 0.726 0.091 0.765 0.365 - -
55 Apriel-v1.5-15B-Thinker ServiceNow 51.6 39.2 87.5 $0 0.773 0.713 0.12 0.728 0.348 - -
56 GLM-4.5 (Reasoning) Z AI 51.3 43.3 73.7 $0.98 0.835 0.782 0.122 0.738 0.348 0.979 0.873
57 Gemini 2.5 Flash (Reasoning) Google 51.2 40.5 73.3 $0.85 0.832 0.79 0.111 0.695 0.394 0.981 0.823
58 GPT-5 nano (high) OpenAI 51 42.3 83.7 $0.138 0.78 0.676 0.082 0.789 0.366 - -
59 o3-mini (high) OpenAI 50.8 42.1 - $1.925 0.802 0.773 0.123 0.734 0.398 0.985 0.86
60 Kimi K2 0905 Moonshot AI 50.4 38.1 57.3 $1.2 0.819 0.767 0.063 0.61 0.307 - -
61 Claude 3.7 Sonnet (Reasoning) Anthropic 49.9 35.8 56.3 $6 0.837 0.772 0.103 0.473 0.403 0.947 0.487
62 Claude 4.5 Sonnet (Non-reasoning) Anthropic 49.6 42.9 37 $6 0.86 0.727 0.071 0.59 0.428 - -
63 GPT-5 nano (medium) OpenAI 49.3 42.1 78.3 $0.138 0.772 0.67 0.076 0.763 0.338 - -
64 GLM-4.5-Air Z AI 48.8 39.4 80.7 $0.425 0.815 0.733 0.068 0.684 0.306 0.965 0.673
65 Nova 2.0 Omni (low) Amazon 48.7 32.3 56 $0.85 0.798 0.699 0.04 0.592 0.343 - -
66 Grok Code Fast 1 xAI 48.6 39.4 43.3 $0.525 0.793 0.727 0.075 0.657 0.362 - -
67 Qwen3 Max (Preview) Alibaba 48.5 40.2 75 $2.4 0.838 0.764 0.093 0.651 0.37 - -
68 o3-mini OpenAI 48.1 39.4 - $1.925 0.791 0.748 0.087 0.717 0.399 0.973 0.77
69 Kimi K2 Moonshot AI 48.1 35 57 $1.075 0.824 0.766 0.07 0.556 0.345 0.971 0.693
70 o1-pro OpenAI 48 - - $262.5 - - - - - - -
71 Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning) Google 47.9 36.5 68.7 $0.175 0.808 0.709 0.066 0.688 0.287 - -
72 gpt-oss-120B (low) OpenAI 47.5 37.2 66.7 $0.263 0.775 0.672 0.052 0.707 0.36 - -
73 o1 OpenAI 47.2 38.6 - $26.25 0.841 0.747 0.077 0.679 0.358 0.97 0.723
74 Nova 2.0 Lite (low) Amazon 46.8 27.9 46.7 $0.85 0.788 0.698 0.042 0.469 0.333 - -
75 Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) Google 46.7 37.8 56.7 $0.85 0.836 0.766 0.078 0.625 0.375 - -
76 Qwen3 30B A3B 2507 (Reasoning) Alibaba 46.4 36.3 56.3 $0.75 0.805 0.707 0.098 0.707 0.333 0.976 0.907
77 DeepSeek V3.2 Exp (Non-reasoning) DeepSeek 46.3 39.6 57.7 $0.315 0.836 0.738 0.086 0.554 0.399 - -
78 Sonar Reasoning Pro Perplexity 46.3 - - $0 - - - - - 0.957 0.79
79 MiniMax M1 80k MiniMax 46.2 37.1 61 $0.825 0.816 0.697 0.082 0.711 0.374 0.98 0.847
80 Gemini 2.5 Flash Preview (Reasoning) Google 45.8 - - $0 0.8 0.698 0.116 0.505 0.359 0.981 0.843
81 DeepSeek V3.1 Terminus (Non-reasoning) DeepSeek 45.7 38.3 53.7 $0.8 0.836 0.751 0.084 0.529 0.321 - -
82 Qwen3 235B A22B 2507 Instruct Alibaba 45.3 34.2 71.7 $1.225 0.828 0.753 0.106 0.524 0.36 0.98 0.717
83 Qwen3 VL 30B A3B (Reasoning) Alibaba 45.3 34.5 82.3 $0.75 0.807 0.72 0.087 0.697 0.288 - -
84 Grok 3 xAI 45.3 30 58 $6 0.799 0.693 0.051 0.425 0.368 0.87 0.33
85 Llama Nemotron Super 49B v1.5 (Reasoning) NVIDIA 45.2 37.8 76.7 $0.175 0.814 0.748 0.068 0.737 0.348 0.983 0.86
86 o1-preview OpenAI 44.9 34 - $28.875 - - - - - 0.924 -
87 Qwen3 Next 80B A3B Instruct Alibaba 44.8 35.4 66.3 $0.875 0.819 0.738 0.073 0.684 0.307 - -
88 Ling-1T InclusionAI 44.8 37.6 71.3 $0.998 0.822 0.719 0.072 0.677 0.352 - -
89 DeepSeek V3.1 (Non-reasoning) DeepSeek 44.8 39 49.7 $0.84 0.833 0.735 0.063 0.577 0.367 - -
90 GLM-4.6 (Non-reasoning) Z AI 44.7 38.7 44.3 $1 0.784 0.632 0.052 0.561 0.331 - -
91 Claude 4.1 Opus (Non-reasoning) Anthropic 44.6 - - $30 - - - - - - -
92 Claude 4 Sonnet (Non-reasoning) Anthropic 44.4 35.9 38 $6 0.837 0.683 0.04 0.449 0.373 0.934 0.407
93 gpt-oss-20B (low) OpenAI 44.3 34.5 62.3 $0.1 0.718 0.611 0.051 0.652 0.34 - -
94 Qwen3 VL 235B A22B Instruct Alibaba 44.1 33.9 70.7 $1.225 0.823 0.712 0.063 0.594 0.359 - -
95 DeepSeek R1 (Jan '25) DeepSeek 43.8 34.4 68 $2.362 0.844 0.708 0.093 0.617 0.357 0.966 0.683
96 GPT-5 (minimal) OpenAI 43.5 37.4 31.7 $3.438 0.806 0.673 0.054 0.558 0.388 0.861 0.367
97 Qwen3 4B 2507 (Reasoning) Alibaba 43.4 30.4 82.7 $0 0.743 0.667 0.059 0.641 0.256 - -
98 GPT-4.1 OpenAI 43.4 32.2 34.7 $3.5 0.806 0.666 0.046 0.457 0.381 0.913 0.437
99 KAT-Coder-Pro V1 KwaiKAT 43.3 33.2 65 $0 0.814 0.709 0.071 0.534 0.355 - -
100 Magistral Small 1.2 Mistral 43 37.2 80.3 $0.75 0.768 0.663 0.061 0.723 0.352 - -
101 GPT-5.1 (Non-reasoning) OpenAI 42.9 35.7 38 $3.438 0.801 0.643 0.052 0.494 0.365 - -
102 EXAONE 4.0 32B (Reasoning) LG AI Research 42.6 37.5 80 $0.7 0.818 0.739 0.105 0.747 0.344 0.977 0.843
103 GPT-4.1 mini OpenAI 42.5 31.9 46.3 $0.7 0.781 0.664 0.046 0.483 0.404 0.925 0.43
104 Claude 4 Opus (Non-reasoning) Anthropic 42.3 - 36.3 $30 0.86 0.701 0.059 0.542 0.409 0.941 0.563
105 Qwen3 Coder 480B A35B Instruct Alibaba 42.3 37.4 39.3 $3 0.788 0.618 0.044 0.585 0.359 0.942 0.477
106 Nova 2.0 Pro Preview (Non-reasoning) Amazon 41.9 30.3 30.7 $3.438 0.772 0.636 0.04 0.473 0.281 - -
107 GPT-5 (ChatGPT) OpenAI 41.8 34.7 48.3 $3.438 0.82 0.686 0.058 0.543 0.378 - -
108 Ring-1T InclusionAI 41.8 35.8 89.3 $0.998 0.806 0.595 0.102 0.643 0.367 - -
109 Qwen3 235B A22B (Reasoning) Alibaba 41.7 35.9 82 $2.625 0.828 0.7 0.117 0.622 0.399 0.93 0.84
110 Claude 4.5 Haiku (Non-reasoning) Anthropic 41.7 37 39 $2 0.8 0.646 0.043 0.511 0.344 - -
111 GPT-5 mini (minimal) OpenAI 41.6 35 46.7 $0.688 0.775 0.687 0.05 0.545 0.369 - -
112 Hermes 4 - Llama-3.1 405B (Reasoning) Nous Research 41.6 34.8 69.7 $1.5 0.829 0.727 0.103 0.686 0.252 - -
113 Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning) Google 41.6 33.2 46.7 $0.175 0.796 0.651 0.046 0.641 0.285 - -
114 Grok 3 Reasoning Beta xAI 41.4 - - $0 - - - - - - -
115 DeepSeek V3 0324 DeepSeek 41.3 30.2 41 $1.25 0.819 0.655 0.052 0.405 0.358 0.942 0.52
116 Claude 3.7 Sonnet (Non-reasoning) Anthropic 41.1 32.3 21 $6 0.803 0.656 0.048 0.394 0.376 0.85 0.223
117 Qwen3 VL 32B Instruct Alibaba 41 29.8 68.3 $1.225 0.791 0.671 0.063 0.514 0.301 - -
118 Gemini 2.5 Flash (Non-reasoning) Google 40.4 30 60.3 $0.85 0.809 0.683 0.051 0.495 0.291 0.932 0.5
119 Gemini 2.5 Flash-Lite (Reasoning) Google 40.1 27.6 53.3 $0.175 0.759 0.625 0.064 0.593 0.193 0.969 0.703
120 Qwen3 Omni 30B A3B (Reasoning) Alibaba 40 34 74 $0.43 0.792 0.726 0.073 0.679 0.306 - -
121 MiniMax M1 40k MiniMax 40 35.2 13.7 $0.825 0.808 0.682 0.075 0.657 0.378 0.972 0.813
122 Ring-flash-2.0 InclusionAI 39.5 28.9 83.7 $0.247 0.793 0.725 0.089 0.628 0.168 - -
123 o1-mini OpenAI 39.2 - - $0 0.742 0.603 0.049 0.576 0.323 0.944 0.603
124 Hermes 4 - Llama-3.1 70B (Reasoning) Nous Research 39.2 34.6 68.7 $0.198 0.811 0.699 0.079 0.653 0.341 - -
125 Qwen3 32B (Reasoning) Alibaba 38.7 30.9 73 $2.625 0.798 0.668 0.083 0.546 0.354 0.961 0.807
126 Grok 4 Fast (Non-reasoning) xAI 38.6 28.1 41.3 $0.275 0.73 0.606 0.05 0.401 0.329 - -
127 Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) NVIDIA 38.5 33.7 63.7 $0.9 0.825 0.728 0.081 0.641 0.347 0.952 0.747
128 Qwen3 VL 30B A3B Instruct Alibaba 38.5 28 72.3 $0.35 0.764 0.695 0.064 0.476 0.308 - -
129 GPT-4.5 (Preview) OpenAI 38.4 - - $0 - - - - - - -
130 Mistral Large 3 Mistral 38.4 32.5 38 $0.75 0.807 0.68 0.041 0.465 0.362 - -
131 Ling-flash-2.0 InclusionAI 38.3 32.6 65.3 $0.247 0.777 0.657 0.063 0.589 0.289 - -
132 Grok 4.1 Fast (Non-reasoning) xAI 38.3 27.7 34.3 $0.275 0.743 0.637 0.05 0.399 0.296 - -
133 QwQ 32B Alibaba 37.9 - 29 $0.473 0.764 0.593 0.082 0.631 0.358 0.957 0.78
134 Gemini 2.0 Flash Thinking Experimental (Jan '25) Google 37.7 24.1 - $0 0.798 0.701 0.071 0.321 0.329 0.944 0.5
135 Solar Pro 2 (Reasoning) Upstage 37.7 31.5 61.3 $0.5 0.805 0.687 0.07 0.616 0.302 0.967 0.69
136 NVIDIA Nemotron Nano 9B V2 (Reasoning) NVIDIA 37.2 31.9 69.7 $0.07 0.742 0.57 0.046 0.724 0.22 - -
137 Qwen3 30B A3B 2507 Instruct Alibaba 37 29.2 66.3 $0.35 0.777 0.659 0.068 0.515 0.304 0.975 0.727
138 GLM-4.5V (Reasoning) Z AI 37 29.2 73 $0.85 0.788 0.684 0.059 0.604 0.221 - -
139 Qwen3 30B A3B (Reasoning) Alibaba 36.7 27.1 72.3 $0.75 0.777 0.616 0.066 0.506 0.285 0.959 0.753
140 OLMo 3 32B Think Allen Institute for AI 36.3 32.4 73.7 $0.237 0.759 0.61 0.059 0.672 0.286 - -
141 NVIDIA Nemotron Nano 9B V2 (Non-reasoning) NVIDIA 36.1 30.6 62.3 $0.07 0.739 0.557 0.04 0.701 0.209 - -
142 Solar Pro 2 (Preview) (Reasoning) Upstage 36.1 - - $0 0.768 0.578 0.057 0.462 0.164 0.9 0.663
143 Qwen3 14B (Reasoning) Alibaba 36 29.1 55.7 $1.313 0.774 0.604 0.043 0.523 0.316 0.961 0.763
144 Llama 4 Maverick Meta 35.8 26.4 19.3 $0.422 0.809 0.671 0.048 0.397 0.331 0.889 0.39
145 GPT-4o (March 2025, chatgpt-4o-latest) OpenAI 35.6 - 25.7 $7.5 0.803 0.655 0.05 0.425 0.366 0.893 0.327
146 Nova 2.0 Lite (Non-reasoning) Amazon 35.6 21.6 33.7 $0.85 0.743 0.603 0.03 0.346 0.24 - -
147 Llama 3.3 Nemotron Super 49B v1 (Reasoning) NVIDIA 35.5 18.7 54.7 $0 0.785 0.643 0.065 0.277 0.282 0.959 0.583
148 Mistral Medium 3.1 Mistral 35.4 28.1 38.3 $0.8 0.683 0.588 0.044 0.406 0.338 - -
149 Gemini 2.0 Pro Experimental (Feb '25) Google 34.6 25.5 - $0 0.805 0.622 0.068 0.347 0.312 0.923 0.36
150 Sonar Reasoning Perplexity 34.2 - - $2 - 0.623 - - - 0.921 0.77
151 Nova 2.0 Omni (Non-reasoning) Amazon 34.1 21.6 37 $0.85 0.719 0.555 0.039 0.305 0.279 - -
152 Gemini 2.5 Flash Preview (Non-reasoning) Google 34.1 - - $0 0.783 0.594 0.05 0.406 0.233 0.926 0.433
153 Gemini 2.0 Flash (Feb '25) Google 33.6 23.4 21.7 $0.175 0.779 0.623 0.053 0.334 0.333 0.93 0.33
154 Mistral Medium 3 Mistral 33.6 25.6 30.3 $0.8 0.76 0.578 0.043 0.4 0.331 0.907 0.44
155 Qwen3 Coder 30B A3B Instruct Alibaba 33.4 27.4 29 $0.9 0.706 0.516 0.04 0.403 0.278 0.893 0.297
156 Magistral Medium 1 Mistral 33.2 30.3 40.3 $2.75 0.753 0.679 0.095 0.527 0.297 0.917 0.7
157 ERNIE 4.5 300B A47B Baidu 32.9 27.9 41.3 $0.485 0.776 0.811 0.035 0.467 0.315 0.931 0.493
158 DeepSeek R1 Distill Qwen 32B DeepSeek 32.7 - 63 $0.285 0.739 0.615 0.055 0.27 0.376 0.941 0.687
159 Hermes 4 - Llama-3.1 405B (Non-reasoning) Nous Research 32.6 32.8 15.3 $1.5 0.729 0.536 0.042 0.546 0.346 - -
160 DeepSeek V3 (Dec '24) DeepSeek 32.5 25.9 26 $0.625 0.752 0.557 0.036 0.359 0.354 0.887 0.253
161 Nova Premier Amazon 32.3 22 17.3 $5 0.733 0.569 0.047 0.317 0.279 0.839 0.17
162 Qwen3 VL 8B (Reasoning) Alibaba 32.1 20.3 30.7 $0.66 0.749 0.579 0.033 0.353 0.219 - -
163 Magistral Small 1 Mistral 31.9 26.6 41.3 $0.75 0.746 0.641 0.072 0.514 0.241 0.963 0.713
164 OLMo 3 7B Think Allen Institute for AI 31.9 27.9 70.7 $0.14 0.655 0.516 0.057 0.617 0.212 - -
165 Gemini 2.0 Flash (experimental) Google 31.8 - - $0 0.782 0.636 0.047 0.21 0.34 0.911 0.3
166 DeepSeek R1 0528 Qwen3 8B DeepSeek 31 24.4 63.7 $0.068 0.739 0.612 0.056 0.513 0.204 0.932 0.65
167 Qwen2.5 Max Alibaba 30.7 - - $2.8 0.762 0.587 0.045 0.359 0.337 0.835 0.233
168 Ministral 14B (Dec '25) Mistral 30.5 21 30 $0.2 0.693 0.572 0.046 0.351 0.236 - -
169 Qwen3 4B 2507 Instruct Alibaba 30.4 20 52.3 $0 0.672 0.517 0.047 0.377 0.181 - -
170 EXAONE 4.0 32B (Non-reasoning) LG AI Research 30.3 24.6 39.3 $0.7 0.768 0.628 0.049 0.472 0.252 0.939 0.47
171 Qwen3 Omni 30B A3B Instruct Alibaba 30.2 20.8 52.3 $0.43 0.725 0.62 0.051 0.422 0.186 - -
172 Solar Pro 2 (Non-reasoning) Upstage 30.2 23.8 30 $0.5 0.75 0.561 0.038 0.424 0.248 0.889 0.407
173 Gemini 2.5 Flash-Lite (Non-reasoning) Google 30.1 19.9 35.3 $0.175 0.724 0.474 0.037 0.4 0.177 0.926 0.5
174 Solar Pro 2 (Preview) (Non-reasoning) Upstage 30 - - $0 0.725 0.544 0.038 0.385 0.272 0.871 0.297
175 Gemini 1.5 Pro (Sep '24) Google 30 23.6 - $0 0.75 0.589 0.049 0.316 0.295 0.876 0.23
176 Claude 3.5 Sonnet (Oct '24) Anthropic 29.9 30.2 - $6 0.772 0.599 0.039 0.381 0.366 0.771 0.157
177 DeepSeek R1 Distill Llama 70B DeepSeek 29.9 19.7 53.7 $0.875 0.795 0.402 0.061 0.266 0.312 0.935 0.67
178 Qwen3 235B A22B (Non-reasoning) Alibaba 29.9 23.3 23.7 $1.225 0.762 0.613 0.047 0.343 0.299 0.902 0.327
179 DeepSeek R1 Distill Qwen 14B DeepSeek 29.7 - 55.7 $0.15 0.74 0.484 0.044 0.376 0.239 0.949 0.667
180 Qwen3 14B (Non-reasoning) Alibaba 29.2 19.8 58 $0.613 0.675 0.47 0.042 0.28 0.265 0.871 0.28
181 Mistral Small 3.2 Mistral 29.1 20.1 27 $0.15 0.681 0.505 0.043 0.275 0.264 0.883 0.323
182 GPT-5 nano (minimal) OpenAI 29.1 27.5 27.3 $0.138 0.556 0.428 0.041 0.47 0.291 - -
183 GPT-4o (Aug '24) OpenAI 29 - - $4.375 - 0.521 0.029 0.317 - 0.795 0.117
184 Qwen2.5 Instruct 72B Alibaba 29 19.5 14 $0 0.72 0.491 0.042 0.276 0.267 0.858 0.16
185 Sonar Perplexity 28.8 - - $1 0.689 0.471 0.073 0.295 0.229 0.817 0.487
186 Qwen3 8B (Reasoning) Alibaba 28.3 21.8 19 $0.66 0.743 0.589 0.042 0.406 0.226 0.904 0.747
187 Sonar Pro Perplexity 28.2 - - $6 0.755 0.578 0.079 0.275 0.226 0.745 0.29
188 Ministral 8B (Dec '25) Mistral 28.2 18.4 31.7 $0.15 0.642 0.471 0.043 0.303 0.208 - -
189 Llama 3.1 Instruct 405B Meta 28.1 22.2 3 $4.188 0.732 0.515 0.042 0.305 0.299 0.703 0.213
190 Llama 4 Scout Meta 28.1 16.1 14 $0.241 0.752 0.587 0.043 0.299 0.17 0.844 0.283
191 QwQ 32B-Preview Alibaba 28 - - $0.135 0.648 0.557 0.048 0.337 0.038 0.91 0.453
192 Devstral Medium Mistral 27.9 23.9 4.7 $0.8 0.708 0.492 0.038 0.337 0.294 0.707 0.067
193 Llama 3.3 Instruct 70B Meta 27.9 19.2 7.7 $0.62 0.713 0.498 0.04 0.288 0.26 0.773 0.3
194 Ling-mini-2.0 InclusionAI 27.8 19 49.3 $0.122 0.671 0.562 0.05 0.429 0.135 - -
195 GPT-4.1 nano OpenAI 27.3 20.7 24 $0.175 0.657 0.512 0.039 0.326 0.259 0.848 0.237
196 Qwen3 VL 4B (Reasoning) Alibaba 27.3 16.8 25.7 $0 0.7 0.494 0.044 0.32 0.171 - -
197 Devstral Small (Jul '25) Mistral 27.2 18.5 29.3 $0.15 0.622 0.414 0.037 0.254 0.243 0.635 0.003
198 Qwen3 VL 8B Instruct Alibaba 27.1 17.6 27.3 $0.31 0.686 0.427 0.029 0.332 0.174 - -
199 GPT-4o (Nov '24) OpenAI 27 24 6 $4.375 0.748 0.543 0.033 0.309 0.333 0.759 0.15
200 Command A Cohere 26.9 19.2 13 $4.375 0.712 0.527 0.046 0.287 0.281 0.819 0.097
201 Mistral Large 2 (Nov '24) Mistral 26.8 21.4 14 $3 0.697 0.486 0.04 0.293 0.292 0.736 0.11
202 Gemini 2.0 Flash-Lite (Feb '25) Google 26.8 - - $0.131 0.724 0.535 0.036 0.185 0.25 0.873 0.277
203 Exaone 4.0 1.2B (Reasoning) LG AI Research 26.7 20.3 50.3 $0 0.588 0.515 0.058 0.516 0.093 - -
204 Llama Nemotron Super 49B v1.5 (Non-reasoning) NVIDIA 26.6 18.8 8 $0.175 0.692 0.481 0.043 0.29 0.238 0.77 0.137
205 Qwen3 30B A3B (Non-reasoning) Alibaba 26.5 21.6 21.7 $0.35 0.71 0.515 0.046 0.322 0.264 0.863 0.26
206 Qwen3 32B (Non-reasoning) Alibaba 26.4 - 19.7 $1.225 0.727 0.535 0.043 0.288 0.28 0.869 0.303
207 GPT-4o (May '24) OpenAI 26.3 24.2 - $7.5 0.74 0.526 0.028 0.334 0.309 0.791 0.11
208 Gemini 2.0 Flash-Lite (Preview) Google 26.3 - - $0.131 - 0.542 0.044 0.179 0.247 0.873 0.303
209 Kimi Linear 48B A3B Instruct Moonshot AI 26.1 22.8 36.3 $0 0.585 0.412 0.027 0.378 0.199 - -
210 Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) NVIDIA 26.1 - 50 $0 0.556 0.408 0.051 0.493 0.101 0.947 0.707
211 GLM-4.5V (Non-reasoning) Z AI 26 20.1 15.3 $0.9 0.751 0.573 0.036 0.352 0.188 - -
212 Llama 3.3 Nemotron Super 49B v1 (Non-reasoning) NVIDIA 25.9 17 7.7 $0 0.698 0.517 0.035 0.28 0.229 0.775 0.193
213 Reka Flash 3 Reka AI 25.9 23.4 33.7 $0.35 0.669 0.529 0.051 0.435 0.267 0.893 0.51
214 Qwen3 4B (Reasoning) Alibaba 25.6 - 22.3 $0.398 0.696 0.522 0.051 0.465 0.035 0.933 0.657
215 Llama 3.1 Tulu3 405B Allen Institute for AI 25.4 - - $0 0.716 0.516 0.035 0.291 0.302 0.778 0.133
216 Claude 3.5 Sonnet (June '24) Anthropic 25.4 26 - $6 0.751 0.56 0.037 - 0.316 0.695 0.097
217 GPT-4o (ChatGPT) OpenAI 25.3 - - $7.5 0.773 0.511 0.037 - 0.334 0.797 0.103
218 Qwen3 VL 4B Instruct Alibaba 25.2 14.2 37 $0 0.634 0.371 0.037 0.29 0.137 - -
219 Nova Pro Amazon 25 16.6 7 $1.4 0.691 0.499 0.034 0.233 0.208 0.786 0.107
220 Pixtral Large Mistral 25 - 2.3 $3 0.701 0.505 0.036 0.261 0.292 0.714 0.07
221 Mistral Small 3.1 Mistral 24.9 18.3 3.7 $0.15 0.659 0.454 0.048 0.212 0.265 0.707 0.093
222 Grok 2 (Dec '24) xAI 24.7 - - $4 0.709 0.51 0.038 0.267 0.285 0.778 0.133
223 Gemini 1.5 Flash (Sep '24) Google 24.4 - - $0 0.68 0.463 0.035 0.273 0.267 0.827 0.18
224 GPT-4 Turbo OpenAI 24.2 21.5 - $15 0.694 - 0.033 0.291 0.319 0.737 0.15
225 Hermes 4 - Llama-3.1 70B (Non-reasoning) Nous Research 23.8 18.2 11.3 $0.198 0.664 0.491 0.036 0.269 0.277 - -
226 Llama 3.1 Nemotron Instruct 70B NVIDIA 23.6 14.8 11 $0.6 0.69 0.465 0.046 0.169 0.233 0.733 0.247
227 Grok Beta xAI 23 - - $0 0.703 0.471 0.047 0.241 0.295 0.737 0.103
228 Qwen3 8B (Non-reasoning) Alibaba 22.9 13 24.3 $0.31 0.643 0.452 0.028 0.202 0.168 0.828 0.243
229 Qwen2.5 Instruct 32B Alibaba 22.9 - - $0 0.697 0.466 0.038 0.248 0.229 0.805 0.11
230 Phi-4 Microsoft Azure 22.7 17.6 18 $0.219 0.714 0.575 0.041 0.231 0.26 0.81 0.143
231 Granite 4.0 H Small IBM 22.7 16.1 13.7 $0.107 0.624 0.416 0.037 0.251 0.209 - -
232 Llama 3.1 Instruct 70B Meta 22.6 17.6 4 $0.56 0.676 0.409 0.046 0.232 0.267 0.649 0.173
233 Qwen3 1.7B (Reasoning) Alibaba 22.4 11.7 38.7 $0.398 0.57 0.356 0.048 0.308 0.043 0.894 0.51
234 Mistral Large 2 (Jul '24) Mistral 22.3 - 0 $3 0.683 0.472 0.032 0.267 0.271 0.714 0.093
235 OLMo 3 7B Instruct Allen Institute for AI 22.2 12.3 41.3 $0.125 0.522 0.4 0.058 0.266 0.103 - -
236 Gemma 3 27B Instruct Google 22.1 12.8 20.7 $0 0.669 0.428 0.047 0.137 0.212 0.883 0.253
237 Ministral 3B (Dec '25) Mistral 21.8 13 22 $0.1 0.524 0.358 0.053 0.247 0.144 - -
238 Qwen2.5 Coder Instruct 32B Alibaba 21.8 - - $0.141 0.635 0.417 0.038 0.295 0.271 0.767 0.12
239 GPT-4 OpenAI 21.5 13.1 - $37.5 - - - - - - -
240 Nova Lite Amazon 21.5 10.4 7 $0.105 0.59 0.433 0.046 0.167 0.139 0.765 0.107
241 GPT-4o mini OpenAI 21.2 - 14.7 $0.263 0.648 0.426 0.04 0.234 0.229 0.789 0.117
242 Mistral Small 3 Mistral 21.2 - 4.3 $0.15 0.652 0.462 0.041 0.252 0.236 0.715 0.08
243 Jamba Reasoning 3B AI21 Labs 20.9 9.2 10.7 $0 0.577 0.333 0.046 0.21 0.059 - -
244 Jamba 1.7 Large AI21 Labs 20.8 13 2.3 $3.5 0.577 0.39 0.038 0.181 0.188 0.6 0.057
245 Qwen3 4B (Non-reasoning) Alibaba 20.7 - - $0.188 0.586 0.398 0.037 0.233 0.167 0.843 0.213
246 DeepSeek-V2.5 (Dec '24) DeepSeek 20.7 - - $0 - - - - - 0.763 -
247 Claude 3 Opus Anthropic 20.6 19.5 - $30 0.696 0.489 0.031 0.279 0.233 0.641 0.033
248 Exaone 4.0 1.2B (Non-reasoning) LG AI Research 20.5 12.2 24 $0 0.5 0.424 0.058 0.293 0.074 - -
249 Gemma 3 12B Instruct Google 20.4 10.6 18.3 $0 0.595 0.349 0.048 0.137 0.174 0.853 0.22
250 DeepSeek-V2.5 DeepSeek 20.2 - - $0 - - - - - - -
251 Gemini 2.0 Flash Thinking Experimental (Dec '24) Google 20.2 - - $0 - - - - - 0.48 -
252 Claude 3.5 Haiku Anthropic 20.2 - - $1.6 0.634 0.408 0.035 0.314 0.274 0.721 0.033
253 Devstral Small (May '25) Mistral 19.6 - - $0.15 0.632 0.434 0.04 0.258 0.245 0.684 0.067
254 Mistral Saba Mistral 19.6 - - $0 0.611 0.424 0.041 - 0.241 0.677 0.13
255 DeepSeek R1 Distill Llama 8B DeepSeek 19.5 - 41.3 $0 0.543 0.302 0.042 0.233 0.119 0.853 0.333
256 Gemini 1.5 Pro (May '24) Google 19.2 19.8 - $0 0.657 0.371 0.039 0.244 0.274 0.673 0.08
257 R1 1776 Perplexity 19.1 - - $0 - - - - - 0.954 -
258 Qwen2.5 Turbo Alibaba 19.1 - - $0.087 0.633 0.41 0.042 0.163 0.153 0.805 0.12
259 Reka Flash (Sep '24) Reka AI 19.1 - - $0.35 - - - - - 0.529 -
260 Solar Mini Upstage 18.9 - - $0.15 - - - - - 0.331 -
261 Llama 3.2 Instruct 90B (Vision) Meta 18.9 - - $0.72 0.671 0.432 0.049 0.214 0.24 0.629 0.05
262 Grok-1 xAI 18.2 - - $0 - - - - - - -
263 Qwen2 Instruct 72B Alibaba 18.1 - - $0 0.622 0.371 0.037 0.159 0.229 0.701 0.147
264 Nova Micro Amazon 17.7 8.3 6 $0.061 0.531 0.358 0.047 0.14 0.094 0.703 0.08
265 LFM2 8B A1B Liquid AI 17.4 7.3 25.3 $0 0.505 0.344 0.049 0.151 0.068 - -
266 Llama 3.1 Instruct 8B Meta 16.9 8.5 4.3 $0.1 0.476 0.259 0.051 0.116 0.132 0.519 0.077
267 Gemini 1.5 Flash-8B Google 16.3 - - $0 0.569 0.359 0.045 0.217 0.229 0.689 0.033
268 Granite 4.0 Micro IBM 16.2 10.4 6 $0 0.447 0.336 0.051 0.18 0.119 - -
269 Phi-4 Mini Instruct Microsoft Azure 15.7 7.8 6.7 $0 0.465 0.331 0.042 0.126 0.108 0.696 0.03
270 Gemma 3n E4B Instruct Google 15.5 8.3 14.3 $0.025 0.488 0.296 0.044 0.146 0.081 0.771 0.137
271 Llama 3.2 Instruct 11B (Vision) Meta 15.5 7.7 1.7 $0.16 0.464 0.221 0.052 0.11 0.112 0.516 0.093
272 DeepHermes 3 - Mistral 24B Preview (Non-reasoning) Nous Research 15.5 - - $0 0.58 0.382 0.039 0.195 0.228 0.595 0.047
273 Granite 3.3 8B (Non-reasoning) IBM 15.2 7.6 6.7 $0.085 0.468 0.338 0.042 0.127 0.101 0.665 0.047
274 Jamba 1.5 Large AI21 Labs 14.8 - - $3.5 0.572 0.427 0.04 0.143 0.163 0.606 0.047
275 Jamba 1.7 Mini AI21 Labs 14.8 5.1 0.3 $0.25 0.388 0.322 0.045 0.061 0.093 0.258 0.013
276 Hermes 3 - Llama-3.1 70B Nous Research 14.7 - - $0.3 0.571 0.401 0.041 0.188 0.231 0.538 0.023
277 Gemma 3 4B Instruct Google 14.7 6.4 12.7 $0 0.417 0.291 0.052 0.112 0.073 0.766 0.063
278 DeepSeek-Coder-V2 DeepSeek 14.5 - - $0 - - - - - 0.743 -
279 Phi-3 Medium Instruct 14B Microsoft Azure 14.4 8.9 1.3 $0.297 0.543 0.326 0.045 0.15 0.118 0.463 0.013
280 OLMo 2 32B Allen Institute for AI 14.4 4.9 3.3 $0 0.511 0.328 0.037 0.068 0.08 - -
281 Qwen3 1.7B (Non-reasoning) Alibaba 14.4 6.5 7.3 $0.188 0.411 0.283 0.052 0.126 0.069 0.717 0.097
282 Jamba 1.6 Large AI21 Labs 14.3 - - $3.5 0.565 0.387 0.04 0.172 0.184 0.58 0.047
283 Qwen3 0.6B (Reasoning) Alibaba 14.2 5 18 $0.398 0.347 0.239 0.057 0.121 0.028 0.75 0.1
284 Gemini 1.5 Flash (May '24) Google 14 - - $0 0.574 0.324 0.042 0.196 0.181 0.554 0.093
285 Granite 4.0 H 1B IBM 13.7 6.6 6.3 $0 0.277 0.263 0.05 0.115 0.082 - -
286 Granite 4.0 1B IBM 13.3 4.5 6.3 $0 0.325 0.281 0.051 0.047 0.087 - -
287 Claude 3 Sonnet Anthropic 13.3 - - $6 0.579 0.4 0.038 0.175 0.229 0.414 0.047
288 Llama 3 Instruct 70B Meta 13 - - $0.88 0.574 0.379 0.044 0.198 0.189 0.483 0
289 Mistral Small (Sep '24) Mistral 13 - - $0.3 0.529 0.381 0.043 0.141 0.156 0.563 0.063
290 Gemini 1.0 Ultra Google 12.8 17.6 - $0 - - - - - - -
291 Phi-3 Mini Instruct 3.8B Microsoft Azure 12.7 6.9 0.3 $0.228 0.435 0.319 0.044 0.116 0.09 0.457 0.04
292 Gemma 3n E4B Instruct Preview (May '25) Google 12.5 - - $0 0.483 0.278 0.049 0.138 0.086 0.749 0.107
293 Phi-4 Multimodal Instruct Microsoft Azure 12.4 - - $0 0.485 0.315 0.044 0.131 0.11 0.693 0.093
294 Qwen2.5 Coder Instruct 7B Alibaba 12.2 - - $0 0.473 0.339 0.048 0.126 0.148 0.66 0.053
295 Mistral Large (Feb '24) Mistral 11.9 - - $6 0.515 0.351 0.034 0.178 0.208 0.527 0
296 LFM2 2.6B Liquid AI 11.8 3.8 8.3 $0 0.298 0.306 0.052 0.081 0.025 - -
297 Mixtral 8x22B Instruct Mistral 11.7 - - $0 0.537 0.332 0.041 0.148 0.188 0.545 0
298 Gemma 3n E2B Instruct Google 11.3 5.2 10.3 $0 0.378 0.229 0.04 0.095 0.052 0.691 0.09
299 Llama 2 Chat 7B Meta 11.3 - - $0.1 0.164 0.227 0.058 0.002 0 0.059 0
300 Llama 3.2 Instruct 3B Meta 11.2 - 3.3 $0.06 0.347 0.255 0.052 0.083 0.052 0.489 0.067
301 Qwen3 0.6B (Non-reasoning) Alibaba 11 3.8 10.3 $0.188 0.231 0.231 0.052 0.073 0.041 0.521 0.017
302 Qwen1.5 Chat 110B Alibaba 10.5 - - $0 - 0.289 - - - - -
303 LFM2 1.2B Liquid AI 9.7 1.5 3.3 $0 0.257 0.228 0.057 0.02 0.025 - -
304 Claude 2.1 Anthropic 9.7 14 - $0 0.495 0.319 0.042 0.195 0.184 0.374 0.033
305 Claude 3 Haiku Anthropic 9.6 - - $0.5 - - - 0.154 0.186 0.394 0.01
306 OLMo 2 7B Allen Institute for AI 9.5 2.6 0.7 $0 0.282 0.288 0.055 0.041 0.037 - -
307 Molmo 7B-D Allen Institute for AI 9.3 2.5 0 $0 0.371 0.24 0.051 0.039 0.036 - -
308 Llama 3.2 Instruct 1B Meta 8.9 1.2 0 $0.053 0.2 0.196 0.053 0.019 0.017 0.14 0
309 DeepSeek-V2-Chat DeepSeek 8.6 - - $0 - - - - - - -
310 DeepSeek R1 Distill Qwen 1.5B DeepSeek 8.6 - 22 $0 0.269 0.098 0.033 0.07 0.066 0.687 0.177
311 Claude 2.0 Anthropic 8.6 12.9 - $0 0.486 0.344 - 0.171 0.194 - 0
312 Mistral Small (Feb '24) Mistral 8.5 - - $1.5 0.419 0.302 0.044 0.111 0.134 0.562 0.007
313 Mistral Medium Mistral 8.4 - - $4.088 0.491 0.349 0.034 0.099 0.118 0.405 0.037
314 GPT-3.5 Turbo OpenAI 8.3 10.7 - $0.75 0.462 0.297 - - - 0.441 -
315 Granite 4.0 H 350M IBM 8.2 1.2 1.3 $0 0.127 0.257 0.064 0.019 0.017 - -
316 Granite 4.0 350M IBM 7.7 1.1 0 $0 0.124 0.261 0.057 0.024 0.009 - -
317 Arctic Instruct Snowflake 7.6 - - $0 - - - - - - -
318 Qwen Chat 72B Alibaba 7.6 - - $0 - - - - - - -
319 LFM 40B Liquid AI 7.3 - - $0 0.425 0.327 0.049 0.096 0.071 0.48 0.023
320 Llama 3 Instruct 8B Meta 7 - - $0.07 0.405 0.296 0.051 0.096 0.119 0.499 0
321 Gemma 3 1B Instruct Google 6.8 0.8 3.3 $0 0.135 0.237 0.052 0.017 0.007 0.484 0
322 PALM-2 Google 6.6 4.6 - $0 - - - - - - -
323 Gemini 1.0 Pro Google 6.2 - - $0 0.431 0.277 0.046 0.116 0.117 0.403 0.007
324 DeepSeek Coder V2 Lite Instruct DeepSeek 6.1 - - $0 0.429 0.319 0.053 0.158 0.139 - -
325 Gemma 3 270M Google 5.6 0.1 2.3 $0 0.055 0.224 0.042 0.003 0 - -
326 DeepSeek LLM 67B Chat (V1) DeepSeek 5.6 - - $0 - - - - - - -
327 Llama 2 Chat 70B Meta 5.6 - - $0 0.406 0.327 0.05 0.098 - 0.323 0
328 Command-R+ (Apr '24) Cohere 5.5 - - $6 0.432 0.323 0.045 0.122 0.118 0.279 0.007
329 Llama 2 Chat 13B Meta 5.5 - - $0 0.406 0.321 0.047 0.098 0.118 0.329 0.017
330 OpenChat 3.5 (1210) OpenChat 5.4 - - $0 0.31 0.23 0.048 0.115 - 0.307 0
331 DBRX Instruct Databricks 5.3 - - $0 0.397 0.331 0.066 0.093 0.118 0.279 0.03
332 Jamba 1.5 Mini AI21 Labs 4 - - $0.25 0.371 0.302 0.051 0.062 0.08 0.357 0.01
333 Jamba 1.6 Mini AI21 Labs 3.3 - - $0.25 0.367 0.3 0.046 0.071 0.101 0.257 0.033
334 Mixtral 8x7B Instruct Mistral 2.6 - - $0.54 0.387 0.292 0.045 0.066 0.028 0.299 0
335 DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning) Nous Research 1.8 - - $0 0.365 0.27 0.043 0.085 0.091 0.218 0
336 Llama 65B Meta 1 - - $0 - - - - - - -
337 Claude Instant Anthropic 1 7.8 - $0 0.434 0.33 0.038 0.109 - 0.264 0
338 Mistral 7B Instruct Mistral 1 - - $0.25 0.245 0.177 0.043 0.046 0.024 0.121 0
339 Command-R (Mar '24) Cohere 1 - - $0.75 0.338 0.284 0.048 0.048 0.062 0.164 0.007
340 Qwen Chat 14B Alibaba 1 - - $0 - - - - - - -
341 GPT-4o Realtime (Dec '24) OpenAI - - - $0 - - - - - - -
342 GPT-3.5 Turbo (0613) OpenAI - - - $0 - - - - - - -
343 Cogito v2.1 (Reasoning) Deep Cogito - 41.8 72.7 $1.25 0.849 0.768 0.11 0.688 0.41 - -
344 GPT-4o mini Realtime (Dec '24) OpenAI - - - $0 - - - - - - -
345 DeepSeek-OCR DeepSeek - - - $0.048 - - - - - - -

* 价格为每百万 Token 的混合价格 (3:1 输入/输出)

Artificial Analysis AI 大模型排名 介绍

Artificial Analysis 是一家独立的 AI 基准测试和分析公司,提供独立的基准测试和分析,以支持开发者、研究人员、企业和其他 AI 用户。Artificial Analysis同时测试专有与开放权重模型,并以端到端用户体验为核心,测量实际使用中的响应时间、输出速度及成本。

质量基准涵盖语言理解与推理能力;性能基准则关注首次令牌到达时间、输出速度、端到端响应时间等真实可感知指标。我们区分 OpenAI Tokens 与原生 Tokens,以便在不同模型之间进行统一、公平的对比,并使用按 3:1 的输入/输出比计算混合价。基准对象包括模型、端点、系统与提供商,覆盖语言模型、语音、图像生成等多个方向,旨在帮助用户准确了解不同 AI 服务的真实表现与性价比。

Artificial Analysis AI 测试基准介绍

上下文窗口

输入和输出令牌的最大总数。输出令牌的数量限制通常要低得多(具体数量因模型而异)。

输出速度

模型生成令牌时每秒接收到的令牌数(即,对于支持流式传输的模型,在从 API 接收到第一个数据块之后)。

延迟(首次令牌到达时间)

API 请求发送后,收到第一个推理令牌所需的时间(以秒为单位)。对于共享推理令牌的推理模型,这将是第一个推理令牌。对于不支持流式传输的模型,这表示收到完成状态所需的时间。

价格

每个代币的价格,以美元/百万代币表示。价格是输入代币和输出代币价格的混合(比例为 3:1)。

常见 AI 大模型测试基准介绍

MMLU Pro

Massive Multitask Language Understanding Professional。MMLU 的增强版,旨在评估大语言模型的推理能力。它通过过滤简单问题、增加选项数量(从4个增加到10个)以及强调复杂的多步推理,来解决原版 MMLU 的局限性。涵盖 14 个领域的约 12,000 个问题。

GPQA

Graduate-Level Google-Proof Q&A Benchmark。一个具有挑战性的研究生级别问答基准,旨在评估 AI 系统在物理、化学和生物等复杂科学领域提供真实信息的能力。这些问题被设计为“防谷歌搜索”,即需要深度理解和推理,而不仅仅是简单的事实回忆。

HLE

Humanity's Last Exam。一个全面的评估框架,旨在测试 AI 系统在模仿人类水平推理、解决问题和知识整合方面的能力。包含 100 多个学科的 2,500 到 3,000 个专家级问题,强调多步推理和处理新颖场景的能力。

LiveCodeBench

一个无污染的 LLM 代码能力评估基准。它持续从 LeetCode、AtCoder 和 Codeforces 等平台的竞赛中收集新问题,以防止训练集数据污染。除了代码生成,还评估自我修复、代码执行和测试输出预测等能力。

SciCode

评估语言模型解决现实科学研究问题代码生成能力的基准。涵盖物理、数学、材料科学、生物和化学等 6 个领域的 16 个子领域。问题源自真实的科学工作流,通常需要知识回忆、推理和代码合成。

Math 500

旨在评估语言模型数学推理和解决问题能力的基准。包含 500 个来自 AMC 和 AIME 等高水平高中数学竞赛的难题,涵盖代数、组合数学、几何、数论和预微积分等领域。

AIME

American Invitational Mathematics Examination。基于美国数学邀请赛问题的基准,被认为是测试高级数学推理的最具挑战性的 AI 测试之一。包含 30 个“奥林匹克级别”的整数答案数学问题,测试多步推理、抽象和解决问题的能力。