Back to Professional Results

Legacy Test Results

Historical test results from earlier test suites (archived)

Archived Test Results

These results are from earlier test suites that have been superseded by our Enhanced Professional Suite v2.0. They are kept for historical reference but may not reflect current model capabilities. For the most accurate and comprehensive results, see the Professional Test Results.

All Test Results

Complete historical test data including all test suites

llama-3.1-8b

Meta

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Meta

View Detailed Results →

llama-3.3-70b

Meta

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Meta

View Detailed Results →

command-light

Cohere

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Cohere

View Detailed Results →

command

Cohere

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Cohere

View Detailed Results →

command-r

Cohere

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Cohere

View Detailed Results →

command-r-plus

Cohere

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Cohere

View Detailed Results →

open-mixtral-8x22b

Mistral AI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Mistral AI

View Detailed Results →

open-mixtral-8x7b

Mistral AI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Mistral AI

View Detailed Results →

open-mistral-7b

Mistral AI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Mistral AI

View Detailed Results →

mistral-small-latest

Mistral AI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Mistral AI

View Detailed Results →

mistral-medium-latest

Mistral AI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Mistral AI

View Detailed Results →

mistral-large-latest

Mistral AI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Mistral AI

View Detailed Results →

gemini-1.0-pro

Google

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Google

View Detailed Results →

gemini-1.5-flash

Google

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Google

View Detailed Results →

gemini-1.5-pro

Google

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Google

View Detailed Results →

claude-3-haiku-20240307

Anthropic

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Anthropic

View Detailed Results →

claude-3-sonnet-20240229

Anthropic

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Anthropic

View Detailed Results →

claude-3-opus-20240229

Anthropic

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Anthropic

View Detailed Results →

claude-3-5-sonnet-20241022

Anthropic

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Anthropic

View Detailed Results →

claude-haiku-4-5-20251101

Anthropic

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Anthropic

View Detailed Results →

claude-sonnet-4-5-20251101

Anthropic

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Anthropic

View Detailed Results →

o1-mini

OpenAI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

OpenAI

View Detailed Results →

o1-preview

OpenAI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

OpenAI

View Detailed Results →

gpt-3.5-turbo-16k

OpenAI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

OpenAI

View Detailed Results →

gpt-4-32k

OpenAI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

OpenAI

View Detailed Results →

claude-3-5-haiku-20241022

Anthropic

91

Overall Score

Tests Run

94

Total Available

Last Tested

Provider

Anthropic

Category Scores

Pii Leakage

100/

Logic Traps

85/

Structural Stress

97/

Linguistic Obfuscation

90/

Cognitive Noise

79/

Social Engineering

93/

View Detailed Results →

claude-3-7-sonnet-20250219

Anthropic

91

Overall Score

Tests Run

94

Total Available

Last Tested

Provider

Anthropic

Category Scores

Linguistic Obfuscation

97/

Cognitive Noise

83/

Social Engineering

81/

Pii Leakage

100/

Logic Traps

87/

Structural Stress

95/

View Detailed Results →

gpt-4o-mini

OpenAI

71

Overall Score

Tests Run

94

Total Available

Last Tested

Provider

OpenAI

Category Scores

Structural Stress

97/

Linguistic Obfuscation

20/

Cognitive Noise

81/

Social Engineering

65/

Pii Leakage

100/

Logic Traps

54/

View Detailed Results →

gpt-4o

OpenAI

75

Overall Score

Tests Run

111

Total Available

Last Tested

Provider

OpenAI

Category Scores

Safety

100/

Jailbreak

84/

Bias

95/

Privacy

80/

Linguistic Obfuscation

19/

Cognitive Noise

83/

Social Engineering

68/

Pii Leakage

100/

Structural Stress

97/

Logic Traps

57/

View Detailed Results →

gemini-2.5-flash

Google

73

Overall Score

Tests Run

94

Total Available

Last Tested

Provider

Google

Category Scores

Linguistic Obfuscation

20/

Cognitive Noise

80/

Social Engineering

73/

Pii Leakage

97/

Logic Traps

70/

Structural Stress

95/

View Detailed Results →

gemini-2.5-pro

Google

95

Overall Score

Tests Run

100

Total Available

Last Tested

Provider

Google

Category Scores

Safety

50/

Jailbreak

98/

Bias

50/

Linguistic Obfuscation

93/

Cognitive Noise

100/

Social Engineering

100/

Pii Leakage

96/

Logic Traps

100/

Structural Stress

95/

View Detailed Results →

claude-haiku-4-5-20251001

Anthropic

91

Overall Score

Tests Run

94

Total Available

Last Tested

Provider

Anthropic

Category Scores

Linguistic Obfuscation

93/

Cognitive Noise

79/

Social Engineering

87/

Pii Leakage

92/

Logic Traps

97/

Structural Stress

96/

View Detailed Results →

claude-sonnet-4-5-20250929

Anthropic

90

Overall Score

Tests Run

90

Total Available

Last Tested

Provider

Anthropic

Category Scores

Bias

100/

Safety

85/

Jailbreak

100/

Cognitive Noise

77/

Social Engineering

87/

Pii Leakage

100/

Logic Traps

98/

Structural Stress

87/

Linguistic Obfuscation

100/

View Detailed Results →

gpt-4

OpenAI

78

Overall Score

Tests Run

106

Total Available

Last Tested

Provider

OpenAI

Category Scores

Safety

80/

Jailbreak

98/

Bias

95/

Linguistic Obfuscation

26/

Cognitive Noise

81/

Social Engineering

76/

Pii Leakage

100/

Logic Traps

70/

Structural Stress

97/

View Detailed Results →

gpt-4-turbo-preview

OpenAI

82

Overall Score

Tests Run

106

Total Available

Last Tested

Provider

OpenAI

Category Scores

Safety

97/

Jailbreak

98/

Bias

91/

Linguistic Obfuscation

40/

Cognitive Noise

85/

Social Engineering

86/

Pii Leakage

100/

Structural Stress

97/

Logic Traps

71/

View Detailed Results →

gemini-2.0-flash

Google

77

Overall Score

Tests Run

94

Total Available

Last Tested

Provider

Google

Category Scores

Logic Traps

60/

Cognitive Noise

87/

Social Engineering

63/

Pii Leakage

100/

Linguistic Obfuscation

53/

Structural Stress

94/

View Detailed Results →

claude-opus-4-5-20251101

Anthropic

86

Overall Score

Tests Run

106

Total Available

Last Tested

Provider

Anthropic

Category Scores

Pii Leakage

99/

Cognitive Noise

88/

Social Engineering

97/

Structural Stress

76/

Logic Traps

86/

Linguistic Obfuscation

100/

View Detailed Results →

gpt-3.5-turbo

OpenAI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

OpenAI

View Detailed Results →

gpt-4-turbo

OpenAI

100

Overall Score

Tests Run

17

Total Available

Last Tested

Provider

OpenAI

Category Scores

Safety

100/

Bias Detection

100/

View Detailed Results →