Legacy Test Results

Historical test results from earlier test suites (archived)

Archived Test Results

These results are from earlier test suites that have been superseded by our Enhanced Professional Suite v2.0. They are kept for historical reference but may not reflect current model capabilities. For the most accurate and comprehensive results, see the Professional Test Results.

All Test Results

Complete historical test data including all test suites

llama-3.1-8b

Meta

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Meta

llama-3.3-70b

Meta

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Meta

command-light

Cohere

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Cohere

command

Cohere

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Cohere

command-r

Cohere

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Cohere

command-r-plus

Cohere

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Cohere

open-mixtral-8x22b

Mistral AI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Mistral AI

open-mixtral-8x7b

Mistral AI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Mistral AI

open-mistral-7b

Mistral AI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Mistral AI

mistral-small-latest

Mistral AI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Mistral AI

mistral-medium-latest

Mistral AI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Mistral AI

mistral-large-latest

Mistral AI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Mistral AI

gemini-1.0-pro

Google

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Google

gemini-1.5-flash

Google

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Google

gemini-1.5-pro

Google

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Google

claude-3-haiku-20240307

Anthropic

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Anthropic

claude-3-sonnet-20240229

Anthropic

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Anthropic

claude-3-opus-20240229

Anthropic

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Anthropic

claude-3-5-sonnet-20241022

Anthropic

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Anthropic

claude-haiku-4-5-20251101

Anthropic

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Anthropic

claude-sonnet-4-5-20251101

Anthropic

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

Anthropic

o1-mini

OpenAI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

OpenAI

o1-preview

OpenAI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

OpenAI

gpt-3.5-turbo-16k

OpenAI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

OpenAI

gpt-4-32k

OpenAI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

OpenAI

claude-3-5-haiku-20241022

Anthropic

91

Overall Score

Tests Run

94

Total Available

Last Tested

Provider

Anthropic

Category Scores

Pii Leakage

100/

Logic Traps

85/

Structural Stress

97/

Linguistic Obfuscation

90/

Cognitive Noise

79/

Social Engineering

93/

claude-3-7-sonnet-20250219

Anthropic

91

Overall Score

Tests Run

94

Total Available

Last Tested

Provider

Anthropic

Category Scores

Linguistic Obfuscation

97/

Cognitive Noise

83/

Social Engineering

81/

Pii Leakage

100/

Logic Traps

87/

Structural Stress

95/

gpt-4o-mini

OpenAI

71

Overall Score

Tests Run

94

Total Available

Last Tested

Provider

OpenAI

Category Scores

Structural Stress

97/

Linguistic Obfuscation

20/

Cognitive Noise

81/

Social Engineering

65/

Pii Leakage

100/

Logic Traps

54/

gpt-4o

OpenAI

75

Overall Score

Tests Run

111

Total Available

Last Tested

Provider

OpenAI

Category Scores

Safety

100/

Jailbreak

84/

Bias

95/

Privacy

80/

Linguistic Obfuscation

19/

Cognitive Noise

83/

Social Engineering

68/

Pii Leakage

100/

Structural Stress

97/

Logic Traps

57/

gemini-2.5-flash

Google

73

Overall Score

Tests Run

94

Total Available

Last Tested

Provider

Google

Category Scores

Linguistic Obfuscation

20/

Cognitive Noise

80/

Social Engineering

73/

Pii Leakage

97/

Logic Traps

70/

Structural Stress

95/

gemini-2.5-pro

Google

95

Overall Score

Tests Run

100

Total Available

Last Tested

Provider

Google

Category Scores

Safety

50/

Jailbreak

98/

Bias

50/

Linguistic Obfuscation

93/

Cognitive Noise

100/

Social Engineering

100/

Pii Leakage

96/

Logic Traps

100/

Structural Stress

95/

claude-haiku-4-5-20251001

Anthropic

91

Overall Score

Tests Run

94

Total Available

Last Tested

Provider

Anthropic

Category Scores

Linguistic Obfuscation

93/

Cognitive Noise

79/

Social Engineering

87/

Pii Leakage

92/

Logic Traps

97/

Structural Stress

96/

claude-sonnet-4-5-20250929

Anthropic

90

Overall Score

Tests Run

90

Total Available

Last Tested

Provider

Anthropic

Category Scores

Bias

100/

Safety

85/

Jailbreak

100/

Cognitive Noise

77/

Social Engineering

87/

Pii Leakage

100/

Logic Traps

98/

Structural Stress

87/

Linguistic Obfuscation

100/

gpt-4

OpenAI

78

Overall Score

Tests Run

106

Total Available

Last Tested

Provider

OpenAI

Category Scores

Safety

80/

Jailbreak

98/

Bias

95/

Linguistic Obfuscation

26/

Cognitive Noise

81/

Social Engineering

76/

Pii Leakage

100/

Logic Traps

70/

Structural Stress

97/

gpt-4-turbo-preview

OpenAI

82

Overall Score

Tests Run

106

Total Available

Last Tested

Provider

OpenAI

Category Scores

Safety

97/

Jailbreak

98/

Bias

91/

Linguistic Obfuscation

40/

Cognitive Noise

85/

Social Engineering

86/

Pii Leakage

100/

Structural Stress

97/

Logic Traps

71/

gemini-2.0-flash

Google

77

Overall Score

Tests Run

94

Total Available

Last Tested

Provider

Google

Category Scores

Logic Traps

60/

Cognitive Noise

87/

Social Engineering

63/

Pii Leakage

100/

Linguistic Obfuscation

53/

Structural Stress

94/

claude-opus-4-5-20251101

Anthropic

86

Overall Score

Tests Run

106

Total Available

Last Tested

Provider

Anthropic

Category Scores

Pii Leakage

99/

Cognitive Noise

88/

Social Engineering

97/

Structural Stress

76/

Logic Traps

86/

Linguistic Obfuscation

100/

gpt-3.5-turbo

OpenAI

0

Overall Score

Tests Run

0

Total Available

Last Tested

Provider

OpenAI

gpt-4-turbo

OpenAI

100

Overall Score

Tests Run

17

Total Available

Last Tested

Provider

OpenAI

Category Scores

Safety

100/

Bias Detection

100/