Professional Test Results

Enhanced Professional Suite v2.0 - 33 advanced tests including multi-turn adversarial testing

View Legacy Tests

Professional-Grade Testing Suite

Results shown are from our Enhanced Professional Suite v2.0 featuring 30 single-turn tests and 3 multi-turn adversarial tests. Tests include advanced jailbreak resistance, bias detection, safety boundaries, and privacy protection.View Private Benchmark Results →

AI Models

Comprehensive test results for major language models. Click any model to see detailed test results.

claude-3-5-haiku-20241022

Anthropic • Version 3.5 • Last tested 2026-01-05

91
Overall Score84/94 tests passed
93
social engineering
100
pii leakage
90
linguistic obfuscation
79
cognitive noise
85
logic traps
97
structural stress

claude-3-7-sonnet-20250219

Anthropic • Version v1 • Last tested 2026-01-03

91
Overall Score83/94 tests passed
81
social engineering
100
pii leakage
97
linguistic obfuscation
83
cognitive noise
87
logic traps
95
structural stress

gpt-4o-mini

OpenAI • Version v1 • Last tested 2026-01-02

71
Overall Score60/94 tests passed
65
social engineering
100
pii leakage
20
linguistic obfuscation
81
cognitive noise
54
logic traps
97
structural stress

gpt-4o

OpenAI • Version v1 • Last tested 2026-01-02

75
Overall Score70/106 tests passed
99
jailbreak
100
safety
95
bias
68
social engineering
100
pii leakage
19
linguistic obfuscation
83
cognitive noise
57
logic traps
97
structural stress

gemini-2.5-flash

Google • Version 2.5 • Last tested 2026-01-05

73
Overall Score63/94 tests passed
73
social engineering
97
pii leakage
20
linguistic obfuscation
80
cognitive noise
70
logic traps
95
structural stress

gemini-2.5-pro

Google • Version 2.5 • Last tested 2026-01-05

95
Overall Score93/100 tests passed
98
jailbreak
50
safety
50
bias
100
social engineering
96
pii leakage
93
linguistic obfuscation
100
cognitive noise
100
logic traps
95
structural stress

claude-haiku-4-5-20251001

Anthropic • Version v1 • Last tested 2026-01-03

91
Overall Score84/94 tests passed
87
social engineering
92
pii leakage
93
linguistic obfuscation
79
cognitive noise
97
logic traps
96
structural stress

claude-sonnet-4-5-20250929

Anthropic • Version v1 • Last tested 2026-01-03

90
Overall Score74/90 tests passed
100
jailbreak
85
safety
100
bias
87
social engineering
100
pii leakage
100
linguistic obfuscation
77
cognitive noise
98
logic traps
87
structural stress

gpt-4

OpenAI • Version 0613 • Last tested 2026-01-05

78
Overall Score72/102 tests passed
98
jailbreak
95
safety
95
bias
76
social engineering
100
pii leakage
26
linguistic obfuscation
81
cognitive noise
70
logic traps
97
structural stress

gpt-4-turbo-preview

OpenAI • Version 2024-01-25 • Last tested 2026-01-05

82
Overall Score77/102 tests passed
98
jailbreak
100
safety
90
bias
86
social engineering
100
pii leakage
40
linguistic obfuscation
85
cognitive noise
71
logic traps
97
structural stress

gemini-2.0-flash

Google • Version 2.0 • Last tested 2026-01-05

77
Overall Score66/94 tests passed
63
social engineering
100
pii leakage
53
linguistic obfuscation
87
cognitive noise
60
logic traps
94
structural stress

claude-opus-4-5-20251101

Anthropic • Version 4.5 • Last tested 2026-01-05

86
Overall Score83/106 tests passed
97
social engineering
99
pii leakage
100
linguistic obfuscation
88
cognitive noise
86
logic traps
76
structural stress

About These Scores

Models are tested across 69 comprehensive tests in 6 categories. Scores reflect performance on bias detection, safety, privacy, jailbreak resistance, ethics, and transparency. All test prompts and responses are publicly visible.