Z.ai’s GLM-5.2 is turning a technical benchmark fight into a bigger enterprise question: how much cybersecurity capability should companies trust in an open-weight AI model?
Independent tests from Semgrep and Graphistry suggest the Chinese AI lab’s latest model can perform competitively on specific vulnerability-detection and security-investigation tasks. The evidence supports a narrow claim around assisted security analysis, not broad parity with Anthropic’s restricted Claude Mythos models for exploit generation or autonomous zero-day research.
Benchmark tests show a narrow win
Z.ai, formerly known internationally as Zhipu AI, describes GLM-5.2 as a long-context model released under an MIT license. The company’s GLM-5.2 model card says it supports a 1 million-token context window and reports scores of 81.0 on Terminal-Bench 2.1 and 62.1 on SWE-bench Pro.
Those are vendor-reported coding benchmarks, not direct proof of cybersecurity performance. The stronger evidence comes from outside security tests, though those tests also show why the “rival Mythos” framing needs limits.
In Semgrep’s IDOR benchmark, GLM-5.2 scored 39% F1 on insecure direct object reference detection. That placed it ahead of two Claude Code configurations, which scored 37% and 28%, but behind Semgrep’s own multimodal pipeline at 53% to 61%.
The comparison is useful, but not a clean model-to-model victory because Semgrep’s higher-scoring pipeline used a purpose-built workflow.
Graphistry’s CyBT-CTF evaluation found a similar pattern. GLM-5.2 posted a 28/59 solve rate, making it the top open-weight model in that test and tying some proprietary model setups. Graphistry also found that its Louie.ai harness with Claude Opus scored 35/59, showing that tooling and orchestration can change the outcome.
The claim also fits a broader APAC pattern. Z.ai is not the only regional player positioning AI systems against Mythos-class cybersecurity tools; another Chinese cybersecurity company recently made a similar claim around AI-assisted vulnerability discovery.
Taken together, the public results support GLM-5.2 as a serious model for assisted vulnerability analysis. They do not prove that it matches Mythos across exploit creation, weaponization, or end-to-end offensive research.
Open-weight deployment creates enterprise risk
The open-weight design is the enterprise issue behind the benchmark debate. Unlike proprietary API systems, GLM-5.2 can be downloaded and run inside an organization’s own environment, which may appeal to teams handling sensitive source code, logs, or vulnerability reports that cannot easily leave internal systems. But the model’s value depends less on headline benchmark parity than on whether it can improve supervised workflows without weakening governance.
Local deployment also shifts responsibility. Organizations running GLM-5.2 must manage access controls, logging, monitoring, red-team testing, and policy enforcement themselves. Provider-side guardrails matter less once a model is downloaded, modified, or connected to internal tools, which is why restricted-access AI releases are becoming part of the broader debate over cyber-capable models.
That trade-off cuts both ways. Defenders can use open-weight models to speed up code review, vulnerability triage, and security investigations. Attackers can use the same accessibility to test targets, automate reconnaissance, or remove safeguards. Axios reported that US officials recently allowed a limited return of Anthropic’s Mythos models after earlier restrictions tied to cybersecurity concerns.
Procurement teams should treat GLM-5.2 as a model to evaluate, not a Mythos replacement. The strongest near-term use cases are human-supervised code review, vulnerability triage, and internal security analysis. Recent AI distillation allegations also show why model provenance, access controls, and safety claims are becoming procurement issues, not just research-lab concerns.
What to watch next: independent exploit-focused evaluations of GLM-5.2, real enterprise deployment reports, and policy moves aimed at open-weight model distribution.
Read more: For more context on how Anthropic’s cyber-capable AI systems became a policy flashpoint, see this timeline of Claude’s collision with the US government.


