← Back to blog

GLM-4.7: Z.ai's Frontier Agentic Reasoning Model

A comprehensive technical analysis of GLM-4.7, the 358B parameter Mixture-of-Experts model pushing the boundaries of coding, reasoning, and agentic AI capabilities.

Executive Summary

GLM-4.7 represents Z.ai's latest iteration of their General Language Model series, building upon GLM-4.6 with substantial improvements across coding, reasoning, and tool-use benchmarks. Released under the MIT license, this model offers both cloud API access through Z.ai's platform and full local deployment options via HuggingFace.

Key highlights:

  • 358 billion total parameters with 32 billion active parameters (Mixture-of-Experts)
  • 200K context window for extensive document and code analysis
  • 73.8% on SWE-bench Verified (+5.8% over GLM-4.6)
  • 66.7% on SWE-bench Multilingual (+12.9% improvement)
  • Three distinct thinking modes for flexible reasoning control

Model Architecture Deep Dive

GLM-4.7 employs a sophisticated Mixture-of-Experts (MoE) architecture that balances computational efficiency with model capacity. The architecture is designated as Glm4MoeForCausalLM in the HuggingFace ecosystem.

GLM-4.7 Architecture Overview

Core Transformer Configuration

ParameterValueDescription
Hidden Size5,120Dimensionality of hidden representations
Number of Layers92Total transformer blocks
Vocabulary Size151,552Token vocabulary including special tokens
Max Position Embeddings202,752Maximum sequence length (~200K tokens)

Mixture-of-Experts Configuration

The MoE layer is the defining architectural feature of GLM-4.7:

ParameterValueDescription
Routed Experts160Total number of expert networks
Shared Experts1Always-active expert for common patterns
Experts Per Token8Active experts per forward pass
First K Dense Replace3First 3 layers use dense FFN (no MoE)

Parameter calculation: Each token activates 8 out of 160 experts + 1 shared expert = 9 active experts. Active parameters: ~32B (8.9% of total capacity). Total parameters: ~358B. This sparsity ratio enables the model to maintain a massive knowledge capacity while keeping inference costs comparable to a ~32B dense model.

Thinking Modes

GLM-4.7 introduces a sophisticated thinking mode system designed for agentic coding workflows:

1. Interleaved Thinking

The model reasons before each response and before each tool call. This provides fine-grained reasoning that adapts to the current context, making it particularly useful for complex multi-step tasks.

2. Preserved Thinking

Reasoning blocks are maintained across multi-turn conversations, reducing redundant computation and maintaining coherent reasoning chains across tool calls.

3. Turn-level Thinking Control

Per-turn control allows lightweight requests to skip reasoning overhead entirely.

Benchmark Performance

Reasoning Benchmarks

GLM-4.7 Reasoning Benchmarks
BenchmarkGLM-4.7GLM-4.6Delta
MMLU-Pro84.383.2+1.1
GPQA-Diamond85.781.0+4.7
HLE (w/ Tools)42.830.4+12.4
AIME 202595.793.9+1.8

Coding Benchmarks

GLM-4.7 Coding Benchmarks
BenchmarkGLM-4.7GLM-4.6Delta
SWE-bench Verified73.868.0+5.8
SWE-bench Multilingual66.753.8+12.9
Terminal Bench 2.041.024.5+16.5
LiveCodeBench v684.978.2+6.7
GLM Version Improvements

Agent and Tool-Use Benchmarks

GLM-4.7 Agent Benchmarks

Hardware Requirements

GLM-4.7 Hardware Requirements
ConfigurationVRAM RequiredRecommended Hardware
FP16 (Full Precision)~716 GB8x H100 80GB or equivalent
FP8 Quantized~358 GB4x H100 80GB
Q4_K_M Quantized~220 GBHigh-end workstation (256GB+ unified memory)
Cloud (Z.ai/Cerebras)N/AAPI access, 1000+ tok/s

Integration Ecosystem

GLM-4.7 integrates with major coding agents through Z.ai's Anthropic-compatible API:

  • Claude Code - Anthropic's official CLI
  • Kilo Code - VS Code extension
  • Cline - AI coding assistant
  • Roo Code - Cursor alternative
  • OpenCode - Open-source coding agent

Last updated: December 2025