Grok 4 DeepDive

In partnership with

Dear Readers,

With Grok 4, xAI is going on the offensive: The new model is not only set to break benchmarks, but also catapult artificial intelligence into a new dimension. The heavy version is particularly impressive with its multi-agent architecture and a record performance of 44.4% in “Humanity's Last Exam” – a test that makes even top models like GPT-4 sweat. Grok combines a huge context window with real-time web search and multimodality and aims to be more than just a chatbot: a digital thinker for the future. But can it deliver what Musk promises?

All the best,

Grok 4 DeepDive

“Grok 4 is at the point where it essentially never gets math/physics exam questions wrong, unless they are skillfully adversarial. It can identify errors or ambiguities in questions, then fix the error in the question or answer each variant of an ambiguous question.”

— Elon Musk

❝

The TLDR
xAI's Grok 4, unveiled by Elon Musk, is a powerful new model built on an efficient Mixture-of-Experts architecture with a massive 256k context window. While it sets new records on difficult reasoning benchmarks and introduces features like real-time web search, its launch has been marred by significant alignment issues, including the reappearance of offensive content and a noted tendency to echo Elon Musk's personal biases, raising serious concerns about its neutrality and safety.

Imagine an AI that doesn't just think based on past data, but reflects in real time, like an intelligent conversation partner who considers current information – and develops such depth that it reveals philosophical trains of thought in its answers. That's exactly what Grok 4, the new model from xAI, promises. It was unveiled by Elon Musk in mid-July 2025. With its claim to be the “smartest AI in the world,” Grok 4 has been unmistakably brought into the debate about AI beyond tradition: a model that thinks multimodally, speaks multimodally, researches live, thinks, codes – and does so with astonishing speed, accessing the latest information.

This ambition raises a central question: Can Grok 4 actually push the boundaries of existing AI models – both in terms of technical excellence and social relevance – or will it become entangled in new challenges such as bias, influence, and responsibility? I explore this tension throughout the article – between architecture, benchmarking, application, and reflection.

Start learning AI in 2025

Everyone talks about AI, but no one has the time to learn it. So, we found the easiest way to learn AI in as little time as possible: The Rundown AI.

It's a free AI newsletter that keeps you up-to-date on the latest AI news, and teaches you how to apply it in just 5 minutes a day.

Plus, complete the quiz after signing up and they’ll recommend the best AI tools, guides, and courses – tailored to your needs.

1. Technological foundation: architecture & multimodality

Unlike many of its predecessors, Grok 4 uses an artificial mixture-of-experts structure in which specialized submodels are only activated depending on the context. The long English summary describes it as follows: "The mixture-of-experts architecture activates specialized subnetworks only when needed. This saves a lot of computing power and costs." This modular approach allows Grok 4 to rely on the most suitable computing unit for complex calculations, deductive reasoning, or code processing – an elegant increase in efficiency.

This is complemented by a gigantic context frame of up to 256k tokens, significantly more than GPT-4 or other top models – allowing for long-lasting conversations with complete context or complete documents. In addition, Grok 4 is becoming multimodal: it will soon understand text, images, and eventually video in combination. In the heavy version, there are discussions about up to 32 agents arguing with each other in parallel, enabling “multi-agent orchestration.”

Another outstanding feature is DeepSearch – a native, real-time web search that can also include X posts. This means that content is not only generated creatively, but also always backed up with the latest information. Live research is a step toward intelligent, reflective dialogue at an academic level.

2. Benchmark successes and performance

In terms of performance data, Grok 4 outperforms many of its competitors: According to xAI, the model scores between 25% and 38.6% on PhD-level tests such as “Humanity's Last Exam” (2,500 questions: and in HLE even up to 50,7% in heavy mode) – significantly better than GPT-4o or Claude Opus. In standardized math and logic tests (LCB, HMMT, USAMO), Grok 4 Heavy wins with top scores between 61 and 97%. In the “Vending Bench” retail simulation test, Grok 4 doubles the net worth of human counterparts. These figures speak to the model's unique combinatorial talent: knowledge plus planning plus strategic action.

3. Applications & developer focus

Grok 4 offers extended support for developers with its specialized variant Grok 4 Code: more than 20 programming languages, native editor integration, and debugging assistants. Initial tests show 5–8% higher coding accuracy compared to competing solutions.

The model impresses with its natural, emotional output (voice mode) with low latency (~250 ms), and those who want more accuracy can switch to “think mode,” which reveals step-by-step thinking. Multimodal creativity is expanded with Aurora (text-to-image).

Image editing, video detection, and finally video generation are scheduled to be integrated between August and October – an ambitious roadmap plan.

4. Transparency, risks, and influence

But despite its impressive technology, Grok 4 is not without its drawbacks. After its launch, anti-Semitic content reappeared – apparently in connection with adjustments to the filters and Musk's input into the data. Even more worrying, Grok 4 unsolicitedly draws on Elon Musk's social media positions on controversial topics, which seriously clouds its neutrality. An AI model that becomes more of an echo chamber than reflective intelligence raises questions about objectivity, transparency, and disclosure.

— # (#)

5. Grok 4 Heavy – The multi-agent machine

For $300/month, Grok 4 Heavy (“SuperGrok Heavy”) offers an innovative architecture: multiple reasoning agents work on tasks in parallel, and their answers are combined – an approach reminiscent of teamwork.

The result: a massive leap in performance.

HLE: 44.4% with tools – almost twice as strong as Grok 4 Standard and well ahead of Gemini Pro (26.9%) and OpenAI o3 (24.9%).
AIME: 100% – perfect score.
ARC-AGI-2: 15.9% – once again the clear leader.
Vending Agent Bench: $4,694 net worth vs. $2,077 for Claude and $844 for humans – proof of economic thinking and strategic planning skills.

Conclusion

Grok 4 claims to be the “smartest AI in the world” – and in its core profile, which combines massive context, multimodal thinking, live web access, and multi-agent reasoning, it actually succeeds in doing so outstandingly well. The benchmarks and real-world application scenarios show that Grok 4 is a qualitative leap forward – especially when compared to GPT-4o, Gemini, or Claude Opus.

At the same time, the crux lies in its influence:

If an AI model unsolicitedly adopts thousands of arguments made by Elon Musk – and anti-Semitic content is repeated – this undermines trust, objectivity, and public acceptance. This leads to perhaps the most exciting finding: Grok 4 is technologically revolutionary – but socially, it is a balancing act between autonomy and authority.

Can xAI further open up its model technically – for example, through systemic transparency, third-party reviews, and independent audit mechanisms – to make Grok 4 a truly reliable “reflection machine” rather than a propaganda machine? This is the balancing act that will determine whether Grok 4 can endure not only in the lab, but also as a democratic discourse tool.

Sources:

🔗 xAI: “Introducing Grok 4” – https://x.ai/news/grok-4

🔗 God of Prompt: “Grok 4 Update” – https://www.godofprompt.ai/blog/grok-4-update

🔗 Scientific American: “Elon Musk's Grok 4 Takes on Humanity's Last Exam” – https://www.scientificamerican.com/article/elon-musks-new-grok-4-takes-on-humanitys-last-exam-as-the-ai-race-heats-up

🔗 DataCamp: “Grok 4 Benchmarks” – https://www.datacamp.com/blog/grok-4

🔗 AP News: “Grok 4 Bias Concerns” – https://apnews.com/article/14d575fb490c2b679ed3111a1c83f857

🔗 Tom's Guide: “Grok 4: AI or Musk's Echo Chamber?” – https://www.tomsguide.com/ai/grok-4-is-here-elon-musk-says-its-the-same-model-physicists-use

Chubby’s Opinion Corner

Grok 4 has the potential to radically change the playing field of artificial intelligence. With the heavy version, we are seeing for the first time a mainstream model that uses multi-agent architectures to crack complex tasks such as Humanity's Last Exam with a score of almost 45% – a figure that previously seemed unattainable. While other models still think linearly, Grok 4 orchestrates entire teams of agents that act like a digital think tank. This is more than a technological advance – it is a paradigm shift. Musk has understood that the future lies not in pure scaling, but in intelligent collaboration. The AI race has restarted – and Grok 4 is setting the bar.

Sponsored By Vireel.com

Vireel is the easiest way to get thousands or even millions of eyeballs on your product. Generate 100's of ads from proven formulas in minutes. It’s like having an army of influencers in your pocket, starting at just $3 per viral video.

Try Now & Get 40% Off for Life >

How'd We Do?

Please let us know what you think! Also feel free to just reply to this email with suggestions (we read everything you send us)!

Grok 4 DeepDive

Grok 4 DeepDive

Ad

Start learning AI in 2025

1. Technological foundation: architecture & multimodality

2. Benchmark successes and performance

3. Applications & developer focus

4. Transparency, risks, and influence

5. Grok 4 Heavy – The multi-agent machine

Conclusion

Chubby’s Opinion Corner

How'd We Do?

Reply

Keep Reading

Superintelligence.