Tech / AI / IT Intelligence Briefing
Period: March 24–25, 2026 | Generated from Twitter/X feed
Executive Summary
The biggest story in the AI developer tooling space is the rapid rise of Hermes Agent v0.4.0, a major open-source release from NousResearch with 300 merged PRs, featuring background self-improvement, OpenAI Responses API support, and self-improving memory/skills — drawing direct comparisons to Claude Code and gaining thousands of users organically in under 48 hours. Simultaneously, NVIDIA's Nemotron Cascade 2 (a Mamba-2 architecture model) is generating significant benchmark buzz, achieving 187 tok/s on a single RTX 3090 with flat performance from 4K to 625K context — outperforming Alibaba's Qwen 3.5 35B-A3B (deltanet architecture) on identical hardware. Google's TurboQuant compression algorithm, promising 6x reduction in LLM KV cache memory, was formally introduced and is already being implemented in MLX by community developers. OpenAI's Sora video platform has been shut down, with its API also discontinued. Ollama experienced scaling issues due to demand spikes and launched an annual Pro plan at $200/year to power OpenClaw, Claude Code, and similar tools.
Key Events
-
Hermes Agent v0.4.0 released — NousResearch's biggest release to date with 300 merged PRs; includes background self-improvement, OpenAI Responses API, self-improving memory and skills. Community describes it as superior to OpenClaw for agentic coding sessions. MiniMax team reached out signaling collaboration interest. → link
-
NVIDIA Nemotron Cascade 2 benchmark results — Mamba-2 architecture hits 187 tok/s on single RTX 3090, flat from 4K to 625K context with no speed loss and minimal flags (
-ngl 99 -np 1). Beats Alibaba Qwen 3.5 35B-A3B (deltanet) at 112 tok/s on same hardware. Community note: Q4_K_M (24.5GB) doesn't fit 24GB VRAM; use bartowski IQ4_XS (18.17GB) instead. → link -
Google TurboQuant introduced — New KV cache compression algorithm reducing memory by at least 6x; already implemented in MLX by @Prince_Canuma, tested on Qwen3.5-35B-A3B with needle-in-a-haystack benchmarks. → link
-
OpenAI Sora shut down — Sora video platform and its API discontinued. Reaction from tech community is largely muted, with some noting GPU resources could be better allocated. → link
-
MiniMax M2.7 model card incoming — Following Hermes Agent integration with MiniMax M2.5, Ryan from MiniMax confirmed an M2.7 model card is being prepared for HuggingFace. Described as "the closest we've ever gotten to a fully local Claude Code + Opus 4.6 experience." → link
-
Ollama scaling issues + annual Pro plan — Ollama's cloud hit capacity limits due to demand; resolved within hours. Launched $200/year annual Pro plan (2 months free) powering OpenClaw, Claude Code, and open model inference. → link
-
Apple Siri reportedly switching to Google Gemini foundation model — Signals Apple's internal AI efforts have been abandoned in favor of Google's model to power Siri. → link
-
Code.Storage: new Git provider for machines — New service from @pierrecomputer specifically designed for AI/machine-generated repos, emerging at a time when GitHub reports ~230 new repos/day from AI agents. → link
-
llm-d joins CNCF as incubating project — llm-d, focused on evolving Kubernetes into state-of-the-art AI inference infrastructure, officially joins the Cloud Native Computing Foundation. → link
-
Tekton joins CNCF as incubating project — Kubernetes-native CI/CD pipeline framework now officially incubating under CNCF. → link
-
Karpathy on LLM memory personalization problems — Notes that LLMs persist irrelevant memory artifacts disproportionately, causing "trying too hard" behavior in personalization. Sparks discussion on RL-based post-training memory as the more viable long-term approach. → link
-
Tinygrad hiring for LLM runner improvements — Explicitly seeking a developer to improve LLM runner with USB GPU support and high BS=1 tok/s throughput; explicitly rejecting bounty model citing "AI slop" code quality concerns. → link
-
OpenCode desktop update — Git and branch change visibility added to the review panel. → link
-
OpenClaw new beta released — Better Microsoft Teams integration and OpenWebUI support added. → link
-
Exo distributed inference getting traction — Users wiring up MiniMax M2.5 to dual Mac Studios (512GB) via exo, integrated with OpenClaw; DGX Spark + Mac Studio beast clusters being tested. → link
-
AWS 20th anniversary reflections — Amazon CTO reflects on the origins of AWS, nearly not answering Amazon's call: "It's an online bookstore. How hard could their scaling be?" → link
-
Anthropic license criticism for skills — Community criticism that Anthropic has "hardass licenses" for skills while most Codex skills are Apache 2.0. → link
-
OpenAI nonprofit to spend $1B in first year — Sam Altman RT'd announcement of new OpenAI nonprofit's first-year $1B spending commitment. → link
-
AI agent self-evaluation problem highlighted — Badlogicgames highlights research finding that agents "confidently praise" their own mediocre work; AI tools fail on design, documentation, and maintainability while performing adequately on functional correctness. → link
Analysis
Patterns
Hermes Agent vs. OpenClaw is emerging as the defining open-source agentic coding rivalry of this period. The organic growth narrative (1,777 community "heralds" in 48 hours with zero paid promotion) and the MiniMax collaboration signal positions Hermes as a serious challenger. The community sentiment ("Hermes > OpenClaw") is consistent across multiple independent accounts.
Architecture wars at the model level: The Mamba-2 (NVIDIA Cascade 2) vs. Deltanet (Alibaba Qwen 3.5) comparison on identical consumer hardware is the most concrete head-to-head architecture benchmark visible in this feed. Mamba-2's 67% throughput advantage with simpler flag requirements at consumer VRAM tiers could have significant downstream adoption implications, especially for local inference advocates.
Google TurboQuant gaining immediate community implementation (MLX port in days) suggests it addresses a real bottleneck. 6x KV cache memory reduction is significant enough to change what's runnable on prosumer hardware.
GitHub reliability is surfacing as a concern multiple times across different accounts — described as "on the brink of becoming the first SaaS with zero nines of availability," and notification volume graphs showing sharp AI-driven spikes. The emergence of Code.Storage as a machine-optimized Git alternative is a direct response to this.
Apple + Google Gemini partnership to power Siri is a significant strategic signal: Apple has effectively conceded its AI foundation model position.
What to Watch Next
- MiniMax M2.7 release on HuggingFace — Could validate or undercut the "local Claude Code" narrative.
- Hermes Agent + Cascade 2 integration results — sudoingX announced live testing at 187 tok/s for autonomous coding sessions; results pending.
- Google TurboQuant broader adoption — Whether other frameworks (llama.cpp, vLLM) implement it quickly.
- OpenClaw ecosystem response — New beta just dropped; whether OpenClaw recovers community sentiment against Hermes.
- GitHub reliability trajectory — AI-driven repo creation volume is straining infrastructure; a major outage could accelerate Code.Storage-type alternatives.
- Sora shutdown fallout — Whether OpenAI redeploys compute meaningfully (e.g., toward reasoning or coding).
Tweet Feed
🤖 Hermes Agent / NousResearch
@Teknium · 2026-03-25T01:37
We are seriously cooking 🔥🧑🍳 → tweet link
@louszbd · 2026-03-25T03:31
RT @Teknium: Hermes Agent v0.4.0 — 300 merged PRs this week. Biggest release we've done. Background self-improvement, OpenAI Responses API… → tweet link
@TheAhmadOsman · 2026-03-25T02:10
Just spent a couple hours playing with Hermes Agent (MiniMax M2.5 on a 2× RTX PRO 6000 node). Genuinely impressive experience. MiniMax M2.7 weights will be the closest we've ever gotten to a fully local "Claude Code + Opus 4.6" experience. Running on your own hardware at home. → tweet link
@Teknium · 2026-03-25T00:13
RT @851277048Li: @Teknium Hi, Teknium, I am Ryan from MiniMax. Hermes's project is truly impressive. I look forward to further collaboration… → tweet link
@sudoingX · 2026-03-25T13:33
now comes my favorite part. installing the majestic hermes agent for cascade 2. did you install it? do you have it too? what are you doing with it? → tweet link
@sudoingX · 2026-03-25T15:59
wow we are 1,777 heralds now in 48 hours. no ads, no giveaways, no follow for follow. just open source and people who build. tomorrow i'm running hermes agent on nvidia's cascade 2 at 187 tok/s. autonomous coding sessions, tool calls, the full test. results will be posted here first. → tweet link
@Teknium · 2026-03-25T18:35
RT @deemoowoor: Got it working yesterday, imported a few skills I use with claude code, a few new tools that hermes has as better alternatives… → tweet link
@Teknium · 2026-03-25T16:38
RT @Rahatcodes: Hermes Agent is WAAAAY better experience than Open Claw by far → tweet link
@Teknium · 2026-03-25T16:42
RT @thejayesh: I spun off one of my test beds to this and to say it's impressive is understating it. The memory just works out of the box… → tweet link
@Teknium · 2026-03-25T17:44
RT @fancylancer3991: After reading it, this should be bigger news. Hermes agent = self-improving memory & skills… → tweet link
⚡ NVIDIA Nemotron Cascade 2 / Model Benchmarks
@sudoingX · 2026-03-25T09:39
if you're about to download nvidia's nemotron cascade 2 at Q4_K_M for a single RTX 3090, stop... the fix: bartowski IQ4_XS at 18.17GB. imatrix quantization... leaves you 5.4GB of headroom for KV cache and context. → tweet link
@sudoingX · 2026-03-25T13:19
nvidia's 3B mamba destroyed alibaba's 3B deltanet on the same RTX 3090... nemotron cascade 2: 187 tok/s. flat from 4K to 625K context. zero speed loss... qwen 3.5 35B-A3B: 112 tok/s. flat from 4K to 262K context... nvidia cooked. → tweet link
@TheAhmadOsman · 2026-03-25T16:34
guys don't get too excited anything intel GPU is dead on arrival for LLMs... NVIDIA owns a 10% stake in intel so they don't compete → tweet link
@TheAhmadOsman · 2026-03-25T18:39
32GB VRAM card that has Unified Memory-class bandwidth, lacks software support & adoption, DOES NOT have CUDA — should not be sold for $1,000 USD. You'll be better off buying an M5 Max lol → tweet link
🔬 Google TurboQuant / KV Cache Compression
@louszbd · 2026-03-25T04:03
RT @GoogleResearch: Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers… → tweet link
@victormustar · 2026-03-25T09:02
RT @Prince_Canuma: Just implemented Google's T