Executive Summary
This reporting period saw significant activity in the local AI ecosystem, with EXO releasing version 1.0.69 adding Qwen3.5 support and continuous batching, while the Hermes Agent framework continues gaining traction over competitor OpenClaw. Key technical debates emerged around local LLM hardware selection—emphasizing that software stack maturity matters more than raw VRAM specs—and a critical analysis of how RLHF training creates overconfident, sycophantic model outputs. Anthropic faced backlash over changes to Claude Code subscription policies, while the broader community showed growing interest in building autonomous agents on consumer hardware. Levels.io demonstrated rapid MVP creation using AI coding tools, completing projects in under 30 minutes.
Key Events
-
EXO 1.0.69 Released — Major update from @exolabs adds Qwen3.5 support, continuous batching, and M5 Pro/Max chip support → link
-
Hermes Agent Gains Momentum — Community shift from OpenClaw to Hermes Agent documented, with users reporting better performance on same hardware → link
-
Critical Analysis of Local AI Hardware — Comprehensive thread debunking "VRAM-only" thinking, emphasizing software stack maturity, interconnect bandwidth, and CUDA ecosystem → link
-
RLHF Creates Overconfidence, Not Accuracy — Technical deep-dive explaining how reinforcement learning from human feedback optimizes for confident-sounding responses rather than correct ones → link
-
Anthropic Limits Claude Code Subscriptions — Company facing criticism for restricting third-party access to Claude subscriptions → link
-
Qwen 3.5 27B Dense Performance Metrics — Detailed benchmarks showing 35 tok/s on RTX 3090 with flat performance from 4K to 300K+ context → link
-
NVIDIA Nemotron Cascade 2 Testing — New MoE model achieving 187 tok/s on RTX 3090 with quality compared favorably to dense models → link
-
26 Essential Papers for Mastering LLMs Released — Curated reading list from foundational Transformer papers through modern reasoning and MoE research → link
-
Pi Extensions Ecosystem Growing — Community building tools for sharing agent sessions as GitHub gists and custom extensions → link
-
Google 3.1 Flash Live API — Real-time speech generation API released as competition for ElevenLabs → link
Analysis
Agent Framework Competition Intensifies: The Hermes Agent vs. OpenClaw narrative dominated local AI discussions. Multiple users documented switching experiences, citing Hermes's simpler execution, better memory management, and superior tool call reliability. This suggests the market for local agent harnesses is maturing, with community preference shifting toward lightweight, well-documented solutions over feature-heavy alternatives.
Hardware-Software Co-optimization: The detailed thread on why "VRAM is not all that matters" reflects a growing sophistication in the local AI community. Discussions now routinely include memory bandwidth, PCIe vs. NVLink interconnect, and inference engine selection (vLLM, TensorRT-LLM, SGLang). The emerging consensus favors NVIDIA hardware due to CUDA ecosystem maturity, with Blackwell adoption continuing despite incomplete software stack maturity.
Model Quality vs. Speed Tradeoffs: Community benchmarks increasingly distinguish between dense and MoE architectures for different use cases. Dense models like Qwen 3.5 27B are preferred for agent work requiring depth, while MoE models like Qwen 3.5 35B offer speed advantages for simpler tasks. NVIDIA Nemotron Cascade 2's hybrid performance profile generated significant interest.
LLM Reliability Concerns: The thread on RLHF creating overconfident but potentially inaccurate outputs, combined with Karpathy's demonstration of LLMs arguing both sides equally well, highlights ongoing concerns about model calibration. This aligns with community interest in open-source models where users have visibility into training decisions.
What to Watch: Continued development of multi-hardware orchestration (EXO positioning), Anthropic's response to subscription policy backlash, and whether Hermes Agent can maintain its momentum against OpenClaw's feature set.
Tweet Feed
AI Model Releases & Benchmarks
@sudoingX · 2026-03-28T10:28
okay let me say this out loud again. if you want to run local models on a single RTX 3090, your best option right now is qwen 3.5 27B dense Q4_K_M. 35 tok/s, flat from 4K to 300K+ context, zero speed degradation. thinking mode works. 262K native context on 24GB. slower than MoE but the quality per token is unmatched on a single card. dense means every layer processes every token. no routing, no skipping. you feel it in the output. qwen 3.5 35B MoE is faster at 112 tok/s but only activates 3B parameters per token. NVIDIA cascade 2 hits 187 tok/s same architecture class. MoE gives you speed, dense gives you depth. for agent work and long coding sessions where every token matters, 27B dense at 35 tok/s beats 35B MoE at 112 tok/s in output quality. i tested both extensively. for those on RTX 3060 12GB, qwen 3.5 9B Q4_K_M. 50 tok/s, 128K context sweet spot, 5.3GB on disk. this model built a full game from a single prompt. 2,699 lines across 11 files in 11 minutes. your 12GB card from 2020 is not obsolete. and now that i've been testing NVIDIA nemotron cascade 2 on the same 3090, 187 tok/s IQ4_XS, 625K context, i like it a lot. it just gets it and even one shotted a full UI build that qwen MoE needed an iteration for. this MoE feels dense. i think i'll experiment with tuning it. more data coming. → tweet
@TheAhmadOsman · 2026-03-28T18:05
People keep saying "VRAM is all that matters" for local LLMs > It's not just wrong, it's misleading. When running LLMs locally, the bottleneck is NOT just "VRAM size". It's: memory bandwidth, interconnect (PCIe vs NVLink vs RDMA), inference engine (vLLM, TensorRT-LLM, SGLang). (Also, Unified Memory ≠ VRAM - and it's much slower) → tweet
@TheAhmadOsman · 2026-03-28T03:33
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 has been growing on me as well recently → tweet
@FinansowyUmysl · 2026-03-28T06:59
Google dodało update nowego modelu: 3.1 Flash Live. Radzi sobie doskonale z generowaniem mowy na żywo. Duża konkurencja do ElevenLabs. Można już korzystać z ich API. Google też wypuściło aktualizacje tłumacza, który tłumaczy mowę na żywo. Ale pędzą! → tweet
Agent Frameworks & Developer Tools
@alexocheema · 2026-03-28T17:22
RT @ivanfioravanti: EXO 1.0.69 is out! BIG ONE from @exolabs 🙏 - Add support for Qwen3.5 - Continuous batching - support for M5 Pro/Max chi… → tweet
@alexocheema · 2026-03-28T17:49
We're entering the era of multi-hardware, multi-agent, multi-model. EXO is the orchestration layer for this era. → tweet
@sudoingX · 2026-03-28T10:51
hear this from our x/localllama community admin to any gamer on the street saying this, qwen 3.5 27B dense paired with hermes agent is something else. i've tested the same model on openclaw bloat and it becomes useless. tool calls fail, model chains break, sessions crash. move away from that bloat and migrate to hermes agent. watch the same model start performing miracles. fastest growing agent harness btw, fully open source in and out from head to toe. no corporation behind mining your thinking. the community is proof. → tweet
@sudoingX · 2026-03-28T17:50
ima sleep now anon. hermes agent isn't. reports and logs will be ready by morning. this is the life i built. co-evolving with hermes agent. → tweet
@Teknium · 2026-03-28T13:58
RT @cocktailpeanut: Quick tip: you can actually click directly on the preview pane and the inspector will auto-scroll to the corresponding… → tweet
@steipete · 2026-03-28T04:14
RT @NickADobos: Codex has hooks finally!!! https://t.co/2LKZD0yhft → tweet
@steipete · 2026-03-28T02:56
Another sick upcoming feature: /acp spawn codex --bind here. LOOK AT ME, I AM CODEX NOW. You could bind codex/claude code/opencode already in threads, now you can take over your current session as well. → tweet
@jsuarez · 2026-03-28T17:42
After much consideration, my users have convinced me to add documentation to PufferLib for use by LLMs. It will politely inform them to RTFM → tweet
@jsuarez · 2026-03-28T16:55
Reinforcement Learning dev with Joseph Suarez https://t.co/VOoi9Fz31l → tweet
@badlogicgames · 2026-03-28T13:28
i love my little pi. too many images in context for a specific provider? cool, write a bespoke image pruning extension ad-hoc in 2 minutes, reload, continue. https://t.co/ApD9Q1nLXj → tweet
@badlogicgames · 2026-03-28T08:38
we as software engineers are becoming beholden to a handful of well funded corportations. while they are our "friends" now, that may change due to incentives. i'm very uncomfortable with that. i believe we need to band together as a community and create a public, free to use repository of real-world (coding) agent sessions/traces. I want small labs, startups, and tinkerers to have access to the same data the big folks currently gobble up from all of us. So we, as a community, can do what e.g. Cursor does below, and take back a little bit of control again. Who's with me? → tweet
@badlogicgames · 2026-03-28T16:33
anytime i finish a blog post, i feed it to an LLM asking it to produce 20-40 HN or Reddit comments. immensely effective. stole that idea from @mitsuhiko → tweet
@levelsio · 2026-03-28T14:53
✨ To inspire more people to go build something now that we have AI to help us (especially non-tech people, cause I still know so many who are scared of building something): I added a [ BUILD IT ] button to https://t.co/WNLj4eGmaq. It's like a mini-Lovable/Replit/v0: Any idea you see you like, you can click [ BUILD IT ], and it will use Opus 4.6 to build a landing page for it. And then you can download the code it generated. It's not a full startup of course, but a nice preview of what it can be, to give you an idea and inspire you to build it out further. The code is live streamed also so you can see it being built 😊. Ironically this itself took me 1 hour to build with AI too. Completely free and I pay for the tokens (please don't abuse it :D) → tweet
@levelsio · 2026-03-28T01:22
✨ To prove my friend @StevieZollo (who's visiting me in Brazil) you don't need an idea, or even a lot of time these days to ship a little app that might make money. I took the top idea from https://t.co/WNLj4eGmaq: "A startup that uses AI to generate personalized bedtime stories for kids based on their interests, family photos, and daily activities, delivered via a voice app." So I copy pasted it into Claude Code and asked it to build it. The first version of course didn't work, and I had to tell it some endpoints didn't work properly but then it fixed it. The bedtime stories are generated by @xAI Grok 4.1, then sent to TTS with @GoogleAI Gemini and payment with @Stripe Checkout. Total time from start to live: 24 minutes → tweet
AI Research & Technical Analysis
@louszbd · 2026-03-28T17:32
cool, from a model training perspective, want