Agent Mixing Without Theater: DeepSeek Pro, Flash, Gemma4, and the Law of Diminishing Returns
Published: June 3, 2026
Last updated: June 4, 2026
Series: Agent mixing / Hermes routing, Part 1
This article started as a blunt answer to one question: when you mix DeepSeek Pro as the orchestrator, DeepSeek Flash as parallel delegates, and Gemma4 as a local side channel, where does the law of diminishing returns kick in?
The answer is still simple:
More agents buy more independent context windows, more parallel search paths, and more chances to catch a mistake. They do not automatically buy better judgment.
This is Part 1 of the series. It sets the baseline rule: Pro leads hard synthesis, Flash multiplies bounded work, and Gemma4 stays useful as a local junior-engineer sanity pass.
Two follow-up drafts now extend the argument:
- Part 2 — What changes when GPT-5.5 is the orchestrator
- Part 3 — The expanded formula: Zen/M3, Puter panels, Gemma4, and specialist rotators
Those draft links are private until promoted. The public principle is already stable: if the next agent does not get a distinct job, the next agent is probably noise.
The short answer
For serious coding, architecture, and repo surgery:
DeepSeek Pro = orchestrator, final synthesizer, master coder
DeepSeek Flash = scouts, reviewers, cheap implementers, context packers
Gemma4 = local junior engineer sanity pass
Hermes/code = deterministic router wherever the routing rules are already knownDo not make Flash the main orchestrator for hard coding work unless the orchestration is basically a scripted router.
Use Flash as a lead only when the job is predictable:
- classify the task
- choose a profile
- summarize logs
- extract TODOs
- run a known checklist
- pack context for a stronger model
- produce a first-pass draft that Pro will review
Use Pro as the lead when the work has ambiguity, risk, architecture judgment, cross-file dependencies, or final merge authority.
That is the practical split.
What changed with DeepSeek V4
DeepSeek’s V4 release makes this question more interesting because both Pro and Flash now have a long-context, agent-capable shape. The official pricing page lists both V4 Flash and V4 Pro with 1M context, JSON output, tool calls, and thinking-mode support. As of this update, the official prices are far apart enough to matter: Flash is listed at $0.14 per 1M cache-miss input tokens and $0.28 per 1M output tokens, while Pro is listed at $0.435 per 1M cache-miss input tokens and $0.87 per 1M output tokens.
So the temptation is obvious:
If Flash is much cheaper and reasonably capable, should Flash orchestrate everything and call Pro only as the expensive master coder?
Sometimes. But not by default.
DeepSeek’s own V4 release notes position Pro as the stronger agentic coding and reasoning model, while Flash is the faster, more economical model that can approach Pro on simpler agent tasks. That points to the shape of the system: Flash should multiply throughput, but Pro should own hard judgment.
The manager is not just the person who sends messages. In an agent workflow, the manager defines the problem, slices it, judges conflicts, and decides what gets merged. That is usually the cognitively expensive part.
Where diminishing returns starts
The law of diminishing returns starts at the point where the next agent no longer receives an independent job.
Here is the test:
Add another delegate only if it can:
1. inspect a different source,
2. test a different hypothesis,
3. own a different risk,
4. search a different part of the repo,
5. or compress context for a later model.If the new agent is just reading the same prompt and producing another opinion, skip it.
Anthropic’s multi-agent research writeup is useful here because it is blunt about the cost curve. Their research system found that multi-agent work can help by spending more tokens across separate context windows, but it also reported that agents used about 4× the tokens of normal chat interactions, while multi-agent systems used about 15×. The same article also notes that most coding tasks have fewer truly parallelizable pieces than research.
That is the painful little truth: multi-agent workflows are often strongest for broad research and audits, not for every normal coding task.
The useful agent-count curve
For coding work, I would use this curve:
| Agent count | Usually worth it? | Best use |
|---|---|---|
| 1 | Yes | focused implementation, bugfixing, simple refactors |
| 2–3 | Usually | lead + reviewer + test/risk scout |
| 4–6 | Sometimes | architecture alternatives, repo audits, code/test/docs split |
| 7–10 | Rarely | broad research, large codebase scans, parallel discovery |
| 10+ | Almost never | only when each agent has a clearly separate corpus or job |
The sweet spot for your Hermes-style work is probably:
C = Pro + Flash * 2 + Gemma4That gives you a strong lead, two cheap independent passes, and one local sanity check without turning the job into a committee.
For major architecture work, use the plus version:
C = Pro + Flash * 3 or 4 + Gemma4But make every Flash delegate own a different lane.
Why Flash should not usually be the top orchestrator
The orchestrator’s job is not only to distribute tasks. It has to decide what matters.
A weak orchestrator can waste a strong coder by asking the wrong question. That is the bad pattern:
Flash lead misunderstands scope
→ Pro receives narrow or incorrect delegate task
→ Pro solves the wrong thing well
→ system returns a polished mistakeThis is why “cheap model as manager, expensive model as worker” is only safe when the route is deterministic.
A Flash-led workflow is fine for this:
Task arrives
→ classify as bugfix / article / audit / prompt pack
→ pick known profile
→ pack repo context
→ call Pro only if risk threshold is highBut for this:
PixelBoats projection architecture
SvelteKit PHP adapter correctness
multi-file refactor
security-sensitive deployment
AI Wiki search stack designPro should lead.
OpenAI’s handoff model and Google Cloud’s agent design guidance both point toward the same idea: handoffs are useful when specialists own distinct tasks, and the architecture should be selected based on complexity, latency, cost, autonomy, and workload shape. Do not use an agent hierarchy when a plain workflow would do.
That is the anti-theater rule.
What each model should do
DeepSeek Pro
Use Pro for:
- task decomposition
- architecture decisions
- final patch design
- final code generation
- conflict resolution between delegates
- root-cause debugging
- long-context synthesis
- “should we do this at all?” decisions
Pro should be the voice that says:
“This is the plan, these are the tradeoffs, this is the minimal safe patch, and these are the tests that matter.”
That is the role worth paying for.
DeepSeek Flash
Use Flash for:
- first-pass repo scan
- source extraction
- codebase inventory
- TODO/risk gathering
- alternate implementation sketch
- test plan draft
- docs summary
- prompt/context packing
- cheap review pass
- “what did Pro miss?” checks
Flash should not be vague. Give it a lane.
Bad Flash prompt:
Review this architecture.Better Flash prompt:
You are the regression-risk delegate.
Only look for test gaps, cross-file breakage, and migration hazards.
Return:
1. likely breakpoints
2. missing tests
3. smallest verification plan
4. anything that should block merge
Do not rewrite the architecture.That makes Flash useful.
Gemma4
Use Gemma4 as a local junior engineer delegate.
Not a skeptic. Not the boss. Not the final reviewer.
Give it jobs like:
- explain the implementation in simpler terms
- spot obvious readability problems
- suggest pseudocode
- flag “this will be hard to maintain”
- check whether a visual/UI plan makes sense
- propose small practical alternatives
- notice if the workflow is overbuilt
Gemma is useful precisely because it is local and cheap enough to use as a side-channel. Google DeepMind describes Gemma open models as deployable across cloud servers, laptops, and phones, which maps well to an inexpensive second-pass lane rather than final authority.
The better orchestration pattern
I would design the default Hermes routes like this:
profiles:
flash_triage:
lead: deepseek-v4-flash
use_when:
- classify task
- summarize logs
- extract TODOs
- pack context
- decide whether Pro is needed
normal_coding:
lead: deepseek-v4-pro
delegates:
- deepseek-v4-flash: regression_review
optional:
- gemma4-e4b-local: junior_sanity_pass
serious_coding:
lead: deepseek-v4-pro
delegates:
- deepseek-v4-flash: implementation_path
- deepseek-v4-flash: regression_risk
- gemma4-e4b-local: junior_sanity_pass
architecture_plus:
lead: deepseek-v4-pro
delegates:
- deepseek-v4-flash: implementation_path
- deepseek-v4-flash: testing_and_failure_modes
- deepseek-v4-flash: context_reduction
- deepseek-v4-flash: alternative_design
- gemma4-e4b-local: junior_sanity_passFor the earlier formula:
C = Pro + Flash * x + G4 + ?YI would set the default values this way:
normal:
C = Pro + Flash * 1 + optional G4
serious:
C = Pro + Flash * 2 + G4
plus:
C = Pro + Flash * 3 or 4 + G4 + optional outside specialistKeep ?Y rare. It should mean “different model family for a real reason,” not “one more opinion because we can.”
The merge-cost problem
Every delegate creates a merge problem.
You have to read the result, judge it, reconcile contradictions, decide whether it affects the plan, and carry forward the useful bits. That synthesis cost is real.
A multi-agent system fails when all delegates are allowed to produce open-ended essays.
Make delegates return structured outputs instead:
role:
finding:
evidence:
confidence:
risk:
recommended_action:
blocks_merge: yes/noOr for coding:
files_to_touch:
files_to_avoid:
likely_breakpoints:
tests_to_run:
smallest_safe_patch:
open_questions:This is boring. Good. Boring structure is how you stop the agent room from becoming improv night.
The context-sharing rule
Do not start delegates with no context.
But also do not dump the full conversation into every delegate.
Use a compact shared brief:
Project:
Current goal:
Known decisions:
Non-negotiables:
Files/surfaces in scope:
Files/surfaces out of scope:
What the lead already believes:
Your delegate role:
Output format:That is enough to prevent token creep without starving the delegate.
For PixelBoats-style work, this matters a lot. If the Flash delegate does not know the projection source of truth lives in the Perspective Lab winner specs, it may confidently re-litigate the wrong thing. If Gemma4 does not know it is a junior implementation/readability pass, it may drift into general brainstorming.
Agent systems are only as good as their context contracts.
The simplest budget law
Use this rule:
Run another delegate only if:
expected independent value
>
token cost + latency + merge cost + contradiction cost + context-packing costThe moment duplicate findings dominate, stop adding agents.
The moment every delegate needs the same full context, stop adding agents.
The moment Pro spends more time cleaning up Flash outputs than using them, reduce the delegate count.
The goal is not to use all available tools every turn. The goal is to use the cheapest tool that can safely move the work forward.
Recommended defaults
Here is the blunt default table.
| Work type | Recommended route |
|---|---|
| Small bugfix | Flash triage or Pro solo |
| Normal feature | Pro + 1 Flash reviewer |
| Risky feature | Pro + implementation Flash + regression Flash + Gemma4 |
| Architecture decision | Pro + 2–4 Flash delegates + Gemma4 |
| Repo-wide audit | Pro lead + 4–8 Flash scouts |
| Prompt pack / docs | Flash lead, Pro review only if published or high-stakes |
| PixelBoats rendering/world systems | Pro lead, Flash delegates by lane, Gemma4 second-to-last sanity pass |
| Deployment/security/client-facing output | Pro lead, Flash risk review, no cheap-model final authority |
That is the shape I would actually run.
My bottom line
Use Flash as a multiplier, not the judge.
Use Gemma4 as a local junior engineer, not a veto authority.
Use Pro as the lead whenever the task has ambiguity, architecture risk, or final-code responsibility.
The best workflow is not:
many agents → magicIt is:
clear lead
+ bounded delegates
+ compact shared context
+ structured outputs
+ final synthesis by the strongest available modelThat is the point where multi-agent work becomes useful instead of theatrical.
Series navigation
Published: June 3, 2026
Last updated: June 4, 2026
This article is now Part 1 of a longer agent-routing series:
- Part 1: Agent Mixing Without Theater — this baseline article.
- Part 2 draft: What changes when GPT-5.5 is the orchestrator.
- Part 3 draft: The expanded formula: Zen/M3, Puter panels, Gemma4, and specialist rotators.
- Planned mini-note: why the formula style is useful, but the corrected formula matters.
- Planned Part 4: Cathedral Edition / Prompt Operations framing — how Hermes makes the pattern operational instead of theoretical.
- Planned Part 5: field history, test runs, results, mistakes, and what changed after using the system for real.
Small footer note: links to Part 2 and Part 3 currently point at private draft previews. They should be switched to public article routes when those drafts are promoted.