Agent Mixing Without Theater: DeepSeek Pro, Flash, Gemma4, and the Law of Diminishing Returns

Published: June 3, 2026
Last updated: June 4, 2026
Series: Agent mixing / Hermes routing, Part 1

This article started as a blunt answer to one question: when you mix DeepSeek Pro as the orchestrator, DeepSeek Flash as parallel delegates, and Gemma4 as a local side channel, where does the law of diminishing returns kick in?

The answer is still simple:

More agents buy more independent context windows, more parallel search paths, and more chances to catch a mistake. They do not automatically buy better judgment.

This is Part 1 of the series. It sets the baseline rule: Pro leads hard synthesis, Flash multiplies bounded work, and Gemma4 stays useful as a local junior-engineer sanity pass.

Two follow-up drafts now extend the argument:

Those draft links are private until promoted. The public principle is already stable: if the next agent does not get a distinct job, the next agent is probably noise.

Agent routing map showing DeepSeek Pro as lead, DeepSeek Flash as scouts, and Gemma4 as a local junior review pass — A practical routing map: DeepSeek Pro leads hard synthesis, Flash handles bounded scouting and review, and Gemma4 provides a local junior-engineer sanity pass.

The short answer

For serious coding, architecture, and repo surgery:

DeepSeek Pro = orchestrator, final synthesizer, master coder
DeepSeek Flash = scouts, reviewers, cheap implementers, context packers
Gemma4 = local junior engineer sanity pass
Hermes/code = deterministic router wherever the routing rules are already known

Do not make Flash the main orchestrator for hard coding work unless the orchestration is basically a scripted router.

Use Flash as a lead only when the job is predictable:

classify the task
choose a profile
summarize logs
extract TODOs
run a known checklist
pack context for a stronger model
produce a first-pass draft that Pro will review

Use Pro as the lead when the work has ambiguity, risk, architecture judgment, cross-file dependencies, or final merge authority.

That is the practical split.

What changed with DeepSeek V4

DeepSeek’s V4 release makes this question more interesting because both Pro and Flash now have a long-context, agent-capable shape. The official pricing page lists both V4 Flash and V4 Pro with 1M context, JSON output, tool calls, and thinking-mode support. As of this update, the official prices are far apart enough to matter: Flash is listed at $0.14 per 1M cache-miss input tokens and $0.28 per 1M output tokens, while Pro is listed at $0.435 per 1M cache-miss input tokens and $0.87 per 1M output tokens.

So the temptation is obvious:

If Flash is much cheaper and reasonably capable, should Flash orchestrate everything and call Pro only as the expensive master coder?

Sometimes. But not by default.

DeepSeek’s own V4 release notes position Pro as the stronger agentic coding and reasoning model, while Flash is the faster, more economical model that can approach Pro on simpler agent tasks. That points to the shape of the system: Flash should multiply throughput, but Pro should own hard judgment.

The manager is not just the person who sends messages. In an agent workflow, the manager defines the problem, slices it, judges conflicts, and decides what gets merged. That is usually the cognitively expensive part.

Where diminishing returns starts

The law of diminishing returns starts at the point where the next agent no longer receives an independent job.

Here is the test:

Add another delegate only if it can:
  1. inspect a different source,
  2. test a different hypothesis,
  3. own a different risk,
  4. search a different part of the repo,
  5. or compress context for a later model.

If the new agent is just reading the same prompt and producing another opinion, skip it.

Anthropic’s multi-agent research writeup is useful here because it is blunt about the cost curve. Their research system found that multi-agent work can help by spending more tokens across separate context windows, but it also reported that agents used about 4× the tokens of normal chat interactions, while multi-agent systems used about 15×. The same article also notes that most coding tasks have fewer truly parallelizable pieces than research.

That is the painful little truth: multi-agent workflows are often strongest for broad research and audits, not for every normal coding task.

Diminishing returns curve for agent count, showing a useful zone around two to four agents and a noisy zone past that — Agent count has a useful middle zone. Past that point, duplicate findings, merge cost, and contradiction handling can exceed the value of another delegate.

The useful agent-count curve

For coding work, I would use this curve:

Agent count	Usually worth it?	Best use
1	Yes	focused implementation, bugfixing, simple refactors
2–3	Usually	lead + reviewer + test/risk scout
4–6	Sometimes	architecture alternatives, repo audits, code/test/docs split
7–10	Rarely	broad research, large codebase scans, parallel discovery
10+	Almost never	only when each agent has a clearly separate corpus or job

The sweet spot for your Hermes-style work is probably:

C = Pro + Flash * 2 + Gemma4

That gives you a strong lead, two cheap independent passes, and one local sanity check without turning the job into a committee.

For major architecture work, use the plus version:

C = Pro + Flash * 3 or 4 + Gemma4

But make every Flash delegate own a different lane.

Why Flash should not usually be the top orchestrator

The orchestrator’s job is not only to distribute tasks. It has to decide what matters.

A weak orchestrator can waste a strong coder by asking the wrong question. That is the bad pattern:

Flash lead misunderstands scope
→ Pro receives narrow or incorrect delegate task
→ Pro solves the wrong thing well
→ system returns a polished mistake

This is why “cheap model as manager, expensive model as worker” is only safe when the route is deterministic.

A Flash-led workflow is fine for this:

Task arrives
→ classify as bugfix / article / audit / prompt pack
→ pick known profile
→ pack repo context
→ call Pro only if risk threshold is high

But for this:

PixelBoats projection architecture
SvelteKit PHP adapter correctness
multi-file refactor
security-sensitive deployment
AI Wiki search stack design

Pro should lead.

OpenAI’s handoff model and Google Cloud’s agent design guidance both point toward the same idea: handoffs are useful when specialists own distinct tasks, and the architecture should be selected based on complexity, latency, cost, autonomy, and workload shape. Do not use an agent hierarchy when a plain workflow would do.

That is the anti-theater rule.

What each model should do

DeepSeek Pro

Use Pro for:

task decomposition
architecture decisions
final patch design
final code generation
conflict resolution between delegates
root-cause debugging
long-context synthesis
“should we do this at all?” decisions

Pro should be the voice that says:

“This is the plan, these are the tradeoffs, this is the minimal safe patch, and these are the tests that matter.”

That is the role worth paying for.

DeepSeek Flash

Use Flash for:

first-pass repo scan
source extraction
codebase inventory
TODO/risk gathering
alternate implementation sketch
test plan draft
docs summary
prompt/context packing
cheap review pass
“what did Pro miss?” checks

Flash should not be vague. Give it a lane.

Bad Flash prompt:

Review this architecture.

Better Flash prompt:

You are the regression-risk delegate.
Only look for test gaps, cross-file breakage, and migration hazards.
Return:
1. likely breakpoints
2. missing tests
3. smallest verification plan
4. anything that should block merge
Do not rewrite the architecture.

That makes Flash useful.

Gemma4

Use Gemma4 as a local junior engineer delegate.

Not a skeptic. Not the boss. Not the final reviewer.

Give it jobs like:

explain the implementation in simpler terms
spot obvious readability problems
suggest pseudocode
flag “this will be hard to maintain”
check whether a visual/UI plan makes sense
propose small practical alternatives
notice if the workflow is overbuilt

Gemma is useful precisely because it is local and cheap enough to use as a side-channel. Google DeepMind describes Gemma open models as deployable across cloud servers, laptops, and phones, which maps well to an inexpensive second-pass lane rather than final authority.

The better orchestration pattern

I would design the default Hermes routes like this:

profiles:
  flash_triage:
    lead: deepseek-v4-flash
    use_when:
      - classify task
      - summarize logs
      - extract TODOs
      - pack context
      - decide whether Pro is needed
 
  normal_coding:
    lead: deepseek-v4-pro
    delegates:
      - deepseek-v4-flash: regression_review
    optional:
      - gemma4-e4b-local: junior_sanity_pass
 
  serious_coding:
    lead: deepseek-v4-pro
    delegates:
      - deepseek-v4-flash: implementation_path
      - deepseek-v4-flash: regression_risk
      - gemma4-e4b-local: junior_sanity_pass
 
  architecture_plus:
    lead: deepseek-v4-pro
    delegates:
      - deepseek-v4-flash: implementation_path
      - deepseek-v4-flash: testing_and_failure_modes
      - deepseek-v4-flash: context_reduction
      - deepseek-v4-flash: alternative_design
      - gemma4-e4b-local: junior_sanity_pass

For the earlier formula:

C = Pro + Flash * x + G4 + ?Y

I would set the default values this way:

normal:
  C = Pro + Flash * 1 + optional G4
 
serious:
  C = Pro + Flash * 2 + G4
 
plus:
  C = Pro + Flash * 3 or 4 + G4 + optional outside specialist

Keep ?Y rare. It should mean “different model family for a real reason,” not “one more opinion because we can.”

The merge-cost problem

Every delegate creates a merge problem.

You have to read the result, judge it, reconcile contradictions, decide whether it affects the plan, and carry forward the useful bits. That synthesis cost is real.

A multi-agent system fails when all delegates are allowed to produce open-ended essays.

Make delegates return structured outputs instead:

role:
finding:
evidence:
confidence:
risk:
recommended_action:
blocks_merge: yes/no

Or for coding:

files_to_touch:
files_to_avoid:
likely_breakpoints:
tests_to_run:
smallest_safe_patch:
open_questions:

This is boring. Good. Boring structure is how you stop the agent room from becoming improv night.

Do not start delegates with no context.

But also do not dump the full conversation into every delegate.

Use a compact shared brief:

Project:
Current goal:
Known decisions:
Non-negotiables:
Files/surfaces in scope:
Files/surfaces out of scope:
What the lead already believes:
Your delegate role:
Output format:

That is enough to prevent token creep without starving the delegate.

For PixelBoats-style work, this matters a lot. If the Flash delegate does not know the projection source of truth lives in the Perspective Lab winner specs, it may confidently re-litigate the wrong thing. If Gemma4 does not know it is a junior implementation/readability pass, it may drift into general brainstorming.

Agent systems are only as good as their context contracts.

The simplest budget law

Use this rule:

Run another delegate only if:
  expected independent value
  >
  token cost + latency + merge cost + contradiction cost + context-packing cost

The moment duplicate findings dominate, stop adding agents.

The moment every delegate needs the same full context, stop adding agents.

The moment Pro spends more time cleaning up Flash outputs than using them, reduce the delegate count.

The goal is not to use all available tools every turn. The goal is to use the cheapest tool that can safely move the work forward.

Recommended defaults

Here is the blunt default table.

Work type	Recommended route
Small bugfix	Flash triage or Pro solo
Normal feature	Pro + 1 Flash reviewer
Risky feature	Pro + implementation Flash + regression Flash + Gemma4
Architecture decision	Pro + 2–4 Flash delegates + Gemma4
Repo-wide audit	Pro lead + 4–8 Flash scouts
Prompt pack / docs	Flash lead, Pro review only if published or high-stakes
PixelBoats rendering/world systems	Pro lead, Flash delegates by lane, Gemma4 second-to-last sanity pass
Deployment/security/client-facing output	Pro lead, Flash risk review, no cheap-model final authority

That is the shape I would actually run.

My bottom line

Use Flash as a multiplier, not the judge.

Use Gemma4 as a local junior engineer, not a veto authority.

Use Pro as the lead whenever the task has ambiguity, architecture risk, or final-code responsibility.

The best workflow is not:

many agents → magic

It is:

clear lead
+ bounded delegates
+ compact shared context
+ structured outputs
+ final synthesis by the strongest available model

That is the point where multi-agent work becomes useful instead of theatrical.

Published: June 3, 2026
Last updated: June 4, 2026

This article is now Part 1 of a longer agent-routing series:

Part 1: Agent Mixing Without Theater — this baseline article.
Part 2 draft: What changes when GPT-5.5 is the orchestrator.
Part 3 draft: The expanded formula: Zen/M3, Puter panels, Gemma4, and specialist rotators.
Planned mini-note: why the formula style is useful, but the corrected formula matters.
Planned Part 4: Cathedral Edition / Prompt Operations framing — how Hermes makes the pattern operational instead of theoretical.
Planned Part 5: field history, test runs, results, mistakes, and what changed after using the system for real.

Small footer note: links to Part 2 and Part 3 currently point at private draft previews. They should be switched to public article routes when those drafts are promoted.

Agent Mixing Without Theater: DeepSeek Pro, Flash, Gemma4, and the Law of Diminishing Returns

Agent Mixing Without Theater: DeepSeek Pro, Flash, Gemma4, and the Law of Diminishing Returns

The short answer

What changed with DeepSeek V4

Where diminishing returns starts

The useful agent-count curve

Why Flash should not usually be the top orchestrator

What each model should do

DeepSeek Pro

DeepSeek Flash

Gemma4

The better orchestration pattern

The merge-cost problem

The simplest budget law

Recommended defaults

My bottom line

Series navigation

Sources and further reading

Sources and further reading

Agent Mixing Without Theater: DeepSeek Pro, Flash, Gemma4, and the Law of Diminishing Returns

Agent Mixing Without Theater: DeepSeek Pro, Flash, Gemma4, and the Law of Diminishing Returns

The short answer

What changed with DeepSeek V4

Where diminishing returns starts

The useful agent-count curve

Why Flash should not usually be the top orchestrator

What each model should do

DeepSeek Pro

DeepSeek Flash

Gemma4

The better orchestration pattern

The merge-cost problem

The context-sharing rule

The simplest budget law

Recommended defaults

My bottom line

Series navigation

Sources and further reading

Sources and further reading

More like this

If Claude Fable 5 Is Gone, Your Agent Stack Needs an Exit Plan

ChatGPT Deep Research vs. DeepSeek: What’s Actually Happening Under the Hood

#OpenJarvis Is the Local AI Agent Project to Watch Right Now