Tech Docs

AI-Native Work 2026: The Trust Gap, Harness Engineering, and the Knowledge Flywheel

Name: AI-Native Work 2026: The Trust Gap, Harness Engineering, and the Knowledge Flywheel
Creator: 腾讯研究院
Published: 2026-06-05T00:00:00.000Z
Keywords: AI原生工作, 驾驭工程, Harness, 信任鸿沟, 多智能体, 知识工程, 去技能化, 上下文工程

Tencent Research Institute decodes the human–AI collaboration dilemma through ten keywords: it isn't a capability problem but a system-design problem—and what ultimately compounds into a moat is the knowledge network you weave in practice.

Published 2026-06-0510 min readSource: 腾讯研究院

AI原生工作驾驭工程Harness信任鸿沟多智能体知识工程去技能化上下文工程

Tencent Research Institute's AI-Native Work Report 2026 uses the Trust Gap as its prism to trace the learning curve from 'using AI' to 'mastering AI': adoption of AI coding tools rose to 84% while trust in them fell to 29%, and users who felt 20% faster were actually 19% slower. The report argues the human–AI collaboration dilemma is not a capability problem but a system-design problem—the way out is Harness Engineering (designing the environment AI works in, not babysitting its output), rebuilt on four pillars (memory, skills, evaluation, context) across end-to-end workflows. What ultimately compounds into a competitive moat is the Knowledge Flywheel: writing practice down, remembering it, and connecting it.

29%

Trust in AI coding tools — while adoption rose to 84% (the trust scissors) · Stack Overflow

39pts

Gap between feeling '20% faster' and actually being '19% slower' · METR

1.9×

Startups that redesigned their workflow earned 1.9× the control group · INSEAD

79%

Share of multi-agent production failures rooted in spec and coordination · UC Berkeley

Panorama: A Learning Curve Disguised as a Trust Problem

Tencent Research Institute uses the Trust Gap as a prism and ten keywords—harness engineering, memory, skills, evaluation, context, workflow, multi-agent, addition bias, deskilling, knowledge engineering—to draw the learning curve from 'using AI' to 'mastering AI.' The thesis is one sentence: the human–AI collaboration dilemma is not a capability problem but a system-design problem. What ultimately compounds into a moat is not a model or a tool, but the knowledge network a person weaves through practice.

The report's spine: five layers from diagnosis to distillation

Diagnosisthe trust gap

Adoption ≠ trust → runaway behavior → distorted perception → organizational fault lines.

Methodharness engineering

Design the environment AI runs in, not the output. Four pillars: memory · skills · evaluation · context.

Systemworkflow · multi-agent

Redesign end-to-end, don't just speed up single tasks. Master one agent before scaling to many.

Cautionaddition bias · deskilling

AI amplifies our instinct to add; judgment quietly atrophies.

Distillationthe knowledge flywheel

Write it down → remember it → connect it → spin the flywheel.

Diagnosis: The Trust Gap

The trust gap shows up as four escalating symptoms: adoption and trust moving in opposite directions, a 'say one thing, do another' behavior gap, a misperception of one's own speed, and a fault line between leadership and the front line. Together they show the problem isn't that AI is too weak, but that we haven't yet learned to work with a probabilistic colleague.

The Trust Scissors: The More We Use It, The Less We Trust It

Stack Overflow's annual developer survey shows adoption of AI coding tools rising from 70% in 2023 to 84% in 2025, while trust in them fell from 40% to 29%—a pair of scissors opening ever wider.

AI coding tools: adoption↑ and trust↓ (2023 → 2025)（%）

The Behavior Paradox: 96% Don't Trust It, 48% Don't Check It

Say they distrust it vs actually check it (Sonar survey)

Don't fully trust AI code correctness

96%

Always review AI code before committing

48%

Find reviewing AI code harder than human code

38%

Almost everyone says 'I don't trust AI code,' yet half of them click commit anyway. Behind it is a hidden cost transfer: AI removes the burden of writing but adds the burden of checking; when checking costs more than expected, many simply stop checking.

The Perception–Reality Crack: More Confident, but Worse

Speed with AI: self-perception vs reality (METR 2025)

Self-perception

+20%

felt faster

Reality

-19%

actually slower

METR had 16 senior open-source developers use frontier models in repos they'd contributed to for years: they felt 20% faster but were measured 19% slower—a 39-percentage-point gap between perception and reality. An earlier Stanford study (CCS 2023) found people using an AI assistant wrote more insecure code yet rated the AI higher. A telling case: AI produced 300 lines of syntactically perfect infrastructure code whose referenced resources and config were mostly fabricated.

Trust Dynamics in Three Phases: Stronger After the Collapse

Trust forms → is shocked → is repaired (Seoul National University, n = 189 + 294)

Form

Based on capability cues

On first contact, trust is usually high.

Shock

One visible error

Trust drops off a cliff—people tolerate AI errors far less than human ones (the 'perfect automation schema').

Repair

Explain + bound

Explaining the cause and naming the limits can push trust above its original baseline (the 'trust acceleration paradox').

The Organizational Fault Line: Executive Enthusiasm vs Frontline Coolness

Enthusiasm for AI: what executives assume vs what the front line feels (BCG + Columbia Business School)

Executives think employees are enthusiastic

76%

Frontline employees who actually feel that way

31%

95%

Enterprise AI pilots with no measurable business return (MIT Sloan)

Truly integrated into the business (88% use AI, only 7% integrate it)

42%

Executives admitting AI is 'tearing the company apart'

The fault line runs deeper than perception: only 44% of employees have had AI training, yet 57% won't tell their team they use AI and 31% actively undermine AI rollouts. The MIT researchers' conclusion is blunt: the core barrier to scaling is not infrastructure, not regulation, not talent—it's learning.

Four Kinds of Organizational Resistance: Not Conservatism, but Signal

Type	What it's really saying	Example
Tool resistance	Tried it, found it didn't work	A legal team refuses a contract-analysis AI to protect the company
Strategy resistance	AI is deployed where the value isn't	Half the budget goes to sales/marketing; the highest return is in back-office automation
Trust resistance	Leaders say 'augment you' while announcing layoffs	Not technophobia, but a rational response to contradictory signals
Capability resistance	Using it but unsure if correctly	44% trained, 57% won't admit they use AI

Employee resistance is often the organization sending a signal (Forbes Technology Council).

Method: Harness Engineering

If the trust gap is a system-design problem, the answer is a different way of working: stop optimizing 'how you talk to AI' and instead design the environment AI works in—constraints, feedback, verification, state management. That is the keyword of 2026: harness engineering.

Three Generations of Paradigm: From Talking to Building the Environment

Prompt → context → harness engineering

2023

Prompt engineering

Optimize how you talk to AI—wording and examples. Low ceiling: it only tunes a one-shot input.

2025

Context engineering

Optimize what background AI sees (pushed by Karpathy). Still about 'what to show the AI.'

2026

Harness engineering

Design AI's runtime environment: constraints, feedback, verification, state. A qualitative shift—AI works inside the office you built.

The Harness Quadrant: Most People Only Did Feedforward

Feedforward × feedback · deterministic × reasoning (Boeckeler / ThoughtWorks)

feedforwardDirectionfeedback

Feedforward × deterministic

Templates, specs, scaffolding

Feedforward × reasoning

AGENTS.md, design principles, values

Feedback × deterministic

Linters, tests, pre-commit hooks

Feedback × reasoning

AI cross-review, expert review

← deterministicExecutionreasoning →

Four Stages of Human–AI Collaboration: From 'In the Loop' to 'On the Loop'

Where the human sits in the collaboration (Fowler's team)

Outside

Outside the Loop

Hand off the task and hope for the best—'vibe coding.'

In the Loop

Review every line of output; the human becomes the bottleneck—the common 'AI saved no time' complaint.

On the Loop

Don't fix the output; fix the system that produces it—shift from doer to environment designer.

Flywheel

Agentic Flywheel

Use the agent to improve the harness itself; the harness begins to self-iterate.

The Three Economics of Constraints

Rules, tools, architecture—three different games

Rule constraints

zero-sum

Text instructions eat context: hand-written ≤60 rules +4%, AI-generated 200+ rules −3% (ETH Zurich + Tsinghua)

Tool constraints

positive-sum

Run outside the context window, cost no attention, execute deterministically—each violation is a 'micro-training'

Architectural constraints

multiplier

3 engineers, 5 months, 0 hand-written lines → a million-line production product (OpenAI Codex)

Four Pillars: Memory · Skills · Evaluation · Context

Memory ① Precise Forgetting: Cut 45% of Storage, Keep 82% of Key Facts

The core challenge of a memory system is not 'how to remember more' but 'how to forget more precisely.' FadeMem mimics the Ebbinghaus forgetting curve (long-term half-life ~11 days, short-term ~5 days), keeping the highest key-fact retention while sharply cutting storage.

Key-fact retention by memory method (LTI-Bench)（%）

82.1%

FadeMem key-fact retention (using only 55% of storage)

45%

Reduction in storage

29.43

Multi-hop F1 with memory management (vs just 5.17 without — ~5× gap)

Memory ② Three Engineering Routes, No One-Size-Fits-All

Route	Mechanism	Cognitive-science analog
A · Selective fact extraction (Mem0)	Auto-extract discrete facts from dialogue, dedupe and update—'what I know'	Semantic memory
B · Document self-management (Anthropic)	The agent maintains its own document set, deciding what to write and how to organize it—'what happened'	Episodic memory (many small focused files > a few big ones)
C · Structured knowledge graph (Neo4j / Zep)	Entities + relations + temporal graph, including inferential memory—'what relates to what'	Relational memory

The three routes trust different things and map to different memory types in cognitive science; Tencent's approach is one underlying dataset with different mechanisms at call time.

Skills: From Bloat to Lean—Over 60% Is Noise

SkillReducer analyzed 55,315 public Skills and found over 60% of their content is noise. By composition, truly actionable rules are barely a third; background explanation is actually the largest slice.

Composition of public Skills (SkillReducer)（%）

Actionable rules38.5%
Background explanation40.7%
Examples12.9%
Templates & redundancy7.9%

Tool-selection accuracy falls off a cliff with tool count (Berkeley BFCL)

4 tools

43%

51 tools

a cliff-edge collapse

Evaluation: Separate the Generator from the Judge

Same model: self-rating in the same context vs independent review in a new context

Self-rating, same context

100%

perfect pass (107 training samples)

Independent review, new context

5.5/10

exposes 5 serious flaws

The fix, inspired by GAN-style adversarial feedback, is the PGE three-role architecture: Planner + Generator + Evaluator. In Anthropic's engineering practice, each round runs 5–15 iterations, sometimes for up to 4 hours, with the evaluator wired to browser-automation tests. But AI judges aren't reliable: in controlled experiments they agree with humans >80% of the time, yet error rates exceed 50% in production (four systematic biases). RAND's 2026 conclusion: no AI judge stays consistently reliable across benchmarks.

Three-Layer Trust Gradient (the Swiss-Cheese Model)

Deterministic checks + AI review + human judgment—no single layer is enough (Anthropic)

Human judgment · roof

Widest coverage, slowest and most expensive—fix the system, not the output.

AI review · floors

Catches semantic issues, but carries position, verbosity, self-preference and family biases.

Deterministic checks · foundation

Most trustworthy, narrowest coverage—hard constraints as a backstop.

Context: Degrading From the Very First Token

Chroma tested 18 frontier models, fixing task difficulty and varying only input length, and found performance degrading from the first output token—no exceptions; shuffling sentence order even made every model perform better (structured filler distracts the model). SWE-rebench maintainers observed a performance ceiling around 1 million tokens. This is the n² cost of the Transformer: 10 tokens make 100 pairs, 10,000 tokens make 100 million.

Token usage before vs after MCP lazy-loading (Claude Code 2.1.7)

Before lazy-loading

~77,000

tokens

After lazy-loading

~8,700

85% less—and accuracy rose too

System: Workflows and Multi-Agent

Workflow Redesign: INSEAD's 1.9× Gap

INSEAD and Harvard Business School ran a 10-week RCT on 515 global startups: same people, same tools, same training—the only difference was whether they re-examined the whole process. Here is the result.

Treatment vs control (INSEAD + Harvard RCT, 515 startups)

Total revenue

1.9×

1.9× the control group

AI use-cases found

+44%

Probability of a paying customer

+18pt

Capital required

39.5%

lower—same people, less money

The Task Length AI Can Handle Is Growing Exponentially

Task length vs success rate (METR, today's strongest models)

Tasks under 4 minutes

~100%

near-100% success

Tasks over 4 hours

<10%

success drops below 10%

Over the past six years, the task length AI can complete independently has doubled roughly every 7 months. Tellingly, three independent projects with no knowledge of each other—Manus (task_plan.md), OpenClaw (MEMORY.md), and Claude Code (CLAUDE.md + Skills)—all converged on a 'file system' solution to manage long tasks.

The Four Multi-Agent Traps

1,642

Multi-agent trajectories analyzed (UC Berkeley)

79%

Failures rooted in spec and coordination (production failure rate 41%–86.7%)

3–5

Anthropic's recommended span of control (research; 1–2 for coding)

Trap	What it looks like
① Over-delegation	A single well-equipped agent handles most cases; splitting blindly only adds chaos
② Under-specification	Brief a worker like a ticket for a first-day junior engineer, or sub-agents re-investigate the same direction
③ Coordination overhead	Dispatch + execute + synthesize can mean 5–10 API calls; a single agent needs just 1–2
④ Telephone-game effect	Information decays at each handoff—have sub-agents write to the file system directly

Across 1,642 trajectories, 79% of failures come from spec and coordination—not model capability.

The real value of multi-agent isn't 'smarter' but more parallel compute: on BrowseComp, OpenAI found token usage explains 80% of the performance variance. So the rule is—master one agent, then scale to many.

Caution: Addition Bias and Deskilling

Addition Bias: Humans 60% vs AI 88–100%

Share that reaches for 'adding' rather than 'subtracting' (Uhler 2026)

Humans

~60%

GPT-4o

88–100%

amplifies the human bias further

English corpus word frequency: add vs subtract (University of Birmingham 2023)

add / more words

361,246

subtract / less words

1,802

Nature 2021's eight experiments found that without a prompt only 41% of people thought of subtraction, rising to 61% after an 8-word hint that 'removing is free.' AI is trained on language, and language itself leans toward 'adding'—producing a double bias: the human instinct plus the AI amplification. Worse, given a problem where subtraction is more efficient, GPT-4 used addition even more—an efficiency signal backfiring.

Deskilling: The AI Group Scored 17 Points Lower

Quiz score after learning a new Python library (Anthropic RCT 2026.01, n=52)

AI group

50%

Control group

67%

17 points higher

The difference isn't 'AI or not' but how much cognitive engagement there is. The three lowest-scoring modes were: AI delegation (hand everything to AI), gradual dependence (give it all up step by step), and iterative debugging (paste the error without asking why). The three highest-scoring modes: concept queries (ask only concepts, write it yourself), mixed code-and-explanation (ask for both), and generate-then-understand (let AI generate, then probe until you get it).

Colonoscopy detection rate: after 3 months with AI, AI removed (Lancet 2025)

With AI assistance

28.4%

After removing AI

22.4%

the skill never truly took root

Distillation: The Knowledge Flywheel

Methods date, tools iterate—what remains and compounds is the knowledge network you weave. Knowledge engineering is a spiral: write down the ways of working you taught the AI (Skills, solving 're-teaching'), remember decisions and lessons from failure (Memory, solving 're-forgetting'), then connect the accumulation into relationships and causal chains (connection is where the real value is).

Write it down → remember it → connect it → flywheel

Write

Skill

Turn the ways of working you taught AI into reusable templates.

Remember

Memory

Leave a trace of decisions, preferences, and lessons from failure.

Connect

Knowledge Eng.

Let the accumulation form relationships and causal chains.

Spin

Flywheel

Context → knowledge base; instructions → Skills; workflow → runbook; failure → guardrail.

A structured knowledge graph raises accuracy (OpenKG, same model — DeepSeek)

Without graph

80.7%

With graph

86.1%

better structured knowledge = higher accuracy

8×

Growth in AI-generated code duplication (GitClear)—assets rot

86%

Less drift from incremental updates vs full rebuilds (Stanford ACE)

10–15

A sensible number of personal Skills (not 89)

The era of big data is over; the era of big knowledge is beginning.
— KPMG 2026 white paper

Conclusion: The Moat Isn't the Model—It's the Network You Weave

String the ten keywords into one line: the trust gap forces us to admit this is a system-design problem; harness engineering moves us from babysitting output to designing the environment; memory, skills, evaluation and context are the four pillars; workflows and multi-agent string them into a system; addition bias and deskilling warn us not to quietly lose our judgment; and knowledge engineering is the flywheel that distills all of it into a moat.

What ultimately compounds into a moat is never a model or a tool—those all date. What truly belongs to you is the knowledge network you write down, remember and connect across every round of practice. Every data point is already the seed of a chart.

Evidence Pool: Key Figures at a Glance

Metric	Value	Period	Source
AI coding tool adoption	70% → 84%	2023 → 2025	Stack Overflow
Trust in AI coding tools	40% → 29%	2023 → 2025	Stack Overflow
Don't fully trust AI code correctness	96%	2025	Sonar
Always review AI code before committing	48%	2025	Sonar
METR actual speed / self-perception	-19% / +20%	2025	METR
Perception–reality gap	39 pts	2025	METR
AI pilots with no measurable return	95%	2026	MIT Sloan
Use AI / truly integrated	88% / 7%	2026	MIT Sloan
Executives assume / front line feels	76% / 31%	2026	BCG + Columbia
Executives admit AI tears the company apart	42%	2026	BCG + Columbia
Hand-written ≤60 rules / AI 200+ rules	+4% / -3%	2026	ETH Zurich + Tsinghua
Tencent rule slimming	200 → 50 lines	2026	Tencent RI
Codex team: engineers / time / hand code	3 / 5 mo / 0 lines	2026	OpenAI Codex
FadeMem key-fact retention / storage cut	82.1% / 45%	2026	FadeMem
F1 with / without memory management	29.43 / 5.17	2026	FadeMem
Public Skills analyzed	55,315	2026	SkillReducer
Compressed / original capability score	0.742 / 0.722	2026	SkillReducer
Tool-selection accuracy (4 / 51 tools)	43% / 2%	2026	Berkeley BFCL
LLM-as-Judge agreement / production error	>80% / >50%	2026	Multiple studies
MCP tokens before / after lazy-loading	~77,000 / ~8,700	2026	Claude Code
INSEAD treatment-group revenue multiple	1.9×	2026	INSEAD + Harvard
INSEAD treatment-group capital reduction	39.5%	2026	INSEAD + Harvard
Use AI / think workflow is efficient	84% / 21%	2025	Telerik
AI task-length doubling period	~7 months	past 6 yrs	METR
Multi-agent trajectories / production failure	1,642 / 41%–86.7%	2026	UC Berkeley
Multi-agent failures from spec & coordination	79%	2026	UC Berkeley
Humans / GPT-4o addition-strategy rate	~60% / 88%–100%	2026	Uhler / Nature
add / subtract word frequency	361,246 / 1,802	corpus	U. of Birmingham
Anthropic learning RCT: AI / control	50% / 67%	2026.01	Anthropic
Lancet colonoscopy: with AI / after removal	28.4% / 22.4%	2025	Lancet
Knowledge graph lifts DeepSeek accuracy	80.7% → 86.1%	2026	OpenKG
Incremental update vs rebuild drift cut	86%	2026	Stanford ACE

All figures are from the sources the report cites and were checked against the original before publishing; for contrasts like 'adoption vs trust' or 'perception vs reality,' read them with the context above.

FAQ

What is Harness Engineering?

It is the key 2026 paradigm: instead of optimizing how you talk to AI (prompt engineering) or what you show it (context engineering), you design the environment AI works in—constraints, feedback, verification, state management. The formula is Agent = model + harness. LangChain showed that with the same frontier model, changing only the surrounding infrastructure raised its TerminalBench rank by 20+ places.

Why do people trust AI coding tools less the more they use them?

Stack Overflow found adoption rising from 70% (2023) to 84% (2025) while trust fell from 40% to 29%. It doesn't mean AI got worse—users increasingly know where it breaks. It's a learning curve disguised as a trust problem.

Does using AI actually make you faster?

Not necessarily. METR had 16 senior developers use frontier models in familiar repos; they felt 20% faster but were measured 19% slower—a 39-percentage-point perception–reality gap. Speed gains depend heavily on whether the task and workflow have been redesigned.

Is multi-agent always better than a single agent?

No. UC Berkeley analyzed 1,642 trajectories with production failure rates of 41%–86.7%, of which 79% came from spec and coordination rather than model capability. The rule is 'master one, then scale'; Anthropic suggests a span of control of 3–5 for research tasks and 1–2 for coding.

How much difference does redesigning the workflow make?

INSEAD and Harvard ran a 10-week RCT on 515 startups: with identical people, tools and training, the group that re-examined the whole process earned 1.9× the control group's revenue, found 44% more AI use-cases, and needed 39.5% less capital. The only variable was a workflow-redesign mindset.

Does AI cause deskilling, and how do you avoid it?

It can, depending on cognitive engagement. An Anthropic RCT (n=52) found the AI group scored 17 points lower; in a Lancet study, after three months with AI, colonoscopists' detection rate fell from 28.4% to 22.4% once AI was removed. Avoid it with high-engagement modes—ask concepts and write it yourself, demand code plus explanation, generate then probe until you understand—and above all protect constitutive capacities like judgment.