Tech Docs

AI-Native Work 2026: The Trust Gap, Harness Engineering, and the Knowledge Flywheel

Tencent Research Institute decodes the human–AI collaboration dilemma through ten keywords: it isn't a capability problem but a system-design problem—and what ultimately compounds into a moat is the knowledge network you weave in practice.

Published 2026-06-0510 min readSource: 腾讯研究院
AI原生工作驾驭工程Harness信任鸿沟多智能体知识工程去技能化上下文工程

Tencent Research Institute's AI-Native Work Report 2026 uses the Trust Gap as its prism to trace the learning curve from 'using AI' to 'mastering AI': adoption of AI coding tools rose to 84% while trust in them fell to 29%, and users who felt 20% faster were actually 19% slower. The report argues the human–AI collaboration dilemma is not a capability problem but a system-design problem—the way out is Harness Engineering (designing the environment AI works in, not babysitting its output), rebuilt on four pillars (memory, skills, evaluation, context) across end-to-end workflows. What ultimately compounds into a competitive moat is the Knowledge Flywheel: writing practice down, remembering it, and connecting it.

29%
Trust in AI coding tools — while adoption rose to 84% (the trust scissors) · Stack Overflow
39pts
Gap between feeling '20% faster' and actually being '19% slower' · METR
1.9×
Startups that redesigned their workflow earned 1.9× the control group · INSEAD
79%
Share of multi-agent production failures rooted in spec and coordination · UC Berkeley

Panorama: A Learning Curve Disguised as a Trust Problem

Tencent Research Institute uses the Trust Gap as a prism and ten keywords—harness engineering, memory, skills, evaluation, context, workflow, multi-agent, addition bias, deskilling, knowledge engineering—to draw the learning curve from 'using AI' to 'mastering AI.' The thesis is one sentence: the human–AI collaboration dilemma is not a capability problem but a system-design problem. What ultimately compounds into a moat is not a model or a tool, but the knowledge network a person weaves through practice.

The report's spine: five layers from diagnosis to distillation
Diagnosisthe trust gap
Adoption ≠ trust → runaway behavior → distorted perception → organizational fault lines.
Methodharness engineering
Design the environment AI runs in, not the output. Four pillars: memory · skills · evaluation · context.
Systemworkflow · multi-agent
Redesign end-to-end, don't just speed up single tasks. Master one agent before scaling to many.
Cautionaddition bias · deskilling
AI amplifies our instinct to add; judgment quietly atrophies.
Distillationthe knowledge flywheel
Write it down → remember it → connect it → spin the flywheel.

Diagnosis: The Trust Gap

The trust gap shows up as four escalating symptoms: adoption and trust moving in opposite directions, a 'say one thing, do another' behavior gap, a misperception of one's own speed, and a fault line between leadership and the front line. Together they show the problem isn't that AI is too weak, but that we haven't yet learned to work with a probabilistic colleague.

The Trust Scissors: The More We Use It, The Less We Trust It

Stack Overflow's annual developer survey shows adoption of AI coding tools rising from 70% in 2023 to 84% in 2025, while trust in them fell from 40% to 29%—a pair of scissors opening ever wider.

AI coding tools: adoption↑ and trust↓ (2023 → 2025)%
0255075100%20232025AdoptionTrust

The Behavior Paradox: 96% Don't Trust It, 48% Don't Check It

Say they distrust it vs actually check it (Sonar survey)
Don't fully trust AI code correctness
96%
Always review AI code before committing
48%
Find reviewing AI code harder than human code
38%

Almost everyone says 'I don't trust AI code,' yet half of them click commit anyway. Behind it is a hidden cost transfer: AI removes the burden of writing but adds the burden of checking; when checking costs more than expected, many simply stop checking.

The Perception–Reality Crack: More Confident, but Worse

Speed with AI: self-perception vs reality (METR 2025)
Self-perception
+20%
felt faster
Reality
-19%
actually slower

METR had 16 senior open-source developers use frontier models in repos they'd contributed to for years: they felt 20% faster but were measured 19% slower—a 39-percentage-point gap between perception and reality. An earlier Stanford study (CCS 2023) found people using an AI assistant wrote more insecure code yet rated the AI higher. A telling case: AI produced 300 lines of syntactically perfect infrastructure code whose referenced resources and config were mostly fabricated.

Trust Dynamics in Three Phases: Stronger After the Collapse

Trust forms → is shocked → is repaired (Seoul National University, n = 189 + 294)
Form
Based on capability cues
On first contact, trust is usually high.
Shock
One visible error
Trust drops off a cliff—people tolerate AI errors far less than human ones (the 'perfect automation schema').
Repair
Explain + bound
Explaining the cause and naming the limits can push trust above its original baseline (the 'trust acceleration paradox').

The Organizational Fault Line: Executive Enthusiasm vs Frontline Coolness

Enthusiasm for AI: what executives assume vs what the front line feels (BCG + Columbia Business School)
Executives think employees are enthusiastic
76%
Frontline employees who actually feel that way
31%
95%
Enterprise AI pilots with no measurable business return (MIT Sloan)
7%
Truly integrated into the business (88% use AI, only 7% integrate it)
42%
Executives admitting AI is 'tearing the company apart'

The fault line runs deeper than perception: only 44% of employees have had AI training, yet 57% won't tell their team they use AI and 31% actively undermine AI rollouts. The MIT researchers' conclusion is blunt: the core barrier to scaling is not infrastructure, not regulation, not talent—it's learning.

Four Kinds of Organizational Resistance: Not Conservatism, but Signal

TypeWhat it's really sayingExample
Tool resistanceTried it, found it didn't workA legal team refuses a contract-analysis AI to protect the company
Strategy resistanceAI is deployed where the value isn'tHalf the budget goes to sales/marketing; the highest return is in back-office automation
Trust resistanceLeaders say 'augment you' while announcing layoffsNot technophobia, but a rational response to contradictory signals
Capability resistanceUsing it but unsure if correctly44% trained, 57% won't admit they use AI
Employee resistance is often the organization sending a signal (Forbes Technology Council).

Method: Harness Engineering

If the trust gap is a system-design problem, the answer is a different way of working: stop optimizing 'how you talk to AI' and instead design the environment AI works in—constraints, feedback, verification, state management. That is the keyword of 2026: harness engineering.

Three Generations of Paradigm: From Talking to Building the Environment

Prompt → context → harness engineering
2023
Prompt engineering
Optimize how you talk to AI—wording and examples. Low ceiling: it only tunes a one-shot input.
2025
Context engineering
Optimize what background AI sees (pushed by Karpathy). Still about 'what to show the AI.'
2026
Harness engineering
Design AI's runtime environment: constraints, feedback, verification, state. A qualitative shift—AI works inside the office you built.

The Harness Quadrant: Most People Only Did Feedforward

Feedforward × feedback · deterministic × reasoning (Boeckeler / ThoughtWorks)
feedforwardDirectionfeedback
Feedforward × deterministic
Templates, specs, scaffolding
Feedforward × reasoning
AGENTS.md, design principles, values
Feedback × deterministic
Linters, tests, pre-commit hooks
Feedback × reasoning
AI cross-review, expert review
deterministicExecutionreasoning

Four Stages of Human–AI Collaboration: From 'In the Loop' to 'On the Loop'

Where the human sits in the collaboration (Fowler's team)
Outside
Outside the Loop
Hand off the task and hope for the best—'vibe coding.'
In
In the Loop
Review every line of output; the human becomes the bottleneck—the common 'AI saved no time' complaint.
On
On the Loop
Don't fix the output; fix the system that produces it—shift from doer to environment designer.
Flywheel
Agentic Flywheel
Use the agent to improve the harness itself; the harness begins to self-iterate.

The Three Economics of Constraints

Rules, tools, architecture—three different games
Rule constraints
zero-sum
Text instructions eat context: hand-written ≤60 rules +4%, AI-generated 200+ rules −3% (ETH Zurich + Tsinghua)
Tool constraints
positive-sum
Run outside the context window, cost no attention, execute deterministically—each violation is a 'micro-training'
Architectural constraints
multiplier
3 engineers, 5 months, 0 hand-written lines → a million-line production product (OpenAI Codex)

Four Pillars: Memory · Skills · Evaluation · Context

Memory ① Precise Forgetting: Cut 45% of Storage, Keep 82% of Key Facts

The core challenge of a memory system is not 'how to remember more' but 'how to forget more precisely.' FadeMem mimics the Ebbinghaus forgetting curve (long-term half-life ~11 days, short-term ~5 days), keeping the highest key-fact retention while sharply cutting storage.

Key-fact retention by memory method (LTI-Bench)%
Fixed-16K50.2%LangChain71.2%MemGPT75.6%Mem078.4%FadeMem82.1%
82.1%
FadeMem key-fact retention (using only 55% of storage)
45%
Reduction in storage
29.43
Multi-hop F1 with memory management (vs just 5.17 without — ~5× gap)

Memory ② Three Engineering Routes, No One-Size-Fits-All

RouteMechanismCognitive-science analog
A · Selective fact extraction (Mem0)Auto-extract discrete facts from dialogue, dedupe and update—'what I know'Semantic memory
B · Document self-management (Anthropic)The agent maintains its own document set, deciding what to write and how to organize it—'what happened'Episodic memory (many small focused files > a few big ones)
C · Structured knowledge graph (Neo4j / Zep)Entities + relations + temporal graph, including inferential memory—'what relates to what'Relational memory
The three routes trust different things and map to different memory types in cognitive science; Tencent's approach is one underlying dataset with different mechanisms at call time.

Skills: From Bloat to Lean—Over 60% Is Noise

SkillReducer analyzed 55,315 public Skills and found over 60% of their content is noise. By composition, truly actionable rules are barely a third; background explanation is actually the largest slice.

Composition of public Skills (SkillReducer)%
  • Actionable rules38.5%
  • Background explanation40.7%
  • Examples12.9%
  • Templates & redundancy7.9%
Tool-selection accuracy falls off a cliff with tool count (Berkeley BFCL)
4 tools
43%
51 tools
2%
a cliff-edge collapse

Evaluation: Separate the Generator from the Judge

Same model: self-rating in the same context vs independent review in a new context
Self-rating, same context
100%
perfect pass (107 training samples)
Independent review, new context
5.5/10
exposes 5 serious flaws

The fix, inspired by GAN-style adversarial feedback, is the PGE three-role architecture: Planner + Generator + Evaluator. In Anthropic's engineering practice, each round runs 5–15 iterations, sometimes for up to 4 hours, with the evaluator wired to browser-automation tests. But AI judges aren't reliable: in controlled experiments they agree with humans >80% of the time, yet error rates exceed 50% in production (four systematic biases). RAND's 2026 conclusion: no AI judge stays consistently reliable across benchmarks.

Three-Layer Trust Gradient (the Swiss-Cheese Model)

Deterministic checks + AI review + human judgment—no single layer is enough (Anthropic)
Human judgment · roof
Widest coverage, slowest and most expensive—fix the system, not the output.
AI review · floors
Catches semantic issues, but carries position, verbosity, self-preference and family biases.
Deterministic checks · foundation
Most trustworthy, narrowest coverage—hard constraints as a backstop.

Context: Degrading From the Very First Token

Chroma tested 18 frontier models, fixing task difficulty and varying only input length, and found performance degrading from the first output token—no exceptions; shuffling sentence order even made every model perform better (structured filler distracts the model). SWE-rebench maintainers observed a performance ceiling around 1 million tokens. This is the n² cost of the Transformer: 10 tokens make 100 pairs, 10,000 tokens make 100 million.

Token usage before vs after MCP lazy-loading (Claude Code 2.1.7)
Before lazy-loading
~77,000
tokens
After lazy-loading
~8,700
85% less—and accuracy rose too

System: Workflows and Multi-Agent

Workflow Redesign: INSEAD's 1.9× Gap

INSEAD and Harvard Business School ran a 10-week RCT on 515 global startups: same people, same tools, same training—the only difference was whether they re-examined the whole process. Here is the result.

Treatment vs control (INSEAD + Harvard RCT, 515 startups)
Total revenue
1.9×
1.9× the control group
AI use-cases found
+44%
Probability of a paying customer
+18pt
Capital required
39.5%
lower—same people, less money

The Task Length AI Can Handle Is Growing Exponentially

Task length vs success rate (METR, today's strongest models)
Tasks under 4 minutes
~100%
near-100% success
Tasks over 4 hours
<10%
success drops below 10%

Over the past six years, the task length AI can complete independently has doubled roughly every 7 months. Tellingly, three independent projects with no knowledge of each other—Manus (task_plan.md), OpenClaw (MEMORY.md), and Claude Code (CLAUDE.md + Skills)—all converged on a 'file system' solution to manage long tasks.

The Four Multi-Agent Traps

1,642
Multi-agent trajectories analyzed (UC Berkeley)
79%
Failures rooted in spec and coordination (production failure rate 41%–86.7%)
3–5
Anthropic's recommended span of control (research; 1–2 for coding)
TrapWhat it looks like
① Over-delegationA single well-equipped agent handles most cases; splitting blindly only adds chaos
② Under-specificationBrief a worker like a ticket for a first-day junior engineer, or sub-agents re-investigate the same direction
③ Coordination overheadDispatch + execute + synthesize can mean 5–10 API calls; a single agent needs just 1–2
④ Telephone-game effectInformation decays at each handoff—have sub-agents write to the file system directly
Across 1,642 trajectories, 79% of failures come from spec and coordination—not model capability.

The real value of multi-agent isn't 'smarter' but more parallel compute: on BrowseComp, OpenAI found token usage explains 80% of the performance variance. So the rule is—master one agent, then scale to many.


Caution: Addition Bias and Deskilling

Addition Bias: Humans 60% vs AI 88–100%

Share that reaches for 'adding' rather than 'subtracting' (Uhler 2026)
Humans
~60%
GPT-4o
88–100%
amplifies the human bias further
English corpus word frequency: add vs subtract (University of Birmingham 2023)
add / more words
361,246
subtract / less words
1,802

Nature 2021's eight experiments found that without a prompt only 41% of people thought of subtraction, rising to 61% after an 8-word hint that 'removing is free.' AI is trained on language, and language itself leans toward 'adding'—producing a double bias: the human instinct plus the AI amplification. Worse, given a problem where subtraction is more efficient, GPT-4 used addition even more—an efficiency signal backfiring.

Deskilling: The AI Group Scored 17 Points Lower

Quiz score after learning a new Python library (Anthropic RCT 2026.01, n=52)
AI group
50%
Control group
67%
17 points higher

The difference isn't 'AI or not' but how much cognitive engagement there is. The three lowest-scoring modes were: AI delegation (hand everything to AI), gradual dependence (give it all up step by step), and iterative debugging (paste the error without asking why). The three highest-scoring modes: concept queries (ask only concepts, write it yourself), mixed code-and-explanation (ask for both), and generate-then-understand (let AI generate, then probe until you get it).

Colonoscopy detection rate: after 3 months with AI, AI removed (Lancet 2025)
With AI assistance
28.4%
After removing AI
22.4%
the skill never truly took root

Distillation: The Knowledge Flywheel

Methods date, tools iterate—what remains and compounds is the knowledge network you weave. Knowledge engineering is a spiral: write down the ways of working you taught the AI (Skills, solving 're-teaching'), remember decisions and lessons from failure (Memory, solving 're-forgetting'), then connect the accumulation into relationships and causal chains (connection is where the real value is).

Write it down → remember it → connect it → flywheel
Write
Skill
Turn the ways of working you taught AI into reusable templates.
Remember
Memory
Leave a trace of decisions, preferences, and lessons from failure.
Connect
Knowledge Eng.
Let the accumulation form relationships and causal chains.
Spin
Flywheel
Context → knowledge base; instructions → Skills; workflow → runbook; failure → guardrail.
A structured knowledge graph raises accuracy (OpenKG, same model — DeepSeek)
Without graph
80.7%
With graph
86.1%
better structured knowledge = higher accuracy
8×
Growth in AI-generated code duplication (GitClear)—assets rot
86%
Less drift from incremental updates vs full rebuilds (Stanford ACE)
10–15
A sensible number of personal Skills (not 89)
The era of big data is over; the era of big knowledge is beginning.
KPMG 2026 white paper

Conclusion: The Moat Isn't the Model—It's the Network You Weave

String the ten keywords into one line: the trust gap forces us to admit this is a system-design problem; harness engineering moves us from babysitting output to designing the environment; memory, skills, evaluation and context are the four pillars; workflows and multi-agent string them into a system; addition bias and deskilling warn us not to quietly lose our judgment; and knowledge engineering is the flywheel that distills all of it into a moat.

What ultimately compounds into a moat is never a model or a tool—those all date. What truly belongs to you is the knowledge network you write down, remember and connect across every round of practice. Every data point is already the seed of a chart.

Evidence Pool: Key Figures at a Glance

MetricValuePeriodSource
AI coding tool adoption70% → 84%2023 → 2025Stack Overflow
Trust in AI coding tools40% → 29%2023 → 2025Stack Overflow
Don't fully trust AI code correctness96%2025Sonar
Always review AI code before committing48%2025Sonar
METR actual speed / self-perception-19% / +20%2025METR
Perception–reality gap39 pts2025METR
AI pilots with no measurable return95%2026MIT Sloan
Use AI / truly integrated88% / 7%2026MIT Sloan
Executives assume / front line feels76% / 31%2026BCG + Columbia
Executives admit AI tears the company apart42%2026BCG + Columbia
Hand-written ≤60 rules / AI 200+ rules+4% / -3%2026ETH Zurich + Tsinghua
Tencent rule slimming200 → 50 lines2026Tencent RI
Codex team: engineers / time / hand code3 / 5 mo / 0 lines2026OpenAI Codex
FadeMem key-fact retention / storage cut82.1% / 45%2026FadeMem
F1 with / without memory management29.43 / 5.172026FadeMem
Public Skills analyzed55,3152026SkillReducer
Compressed / original capability score0.742 / 0.7222026SkillReducer
Tool-selection accuracy (4 / 51 tools)43% / 2%2026Berkeley BFCL
LLM-as-Judge agreement / production error>80% / >50%2026Multiple studies
MCP tokens before / after lazy-loading~77,000 / ~8,7002026Claude Code
INSEAD treatment-group revenue multiple1.9×2026INSEAD + Harvard
INSEAD treatment-group capital reduction39.5%2026INSEAD + Harvard
Use AI / think workflow is efficient84% / 21%2025Telerik
AI task-length doubling period~7 monthspast 6 yrsMETR
Multi-agent trajectories / production failure1,642 / 41%–86.7%2026UC Berkeley
Multi-agent failures from spec & coordination79%2026UC Berkeley
Humans / GPT-4o addition-strategy rate~60% / 88%–100%2026Uhler / Nature
add / subtract word frequency361,246 / 1,802corpusU. of Birmingham
Anthropic learning RCT: AI / control50% / 67%2026.01Anthropic
Lancet colonoscopy: with AI / after removal28.4% / 22.4%2025Lancet
Knowledge graph lifts DeepSeek accuracy80.7% → 86.1%2026OpenKG
Incremental update vs rebuild drift cut86%2026Stanford ACE
All figures are from the sources the report cites and were checked against the original before publishing; for contrasts like 'adoption vs trust' or 'perception vs reality,' read them with the context above.

FAQ

What is Harness Engineering?

It is the key 2026 paradigm: instead of optimizing how you talk to AI (prompt engineering) or what you show it (context engineering), you design the environment AI works in—constraints, feedback, verification, state management. The formula is Agent = model + harness. LangChain showed that with the same frontier model, changing only the surrounding infrastructure raised its TerminalBench rank by 20+ places.

Why do people trust AI coding tools less the more they use them?

Stack Overflow found adoption rising from 70% (2023) to 84% (2025) while trust fell from 40% to 29%. It doesn't mean AI got worse—users increasingly know where it breaks. It's a learning curve disguised as a trust problem.

Does using AI actually make you faster?

Not necessarily. METR had 16 senior developers use frontier models in familiar repos; they felt 20% faster but were measured 19% slower—a 39-percentage-point perception–reality gap. Speed gains depend heavily on whether the task and workflow have been redesigned.

Is multi-agent always better than a single agent?

No. UC Berkeley analyzed 1,642 trajectories with production failure rates of 41%–86.7%, of which 79% came from spec and coordination rather than model capability. The rule is 'master one, then scale'; Anthropic suggests a span of control of 3–5 for research tasks and 1–2 for coding.

How much difference does redesigning the workflow make?

INSEAD and Harvard ran a 10-week RCT on 515 startups: with identical people, tools and training, the group that re-examined the whole process earned 1.9× the control group's revenue, found 44% more AI use-cases, and needed 39.5% less capital. The only variable was a workflow-redesign mindset.

Does AI cause deskilling, and how do you avoid it?

It can, depending on cognitive engagement. An Anthropic RCT (n=52) found the AI group scored 17 points lower; in a Lancet study, after three months with AI, colonoscopists' detection rate fell from 28.4% to 22.4% once AI was removed. Avoid it with high-engagement modes—ask concepts and write it yourself, demand code plus explanation, generate then probe until you understand—and above all protect constitutive capacities like judgment.

SourceTD 2026
腾讯研究院
AI原生工作报告2026 · 2026
AI-NATIVE-WORK-2026-TENCENT

An independent synthesis, distillation and data visualization based on a public report; the original data and views belong to the source author.

Back to reports