AI-Native Work 2026: The Trust Gap, Harness Engineering, and the Knowledge Flywheel
Tencent Research Institute decodes the human–AI collaboration dilemma through ten keywords: it isn't a capability problem but a system-design problem—and what ultimately compounds into a moat is the knowledge network you weave in practice.
Tencent Research Institute's AI-Native Work Report 2026 uses the Trust Gap as its prism to trace the learning curve from 'using AI' to 'mastering AI': adoption of AI coding tools rose to 84% while trust in them fell to 29%, and users who felt 20% faster were actually 19% slower. The report argues the human–AI collaboration dilemma is not a capability problem but a system-design problem—the way out is Harness Engineering (designing the environment AI works in, not babysitting its output), rebuilt on four pillars (memory, skills, evaluation, context) across end-to-end workflows. What ultimately compounds into a competitive moat is the Knowledge Flywheel: writing practice down, remembering it, and connecting it.
Panorama: A Learning Curve Disguised as a Trust Problem
Tencent Research Institute uses the Trust Gap as a prism and ten keywords—harness engineering, memory, skills, evaluation, context, workflow, multi-agent, addition bias, deskilling, knowledge engineering—to draw the learning curve from 'using AI' to 'mastering AI.' The thesis is one sentence: the human–AI collaboration dilemma is not a capability problem but a system-design problem. What ultimately compounds into a moat is not a model or a tool, but the knowledge network a person weaves through practice.
Diagnosis: The Trust Gap
The trust gap shows up as four escalating symptoms: adoption and trust moving in opposite directions, a 'say one thing, do another' behavior gap, a misperception of one's own speed, and a fault line between leadership and the front line. Together they show the problem isn't that AI is too weak, but that we haven't yet learned to work with a probabilistic colleague.
The Trust Scissors: The More We Use It, The Less We Trust It
Stack Overflow's annual developer survey shows adoption of AI coding tools rising from 70% in 2023 to 84% in 2025, while trust in them fell from 40% to 29%—a pair of scissors opening ever wider.
The Behavior Paradox: 96% Don't Trust It, 48% Don't Check It
Almost everyone says 'I don't trust AI code,' yet half of them click commit anyway. Behind it is a hidden cost transfer: AI removes the burden of writing but adds the burden of checking; when checking costs more than expected, many simply stop checking.
The Perception–Reality Crack: More Confident, but Worse
METR had 16 senior open-source developers use frontier models in repos they'd contributed to for years: they felt 20% faster but were measured 19% slower—a 39-percentage-point gap between perception and reality. An earlier Stanford study (CCS 2023) found people using an AI assistant wrote more insecure code yet rated the AI higher. A telling case: AI produced 300 lines of syntactically perfect infrastructure code whose referenced resources and config were mostly fabricated.
Trust Dynamics in Three Phases: Stronger After the Collapse
The Organizational Fault Line: Executive Enthusiasm vs Frontline Coolness
The fault line runs deeper than perception: only 44% of employees have had AI training, yet 57% won't tell their team they use AI and 31% actively undermine AI rollouts. The MIT researchers' conclusion is blunt: the core barrier to scaling is not infrastructure, not regulation, not talent—it's learning.
Four Kinds of Organizational Resistance: Not Conservatism, but Signal
| Type | What it's really saying | Example |
|---|---|---|
| Tool resistance | Tried it, found it didn't work | A legal team refuses a contract-analysis AI to protect the company |
| Strategy resistance | AI is deployed where the value isn't | Half the budget goes to sales/marketing; the highest return is in back-office automation |
| Trust resistance | Leaders say 'augment you' while announcing layoffs | Not technophobia, but a rational response to contradictory signals |
| Capability resistance | Using it but unsure if correctly | 44% trained, 57% won't admit they use AI |
Method: Harness Engineering
If the trust gap is a system-design problem, the answer is a different way of working: stop optimizing 'how you talk to AI' and instead design the environment AI works in—constraints, feedback, verification, state management. That is the keyword of 2026: harness engineering.
Three Generations of Paradigm: From Talking to Building the Environment
The Harness Quadrant: Most People Only Did Feedforward
Four Stages of Human–AI Collaboration: From 'In the Loop' to 'On the Loop'
The Three Economics of Constraints
Four Pillars: Memory · Skills · Evaluation · Context
Memory ① Precise Forgetting: Cut 45% of Storage, Keep 82% of Key Facts
The core challenge of a memory system is not 'how to remember more' but 'how to forget more precisely.' FadeMem mimics the Ebbinghaus forgetting curve (long-term half-life ~11 days, short-term ~5 days), keeping the highest key-fact retention while sharply cutting storage.
Memory ② Three Engineering Routes, No One-Size-Fits-All
| Route | Mechanism | Cognitive-science analog |
|---|---|---|
| A · Selective fact extraction (Mem0) | Auto-extract discrete facts from dialogue, dedupe and update—'what I know' | Semantic memory |
| B · Document self-management (Anthropic) | The agent maintains its own document set, deciding what to write and how to organize it—'what happened' | Episodic memory (many small focused files > a few big ones) |
| C · Structured knowledge graph (Neo4j / Zep) | Entities + relations + temporal graph, including inferential memory—'what relates to what' | Relational memory |
Skills: From Bloat to Lean—Over 60% Is Noise
SkillReducer analyzed 55,315 public Skills and found over 60% of their content is noise. By composition, truly actionable rules are barely a third; background explanation is actually the largest slice.
- Actionable rules38.5%
- Background explanation40.7%
- Examples12.9%
- Templates & redundancy7.9%
Evaluation: Separate the Generator from the Judge
The fix, inspired by GAN-style adversarial feedback, is the PGE three-role architecture: Planner + Generator + Evaluator. In Anthropic's engineering practice, each round runs 5–15 iterations, sometimes for up to 4 hours, with the evaluator wired to browser-automation tests. But AI judges aren't reliable: in controlled experiments they agree with humans >80% of the time, yet error rates exceed 50% in production (four systematic biases). RAND's 2026 conclusion: no AI judge stays consistently reliable across benchmarks.
Three-Layer Trust Gradient (the Swiss-Cheese Model)
Context: Degrading From the Very First Token
Chroma tested 18 frontier models, fixing task difficulty and varying only input length, and found performance degrading from the first output token—no exceptions; shuffling sentence order even made every model perform better (structured filler distracts the model). SWE-rebench maintainers observed a performance ceiling around 1 million tokens. This is the n² cost of the Transformer: 10 tokens make 100 pairs, 10,000 tokens make 100 million.
System: Workflows and Multi-Agent
Workflow Redesign: INSEAD's 1.9× Gap
INSEAD and Harvard Business School ran a 10-week RCT on 515 global startups: same people, same tools, same training—the only difference was whether they re-examined the whole process. Here is the result.
The Task Length AI Can Handle Is Growing Exponentially
Over the past six years, the task length AI can complete independently has doubled roughly every 7 months. Tellingly, three independent projects with no knowledge of each other—Manus (task_plan.md), OpenClaw (MEMORY.md), and Claude Code (CLAUDE.md + Skills)—all converged on a 'file system' solution to manage long tasks.
The Four Multi-Agent Traps
| Trap | What it looks like |
|---|---|
| ① Over-delegation | A single well-equipped agent handles most cases; splitting blindly only adds chaos |
| ② Under-specification | Brief a worker like a ticket for a first-day junior engineer, or sub-agents re-investigate the same direction |
| ③ Coordination overhead | Dispatch + execute + synthesize can mean 5–10 API calls; a single agent needs just 1–2 |
| ④ Telephone-game effect | Information decays at each handoff—have sub-agents write to the file system directly |
The real value of multi-agent isn't 'smarter' but more parallel compute: on BrowseComp, OpenAI found token usage explains 80% of the performance variance. So the rule is—master one agent, then scale to many.
Caution: Addition Bias and Deskilling
Addition Bias: Humans 60% vs AI 88–100%
Nature 2021's eight experiments found that without a prompt only 41% of people thought of subtraction, rising to 61% after an 8-word hint that 'removing is free.' AI is trained on language, and language itself leans toward 'adding'—producing a double bias: the human instinct plus the AI amplification. Worse, given a problem where subtraction is more efficient, GPT-4 used addition even more—an efficiency signal backfiring.
Deskilling: The AI Group Scored 17 Points Lower
The difference isn't 'AI or not' but how much cognitive engagement there is. The three lowest-scoring modes were: AI delegation (hand everything to AI), gradual dependence (give it all up step by step), and iterative debugging (paste the error without asking why). The three highest-scoring modes: concept queries (ask only concepts, write it yourself), mixed code-and-explanation (ask for both), and generate-then-understand (let AI generate, then probe until you get it).
Distillation: The Knowledge Flywheel
Methods date, tools iterate—what remains and compounds is the knowledge network you weave. Knowledge engineering is a spiral: write down the ways of working you taught the AI (Skills, solving 're-teaching'), remember decisions and lessons from failure (Memory, solving 're-forgetting'), then connect the accumulation into relationships and causal chains (connection is where the real value is).
The era of big data is over; the era of big knowledge is beginning.
Conclusion: The Moat Isn't the Model—It's the Network You Weave
String the ten keywords into one line: the trust gap forces us to admit this is a system-design problem; harness engineering moves us from babysitting output to designing the environment; memory, skills, evaluation and context are the four pillars; workflows and multi-agent string them into a system; addition bias and deskilling warn us not to quietly lose our judgment; and knowledge engineering is the flywheel that distills all of it into a moat.
What ultimately compounds into a moat is never a model or a tool—those all date. What truly belongs to you is the knowledge network you write down, remember and connect across every round of practice. Every data point is already the seed of a chart.
Evidence Pool: Key Figures at a Glance
| Metric | Value | Period | Source |
|---|---|---|---|
| AI coding tool adoption | 70% → 84% | 2023 → 2025 | Stack Overflow |
| Trust in AI coding tools | 40% → 29% | 2023 → 2025 | Stack Overflow |
| Don't fully trust AI code correctness | 96% | 2025 | Sonar |
| Always review AI code before committing | 48% | 2025 | Sonar |
| METR actual speed / self-perception | -19% / +20% | 2025 | METR |
| Perception–reality gap | 39 pts | 2025 | METR |
| AI pilots with no measurable return | 95% | 2026 | MIT Sloan |
| Use AI / truly integrated | 88% / 7% | 2026 | MIT Sloan |
| Executives assume / front line feels | 76% / 31% | 2026 | BCG + Columbia |
| Executives admit AI tears the company apart | 42% | 2026 | BCG + Columbia |
| Hand-written ≤60 rules / AI 200+ rules | +4% / -3% | 2026 | ETH Zurich + Tsinghua |
| Tencent rule slimming | 200 → 50 lines | 2026 | Tencent RI |
| Codex team: engineers / time / hand code | 3 / 5 mo / 0 lines | 2026 | OpenAI Codex |
| FadeMem key-fact retention / storage cut | 82.1% / 45% | 2026 | FadeMem |
| F1 with / without memory management | 29.43 / 5.17 | 2026 | FadeMem |
| Public Skills analyzed | 55,315 | 2026 | SkillReducer |
| Compressed / original capability score | 0.742 / 0.722 | 2026 | SkillReducer |
| Tool-selection accuracy (4 / 51 tools) | 43% / 2% | 2026 | Berkeley BFCL |
| LLM-as-Judge agreement / production error | >80% / >50% | 2026 | Multiple studies |
| MCP tokens before / after lazy-loading | ~77,000 / ~8,700 | 2026 | Claude Code |
| INSEAD treatment-group revenue multiple | 1.9× | 2026 | INSEAD + Harvard |
| INSEAD treatment-group capital reduction | 39.5% | 2026 | INSEAD + Harvard |
| Use AI / think workflow is efficient | 84% / 21% | 2025 | Telerik |
| AI task-length doubling period | ~7 months | past 6 yrs | METR |
| Multi-agent trajectories / production failure | 1,642 / 41%–86.7% | 2026 | UC Berkeley |
| Multi-agent failures from spec & coordination | 79% | 2026 | UC Berkeley |
| Humans / GPT-4o addition-strategy rate | ~60% / 88%–100% | 2026 | Uhler / Nature |
| add / subtract word frequency | 361,246 / 1,802 | corpus | U. of Birmingham |
| Anthropic learning RCT: AI / control | 50% / 67% | 2026.01 | Anthropic |
| Lancet colonoscopy: with AI / after removal | 28.4% / 22.4% | 2025 | Lancet |
| Knowledge graph lifts DeepSeek accuracy | 80.7% → 86.1% | 2026 | OpenKG |
| Incremental update vs rebuild drift cut | 86% | 2026 | Stanford ACE |
FAQ
What is Harness Engineering?
It is the key 2026 paradigm: instead of optimizing how you talk to AI (prompt engineering) or what you show it (context engineering), you design the environment AI works in—constraints, feedback, verification, state management. The formula is Agent = model + harness. LangChain showed that with the same frontier model, changing only the surrounding infrastructure raised its TerminalBench rank by 20+ places.
Why do people trust AI coding tools less the more they use them?
Stack Overflow found adoption rising from 70% (2023) to 84% (2025) while trust fell from 40% to 29%. It doesn't mean AI got worse—users increasingly know where it breaks. It's a learning curve disguised as a trust problem.
Does using AI actually make you faster?
Not necessarily. METR had 16 senior developers use frontier models in familiar repos; they felt 20% faster but were measured 19% slower—a 39-percentage-point perception–reality gap. Speed gains depend heavily on whether the task and workflow have been redesigned.
Is multi-agent always better than a single agent?
No. UC Berkeley analyzed 1,642 trajectories with production failure rates of 41%–86.7%, of which 79% came from spec and coordination rather than model capability. The rule is 'master one, then scale'; Anthropic suggests a span of control of 3–5 for research tasks and 1–2 for coding.
How much difference does redesigning the workflow make?
INSEAD and Harvard ran a 10-week RCT on 515 startups: with identical people, tools and training, the group that re-examined the whole process earned 1.9× the control group's revenue, found 44% more AI use-cases, and needed 39.5% less capital. The only variable was a workflow-redesign mindset.
Does AI cause deskilling, and how do you avoid it?
It can, depending on cognitive engagement. An Anthropic RCT (n=52) found the AI group scored 17 points lower; in a Lancet study, after three months with AI, colonoscopists' detection rate fell from 28.4% to 22.4% once AI was removed. Avoid it with high-engagement modes—ask concepts and write it yourself, demand code plus explanation, generate then probe until you understand—and above all protect constitutive capacities like judgment.