The New Engineering Judgment: Five Skills That Survive the Agentic Transition

0 batti cinque

A software researcher at University College London built a production-grade, multi-module data analysis platform (118 Python files, 31 REST API endpoints, 200+ unit tests, a four-layer architecture employing six design patterns) in five days, without writing a single line of code. The question this raises is not whether AI can build software. It can. The question is: what did the human actually contribute, and is that contribution something that can be systematically learned?

The dominant narrative around AI-assisted development collapses into two positions. One camp treats every capability demonstration as evidence that developers are becoming redundant. The other dismisses experiments like this as prototypes disconnected from production reality. Giuseppe Destefanis’s case study at UCL (Destefanis, 2026) cuts through both positions with something rare in this conversation: systematic, prompt-level empirical data on what the human did for five consecutive days, and precisely why those contributions made the difference between a working system and a failed experiment.

What makes this study worth examining carefully is not the scale of the artifact but the granularity of the record. Most published accounts of AI-assisted development are retrospective assessments or controlled productivity trials conducted with students under lab conditions. This is a practitioner’s live, prompt-level documentation of a five-day engineering effort, preserved with sufficient resolution to reconstruct what the human contributed at each stage. That ecological validity is what makes the findings actionable rather than merely suggestive.

Subscribe to newsletter (Si apre in una nuova finestra)

The Empirical Backbone

The Destefanis experiment produced a non-trivial artifact: a modular Flask web application for mining software repository data, built across 12 Claude Code sessions totaling 552 prompts (Destefanis, 2026). The system handles file import, variable classification, interactive visualization, cross-dataset comparison with threshold management, and exportable PDF and Excel report generation. By conventional engineering standards, this would require a small team over several weeks.

“79 out of 552 prompts (14.3%) contained direct technical or logical guidance: root cause diagnoses, logic fixes expressed in plain language, data model corrections, and implementation direction. The most persistent bug in the project was ultimately resolved not by Claude’s debugging, but by the human providing the solution directly: ‘it’s a simple if: if session_id equals selected session, show it.’” (Destefanis, 2026)

This 14.3% is the most instructive number in the study. In a project built “without writing a single line of code,” nearly one in seven interactions required the human to provide the cognitive core of the solution: not syntax, but logical reasoning, domain knowledge, and diagnostic precision. The code was generated by the machine. The thinking that directed it was irreducibly human.

This finding makes concrete a broader structural shift already underway: the bottleneck in software production has moved from implementation speed to specification quality. What Destefanis’s data provide is not a prediction about a future state but a prompt-level record of what that shift looks like in five days of live practice. The five skills that follow are derived directly from what the human in this experiment actually did.

The Five Skills

The 552 prompts distribute across categories in ways that map directly to the competencies required to operate effectively in an agentic development context. Three emerge from the Destefanis data; two from the broader structural transformation are equally consequential.

1. Specification Engineering (Architecture-First Thinking)

The single most consequential investment in the experiment was Phase 1: 98 prompts, representing 18% of the total, spent entirely on requirements analysis, architectural documentation, and sprint planning before a single line of code was produced (Destefanis, 2026). The resulting 68-page ARCHITECTURE.md (Si apre in una nuova finestra) and detailed sprint plan were not bureaucratic artefacts. They were active control mechanisms that Claude re-read after every context overflow, and that Destefanis used to verify architectural adherence throughout the project.

The document served a less obvious function as well: it compressed the cognitive overhead of re-establishing context after each session reset, transforming what would otherwise be a costly verbal re-briefing into a single, stable artifact that both parties could reference. Most developers treat documentation as a downstream output of completed work. In this experiment, it functioned as upstream input to every subsequent session. The human who can write a specification precise enough to survive context overflow is the human who ships working software. The rest are managing entropy.

2. AI Behavioral Management (Steering the System)

Approximately 8% of the 552 prompts were dedicated to managing Claude’s behavioral tendencies: preventing over-engineering (“are you sure the DI container is needed?”), demanding honest assessment (“Don’t please me”), preventing regressions (“Don’t change that!”), and controlling scope creep (Destefanis, 2026). These were not conversational adjustments. They were precision corrections to a system with predictable failure modes.

Claude defaulted to blaming browser caching as the root cause at least five separate times across the project, a pattern Destefanis identifies as “hypothesis anchoring”: latching onto a plausible-sounding diagnosis and returning to it even after it has been disproven. Without active human intervention, this pattern would have consumed unbounded time. AI behavioral management is not intuition; it is a learnable technical skill with identifiable counter-prompts and measurable outcomes.

The “hypothesis anchoring” failure mode corresponds to well-documented behavior in large language models. The practical implication extends beyond individual practice: teams that develop a shared taxonomy of these patterns, with corresponding counter-prompts, will consistently outperform teams where each developer rediscovers the same failure modes independently. Behavioral management treated as a team-level competency, rather than individual improvisation, is a source of durable organizational advantage that few engineering organizations have yet formalized.

3. Domain-Grounded Logical Reasoning (Root Cause Diagnosis)

The most persistent bug in the project, session isolation for plots, was resolved not by escalating console output to Claude, but by reducing the problem to a single conditional: “if session_id equals selected session, show it” (Destefanis, 2026). This is not code. It is logical specification derived from understanding how the system should behave.

Problems that had consumed 15+ prompts across multiple context continuations were resolved immediately when the human stated the logical solution in plain language. The ability to identify the conceptual flaw in a broken system, rather than describe the symptom, is precisely the systems thinking that AI cannot substitute.

There is a temptation to treat this as simply “understanding your domain.” It is more precise than that. The skill is the ability to translate domain understanding into a logical specification tight enough that the implementation follows without ambiguity. Domain expertise without this translational capacity still produces vague prompts and accumulating entropy, regardless of how sophisticated the model receiving them.

4. Context Engineering (Managing the AI’s Information Environment)

The 222-prompt marathon session in Phase 2 produced 10+ context overflows, each requiring a 200–400 line summary that compounded token overhead and progressively reduced effective productivity (Destefanis, 2026). The shorter, domain-scoped sessions in Phase 3 were measurably more efficient. Managing what information the AI holds at any given moment is a core competency: curating the information environment through structured prompts, memory management, and deliberate session design.

This skill operates at the session level, not the prompt level. It requires understanding how language models degrade with context dilution and compensating through explicit re-anchoring, modular session boundaries, and documentation as persistent memory. The counterintuitive aspect is that it runs against the natural instinct of most developers, who want richer context, longer histories, and more complete information. The discipline of excluding details that introduce noise rather than signal is harder to develop than knowing what to include, and considerably rarer to find in practice. The developer who treats every session as a fresh conversation starting from scratch will consistently underperform the developer who treats context as a managed resource.

5. Acceptance Criteria Design (Quality Judgment)

Before every commit in the Destefanis experiment, one pattern was invariant: “run the system so I test it, before we commit the changes” (Destefanis, 2026). The human remained the quality arbiter throughout. This was not optional overhead; it was the mechanism by which regressions were caught and Claude’s tendency to introduce tangential fixes was contained.

If you cannot write three sentences that an independent observer could use to verify the output, you do not understand the task well enough to delegate it. This applies whether the delegate is a junior engineer or an AI agent. The critical difference is that the AI will produce confident-looking output that satisfies the literal prompt while violating the intent, and confidence is not a signal of correctness. Quality judgment, in this context, is not a soft skill. It is the last line of defense against a system optimized to appear helpful.

The 30-Minute Skill Audit

Run this audit once a week for four weeks. It is designed to surface the gap between current practice and the five skills above.

Specification test. Take a project you are currently working on. Write a specification, not a ticket, that an unfamiliar colleague could execute without asking a follow-up question. If it takes less than 15 minutes, you have likely not been precise enough.
Behavioral management audit. Review your last AI-assisted session. Count every instance of “still not working” that appeared without an accompanying logic fix. That number is your behavioral management debt.
Diagnostic reasoning practice. Take a recent bug. Write a differential diagnosis: “this component works, that one does not, therefore the fault is in the interface between them.” If you cannot write this without reading code, the reasoning skill requires deliberate development.
Context mapping. Before your next AI session, list explicitly what the agent needs to know, what it must not change, and what the acceptance criterion for completion is. Write it down before opening the chat.
Acceptance criteria check. Define “done” for your next feature in three sentences. Show those sentences to a colleague with no context on the task. Ask whether they could verify completion. If they cannot, rewrite.

Next Moves by Role

The Builder (Individual Contributor)

The bottleneck in your workflow has shifted from how fast you can write code to how precisely you can describe a problem. This week: take one active bug and practice. Provide the logical fix in plain language before asking the AI to implement it. Track resolution time against your baseline from the prior week. The difference will be instructive.

The Manager (Team Lead)

The 14.3% technical intervention rate in the Destefanis study is a hiring and onboarding signal. Your team’s effectiveness with AI agents correlates with their capacity to perform root cause diagnosis, write acceptance criteria, and manage context, not with framework fluency. Restructure your code review practice: stop reviewing syntax and start reviewing specifications and acceptance criteria. The code will follow.

The Roadmap Owner (Executive/CTO)

Junior developer job postings in the US have declined by 67%, and UK graduate tech roles fell 46% in 2024 with a further 53% decline projected by 2026. The apprenticeship model that transferred institutional knowledge through implementation is structurally broken. Your organization’s capacity to develop senior judgment internally is at risk. The appropriate response is an explicit “residency” model: junior engineers develop judgment by reviewing and steering AI output rather than writing routine endpoints. Invest in this before the institutional knowledge gap compounds.

The phrase “without writing a single line of code” is factually accurate and simultaneously misleading. What the experiment demonstrates is that the nature of the human contribution has shifted from writing syntax to providing the cognitive infrastructure that makes correct software possible: the architecture, the diagnosis, the behavioral steering, the quality judgment (Destefanis, 2026). These are not softer contributions than writing code. They are harder, because they require understanding the system at a level of abstraction that syntax practice alone will never build.

The organizational dimension of this transition deserves equal attention. Most performance evaluation systems, hiring rubrics, and career ladders in software organizations were calibrated for a world where implementation capacity was the primary bottleneck. Recalibrating those systems to recognize specification quality, behavioral management, and diagnostic reasoning as first-order competencies is a non-trivial institutional change. The developers treating this transition as a reduction of their role are misreading the data. Those treating it as a change in where their expertise must be applied are reading it correctly.

Daniel Russo, Ph.D., is a Professor of Software Engineering whose research examines the intersection of human cognition and artificial intelligence. Through “Software Insights,” he translates empirical research into actionable guidance for software practitioners and organizations.

Partner with Daniel to transform your organization through evidence-based approaches that bridge academic rigor with practical implementation. His consulting work helps organizations to adopt scientifically validated practices that improve software development outcomes, team performance, and innovation capacity.

Learn more about his approach to evidence-based organizational change: https://www.danielrusso.org/evidence-based-organizational-change/ (Si apre in una nuova finestra)

References

Destefanis, G. (2026). Building complex software using natural language: A case study in AI-assisted software development. University College London.

View all posts (Si apre in una nuova finestra)

Data 03/03/2026

0 batti cinque