The AI Intern Playbook: How to Manage AI Agents Without Getting Burned

In Part 1, I described what my workday looks like now — five agents running in parallel, me toggling between approvals, QA, and direction-setting. The thesis: managing AI agents well requires the same skills as managing people well. Harvard brain, zero context. The gap between disappointing AI output and genuinely useful output is almost entirely a management gap.

That was the worldview. This is the field manual.

I'm not the only one seeing this pattern. As recruiting leader Glen Cathey put it bluntly: "AI skills are really the same skills as managing people." And Wharton's Ethan Mollick has been making the case that "great AI management, not great models, creates competitive advantage." The models will keep improving. The management gap is the differentiator.

What follows is the operating system I've built over two years of daily agent work — the loops, templates, and habits that make the difference between "AI is overhyped" and "I can't imagine working without this." If you read Part 1 and thought "okay, but what do I actually do?" — this is the answer.

The Operating System (Five Steps, One Loop)

Every task I delegate to an AI follows the same rhythm. Whether it's a one-off research request or a standing workflow running three times a day, the pattern holds:

1. Brief. Define the outcome, context, constraints, and what "done" looks like. This is the Manager Brief from Part 1. Two to three minutes. It replaces the vague "can you write me an email?" that produces vague output.

2. Plan-first checkpoint. Before the agent drafts anything, ask it to restate the task, list its assumptions, and surface any gaps. This is the equivalent of a manager saying "before you start, walk me through your approach." It catches misunderstandings before they become wasted drafts.

3. Draft. The agent produces output. This is the part most people think is the whole process. It's actually the smallest management decision in the loop.

4. QA. Run the Trust-but-Verify check (more on this below). Score it. Give targeted feedback. The agent revises.

5. Log and reuse. Every correction becomes a rule. Every strong output becomes a template. The playbook compounds. This is how institutional knowledge gets built — same way a sales org builds playbooks from what top performers do differently.

Brief → Plan → Draft → QA → Log. That's it. The steps are simple. The discipline of doing them every time is what separates people who get real value from AI and people who gave up after a month.

The Trust-But-Verify Loop (90 Seconds)

This is the QA habit I run on every agent output before it ships anywhere. It takes about 90 seconds once you've done it a few times. Five checks:

Sanity scan. Read it once, fast. Anything obviously wrong, missing, or off-audience?

Evidence check. Are factual claims cited or sourced? If the agent stated something as fact, can you trace it back to a real document, a real data point, a real quote? Anything unsourced gets flagged or cut.

Constraint check. Does it follow the rules you set? Brand voice, compliance, claims you can't make, scope boundaries. This is where most "the AI said something embarrassing" stories come from — the agent wasn't given constraints, so it made something up that sounded plausible.

Score it. A, B, or C. "A" means it ships with minor edits. "B" means structurally sound but needs revision in specific sections. "C" means the brief or context was probably wrong — start over.

Coach it. Give feedback the way a good manager gives feedback: "Keep the structure. Revise the third section — the risk framing is too generic. Use this rubric: accuracy first, then clarity, then persuasion. Cite the source for the revenue claim or remove it."

The key phrase I use constantly: "Keep X, change Y, because Z." That's targeted feedback. It tells the agent (or the intern) exactly what worked, what didn't, and why — so the next draft improves instead of just being different.

Workflow #1: Account Research → Personalized Outreach

This is the workflow I run most often. The goal: turn a target account into a one-page research brief and a set of personalized email drafts — in about 15 minutes of my time.

What the agent needs from you (the brief):

Target account name and website. Your ICP definition — who you sell to, who you don't. Your value prop and two or three approved proof points. Deal stage and meeting goal (book a discovery call, re-engage a closed-lost, etc.). Voice and tone examples — two good, one bad. Constraints: claims you can't make, competitor rules, privacy boundaries.

What the agent produces:

A company snapshot (verified facts only). Two or three strategic initiatives with citations. Three pain hypotheses and three priority hypotheses — explicitly labeled as hypotheses, because the agent is guessing based on available signals. A buying committee map. Three messaging angles tied to your proof points. Eight to ten discovery questions. Risks, unknowns, and a verification plan. A confidence rating.

What you QA:

The firmographic facts — are they real? The hypotheses — are they grounded in something observable (job postings, press releases, earnings calls) or did the agent just confabulate? The messaging angles — would a skeptical buyer find these relevant, or do they sound like generic marketing?

My standing rule: if the accuracy score isn't an A, the brief doesn't leave my desk. Hypotheses can be wrong — that's fine, they're hypotheses. But stated facts must be cited. "Unknown" is always an acceptable answer. Confident fabrication is never acceptable.

The management analog: this is exactly how you'd review a research brief from a new analyst. You wouldn't send it to a client without checking the sources. You'd push back on overconfident claims. You'd ask "how do you know that?" That's the same instinct.

Workflow #2: Call Analysis → MEDDPICC Scoring → Deal Risks

This is the workflow that surprised me most with how much time it saves. The goal: turn a sales call transcript into a qualification scorecard, a risk register, and a coaching plan.

What the agent needs from you:

The transcript (with speaker labels). The deal stage and next meeting goal. Known stakeholders. Your scoring methodology — I use MEDDPICC, but SPICED, BANT, or any framework works if you define the components and scoring scale. The key instruction: score only what has evidence in the transcript. If a MEDDPICC component wasn't discussed, the score is zero and the gap becomes a question for the next call.

What the agent produces:

A five-bullet call summary (no new facts — only what was said). A MEDDPICC evidence table: for each component, what we know, the supporting quote or timestamp, a confidence rating, the gaps, and the next questions to ask. A deal risk register — top five risks with evidence, severity, and mitigation steps. Three next best actions with exact questions for the follow-up call.

What you QA:

Evidence grounding — did the agent pull real quotes, or did it paraphrase loosely and inflate the confidence rating? Gap identification — did it actually flag what's missing, or did it paper over weak areas? Coaching tone — is the output neutral and useful, or did it drift into judgment?

I shared numbers in Part 1: across a recent batch, the portfolio MEDDPICC average was 19.2 out of 40. In a stricter buyer-evidence-only pass, it dropped to 31.3% qualified. Those numbers told my client exactly where qualification was breaking down across their team — and what patterns were systemic versus deal-specific.

The management analog: this is a deal review meeting, compressed to 10 minutes. The agent does the prep. You do the interpretation and the coaching. That split is the whole point.

The Management Layer That Doesn't Exist Yet

In Part 1 I mentioned the cognitive weirdness of running multiple agents in parallel — forgetting a task is running, scrolling past a tab and realizing output has been sitting there for 20 minutes. That's a real problem. And it's about to get worse for everyone.

Right now, my agent crew is spread across Claude Cowork, ChatGPT Codex, ChatGPT deep research, Perplexity, Gemini, and Make.com. Each platform has its own interface, its own notification system, its own way of telling me what's done, what's waiting, and what broke. There's no single place where I can see: here are my five running agents, here's their status, here's what needs me.

I'm managing a team with no project management tool. No org chart. No shared status board. Just tabs.

And the research says this is expensive. UC Irvine found it takes an average of 23 minutes and 15 seconds to return to a task after an interruption. If you're checking on five agents across five platforms, you're triggering that penalty repeatedly.

a16z captured it well in their "Notes on AI Apps in 2026": all the tools we use for knowledge work are focused on execution. When it comes to tools that help us think — tools for managing, reviewing, and directing the work — we don't really have any modern products.

The enterprise market is starting to build this layer. Salesforce introduced Agentforce with observability tools and a "command center" for monitoring agent fleets. OpenAI introduced Frontier as an enterprise platform for building, deploying, and overseeing agents. Microsoft added analytics and monitoring for agent performance in Copilot Studio. A growing category of "AgentOps" vendors is racing to fill the gaps.

But for the individual knowledge worker? The consultant, the AE, the founder running a small team with an agent crew? That layer doesn't exist yet. You're duct-taping it together with browser tabs and memory.

What I do in the meantime: I run a manual 5-minute review at the start of each work block. Open every platform. Check what's done, what's waiting on me, what went silent. I built this habit because I know my attention — I'll get absorbed in one agent's output and forget three others are waiting. The 5-minute sweep isn't optional for me. I'd argue it's not optional for anyone running parallel workflows.

This is a temporary workaround. The control tower for individual agent management is coming — and when it arrives, it'll change the daily experience of managing agents as much as project management software changed managing people. But until then, the discipline has to be manual. And that's fine. Good managers operated before Salesforce existed. They'll operate before the agent dashboard ships too.

The Promotion Ladder (How To Scale Trust Safely)

When you onboard a real intern, you don't hand them the client relationship on day one. You start with low-risk tasks, review everything, and gradually increase responsibility as they prove reliability.

Same principle with agents. I use a four-level ladder:

Level 1 — Draft only. The agent produces output. You review and edit everything before it goes anywhere. This is where every agent starts.

Level 2 — Draft plus recommendations. The agent produces output and suggests next steps or decisions. You still approve everything, but the agent is starting to show judgment.

Level 3 — Tool access (read). The agent can pull from your CRM, your docs, your research sources. It has context beyond what you paste in. QA becomes more important here because the agent is working with real data.

Level 4 — Tool access (write) and action-taking. The agent can update a CRM field, send a draft to your review queue, or trigger a workflow. This is where you need hard approval gates — nothing customer-facing or system-writing happens without human sign-off.

Other teams are landing on the same instinct. One team profiled by CIO.com built an AI agent called "Jerri" and gave it a literal 30-day probationary period — a job description, a reporting line to a human manager, and a requirement to learn through feedback rather than prompt edits. That's the pattern.

Most agents should stay at Level 1 or 2 for weeks before you consider promoting them. The ones that consistently score "A" on your rubric, with declining edit rates and fewer corrections needed — those earn Level 3. Level 4 is reserved for workflows where you've stress-tested the failure modes and built guardrails around them.

One more thing about the ladder: some tasks stay at Level 1 or 2 forever. Not because the agent isn't capable, but because the stakes are too high. You wouldn't send an intern to handle a complete client pitch alone, no matter how smart they are. Same here. Customer-facing communications, pricing decisions, anything with legal or compliance exposure — those warrant permanent human review. As the models get smarter and as you get better at managing them, you can give agents more responsibility on more workflows. But trust-but-verify never goes away for the work that matters most.

The rule of thumb: promote by failure domain. The more irreversible or customer-facing the failure, the more human gating you keep. And be honest about your monitoring bandwidth — every agent at Level 3 or 4 is another thing you need to actively track. ADHD taught me that my monitoring capacity has a hard ceiling. Yours does too, whether you've named it or not.

The management analog is obvious: you promote people who've earned trust through demonstrated competence. Same principle, same pace.

The 7-Day Challenge

If you've read both parts and you're thinking "I should try this" — start small. Ten to fifteen minutes a day for one week:

Day 1: Write a Manager Brief for one real task. Run it. Review the output using the Trust-but-Verify loop.

Day 2: Same workflow, different task. Tighten the brief based on what you learned yesterday.

Day 3: Run a call analysis on one transcript. Score it against your methodology. Note where the agent inflated confidence or missed gaps.

Day 4: Revise your call analysis brief to enforce evidence-only scoring and explicit "unknown" flagging.

Day 5: Build a persistent agent (CustomGPT, Claude Project, Gemini Gem) using your best brief, rubric, and voice examples as standing instructions. This is your Intern Handbook.

Day 6: Run both workflows on real opportunities. Compare your time investment versus doing it manually.

Day 7: Log what worked. Save your best outputs as templates. Write down the rules you added. Version your playbook.

That's one week. By the end, you'll have a working system — two workflows, a QA habit, and the start of a playbook that compounds every time you use it.

The Thesis, One More Time

The technology will keep changing. New models every quarter. New tools every month. New agent platforms every week.

The management discipline won't change. Clear expectations, structured context, tight feedback, gradual trust, compounding playbooks — that's how you get reliable output from smart people who are new to your business. It's how you get reliable output from AI agents. Same skills. Same discipline. Different medium.

Start with one workflow. Brief it like a manager. QA it like a manager. Build the playbook like a manager.

The agents are ready. The question is whether you're ready to manage them.

If you want the full template pack — Manager Brief, QA checklist, rubric, feedback phrases, and both workflow briefs as copy/paste documents — reply to this and I'll send it over.

And if you're thinking about how to build this management discipline into your team's workflow — not just your own — that's the work I do with sales organizations. Reply and tell me what you're working on.

Victor Adefuye is the founder of Dana Consulting, where he helps B2B sales teams improve productivity through AI adoption, sales methodology, and coaching. He writes the Superintelligent Sales newsletter.