Managing Token Cost Is a Skill Issue

Getting your work done without hitting the wall
Why you keep running out
The models, and why most work doesn't need the sma …
The habit that saves the most: be specific
Pick the right place to work, and set it up once
Your first automation: the morning brief
The habits that keep the bill flat
What this adds up to

Getting your work done without hitting the wall

You're deep in a deadline and you run out of tokens. You've hit your five-hour cap on Claude, or the equivalent on ChatGPT and Codex, and the work stops cold. The deck is due in an hour, and your access won't reset for another three.

A few months ago the worry ran the other way. The move then was token-maxing, using as much AI as you could, on the idea that capacity you left unused was intelligence left on the table, like a team of interns sitting idle while you gave them nothing to do.

That has flipped fast to conservation. It was the topic of the month at a go-to-market community I belong to, which gave its entire monthly meeting to it and drew a packed room. Sales teams were running dry in the middle of account research, and revenue ops hit the same wall in the middle of an analysis.

Uber is the example everyone points to. The company put its engineers on Claude Code and ran an internal leaderboard ranking teams by how much AI they used, then watched that usage burn through its entire 2026 budget for those tools by April, four months in. Leadership couldn't connect the soaring spend to better products, so the same company that gamified maximum usage now caps each engineer at $1,500 a month per tool.

Whether you're the one hitting the five-hour cap mid-task or the one watching the company bill climb, the cause is the same. Almost no one was taught how to manage this. It's a skill issue, and the good news about skill issues is that they're fixable.

The skill works at two levels, the individual and the team. This issue is about the individual.

This guide exists to help you avoid that. Not by turning you into an engineer, and not by making you hoard tokens, but by getting your everyday work done inside the limits you have and spending what you use well. A few habits and a handful of solid use cases carry most of the load, and the rest of this issue walks through them in plain language.

Why you keep running out

To understand why you keep running out, it helps to know a little about how these tools actually work. Don't worry, we're not getting too technical.

The AI tools you use are large language models, and they operate in tokens. A token is the small unit of text the model reads and writes in, close to a word. It spends them in three stages: what goes in, what it thinks, and what comes out.

The first stage is inputs, and it's bigger than people expect. Inputs are your prompt, the documents and context you hand over, and the context the model goes and finds on its own when it searches the web or calls a tool. You pay for all of it, including tokens you never typed.

The second stage is processing, the thinking, and only reasoning models do it. It's the logic the model works through over what it found, written out in words, which is why a reasoning trace is the closest thing we have to reading a mind on the page. A plain model blurts the first thing that comes to it, while a reasoning model sits with the problem for a beat and then answers.

The third stage is outputs, what the model produces for you: the email, the article, the block of code, the analysis. It's the only stage you actually see, and which of the three runs up the most tokens depends entirely on the task.

Now you can see why you run out. The more useful the task, the more of all three stages it runs. A genuinely useful job pulls in a lot of context, thinks hard about it, and produces something substantial, and every word of that is metered.

Look at why those two groups hit the wall, and you see both halves of the spend. The reps ran dry on inputs, all that reading of sites, filings, and call notes. The revenue-ops people ran dry on processing, grinding through pipeline and forecast analysis.

The model doesn't actually remember your conversation, which catches people off guard. Each time you send a message, it works through the whole thread from the top again to decide what to say next. So a long chat isn't free to keep going. The cost climbs as the thread grows, and a tidy thread beats a sprawling one.

That covers how many tokens move through, and it's only half of what you spend. The other half is the price per token, set by which model you put on the job. That's where we go next.

The models, and why most work doesn't need the smartest one

The other factor that drives your bill is the model's intelligence. The smarter the model, the more it charges per token, and models sort into tiers:

Cheap and fast, good at straightforward work.
Workhorses, strong on most everyday professional tasks.
Frontier, the newest and most capable, able to handle real ambiguity and hard reasoning.
Super-frontier, an emerging tier above even that, priced and gated for the rare hardest jobs.

When people say "frontier," they mean that top group, the smartest models available right now.

The gap between the tiers is real. A frontier model like Opus 4.8 runs around $5 per million input tokens and about $25 per million output.

Gemini 3.1 Pro, also frontier, lands near $2 in and $12 out, and a cheap model like Haiku is closer to $1 in and $5 out. Treat the exact figures as a snapshot, since they move, but the spread is the point.

Two patterns hold across all of them. Output costs several times more than input, and the model's thinking is billed as output, so a reasoning-heavy task on a frontier model is the most expensive combination there is. The hidden monologue you read in the trace is running up the output meter the whole time it's quiet.

So here's the rule the rest of this turns on: use the cheapest model that clears the bar, and reach for a smarter one only when the task genuinely needs it, real reasoning, real ambiguity, high stakes, or a decision that matters. Most work doesn't clear that bar.

Take the range of what people actually do. Reading a 10-K or a company website to grasp its priorities is comprehension, and a cheap or workhorse model handles it, you don't need the frontier to summarize a filing. Building a complex ROI model leans harder on reasoning and earns a stronger model, though even there the math is algebra and statistics, nowhere near the research-grade problems where the top models actually separate.

That's the move you'll make over and over. Size the thinking a task needs, then buy the cheapest intelligence that meets it. The next section is the single habit that makes every one of those tasks cheaper still.

The habit that saves the most: be specific

The single habit that saves the most tokens is being specific. A vague prompt is the most expensive thing you can type.

Here's why. When you're vague, the model has to guess what you meant, and a guess is usually generic or a little off.

So you run it again with a correction, then maybe again, and every rerun pays the full bill: inputs, processing, and output, charged fresh each time.

A large share of the tokens in a typical request get burned this way, on prompts that were never clear enough to get it right the first time. Specificity is how you get most of that back.

Being specific has a shape, and the one I teach is PTCF: Persona, Task, Context, Format. It turns a one-line question into a complete assignment.

Persona, who the model should be. You are a sales coach with fifteen years working with small B2B teams.
Task, exactly what you want done. Write a follow-up email after a discovery call where the buyer was interested but hesitant about budget.
Context, the relevant details. The prospect runs a three-person firm, loved the conversation, and said she needs to think about the investment.
Format, how you want it back. Two short paragraphs, warm with a light sense of urgency, no jargon.

Compare that to "write a follow-up email." Same model, very different output, and the vague version is the one that sends you back for a second and third try.

The shift is from asking questions to giving assignments. Stop asking the model what it thinks, and start handing it the job the way you would a sharp new hire who needs the brief spelled out.

One caution on the Context line. It means the relevant details, not everything you own.

One team I worked with pasted an entire twenty-tab workbook into a prompt and waited while the model chewed through all of it. Pointing it at the one tab that mattered ran faster and cost a fraction. Give it what the task needs and leave the rest out.

A strong PTCF prompt gets you most of the way on the first pass. Refining from there gets you the rest, and the last stretch is your judgment, the part the model can't do for you. Each refinement still costs tokens, so the cleaner your first prompt, the less you spend getting to good.

There's one catch. Every time you open a new chat, the model forgets all of it, your persona, your context, your format, and you type it again. The next section is how you stop repeating yourself.

Pick the right place to work, and set it up once

The fix for that is to stop living in disposable chats and give your work a permanent home.

Think of three places your work can live. A quick chat is for one-offs, where you ask, get an answer, and move on. A project, which your platform may call a Project, a GPT, or a Gem, is a workspace that holds your context for good.

A scheduled task is the third place, work that runs on its own, on a timer, without you. Most people only ever use the first.

The point of a project is that you load your context once, your role, your methodology, your client details, your key documents, and every prompt inside it starts informed. You stop pasting the same background into a cold chat and paying to send it again. It compounds, because the workspace gets sharper about your world the more you put into it.

As for which brand, they are closer than the marketing suggests for everyday work. Use the one your company already pays for, and build your workspace there.

Before you build or buy anything, check what is already native in the tools you own. The cheapest token is the one you never spend.

One security-conscious firm I worked with skipped a separate scheduling tool once it realized the booking feature was already built into the office suite it paid for. Free beats cheap.

A workspace is the foundation. The next step is letting a piece of your work run on its own, which is exactly what a morning brief does.

Your first automation: the morning brief

The morning brief is the simplest automation worth building, and it puts every habit so far to work at once. Each weekday, before you start, it hands you a short, prioritized rundown: who you're meeting, what matters about each one, and what to do about it.

It does more than read your calendar. It pulls today's meetings, checks your CRM for the history on each account, gathers a little fresh context on the people and companies you're seeing, and compiles all of it onto one page. You walk in prepared without doing the digging yourself.

And it's cheap to run, because none of that needs a frontier model. Reading a calendar, pulling records, and summarizing are workhorse jobs, so the brief runs on a cheap model, on a schedule, scoped to exactly what you want to see. That's a strong result at the lowest sufficient cost.

You set it up once. Be specific about what you want each morning, and build it inside the workspace you already have so it knows your accounts and your style. From then on it runs on its own.

Because the brief is personal, you set it up on your own account. If your whole team wants one, though, it's worth building centrally on the cheapest tokens, which is a question for the next issue.

The habits that keep the bill flat

Automations like the brief run themselves, but most of your day is still hands-on, and a few habits in the chat window keep your spend from creeping up. One matters more than the rest, so start there.

When an answer misses, fix your original prompt and run it again instead of stacking "no, I meant this" on top. Remember the model rereads the whole thread every turn, so a pile of corrections makes it reprocess the misfire and all your fixes on every pass. Editing keeps the thread short and clean, and the answer is usually better because the bad attempt isn't sitting there steering the next one.

The rest are quick:

Read the reasoning when something goes wrong. The trace shows where it took a wrong turn, so you can fix that instead of guessing at a whole new prompt.
In a long session, summarize what you've decided and start a fresh chat. You keep the conclusions and drop the bloated history the model would otherwise reread.
Ask for less. "Give me ten bullets" costs a fraction of "tell me everything," and it's usually what you wanted anyway.
Say what format you want. A clear shape stops the tool from over-building an elaborate artifact you never asked for.

What this adds up to

Put it together and the skill is simple to name. You don't need the biggest budget or the smartest model. You need to spend your thinking deliberately: the right amount, from the right model, on the work that matters, with prompts clear enough to get it right the first time.

Do that and the wall stops being a wall. You get your work done inside your limits, and the meter stops running your day.

Managing your own spend is one job. Rolling this out across a team, deciding what to build, what to fund, and how to keep a hundred people's usage from blowing the budget, is a different one. That's the next issue.