GotGemini

What a language model actually is

Tokens, autoregressive generation, attention, and why the abstraction matters before the API does.

v1· gemini-3.5-flash / ai-studio v1beta· June 12, 2026

Most people start using a language model exactly as they would a search engine, only to notice within the hour that something is fundamentally different. The answers are incredibly fluent, yet sometimes wrong in ways that are hard to spot. Ask the exact same question twice, and you get two slightly different replies. Let a conversation run long enough, and the model begins to "forget" what was said at the start.

None of this is a malfunction. All of it follows directly from what a language model actually is.

Before touching an API or writing a complex prompt, there is a fundamental question to internalize: what is the model actually doing? You can ship a lot of code without ever forming a clear picture of the underlying mechanics. But every hard bug you hit—every "why did it do that?", every sudden cost spike, every latency cliff—makes perfect sense once that picture is in your head.

This chapter provides that picture. It isn't a dense technical paper. It is the minimum mental model you need so that, in the chapters to come, you understand why a technique works, rather than just memorizing that it does.

The Core Mechanism: Tokens, not Letters or Words

A language model does not read your message as a sentence, nor does it read it as a sequence of words. It reads it as a sequence of tokens: short fragments of text produced by a piece of software called a tokenizer. Tokens can be whole words, parts of words, or single characters.

Here are a few concrete examples:

  • "The Gemini API" is roughly four tokens: The, Gem, ini, API.

  • "unbelievable" is around three tokens: un, believ, able.

  • A single space, a comma, and an emoji are each their own distinct token.

You do not need to memorize the exact splitting rules. You only need to remember that tokens, not words, are the unit the model sees, counts, and is limited by. Almost everything is measured in tokens. When a model has a context window of one million tokens, that equates to roughly 700,000 words of ordinary English. Furthermore, when an AI struggles with a prompt like "how many letters are in 'strawberry'?", the reason is structural: it sees the token representing the word, not the individual letters inside it.

Autoregressive generation, one token at a time

At its most fundamental level, a language model is a statistical prediction engine. Think of smartphone autocomplete, which suggests the next word based on what you have typed. A language model operates on this same principle, but at an exponentially larger scale. It produces its reply one token at a time using a process called autoregressive generation. After it picks a token, it looks at everything written so far—your prompt plus every token it has already generated—and uses that entire history to calculate the most statistically probable next token. Then, it appends that token and repeats the process.

This mechanism creates three highly visible behaviors in any modern language model:

  • Streaming Output: You see words appear one after another because that is the exact order they are being produced.

  • Proportional Cost and Time: Each token requires a separate mathematical decision. A thousand-token answer is a thousand small decisions stacked end to end, meaning longer answers take proportionally longer and cost more computational power than shorter ones.

  • No Revisions: The model cannot revise an earlier sentence once it is writing a later one. Past tokens are fixed inputs for future tokens; nothing in the architecture lets the model reach back and edit what it has already said. If a paragraph starts going off-track, the model's only options are to continue down that same path or contradict itself further on. This is why asking the AI to "start over" with a fresh response from scratch is often more effective than asking it to fix a long answer in place.

A related setting, temperature, controls how predictable these token-by-token choices are. At a temperature of zero, the model picks the single most likely next token every time, producing consistent but sometimes flat output. Higher temperatures allow the model to sample from a wider range of probable candidates, which is why asking the same question twice can yield two different—yet both plausible—answers. Most chat products use a moderate default, though many developer tools allow you to adjust it.

Attention — every token looks at every other token

The mechanism that lets a modern language model write coherently across a long conversation is called self-attention. When the model is deciding the next token, every token in its current input — your instructions, the document you pasted, the conversation up to this point — gets weighed against every other token, and the model decides which ones matter most for the choice at hand.

The practical consequence for a user is reassuring. You do not have to put your most important instruction immediately next to the part of the input it applies to. You can give a clear instruction at the top of the message, paste a long document underneath it, and trust that the model will connect the two. You do not have to repeat yourself throughout a long prompt. Clear structure — labelled sections, a brief lead-in, a question at the end — is enough.

There is a cost, though, and it is worth knowing about. The amount of work the model does to consider all those connections grows faster than the length of the input. Doubling the size of a prompt does not double the work; it roughly quadruples it. This is why very long requests are slower and more expensive than short ones, and why "just paste everything in" is not always the best strategy even when it fits.

Context windows — the working memory

A model's context window is the total amount of text it can hold in view at once. It includes your current message, the conversation so far, any files or images you have attached (which are converted to tokens too), and the space reserved for the answer it is about to produce. Gemini's general-purpose tier ships with 1 million tokens of context across 3.5 Flash, 3.1 Pro, and Flash-Lite.

Two clarifications are important, because both are common sources of confusion.

The context window is not memory across separate conversations. When you start a new chat, the window is empty. Anything the model appeared to "remember" from yesterday's session is, from its point of view, gone — unless you paste it back in, or unless you are using a product feature (such as saved memories or persistent chat history) that puts it back in for you behind the scenes. The model itself does not remember; the product around it can be configured to remind it.

The context window is also not infinite. If a conversation grows past the limit, the earliest parts will start to fall out of view. In some interfaces this happens silently. The first symptom is usually that the model begins ignoring an instruction you gave at the start of the chat, or contradicts a fact you established earlier. When that begins to happen, the right response is not to scold the model but to restate the key points in a fresh message, or to start a new conversation with a tight summary.

The mental image to carry: the context window is the model's working memory for this single request. Persistent memory across requests is a feature of the saved memories, conversation history, uploaded files in a workspace, not of the model itself.

Where the abstraction ends and the product begins

The last idea in this chapter is the one that will save you the most confusion in the chapters to come.

The model is a function. Tokens in, tokens out, with attention and temperature shaping the trajectory in between. Everything else — function calling, tools, retrieval, agent loops — is code you (or Google) write around the model to make it useful at production scale.

This distinction matters because the two layers fail in different ways and require different fixes.

A model problem looks like: the answer is fluent but wrong, the reasoning skips a step, the model misreads a subtle question. How to fix: Provide better context, rephrase your prompt, or switch to a more capable model version.

A system problem looks like: the model "forgets" what you said earlier (the conversation exceeded the window), an uploaded file does not seem to be considered (it was not attached correctly), a web search returns nothing useful (the search tool, not the model, had a bad day), or a response is refused on a topic the model can clearly handle (a safety filter intercepted it).
How to fix: Start a new chat to clear the context window, re-upload files, or work around system-level safety filters.

Many frustrations blamed on "the AI" are actually system problems in disguise. Learning to tell the difference is half the diagnosis.

The mental shortcut that saves debugging time: the model is not "wrong" in the way a function is wrong. It is completing a distribution under constraints you supplied, constraints you forgot, and constraints it inferred.

When output quality changes, look first at the hidden distribution shift: prompt wording, retrieved context, message order, model snapshot, safety settings, and temperature.

Takeaways

  • Tokens, not words. Count before you call; budget in tokens.
  • One token at a time. Long outputs scale linearly with output length; output rate limits are real.
  • Self-attention is what makes long context coherent and what makes it expensive.
  • Context is working memory for one request. Persistent memory is a product feature, not a built-in property of the language model itself.
  • Illusion of understanding: The system relies on statistical probability and deep learning, not sentient comprehension or emotion.
  • Model vs. System: Knowing whether you are fighting the core prediction engine or the product interface wrapped around it is the key to effective troubleshooting.

References

  • Token counter API — ai.google.dev/api/tokens/countTokens
  • Context window docs — ai.google.dev/models
  • Attention Is All You Need (Vaswani et al., 2017) — the original transformer paper
  • Chapter 4 covers prompting + context engineering in depth.

Discussion

Questions and comments from readers.