Your voice is an embedding: how Phoenix encodes creator identity

"Voice" is the kind of word that gets used so loosely it starts to feel unfalsifiable. Phoenix, the 2026 X ranker, makes the word concrete. Inside the model, a creator is not a username. The username is one hashed token in a 1M-entry author vocabulary (phoenix/README.md), and even that token only appears alongside the creator's posts in two places: the candidate tower when scoring a new post, and the viewer's history sequence as part of past posts the viewer has seen. The model's representation of "who you are as a creator" is the spatial distribution your posts trace through the embedding space, anchored by your author-hash token. That distribution is what we are pointing at when we say "voice." This article walks the embedding mechanism for non-ML readers, in the specific terms of phoenix/recsys_model.py and phoenix/recsys_retrieval_model.py. Companion to A1 of this series, which establishes the architecture, and A5, which walks the negative-signal pipeline that voice drift feeds.

From username to hashed token

Phoenix's input vocabulary, per the public mini checkpoint, has 1,000,000 entries for users, items, and authors, with two hashes per entity (phoenix/README.md). The two-hash design is a common embedding-table technique: each entity gets two integer hash indices into the embedding matrix and the model learns a useful representation by summing the two rows. The technique is sometimes called "hashing trick" embeddings, and it lets a finite-size table represent a much larger entity space at the cost of occasional collisions.

Production Phoenix is larger than the mini checkpoint (documented). The mini's parameters (128-d embeddings, 4 transformer layers, 4 attention heads, 127 history sequence length, 64 candidate sequence length, 1M hash vocab) are authoritative for the released artefact. The root README and Phoenix README briefly diverge on certain numbers; we take the values in phoenix/README.md as authoritative for the mini checkpoint, since that is the file describing the artefact actually shipped via Git LFS. Production is described only as "larger" with specific numbers not disclosed.

The practical implication of the hashing scheme: your account does not have a unique slot in the model's memory. It has two hash indices that share their slots with other accounts. What identifies you specifically is not the slot itself; it is the way the two rows, plus the context your posts arrive in, combine across many viewer history sequences. That combination is what the model can lean on when scoring your next post.

The post-plus-author projection in the candidate tower

When a candidate post is being considered for a viewer, Phoenix builds a candidate representation by combining the post tokens with the author hash and position information. The construction lives in phoenix/recsys_retrieval_model.py for the retrieval tower (which produces the dense vector used for nearest-neighbour search against viewer history) and in phoenix/recsys_model.py for the ranking model (which produces the 19 action probabilities). In both cases, the author hash sits adjacent to the post tokens before the transformer layers process the combined sequence.

What this means concretely: your post embedding is not produced from the post alone. It is produced from the post-plus-author combination. Two creators posting word-for-word identical text would get different embeddings, because the author-hash component differs. The model's representation of the post is, by construction, conditional on who wrote it. That conditioning is the smallest unit of "voice" the architecture knows about.

The model learns useful author embeddings during training by observing which posts each author actually produces and which viewer histories those posts get engaged from. Authors whose posts cluster tightly in the embedding space (because their posts share consistent stylistic and semantic structure) get distinctive author embeddings. Authors whose posts spread broadly across the embedding space (because their content varies widely) get more generic author embeddings, since the model has no consistent pattern to anchor on.

A simplified two-tower view: candidate tower combines post tokens plus author hash; user tower encodes the viewer's history sequence. The dot product decides retrieval.

The candidate tower and user tower share a learned embedding space. Posts that the user historically engaged with sit close to the user's history-embedding centroid. Candidates with high dot-product similarity get retrieved as recommended posts. The architecture is the same one Spotify uses for music recommendations, Netflix uses for film recommendations, and Pinterest uses for image recommendations. The specific application to X is that the candidate tower's input includes the author-hash signal, which makes the embedding conditional on authorship in a way that simpler text-only retrieval does not.

The history sequence: where your voice actually lives

The user tower's input is a sequence of past posts the viewer encountered, each tagged with the action the viewer took on it. In the mini checkpoint, that sequence is 127 tokens long; production is longer. Each token entry effectively says "the viewer saw this post by this author and did this thing." Over many such entries, the model builds an embedding that summarises the viewer's engagement history.

If you have posted thirty times in the last six months and a particular follower has engaged with twenty of those posts (favorited some, dwelt on others, replied to a few), your posts appear in their history sequence twenty times, each paired with the action they took. The model can attend to all twenty positions when scoring your next post. The strength of the "I am one of those familiar posts" prior depends on:

The number of times your posts appear in that viewer's history (sparse follower vs heavy reader).
The consistency of the embeddings across your posts (tight cluster vs scattered spread).
The diversity of actions the viewer took (mix of engagement types vs one repeated type).

The first axis is volume. The second is voice consistency. The third is the viewer's engagement style. Voice consistency is the only one of the three that the creator directly controls. A creator with a tight embedding cluster across their post history compounds the per-viewer prior across all viewers who have engaged with them. A creator with a scattered embedding cluster cannot lean on any single viewer's history the same way, because the new candidate's embedding is uncorrelated with the historical pattern.

This is the core claim of "voice as an embedding." The voice-fidelity score, in Phoenix-native terms, is the tightness of the cluster the creator's post embeddings trace through the embedding space, conditional on their author-hash anchor.

Candidate isolation: why one good post does not rescue an off-voice one

The Phoenix attention mask is documented as candidate-isolated (phoenix/README.md). In plain language: when Phoenix scores a batch of candidates, each candidate can attend to the viewer's history sequence but not to the other candidates in the batch. The mask zeros out cross-candidate attention.

The architectural consequence: each post you publish is scored on its own merits against the viewer's history pattern. A streak of three high-performing posts does not contribute extra signal when Phoenix scores your fourth one. The fourth one stands on its own and is judged against the viewer's prior, not against your other recent candidates in the same batch.

This is the property that turns voice consistency into a per-post multiplier, not a per-streak one. Consistent voice anchors the per-viewer prior across every individual post. Inconsistent voice weakens the prior, and Phoenix falls back on the candidate-only signal, which is much weaker for OON candidates and only modestly stronger for in-network ones. The cost of an off-voice post is the cost of weakening the prior for every subsequent post, viewer by viewer. The benefit of an on-voice post is the cost of NOT weakening it.

Voice drift, in embedding terms

A cluster of post embeddings in the high-dimensional space (128 dimensions in the mini checkpoint) does not visualise cleanly, but the two-dimensional projection below sketches the qualitative shape. The left panel shows a creator with a tight historical voice. The right panel shows one whose posts have drifted widely.

Schematic 2D projection of post embeddings. Left, tight on-voice cluster. Right, drifted scatter. The author-hash anchor is the same in both panels; the post embeddings around it differ.

Two reading rules for the diagram. First, both creators have the same author-hash anchor. The difference is in where their post embeddings sit relative to that anchor. On the left, the model sees a coherent "this is what posts from this author look like." On the right, the author embedding has to compromise across many disparate patterns; the representation it lands on is closer to the embedding-space centroid (a generic point), which means the anchor itself contributes less discriminative signal.

Second, both creators may produce the same total engagement count over the same period. Volume is not the variable in this picture. The variable is the spread of the embeddings around the anchor. Three hundred tightly-clustered posts produce a stronger per-viewer prior than three hundred widely-scattered posts, even if total engagement is identical.

The mechanical consequence for a creator: voice consistency is the property that turns thirty posts into a model-legible identity. The same thirty posts in a scattered cluster do not. The thresholds for "enough corpus" depend on the underlying embedding geometry, which is not in the public source, but the qualitative shape is documented in the architecture choices.

The 100 to 200 piece corpus problem

A common pitch from competitor writing tools is "train on twenty of your tweets and we will write in your voice." The embedding-geometry view explains why twenty is structurally inadequate.

A 128-dimensional embedding space requires enough data points to estimate a meaningful distribution. Twenty posts give you twenty data points; the cluster shape estimated from twenty samples in 128-d is dominated by sampling noise, not by the underlying distribution. Two hundred samples is still small for 128-d statistics, but it is qualitatively different: the cluster's principal-component structure starts to stabilise, and the distance from the cluster centroid to its outer envelope can be estimated within useful bounds.

The corpus VoiceMoat trains Auden on, per current product spec, is 100 to 200 content pieces across posts, replies, threads, and images, spanning 10 signals (tone, vocabulary, hook style, pacing, formatting, quirks, persona, authority, topic surface, register). That count is not chosen to sound impressive in marketing copy. It is the smallest corpus that produces a stable cluster geometry against which new drafts can be scored. Below 100, the cluster is too noisy for the comparison to be reliable. Above 200, returns diminish quickly; the cluster shape estimated from 200 samples is close enough to the asymptote that additional samples do not move the score much.

The 10 signals Auden trains on, mapped to the embedding properties they shape

Source: VoiceMoat product spec + voice-fidelity score components

Signal	What it captures	Embedding property it shapes
Tone	register, formality, warmth, irony	stylistic-cluster orientation
Vocabulary	specific lexical choices, idioms, jargon	lexical-component density
Hook style	opening-line structure across posts	first-token attention patterns
Pacing	sentence and paragraph rhythm	sequence-length distribution
Formatting	line breaks, bullet conventions, capitalisation choices	structural-token patterns
Quirks	recurring tics, micro-patterns	high-confidence anchor tokens
Persona	first-person presentation, stance, register	author-hash interaction
Authority	claim density, hedge frequency, citation patterns	epistemic-language distribution
Topic surface	subject matter the creator actually writes about	topic-cluster centroid
Register	casual to formal range across contexts	spread of stylistic cluster

The table maps an editorial vocabulary (tone, voice, pacing) onto an architectural vocabulary (cluster orientation, density, attention patterns). The mapping is approximate; the architecture does not have named features for each signal. What it has is a learned representation that integrates them. The voice-fidelity score's job is to flag drafts whose embedding sits outside the cluster geometry estimated from the creator's training corpus.

How template-driven writing tools sit in the embedding space

The bulk of competitor AI writing tools fall into one of two categories relative to the embedding geometry above. The first is template-driven: the tool maintains a set of viral-post templates and pattern-fills the creator's topic into the template. Tweet Hunter and similar products are canonical examples. The second is general-LLM-driven: the tool prompts a general writing model with light context (the creator's recent posts, sometimes a tone preset) and ships whatever the model returns. Both categories have a predictable failure mode in the embedding-geometry view.

Template-driven output occupies a tight cluster in the embedding space, but it is the wrong cluster. The cluster centroid is wherever the template family sits, and that centroid is shared across every creator using the same templates. Two creators with very different voices, both running their content through the same engagement-hook template, produce outputs that cluster near each other in the embedding space and far from each individual creator's historical cluster. The follower reading either creator's feed sees output that does not match the historical pattern the follow decision was anchored on. The author-hash anchor is intact; the post embeddings around it have migrated to the template cluster. The voice-fidelity score plummets.

General-LLM-driven output is more spread out in the embedding space (because the general model's pretraining distribution is broad), but the spread is centred on the helpful-assistant pretraining centroid, not on any individual creator. The cluster around any specific creator's author-hash anchor is built from posts the creator wrote, not posts the LLM produced. Output from a general LLM, regardless of prompt, sits closer to the helpful-assistant centroid than to the creator's cluster. The mechanical reason: the general model has never seen enough of the creator's writing to estimate their cluster geometry. Better prompts move the output around inside the helpful-assistant cluster (more terse, more conversational, more technical) but do not relocate it to the creator's cluster.

The third category, voice-trained, is what the embedding-geometry architecture rewards. A model that has seen the creator's 100 to 200 piece corpus and learned a representation conditioned on those samples can produce output that lands inside the creator's historical cluster rather than near the broad-internet centroid. The product category here is small. VoiceMoat occupies it; Brandled covers two platforms at adjacent depth; the others are template-driven or general-LLM-driven under the hood. The full comparison sits in A10.

What this means for tools that do not model voice

The architecture above is the structural reason general-purpose AI writing assistants converge on a single helpful-assistant register. A general model has no access to a creator's specific embedding cluster. It has access to the broad-internet pretraining distribution, which is dominated by helpful-assistant text because that is the largest source of conversational training data on the modern web. Output from such a model lands at the embedding-space centroid of that pretraining distribution. Every general-LLM output, by every creator, sits at roughly the same point in the embedding space the X ranker uses. A follower scrolling past sees not just "this is AI-shaped" but "this is the same AI-shaped, regardless of the byline."

The fix is not better prompts. Better prompts pull the output from the helpful-assistant centroid toward a different generic location (more academic, more casual, more vivid), but they cannot pull it toward a creator-specific cluster the prompt has no information about. The fix is training, not prompting. A model that has seen a creator's 100 to 200 pieces and learned the embedding geometry of their voice can score drafts against that geometry. A model that has not seen the corpus cannot.

What changes for creators in practice

The architecture does not change the editorial advice creators have heard for years: write specifically, write consistently, write things only you would write. What it does is supply the mechanism. "Write consistently" maps onto cluster tightness in the embedding space. "Write specifically" maps onto distance from the broad-internet centroid. "Write things only you would write" maps onto the author-hash anchor having meaningful discriminative signal.

The negative-signal economy walked in A5 operates on exactly this geometry. Voice drift increases predicted-mute probability across the follower base because the off-voice embedding sits outside the cluster the followers' history sequences anchored on. The recency window walked in A9 intersects this too: only the 48-hour Thunder window's posts are in the in-network candidate pool, but the embedding-geometry effect is cumulative across all your historical posts in viewer histories, not limited to the current Thunder window.

The structural conclusion: voice is now an architectural property of the X ranker, not a brand-marketing one. The model has a geometric representation of who you are. Drift, and the representation generalises toward the broad-internet centroid. Hold, and the representation anchors a per-viewer prior that compounds across every post you ship.

The final piece of this series, A10, scores the four major writing tools (VoiceMoat, Tweet Hunter, Typefully, Hypefury) against six 2026-algorithm criteria, with the embedding-geometry argument from this article one of them. None of the other three tools models voice in this sense; the architecture described above is unavailable to anything that does not train on a per-creator corpus.