GPT Musings I: Technical Question Marks

Is GPT-4o’s tokenization off? The bigger question may lie in insufficient model training.

This is an unstructured, indirectly tech-oriented ramble.

Background

What are ‘tokens’ in an LLM?

Machines can’t process text directly. They require processing steps like tokenization—splitting text into tokens, each assigned a vector representation. Though tokenization appears in various fields, in language models it means breaking text down into these fundamental units.

What algorithm is used for LLM tokenization?

A common method is Byte Pair Encoding (BPE), originally developed as a compression algorithm. In the NLP domain, the logic is somewhat similar: start with individual characters and repeatedly replace the most frequent pair of characters with a single token, cycling through until completion. (For a more precise and clear explanation, watch related video tutorials.)

So should we add more vocabulary?

Leaving aside the question of how to tokenize Mandarin, should we expand the vocabulary? At first glance, expanding the vocabulary might help the model “recognize” more commonly used characters or words.

But indiscriminately increasing the vocabulary isn’t necessarily beneficial.

While expanding the vocabulary may look like simply adding words, if the newly added terms aren’t sufficiently trained, the model could fail. Training data dedicated to these new tokens might also destabilize the model, offering no clear advantage over the original. (This is an oversimplified conclusion, of course.)

Moreover, overly long tokens could affect tokenization granularity. Sometimes, adding very long tokens might cause the model to overlook semantic similarities between tokens. Suppose “Taipei Medical University” and “Kaohsiung Medical University” are each treated as a single token. The model might fail to notice their structural and semantic parallels—unless it’s specifically trained to understand these relationships.

Thus, while expanding the vocabulary might seem great in theory, sometimes not expanding it and instead focusing on more training yields better results. The key is not merely the raw number or length of tokens, but whether the tokens are cleverly defined so that the model can absorb syntactic and structural knowledge.

A Related Tangent

Quality traditional Mandarin corpora are undoubtedly important. But what would be the best tokenization algorithm for Traditional Mandarin specifically? Or is there something about Taiwanese Traditional Mandarin that common methods like BPE can’t fully capture? This might be worth further exploration or development.

Why Suspect That GPT-4o’s New Content Is Undertrained

Back to current events: recent GPT-4o demos have sparked discussion in online communities.

Both Meta’s Llama 3 and OpenAI’s GPT-4o have introduced updates to their tokenizers. If you use Python to call OpenAI’s open-source “tiktoken,” you’ll see just how oddly the GPT-4o tokenizer handles Chinese tokens.

Beyond the strange, “flattened” vocabulary (for lack of a better term), tests suggest data cleaning might be incomplete and model training insufficient.

So What’s the Point?

I’m not writing this to mock OpenAI or to express pessimism. Observing GPT-4o’s demos in context, I’m hopeful for a broader range of applications that can empower different communities and release new productive potential.

From a technologist or researcher’s perspective, there’s still much to discuss with LLMs. These issues merit scrutiny and serve as reminders for ongoing improvement.

From a user’s viewpoint, if I’m considering using or recommending ChatGPT as a tool within an organization—especially for tasks like bilingual (Chinese-English) translation—I’d still advise them to use GPT-4 for the time being.

In the next installment (if there is one), I’ll talk about how GenAI might be applied in organizational contexts.