生成式AI与transformer技术

type

date

status

slug

summary

Generative AI exists because of the transformer

生成式人工智能之所以存在，是因为有了transformer技术

transformer变换神经网络

Over the past few years, we have taken a gigantic leap forward in our decades-long quest to build intelligent machines: the advent of the large language model, or LLM.

This technology, based on research that tries to model the human brain, has led to a new field known as generative AI — software that can create plausible and sophisticated text, images and computer code at a level that mimics human ability.

Businesses around the world have begun to experiment with the new technology in the belief it could transform media, finance, law and professional services, as well as public services such as education. The LLM is underpinned by a scientific development known as the transformer model, made by Google researchers in 2017.

“While we’ve always understood the breakthrough nature of our transformer work, several years later, we’re energised by its enduring potential across new fields, from healthcare to robotics and security, enhancing human creativity, and more,” says Slav Petrov, a senior researcher at Google, who works on building AI models, including LLMs.

LLMs’ touted benefits — the ability to increase productivity by writing and analysing text — are also why it poses a threat to humans. According to Goldman Sachs, it could expose the equivalent of 300mn full-time workers across big economies to automation, leading to widespread unemployment.

As the technology is rapidly woven into our lives, understanding how LLMs generate text means understanding why these models are such versatile cognitive engines — and what else they can help create.

在过去的几年里，我们在长达数十年的智能机器探索中取得了巨大的飞跃：大型语言模型（LLM）的出现。这项技术以试图模拟人脑的研究为基础，开创了一个被称为 "生成式人工智能"（generative AI）的新领域--这种软件可以在模仿人类能力的水平上创造出可信而复杂的文本、图像和计算机代码。世界各地的企业已经开始尝试这种新技术，相信它能改变媒体、金融、法律和专业服务以及教育等公共服务。法律硕士的基础是谷歌研究人员于2017年开发的一种被称为 "转换器模型 "的科学成果。谷歌高级研究员斯拉夫-彼得罗夫（Slav Petrov）说："虽然我们一直都明白我们的变压器工作具有突破性，但几年后，我们对它在从医疗保健到机器人和安全、提高人类创造力等新领域的持久潜力感到振奋，"他致力于建立人工智能模型，包括法律硕士。 LLMs被吹捧的好处--通过撰写和分析文本提高生产力--也是它对人类构成威胁的原因。据高盛公司（Goldman Sachs）称，它可能会使大型经济体中相当于 3 亿名全职工人面临自动化，从而导致大范围失业。随着这项技术迅速融入我们的生活，了解 LLM 如何生成文本就意味着了解为什么这些模型是如此多才多艺的认知引擎，以及它们还能帮助创造什么。

We-go-to-work-by-train

To write text, LLMs must first translate words into a language they understand. 为了编写文本，LLMs（语言模型）首先必须将单词翻译成他们能理解的语言

First a block of words is broken intotokens — basic units that can be encoded. Tokens often represent fractions of words, but we’ll turn each full word into a token. 首先，一段文字被分解成标记（tokens）——基本的编码单位。标记通常代表单词的部分，但我们将把每个完整的单词转换为一个标记。

In order to grasp a word’s meaning, work in our example, LLMs first observe it in context using enormous sets of training data, taking note of nearby words. These datasets are based on collating text published on the internet, with new LLMs trained using billions of words. 为了理解一个词的意思，比如我们的例子中的“work”，语言模型首先会观察它在上下文中的使用，并利用大量训练数据来分析附近出现的其他单词。这些数据集是通过整合互联网上发布的文本而得到的，新的语言模型则使用数十亿个单词进行训练。

Eventually, we end up with a huge set of the words found alongside work in the training data, as well as those that weren’t found near it. 最终，我们得到了一个庞大的词汇集合，其中包括在训练数据中与“work”一起出现的单词，以及那些没有与之相邻的单词。

As the model processes this set of words, it produces a vector - or list of values - and adjusts it based on each word's proximity to work in the training data. This vector is known as a word embedding. 当模型处理一组单词时，它会生成一个向量（或值列表），并根据每个单词在训练数据中的work接近程度进行调整。这个向量被称为词嵌入。

A word embedding can have hundreds of values, each representing a different aspect of a word’s meaning. Just as you might describe a house by its characteristics — type, location, bedrooms, bathrooms, storeys — the values in an embedding quantify a word’s linguistic features. 词嵌入可以有数百个值，每个值代表一个单词含义的不同方面。就像你可能通过房屋的特征（类型、位置、卧室、浴室、楼层数）来描述一座房子一样，嵌入中的值量化了一个单词的语言特征。

The way these characteristics are derived means we don’t know exactly what each value represents, but words we expect to be used in comparable ways often have similar-looking embeddings. 这些特征的衍生方式意味着我们不知道每个值具体代表什么，但我们预期以类似方式使用的词通常具有相似的嵌入表示。

A pair of words like sea and ocean, for example, may not be used in identical contexts (‘all at ocean’ isn't a direct substitute for ‘all at sea’), but their meanings are close to each other, and embeddings allow us to quantify that closeness.

例如海洋和大洋这样一对词可能不会在相同的语境中使用（“all at ocean”不能直接替代“all at sea”），但它们的意义非常接近。通过嵌入技术，我们可以量化它们之间的相似程度。

By reducing the hundreds of values each embedding represents to just two, we can see the distances between these words more clearly. 通过将每个嵌入所代表的数百个值减少到只有两个，我们可以更清楚地看到这些词之间的距离。

We might spot clusters of pronouns, or modes of transportation, and being able to quantify words in this way is the first step in a model generating text. 我们可能会发现代词集群或ransportation方式，能够以这种方式量化词语是文本生成模型的第一步。

But this alone is not what makes LLMs so clever. What unlocked their abilities to parse and write as fluently as they do today is a tool called the transformer, which radically sped up and augmented how computers understood language.Transformers process an entire sequence at once — be that a sentence, paragraph or an entire article — analysing all its parts and not just individual words.This allows the software to capture context and patterns better, and to translate — or generate — text more accurately. This simultaneous processing also makes LLMs much faster to train, in turn improving their efficiency and ability to scale.Research outlining the transformer model was first published by a group of eight AI researchers at Google in June 2017eight AI researchers at Google in June 2017. Their 11-page research paper marked the start of the generative AI era. 但仅仅这一点并不是使LLM（语言模型）如此聪明的原因。使它们能够像今天这样流利地解析和写作的关键是一种被称为"transformer"的工具，它极大地加快了计算机理解语言的速度并增强了其能力。Transformer一次处理整个序列，可以是一个句子、段落或整篇文章，分析其中的所有部分，而不仅仅是个别单词。这使得软件能够更好地捕捉上下文和模式，并更准确地翻译或生成文本。这种同时处理还使得LLM的训练速度更快，进而提高了它们的效率和可扩展性。关于transformer模型的研究首次由谷歌的八位人工智能研究员在2017年6月发表，他们的11页研究论文标志着生成式人工智能时代的开始。

A key concept of the transformer architecture is self-attention. This is what allows LLMs to understand relationships between words. transformer架构的一个关键概念是自注意力。这使得语言模型能够理解单词之间的关系。

Self-attention looks at each token in a body of text and decides which others are most important to understanding its meaning. 自注意力机制会检查文本中的每个标记，并决定哪些标记对于理解其含义最重要。

Before transformers, the state of the art AI translation methods were recurrent neural networks (RNNs), which scanned each word in a sentence and processed it sequentially. 在Transformer出现之前，最先进的AI翻译方法是循环神经网络（RNNs），它会逐个扫描句子中的每个单词并按顺序进行处理。

With self-attention, the transformer computes all the words in a sentence at the same time. Capturing this context gives LLMs far more sophisticated capabilities to parse language. 使用自注意力机制，Transformer可以同时计算句子中的所有单词。通过捕捉上下文信息，语言模型（LLMs）具备了更加复杂的语言解析能力。

In this example, assessing the whole sentence at once means the transformer is able to understand that interest is being used as a noun to explain an individual’s take on politics.

在这个例子中，一次性评估整个句子意味着 transformer 能够理解interest被用作名词来解释一个人对政治的看法。

If we tweak the sentence . . . 如果我们微调这个句子...

. . . the model understandsinterest is now being used in a financial sense. .......该模式理解的利息现在被用于金融意义上。

And when we combine the sentences, the model is still able to recognise the correct meaning of each word thanks to the attention it gives the accompanying text.For the first use of interest, it is no and in that are most attended 当我们将这些句子组合在一起时，由于模型对伴随文本的关注，它仍然能够识别出每个单词的正确意思。对于"interest"这个词的第一次使用，最重要的是no和in连用，并且在那种情况下得到了最多的关注。

For the second, it is rate and bank. 其次是利率和银行。

This functionality is crucial for advanced text generation. Without it, words that can be interchangeable in some contexts but not others can be used incorrectly. 这个功能对于高级文本生成至关重要。如果没有它，一些词在某些情境中可以互换使用，但在其他情境中则可能被错误地使用。

Effectively, self-attention means that if a summary of this sentence was produced, you wouldn’t have enthusiasm used when you were writing about interest rates. 实际上，自我关注意味着如果对这个句子进行总结，你在写有关利率的内容时不会使用热情。

This capability goes beyond words, like interest, that have multiple meanings. 这种能力超越了像interest这样有多重含义的词语。

In the following sentence, self-attention is able to calculate that it is most likely to be referring to dog. 在下面的句子中，自注意力能够计算出它最有可能指的是狗。

And if we alter the sentence, swapping hungry for delicious, the model is able to recalculate, with it now most likely to refer to bone. 如果我们改变句子，将"hungry"换成"delicious"，模型就能重新计算，并且它现在最有可能指的是骨头。

The benefits of self-attention for language processing increase the more you scale things up. It allows LLMs to take context from beyond sentence boundaries, giving the model a greater understanding of how and when a word is used. 自我注意力对于语言处理的好处在扩大规模时会增加。它使得语言模型能够从句子边界之外获取上下文，从而更好地理解词语的使用方式和时间。

One of the world’s largest and most advanced LLMs is GPT-4, OpenAI’s latest artificial intelligence model which the company says exhibits “human-level performance” on several academic and professional benchmarks such as the US bar exam, advanced placement tests and the SAT school exams.

GPT-4 can generate and ingest large volumes of text: users can feed in up to 25,000 English words, which means it could handle detailed financial documentation, literary works or technical manuals.

The product has reshaped the tech industry, with the world’s biggest technology companies — including Google, Meta and Microsoft, who have backed OpenAI — racing to dominate the space, alongside smaller start-ups.

The LLMs they have released include Google’s PaLM model, which powers its chatbot Bard, Anthropic’s Claude model, Meta’s LLaMA and Cohere’s Command, among others.

While these models are already being adopted by an array of businesses, some of the companies behind them are facing legal battles around their use of copyrighted text, images and audio scraped from the web.

The reason for this is that current LLMs are trained on most of the English-language internet — a volume of information that makes them far more powerful than previous generations.

From this enormous corpus of words and images, the models learn how to recognise patterns and eventually predict the next best word.

世界上最大、最先进的LLM之一——GPT-4，它是OpenAI最新的人工智能模型。该公司表示，GPT-4在美国律师考试、高级放置测试和SAT学校考试等多个学术和专业基准上表现出“与人类水平相当”的性能。GPT-4可以生成和吸收大量文本：用户可以输入多达25,000个英语单词，这意味着它可以处理详细的财务文件、文学作品或技术手册。该产品已经改变了科技行业，包括谷歌、Meta和微软在内的全球最大科技公司（他们支持OpenAI）正竞相主导这一领域，同时还有一些小型初创企业也加入其中。他们发布的LLM包括谷歌的PaLM模型（用于其聊天机器人Bard）、Anthropic的Claude模型、Meta的LLaMA以及Cohere的Command等。尽管这些模型已被许多企业采用，但其中一些背后公司因为使用从网络中抓取来的受版权保护文字、图像和音频而面临法律纠纷。造成这种情况是因为当前LLM是通过对大部分英语互联网进行训练得到的，这使得它们比以前的一代更加强大。从这个庞大的文字和图像语料库中，模型学习如何识别模式，并最终预测下一个最佳单词。

After tokenising and encoding a prompt, we’re left with a block of data representing our input as the machine understands it, including meanings, positions and relationships between words. 将提示进行标记化和编码后，我们得到了一块数据块，它代表着机器对输入的理解，包括单词的含义、位置和关系。

At its simplest, the model’s aim is now to predict the next word in a sequence and do this repeatedly until the output is complete. 最简单地说，该模型现在的目标是预测序列中的下一个单词，并反复进行预测，直到输出完成为止。

To do this, the model gives a probability score to each token, which represents the likelihood of it being the next word in the sequence. 为了实现这一点，模型给每个标记分配一个概率分数，该分数表示它成为序列中下一个单词的可能性。

And it continues to do this until it is happy with the text it has produced. 它会持续进行这个过程，直到对自己所生成的文本感到满意。

But this method of predicting the following word in isolation — known as “greedy search” — can introduce problems. Sometimes, while each individual token might be the next best fit, the full phrase can be less relevant.Not necessarily always wrong, but perhaps not what you’d expect either. 这种预测下一个单词的方法，即“贪婪搜索”，可能会引入问题。有时候，虽然每个单独的标记可能是最合适的下一个单词，但整个短语可能与上下文不太相关。这并不一定总是错误的，但也许不符合你的期望。

Transformers use a number of approaches to address this problem and enhance the quality of their output. One example is called beam search.Rather than focusing only on the next word in a sequence, it looks at the probability of a larger set of tokens as a whole. Transformers 使用多种方法来解决问题并提高其输出质量。其中一个例子被称为束搜索（beam search）。与仅关注序列中的下一个单词不同，束搜索会整体考虑更大一组标记的概率。

With beam search, the model is able to consider multiple routes and find the best option. 使用束搜索，模型能够考虑多条路径并找到最佳选项。

This produces better results, ultimately leading to more coherent, human-like text. 通过这种方法可以产生更好的结果，最终导致更连贯、更像人类写作的文本。

But things don’t always go to plan. While the text may seem plausible and coherent, it isn’t always factually correct. LLMs are not search engines looking up facts; they are pattern-spotting engines that guess the next best option in a sequence.

Because of this inherent predictive nature, LLMs can also fabricate information in a process that researchers call “hallucination”. They can generate made-up numbers, names, dates, quotes — even web links or entire articles.

Users of LLMs have shared examples of links to non-existent news articles on the FT and Bloomberg, made-up references to research papers, the wrong authors for published books and biographies riddled with factual mistakes.

In one high-profile incident in New York, a lawyer used ChatGPT to create a brief for a case. When the defence interrogated the report, they discovered it was littered with made-up judicial opinions and legal citations. “I did not comprehend that ChatGPT could fabricate cases,” the lawyer later told a judge during his own court hearing.

Although researchers say hallucinations will never be completely erased, Google, OpenAI and others are working on limiting them through a process known as “grounding”. This involves cross-checking an LLM’s outputs against web search results and providing citations to users so they can verify.

Humans are also used to provide feedback and fill gaps in information — a process known as reinforcement learning by human feedback (RLHF) — which further improves the quality of the output. But it is still a big research challenge to understand which queries might trigger these hallucinations, as well as how they can be predicted and reduced.

Despite these limitations, the transformer has resulted in a host of cutting-edge AI applications. Apart from powering chatbots such as Bard and ChatGPT, it drives autocomplete on our mobile keyboards and speech recognition in our smart speakers.

Its real power, however, lies beyond language. Its inventors discovered that transformer models could recognise and predict any repeating motifs or patterns. From pixels in an image, using tools such as Dall-E, Midjourney and Stable Diffusion, to computer code using generators like GitHub CoPilot. It could even predict notes in music and DNA in proteins to help design drug molecules.

For decades, researchers built specialised models to summarise, translate, search and retrieve. The transformer unified all those actions into a single structure capable of performing a huge variety of tasks.

“Take this simple model that predicts the next word and it . . . can do anything,” says Aidan Gomez, chief executive of AI start-up Cohere, and a co-author of the transformer paper.

Now they have one type of model that is “trained on the entire internet and what falls out the other side does all of that and better than anything that came before”, he says.

“That is the magical part of the story.”

This story is free to read so you can share it with family and friends who don’t yet have an FT subscription.

但事情并不总是按计划进行。尽管文本可能看起来合理和连贯，但并不总是事实正确的。LLM不是在查找事实的搜索引擎；它们是一种模式识别引擎，在序列中猜测下一个最佳选项。

由于这种固有的预测性质，LLM还可以在研究人员称之为“幻觉”的过程中制造信息。它们可以生成虚构的数字、名称、日期、引用——甚至是网络链接或整篇文章。

LLM的用户分享了一些链接到FT和Bloomberg上不存在的新闻文章的例子，虚构的研究论文引用，已出版书籍的错误作者以及充满事实错误的传记。

在纽约的一个备受关注的事件中，一位律师使用ChatGPT为一起案件创建了一份简报。当辩方对报告进行质询时，他们发现其中充斥着虚构的司法意见和法律引用。“我没有意识到ChatGPT可以虚构案例，”该律师在自己的庭审中告诉法官。

尽管研究人员表示幻觉永远不会完全消失，但谷歌、OpenAI和其他机构正在通过一种称为“grounding”的过程来限制幻觉。这涉及将LLM的输出与网络搜索结果进行交叉检查，并向用户提供引用，以便他们进行验证。

还利用人类提供反馈和填补信息空白的方式（称为人类反馈强化学习），进一步提高了输出的质量。但了解可能触发这些幻觉的查询以及如何预测和减少它们仍然是一个重大的研究挑战。

尽管存在这些限制，transformer已经产生了一系列尖端的人工智能应用。除了为Bard和ChatGPT等聊天机器人提供动力外，它还驱动我们手机键盘上的自动完成功能和智能扬声器中的语音识别。

然而，它的真正威力在于超越语言。它的发明者发现transformer模型可以识别和预测任何重复的模式。从图像中的像素，使用诸如Dall-E、Midjourney和Stable Diffusion等工具，到使用GitHub CoPilot等生成器的计算机代码。它甚至可以预测音乐中的音符和蛋白质中的DNA，以帮助设计药物分子。

几十年来，研究人员构建了专门的模型来进行摘要、翻译、搜索和检索。transformer将所有这些操作统一到一个能够执行各种任务的单一结构中。

“拿这个简单的模型来预测下一个单词，它...可以做任何事情，”AI初创公司Cohere的首席执行官、transformer论文的合著者Aidan Gomez说。

他说：“现在他们有了一种在整个互联网上进行训练的模型，而其输出结果胜过以往的任何东西，并且可以做到所有这些。”

“这就是这个故事的神奇之处。”

本故事可免费阅读，以便与尚未订阅FT的家人和朋友分享。

Madhumita Murgia is the FT’s artificial intelligence editor.

Visual storytelling team: Dan Clark, Sam Learner, Irene de la Torre Arenas, Sam Joiner, Eade Hemingway and Oliver Hawkins.

With thanks to Slav Petrov, Jakob Uszkoreit, Aidan Gomez and Ashish Vaswani.

To generate the 50D word embeddings we used the GloVe 6B 50D pre-trained model and converted to Word2Vec format. To generate the 2D representation of word embeddings we used the BERT large language model and reduced dimensionality using UMAP. The self-attention values and the probability scores in the beam search section are conceptual.

We used the free version of ChatGPT-3.5 to generate some of the example sentences used in the visual part of the word embedding and self attention section.