Let’s go over the architecture of the transformer.

トランスフォーマーのアーキテクチャを見ていきましょう。

In this diagram, the left side is the encoder, and the right side is the decoder. Notice the $\text{N}\times$ on the side of both parts. Right below the diagram, they say both encoder and decoder consist of $N = 6$ identical layers. So the output of one stack on the encoder side goes into the next stack, repeating 6 times, and the same happens on the decoder side too.

この図では、左側がエンコーダー、右側がデコーダーです。両側に$\text{N}\times$と書かれていることに注目してください。図の下で、エンコーダーとデコーダーはそれぞれ$N = 6$個の同一レイヤーで構成されるという説明があります。つまり、エンコーダー側では1つのスタックの出力が次のスタックへ入力され、これが6回繰り返されます。デコーダー側も同様です。

To make the discussion concrete, we are going to use a translation from English to French.

議論を具体的にするために、英語からフランス語への翻訳を例として用います。

$\text{English: “I ate an apple”}$

$\text{French: “J'ai mangé une pomme.”}$

Encoder Process

エンコーダーのプロセス

The Input Embeddings are the tokens for the English text encoded as vectors. In this paper, the embeddings are 512-dimensional. If the original sentence is broken down into tokens this way, the input is a $6 \times 512$ array.

Input Embeddings は、英語のテキストのトークンをベクトルとしてエンコードしたものです。この論文では、埋め込み（Embeddings）は512次元です。元の文が次のようにトークンに分解される場合、入力は $6 \times 512$ の配列になります。

$\text{ [ I | \_ate | \_an | \_apple | . | <EOS> ]}$

The encoder takes these initial embeddings and adds Positional Encoding—information about each token's position. The gray box is the main processing component. It contains 6 identical layers, so the process repeats 6 times. The output from the first layer becomes the input for the second, and so on. Throughout this process, the encoder maintains the same shape: a $6 \times 512$ array.