Borgeaud, Sebastian, et al. "Improving language models by retrieving from trillions of tokens." International conference on machine learning. PMLR, 2022.

Improving language models by retrieving from trillions of tokens

Abstract

우리는 **이전 토큰과의 지역적 유사성(local similarity)**을 기반으로 대규모 코퍼스에서 검색된 문서 청크(document chunk)를 조건으로 하여 auto-regressive language model을 강화한다. 2조 개의 토큰 데이터베이스를 활용한 우리의 **Retrieval-Enhanced Transformer (Retro)**는 GPT-3 및 Jurassic-1과 유사한 성능을 Pile 벤치마크에서 달성했으며, 이는 25배 적은 파라미터 수로 이루어낸 결과이다. Fine-tuning 후, Retro의 성능은 질문 응답과 같은 지식 집약적인 다운스트림 task로 이어진다. Retro는 frozen Bert retriever, 미분 가능한 encoder, 그리고 chunked cross-attention 메커니즘을 결합하여, 일반적인 학습 과정에서 소비되는 데이터보다 훨씬 더 많은 양의 데이터를 기반으로 토큰을 예측한다. 우리는 일반적으로 Retro를 scratch부터 학습시키지만, 사전학습된 Transformer에 검색 기능을 빠르게 Retrofit하여도 여전히 좋은 성능을 얻을 수 있다. 우리의 연구는 전례 없는 규모의 명시적 메모리(explicit memory)를 통해 language model을 개선할 수 있는 새로운 길을 열어준다.

1. Introduction

**Language modelling (LM)**은 텍스트의 확률을 모델링하는 비지도 학습 task이며, 일반적으로 조건부 다음 토큰 예측 $p\left(x_{1}, \ldots, x_{n}\right)=\prod_{i} p\left(x_{i} \mid x_{<i}\right)$ 으로 분해하여 모델링한다. 신경망은 강력한 language model임이 입증되었는데, 처음에는 recurrent 아키텍처 형태(Graves, 2013; Jozefowicz et al., 2016; Mikolov et al., 2010)로, 최근에는 Transformer 형태(Vaswani et al., 2017)로 발전하여 attention을 사용하여 과거를 contextualize한다. 성능 향상은 데이터 양, 학습 연산량, 또는 모델 파라미터 수 증가를 통해 이루어졌다. Transformer는 지난 2년간 초기 연구의 1억 개 파라미터 모델에서 수천억 개 파라미터 모델(Brown et al., 2020; Radford et al., 2019)로 확장되었으며, 이는 zero-shot 또는 few-shot 방식으로 다양한 task에서 매우 우수한 성능을 보이는 모델로 이어졌다. 모델 크기를 늘리면 광범위한 다운스트림 task에서 예측 가능한 성능 향상을 가져온다(Kaplan et al., 2020). 파라미터 수를 늘리는 이점은 두 가지 요인에서 비롯된다: 학습 및 추론 시간의 추가 연산, 그리고 학습 데이터의 기억력 증가.

본 연구에서는 이러한 요인들을 분리하여, 연산량을 크게 늘리지 않으면서도 대규모 메모리로 language model을 증강하는 효율적인 방법을 탐구한다. 구체적으로, 우리는 대규모 텍스트 데이터베이스로부터 retrieval하는 것을 language model 확장의 보완적인 경로로 제안한다. 모델 크기를 늘리고 더 많은 데이터로 학습하는 대신, 우리는 모델에 대규모 데이터베이스에 직접 접근하여 예측을 수행하는 능력을 부여한다. 이는 semi-parametric 접근 방식이다. 높은 수준에서, 우리의 Retrieval Transformer (Retro) 모델은 입력 시퀀스를 청크(chunk)로 분할하고, 이전 청크와 유사한 텍스트를 검색하여 현재 청크의 예측을 개선한다. 기존의 language modelling을 위한 retrieval 연구는 소규모 Transformer(1억 개 파라미터)와 제한된 크기의 데이터베이스(최대 수십억 개의 토큰)만을 고려한다(Guu et al., 2020; Khandelwal et al., 2020; Lewis et al., 2020; Yogatama et al., 2021). 우리의 연구는 대규모 parametric language model을 위해 retrieval 데이터베이스를 수조 개의 토큰으로 확장하는 것의 이점을 보여주는 최초의 연구이다. 우리의 주요 기여는 다음과 같다.

Figure 1 | Retro의 확장성. 우리의 retrieval 모델의 성능 향상은 모델 규모에 따라 일정하게 유지되며(왼쪽), 이는 parametric 모델 크기를 약 10배 증가시키는 것과 유사하다. C4 validation set에서 최대 40개의 이웃을 사용할 때, retrieval 데이터베이스의 크기(중간)와 검색된 이웃의 수(오른쪽)가 증가함에 따라 성능 향상도 증가한다. 이 지점을 넘어서면 성능이 저하되기 시작하는데, 이는 아마도 품질 저하 때문일 것이다. 평가 시 Retro는 retrieval 데이터 없이도 사용될 수 있으며(Retro [OFF]), baseline Transformer에 비해 제한적인 성능 저하를 보인다.

우리는 retrieval-enhanced autoregressive language model인 Retro를 소개한다(§2.2). 우리는 chunked cross-attention 모듈을 사용하여 검색된 텍스트를 통합하며(§2.4), 이는 검색된 데이터 양에 대해 선형적인 시간 복잡도를 가진다. 우리는 사전학습된 frozen Bert 모델을 기반으로 한 retrieval이 대규모에서 작동하며(§2.3), retriever 네트워크를 학습하고 업데이트할 필요가 없음을 보여준다.
우리는 우리의 방법이 모델 크기 및 데이터베이스 크기에 따라 잘 확장됨을 보여준다(Fig. 1): Retro는 150M에서 7B 파라미터 범위의 모델에 대해 일정한 성능 향상을 제공하며, 평가 시 데이터베이스 크기와 검색된 이웃의 수를 늘림으로써 Retro의 성능을 향상시킬 수 있다. 우리의 가장 큰 모델은 Wikitext103 (Merity et al., 2017) 및 Pile (Gao et al., 2020)을 포함한 다양한 다운스트림 평가 데이터셋에서 state-of-the-art 결과를 얻는다(§4). 우리는 Retro가 질문 응답과 같은 다운스트림 task에서 경쟁력 있는 성능을 달성하기 위해 fine-tuning될 수 있음을 보여준다(§4.3).
우리는 학습 세트와 테스트 문서의 근접성을 고려한 평가 방법을 제안하여(§2.6), 테스트 세트 유출(test set leakage) 문제를 해결한다(Lee et al., 2021). 이는 모든 language model, 특히 평가 중에 학습 데이터셋에 직접 접근하는 retrieval-enhanced model에 중요하다. 이 방법론을 사용하여, 우리는 Retro의 성능이 명시적인 이웃 복사(explicit neighbour copying)와 일반 지식 추출(general knowledge extraction) 모두에서 비롯됨을 보여준다(§4.4).

2. Method

우리는 수조 개의 token을 가진 데이터베이스에서 검색할 수 있도록 retrieval-enhanced 아키텍처를 설계한다. 이를 위해 개별 token 대신 연속적인 token chunk 단위로 검색하여, 저장 및 계산 요구 사항을 크게 줄인다.
우리의 방법은 먼저 key-value 데이터베이스를 구축한다. 여기서 value는 원시 텍스트 token chunk를 저장하고, key는 frozen Bert embedding이다 (Devlin et al., 2019). 우리는 학습 중에 전체 데이터베이스에 대한 embedding을 주기적으로 다시 계산할 필요가 없도록 frozen 모델을 사용한다.
각 학습 시퀀스는 chunk로 분할되며, 이 chunk들은 데이터베이스에서 검색된 $k$ -nearest neighbor로 보강된다. encoder-decoder 아키텍처는 검색된 chunk들을 모델의 예측에 통합한다. 우리는 Retro 아키텍처를 Figure 2에 요약하고, 이 섹션에서 자세히 설명한다. 이 섹션의 마지막에서는 평가 세트가 학습 세트에 부분적으로 포함되어 있을 때 언어 모델을 평가하는 새로운 방법론을 소개한다.

Figure 2 | Retro 아키텍처. 왼쪽: 길이가 $n=12$ 인 시퀀스가 크기 $m=4$ 의 $l=3$ 개 chunk로 분할된 간소화된 버전. 각 chunk에 대해 $r=5$ 개 token으로 구성된 $k=2$ 개 neighbor를 검색한다. 검색 경로는 상단에 표시되어 있다. 오른쪽: Cca operator 내 상호작용에 대한 세부 정보. 첫 번째 chunk의 neighbor는 첫 번째 chunk의 마지막 token과 두 번째 chunk의 token에만 영향을 미치므로 인과성(causality)이 유지된다.

2.1. Training dataset

우리는 학습 및 검색 데이터 모두에 MassiveText (Rae et al., 2021)의 다국어 버전을 사용한다. 이 데이터셋은 여러 출처와 여러 언어에서 수집된 텍스트 문서로 구성되며, 총 5조 개 이상의 토큰을 포함한다 (자세한 내용은 Table 1 참조). 시퀀스는 학습 데이터의 하위 집합에서 샘플링되며, 샘플링 가중치는 Table 1의 가장 오른쪽 열에 제시되어 있다. 우리는 **128,000개의 토큰으로 구성된 vocabulary를 가진 SentencePiece (Kudo and Richardson, 2018)**를 사용하여 데이터셋을 토큰화한다. 학습 중 (별도로 명시되지 않는 한), 우리는 학습 데이터에서 600B 토큰을 검색한다. 학습 검색 데이터베이스는 학습 데이터와 동일한 하위 집합으로 구성되며, 학습 샘플링 빈도에 비례한다. 평가 시 검색 데이터베이스는 이러한 데이터셋의 전체 합집합으로 구성되며, 책의 경우 4%의 하위 샘플을 사용한다. 따라서 평가 검색 데이터베이스는 1.75T 토큰을 포함한다. 테스트 세트 유출을 제한하기 위해, 우리는 MinHash 방식을 사용하여 학습 문서와 테스트 문서 간의 13-gram Jaccard 유사도를 계산하고, 유효성 검사 또는 테스트 세트 문서와 유사도가 높은 (0.8 이상) 모든 학습 문서를 제거한다. 또한, 우리는 Wikitext103 (Merity et al., 2017)의 모든 유효성 검사 및 테스트 기사를 Wikipedia 학습 데이터에서 제거한다.

2.2. Retrieval-enhanced autoregressive token models

우리의 접근 방식은 토큰의 작은 청크(chunk) 단위로 입력 예시를 보강하는 방법으로 **검색(retrieval)**을 활용한다.
정식으로, 우리는 텍스트 토크나이저를 사용하여 얻은 $\mathbb{V}=[1, v]$ 범위의 정수 토큰 시퀀스를 고려한다.
각 $n$ -토큰 길이의 예시 $X=\left(x_{1}, \ldots, x_{n}\right)$ 를 $m=\frac{n}{l}$ 크기의 $l$ 개 청크 $(C_{1}, \ldots, C_{l})$ 시퀀스로 분할한다. 즉, $C_{1} \triangleq\left(x_{1}, \ldots, x_{m}\right), \ldots, C_{l} \triangleq\left(x_{n-m+1}, \ldots, x_{n}\right) \in \mathbb{V}^{m}$ 이다.
우리는 $n=2048$ 및 $m=64$ 를 사용한다.
각 청크 $C_{u}$ 는 데이터베이스 $\mathcal{D}$ 에서 가져온 $k$ 개의 이웃 $\operatorname{Ret}_{\mathcal{D}}\left(C_{u}\right)$ 세트로 보강된다. $\operatorname{Ret}_{\mathcal{D}}$ (또는 간단히 Ret)는 §2.3에 명시된 학습 불가능한(non-trainable) 연산자이다.
토큰의 likelihood는 이전 토큰과 검색된 이웃을 모두 입력으로 받는, $\theta$ 로 파라미터화된 모델에 의해 제공된다. 이는 다음과 같은 검색 강화 시퀀스 log-likelihood를 정의한다:

L(X \mid \theta, \mathcal{D}) \triangleq \sum_{u=1}^{l} \sum_{i=1}^{m} \ell_{\theta}\left(x_{(u-1) m+i} \mid\left(x_{j}\right)_{j<(u-1) m+i},\left(\operatorname{RET}_{\mathcal{D}}\left(C_{u^{\prime}}\right)\right)_{u^{\prime}<u}\right) .

우리는 $\operatorname{Ret}\left(C_{1}\right)=\emptyset$ 으로 설정한다. 즉, 첫 번째 청크의 토큰 likelihood는 어떤 검색 데이터에도 의존하지 않는다.
이 likelihood 정의는 autoregressivity를 유지한다: $u$ -번째 청크의 $i$ -번째 토큰 $x_{(u-1) m+i}$ 의 확률은 이전에 본 토큰 $\left(x_{j}\right)_{1 \leqslant j<(u-1) m+i}$ 과 이전 청크에서 검색된 데이터 $\left(\operatorname{Ret}\left(C_{u^{\prime}}\right)\right)_{u^{\prime}<u}$ 에만 의존한다.
따라서 우리는 log-probability $\ell$ 을 사용하여 직접 샘플링할 수 있으며, 여기서 청크 $C_{u}$ 내의 샘플링은 이웃 $\left(\operatorname{Ret}\left(C_{u^{\prime}}\right)\right)_{u^{\prime}<u}$ 에 조건화된다. 이는 검색 강화 모델을 샘플링으로 평가되는 가장 큰 language model과 직접 비교 가능하게 만든다.

2.3. Nearest neighbour retrieval

Retrieval neighbours
우리의 데이터베이스는 key-value 메모리로 구성된다. 각 value는 **두 개의 연속된 토큰 덩어리(chunk)**로 이루어져 있으며, 이를 $[N, F]$ 로 표기한다. 여기서 $N$ 은 key를 계산하는 데 사용되는 neighbour chunk이고, $F$ 는 원래 문서에서 $N$ 의 연속 부분이다. 해당 key는 $N$ 의 Bert embedding을 시간에 걸쳐 평균한 값이며, 이를 $\operatorname{Bert}(N)$ 으로 표기한다.
각 chunk $C$ 에 대해, 우리는 BERT embedding의 $L_2$ 거리 $d(C, N)=\|\operatorname{Bert}(C)-\operatorname{Bert}(N)\|_{2}^{2}$ 를 사용하여 **key-value 데이터베이스에서 근사 $k$ -최근접 이웃(approximate $k$ -nearest neighbours)**을 검색한다. 모델은 해당 value인 $\operatorname{Ret}(C) \triangleq\left(\left[N^{1}, F^{1}\right], \ldots,\left[N^{k}, F^{k}\right]\right)$ 를 받는다.
neighbour chunk와 그 연속 부분 모두 의미 있는 성능 향상을 제공하며, 이는 ablation study (Appendix D)에서 입증된다. 우리는 $N^j$ 와 $F^j$ 모두에 대해 길이 64를 사용하므로, $\operatorname{Ret}(C)$ 는 $k \times r$ 형태를 가지며 $r=128$ 이다. 학습 중 인과성(causality)을 깨뜨릴 수 있는 chunk $C_{u+1}$ 이 검색 세트 $\operatorname{Ret}\left(C_{u}\right)$ 에 포함되는 것을 방지하기 위해, 우리는 학습 시퀀스 $X$ 와 동일한 문서에서 유래한 이웃들을 필터링한다.

$T$ 개의 요소를 가진 데이터베이스의 경우, 근사 최근접 이웃을 $O(\log T)$ 시간 안에 쿼리할 수 있다. 우리는 이를 위해 **SCaNN 라이브러리 (Guo et al., 2020)**를 사용한다. 이는 모델을 평가하거나 샘플링하는 동안 2조 개의 토큰 데이터베이스를 10ms 안에 쿼리할 수 있음을 의미하며, 이 비용은 chunk 길이에 따라 상각된다.
실시간(on-the-fly)으로 검색을 수행하는 것은 학습 계산 속도를 따라가기에는 너무 느리다. 우리는 embedding operator Bert의 frozen 특성을 활용하여 모든 근사 최근접 이웃을 미리 계산하고 그 결과를 데이터의 일부로 저장한다. Appendix의 Fig. 9에서는 Wikipedia 내에서만 이웃을 검색한 결과를 보여준다. 우리는 이웃들이 주어진 기사로부터 2-3 링크 떨어진 곳에서 오는 경향이 있는 반면, 무작위 기사들은 5개 이상의 링크만큼 떨어져 있음을 발견했다.

Table 1 | MassiveText. 마지막 열은 학습 중 샘플링 가중치를 나타낸다. 다국어 하위 집합에는 10개 언어의 문서가 포함된다. 전체 세부 내용은 §A.1에 제시되어 있다.

Source	Token count (M)	Documents (M)	Multilingual	Sampling frequency
Web	977,563	1,208	Yes	$55 \%$
Books	$3,423,740$	20	No	$25 \%$
News	236,918	398	No	$10 \%$
Wikipedia	13,288	23	Yes	$5 \%$
GitHub	374,952	143	No	$5 \%$

2.4. Retro model architecture

우리 모델은 encoder-decoder Transformer 아키텍처에 기반하며, Vaswani et al. (2017)에서 소개된 cross-attention 메커니즘을 통해 검색된 데이터를 통합한다.
먼저, **검색된 토큰 $\operatorname{Ret}(C)$ **는 encoder Transformer에 입력되어 **인코딩된 이웃 집합 $E$ **를 계산한다.
중간 활성화(intermediate activation)를 $H$ 라고 할 때, 우리 Transformer decoder는 **Retro-block $\operatorname{Retro}(H, E)$ **와 **표준 Transformer block $\operatorname{LM}(H)$ **를 교차(interleave)하여 사용한다 (하이퍼파라미터 $P \subseteq[1, L]$ 는 어떤 layer에서 Retro-block을 사용할지 결정한다).
이러한 블록들은 세 가지 다른 residual operator로 구성된다:

fully-connected layer FFW,
표준 시퀀스 레벨 self-attention layer Attn,
검색 encoder로부터 정보를 통합하는 chunked cross-attention layer Cca( $\cdot, E$ ).

이들은 다음과 같이 정의된다:

\operatorname{Retro}(H, E) \triangleq \operatorname{FFw}(\operatorname{Cca}(\operatorname{Atcn}(H), E)), \quad \text { and } \quad \operatorname{Lm}(H) \triangleq \operatorname{Ffw}(\operatorname{Attn}(H))

FFw, Attn, Cca는 모두 위치 $i$ 에서의 출력이 $\left(h_{j}\right)_{j \leqslant i}$ 에만 의존하는 autoregressive operator이므로, Retro 및 LM layer의 연속은 token classification head와 함께 **autoregressive log-likelihood (1)**을 정의한다.
모델 아키텍처의 개요는 Algorithm 1과 Fig. 2에 제시되어 있다.
다음으로, 검색 encoder와 chunked cross-attention layer를 더 자세히 설명하고, Retro 모델에서 샘플링하는 방법을 설명한다.

검색 이웃 인코딩 (Encoding retrieval neighbours)
각 chunk $C_u$ 에 대해, $k$ 개의 검색 이웃 $\operatorname{Ret}(C_u)$ 는 bi-directional Transformer Encoder에 입력되어 출력 $E_u^j \triangleq \operatorname{Encoder}(\operatorname{Ret}(C_u)^j, H_u) \in \mathbb{R}^{r \times d'}$ 를 생성한다. 여기서 $j \in[1, k]$ 는 각 이웃을 인덱싱한다.
검색 encoder는 non-causal Transformer이다. 이는 cross-attention layer를 통해 chunk $C_u$ 의 활성화 $H_u$ 에 조건화된다. 이를 통해 검색 encoder의 표현이 검색 chunk에 의해 미분 가능한 방식으로 조절될 수 있다.
더 정확히 말하면, $u$ 번째 chunk의 $j$ 번째 이웃 $\operatorname{Ret}(C_u)^j$ 의 인코딩은 layer $\min(P)$ 에서 chunk $C_u$ 의 attended activation $H_u \triangleq (h_{(u-1)m+i})_{i \in[1,m]} \in \mathbb{R}^{m \times d}$ 에 의존한다.
모든 chunk의 모든 이웃은 병렬로 인코딩되어, **완전한 인코딩된 집합 $E \triangleq (E_u^j)_{u \in[1,l], j \in[1,k]} \in \mathbb{R}^{l \times k \times r \times d'}$ **를 생성한다.
우리는 $E_u \in \mathbb{R}^{k \times r \times d'}$ 를 chunk $u \in[1,l]$ 에 대한 인코딩된 이웃으로 나타낸다.

Chunked cross-attention
Cca 연산을 수행하기 위해, 우리는 주어진 중간 활성화 $H \in \mathbb{R}^{n \times d}$ 를 $l-1$ 개의 attending chunk $\left(H_{u}^{+} \triangleq\left(h_{u m+i-1}\right)_{i \in[1, m]} \in \mathbb{R}^{m \times d}\right)_{u \in[1, l-1]}$ 로 분할한다. 이는 Fig. 2의 오른쪽에 묘사되어 있다.
$H_u^+$ 는 chunk $C_u$ 의 마지막 토큰과 $C_{u+1}$ 의 처음 $m-1$ 개 토큰의 중간 임베딩을 포함한다.
우리는 $H_u^+$ 와 $E_u$ (chunk $C_u$ 로부터 얻은 인코딩된 검색 집합) 사이의 cross-attention을 계산한다.
cross-attention을 적용하기 전에 $E_u$ 의 이웃 및 시간 차원을 병합하므로, attention은 시간과 이웃에 걸쳐 동시에 계산된다.
데이터 chunk와 검색 이웃 사이에 정렬 개념이 있으므로, 우리는 **§B.1.2에 설명된 상대적 위치 인코딩(relative positional encoding)**을 사용한다.

우리는 **chunk별 cross-attention의 $l-1$ 개 출력(각각 $m \times d$ 형태)**을 시간에 따라 연결하고, 결과를 적절히 패딩한다.
이로써 출력 활성화 $\operatorname{CcA}(H, E) \in \mathbb{R}^{n \times d}$ 를 형성한다.
공식적으로, 각 chunk $C_u$ 와 각 토큰 $i \in[1, m]$ 에 대해 다음과 같이 설정한다:

\mathrm{CcA}(H, E)_{u m+i-1} \triangleq \mathrm{Ca}\left(h_{u m+i-1}, E_{u}\right)

${ }^{2}$ chunk $C_u$ 의 마지막 토큰은 (1)에서 autoregressivity를 유지하면서 검색된 콘텐츠 $E_u$ 에 접근할 수 있는 첫 번째 토큰이다. 따라서 chunk $C_u=\left(x_{(u-1) m+i}\right)_{i \in[1, m]}$ 와 해당 attending chunk $C_u^{+} \triangleq\left(x_{u m+i-1}\right)_{i \in[1, m]}$ 사이에는 하나의 토큰 중첩이 있다.

Algorithm 1: Retro 모델 아키텍처 개요
하이퍼파라미터: $P$ 및 $P_{\text {enc}}$ , 각각 decoder 및 encoder에서 cross-attention이 있는 layer의 인덱스
하이퍼파라미터: $L$ 및 $L_{\mathrm{enc}}$ , decoder layer 수 및 encoder layer 수
입력: $X \in \mathbb{V}^{n}$ : 토큰 시퀀스. $\left(\operatorname{Ret}\left(C_{u}\right)\right)_{1 \leqslant u \leqslant l}$ : 검색된 이웃
출력: $O \in \mathbb{R}^{n \times|\mathbb{V}|}$ : 출력 logits

def \(\operatorname{Encoder}\left(\operatorname{Ret}\left(C_{u}\right)_{1 \leqslant u \leqslant l}, H\right)\) :
    \(\left(H_{u}\right)_{u \in[1, l]} \leftarrow \operatorname{Split}(H)\)
    for \(j \in[1, k], u \in[1, l]\) do // Encoder shared across neighbours and chunks
        \(E_{u}^{j}=\operatorname{EmBenc}\left(\operatorname{RET}\left(C_{u}\right)^{j}\right) / /\) May be shared with the decoder Emb
        for \(p^{\prime} \in\left[1, L_{\text {enc }}\right]\) do
            \(E_{u}^{j} \leftarrow \operatorname{ATTNenc}\left(E_{u}^{j}\right) / /\) Bi-directional attention
            if \(p^{\prime} \in P_{\text {enc }}\) then
                \(E_{u}^{j} \leftarrow \mathrm{C}_{\text {Aenc }}\left(E_{u}^{j}, H_{u}\right)\)
            \(E_{u}^{j} \leftarrow \mathrm{FFW}_{\text {enc }}\left(E_{u}^{j}\right)\)
    return \(E\)

\(H \leftarrow \operatorname{Emb}(X)\)
for \(p \in[1, L]\) do
    \(H \leftarrow \operatorname{Attn}(H) ~ / / ~ C a u s a l ~ a t t e n t i o n ~\)
    if \(p=\min (P)\) then
        // The neighbour ENCODER is conditioned with the decoder activations of
            the last layer before the first cross-attention
        \(E=\operatorname{Encoder}\left(\operatorname{Ret}\left(C_{u}\right)_{1 \leqslant u \leqslant l}, H\right)\)
    if \(p \in P\) then
        \(H \leftarrow \operatorname{Cca}(H, E)\)
    \(H \leftarrow \operatorname{FFW}(H)\)
\(O \leftarrow \operatorname{Read}(H)\)

여기서 $\mathrm{C}_{\mathrm{A}}$ 는 시간-연결된 인코딩된 이웃에 대한 cross-attention residual operator이다.
이 operator는 가장 간단한 버전에서 세 개의 파라미터 행렬 $K \in \mathbb{R}^{d \times c}, Q \in \mathbb{R}^{d \times c}$ 및 $V \in \mathbb{R}^{d \times d}$ 로 정의된다.
모든 $h \in \mathbb{R}^{d}$ 및 $Y \in \mathbb{R}^{T \times d}$ 에 대해 다음과 같이 정의한다:

\mathrm{CA}(h, Y) \triangleq \operatorname{softmax}\left(Y K Q^{T} h\right) Y V,

여기서 softmax는 두 번째 차원에서 수행되며 모든 곱은 행렬 곱이다.
우리는 multi-head cross-attention을 사용하고, softmax에 **위치 인코딩(positional encoding)**을 추가한다 (자세한 내용은 §B.1.2 참조).

처음 $m-1$ 개 토큰은 이전 chunk의 어떤 이웃에도 attend할 수 없다. 이 위치에서는 Cca를 항등 함수로 정의하여, 모든 토큰 $j \in[1, m-1]$ 에 대해 $\operatorname{Cca}(H, E)_j \triangleq h_j$ 로 설정한다.
마지막으로, 마지막 토큰 $h_{lm}$ 은 마지막 검색 집합 $E_l$ 에 attend하며, 우리는 $h_{lm} \triangleq \mathrm{CA}(h_{lm}, E_l)$ 로 설정한다 (Fig. 2에는 표시되지 않음).
Listing 1은 Cca의 간소화된 구현을 포함한다.
chunked cross-attention은 autoregressive이다: Cca의 위치 $i$ 에서의 출력은 Cca에 입력되는 토큰 0부터 $i$ 까지의 시퀀스에 의존한다.

Retro 모델에서는 각 Cca cross-attention이 이전 chunk $\operatorname{Ret}(C_{u-1})$ 의 이웃에만 attend하더라도, 이전 이웃에 대한 종속성(dependencies)은 self-attention 연산을 통해 전파된다.
따라서 $u$ 번째 chunk의 $i$ 번째 토큰의 활성화는 모든 이전 이웃 $\operatorname{Ret}(C_{u'})_{u'<u}$ 의 집합에 잠재적으로 의존하며, 해당 집합에 cross-attend하는 quadratic 비용을 발생시키지 않는다.

샘플링 (Sampling)
샘플링 시, chunk $C_u$ 의 끝에서 우리는 SCaNN을 사용하여 임베딩 $\operatorname{Bert}(C_u)$ 를 기반으로 이웃 $\operatorname{Ret}(C_u)$ 를 검색한다.
인코딩된 이웃 $E_u = \operatorname{Encoder}(\operatorname{Ret}(C_u))$ 는 다음 chunk $C_{u+1}$ 의 생성을 조건화하는 데 사용되며, 이는 점진적으로 수행된다.
전반적으로 샘플링 비용은 일반 Transformer에서 샘플링할 때와 마찬가지로 샘플링된 시퀀스 크기에 대해 quadratic이다.
검색에 추가되는 비용은 chunk 수 $l$ 에 대해 선형이며, 실제로는 토큰 샘플링 비용에 비해 무시할 수 있는 수준이다.

2.5. Baseline Transformer architecture

우리는 (Radford et al., 2019)에서 설명된 것과 유사한 Transformer (Vaswani et al., 2017)를 사용하며, 몇 가지 최소한의 변경 사항을 적용했다: LayerNorm을 RMSNorm (Zhang and Sennrich, 2019)으로 대체하고 relative position encoding (Dai et al., 2019)을 사용한다.
Baseline으로는 132M, 368M, 1.3B, 7.0B 파라미터를 가진 retrieval-free Transformer를 학습시킨다 (파라미터 수에서 embedding matrix는 제외된다). 사용된 하이퍼파라미터는 Table 2에 자세히 설명되어 있다.
모든 retrieval model은 retrieval 데이터를 위해 $d'=896$ 및 2개 layer를 가진 동일한 크기의 encoder를 사용하며, 이는 대략 19M 파라미터를 추가한다. 이 encoder는 relative positional encoding을 사용한다.
retrieval model은 6번째 layer부터 3개 블록마다 하나의 Retro-block을 포함한다. 가장 작은 모델의 경우, Cca는 main pathway의 6, 9, 12번째 layer에 적용되며, encoder의 query conditioning을 위해 한 번 더 적용되어 추가적으로 12M 파라미터를 더한다. baseline 모델 크기가 증가함에 따라 추가 파라미터의 상대적인 수는 감소한다.
모든 모델은 JAX (Bradbury et al., 2018) 및 Haiku (Hennigan et al., 2020)를 사용하여 구현되었다.

2.6. Quantifying dataset leakage exploitation

Retro 모델은 평가 데이터셋 유출(evaluation dataset leakage), 즉 학습 세트에도 존재했던 데이터로 평가를 수행하는 사실로부터 더 쉽게 이점을 얻을 수 있다고 주장할 수 있다. 따라서 검색(retrieval)이 언어 모델링 성능을 어떻게 향상시키는지 더 잘 이해하기 위해, 우리는 평가 데이터셋과 학습 데이터셋 간의 중복(overlap) 정도에 따른 평가 likelihood를 정량화한다.

다음 접근 방식은 어떤 언어 모델에도 사용될 수 있으며, §2.3에서 제시된 고정된 retriever 시스템에만 의존한다. 우리는 평가 시퀀스 $(X_i)_i$ 를 길이 $m \leq 64$ 의 청크(chunk)로 분할하고, 학습 데이터를 **청크들의 집합 $C$ **로 간주한다. 각 평가 청크 $C \in C$ 에 대해, 우리는 **학습 데이터에서 가장 가까운 10개의 이웃(최대 길이 128)**을 검색한다. 그런 다음, 평가 청크와 그 이웃들 모두에 공통으로 존재하는 가장 긴 토큰 부분 문자열을 계산한다. 이 값은 $s \in [0, m]$ 이다. $r(C) = \frac{s}{m}$ 값은 0(청크가 전혀 보이지 않음)부터 1(청크가 완전히 보임)까지의 범위를 가지며, 평가 청크와 학습 데이터 간의 중복 정도를 신뢰할 수 있는 지표로 제공한다. 주어진 모델에 대해, 우리는 각 청크 $C$ 의 **log-likelihood $\ell(C)$ **와 해당 청크가 인코딩하는 **바이트 수 $N(C)$ **를 얻는다. 그런 다음, 모델의 필터링된 bits-per-bytes를 다음과 같이 고려한다:

\forall \alpha \in[0,1], \quad C_{\alpha} \triangleq\{C \in C, r(C) \leqslant \alpha\}, \quad \operatorname{bpb}(\alpha) \triangleq \frac{\sum_{C \in C_{\alpha}} \ell(C)}{\sum_{C \in C_{\alpha}} N(C)},

Table 2 | Baseline 및 Retro 모델의 파라미터 수(embedding 제외)와 해당 하이퍼파라미터.

Baseline parameters	Retro	$d$	$d_{\text {ffw }}$	# heads	Head size	# layers
132 M	$172 \mathrm{M}(+30 \%)$	896	3,584	16	64	12
368 M	$425 \mathrm{M}(+15 \%)$	1,536	6,144	12	128	12
$1,309 \mathrm{M}$	$1,451 \mathrm{M}(+11 \%)$	2,048	8,192	16	128	24
$6,982 \mathrm{M}$	$7,532 \mathrm{M}(+8 \%)$	4,096	16,384	32	128	32

이는 학습 청크와 $\alpha \%$ 미만으로 중복되는 청크 집합에 대한 bits-per-bytes에 해당한다. 전체 평가 bits-per-bytes 성능은 $\operatorname{bpb}(1)$ 로 복구된다는 점에 유의하라. 함수 $\operatorname{bpb}(\cdot)$ 는 예측 성능에 대한 평가 유출의 영향을 평가할 수 있게 해준다: 낮은 $\alpha$ 값의 경우, $\mathrm{bpb}(\alpha)$ 는 모델이 완전히 새로운 청크에서 어떻게 작동하는지를 나타내며, $\mathrm{bpb}(\cdot)$ 의 **기울기(slope)**는 모델이 평가 유출을 얼마나 활용하는지를 보여준다.

우리는 먼저 language modelling을 위한 retrieval 사용에 대한 기존 연구들을 검토하고, Retro를 이들 연구와 비교한다 (Table 3 참조). Retro 모델은 인터넷의 상당 부분을 포함하는 대규모 데이터셋으로 학습되기 때문에, 우리의 연구는 잠재적인 개인 정보 보호, 안전, 그리고 공정성 문제를 야기하며, 이에 대해 이어서 검토한다.

3.1. Retrieval for language modelling

Brants et al. (2007)는 학습 데이터를 수조 개의 token으로 확장하면 $n$ -gram 모델의 기계 번역 성능이 향상됨을 보여주었다. 최근에는 GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), Jurassic-1 (Lieber et al., 2021)이 language model의 규모를 확장하는 것이 많은 다운스트림 task에서 엄청난 성능 향상을 가져옴을 입증했다. 동시에 Carlini et al. (2021)은 대규모 language model이 학습 데이터의 일부를 완벽하게 기억할 수 있음을 보여주며, 이는 모델에 retrieval 기능을 추가하면 추가적인 성능 향상을 이끌어낼 수 있음을 시사한다. 그러나 학습 데이터셋과 테스트 데이터셋 간의 상당한 정보 유출(leakage) (Lee et al., 2021; Lewis et al., 2021)은 특히 학습 데이터셋에 대한 retrieval 기능이 추가될 경우, 대규모 데이터셋으로 학습된 대규모 모델들을 비교하고 평가하는 것을 어렵게 만든다.

역사적으로 텍스트에 대한 **정보 검색(information retrieval)**은 TF-IDF 및 BM25와 같은 역색인(inverted index) 매칭에 의존해왔다 (Robertson and Zaragoza, 2009). 기초 연구에서는 LDA (Blei et al., 2003)와 같은 잠재 토픽 모델링(latent topic modelling) 접근 방식을 사용하여 관련 이웃을 식별했다 (Wei and Croft, 2006). Zhang et al. (2018) 및 Gu et al. (2018)과 같은 기계 번역 연구에서는 원본 문장 간의 편집 거리(edit distance)를 기반으로 번역 쌍을 검색하고, 가장 가까운 검색된 대상 문장을 사용하여 번역 출력을 유도한다. retrieval 데이터베이스는 구조화될 수도 있는데, 예를 들어 Ahn et al. (2016)은 RNN language model의 성능을 향상시키기 위해 symbolic knowledge graph를 사용했다.

딥러닝의 성공과 함께, retrieving 시스템은 부분적으로 신경망의 activation에 기반한 dense learned representation으로 전환되었다. Continuous cache (Grave et al., 2017)는 이전 activation 벡터와 현재 activation 벡터가 유사한 token에 확률 질량(probability mass)을 추가하여, 모델의 context를 로컬 히스토리로 확장한다. $k$ NN-LM (Khandelwal et al., 2020)은 이 아이디어를 Transformer에 적용하고 retrieval 데이터베이스를 영어 Wikipedia로 확장하여, Wikitext103 평가에서 상당한 성능 향상을 가져왔다.

Table 3 | Retro와 기존 retrieval 접근 방식 비교.

	# Retrieval tokens	Granularity	Retriever training	Retrieval integration
Continuous Cache	$O\left(10^{3}\right)$	Token	Frozen (LSTM)	Add to probs
$k$ NN-LM	$O\left(10^{9}\right)$	Token	Frozen (Transformer)	Add to probs
Spalm	$O\left(10^{9}\right)$	Token	Frozen (Transformer)	Gated logits
Dpr	$O\left(10^{9}\right)$	Prompt	Contrastive proxy	Extractive QA
Realm	$O\left(10^{9}\right)$	Prompt	End-to-End	Prepend to prompt
RAG	$O\left(10^{9}\right)$	Prompt	Fine-tuned Dpr	Cross-attention
Fid	$O\left(10^{9}\right)$	Prompt	Frozen Dpr	Cross-attention
Emdr ${ }^{2}$	$O\left(10^{9}\right)$	Prompt	End-to-End (EM)	Cross-attention
Retro (ours)	$O\left(10^{12}\right)$	Chunk	Frozen (Bert)	Chunked cross-attention

Continuous cache와 $k$ NN-LM은 기저 신경망 모델을 수정하지 않고, **language model의 출력과 검색된 token에서 계산된 분포 사이를 추론 시에 보간(interpolate)**한다. 따라서 이러한 방법들은 추가 학습 없이 어떤 모델에도 적용될 수 있지만, 이는 검색된 텍스트에 대해 모델이 추론하는 능력을 제한한다. Spalm (Yogatama et al., 2021)은 검색된 데이터를 후처리하기 위해 추가적인 gating network를 추가하여 이러한 한계를 해결하지만, 대부분의 네트워크는 추론 시 retrieval의 영향을 받지 않는다.

retrieval representation은 사전학습된 모델에 의존하는 대신 직접 학습될 수도 있다. 이를 위해 retriever 시스템이 개발되었으며, 주로 open-domain question answering에 초점을 맞추었다. 예를 들어 Dpr (Karpukhin et al., 2020)은 **두 개의 Bert 모델(각각 query와 key용)**을 contrastive loss를 사용하여 질문과 답변의 representation을 정렬하도록 학습시킨다. Lee et al. (2019)는 inverse cloze task를 사용하여 검색을 위한 passage의 semantic representation을 찾는다. 이러한 연구들은 continuous cache 및 $k$ NN-LM과 달리 각 token을 개별적으로 임베딩하는 것이 아니라, 텍스트의 passage(또는 chunk)를 함께 임베딩한다. retriever network는 retrieval 데이터를 사용하는 다운스트림 task와 독립적으로 학습된다. 이 잠재적인 문제는 Realm (Guu et al., 2020)에서 특별히 다루어졌는데, 이 모델은 최종 학습 cross-entropy를 최대화하기 위해 retrieval 시스템을 end-to-end로 학습시킨다. 이는 학습 중에 데이터베이스를 검색하고 임베딩 테이블을 주기적으로 업데이트하는 추가적인 복잡성을 수반하며, 이로 인해 작동할 수 있는 규모가 심각하게 제한된다. RAG (Lewis et al., 2020) 및 Fid (Izacard and Grave, 2021)는 Dpr을 기반으로 encoder-decoder Transformer 모델을 학습시켜 question answering 벤치마크에서 state of the art를 달성한다. 최근에는 EmDR $^2$ (Sachan et al., 2021)가 기대-최대화(expectation-maximization) 알고리즘을 사용하여 retriever를 end-to-end로 학습함으로써 FID를 확장하고, 유사한 크기의 모델들과 비교하여 state of the art 결과를 얻었다.

Open-domain dialogue 설정에서 BlenderBot 2.0 (Komeili et al., 2021)은 텍스트 인터넷 쿼리를 발행하는 방법을 학습하며, 모델 응답이 인간의 응답과 얼마나 가까운지를 측정하는 task에서 dense retrieval 방법보다 우수한 성능을 보인다. 이는 검색 쿼리와 관련된 인간 대화 데이터셋을 수집하는 것을 포함하며, 이 접근 방식의 확장성을 제한한다. Hashemi et al. (2020)은 문서 검색 및 질문 선택을 위한 Guided Transformer를 도입했는데, 이는 Retro와 유사한 수정된 Transformer이다. question answering 및 강력한 조건화가 필요한 다른 task에서는 효과적이지만, 이러한 방법 중 어느 것도 Retro와 달리 임의의 텍스트 시퀀스를 모델링하도록 설계되지 않았다.

Retro는 frozen retrieval representation을 사용한다는 점에서 $k$ NN-LM 및 Dpr과 구성 요소를 공유한다. Retro는 QA 예시보다 더 긴 시퀀스를 모델링한다. 이는 하위 시퀀스(sub-sequence) 수준에서 추론하고, 시퀀스의 다른 chunk에 대해 다른 문서를 검색해야 한다. Fid와 유사하게, Retro는 검색된 이웃들을 encoder에서 별도로 처리하고, chunked cross-attention에서 이들을 통합한다. 이는 검색된 문서를 prompt 앞에 추가하는 Realm과 같은 방식과는 다르다. Chunk를 사용하면 prompt에만 기반하여 한 번만 검색하는 것이 아니라, 시퀀스를 생성하는 동안 반복적인 retrieval이 가능하다. 또한, Retro에서는 전체 사전학습 과정 동안 retrieval이 수행되며, 단순히 특정 다운스트림 task를 해결하기 위해 삽입되는 것이 아니다. 마지막으로, dense query vector에 기반한 이전 방법들은 작은 모델과 30억 개 미만의 token(영어 Wikipedia)을 가진 retrieval 데이터셋을 사용했다. Table 3은 Retro와 기존 접근 방식의 차이점을 요약한다.

3.2. Privacy, safety and fairness

Bender et al. (2021)와 Weidinger et al. (2021)은 대형 language model의 여러 위험성을 강조한다. 이러한 위험은 학습 데이터 암기 능력, 높은 학습 비용, 학습 데이터의 정적인 특성 (Lazaridou et al., 2021), 학습 데이터에 내재된 편향 증폭 경향, 그리고 유해한 언어 생성 능력 (Gehman et al., 2020) 등에서 비롯된다. 이 섹션에서는 이러한 위험들을 검토하며, retrieval augmented language model이 이러한 위험을 어떻게 악화시키거나 완화할 수 있는지에 초점을 맞춘다.

대형 language model은 학습 데이터의 일부를 완벽하게 암기할 수 있다 (Carlini et al., 2021). 이는 웹이나 다른 출처에서 수집된 대규모 학습 데이터셋과 결합될 때 명확한 개인 정보 보호 및 안전 문제를 야기한다. Retro와 같이 추론 시 전체 학습 데이터셋에 접근할 수 있는 retrieval model은 학습 데이터를 직접 복사할 수 있으므로 이러한 개인 정보 보호 문제를 악화시킨다. 그러나 retrieval 시스템은 추론 시 검색 가능한 데이터를 삭제(obliteration)함으로써 이러한 우려를 완화할 수 있는 경로를 제공한다. 또한, retrieval model의 **차등 프라이버시 학습(differential privacy training) (Abadi et al., 2016)**은 모델 가중치에 개인 정보가 저장되지 않도록 보장할 수 있으며, 추론 시 retrieval database를 업데이트하여 개인 데이터에 대한 개별화를 수행할 수 있다.

높은 학습 비용 때문에, 새로운 데이터, 언어, 규범을 통합하기 위해 대형 language model을 정기적으로 재학습하는 것은 엄청나게 비효용적이다. retrieval model을 최신 상태로 유지하기 위해서는 retrieval database를 업데이트하는 것만으로 충분할 수 있으며, 이는 모델을 처음부터 재학습하는 것보다 훨씬 저렴하다. 공정성 및 편향 측면에서 모델을 업데이트하는 이점 외에도, 단순히 대형 language model을 학습시키는 것만으로도 상당한 에너지 비용이 발생한다 (Schwartz et al., 2020; Strubell et al., 2019). retrieval 메커니즘은 특정 성능에 도달하는 language model을 학습하고 업데이트하는 데 필요한 연산 요구 사항을 줄일 수 있는 경로를 제공한다.

대형 language model은 Gehman et al. (2020)에서 보여주듯이 유해한 출력(toxic outputs)을 생성하는 경향이 있다. Bender et al. (2021)과 Jo and Gebru (2020)는 더 나은 학습 데이터 큐레이션 및 문서화의 중요성을 강조한다. 또한, 학습 후 학습 데이터의 일부가 편향되거나 유해한 출력을 유발하는 것으로 밝혀진 경우, retrieval은 문제가 되는 검색 데이터를 소급적으로 필터링할 수 있으므로 일부 수정이 가능하다. 그러나 신중한 분석과 개입 없이는 retrieval model이 학습 데이터에 존재하는 편향을 악화시킬 수도 있다. retrieval model은 또한 검색 문서 선택 메커니즘을 통해 추가적인 편향의 원천이 될 수 있다. retrieval이 모델 출력의 편향 및 유해성에 미치는 영향을 더 잘 이해하기 위해서는 이 분야에 대한 추가 연구가 필요하다.

마지막으로, 대형 모델의 샘플은 해석하기 어렵기 때문에 이러한 문제를 완화하는 것이 더욱 어려워진다 (Belinkov et al., 2020; Jain and Wallace, 2019). retrieval은 사용되는 이웃(neighbors)을 직접 시각화하거나 수정할 수 있으므로 모델 출력에 대한 더 많은 통찰력을 제공한다. Table 6, 7, 20, 21의 예시는 retrieval이 더 투명한 출력을 제공함으로써 language model을 더 사실적이고 해석 가능하게 만드는 방법을 보여준다.

4. Results

우리는 먼저 language modelling 벤치마크에 대한 결과를 보고한다. 둘째, 사전학습된 Transformer language model을 추가적인 FLOPs를 거의 사용하지 않고 retrieval model로 변환(Retrofit)하는 방법을 보여준다. 다음으로, **질문 응답(question answering)**에 대한 Retro 결과를 보고한다. 마지막으로, retrieval을 통한 성능 향상의 원천을 더 잘 이해하기 위해 leakage filtering이 적용된 평가 지표를 보고한다.

4.1. Language modelling

데이터셋 (Datasets)
우리는 C4 (Raffel et al., 2020), Wikitext103 (Merity et al., 2017), Curation Corpus (Curation, 2020), Lambada (Paperno et al., 2016), 그리고 Pile (Gao et al., 2020) 데이터셋으로 모델을 평가한다. 또한, 사전학습 및 검색 데이터셋이 수집된 시점보다 몇 달 후인 2021년 9월에 수동으로 선택되어 추가되거나 대폭 수정된 Wikipedia 문서 세트에 대해서도 평가를 진행한다 (자세한 내용은 §A.2에 제시되어 있다). 우리는 "미래"의 문서들로 데이터셋을 구성하고, 학습 데이터에 강하게 중복되는 새로운 문서들은 수동으로 제거한다. 이는 평가 문서가 학습 데이터에 유출되지 않도록 보장한다.

Figure 3 | 모델 크기에 따른 확장성. (a) LAMBADA top-1 정확도. (b) Curation Corpus에서의 평가 손실. (c) Wikitext103 valid에서의 Perplexity. (d) 2021년 9월에 선택된 Wikipedia 문서에서의 Bits-per-byte.

C4, Wikitext103, Pile, 그리고 우리의 Wikipedia 데이터셋에 대해서는 전체 문서에 대한 language modelling 성능을 평가하고 bits-per-byte (bpb)를 측정한다. 우리는 tokenizer에 독립적이라는 이유로 loss 대신 bits-per-byte를 선호한다. 우리는 2048 토큰의 시퀀스 길이를 사용하지만, 경계 효과(boundary effects)를 완화하기 위해 문서 내에서 1024의 stride를 사용한다. Curation Corpus에서는 기사, "TL; DR:" 문자열, 그리고 요약을 연결하지만, 요약 부분에 대해서만 bpb를 평가한다. Lambada의 경우, greedy generation을 사용하여 마지막 단어에 대한 정확도를 평가한다.

모델 확장 (Model scaling)
Figure 1 (왼쪽)과 Figure 3에서 우리는 모델을 1억 5천만 개에서 70억 개(비-임베딩) 파라미터로 확장함에 따른 language modelling 성능을 보여준다. 모든 데이터셋에서 Retro는 모든 모델 크기에서 baseline을 능가한다. 또한, 모델을 확장해도 성능 향상이 줄어들지 않음을 관찰한다. 성능은 데이터셋에 따라 다르며, Wikitext103과 C4에서 가장 큰 이득을 보인다. Wikipedia 문서와 다른 웹 페이지는 Wikitext103 문서와 유사하며, 정확히 복사된 것은 아니지만 (§4.4), 우리의 검색 모델이 이러한 중복을 직접 활용할 수 있기 때문에 Wikitext103에서 극적인 성능 향상을 얻는다. 가장 작은 이득은 Curation Corpus에서 나타나는데, Retro는 baseline을 약간만 능가한다. 이는 Curation Corpus 요약이 원본 기사의 정보만 포함하도록 설계되었고, 우리의 검색 데이터베이스에 포함되지 않기 때문에 예상된 결과이다. 우리의 "미래" Wikipedia 2021년 9월 데이터셋에서도 모든 모델 크기에서 일관된 성능 향상을 관찰한다.

데이터 확장 (Data scaling)
Figure 1 (중간)은 평가 시 검색 데이터베이스를 확장함에 따라 language modelling 성능이 어떻게 향상되는지를 보여준다. 검색 데이터가 Wikipedia (40억 토큰)에서 MassiveText 전체 (1.7조 토큰)로 증가함에 따라 극적인 성능 향상을 관찰한다. Figure 1 (오른쪽)은 검색된 chunk의 수를 늘림에 따라 성능이 어떻게 확장되는지를 보여준다. 단 2개의 이웃으로만 학습되었음에도 불구하고, 이웃의 수가 1개에서 10개로 증가할 때 모든 모델에서 일관된 성능 향상을 보인다. 또한, 더 큰 모델이 더 많은 이웃을 더 잘 활용할 수 있음을 관찰한다: 172M 모델은 최대 10개의 이웃까지 성능이 향상되는 반면, 7B 모델은 최대 40개의 이웃까지 성능이 향상된다.

Pile
우리는 7B 모델을 Pile test set에서 평가하고, 178B 파라미터의 Jurassic-1 (Lieber et al., 2021) 모델과 280B 파라미터의 Gopher (Rae et al., 2021) 모델과 비교한다. GPT-3는 Jurassic-1과 Gopher에 거의 모든 subset에서 성능이 뒤처지므로 비교하지 않는다. Figure 4는 우리의 7B Transformer baseline 대비 7.5B Retro 모델, Jurassic-1, Gopher의 bits-per-byte 상대적 개선을 보여준다.

Figure 4 | Pile: 우리의 7B baseline과 Jurassic-1, Gopher, Retro의 비교. 검색 모델이 baseline을 모든 test set에서 능가하며, 10배 이상 작은 규모임에도 불구하고 대부분의 test set에서 Jurassic-1을 능가함을 관찰한다.

Jurassic-1은 books를 제외한 모든 데이터셋에서 baseline을 능가하는데, 이는 우리의 학습 데이터에 books가 포함되었기 때문일 가능성이 높다. Gopher와 Retro는 모든 test set에서 baseline을 능가한다. 전반적으로 Retro 7.5B는 대부분의 test set에서 Jurassic-1과 Gopher를 능가한다. dm_mathematics와 ubuntu_irc subset에서는 우리의 Retro 모델이 7B baseline을 능가하지 못하고 Jurassic-1보다 성능이 떨어진다. 우리는 이러한 데이터셋에서 검색된 이웃들이 우리의 검색 데이터셋 내용과 nearest-neighbour 검색의 효율성 조합으로 인해 도움이 되지 않는다고 가정한다.

Wikitext103
통제된 환경에서 우리의 접근 방식을 검증하기 위해, Table 4에서 Wikitext103 데이터셋에 대해 우리의 방법을 $k$ NN-LM (Khandelwal et al., 2020)과 비교한다. 우리는 Wikitext103 학습 세트에서 baseline Transformer를 학습시킨다. 이 Transformer는 Baevski and Auli (2019)에서와 같이 24개 layer, 1024개 hidden unit, 16개 head, 64개 key size를 가진다. 우리의 baseline은 adaptive input을 사용하지 않으며, 우리의 tokenizer는 Baevski and Auli (2019)와 달리 open vocabulary를 가지므로 우리의 baseline perplexity가 약간 더 높다. 전체 실험 세부 사항 및 하이퍼파라미터는 §C.2 및 Table 11에 제시되어 있다.

Table 4 | Wikitext103에서의 Perplexity. Wikipedia 데이터셋을 검색에 사용할 때, Retro는 우리의 $k$ NN-LM 구현과 유사한 성능을 보인다. 검색 데이터셋을 확장함에 따라 Retro는 훨씬 더 좋은 성능을 보인다. MassiveText 전체에서 검색할 때의 perplexity는 상당히 낮은데, 이는 우리의 중복 제거로 걸러지지 않은 Wikitext103과의 부분적인 중복 때문이기도 하다.

Model	Retrieval Set	#Database tokens	#Database keys	Valid	Test
Adaptive Inputs (Baevski and Auli, 2019)	-	-	-	17.96	18.65
Spalm (Yogatama et al., 2021)	Wikipedia	3B	3B	17.20	17.60
$k$ NN-LM (Khandelwal et al., 2020)	Wikipedia	3B	3B	16.06	16.12
Megatron (Shoeybi et al., 2019)	-	-	-	-	10.81
Baseline transformer (ours)	-	-	-	21.53	22.96
$k$ NN-LM (ours)	Wikipedia	4B	4B	18.52	19.54
Retro	Wikipedia	4B	0.06 B	18.46	18.97
Retro	C4	174B	2.9 B	12.87	10.23
Retro	MassiveText (1%)	18B	0.8 B	18.92	20.33
Retro	MassiveText (10%)	179 B	4B	13.54	14.95
Retro	MassiveText (100%)	1792B	28B	3.21	3.92

우리는 우리의 tokenizer와 baseline Transformer를 사용하여 $k$ NN-LM을 재구현하여 Wikitext103의 모든 토큰에 대해 1024 크기의 임베딩을 생성한다. $k$ NN-LM은 $p_{k N N-L M}=\lambda p_{k N N}+(1-\lambda) p_{L M}$ 확률을 가지며, 여기서 $p_{k N N}\left(n_{k}\right) \propto \exp \left(-\alpha d_{k}\right)$ 이다. 우리는 validation set에서 $\lambda=0.118$ 및 $\alpha=0.00785$ 를 튜닝하고 (Figure 7), 이 하이퍼파라미터에 대한 성능을 validation 및 test set 모두에서 보고한다.

우리는 Wikitext103 학습 데이터를 사용하고 Wikipedia에서 2개의 이웃을 검색하여 baseline Transformer를 Retro 모델로 fine-tuning한다 (Figure 7). §4.2에서 설명한 대로 새로운 가중치만 학습하며, encoder와 main pathway 간에 임베딩 가중치를 공유한다. 이는 Wikitext103이 상당히 작기 때문에 필요하며, 이 설정에서 Retro를 scratch부터 학습하면 over-fitting이 발생한다.

우리는 fine-tuned Retro 모델을 다른 검색 세트로 평가한다. Retro와 $k$ NN-LM 모두 평가 시 10개의 이웃을 사용한다. Wikipedia에서 검색할 때, 우리의 $k$ NN-LM 구현과 비슷한 결과를 얻는다. 또한, 검색 데이터베이스를 MassiveText로 확장하면 극적인 성능 향상을 가져오지만, 이는 부분적으로 데이터 유출(leakage) 때문이기도 하다 (§4.4 참조). 재현성을 위해 C4에서 검색할 때의 결과도 포함하는데, 이는 이전 state-of-the-art에 가깝고 MassiveText의 10%를 사용하는 것과 비슷하다.

$k$ NN-LM은 검색 데이터셋의 모든 토큰에 대해 1024개의 float를 필요로 하며, Wikipedia의 40억 토큰에 대해 총 15테라바이트(Tb)가 필요하다는 점은 주목할 가치가 있다. 따라서 $k$ NN-LM 및 다른 토큰 수준 검색 접근 방식은 MassiveText와 같이 수조 개의 토큰을 가진 검색 데이터베이스로 확장할 수 없다. 이에 비해 Retro는 Wikipedia 데이터셋을 인덱싱하는 데 215GB만 필요하며, MassiveText의 경우 93TB가 필요하다. Table 4의 검색 데이터베이스 항목 수를 살펴보면, 수조 개의 토큰을 가진 데이터셋으로 확장할 때 chunk 수준 검색이 왜 필요한지 명확해진다.

4.2. Retro-fitting baseline models

우리는 사전학습된 가중치를 고정하고 chunked cross-attention 및 neighbour encoder 파라미터만 학습하여(7B 모델의 경우 전체 가중치의 10% 미만) baseline 모델을 Retro 모델로 확장한다 (Fig. 5). 이 방식은 retrieval을 통해 Transformer를 효율적으로 강화하는 대안적인 경로를 제공하며, 6백만 개의 시퀀스(우리가 사용한 사전학습 시퀀스의 3%)만으로 학습이 가능하다. 또한, 새로운 가중치만 학습함으로써 retrieval 없이 평가할 때 원본 모델의 성능이 정확히 유지되도록 보장한다. Retrofitting 모델은 baseline 모델의 성능을 빠르게 능가하며, 심지어 scratch부터 학습된 Retro 모델과 유사한 성능을 달성한다. 실험 하이퍼파라미터는 §C.3에 제시되어 있다.

4.3. Question answering

우리는 Natural Questions (Kwiatkowski et al., 2019) 데이터셋에 대해 retrieval 모델을 fine-tuning하여, 우리의 retrieval pathway가 임의의 데이터 소스로부터 정보를 주입하는 데 사용될 수 있음을 입증한다. 우리는 Izacard and Grave (2021)가 제공한 버전[^4]을 사용하는데, 이 버전은 Dpr (Karpukhin et al., 2020)로부터 검색된 passage들로 증강되어 있다. 우리는 7.5B 사전학습된 Retro 모델의 모든 가중치를 25,000 스텝 동안 fine-tuning했으며, 이때 상위 20개의 검색된 passage를 사용했다. 데이터는 "question: {question} \nanswer: {answer}" 형식으로 구성하고, "answer:"가 64개 토큰으로 구성된 첫 번째 chunk의 끝과 일치하도록 left padding하여 첫 번째 retrieving chunk와 정렬되도록 했다. 모델은 시퀀스의 이전 토큰들을 통해 질문에 접근할 수 있으며, chunked cross-attention 메커니즘을 통해 상위 20개의 DPR Wikipedia passage와 그 제목에도 접근할 수 있다.

Figure 5 | Baseline Transformer의 Retro-fitting.
어떤 Transformer라도 chunked cross-attention과 retrieval encoder 가중치만을 무작위로 초기화하고 학습함으로써 retrieval-enhanced Transformer로 fine-tuning될 수 있다.
이러한 방식으로 fine-tuning하면 비-retrieval 성능을 빠르게 회복하고 능가하며, retrieval 모델을 처음부터 학습시키는 것과 거의 동일한 성능을 달성한다 (각 그래프의 오른쪽 화살표로 표시).
우리는 사전학습 시 사용된 토큰 수의 단 3%에 해당하는 토큰 수로만 학습하여도 Retro-fitting 모델에서 좋은 성능을 발견했다.

정확 일치(exact match) 점수는 Table 5에 제시되어 있으며, 전체 fine-tuning 세부 사항은 §C.4에 설명되어 있다. 우리의 방법은 Realm, RAG, Dpr과 같은 이전 접근 방식들과 경쟁력 있는 성능을 보이지만, 최근의 FiD보다는 낮은 성능을 보인다. 본 연구와 대조적으로, 우리는 이 task에서 이웃(neighbour)의 수를 20개 이상으로 늘려도 Retro 성능이 향상되지 않음을 발견했다. 우리는 FiD의 기본 모델인 T5의 encoder-decoder 구조와 T5의 사전학습 objective가 Retro보다 encoder 출력에 더 많이 의존하는 모델로 이어진다고 가정하며, 이는 QA 설정에서 중요하다. T5 fine-tuned 모델과 경쟁하기 위해서는, 향후 연구에서 Retro가 토큰을 생성할 때 retrieval encoder 출력에 더 많이 의존하도록 강제하는 방법을 고려해야 할 것이다.

4.4. Relating retrieval performance to dataset leakage.

우리는 C4, Curation Corpus, Wikitext103 데이터셋에 대해 §2.6에 상세히 설명된 **필터링된 평가 손실(filtered eval losses)**을 Figure 6에 보고한다. C4와 Wikitext103의 경우, 학습 데이터셋으로의 데이터 유출(leakage)이 존재하며, baseline 모델과 Retro 모델 모두에서 음의 기울기를 보인다. Retro 모델은 baseline 모델보다 유출을 더 강하게 활용하는데, 이는 더 가파른 음의 기울기로 나타난다. 이는 Retro 모델이 기존 학습 청크를 복사-붙여넣기하여 유출된 평가 청크를 예측하는 명시적인 능력 때문이다 (이러한 모델 동작의 정성적 예시는 Table 19의 Wikitext103 아티클에서 확인할 수 있다). Curation Corpus의 경우, retrieval은 일정한 offset을 제공하는데, 이는 Curation Corpus와 학습 데이터셋 사이에 설계상 유출이 없기 때문에 예상되는 결과이다.

Table 5 | 질문 응답 결과. Natural Questions 데이터셋에서의 Exact Match 정확도.

Model	Test Accuracy
Realm (Guu et al., 2020)	40.4
DPR (Karpukhin et al., 2020)	41.5
RAG (Lewis et al., 2020)	44.5
EmDR $^{2}$ (Sachan et al., 2021)	52.5
FID (Izacard and Grave, 2021)	51.4
FID + Distill. (Izacard et al., 2020)	$\mathbf{5 4 . 7}$
Baseline 7B (closed book)	30.4
Retro 7.5B (DPR retrieval)	45.5

Figure 6 | 성능 대 가장 긴 공통 retrieval 부분 문자열.
평가 데이터 청크와 그 nearest neighbour 간의 **허용된 가장 긴 공통 부분 문자열(longest common substring)**의 함수로서의 평가 손실(evaluation loss). 학습 데이터셋 청크와 8개 이하의 연속적인 토큰이 겹치는 청크를 고려할 때도 retrieval은 여전히 도움이 된다.

반면에, Retro는 모든 유출 수준에서 baseline 모델을 능가하며, $\alpha=12.5\%$ 수준까지도 우수한 성능을 보인다. 이 수준에서 손실은 학습 데이터셋에서 가장 유사한 청크와 8개 미만의 연속적인 토큰을 공유하는 청크에 대해 계산된다. 이는 국소적인 유출(local leakage)이 없다고 간주할 수 있는 합리적인 수준의 중복이다. 따라서 retrieval은 학습 세트의 청크와 구문적으로 유사한 청크뿐만 아니라, 모든 학습 청크와 구문적으로 다른 청크에 대해서도 예측을 개선한다. 이는 모델 파라미터와 retrieval 데이터베이스 모두에 기반하여 일반화하는 Retro의 비범한 능력을 시사한다. 유사한 결과는 Pile 데이터셋에서도 발견된다 (Figure 12, §F.3 참조).

4.5. Using Retro for sampling

우리는 7.5B Retro 모델을 사용하여 얻은 샘플 예시들을 Table 6, Table 7 및 Appendix E에 제시한다. 각 chunk(첫 번째 chunk는 prompt)에 대해, 샘플링된 chunk $C_u$ 와 검색된 이웃 $\operatorname{Ret}(C_u)$ 를 나란히 배치하였다.
로컬 오버랩(local overlap)을 시각적으로 보여주기 위해, 샘플링된 chunk $C_u$ 의 각 토큰은 검색된 chunk $\operatorname{Ret}(C_{u-1})$ 에서 발견된 가장 긴 공통 접두사(LCP, Longest Common Prefix)의 길이에 따라 색상을 입혔다. 마찬가지로, 검색된 chunk들도 샘플링된 chunk의 LCP에 따라 색상을 입혔다.
**Table 6의 샘플(prompt를 우리가 선택)**에서는 샘플링된 토큰과 이웃 토큰 사이에 오버랩이 존재하므로, 검색된 chunk들이 샘플에 영향을 미쳤음을 관찰할 수 있다. 전반적으로, 검색 기능을 비활성화하고 생성된 샘플과 비교했을 때, 검색은 환각(hallucination)을 줄이고(Shuster et al. (2021)의 연구 결과와 일치), 모델을 더 지식 기반으로 만든다.
Table 7의 샘플에서는 모델이 prompt가 햄릿(Hamlet)의 첫 장면 시작 부분임을 인식하고, 검색 데이터를 활용하여 몇 가지 오류만으로 내용을 이어가는 모습을 보여준다.
Appendix E에는 평가 세트의 예시를 포함한 추가적인 샘플 예시와 함께, 테이블 색상 지정에 사용된 상세한 절차가 제공되어 있다.

5. Conclusion

우리는 **Retrieval-Enhanced Transformers (Retro)**를 제안한다. 이는 수조 개의 토큰으로 구성된 데이터베이스에서 정보를 검색하면서 임의의 텍스트 시퀀스를 모델링하는 방법이다. 이 방식은 일반적인 학습 과정에서 소비되는 데이터의 양보다 한 자릿수 더 많은 데이터를 모델에 제공한다. Retro 모델의 성능 향상은 최소 70억 개 파라미터 규모의 모델에서도 줄어들지 않으며, 특정 데이터셋에서는 파라미터 수가 10배 더 많은 비검색(non-retrieval) 모델과 유사한 성능을 보인다. Wikitext103 및 Pile 데이터셋에서 Retro는 대규모 데이터셋으로 학습된 이전 모델들을 능가한다. 또한 Retro가 질문 응답과 같이 검색 의존적인 다운스트림 task에서도 경쟁력 있는 성능을 보임을 입증한다.

Retro 모델은 유연하여 평가 시 검색 없이도 사용 가능하며, 이 경우에도 baseline 모델과 유사한 성능을 달성한다. 반대로, baseline 모델을 Retro 모델로 빠르게 fine-tuning하여 처음부터 학습한 것과 거의 동일한 성능을 얻을 수 있다. 면밀한 분석 결과, Retro가 얻은 성능 향상 중 극히 일부만이 test set leakage 때문인 것으로 나타났다. 일반적으로 우리는 대규모 언어 데이터셋에서 이러한 leakage에 대해 주의를 기울일 것을 권고하며, 대규모 언어 모델 성능에서 test set leakage의 역할을 더 잘 이해하기 위한 추가 연구를 제안한다.

종합적으로, 우리의 연구는 전례 없는 규모로 semi-parametric 접근 방식이 더 강력한 언어 모델을 구축하려는 노력에 있어, 순수한 파라미터 확장에 비해 직교적(orthogonal)이고 더 효율적인 접근 방식을 제공할 수 있음을 보여준다.

Acknowledgements

Nikolai Grigorev, Marc'aurelio Ranzato, Cyprien de Masson d'Autume, Po-Sen Huang, Johannes Welbl, Lisa Anne Hendricks, Ethan Perez, Jeff Stanway, Eric Noland, Gregory Wayne, John Jumper, Julian Schrittwieser, Lorrayne Bennett, Devang Agrawal, Dani Yogatama, Susannah Young, Nando de Freitas, Demis Hassabis, Koray Kavukcuoglu에게 도움, 조언 및 검토에 감사드립니다. 또한 Zonglin Li, David Simcha, ScaNN 개발자들에게도 감사드립니다.

Table 6 | 샘플 - 비버는 흥미로운 동물입니다. Retro [Off] 샘플은 빠르게 다른 동물로 발산하는 반면, Retro [ON] 샘플은 이웃 조건화(neighbour conditioning) 덕분에 비버 주제에 집중하는 경향이 있습니다.

Prompt and sample of Retro[Off]	Prompt and sample of $\operatorname{Retro}[\mathrm{ON}]$ colored by LCP with $\operatorname{Ret}\left(C_{u}-1\right) \operatorname{LCP}=0,1,2,3,4, \geqslant 5$	[ $N_{u}^{1}, F_{u}^{1}$ ] colored by LCP with $C_{u+1} \mathrm{LCP}=0,1,2,3,4, \geqslant 5$	[ $N_{u}^{2}, F_{u}^{2}$ ] colored by LCP with $C_{u+1} \mathrm{LCP}=0,1,2,3,4 \geqslant 5$
비버는 강 근처에 사는 흥미로운 동물입니다. 그들은	비버는 강 근처에 사는 흥미로운 동물입니다. 그들은	.비버는 숲이 우거진 지역에 만든 연못에 굴을 짓습니다. 자연의 많은 것들처럼, 야생의 생물들 사이에는 연결 고리가 있습니다. 비버 연못은 나무를 익사시키지만, 죽은 나무는 종종 매년 돌아오는 큰 푸른 왜가리를 유인합니다. 시간이 지남에 따라, 비버 연못은 군집을 이루는 50개 이상의 둥지를 유인할 수 있으며, 이를 rookery라고 부릅니다. 이에 대한 예시는 Boxford 1ine 근처 Carter Fields의 Bradford Road 옆 큰 연못에서 찾을 수 있습니다. 매사추세츠 오듀본 협회의 전문가인 Chris Leahy는	그것들을 더 작은 조각으로 썰어 물속으로 끌고 갑니다. 연못의 엔지니어 비버는 그들이 사는 서식지를 변화시키기 때문에 흥미로운 동물입니다. 비버는 개울을 막아 연못을 만듭니다. 그런 다음 그들은 굴이라고 불리는 집을 이 연못에 짓습니다. 비버의 몸은 수중 건축에 잘 적응되어 있습니다. 특별한 근육은 코, 귀, 목을 막아 물이 들어오지 못하게 합니다. 비버의 넓은 꼬리는 방향타처럼 작용합니다. 그들의 매우 크고 주황색 앞니 두 개는 나무를 갉아먹는 데 사용됩니다. 그들은 댐을 짓기 시작합니다.
댐을 지어 연못을 만듭니다. 개구리는 양서류이므로 1and와 물 모두에서 살 수 있습니다. 그들은 포식자로부터 숨기기 위해 훌륭한 위장을 가지고 있습니다. 골든 리트리버, 또는 골든은 매우 큰 개입니다. 그들은 매우 강하고, 두꺼운 털을 가지고 있어	강바닥에 비버 댐이라고 불리는 집을 짓습니다. 그들은 또한 육지에서도 삽니다. 비버는 강한 이빨과 강한 턱을 사용하여 나무와 가지를 잘라 집을 짓습니다. 그들은 또한 이빨과 턱을 사용하여 음식을 씹습니다. 비버는 크고 평평한 꼬리를 사용하여 수영합니다. 그들은	, 그렇다면 당신은 매우 바쁘다는 뜻입니다. 비버는 개울에서 쉽게 수영하며, 돌과 나뭇가지를 주워 댐을 짓습니다. 그들은 큰 앞니로 나무를 갉아먹어 쓰러뜨립니다. 그런 다음 나무의 일부를 사용하여 집을 짓습니다. 비버는 영리한 건축가입니다. 그들은 비버 댐을 짓는 데 필요한 것이 무엇인지 정확히 알고 있습니다. 그들은 개울의 진흙을 사용하여 댐을 함께 유지합니다. 그들은 꼬리를 사용하여 진흙을 다집니다. 비버는 아기들을 위해 댐 꼭대기에 아늑한 방을 만듭니다. 그들은 음식을 물속에 저장합니다. 비버는	귀와 같은 꼬리, 그리고 나무를 갉아먹을 만큼 강한 두 개의 튀어나온 이빨을 가지고 있습니다. 비버는 나무, 가지, 진흙을 사용하여 강과 개울을 가로질러 댐을 짓습니다. 이 댐은 비버가 집을 짓는 깊은 웅덩이를 만듭니다. 비버의 집은 굴이라고 불립니다. 아기 비버 또는 "새끼"는 두 살이 될 때까지 가족 굴에 머무릅니다. 펠트라고 알려진 비버 모피는 한때 모자와 코트의 장식으로 매우 인기가 있었습니다. 비버 모피의 인기가 뉴욕 식민지화에 어떻게 기여했을까요?wwwWw
추운 지역에서 살 수 있도록 돕습니다. 캥거루는 매우 큰 동물입니다. 매우 높이 뛸 수 있습니다. 매우 빠르게 달릴 수 있습니다. 미어캣은 아프리카에 사는 작은 동물입니다. 그들은 최대 백 마리의 군집을 이루어 삽니다. 그들은 나무를 매우 쉽게 오를 수 있습니다.	그들의 날카로운 이빨로 바위를 갉아먹고 나무를 자릅니다. 그들은 평평한 꼬리를 사용하여 수영할 때 방향을 잡습니다. 겨울에는 비버가 비버 굴이라고 불리는 집을 지어야 합니다. 그들은 매우 튼튼한 집을 짓습니다. 벽은 나뭇가지로 만들어집니다. 지붕은	, 그렇다면 당신은 매우 바쁘다는 뜻입니다. 비버는 개울에서 쉽게 수영하며, 돌과 나뭇가지를 주워 댐을 짓습니다. 그들은 큰 앞니로 나무를 갉아먹어 쓰러뜨립니다. 그런 다음 나무의 일부를 사용하여 집을 짓습니다. 비버는 영리한 건축가입니다. 그들은 비버 댐을 짓는 데 필요한 것이 무엇인지 정확히 알고 있습니다. 그들은 개울의 진흙을 사용하여 댐을 함께 유지합니다. 그들은 꼬리를 사용하여 진흙을 다집니다. 비버는 아기들을 위해 댐 꼭대기에 아늑한 방을 만듭니다. 그들은 음식을 물속에 저장합니다. 비버는	큰 앞니로 나무를 갉아먹어 쓰러뜨립니다. 그런 다음 나무의 일부를 사용하여 집을 짓습니다. 비버는 영리한 건축가입니다. 그들은 비버 댐을 짓는 데 필요한 것이 무엇인지 정확히 알고 있습니다. 그들은 개울의 진흙을 사용하여 댐을 함께 유지합니다. 그들은 꼬리를 사용하여 진흙을 다집니다. 비버는 아기들을 위해 댐 꼭대기에 아늑한 방을 만듭니다. 그들은 음식을 물속에 저장합니다. 비버는 그들이 쓰러뜨린 나무의 껍질을 먹습니다! 1. 첫 번째 단락의 주요 아이디어는 무엇입니까? 2. 두 번째 단락의 주요 아이디어는 무엇입니까?
쥐는 육지에 사는 작은 포유류입니다. 매우 잘 오르고 매우 빠르게 달릴 수 있습니다. 펭귄은 남극에 사는 새입니다. 그들은 몸을 따뜻하게 유지하기 위해 두꺼운 털을 가지고 있습니다. 토끼는 땅에 사는 작은 동물입니다. 그들은	가지와 다른 나무 부분으로 만들어집니다. 그들은 또한 강한 턱을 사용하여 나무를 자릅니다. 그들은 그것들을 집으로 가져옵니다. 그들은 또한 날카로운 이빨을 사용하여 나무 부분을 씹습니다. 그들은 평평한 꼬리를 사용하여 집 꼭대기까지 수영합니다. 그런 다음 그들은 이빨과 턱을 사용하여 나무를 씹습니다.

Table 7 | 샘플 - 햄릿, 1막 1장. Retro [Off] 샘플은 문법적으로는 올바르지만 환각(hallucinated)이며, 한 등장인물의 반복(FRANCISCO Approach me not)으로 끝납니다. Retro [On] 샘플은 원문의 올바른 이어지는 부분이며, 우리의 prompt와 검색된 데이터 간의 형식 차이에 강건합니다.

Prompt and sample of Retro[Off]	Prompt and sample of Retro [On] colored by LCP with $\operatorname{Ret}\left(C_{u}-1\right) \operatorname{LCP}=0,1,2,3,4 \geqslant 5$	[ $N_{u}^{1}, F_{u}^{1}$ ] colored by LCP with $C_{u+1} \operatorname{LCP}=0,1,2,3,4, \geqslant 5$	[ $N_{u}^{2}, F_{u}^{2}$ ] colored by LCP with $C_{u+1} \operatorname{LCP}=0,1,2,3,4, \geqslant 5$
1막 1장. 엘시노어. 성 앞의 플랫폼. FRANCISCO가 그의 초소에 있다. 그에게 BERNARDO가 들어온다. BERNARDO 누구냐? FRANCISCO 아니, 내게 대답해: 서서, 자신을 밝혀라. BERNARDO	1막 1장. 엘시노어. 성 앞의 플랫폼. FRANCISCO가 그의 초소에 있다. 그에게 BERNARDO가 들어온다. BERNARDO 누구냐? FRANCISCO 아니, 내게 대답해: 서서, 자신을 밝혀라. BERNARDO	1막 1장. 엘시노어. 성 앞의 플랫폼. (FRANCISCO가 그의 초소에 있다. 그에게 BERNARDO가 들어온다.) BERNARDO 누구냐? FRANCISCO 아니, 내게 대답해: 서서, 자신을 밝혀라. BERNARDO 왕이여 만수무강하소서! FRANCISCO 베르나르도? BERNARDO 나다. FRANCISCO 자네는 제시간에 아주 조심스럽게 오는군. BERNARDO 지금 12시를 쳤네; 프란시스코, 자러 가게. FRANCISCO 이 안도에 감사하네: 몹시 춥고, 마음이 아프네. BERNARDO 자네는	> <TEXT>1막 1장. 엘시노어. 성 앞의 플랫폼. FRANCISCO가 그의 초소에 있다. 그에게 BERNARDO가 들어온다. BERNARDO 누구냐? FRANCISCO 아니, 내게 대답해: 서서, 자신을 밝혀라. BERNARDO 왕이여 만수무강하소서! FRANCISCO 베르나르도? BERNARDO 나다. FRANCISCO 자네는 제시간에 아주 조심스럽게 오는군. BERNARDO 지금 12시를 쳤네; 프란시스코, 자러 가게. FRANCISCO 이 안도에 감사하네: 몹시 춥고, 마음이 아프네.
누가 부르느냐? FRANCISCO 내가 여기 주인이다; 나, 프란시스코, 슬픔에 지쳐 있다. [ 혼잣말. BERNARDO 왕이여! FRANCISCO 나도 그것에 지쳤다. BERNARDO 나의 주군? FRANCISCO 내게 다가오지 마라. BERNARDO	왕이여 만수무강하소서! FRANCISCO 베르나르도? BERNARDO 나다. FRANCISCO 자네는 제시간에 아주 조심스럽게 오는군. BERNARDO 지금 12시를 쳤네; 프란시스코, 자러 가게. FRANCISCO 이 안도에 감사하네: 몹시 춥고, 마음이 아프네. B	왕이여 만수무강하소서! FRANCISCO 베르나르도? BERNARDO 나다. FRANCISCO 자네는 제시간에 아주 조심스럽게 오는군. BERNARDO 지금 12시를 쳤네; 프란시스코, 자러 가게. FRANCISCO 이 안도에 감사하네: 몹시 춥고, 마음이 아프네.</TEXT></DOC><DOC><DO CNO>romeo</DOCNO><TEXT>1막 서곡 두 가문, 둘 다 존엄함에 있어 동등하며, 아름다운 베로나에서, 우리의 무대가 펼쳐지고, 오랜 원한이 새로운 반란으로 터져 나오네,	왕이여 만수무강하소서! FRANCISCO 베르나르도? BERNARDO 나다. FRANCISCO 자네는 제시간에 아주 조심스럽게 오는군. BERNARDO 지금 12시를 쳤네: 프란시스코, 자러 가게. FRANCISCO 이 안도에 감사하네: 몹시 춥고, 마음이 아프네. BERNARDO 조용히 경계했는가? FRANCISCO 쥐 한 마리도 움직이지 않았다. BERNARDO 잘 자게. 호레이쇼와 마르셀루스를 만나거든, 내 경계의 라이벌2, 서두르라고 전하게. FRANCISCO 그들이 오는 소리가 들리는 것 같군. 멈춰! 누구냐? EN
프란시스코, 자네와 이야기하고 싶네. FRANCISCO 내게 다가오지 마라, 하지만 말해라. BERNARDO 자네의 손, 자네의 목소리 FRANCISCO 나는 자네의 말을 듣지 않을 것이다. BERNARDO 프란시스코, 자네의 손, 간청하네. FRANCISCO 내게 다가오지 마라. BERNARDO 프란시스코 FRANC	ERNARDO 조용히 경계했는가? FRANCISCO 쥐 한 마리도 움직이지 않았다. BERNARDO 잘 자게. 호레이쇼와 마르셀루스를 만나거든, 내 경계의 라이벌, 서두르라고 전하게. FRANCISCO 그들이 오는 소리가 들리는 것 같군. 멈춰! 누구냐? Enter	조용히 경계했는가? FRANCISCO 쥐 한 마리도 움직이지 않았다. BERNARDO 잘 자게. 호레이쇼와 마르셀루스를 만나거든, 내 경계의 라이벌, 서두르라고 전하게. FRANCISCO 그들이 오는 소리가 들리는 것 같군. 멈춰! 누구냐? (HORATIO와 MARCELLUS 등장) HORATIO 이 땅의 친구들. MARCELLUS 그리고 덴마크 왕의 신하들. FRANCISCO 잘 자게. MARCELLUS 오, 잘 가시오, 정직한 병사: 누가 자네를 교대했는가? FRANCISCO 베르나르도가 내 자리를 맡았네. 잘 자게. (퇴장)	ARDO 조용히 경계했는가? FRANCISCO 쥐 한 마리도 움직이지 않았다. BERNARDO 잘 자게. 호레이쇼와 마르셀루스를 만나거든, 내 경계의 라이벌2, 서두르라고 전하게. FRANCIS CO 그들이 오는 소리가 들리는 것 같군. 멈춰! 누구냐? HORATIO와 MARCELLUS 등장. HORATIO 이 땅의 친구들. MARCELLUS 그리고 덴마크 왕의 신하들. 3 FRANCISCO 잘 자게. M ARCELLUS 오, 잘 가시오, 정직한 병사: 누가 자네를 교대했는가? FRANCISCO 베르나르도가 내 자리를 맡았네. 잘 자게.
ISCO 내게 다가오지 마라. BERNARDO 편지가 있다 FRANCISCO 내게 다가오지 마라. BERNARDO 왕을 위한 것이다. FRANCISCO 내게 다가오지 마라. BERNARDO 그 안에 반역은 없다. FRANCISCO 내게 다가오지 마라. BERNARDO 나는	HORATIO와 MARCELLUS HORATIO 이 땅의 친구들. MARCELLUS 그리고 덴마크 왕의 신하들. FRANCISCO 잘 자게. MARCELLUS 오, 잘 가시오, 정직한 병사: 누가 자네를 교대했는가? FRANCISCO 베르나르도가 내 자리를 맡았네. 잘 자게.

A. Datasets

우리는 MassiveText와 최신 Wikipedia 기사에서 추출한 데이터에 대한 전체 설명을 제공한다.

A.1. Full description of MassiveText

MassiveText의 출처별, 언어별 전체 구성은 Table 8에 제시되어 있다. MassiveText에 대한 자세한 설명과 분석은 Rae et al. (2021)을 참조하라.

Source	Language	Token count (M)	Documents	Sampling weight
Web	En	483,002	604,938,816	0.314
	Ru	103,954	93,004,882	0.033
	Es	95,762	126,893,286	0.033
	Zh	95,152	121,813,451	0.033
	Fr	59,450	76,612,205	0.033
	De	57,546	77,242,640	0.033
	Pt	44,561	62,524,362	0.033
	It	35,255	42,565,093	0.033
	Sw	2,246	1,971,234	0.0044
	Ur	631	455,429	0.0011
Books	En	3,423,740	20,472,632	0.25
News	En	236,918	397,852,713	0.1
Wikipedia	En	3,977	6,267,214	0.0285
	De	2,155	3,307,818	0.003
	Fr	1,783	2,310,040	0.003
	Ru	1,411	2,767,039	0.003
	Es	1,270	2,885,013	0.003
	It	1,071	2,014,291	0.003
	Zh	927	1,654,772	0.003
	Pt	614	1,423,335	0.003
	Ur	61	344,811	0.0001
	Sw	15	58,090	0.0004
Github	-	374,952	142,881,832	0.05
Total	-	5,026,463	1,792,260,998	1

Table 8: MassiveText 데이터셋.
마지막 열은 학습 중 각 데이터셋의 샘플링 가중치를 나타낸다. **검색 데이터베이스(retrieval database)**의 경우, 전체 데이터셋이 사용되었으며, 책(books) 데이터셋은 4%의 서브샘플만 사용되었다.

A.2. Wikipedia September 2021

우리는 2021년 9월에 추가되거나 대폭 수정된 23개의 위키백과 문서로 구성된 평가 데이터셋을 구축했다. 이 문서들은 우리의 학습 데이터셋 수집 이후에 생성된 것들이다. 또한, 템플릿화된 콘텐츠에 너무 많이 의존하는 문서들은 필터링했다. 이는 인접한 청크와 높은 중복도를 보이는 청크를 식별하는 §2.6의 방법을 사용했다. Figure 10은 우리의 테스트 데이터셋과 학습 데이터셋에서 검색된 이웃들 사이에 중복이 거의 남아있지 않음을 보여준다. 포함된 문서의 전체 목록은 Table 9에 제시되어 있다.

Table 9 | 2021년 9월 위키백과 평가 데이터셋에 포함된 문서 전체 목록.

Megan Rohrer	Aakashavaani
Emma Raducanu	Junior Eurovision Song Contest 2021
Ambra Sabatini	Pavilion Bukit Jalil
WhyDonate	Blake Desjarlais
The Juggernaut (company)	2021 All-Ireland Senior Football Championship Final
Angela Diaz	Drift-barrier hypothesis
2020 Summer Paralympics	Venomics
2021 Afghan protests	Great Circle (novel)
Rexh Xhakli	Hurricane Ida
Julia Laskin	2021 Montenegrin episcopal enthronement protests
Cuijk	At War With the Silverfish
Ghoubet Wind Power Station

우리는 먼저 mwparserfromhell을 사용하여 문서를 파싱한다. 그 다음, 다음 제목을 가진 섹션들을 제거한다: "references", "external links", "sources", "further reading", "see also", "citations", "note". 남아있는 섹션에서는 Wikilink를 제거하고 다음 템플릿들을 제거한다: "reflist", "notelist", "notelist-ua", "notelist-lr", "notelist-ur", "notelist-lg". 또한 "ref" 또는 "table" 태그를 가진 객체들을 제외하고, strip_code 함수로 남은 텍스트를 정리한다. 마지막으로, 제목과 모든 섹션을 연결하고 $\backslash n \backslash n$ 으로 구분한다.

B. Details on the retrieval architecture

우리는 Retro 아키텍처와 기존 language model을 Retrofitting하는 데 사용되는 fine-tuning 절차에 대한 세부 정보를 제공한다.

B.1. Retro architecture and implementation

B.1.1. Feed-forward architecture

본문에서 언급했듯이, 전체 encoder-decoder 아키텍처는 완전히 feed-forward 방식이다. 우리는 시퀀스 $X \in \mathbb{V}^{n}=\left(C_{u}\right)_{1 \leqslant u \leqslant l}$ 와 그 사전 계산된 이웃 $\left(\operatorname{Ret}\left(C_{u}\right)\right)_{1 \leqslant u \leqslant l}$ 로 시작하여 $\mathbb{R}^{n \times|\mathbb{V}|}$ 형태의 logit을 반환한다. 본문에서 소개된 Attn, Ffw, Cca, Ca 연산자와 함께, 우리는 decoder embedding layer $\mathrm{E} \mathrm{mb}: \mathbb{V}^{n} \rightarrow \mathbb{R}^{n \times d}$ , 청크(chunk)된 중간 임베딩을 추출하는 Split 연산자 $\operatorname{Split}(H) \triangleq\left(H_{u}\right)_{1 \leqslant u \leqslant l} \in \mathbb{R}^{l \times m \times d}$ , 그리고 **read-out layer Read : $\mathbb{R}^{n \times d} \rightarrow \mathbb{R}^{n \times|\mathbb{V}|}$ **를 정의한다. 그런 다음 Algorithm 1에서 forward pass를 설명한다. 일반적인 Transformer 하이퍼파라미터 외에도, Retro 아키텍처 하이퍼파라미터에는 **encoder와 decoder가 cross-attention을 수행하는 layer 인덱스 $P_{\text {enc }}$ 와 $P$ **가 포함된다.

B.1.2. Relative positional encoding in the chunked cross-attention layer

Ca operator는 relative positional logits을 사용하며, 이는 데이터 토큰과 검색 토큰을 분리하는 특정 상대 거리로부터 계산된다. 실제로, 우리는 모든 검색 이웃 $\operatorname{Ret}\left(C_{u}\right)^{j}$ 와 청크 $C_{u}$ 가 상대적으로 잘 정렬되어 있고, 동일한 위치에서 시작한다고 가정한다. 따라서 $\mathrm{CA}\left(H_{u}^{+}, E_{u}\right)$ 를 계산할 때, 청크 $C_{u}^{+}$ 의 데이터 토큰 $i \in[1, l]$ 와 $\operatorname{Ret}\left(C_{u}\right)^{j}$ 의 검색 토큰 $i^{\prime} \in[1,2 l]$ 사이의 거리는 다음과 같이 설정된다:

d\left(i, i^{\prime}\right) \triangleq i-i^{\prime}+l-1 .

encoder cross-attention $\operatorname{Ca}\left(\operatorname{Ret}\left(C_{u}\right)^{j}, H_{u}\right)$ 를 계산할 때, 검색 토큰 $i^{\prime} \in[1,2 l]$ 와 데이터 토큰 $i \in[1, l]$ 사이의 거리는 다음과 같이 설정된다:

d_{\mathrm{enc}}\left(i^{\prime}, i\right) \triangleq i^{\prime}-i .

Positional logits은 $\left(d\left(i, i^{\prime}\right)\right)_{i, i^{\prime}}$ 로부터 계산된 cosine vector의 선형 변환으로 얻어지며, 일반적인 self-attention block에서와 같이 content logits에 더해진다.

B.1.3. Chunked cross-attention implementation

Listing 1에 제시된 Cca operator의 구현은 cross-attention layer의 벡터화된 적용에 기반한다. 단순화를 위해, 우리는 multi-head attention 로직을 생략하고 가장 간단한 Q, K, V attention을 사용한다. 위에서 설명한 상대적 위치 logits 계산은 생략한다.

우리는 기본적으로 encoder와 decoder에 대해 서로 다른 embedding을 사용한다. 이를 통해 encoder는 $d_{\mathrm{E}_{\mathrm{NC}}}=896$ 차원을, decoder는 $d=8192$ 차원을 가질 수 있도록 한다. ablation 섹션에서 보여주듯이, embedding을 공유하는 것도 가능하며, 학습에는 큰 차이가 없다.

B.2. Baseline to Retro model fine-tuning

Figure 5에서 볼 수 있듯이, 우리는 사전학습된 baseline Transformer에 fine-tuning을 통해 Retro를 추가할 수 있음을 확인했다. 모든 경우에 **사전학습된 모든 가중치는 고정(freeze)**하고, retrieval encoder와 cross-attention 가중치는 새로 초기화했다. 모든 경우에 cross-attention은 6번째 layer부터 시작하여 3번째 layer마다 추가되었다.
세 개의 작은 모델에 대한 learning rate는 $2 \times 10^{-4}$ 로 설정되었고, 더 큰 모델에 대해서는 그 절반으로 설정되었다. 우리는 fine-tuning 동안 전체 모델이 학습을 재개하도록 허용하는 실험도 진행했지만, 사전학습된 모델을 고정(freeze)하는 것이 가장 좋은 접근 방식임을 일관되게 발견했다. 이 방식은 retrieval-off 성능을 고정 상태로 유지시킨 반면, 모든 가중치를 튜닝했을 때는 retrieval-off 성능이 저하되는 경향을 보였다.

C. Training details and hyperparameters

우리는 §4의 다양한 실험에서 사용된 하이퍼파라미터를 제공한다.

C.1. Language model pre-training

Table 10에서는 우리가 학습시킨 다양한 모델들의 하이퍼파라미터를 보여준다. 모든 경우에 우리는 419,430,400,000개의 training token으로 학습을 진행한다. 세 개의 작은 모델은 batch size 256으로 학습되었고, 가장 큰 모델은 batch size 1024로 학습되었다. 최소 learning rate는 최대 learning rate의 0.1배로 설정되었으며, 최대 learning rate는 Table 10에 제시되어 있다. learning rate는 총 training token 수에 맞춰 cosine cycle length를 사용하여 감소한다. 모든 모델은 AdamW (Loshchilov and Hutter, 2019) 옵티마이저와 weight decay 파라미터 0.1을 사용하여 학습된다. learning rate는 학습의 첫 750 스텝 동안 $10^{-7}$ 에서 최대 learning rate까지 선형적으로 증가한다. 모든 모델은 옵티마이저 상태를 샤딩하기 위해 ZeRO (Rajbhandari et al., 2020)를 사용한다. 추가적인 인프라 세부 사항은 Rae et al. (2021)에서 확인할 수 있다.

Listing 1 | Jax implementation of the chunked cross attention, simplified.

n = 128 # Sequence length
m = 16 # Chunk length
r = 32 # Retrieval length
k = 4 # Number of neighbours
d = 16 # Embedding size
l = n // m # Number of chunks

Parameters

Q = jnp.zeros((d, d))
K = jnp.zeros((d, d))
V = jnp.zeros((d, d))
def relative_positional_encodings(attending_length, attended_length):
    # Classical relative positional encodings
def cross_attention(chunk, neighbour):
    m, d = chunk.shape
    r, d = neighbour.shape
    queries = chunk @ Q
    keys = neighbour @ K
    logits = queries @ keys.T
    values = neighbour @ V
    return logits, values
def multi_neighbour_cross_attention(chunk, neighbours):
    m, d = chunk.shape
    k, r, d = neighbours.shape
    logits, values = jnp.vectorize(cross_attention,
                signature='(m,d),(r,d)->(m,r),(r,d)')(
                    chunk, neighbours)
    assert logits.shape == (k, m, r)
    assert values.shape == (k, r, d)
    logits += relative_positional_encodings(m, r)[None, :, :]
    logits = jnp.moveaxis(logits, 0, -1).reshape((m, r * k))
    values = jnp.moveaxis(values, 0, 1).reshape((r * k, d))
    return jax.nn.softmax(logits) @ values
def multi_chunk_cross_attention(observation, neighbours):
    attending_chunks = jnp.pad(observation[m-1:],
            ((0, m - 1), (0, 0)),
            mode='constant').reshape(l, m, d)
    chunked_output = jnp.vectorize(multi_neighbour_cross_attention,
                signature='(m,d),(k,r,d)->(m,d)')(
                    attending_chunks, neighbours)
    assert chunked_output.shape == (l, m, d)
    output = jnp.pad(chunked_output.reshape(n, d),
        ((m - 1, 0), (0, 0)),
        mode='constant') [:n]
    return output
observation = jnp.zeros((n, d)) # Input
neighbours = jnp.zeros((l, k, r, d))
h = multi_chunk_cross_attention(observation, neighbours)
assert h.shape == (n, d) # Output

Table 10 | Retro 모델의 하이퍼파라미터와 decoder의 크기.

Baseline	$d_{\text {model }}$	$d_{f f w}$	# heads	Head size	# layers	$P$	$P_{\text {ENC }}$	Max LR
247 M	896	3584	16	64	12	$[6,9,12]$	$[1]$	$2 \times 10^{-4}$
564 M	1536	6144	12	128	12	$[6,9,12]$	$[1]$	$2 \times 10^{-4}$
$1,574 \mathrm{M}$	2048	8192	16	128	24	$[9,12, \ldots, 24]$	$[1]$	$2 \times 10^{-4}$
$7,505 \mathrm{M}$	4096	16384	32	128	32	$[9,12, \ldots, 32]$	$[1]$	$1 \times 10^{-4}$

Table 11 | Table 4에 제시된 Wikitext103 실험의 하이퍼파라미터. Baseline과 Retro-fitting에 동일한 learning rate schedule을 사용한다. Retro-fitting의 경우, schedule을 재설정한다. 즉, schedule은 35,000 스텝이 아닌 0 스텝부터 시작한다.

Model	Number of layers	18
	$d$	1024
	$d_{\text {FFW }}$	4096
	Key size	64
	Value size	64
	Number of heads	16
Training data	Dataset	Wikitext103train
	Sequence length	3072
	Batch size	128
	Tokenizer vocabulary size	128,000
Optimisation	optimiser	Adam
	Adam's $\beta_{1}$	0.9
	Adam's $\beta_{2}$	0.95
	Adam's $\varepsilon$	$1 \mathrm{e}-8$
	Dropout rate	0.25
Schedule	Learning rate start	1 e-7
	Learning rate max	$2.5 \mathrm{e}-4$
	Learning rate min	$2 \mathrm{e}-5$
	Warmup steps	4,000
	Cosine cycle steps	100,000
Evaluation	Overlapping proportion	87.5 %

C.2. Wikitext103 comparison

우리는 §4.1 및 Table 4에 제시된 Wikitext103 결과에 대한 자세한 내용을 제공한다. 우리는 Table 11에 제시된 하이퍼파라미터를 사용하여 Wikitext103 훈련 세트에서 baseline Transformer를 훈련시킨다. 학습률은 처음 4,000단계에서 $1 \times 10^{-7}$ 에서 $2.5 \times 10^{-4}$ 로 선형적으로 증가한 다음, cosine 스케줄을 사용하여 100,000단계에서 $2 \times 10^{-5}$ 로 감소한다. 35,000단계의 baseline checkpoint는 Wikitext103 valid에서 21.58의 가장 낮은 perplexity를 가지며, 이는 75%의 overlapping proportion(사용 가능한 경우 시퀀스 길이 컨텍스트의 최소 75%를 가진 토큰에 대해서만 확률을 사용하는 sliding window 평가)에 해당한다. 우리는 Table 4에 보고된 모든 baseline 및 $k$ NN-LM 수치에 이 checkpoint를 사용한다. 단, Table 4는 87.5%의 overlapping proportion에 대해 보고하며, 이는 baseline의 perplexity를 Wikitext103 valid에서 21.53으로 약간 낮춘다.

또한 우리는 35,000단계 baseline checkpoint를 Retrofit 초기화에 사용하며, 이는 §4.2에서 설명된 대로 동일한 optimizer 및 스케줄 하이퍼파라미터를 사용하지만 새로운 retrieval 가중치만 훈련시킨다. 우리의 최적 Retrofit checkpoint는 Wikipedia에서 retrieval할 때 Wikitext103 valid perplexity 18.46을 기록한다. 우리는 Table 4에서 다른 모든 retrieval 세트에 이 Retro checkpoint를 사용한다. baseline 및 Retrofit의 평가 곡선은 Fig. 7 (왼쪽)에 나와 있다. 이 특정 경우, Wikitext103이 상당히 작기 때문에, Retro 모델을 처음부터 훈련시키는 것은 baseline보다 약한 결과를 초래했다. 이는 Retro의 추가 가중치로 인한 과적합 증가를 완화할 효과적인 방법을 찾을 수 없었기 때문이다.

또한 우리는 baseline 및 Retrofitting 실험에 사용하는 것과 동일한 tokenizer 및 데이터셋을 사용하여 $k$ NN-LM을 재구현한다. $k$ NN-LM은 $p_{k N N-L M}=\lambda p_{L M}+(1-\lambda) p_{k N N}$ 의 확률을 가지며, 여기서 $p_{k N N}\left(n_{k}\right) \propto \exp \left(-\alpha d_{k}\right)$ 이다. $\lambda$ 와 $\alpha$ 를 튜닝하기 위해, 우리는 $k$ NN-LM의 key와 query로 사용하는 embedding의 norm의 표준 편차의 역수에 해당하는 $\alpha=0.0012$ 로 시작한다. 우리는 최적의 $\lambda=0.118$ 을 찾는다. 그런 다음 해당 $\lambda$ 값에 대해 최적의 $\alpha=0.00785$ 를 찾는다. Fig. 7 중앙 및 오른쪽은 각각 $\lambda$ 및 $\alpha$ 의 함수로서 $k$ NN-LM의 perplexity를 보여준다.

Figure 7 | Wikitext103 valid perplexity. 왼쪽: 훈련 단계에 따른 baseline 및 Retrofit(35,000단계의 baseline checkpoint에서 초기화됨)의 perplexity. 중앙 및 오른쪽: 각각 $\lambda$ ( $\alpha=0.0012$ 일 때) 및 $\alpha$ ( $\lambda=0.12$ 일 때)의 함수로서 $k$ NN-LM의 perplexity.

C.3. Retrofitting baseline models experiments

Table 12에서는 Massive Text 데이터셋에 대한 모델 Retrofitting에 사용된 하이퍼파라미터를 제시한다.

Table 12 | Retrofitting 실험을 위한 하이퍼파라미터

Model	Retro-block이 있는 Layer $(P)$	Learning rate	Batch size
172 M	6번째부터 3번째마다	$2 \times 10^{-4} \rightarrow 2 \times 10^{-5}$	256
425 M	6번째부터 3번째마다	$2 \times 10^{-4} \rightarrow 2 \times 10^{-5}$	256
1.5 B	6번째부터 3번째마다	$2 \times 10^{-4} \rightarrow 2 \times 10^{-5}$	256
7.5 B	6번째부터 3번째마다	$1 \times 10^{-4} \rightarrow 1 \times 10^{-5}$	256

C.4. Question answering experiments

우리는 7.5B Retro 모델을 25,000 스텝 동안 fine-tuning했으며, 이때 batch size는 128, learning rate는 $10^{-6}$ 에서 $10^{-7}$ 로 코사인 스케줄링하고 750 스텝의 선형 램프업을 적용했다. dropout은 decoder에만 사용했는데, 이는 encoder와 decoder 모두에 사용하는 것보다 더 나은 성능을 보였기 때문이다. 각 neighbour는 title: {title}, source: {source} 형식으로 구성된다. 학습 및 평가 시에는 DPR에서 가져온 상위 20개의 neighbour를 사용한다.

Table 13 | 다양한 Retro 변형 모델의 성능. 1,570억 토큰 스케줄로 학습된 247M 파라미터 모델의 C4 평가 세트 성능 (bytes-per-bit 단위).

Ablation group	Ablation	C4 eval bpb
Model	Retro	0.822
	No query conditioning	0.829
	No CA positional encodings	0.826
	Shared embeddings	0.823
	6-layer encoder	0.821
Retrieval values	Neighbours N	0.950
	Continuations F	0.895
	No retrieval	0.987
Training neighbours	1 training neighbours	0.858
	4 training neighbours	0.847
Cross attention position	CA top layer (1/12)	0.827
	CA mid layer (6/12)	0.823
	CA top layer (12/12)	0.831
	CA all layers	0.860
	CA every 3 from 1	0.823

D. Model ablations

우리는 중요한 설계 선택 사항들을 포함하지 않았을 때 어떤 변화가 발생하는지 평가하여 검증한다. 모든 실험에는 247M 파라미터 모델을 사용했으며, 모든 ablation 실험은 157B 토큰 스케줄로 압축하여 학습시켰다. 본문에서 제시된 기본 설정(default settings)과 비교하여 결과를 보고하며, 학습 과정 종료 시점의 C4 evaluation loss를 보고하고, baseline 학습 시간 대비 evaluation loss 감소율을 측정하여 비교한다. 결과는 Figure 8과 Table 13에 보고되어 있다.

cross-attention에서 relative encoding 사용. §B.1.2에서 설명된 바와 같이, cross-attention에서 relative encoding을 사용하는 것은 주어진 성능에 도달하는 데 필요한 step 수와 연산 효율성 모두에서 순수한 개선을 제공한다.

이전 chunk에 대한 encoder 조건화. §B.1.1에서 설명된 바와 같이, 이전 chunk의 중간 embedding에 encoder를 조건화하는 것은 step 수와 연산 효율성 모두에서 순수한 개선을 제공한다.

Embedding 공유. encoder와 decoder 간에 embedding을 공유하는 것은 성능에 영향을 미치지 않는다. 이는 decoder 크기를 확장할 때 encoder를 decoder보다 좁게 유지할 수 있도록 별도의 embedding을 사용하는 것을 뒷받침한다.

이웃(neighbours) 및 그 연속(continuation)에 대한 attention. Retro 모델은 주어진 chunk에 대해 이전 chunk의 이웃과 시간상 그 연속 모두에 attention을 수행하여 학습된다. 우리는 Retro 모델을 이웃에만, 또는 그 연속에만 학습하고 평가하는 것이 성능에 어떻게 영향을 미치는지 측정한다. 전반적으로, 이웃에만 attention을 수행하는 것은 Retro에서 검색으로 인한 성능 향상의 22%를 제공하는 반면, 이웃의 미래에 attention을 수행하는 것은 성능의 56%를 제공한다.

Figure 8 | 다양한 변형 모델의 연산 효율성. C4 evaluation bytes per bits를 시간 대비 플로팅한 학습 곡선을 보고하며, 이는 baseline Retro 모델 학습에 소요된 시간을 기준으로 상대적으로 측정된다. 전반적으로, 우리의 설계 선택은 연산 효율성 측면에서 최적이다.

이웃과 그 연속 모두에 attention을 수행하는 것이 최종 성능과 학습 효율성 모두에서 가장 효율적인 선택이다.

더 깊은 encoder 학습. 본문의 모든 모델은 상대적으로 작은 Retro encoder를 사용한다. 우리는 3배 더 깊은 encoder로 실험했다. 그 결과, 손실이 0.15% 감소하는 미미한 개선을 보였지만, 학습 시간이 20% 증가하는 비용이 발생했다. 전반적으로, 얕은 encoder를 사용하는 것이 학습 효율성 측면에서 최선의 선택이다.

여러 이웃으로 학습. 우리는 단일 검색된 이웃으로 학습하는 것과 4개의 이웃으로 학습하는 것(Retro는 학습 시 2개의 이웃 사용)의 효과를 측정한다. 단일 이웃으로 학습하면 성능이 크게 감소하는 반면, 4개의 이웃으로 학습하면 학습 종료 시점에 상당한 성능 개선을 제공하지 않으면서도 큰 연산 오버헤드를 유발한다. 전반적으로, 2개의 이웃을 사용하는 것이 학습 효율성 측면에서 최선의 선택임을 확인했다. 또한, 평가는 추가적인 이웃으로도 수행될 수 있다.

cross-attention 빈도. 우리는 decoder에서 cross-attention의 빈도가 성능에 미치는 영향을 측정한다. 전반적으로, 가장 상위 또는 하위 layer에서 한 번만 attention을 수행하는 것은 좋지 않은 선택이며, 중간 깊이 layer에서 한 번 attention을 수행하는 것은 비교적 합리적이다. 우리는 성능과 런타임 간의 좋은 trade-off를 제공하는 3개 layer마다 cross-attention을 수행하는 방식을 선택한다.

E. Qualitative experiments

우리는 평가 샘플의 perplexity를 살펴보고 autoregressive 방식으로 샘플을 생성함으로써 Retro 모델의 사용법을 설명한다.

E.1. Inspecting neighbours and perplexities on evaluation data

Retro 모델이 어떤 종류의 정보를 활용하는지에 대한 직관을 얻기 위해, Table 16, 17, 18, 19에 제시된 몇몇 평가 문서와 해당 검색 데이터를 자세히 살펴보는 것을 제안한다. 이 표들에서 4개의 행은 문서의 첫 4개 chunk에 해당한다.
가장 왼쪽 열은 평가 중인 문서의 chunk $C_u$ 를 보여주며, 각 토큰은 **negative cross entropy loss 차이 $L_{\text{Retro}[\text{Off}]}-L_{\text{Retro}}$ **로 색칠되어 있다. 양수 값은 노란색으로 표시되며, 이는 Retro가 이웃 데이터에 접근할 때 더 나은 성능을 보인다는 것을 나타낸다.
두 번째 열 또한 평가된 chunk $C_u$ 를 보여주지만, 각 토큰 $i$ 는 이전 이웃(preceding neighbours)과의 최장 공통 접두사(LCP, Longest Common Prefix) 길이로 색칠되어 있다. 즉, 이는 접두사 ( $x_{i-j-1}, \ldots, x_i$ )가 $\operatorname{Ret}(C_{u-1})$ 에도 나타나는 가장 큰 정수 $j$ 를 의미한다.
반대로, 세 번째와 네 번째 열은 각각 **첫 두 이웃과 그들의 연속(continuation)**인 $[N_u^1, F_u^1]$ 과 $[N_u^2, F_u^2]$ 를 보여주며, 이들은 다음 chunk $C_{u+1}$ 과의 LCP로 색칠되어 있다. LCP 색칠은 평가된 문서가 검색된 데이터와 어디에서 겹치는지 시각적으로 식별하는 데 도움을 준다.
두 번째 열의 첫 번째 chunk $C_1$ 은 LCP를 계산할 이전 이웃이 없으므로 색칠되지 않는다. 마찬가지로, 네 번째 chunk의 이웃들은 처음 네 chunk 중 어떤 것에도 조건을 부여하는 데 사용되지 않으므로 표시하지 않는다.

우리의 정성적 분석은 두 가지 주요 행동 양상을 보여준다.
첫째, 때때로 $C_u$ 내의 특정 사실들이 이전 이웃 $\operatorname{Ret}(C_{u-1})$ 로부터 추출될 수 있으며, 이는 해당 토큰에 대한 Retro 모델의 손실을 크게 줄이는 결과로 이어진다는 것을 관찰했다. 이러한 행동의 예시로는 Table 16의 저널 이름 Publishers Weekly, Table 17의 축구팀 이름 Tyrone, 또는 Table 18의 이벤트 날짜 25 August to 6 September 2020 등이 있다. 이 세 가지 예시에서 평가된 데이터는 2021년 9월에 작성된 최신 Wikipedia 기사들로 구성되어 있으며, 이는 우리가 검색 데이터셋을 구축한 이후의 시점이다 (섹션 §A.2 참조). 그럼에도 불구하고, 이 새로운 데이터를 예측하는 데 관련된 정보는 기존 검색 데이터에 존재했으며, Retro 모델은 이를 올바르게 활용하는 것으로 보인다.

다른 한편으로, 우리는 중복 제거(deduplication)를 사용했음에도 불구하고, 일부 평가 데이터가 학습 및 검색 데이터에 부분적으로 유출될 수 있음을 관찰했다. Retro는 이러한 유출을 극적으로 활용할 수 있다. Table 19는 이러한 행동을 보여주는데, chunk $C_2$ 와 $C_3$ 가 각각 $\operatorname{Ret}(C_1)$ 과 $\operatorname{Ret}(C_2)$ 와 작은 형식 차이를 제외하고는 크게 겹치며, 이는 해당 모든 토큰에 대해 Retro 손실을 훨씬 낮추는 결과로 이어진다. Fig. 6은 검색 세트와 겹치는 평가 chunk를 필터링함으로써, Retro 손실 감소가 이 두 가지 행동 중 각각 얼마나 기여하는지 정량화하는 것이 가능함을 보여준다.

E.2. Inspecting samples

우리는 Retro 모델을 사용하여 생성된 샘플에 대해서도 위와 동일한 절차를 따를 수 있으며, 이를 통해 검색 데이터가 샘플링에 어떤 영향을 미쳤는지 더 잘 이해할 수 있다. 7.5B Retro 모델을 사용하여 얻은 샘플 예시는 Table 6, 7, 20, 21에 제시되어 있다.

E.3. Neighbour quantification

소스 문서와 검색된 chunk 간의 거리 개념을 정량화하기 위해, Wikipedia에서만 검색했을 때 소스 문서 간의 거리를 측정할 수 있다. Consonni et al. (2019)는

Figure 9 | 검색된 문서들 간의 Wikipedia link-distance. 각 시퀀스-chunk 조합에 대해, 우리는 Wikipedia만을 사용하여 target과 상위 5개 이웃(neighbor) 간의 link distance를 계산한다. Rank는 상대적인 이웃 거리를 나타내며, rank-1은 첫 번째 이웃, rank-5는 다섯 번째 이웃이다. 다른 색상은 link distance를 나타낸다. 동일한 문서에서 검색하지 않으므로, 1이 가장 작은 값이다. 우리는 경로가 있는 임의의 문서들 간의 평균 거리가 5.0 이상임을 발견했다.

각 문서에 대해 이웃 문서 목록을 포함하는 Wikipedia link dataset을 제공한다. 이를 사용하여 우리는 **방향성 그래프(directed graph)**를 구성하고 한 페이지에서 다른 페이지까지의 거리를 계산한다. Figure 9에서 우리는 학습 시퀀스와 검색된 이웃들 간의 link-distance를 계산한다. 우리는 검색된 문서들이 target을 포함하는 문서와 상당히 가까운 문서들에서 오는 경향이 있음을 발견했다. 또한, 평균적으로 거리가 rank에 따라 증가한다는 것을 발견했는데, 이는 우리의 이웃들이 유용하며 순서가 합리적임을 시사한다. 이는 문서 거리가 덜 명확하게 정의되는 더 큰 규모의 실험에 대한 신뢰를 제공한다.

F. Complementary quantitative results

본문 내 정량적 수치에 해당하는 표들과 함께, Pile 데이터셋에 대한 추가적인 필터링된 language model 결과를 보고한다.

F.1. Main text datasets

우리는 Retro 및 baseline 모델의 성능을 evaluation set에서의 bits-per-bytes로 측정하여 Table 14에 보고한다.

F.2. The Pile

Fig. 4에서 우리는 Retro와 **Jurassic-1 (Lieber et al., 2021)**을 비교한다. 전체 bits-per-bytes 결과는 Table 15에 보고되어 있다.

F.3. Filtered results

주요 평가 세트에서 유출된 청크(chunk)의 분포. 우리는 특정 **오버랩 $r(C)$ **를 가진 평가 청크의 비율을 측정하여 평가 세트와 훈련 세트 간의 유출(leakage)을 평가한다. 히스토그램은 Figure 10에 제시되어 있다.

Table 14 | 주요 언어 모델링 데이터셋에 대한 전체 결과. 처음 세 행 세트는 Figure 1에 해당하고, 마지막 행 세트는 Figure 3에 해당한다.

	Baseline				Retro [Off]				Retro[On]
	172 M	425 M	1.5 B	7.5B	172 M	425 M	1.5 B	7.5B	172 M	425 M	1.5 B	7.5B
C4 Eval bpb	0.98	0.92	0.84	0.78	0.98	0.92	0.84	0.78	0.82	0.77	0.71	0.66
C4 Eval bpb (900B)	-	-	-	-	-	-	-	-	0.88	0.83	0.76	0.71
C4 Eval bpb (360B)	-	-	-	-	-	-	-	-	0.92	0.87	0.80	0.74
C4 Eval bpb (180B)	-	-	-	-	-	-	-	-	0.94	0.89	0.81	0.75
C4 Eval bpb (90B)	-	-	-	-	-	-	-	-	0.95	0.89	0.82	0.76
C4 Eval bpb (36B)	-	-	-	-	-	-	-	-	0.96	0.90	0.83	0.77
C4 Eval bpb (18B)	-	-	-	-	-	-	-	-	0.96	0.91	0.83	0.77
C4 Eval bpb (9B)	-	-	-	-	-	-	-	-	0.96	0.91	0.83	0.77
C4 Eval bpb (4B)	-	-	-	-	-	-	-	-	0.97	0.91	0.84	0.78
C4 Eval bpb (2B)	-	-	-	-	-	-	-	-	0.97	0.91	0.84	0.78
C4 Eval bpb ( $k=1$ )	-	-	-	-	-	-	-	-	0.84	0.79	0.73	0.67
C4 Eval bpb ( $k=2$ )	-	-	-	-	-	-	-	-	0.83	0.78	0.72	0.67
C4 Eval bpb ( $k=3$ )	-	-	-	-	-	-	-	-	0.82	0.78	0.71	0.66
C4 Eval bpb ( $k=4$ )	-	-	-	-	-	-	-	-	0.82	0.77	0.71	0.66
C4 Eval bpb ( $k=5$ )	-	-	-	-	-	-	-	-	0.82	0.77	0.71	0.66
C4 Eval bpb ( $k=10$ )	-	-	-	-	-	-	-	-	0.82	0.77	0.71	0.66
C4 Eval bpb ( $k=20$ )	-	-	-	-	-	-	-	-	0.82	0.77	0.71	0.66
C4 Eval bpb ( $k=30$ )	-	-	-	-	-	-	-	-	0.82	0.77	0.71	0.65
C4 Eval bpb ( $k=40$ )	-	-	-	-	-	-	-	-	0.83	0.77	0.71	0.65
C4 Eval bpb ( $k=50$ )	-	-	-	-	-	-	-	-	0.83	0.78	0.71	0.66
C4 Eval bpb ( $k=60$ )	-	-	-	-	-	-	-	-	0.84	0.78	0.72	0.66
C4 Eval bpb ( $k=70$ )	-	-	-	-	-	-	-	-	0.84	0.79	0.72	0.66
C4 Eval bpb ( $k=80$ )	-	-	-	-	-	-	-	-	0.85	0.79	0.73	0.66
C4 Eval bpb ( $k=90$ )	-	-	-	-	-	-	-	-	0.85	0.79	0.73	0.66
C4 Eval bpb ( $k=100$ )	-	-	-	-	-	-	-	-	0.85	0.79	-	0.67
Lambada Accuracy	0.42	0.51	0.61	0.69	0.47	0.54	0.63	0.70	0.52	0.60	0.67	0.73
Curation Corpus bpb	0.69	0.63	0.56	0.52	0.68	0.64	0.57	0.51	0.66	0.61	0.55	0.50
Wikitext103 Perplexity	25.62	19.29	13.98	10.65	25.88	19.78	13.89	10.40	3.32	2.96	2.53	2.22
Wikipedia Sept. 2021 bpb	0.85	0.78	0.71	0.65	0.86	0.79	0.71	0.65	0.79	0.73	0.66	0.61

Figure 10에서 볼 수 있듯이, C4는 훈련 세트와 평가 세트 사이에 약간의 오버랩이 존재한다. 유사하게, Wikitext103의 실제 평가 문서들을 훈련 세트에서 제거했음에도 불구하고, Wikitext103의 청크들이 훈련 세트에 나타난다. 반면, 우리의 Wikipedia September 21 데이터셋은 거의 유출이 없으며(훈련 데이터 생성 시 존재하지 않았던 원본 문서들이기 때문), Curation Corpus도 마찬가지이다.

Pile에 대한 필터링된 결과. Figure 12와 Figure 11에서 각각 Pile에 대한 청크 오버랩 분포와 필터링된 성능 곡선을 보고한다. 필터링된 곡선에 대한 질적 해석은 동일하다: Retro 모델은 유출을 더 많이 활용하지만, 훈련 세트에서 관찰되지 않은 원본 청크에서도 성능 향상은 여전히 유의미하다.

Table 15 | The Pile에 대한 전체 결과 (bits-per-bytes 단위). Jurassic-1 및 GPT-3 수치는 Lieber et al. (2021)에서 가져왔다. Gopher 수치는 Rae et al. (2021)에서 가져왔다.

Subset	7B Baseline (Ours)	GPT-3	Jurassic-1	Gopher	7.5B Retro
arxiv	0.742	0.838	0.680	0.641	0.714
books3	0.792	0.802	0.835	0.706	0.653
dm_mathematics	1.177	1.371	1.037	1.135	1.164
freelaw	0.576	0.612	0.514	0.506	0.499
github	0.420	0.645	0.358	0.367	0.199
gutenberg_pg_19	0.803	1.163	0.890	0.652	0.400
hackernews	0.971	0.975	0.869	0.888	0.860
nih_exporter	0.650	0.612	0.590	0.590	0.635
opensubtitles	0.974	0.932	0.879	0.894	0.930
philpapers	0.760	0.723	0.742	0.682	0.699
pile_cc	0.771	0.698	0.669	0.688	0.626
pubmed_abstracts	0.639	0.625	0.587	0.578	0.542
pubmed_central	0.588	0.690	0.579	0.512	0.419
stackexchange	0.714	0.773	0.655	0.638	0.624
ubuntu_irc	1.200	0.946	0.857	1.081	1.178
uspto_backgrounds	0.603	0.566	0.537	0.545	0.583

Figure 10 | C4, Curation Corpus, Wikitext103, Wikipedia Sept. 2021의 평가 청크와 훈련 청크 간의 오버랩 분포.

Figure 11 | Pile에 대한 필터링된 평가 손실 (baseline Transformer 및 Retro 포함).

Figure 12 | Pile 평가 세트의 평가 청크와 훈련 청크 간의 오버랩 분포.

Table 16 | Great Circle (novel), Wikipedia September 21에서 발췌. 이 기사는 최근 소설에 관한 것이며, 청크 $C_{3}$ 와 $C_{4}$ 는 특히 소설의 반응에 대한 내용이다. 소설을 리뷰한 저널인 Publishers Weekly의 이름이 청크 $C_{3}$ 의 이웃 $\left[N_{3}^{1}, F_{3}^{1}\right],\left[N_{3}^{2}, F_{3}^{2}\right]$ 과 다음 청크 $C_{4}$ 에 모두 나타나며, 이 토큰들에 대한 손실은 Retro에 의해 크게 감소한다.

$C_{u}$ colored by loss difference $L_{\text {Retro }\left[\mathrm{O}_{F F}\right]}-L_{\text {RETRO }} \leqslant-0.5,=0, \geqslant 0.5$	$C_{u}$ colored by LCP with $\operatorname{Ret}\left(C_{u}-1\right) \mathrm{LCP}=0,1,2,3,4 \geqslant 5$	[ $N_{u}^{1}, F_{u}^{1}$ ] colored by LCP with $C_{u+1} \mathrm{LCP}=0,1,2,3,4 \geqslant 5$	[ $N_{u}^{2}, F_{u}^{2}$ ] colored by LCP with $C_{u+1} \mathrm{LCP}=0,1,2,3,4 \geqslant 5$
Great Circle (novel)Great Circle i s a 2021 novel by Maggie Shipstead, published on May 4, 2021, by Alfred A. Knopf.The novel has been shortl isted for the 2021 Booker Prize.Sy nopsis The novel consists of two pa rallel narratives about two fictiona l women. One is	Great Circle (novel) Great Circle i s a 2021 novel by Maggie Shipstead, published on May 4, 2021, by Alfred A. Knopf. The novel has been shortl isted for the 2021 Booker Prize. Sy nopsis The novel consists of two pa rallel narratives about two fictiona l women. One is	The Dutch House (novel)The Dutch H ouse is a 2019 novel by Ann Patchett . It was published by Harper on Sept ember 24, 2019. It tells the story o f a brother and sister over the cour se of five decades.The novel was a finalist for the 2020 Pulitzer Priz e for Fiction.PlotThe Dutch House is a mansion located in Elkins Park , Pennsylvania, a suburb of Philadel phia. It was built in 1922 by the Va nHoebeek family, a husband and wife originally from the Netherlands who made their fortune in the tobacco in dustry. Cyril Conroy, a self-made re al estate mogul	The Dutch House (novel)The Dutch H ouse is a 2019 novel by Ann Patchett . It was published by Harper on Sept ember 24, 2019. It tells the story o f a brother and sister over the cour se of five decades. [2]The novel wa s a finalist for the 2020 Pulitzer P rize for Fiction.[3]Plot[edit]Th e Dutch House is a mansion located i n Elkins Park, Pennsylvania, a subur b of Philadelphia. It was built in 1 922 by the VanHoebeek family, a husb and and wife originally from the Net herlands who made their fortune in t he tobacco industry. Cyril Conroy, a self-
about the disappeared 20th-century aviator Marian Graves, while the oth er is about the struggling 21st-cent ury Hollywood actress Hadley Baxter, who is attempting to make a film ab out Marian. Hadley's narrative is to ld in the first-person, while Marian 's sections are told in the third-pe rson	about the disappeared 20th-century aviator Marian Graves, while the oth er is about the struggling 21st-cent ury Hollywood actress Hadley Baxter, who is attempting to make a film ab out Marian. Hadley's narrative is to Id in the first-person, while Marian 's sections are told in the third-pe rson	on becoming a filmmaker. She has fo und a subject for her film project, an obscure African American actress credited only as "the watermelon wom an" in old Hollywood films, and the subsequent film recounts her search for this woman even as it covers, in the manner of the earlier Dunyement aries, Dunye's friendships and her 1 ove life. InThe Watermelon Woman, D unye makes the film she set out to m ake in 1990 about African American w omen artists, a film that both inven ts an artistic predecessor with whom she can identify and also "finds" C heryl herself as the artist that she seeks. As Dunye identifies herself	based closely on her own youthful e xperiences. (She plans the film to $b$ e the first of two parts, the second dealing with the aftermath of the f irst's events.) Byrne plays a young film student named Julie (Hogg's ava tar), who starts her artistic educat ion with high hopes of making a movi e about a boy named Tony, living in working-class Sunderland, who adores his mother - "is almost obsessed wi th her," as eager Julie tells her ad visers. Her idealism is evident from the start.The advisers are skepti cal, and no wonder; Julie's family i s posh, with a comfortable country e state and
.Reception Great Circle received very favorable reviews, with a cumul ative "Rave" rating at the review ag gregator website Book Marks, based o n 22 book reviews from mainstream li terary critics. The novel debuted at number fourteen on The New York Tim es Hardcover fiction best-seller lis t for the week ending May	.Reception Great Circle received very favorable reviews, with a cumul ative "Rave" rating at the review ag gregator website Book Marks, based o n 22 book reviews from mainstream li terary critics. The novel debuted at number fourteen on The New York Tim es Hardcover fiction best-seller lis t for the week ending May	first edition hardcoverReception The novel debuted at number one on T he New York Times fiction best-selle r list. As of the week ending Februa ry 20,2021 , the novel has spent 38 weeks on the list.At the review ag gregator website Book Marks, which a ssigns individual ratings to book re views from mainstream literary criti cs, the novel received a cumulative "Rave" rating based on 38 reviews, w ith only one "mixed" review. Publish ers Weekly wrote, "Bennett renders h er characters and their struggles wi th great compassion, and explores th e complicated state of mind that Ste lla finds herself in while passing a s white." In its	The book also debuted at number tw o on The New York Times Hardcover No nfiction best-sellers list on July 2 8, 2019.[5] It spent eleven weeks on the list.[6]Reception[edit]At t he review aggregator website Book Ma rks, which assigns individual rating s to book reviews from mainstream li terary critics, the book received a cumulative "Positive" rating based o n 29 reviews: 12 "Rave" reviews, 6 " Positive" reviews, 9 "Mixed" reviews , and 2 "Pan" reviews.[7]Publisher s Weekly gave the book a mixed revie w, writing, "Unfortunately, all thre e
8, 2021. Critics praised the novel for sustaining its length and for Sh ipstead's research and intricate nov el structure for perfectly interweav ing the parallel narratives, despite the time and circumstances separati ng them.In its starred review, Pub lishers Weekly wrote, "Shipstead man ages to portray both Marian's and Ha dley's	8, 2021. Critics praised the novel for sustaining its length and for Sh ipstead's research and intricate nov el structure for perfectly interweav ing the parallel narratives, despite the time and circumstances separati ng them.In its starred review, Pub lishers Weekly wrote, "Shipstead man ages to portray both Marian's and Ha dley's

Table 17 | All-Ireland Senior Football Championship Final, Wikipedia September 21에서 발췌. 팀 이름인 Tyrone이 청크 $C_{1}$ 의 두 번째 이웃 $\left[N_{1}^{2}, F_{1}^{2}\right]$ 과 다음 청크 $C_{2}$ 에 모두 나타나며, 이 토큰들에 대한 손실은 Retro에 의해 크게 감소한다.

$C_{u}$ colored by loss difference $L_{\text {Retro }[\text { Off }]}-L_{\text {Retro }} \leqslant-0.5,=0, \geqslant 0.5$	$C_{u}$ colored by LCP with $\operatorname{Ret}\left(C_{u}-1\right) \mathrm{LCP}=0,1,2,3,4 \geqslant 5$	[ $N_{u}^{1}, F_{u}^{1}$ ] colored by LCP with $C_{u+1} \mathrm{LCP}=0,1,2,3,4 \geqslant 5$	[ $N_{u}^{2}, F_{u}^{2}$ ] colored by LCP with $C_{u+1} \mathrm{LCP}=0,1,2,3,4 \geqslant 5$
2021 All-Ireland Senior Football Cha mpionship FinalThe 2021 All-Irelan d Senior Football Championship Final was the 134th final of the All-Irel and Senior Football Championship and the culmination of the 2021 All-Ire land Senior Football Championship. T he match was played at Croke Park in Dublin on 11 September 2021. It was originally scheduled	2021 All-Ireland Senior Football Cha mpionship Final The 2021 All-Irelan d Senior Football Championship Final was the 134th final of the All-Irel and Senior Football Championship and the culmination of the 2021 All-Ire land Senior Football Championship. T he match was played at Croke Park in Dublin on 11 September 2021. It was originally scheduled	2018 All-Ireland Senior Football Cha mpionship FinalThe 2018 All-Irelan d Senior Football Championship Final was the 131st final of the All-Irel and Senior Football Championship and the culmination of the 2018 All-Ire land Senior Football Championship in Gaelic football. The match was play ed at Croke Park in Dublin on 2 Sept ember 2018. [3]It was the second ti me the teams had met in the final; D ublin won the first encounter in 199 5.The final was shown live in Irel and on RTÉ Two as part of The Sunday Game live programme, presented by M ichael Lyster from Croke Park, with studio analysis from Joe Brolly,	2018 All-Ireland Senior Football Cha mpionship FinalThe 2018 All-Irelan d Senior Football Championship Final was the 131st final of the All-Irel and Senior Football Championship and the culmination of the 2018 All-Ire land Senior Football Championship in Gaelic football. The match was play ed at Croke Park in Dublin on 2 Sept ember 2018.It was the second time the teams had met in the final; Dubl in won the first encounter in 1995. It was the third consecutive year th <br> at a team qualified under the system of second chances introduced in 200 1; Tyrone qualified despite defeat i n its provincial championship.Dubl in won the final by a margin of six points
for 28 August but had to be postpon ed by two weeks when the -semi-fina 1 was postponed due to a COVID-19 ou tbreak. Ulster champions Tyrone took on Connacht champions Mayo, in what was their first ever meeting in a f inal, winning their 4th title after a $2-14$ to $0-15$ win. Mayo lost	for 28 August but had to be postpon ed by two weeks when the - semi-fina 1 was postponed due to a COVID-19 ou tbreak. Ulster champions Tyrone took on Connacht champions Mayo, in what was their first ever meeting in a f inal, winning their 4th title after a $2-14$ to $0-15$ win. Mayo lost	game 23-23 after extra time, howeve r Ulster progressed under the compet ition rules as they scored three tir es in the match against Leinster's t wo. The semi-finals took place in mi d November and saw both the away tea ms win, as Ulster beat Glasgow and E dinburgh beat Connacht. The final wa s held on Saturday December 20 at Mu rrayfield Stadium and saw Ulster bea t Edinburgh 21-27 to win the Celtic Cup.2004-05 seasonThe format of the competition was changed for the second edition of the competition. T he competition was moved to April an d May to run after the conclusion of the Celtic League competition, with only eight	with a last-ditch plan of action play the Munster/Ulster Semi-Final o n March 16th, with the winners to pl ay Connacht in the following day's F inal.On March 16th then Munster ha d an easy win over Ulster ( $9-07$ to 0 -00) but thankfully for the Munster players, the pitch cut up so badly d uring the game, it was decided to po stpone the following day's hurling F inal (until Easter Sunday) with the football Final going ahead on its ow n on St. Patrick's Day.Less than a week later, on March 23rd, seven
their 11th consecutive final since 1989, losing 6 finals in 9 years, wi th this latest defeat on an identica l scoreline to 2020, when Mayo lost to Dublin.Background were aiming to win their fourth title and first All-Ireland since 1951. Since then, they had lost ten finals (1989, 1996 , 1997, 2004, 2006,	their 11th consecutive final since 1989, losing 6 finals in 9 years, wi th this latest defeat on an identica 1 scoreline to 2020, when Mayo lost to Dublin.Background were aiming to win their fourth title and first All-Ireland since 1951. Since then, they had lost ten finals (1989, 1996 1997, 2004, 2006,	$1-16$ to $0-15$ winners to qualify for their 10th league final in the past 13 years. They have won seven of $t$ heir previous league finals under Co dy since 2002, losing the other two to Waterford (2007) and Dublin (201 1 ).Despite the defeat there were some distinct positives from a Galwa y perspective- most notably the soli d displays of Daithí Burke at centre -back, Joseph Cooney at wing-back an d Ronan Burke at full-back. Colm Cal lanan continued his excellent form i n goal and also hit a stunning free from distance.Indeed it was not th e Galway defence that was the proble m	which Dublin won by 0-12 to 0-9.D ublin are going for an unprecedented fourth successive Championship win over Kerry. Prior to their current $\mathbf{r}$ un, which started with the 2011 AllIreland final, they had only managed two consecutive victories over them on two separate occasions - 1909 an d '24, 1976 and '77.The longest wi nning sequence in the rivalry was se t by Kerry between 1941 and 1975, wh en they won each of the six Champion ship meetings. Kerry went nine games unbeaten between 1978 and 2009, wit h four victories either side of a dr amatic draw at the quarter-final sta ge in Thurles in 2001.Sunday will mark their 11th
2012, 2013, 2016, 2017, 2020). app eared in their seventh final, winnin g on three occasions in 2003, 2005 a nd 2008. This final was the fifth to be contested by county teams from C onnacht and Ulster, the other finals were 1925 (Galway beat Cavan), 1943 (Roscommon beat Cavan), 1948 (Cavan beat	2012, 2013, 2016, 2017, 2020). app eared in their seventh final, winnin g on three occasions in 2003, 2005 a nd 2008. This final was the fifth to be contested by county teams from C onnacht and Ulster, the other finals were 1925 (Galway beat Cavan), 1943 (Roscommon beat Cavan), 1948 (Cavan beat

Table 18 | 2020 Summer Paralympics, Wikipedia September 21에서 발췌. 이벤트의 원래 날짜인 2020년 8월 25일부터 9월 6일까지가 청크 $C_{1}$ 의 이웃 $\left[N_{1}^{1}, F_{1}^{1}\right],\left[N_{1}^{2}, F_{1}^{2}\right]$ 과 다음 청크 $C_{2}$ 에 모두 나타나며, 이 토큰들에 대한 손실은 Retro에 의해 크게 감소한다. 흥미롭게도 이 경우, 이웃들은 이벤트가 아직 연기되지 않았을 때 작성되었다.

$C_{u}$ colored by loss difference $L_{\text {Retro }\left[\mathrm{O}_{F F}\right]}-L_{\text {RETRO }} \leqslant-0.5,=0, \geqslant 0.5$	$C_{u}$ colored by LCP with $\operatorname{Ret}\left(C_{u}-1\right)$ LCP $=0,1,2,3,4, \geq 5$	[ $N_{u}^{1}, F_{u}^{1}$ ] colored by LCP with $C_{u+1}$ LCP $=0,1,2,3,4 \geqslant 5$	[ $N_{u}^{2}, F_{u}^{2}$ ] colored by LCP with $C_{u+1}$ LCP $=0,1,2,3,4, \geq 5$
2020 Summer ParalympicsThe , brand ed as the Tokyo 2020 Paralympic Game s , was an international multi-sport parasports event held from 24 August to 5 September 2021 in Tokyo, Japan . They were the 16th Summer Paralymp ic Games as organized by the Interna tional Paralympic Committee (IPC).	2020 Summer Paralympics The , brand ed as the Tokyo 2020 Paralympic Game s , was an international multi-sport parasports event held from 24 August to 5 September 2021 in Tokyo, Japan . They were the 16th Summer Paralymp ic Games as organized by the Interna tional Paralympic Committee (IPC).	pics Games.* The 2020 Summer Paraly mpics are an upcoming major internat ional multi-sport event for athletes with disabilities governed by the I nternational Paralympic Committee. S cheduled as the 16th Summer Paralymp ic Games, it is planned to be held i n Tokyo, Japan from 25 August to 6 S eptember 2020.3. 2019 BWF Para-Bad minton World Championships- The 20 19 BWF Para-Badminton World Champion ships was held from 20 to 25 August 2019 in Basel, Switzerland.- Men's event: Gold Medal: Pramod Bhagat in Singles SL3 Event and Pramod Bhagat and Manoj	2020 Summer ParalympicsThe are an upcoming major international multisport event for athletes with disabi lities governed by the International Paralympic Committee. Scheduled as the 16th Summer Paralympic Games, th ey are scheduled to be held in Tokyo , Japan between 24 August and 5 Sept ember 2021. Originally due to take p lace between 25 August and 6 Septemb er 2020. On 24 March 2020, the IOC a nd the Tokyo Organizing Committee of ficially announced that the 2020 Sum mer Olympics and 2020 Summer Paralym pics would be postponed to 2021, due to the COVID-19 pandemic, marking t he first time that the Paralympics h as been postponed. They will still b e publicly marketed as
Originally scheduled to take place $f$ rom 25 August to 6 September 2020, i n March 2020 both the 2020 Summer Ol ympics and Paralympics were postpone d by one year due to the COVID-19 pa ndemic, with the rescheduled Games $s$ till referred to as Tokyo 2020 for m arketing and branding purposes. As with the Olympics, the Games were la rgely held behind	Originally scheduled to take place f rom 25 August to 6 September 2020, i n March 2020 both the 2020 Summer Ol ympics and Paralympics were postpone d by one year due to the COVID-19 pa ndemic, with the rescheduled Games s till referred to as Tokyo 2020 for m arketing and branding purposes. As with the Olympics, the Games were la rgely held behind	once submitted.This process was u ndertaken following the postponement of the Tokyo 2020 Games due to the COVID-19 pandemic, with both the Oly mpics and Paralympics pushed back a year.Now, the Tokyo 2020 Olympics are scheduled for July 23 to August 8 while the Paralympics are due to f ollow from August 24 to September 5. The refund process is separate for ticketholders outside of Japan, who purchased tickets through authorise d ticket resellers (ATR).Each ATR has its own individual refund proced ure.Early figures from the refund process for the Tokyo 2020 Olympics stated that around 18 per cent	Olympiad, have now been postponed a nd rescheduled for 23 July to 8 Augu st 2021 in Tokyo, Japan. The Games were postponed in March 2020 as a re sult of the worldwide Covid-19 pande mic, although they will still keep t he name Tokyo 2020 for marketing and branding purposes. This will be th e first time the Olympic Games have been postponed rather than cancelled
closed doors with no outside specta tors due to a state of emergency in the Greater Tokyo Area and other pre fectures. The Games were the second Summer Paralympics hosted by Tokyo s ince 1964, and the third Paralympics held in Japan overall since the 199 8 Winter Paralympics in Nagano. Th e Games featured	closed doors with no outside specta tors due to a state of emergency in the Greater Tokyo Area and other pre fectures. The Games were the second Summer Paralympics hosted by Tokyo $s$ ince 1964, and the third Paralympics held in Japan overall since the 199 8 Winter Paralympics in Nagano. Th e Games featured	has been rescheduled to May 1-4 bec ause of travel restrictions under th e current state of emergency in Toky o and other 10 prefectures across Ja pan.The Tokyo 2020 organizing comm ittee announced that the first of 18 test events for the Olympic and Par alympic Games will involve wheelchai r rugby, which will be held in Yoyog i National Stadium from April 3 to 4 .The FINA Diving World Cup will fo llow from April 18 to 23 at the Toky o Aquatics Centre, which will also s erve as an Olympic qualifying event. The spread of the COVID-19 pandemi c has slowed down in Tokyo three wee ks after the Japanese capital entere d a state of emergency on	Olympic Games, when Tokyo became th e first city in Asia to host the Oly mpic and Paralympic Games, but unfor tunately strong winds made it an imp ossible task this time around.Memb ers of the Tokyo Organising Committe e of the Olympic and Paralympic Game s (Tokyo 2020), Tokyo Metropolitan G overnment officials, Tokyo 2020 Torc h Relay Official Ambassadors and rep resentatives from Miyagi Prefecture joined the arrival ceremony.FLAME OF RECOVERYThe Olympic flame will now be put on display at various loc ations in the Tohoku region, to high light the message of hope in the are as worst affected by the 2011 Great East Japan Earthqu
539 medal events in 22 sports, with badminton and taekwondo both making their Paralympic debut to replace f ootball 7-a-side and sailing. China topped the medal table for the fifth consecutive Paralympics, with 96 go lds and 207 total medals. Great Brit ain finished second for the ninth t ime,	539 medal events in 22 sports, with badminton and taekwondo both making their Paralympic debut to replace f ootball 7-a-side and sailing. China topped the medal table for the fifth consecutive Paralympics, with 96 go lds and 207 total medals. Great Brit ain finished second for the ninth t ime,

Table 19 | Daniel Radcliffe, Wikitext103Valid에서 발췌, 검색 데이터는 C4에서 가져왔다. 청크 $C_{2}$ 와 $C_{3}$ 는 거의 전적으로 이웃 $\left[N_{1}, F_{1}\right]$ 과 $\left[N_{2}, F_{2}\right]$ 에서 각각 검색되었으며, 이는 형식 차이를 제외하고는 동일하여 이 토큰들에 대한 손실을 극적으로 감소시킨다. 이 예시는 중복 제거에도 불구하고 훈련 데이터가 평가 세트로 유출될 때, Retro 모델이 이 유출을 직접 활용할 수 있음을 보여준다.

$C_{u}$ colored by loss difference $L_{\text {Retro }\left[\mathrm{Off}_{\mathrm{Fe}}\right]}-L_{\text {Retro }} \leqslant-0.5,=0, \geqslant 0.5$	$C_{u}$ colored by LCP with $\operatorname{Ret}\left(C_{u}-1\right) \mathrm{LCP}=0,1,2,3,4, \geqslant 5$	[ $N_{u}^{1}, F_{u}^{1}$ ] colored by LCP with $C_{u+1} \operatorname{LCP}=0,1,2,3,4 \geqslant 5$	[ $N_{u}^{2}, F_{u}^{2}$ ] colored by LCP with $C_{u+1} \mathrm{LCP}=0,1,2,3,4, \geqslant 5$
= Daniel Radcliffe = Daniel Jacob R adcliffe ( born 23 July 1989 ) is an English actor who rose to prominenc e as the title character in the Harr y Potter film series. He made his ac ting debut at 10 years of age in BBC One's 1999 television film David Co pperfield, followed by his cinematic debut	= Daniel Radcliffe = Daniel Jacob R adcliffe ( born 23 July 1989 ) is an English actor who rose to prominenc e as the title character in the Harr y Potter film series. He made his ac ting debut at 10 years of age in BBC One's 1999 television film David Co pperfield, followed by his cinematic debut	Daniel Jacob Radcliffe (born 23 July 1989) is an English actor who rose to prominence as the title character in the Harry Potter film series. He made his acting debut at 10 years o f age in BBC One's 1999 television f ilm David Copperfield, followed by h is cinematic debut in 2001's The Tai lor of Panama. At age 11, he was cas t as Harry Potter in the first Harry Potter film, and starred in the ser ies for 10 years until the release o f the eighth and final film in 2011. Radcliffe began to branch out to s tage acting in 2007, starring in the London and New York productions of Equus, and	Daniel Jacob Radcliffe (born 23 July 1989) is an English actor who rose to prominence as the title character in the Harry Potter film series. He made his acting debut at 10 years o f age in BBC One's 1999 television m ovie David Copperfield, followed by his film debut in 2001's The Tailor of Panama. At age 11, he was cast as Harry Potter in the first Harry Pot ter film, and starred in the series for 10 years until the release of th e eighth and final film in 2011. Rad cliffe began to branch out to stage acting in 2007, starring in the Lond on and New York productions of Equus , and in the
in 2001's The Tailor of Panama. At age 11, he was cast as Harry Potter in the first Harry Potter film, and starred in the series for 10 years u ntil the release of the eighth and f inal film in 2011.Radcliffe began to branch out to stage acting in 200 7, starring in the London and New	in 2001's The Tailor of Panama. At age 11, he was cast as Harry Potter in the first Harry Potter film, and starred in the series for 10 years u ntil the release of the eighth and f inal film in 2011.Radcliffe began to branch out to stage acting in 200 7, starring in the London and New	in 2001's The Tailor of Panama. At age 11, he was cast as Harry Potter in the first Harry Potter film, and starred in the series for 10 years u ntil the release of the eighth and f inal film in 2011.Radcliffe began to branch out to stage acting in 200 7, starring in the London and New Yo rk productions of Equus, and in the 2011 Broadway revival of the musical How to Succeed in Business Without Really Trying. He starred in the 201 2 horror film The Woman in Black, an d played beat poet Allen Ginsberg in the 2013 independent film Kill Your Darlings.He has contributed to ma ny charities	of Panama. At age 11, he was cast a s Harry Potter in the first Harry Po tter film, and starred in the series for 10 years until the release of t he eighth and final film in 2011.R adcliffe began to branch out to stag e acting in 2007, starring in the Lo ndon and New York productions of Equ us, and in the 2011 Broadway revival of the musical How to Succeed in Bu siness Without Really Trying. He sta rred in the 2012 horror film The Wom an in Black, and played beat poet Al len Ginsberg in the 2013 independent film Kill Your Darlings. He has con tributed to many charities, includin g Demelza House Children's
York productions of Equus, and in t he 2011 Broadway revival of the musi cal How to Succeed in Business Witho ut Really Trying. He starred in the 2012 horror film The Woman in Black, and played beat poet Allen Ginsberg in the 2013 independent film Kill Y our <unk>.He has contributed to ma ny charities,	York productions of Equus, and in t he 2011 Broadway revival of the musi cal How to Succeed in Business Witho ut Really Trying. He starred in the 2012 horror film The Woman in Black, and played beat poet Allen Ginsberg in the 2013 independent film Kill Y our <unk>.He has contributed to ma ny charities,	York productions of Equus, and in t he 2011 Broadway revival of the musi cal How to Succeed in Business Witho ut Really Trying. He starred in the 2012 horror film The Woman in Black, and played beat poet Allen Ginsberg in the 2013 independent film Kill Y our Darlings. He has contributed to many charities, including Demelza H ouse Children's Hospice and The Trev or Project. He also made public serv ice announcements for the latter. In 2011, he was awarded the Trevor Pro ject's "Hero Award."Sources disagr ee about Radcliffe's personal wealth ; he was reported to have earned $£ 1$ million for the first Harry Potter	in the 2011 Broadway revival of the musical How to Succeed in Business Without Really Trying. He starred in the 2012 horror film The Woman in B lack, and played beat poet Allen Gin sberg in the 2013 independent film K ill Your Darlings. He has contribute d to many charities, including Demel za House Children's Hospice and The Trevor Project. He also made public service announcements for the latter . In 2011, he was awarded the Trevor Project's "Hero Award."
including <unk> Hospice Care for Ch ildren, and The Trevor Project for s uicide prevention among LGBTQ youth, which gave him its Hero Award in 20 11. $==$ Early life $==$ Radcliffe w as born in West London, England. He is the only child of Alan George Rad cliffe, a literary agent, and	including <unk> Hospice Care for Ch ildren, and The Trevor Project for s uicide prevention among LGBTQ youth, which gave him its Hero Award in 20 11. $==$ Early life $==$ Radcliffe w as born in West London, England. He is the only child of Alan George Rad cliffe, a literary agent, and

Table 20 | 샘플 - Déclaration des droits de l'homme: Article premier. Retro [Off] 샘플은 올바른 구문을 가지고 있으며 거의 그럴듯하지만, 환각(hallucinated)이다. Retro [ON] 샘플은 이웃 데이터에서 올바르게 복사되었으며, 우리의 prompt에 따라 견고하게 재형식화되었다.

Prompt and sample of Retro[Off]	Prompt and sample of Retro[ON] colored by LCP with $\operatorname{Ret}\left(C_{u}-1\right)$ LCP $=0,1,2,3,4, \geqslant 5$	[ $N_{u}^{1}, F_{u}^{1}$ ] colored by LCP with $C_{u+1} \operatorname{LCP}=0,1,2,3,4, \geqslant 5$	[ $N_{u}^{2}, F_{u}^{2}$ ] colored by LCP with $C_{u+1} \operatorname{LCP}=0,1,2,3,4, \geqslant 5$
Article premier - Les hommes naissent et demeurent libres e t égaux en droits. Les distinc tions sociales ne peuvent être fondées que sur l'utilité com mune. Article 2. - Le but de toute association politique e st la conservation des droits naturels et	Article premier - Les hommes naissent et demeurent libres e t égaux en droits. Les distinc tions sociales ne peuvent être fondées que sur l'utilité com mune. Article 2. - Le but de toute association politique e st la conservation des droits naturels et	de l'homme et du citoyen.Ar ticle 1erLes hommes naissent et demeurent libres et égaux en droits. Les distinctions so ciales ne peuvent être fondées que sur l'utilité commune.A rticle 2Le but de toute asso ciation politique est la conse rvation des droits naturels et imprescriptibles de l'homme. Ces droits sont la liberté, la propriété, la sûreté, et la r ésistance à l'oppression.Art icle 3Le principe de toute s ouveraineté réside essentielle ment dans la nation. Nul corps , nul individu ne peut exercer d'autorité qui n'en	Les hommes naissent et demeur ent libres et égaux en droits. Les distinctions sociales ne peuvent être fondées que sur 1 'utilité commune.Art. 2. Le but de toute association po litique est la conservation de s droits naturels et imprescri ptibles de l'Homme. Ces droits sont la liberté, la propriété , la sûreté, et la résistance à l'oppression.Art. 3. -Le principe de toute Souverainet é réside essentiellement dans la Nation. Nul corps, nul indi vidu ne peut exercer d'autorit é qui n'en émane expressément. Art
imprescriptibles de l'homme, et par conséquent la garantie à chacun des droits suivants : Article 3. - La propriété est un droit inviolable et sa cré. Toute personne a le droit de procéder à sa propre cons ervation. Article 4. - Le	imprescriptibles de l'homme. Ces droits sont la liberté, la propriété, la sûreté et la ré sistance à l'oppression.Arti cle 3.- Le principe de toute souveraineté réside essentiel lement dans la nation. Nul cor ps, nul individu ne peut exerc er d'autorité qui n	criptibles del'homme. Ces dro its sont la liberté, la propri été, la sûretéet la résistanc e à l'oppression.Article 3 - <br> Le principe de toute souverai neté résideessentiellement da ns la Nation. Nul corps, nul i ndividu nepeut exercer d'auto rité qui n'en émane expresséme nt.Article 4 - La liberté co nsiste à pouvoir faire tout ce quine nuit pas à autrui : ai nsi, l'exercice des droits nat urelsde chaque homme n'a de b ornes que celles qui assurent auxautres membres de la socié té la jouissance de	et imprescriptibles de l'homm e. Ces droits sont la liberté, la propriété, la sûreté et la résistance à l'oppression.A rticle 3 - Le principe de tout e souveraineté réside essentie llement dans la Nation. Nul co rps, nul individu ne peut exer cer d'autorité qui n'en émane expressément.Article 4 - La liberté consiste à pouvoir fai re tout ce qui ne nuit pas à a utrui : ainsi, l'exercice des droits naturels de chaque homm e n'a de bornes que celles qui assurent aux autres membres d e la société la jouissance de ces mêmes droits. Ces bornes
but de toute association est la défense des droits de l'hom me et du citoyen. Tout citoye n a le droit de participer à 1 a direction des affaires publi ques. Article 5. - L'impuni té n'a jamais été et ne sera j amais une fin en elle-même. L' imp	'en émane expressément.Artic le 4.- La liberté consiste à pouvoir faire tout ce qui ne nuit pas à autrui : ainsi, l'e xercice des droits naturels de chaque homme n'a de bornes qu e celles qui assurent aux autr es membres de la société la jo uissance de ces mêmes	mane expressément.Article 4 La liberté consiste à pouvoi r faire tout ce qui ne nuit pa s à autrui : ainsi, l'exercice des droits naturels de chaque homme n'a de bornes que celle s qui assurent aux autres memb res de la société la jouissanc e de ces mêmes droits. Ces bor nes ne peuvent être déterminée s que par la loi.Article 5 La loi n'a le droit de défend re que les actions nuisibles à la société. Tout ce qui n'est pas défendu par la loi ne peu t être empêché, et nul ne peut être contraint à faire ce qu' elle n	mane expressément.Article 4 - La liberté consiste à pouvoi r faire tout ce qui ne nuit pa s à autrui : ainsi, l'exercice des droits naturels de chaque homme n'a de bornes que celle s qui assurent aux autres memb res de la société la jouissanc e de ces mêmes droits. Ces bor nes ne peuvent être déterminée s que par la loi.Article 5 La loi n'a le droit de défend re que les actions nuisibles à la société. Tout ce qui n'est pas défendu par la loi ne peu t être empêché, et nul ne peut être contraint à faire ce qu' elle n
unité, comme le despotisme, s 'est toujours révélée être un instrument d'oppression. La ty rannie qui s'est élue juge su prême de la conscience des aut res ne peut être jugée. La jus tice se trouve dans la consci ence de chaque citoyen,	droits. Ces bornes ne peuvent être déterminées que par la 1 oi.Article 5.- La loi n'a le droit de défendre que les a ctions nuisibles à la société. Tout ce qui n'est pas défendu par la loi ne peut être empêc hé, et nul ne peut être

Table 21 | 샘플 - Decimals of $\pi$ . Retro [Off] 샘플은 prompt 끝에서 두 자리 이후 빠르게 발산하는 반면, Retro [ON] 샘플은 이웃 데이터에서 직접 복사된 많은 수의 $\pi$ 자릿수를 올바르게 출력한다.

Prompt and sample of Retro[Off]	Prompt and sample of Retro [ON] colored by LCP with $\operatorname{Ret}\left(C_{u}-1\right)$ LCP $=0,1,2,3,4, \geqslant 5$	[ $N_{u}^{1}, F_{u}^{1}$ ] colored by LCP with $C_{u+1} \mathrm{LCP}=0,1,2,3,4, \geqslant 5$	[ $N_{u}^{2}, F_{u}^{2}$ ] colored by LCP with $C_{u+1} \mathrm{LCP}=0,1,2,3,4, \geqslant 5$
Pi = 3. 14159265358979323846 2643383279502884197169399375	Pi = 3. 14159265358979323846 2643383279502884197169399375	"1415926535897932384626433 832795028841971693993751058	462643383279502884197169399 375105820974944592307816406
105820974944592307816406286	105820974944592307816406286	2097494459230781640628620899	2862089986280348253421170679
2089986280348253421170679	2089986280348253421170679	862803482534211706798214808	821480865132823066470938446
		651328230664709384460955058	095505822317253594081284811
		223172535940812848111745028	174502841027019385211055596
		4102701938521105559644622948	4462294895493038196442881097
		954930381964428810975665933	566593344612847564823378678
		446128475648233786783	316527120190914564856692346

829404960289884960699858349	821480865132823066470938446	651328230664709384460955058	470938446095505822317253594
065987324637996447894358628	095505822317253594081284811	22317253594081284811174502	081284811174502841027019385
730709654015907959440698105	174502841027019385211055596	8410270193852110555964462294	2110555964462294895493038196
992965913709537841269378359	4462294895493038196442881097	895493038196442881097566593	442881097566593344612847564
	56659334461284	344612847564823378678316527	823378678316527120190914564
		120190914564856692346034861	856692346034861045432664821
		045432664821339360726024914	3393607260249141273724587006
		127372458700660631558817488	606315588174881520920962829
		1520920962829254091715364	254091715364367892590360
106940372045708867951285612	756482337867831652712019091	23378678316527120190914564	165271201909145648566923460
308579046461290927664215556	4564823378678316527120190914	856692346034861045432664821	348610454326648213393607260
5460326956561287986366475705	821339360726024914127372458	3393607260249141273724587006	2491412737245870066063155881
6294954741588633533957657	700660631558817488152092096	606315588174881520920962829	748815209209628292540917153
	2829254091715	25409171536436789259036001	643678925903600113305305488
		1330530548820466521384146951	204665213841469519415116094
		941511609433057270365759591	3305727036575959195309218611
		953092186117381932611793105	738193261179310511854807446
		1185480744623799627495	237996274956735188575272489
763455770886953798887691079	364367892590360011330530548
6616974564939746376345801550	8204665213841469519415116094
666354285463337646306356284	330572703657595919530921861
27178853398045672434	173819326117931051185480744