MambaByte: Token-free Selective State Space Model (2024)

JunxiongWang  TushaarGangavarapu  JingNathanYan  AlexanderMRush
Cornell University
{jw2544,tg352,jy858,arush}@cornell.edu

Abstract

Token-free language models learn directly from raw bytes and remove the bias of subword tokenization. Operating on bytes, however, results in significantly longer sequences, and standard autoregressive Transformers scale poorly in such settings. We experiment with MambaByte, a token-free adaptation of the Mamba state space model, trained autoregressively on byte sequences. Our experiments indicate the computational efficiency of MambaByte compared to other byte-level models. We also find MambaByte to be competitive with and even outperform state-of-the-art subword Transformers. Furthermore, owing to linear scaling in length, MambaByte benefits from fast inference compared to Transformers. Our findings establish the viability of MambaByte in enabling token-free language modeling.

MambaByte: Token-free Selective State Space Model (1)

1 Introduction

When defining a language model, a base tokenization is typically used—either words (Bengio etal., 2000), subwords (Schuster and Nakajima, 2012; Sennrich etal., 2015; Wu etal., 2016; Wang etal., 2020), or characters (Gao etal., 2020a). Of these, subword tokenization has been the most popular choice, as it achieves a natural compromise between training efficiency and the ability to handle out-of-vocabulary words. However, several works (e.g., Xue etal. (2022)) have noted issues with subword tokenizers, such as a lack of robustness to typos, spelling and capitalization variations, and morphological changes.

Researchers (Clark etal., 2022; Xue etal., 2022; Yu etal., 2023) have employed an alternative approach of using byte sequences, i.e., an end-to-end mapping from raw data to predictions without any intermediate tokenization. Compared to subword models, byte-level language models can generalize more easily across orthographic and morphological variants. Of course, modeling text as bytes means that the resultant sequences are significantly longer than their subword counterparts. This pushes the efficiency issues upstream into the architecture itself.

Efficiency issues are particularly pronounced for autoregressive Transformers (Vaswani etal., 2017), which dominate language modeling (Brown etal., 2020; Touvron etal., 2023). Due to the quadratic cost of attention, Transformers scale poorly for long (byte) sequences (Brown etal., 2020; Zhang etal., 2022). Researchers have compressed the internal Transformer representation to work with long sequences, for instance, developing length-aware modeling approaches (Dai etal., 2020; Nawrot etal., 2022), where groups of tokens are merged within the intermediate layers. Recently, Yu etal. (2023) proposed the MegaByte Transformer, which uses compression in the form of fixed-size patches of bytes as a subword analog. As a result, MegaByte enables lower computational costs.111Although our experiments (see Figure1) indicate that patching can also lower the model performance compared to the standard Transformer.

In this work, we introduce MambaByte, an efficient and simple byte-level language model. The model is a straightforward adaptation of the recently introduced Mamba architecture (Gu and Dao, 2023), a linear-time approach for sequence modeling. Mamba builds off the approach pioneered by state space models (SSMs) (Gu etal., 2021; Gupta etal., 2022; Gu etal., 2022; Smith etal., 2023) by introducing a selection mechanism that is more effective for discrete data such as text and providing an efficient GPU implementation. Our simple observation is that using Mamba (without modifications) relieves the main computational bottleneck in language modeling, thus allowing for the elimination of patching and effective use of the available compute budget.

Experiments compare MambaByte to Transformers, SSMs, and MegaByte (patching) architectures in a fixed parameter and fixed compute setting on several long-form text datasets. Figure1 summarizes our main findings. Compared to byte-level Transformers, MambaByte achieves better performance faster and is significantly more compute efficient. We also consider the viability of token-free language models compared to the existing state-of-the-art subword models. In this regard, we find MambaByte to be competitive with various subword baselines despite handling significantly longer sequences. Our results establish MambaByte as a strong alternative to the existing tokenizer-dependent models and advocate its use to facilitate end-to-end learning.

2 Background: Selective state space sequence models

SSMs model the evolution of a hidden state across time through a first-order differential equation. Linear time-invariant SSMs (Gu etal., 2021; Gupta etal., 2022; Gu etal., 2022; Smith etal., 2023) have shown promising results in deep learning across several modalities. However, Gu and Dao (2023) have recently argued that the constant dynamics of these approaches lack input-dependent context selection in the hidden state, which may be necessary for tasks such as language modeling. To this end, they proposed Mamba, which defines the time-varying continuous state dynamics for a given input x(t)𝑥𝑡x(t)\in{{\mathbb{R}}}italic_x ( italic_t ) ∈ blackboard_R, hidden state h(t)n𝑡superscript𝑛h(t)\in{{\mathbb{R}^{n}}}italic_h ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and output y(t)𝑦𝑡y(t)\in{{\mathbb{R}}}italic_y ( italic_t ) ∈ blackboard_R at time t𝑡titalic_t as:

dh(t)dt=Ah(t)+B(t)x(t);y(t)=C(t)h(t),formulae-sequenced𝑡d𝑡A𝑡B𝑡𝑥𝑡𝑦𝑡C𝑡𝑡\displaystyle\frac{{\operatorname{d}}h(t)}{{\operatorname{d}}t}={\operatorname%{\mathrm{A}}}h(t)+\operatorname{\mathrm{B}}(t)x(t);\quad y(t)=\operatorname{%\mathrm{C}}(t)h(t),divide start_ARG roman_d italic_h ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG = roman_A italic_h ( italic_t ) + roman_B ( italic_t ) italic_x ( italic_t ) ; italic_y ( italic_t ) = roman_C ( italic_t ) italic_h ( italic_t ) ,(1)

which is parameterized by a diagonal time-invariant system matrix An×nAsuperscript𝑛𝑛\operatorname{\mathrm{A}}\in{{\mathbb{R}^{n\times n}}}roman_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT and time-dependent input and output matrices B(t)n×1B𝑡superscript𝑛1\operatorname{\mathrm{B}}(t)\in{{\mathbb{R}^{n\times 1}}}roman_B ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × 1 end_POSTSUPERSCRIPT and C(t)1×nC𝑡superscript1𝑛\operatorname{\mathrm{C}}(t)\in{{\mathbb{R}^{1\times n}}}roman_C ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_n end_POSTSUPERSCRIPT.

To model discrete-time sequences such as bytes, the continuous time dynamics in (1) must be approximated through discretization. This results in a discrete-time hidden state recurrence with new matrices at each timestep, AA\operatorname{\mathrm{A}}roman_A, BB\operatorname{\mathrm{B}}roman_B, and CC\operatorname{\mathrm{C}}roman_C, such that

h[k]=A[k]h[k1]+B[k]x[k];y[k]=C[k]h[k].formulae-sequencedelimited-[]𝑘Adelimited-[]𝑘delimited-[]𝑘1Bdelimited-[]𝑘𝑥delimited-[]𝑘𝑦delimited-[]𝑘Cdelimited-[]𝑘delimited-[]𝑘\displaystyle h[k]=\vbox{\hrule height=0.55pt\kern 1.29167pt\hbox{\kern 0.0pt$%\operatorname{\mathrm{A}}$\kern 0.0pt}}[k]h[k-1]+\vbox{\hrule height=0.55pt\kern 1.29167pt\hbox{\kern 0.0pt$%\operatorname{\mathrm{B}}$\kern 0.0pt}}[k]x[k];\quad y[k]=\vbox{\hrule height=0.55pt\kern 1.29167pt\hbox{\kern 0.0pt%$C$\kern 0.0pt}}[k]h[k].italic_h [ italic_k ] = roman_A [ italic_k ] italic_h [ italic_k - 1 ] + roman_B [ italic_k ] italic_x [ italic_k ] ; italic_y [ italic_k ] = roman_C [ italic_k ] italic_h [ italic_k ] .(2)
MambaByte: Token-free Selective State Space Model (2)
MambaByte: Token-free Selective State Space Model (3)

Observe that (2) resembles a linear version of a recurrent neural network and can be applied in this recurrent form during language model generation. The discretization requires a timestep, Δ[k]Δdelimited-[]𝑘\Delta[k]roman_Δ [ italic_k ], for each input position, corresponding to treating x[k]=x(tk)𝑥delimited-[]𝑘𝑥subscript𝑡𝑘x[k]=x\left(t_{k}\right)italic_x [ italic_k ] = italic_x ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) for tk=j=1kΔ[j]subscript𝑡𝑘superscriptsubscript𝑗1𝑘Δdelimited-[]𝑗t_{k}=\sum_{j=1}^{k}\Delta[j]italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_Δ [ italic_j ]. The discrete-time matrices AA\operatorname{\mathrm{A}}roman_A, BB\operatorname{\mathrm{B}}roman_B, and CC\operatorname{\mathrm{C}}roman_C can then be computed from Δ[k]Δdelimited-[]𝑘\Delta[k]roman_Δ [ italic_k ]. Figure2 illustrates how Mamba models discrete sequences.

In Mamba, the SSM terms are input-selective, i.e., BB\operatorname{\mathrm{B}}roman_B, CC\operatorname{\mathrm{C}}roman_C, and ΔΔ\Deltaroman_Δ are defined as functions of the input x[k]d𝑥delimited-[]𝑘superscript𝑑x[k]\in{{\mathbb{R}^{d}}}italic_x [ italic_k ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT:

Δ[k]=softplus(WΔ(WRx[k]);B(tk)=WBx[k],\displaystyle\Delta[k]=\operatorname{softplus}(W_{\Delta}(W_{R}x[k]);\quad%\operatorname{\mathrm{B}}(t_{k})=W_{\operatorname{\mathrm{B}}}x[k],roman_Δ [ italic_k ] = roman_softplus ( italic_W start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT italic_x [ italic_k ] ) ; roman_B ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_W start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPT italic_x [ italic_k ] ,(3)

where WBn×dsubscript𝑊Bsuperscript𝑛𝑑W_{\operatorname{\mathrm{B}}}\in{{\mathbb{R}^{n\times d}}}italic_W start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT (CC\operatorname{\mathrm{C}}roman_C is similarly defined), WΔd×rsubscript𝑊Δsuperscript𝑑𝑟W_{\Delta}\in{{\mathbb{R}^{d\times r}}}italic_W start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT and WRr×dsubscript𝑊𝑅superscript𝑟𝑑W_{R}\in{{\mathbb{R}^{r\times d}}}italic_W start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_d end_POSTSUPERSCRIPT (for some rdmuch-less-than𝑟𝑑r\ll ditalic_r ≪ italic_d) are learnable weights, and softplus ensures positivity. Note that the SSM parameters AA\operatorname{\mathrm{A}}roman_A, BB\operatorname{\mathrm{B}}roman_B, and CC\operatorname{\mathrm{C}}roman_C are identical for each input dimension d𝑑ditalic_d, but the timesteps ΔΔ\Deltaroman_Δ are distinct; this results in a hidden state size of n×d𝑛𝑑n\times ditalic_n × italic_d per timestep k𝑘kitalic_k. (See AppendixD for specifics on discretization and selectivity.)

Mamba embeds this SSM layer into a full neural network language model. Specifically, the model utilizes a stack of gated layers inspired by the previous gated SSM (Mehta etal., 2023). Figure3 shows the Mamba architecture combining the SSM layer with a gated neural network.

Parallel scans for linear recurrences.

At training time, we have access to the entire sequence x𝑥xitalic_x, allowing us to compute the linear recurrence more efficiently. Smith etal. (2023) demonstrated the use of work-efficient parallel scans (Blelloch, 1990) for efficiently computing the sequential recurrence in linear SSMs. For Mamba, we first map the recurrence to a sequence of L𝐿Litalic_L tuples, with ek=(Ak,bk)(A[k],B[k]x[k])subscript𝑒𝑘subscript𝐴𝑘subscript𝑏𝑘Adelimited-[]𝑘Bdelimited-[]𝑘𝑥delimited-[]𝑘e_{k}=(A_{k},b_{k})\coloneqq(\vbox{\hrule height=0.55pt\kern 1.29167pt\hbox{%\kern 0.0pt$\operatorname{\mathrm{A}}$\kern 0.0pt}}[k],\vbox{\hrule height=0.55pt\kern 1.29167pt\hbox{\kern 0.0pt$\operatorname{%\mathrm{B}}$\kern 0.0pt}}[k]x[k])italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≔ ( roman_A [ italic_k ] , roman_B [ italic_k ] italic_x [ italic_k ] ), then define an associative operator \bullet such that ejek=(AkAj,Akbj+bk)subscript𝑒𝑗subscript𝑒𝑘subscript𝐴𝑘subscript𝐴𝑗subscript𝐴𝑘subscript𝑏𝑗subscript𝑏𝑘e_{j}\bullet e_{k}=(A_{k}A_{j},A_{k}b_{j}+b_{k})italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∙ italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Finally, we apply a parallel scan to compute the sequence [(A[1],h[1]),(A[2]A[1],h[2]),]Adelimited-[]1delimited-[]1Adelimited-[]2Adelimited-[]1delimited-[]2[(\vbox{\hrule height=0.55pt\kern 1.29167pt\hbox{\kern 0.0pt$\operatorname{%\mathrm{A}}$\kern 0.0pt}}[1],h[1]),(\vbox{\hrule height=0.55pt\kern 1.29167pt\hbox{\kern 0.0pt$%\operatorname{\mathrm{A}}$\kern 0.0pt}}[2]\vbox{\hrule height=0.55pt\kern 1.29167pt\hbox{\kern 0.0pt$\operatorname{%\mathrm{A}}$\kern 0.0pt}}[1],h[2]),\ldots][ ( roman_A [ 1 ] , italic_h [ 1 ] ) , ( roman_A [ 2 ] roman_A [ 1 ] , italic_h [ 2 ] ) , … ]. In general, this requires 𝒪(Tlog2(L))𝒪subscript𝑇subscript2𝐿\mathcal{O}(T_{\bullet}\log_{2}(L))caligraphic_O ( italic_T start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_L ) ) time, using L/2𝐿2L/2italic_L / 2 processors, where Tsubscript𝑇T_{\bullet}italic_T start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT is the cost of a matrix-matrix multiplication. Noting AA\operatorname{\mathrm{A}}roman_A to be a diagonal matrix, the linear recurrence can be computed parallelly in 𝒪(nlog2(L))𝒪𝑛subscript2𝐿\mathcal{O}(n\log_{2}(L))caligraphic_O ( italic_n roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_L ) ) time and 𝒪(nL)𝒪𝑛𝐿\mathcal{O}(nL)caligraphic_O ( italic_n italic_L ) space. A parallel scan with a diagonal matrix is also efficient in operation, requiring 𝒪(nL)𝒪𝑛𝐿\mathcal{O}(nL)caligraphic_O ( italic_n italic_L ) FLOPs.

3 Experimental setup

ExperimentModels FLOPs pertrain byte
Medium-scaleMegaByte-758758758758M+262262262262M :bold-:\boldsymbol{:}bold_:1.02:1:1.0211.02:11.02 : 1
MambaByte-353353353353M
Large-scaleMegaByte-1.31.31.31.3B+350350350350M :bold-:\boldsymbol{:}bold_:0.54:1:0.5410.54:10.54 : 1
MambaByte-972972972972M
MegaByte-1.31.31.31.3B+218218218218M :bold-:\boldsymbol{:}bold_:0.40:1:0.4010.40:10.40 : 1
MambaByte-972972972972M

Our experiments compare MambaByte to other byte-level Transformers and SSMs. All our models employ the same training recipes (see AppendixC for details). We utilize a set of diverse long-form text datasets: PG19 (Rae etal., 2020), Stories (Trinh and Le, 2018), Books (Gao etal., 2020b), ArXiv (Gao etal., 2020b), and Code (Gao etal., 2020b). Dataset sizes and average document lengths are included in AppendixA.

Performance comparison across architectures requires care. To this end, we consider two settings: compute-matched and parameter-matched. This setup is necessary as the default MegaByte Transformer employs a global module that works with 8×8\times8 ×-patched representations of the input, thus using 8×8\times8 × fewer feed-forward FLOPs per byte than a raw Transformer, while having significantly more parameters. Table1 shows the MegaByte and MambaByte model sizes employed in our experiments. The (forward pass) FLOPs computation for various model architectures and the associated hyperparameters employed are detailed in AppendixB.

All MambaByte models were trained using the open-source Mamba code base.222https://github.com/state-spaces/mamba. At training, we shuffle the documents and use contiguous sequences of 8,19281928,1928 , 192 bytes (one per document), starting from a random position. We enable mixed precision training using BF16161616 for training efficiency at scale. The optimizer, learning rate scheduler, and other training details are specified in AppendixC.

Press etal. (2021) proposed using a sliding window to trade off speed for performance during inference. Following this, we employ a sliding window (with a stride of Lctx/2subscript𝐿ctx2L_{\text{ctx}}/2italic_L start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT / 2 for a byte sequence of length Lctxsubscript𝐿ctxL_{\text{ctx}}italic_L start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT) when comparing with the state-of-the-art subword models in Table3.

4 Results

Table2 shows the bits per byte (BPBBPB\operatorname{BPB}roman_BPB) across each dataset. For this experiment, the MegaByte-758758758758M+262262262262M and MambaByte models use the same number of FLOPs per byte (see Table1). We observe MambaByte to outperform MegaByte consistently across all datasets. Furthermore, we note that we could not train MambaByte for the full 80808080B bytes due to monetary constraints, but MambaByte outperforms MegaByte with 0.63×0.63\times0.63 × less compute and training data. Additionally, MambaByte-353353353353M also outperforms byte-level Transformer and PerceiverAR.

Byte-level modelContext Bytestrained Test BPBBPBabsent\operatorname{BPB}\downarrowroman_BPB ↓
PG19StoriesBooksArXivCode
Transformer-320320320320M1,02410241,0241 , 02480808080B1.0571.0571.0571.0571.0641.0641.0641.0641.0971.0971.0971.0970.8160.8160.8160.8160.5750.5750.5750.575
PerceiverAR-248248248248M8,19281928,1928 , 19280808080B1.1041.1041.1041.1041.0701.0701.0701.0701.1041.1041.1041.1040.7910.7910.7910.7910.5460.5460.5460.546
MegaByte-758758758758M+262262262262M (patch: 8888)8,19281928,1928 , 19280808080B1.0001.0001.0001.0000.9780.9780.9780.9781.0071.0071.0071.0070.6780.6780.6780.6780.4110.4110.4110.411
MambaByte-353353353353M8,19281928,1928 , 19230303030B{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT0.9300.930\mathbf{0.930}bold_0.9300.9080.908\mathbf{0.908}bold_0.9080.9660.966\mathbf{0.966}bold_0.9660.6630.3960.396\mathbf{0.396}bold_0.396

How is MambaByte performing better than a much larger model in so few training steps? Figure1 further explores this relationship by looking at models with the same number of parameters. The graphs indicate that for MegaByte models of the same parameter size, models with less input patching perform better, but when compute-normalized, they perform similarly. In fact, a full-length Transformer, while slow in an absolute sense, also performs similarly to MegaByte when compute-normalized. In contrast, switching to the Mamba architecture significantly improves both the compute usage and the model performance.

(##\##Layers) ModelVocabEffective context(in bytes)333For subword models, we use one subword as being equivalent to four bytes. Effectivebytestrained33{}^{\text{\ref{tab-footnote}}}start_FLOATSUPERSCRIPT end_FLOATSUPERSCRIPT ValPPLPPL\operatorname{PPL}roman_PPL \downarrow TestPPLPPL\operatorname{PPL}roman_PPL \downarrow

Subword

(36363636) Transformer-XL (Rae etal., 2020)32323232K2,048/4,096204840962,048/4,0962 , 048 / 4 , 096400400400400B45.545.545.545.536.336.336.336.3
(36363636) Compressive (Rae etal., 2020)32323232K2,048/2204822,048/22 , 048 / 2×2,048\times 2,048× 2 , 048400400400400B43.443.443.443.433.633.633.633.6
(22222222) Routing-490490490490M444The number of parameters is noted from Hutchins etal. (2022). (Roy etal., 2021)82828282K32,7683276832,76832 , 768330330330330B--33.233.233.233.2
(60606060) PerceiverAR-974.6974.6974.6974.6M (Hawthorne etal., 2022)32323232K8,19281928,1928 , 1921.681.681.681.68T45.945.945.945.928.928.928.928.9
(24242424) Block-Recurrent-1.31.31.31.3B (Hutchins etal., 2022)32323232K4,096/4,096/4 , 096 /recurrence----26.526.5\mathbf{26.5}bold_26.5

Byte

(--) Transformer-320320320320M (Yu etal., 2023)2562562562568,19281928,1928 , 192400400400400B81.681.681.681.669.469.469.469.4
(--) PerceiverAR-248248248248M (Yu etal., 2023)2562562562568,19281928,1928 , 192400400400400B119.1119.1119.1119.188.888.888.888.8
(24242424+24242424) MegaByte-1.31.31.31.3B+350350350350M (Yu etal., 2023)2562562562568,192/8,192/8 , 192 /patch: 8888400400400400B42.842.842.842.836.436.436.436.4
(48) MambaByte-972972972972M2562562562568,19281928,1928 , 192555For inference, we use a context of 32,7683276832,76832 , 768 bytes.150150150150B{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT39.539.5\mathbf{39.5}bold_39.533.033.033.033.0

Model Bytestrained Context TestBPBBPB\operatorname{BPB}roman_BPB \downarrow Generationtime (s) \downarrow
Transformer-350350350350M--1,02410241,0241 , 0241.0641.0641.0641.064132132132132
MegaByte-1.31.31.31.3B+218218218218M (patch: 8888)--8,19281928,1928 , 1920.9910.9910.9910.99193939393
MegaByte-1.31.31.31.3B+218218218218M (patch: 8888)666Open-source implementation: https://github.com/lucidrains/MEGABYTE-pytorch.--8,19281928,1928 , 192--265
MambaByte-972972972972M75757575B{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT8,19281928,1928 , 1920.8830.883\mathbf{0.883}bold_0.883𝟐𝟗29\mathbf{29}bold_29
w/ sliding window (2×2\times2 × bytes)0.8630.863\mathbf{0.863}bold_0.86358
MambaByte-1.61.61.61.6B--8,19281928,1928 , 192--36363636

Following these findings, Table3 compares a larger version of these models on the PG19 dataset. For this experiment, we compare MambaByte-972972972972M with MegaByte-1.31.31.31.3B+350350350350M and other byte-level models, as well as several state-of-the-art subword models. (The conversion from BPBBPB\operatorname{BPB}roman_BPB to perplexity (PPLPPL\operatorname{PPL}roman_PPL) is detailed in AppendixE). We find that MambaByte-972972972972M, even just trained for 150150150150B bytes, outperforms all the byte-level models and achieves competitive performance with subword models.

Text generation.

Autoregressive inference in Transformer models requires caching the entire context, which can significantly affect the generation speed. MambaByte does not suffer from this bottleneck as it maintains a single hidden state per layer that evolves with time, enabling constant time per generation step. Table4 compares the text generation speeds of MambaByte-972972972972M and MambaByte-1.61.61.61.6B with MegaByte-1.31.31.31.3B+350350350350M on an A100 80GB PCIe GPU. While MegaByte significantly reduces the generation cost through patching, we observe MambaByte to be 2.6×2.6\times2.6 × faster in a parameter-matched setting due to its use of recurrent generation. AppendixF includes more information about the generation process.

5 Conclusion

We introduce MambaByte, a token-free SSM for modeling long byte-sequences. MambaByte outperforms other byte-level models over several datasets and shows competitive results with subword Transformers, thus serving as a promising tokenization alternative. SSMs also enable significantly fast text generation due to their recurrent nature, making byte models practical. Our findings establish the possibility of token-free language modeling in future large models.

References

  • Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention Is All You Need.Advances in neural information processing systems, 30, 2017.
  • Su etal. [2021]Jianlin Su, YuLu, Shengfeng Pan, BoWen, and Yunfeng Liu.Roformer: Enhanced transformer with rotary position embedding.arXiv e-prints, pages arXiv–2104, 2021.
  • Yu etal. [2023]Lili Yu, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis.MegaByte: Predicting Million-byte Sequences with Multiscale Transformers.In Thirty-seventh Conference on Neural Information Processing Systems, 2023.URL https://openreview.net/forum?id=JTmO2V9Xpz.
  • Mehta etal. [2023]Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur.Long Range Language Modeling via Gated State Spaces.In The Eleventh International Conference on Learning Representations, 2023.URL https://openreview.net/forum?id=5MkYIYCbva.
  • Bengio etal. [2000]Yoshua Bengio, Réjean Ducharme, and Pascal Vincent.A Neural Probabilistic Language Model.Advances in neural information processing systems, 13, 2000.
  • Schuster and Nakajima [2012]Mike Schuster and Kaisuke Nakajima.Japanese and Korean Voice Search.In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5149–5152. IEEE, 2012.
  • Sennrich etal. [2015]Rico Sennrich, Barry Haddow, and Alexandra Birch.Neural Machine Translation of Rare Words with Subword Units.arXiv preprint arXiv:1508.07909, 2015.
  • Wu etal. [2016]Yonghui Wu, Mike Schuster, Zhifeng Chen, QuocV Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, etal.Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.arXiv preprint arXiv:1609.08144, 2016.
  • Wang etal. [2020]Changhan Wang, Kyunghyun Cho, and Jiatao Gu.Neural Machine Translation with Byte-Level Subwords.In Proceedings of the AAAI conference on artificial intelligence, volume34, pages 9154–9160, 2020.
  • Gao etal. [2020a]Yingqiang Gao, NikolaI Nikolov, Yuhuang Hu, and RichardHR Hahnloser.Character-Level Translation with Self-attention.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1591–1604, 2020a.
  • Xue etal. [2022]Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel.ByT5: Towards a token-free future with pre-trained byte-to-byte models.Transactions of the Association for Computational Linguistics, 10:291–306, 2022.
  • Clark etal. [2022]JonathanH Clark, Dan Garrette, Iulia Turc, and John Wieting.Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation.Transactions of the Association for Computational Linguistics, 10:73–91, 2022.
  • Brown etal. [2020]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, etal.Language Models are Few-Shot Learners.Advances in neural information processing systems, 33:1877–1901, 2020.
  • Touvron etal. [2023]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, CristianCanton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, PunitSingh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, EricMichael Smith, Ranjan Subramanian, XiaoqingEllen Tan, Binh Tang, Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and ThomasScialom.Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023.
  • Zhang etal. [2022]Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, XiVictoria Lin, etal.OPT: Open Pre-trained Transformer Language Models.arXiv preprint arXiv:2205.01068, 2022.
  • Dai etal. [2020]Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le.Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing.Advances in neural information processing systems, 33:4271–4282, 2020.
  • Nawrot etal. [2022]Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski, Łukasz Kaiser, Yuhuai Wu, Christian Szegedy, and Henryk Michalewski.Hierarchical Transformers Are More Efficient Language Models.In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1559–1571, 2022.
  • Gu and Dao [2023]Albert Gu and Tri Dao.Mamba: Linear-Time Sequence Modeling with Selective State Spaces.arXiv preprint arXiv:2312.00752, 2023.
  • Gu etal. [2021]Albert Gu, Karan Goel, and Christopher Ré.Efficiently Modeling Long Sequences with Structured State Spaces.arXiv preprint arXiv:2111.00396, 2021.
  • Gupta etal. [2022]Ankit Gupta, Albert Gu, and Jonathan Berant.Diagonal State Spaces are as Effective as Structured State Spaces.Advances in Neural Information Processing Systems, 35:22982–22994, 2022.
  • Gu etal. [2022]Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré.On the Parameterization and Initialization of Diagonal State Space Models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
  • Smith etal. [2023]JimmyT.H. Smith, Andrew Warrington, and Scott Linderman.Simplified State Space Layers for Sequence Modeling.In The Eleventh International Conference on Learning Representations, 2023.URL https://openreview.net/forum?id=Ai8Hw3AXqks.
  • Ramachandran etal. [2017]Prajit Ramachandran, Barret Zoph, and QuocV Le.Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017.
  • Blelloch [1990]GuyE Blelloch.Prefix Sums and Their Applications.(CMU-CS-90-190), nov 1990.URL https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf.
  • Rae etal. [2020]JackW. Rae, Anna Potapenko, SiddhantM. Jayakumar, Chloe Hillier, and TimothyP. Lillicrap.Compressive Transformers for Long-Range Sequence Modelling.In International Conference on Learning Representations, 2020.URL https://openreview.net/forum?id=SylKikSYDH.
  • Trinh and Le [2018]TrieuH. Trinh and QuocV. Le.A Simple Method for Commonsense Reasoning.arXiv preprint arXiv:1806.02847, 2018.
  • Gao etal. [2020b]Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy.The Pile: An 800GB Dataset of Diverse Text for Language Modeling.arXiv preprint arXiv:2101.00027, 2020b.
  • Press etal. [2021]Ofir Press, NoahA. Smith, and Mike Lewis.Shortformer: Better Language Modeling using Shorter Inputs.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5493–5505, Online, August 2021. Association for Computational Linguistics.doi: 10.18653/v1/2021.acl-long.427.URL https://aclanthology.org/2021.acl-long.427.
  • Roy etal. [2021]Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier.Efficient Content-Based Sparse Attention with Routing Transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021.doi: 10.1162/tacl_a_00353.URL https://aclanthology.org/2021.tacl-1.4.
  • Hawthorne etal. [2022]Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, Joao Carreira, and Jesse Engel.General-purpose, long-context autoregressive modeling with Perceiver AR.In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 8535–8558. PMLR, 17–23 Jul 2022.URL https://proceedings.mlr.press/v162/hawthorne22a.html.
  • Hutchins etal. [2022]DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur.Block-Recurrent Transformers.Advances in Neural Information Processing Systems, 35:33248–33261, 2022.
  • Hendrycks and Gimpel [2016]Dan Hendrycks and Kevin Gimpel.Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016.
  • Orvieto etal. [2023]Antonio Orvieto, SamuelL Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De.Resurrecting Recurrent Neural Networks for Long Sequences.arXiv preprint arXiv:2303.06349, 2023.
  • Gu etal. [2023]Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Re.How to Train your HiPPO: State Space Models with Generalized Orthogonal Basis Projections.In International Conference on Learning Representations, 2023.URL https://openreview.net/forum?id=klK17OQ3KB.
  • Nguyen etal. [2022]Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christopher Ré.S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces.Advances in neural information processing systems, 35:2846–2861, 2022.
  • Holtzman etal. [2020]Ari Holtzman, Jan Buys, LiDu, Maxwell Forbes, and Yejin Choi.The Curious Case of Neural Text Degeneration.In International Conference on Learning Representations, 2020.URL https://openreview.net/forum?id=rygGQyrFvH.

Appendix A Dataset specifics

Total bytesTotal docsBytes///doc
PG1911.7411.7411.7411.74G28,7522875228,75228 , 7524,082,21040822104,082,2104 , 082 , 210
Stories34.1834.1834.1834.18G948,247948247948,247948 , 24736,0453604536,04536 , 045
Books108.38108.38108.38108.38G196,640196640196,640196 , 640551,179551179551,179551 , 179
ArXiv60.2760.2760.2760.27G1,264,40512644051,264,4051 , 264 , 40547,6654766547,66547 , 665
Code677677677677G56,626,3425662634256,626,34256 , 626 , 34211,9581195811,95811 , 958

We benchmark our results on various long-form text datasets. The PG19 dataset [Rae etal., 2020] is an extensive collection of full-length English books (written before 1919191919191919) from the Project Gutenberg online library. The PG19 dataset is ideal to test for long-distance context modeling [Gao etal., 2020b]. The Stories dataset [Trinh and Le, 2018] is a subset of the CommonCrawl data used for commonsense reasoning and language modeling. The Books dataset [Gao etal., 2020b] is another collection of English books. The ArXiv dataset [Gao etal., 2020b] comprises technical publications in LATEXfrom the arXiv online archive. Finally, the Code dataset [Gao etal., 2020b] is a large dataset of publicly available open-source code (under Apache, MIT, or BSD licenses). Dataset statistics are tabulated in Table5.

For the PG19 dataset, we employ the train, validation, and test data splits as indicated by Rae etal. [2020]. For Stories, Books, ArXiv, and Code datasets, we randomly sample 40404040M consecutive bytes for testing and the rest to train MambaByte.

Appendix B Compute-constrained modeling

MambaByte: Token-free Selective State Space Model (4)

As noted earlier, we evaluate and benchmark MambaByte in a compute-controlled setting. To this end, we estimate the FLOPs per byte incurred by various byte-level model architectures. We parameterize the architectures using hyperparameters n𝑛nitalic_n (ng/nl)subscript𝑛𝑔subscript𝑛𝑙(n_{g}/n_{l})( italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) number of (global///local) layers, dimension d𝑑ditalic_d (dg/dl)subscript𝑑𝑔subscript𝑑𝑙(d_{g}/d_{l})( italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT / italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) of the (global///local) residual stream, expansion factor e𝑒eitalic_e of linear layers, patch size p𝑝pitalic_p in MegaByte, state dimension nstatesubscript𝑛staten_{\text{state}}italic_n start_POSTSUBSCRIPT state end_POSTSUBSCRIPT in SSMs, 1D convolution kernel size k𝑘kitalic_k, and low-rank projection dimension r𝑟ritalic_r in Mamba. We also include Lctxsubscript𝐿ctxL_{\text{ctx}}italic_L start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT bytes in the input context. Detailed component-wise compute counts for the forward pass are included in Table6.

ModelComponentFLOPs per byte
Transformer[Vaswani etal., 2017] Multi-head attention2n(4d2+2Lctxd)2𝑛4superscript𝑑22subscript𝐿ctx𝑑2n(4d^{2}+2L_{\text{ctx}}d)2 italic_n ( 4 italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_L start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT italic_d )
Pointwise feed-forward2n(2ed2)2𝑛2𝑒superscript𝑑22n(2ed^{2})2 italic_n ( 2 italic_e italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
MegaByte[Yu etal., 2023] Embedding projection2dg22superscriptsubscript𝑑𝑔22d_{g}^{2}2 italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Global transformer model2ng(4dg2+2dgLctx/p+2edg2)/p2subscript𝑛𝑔4superscriptsubscript𝑑𝑔22subscript𝑑𝑔subscript𝐿ctx𝑝2𝑒superscriptsubscript𝑑𝑔2𝑝2n_{g}(4d_{g}^{2}+2d_{g}L_{\text{ctx}}/p+2ed_{g}^{2})/p2 italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( 4 italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT / italic_p + 2 italic_e italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / italic_p
Global-to-local projection2dgdl2subscript𝑑𝑔subscript𝑑𝑙2d_{g}d_{l}2 italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT
Local transformer model2nl(4dl2+2pdl+2edl2)2subscript𝑛𝑙4superscriptsubscript𝑑𝑙22𝑝subscript𝑑𝑙2𝑒superscriptsubscript𝑑𝑙22n_{l}(4d_{l}^{2}+2pd_{l}+2ed_{l}^{2})2 italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( 4 italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_p italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + 2 italic_e italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
Gated-S4D(Figure4)Linear projections2n(3ed2+d2)2𝑛3𝑒superscript𝑑2superscript𝑑22n(3ed^{2}+d^{2})2 italic_n ( 3 italic_e italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
Kernel via Vandermonde v(A)𝑣A\varv(\vbox{\hrule height=0.55pt\kern 1.29167pt\hbox{\kern 0.0pt$\operatorname%{\mathrm{A}}$\kern 0.0pt}})italic_v ( roman_A )n(αved(nstate+Lctx)log22(nstate+Lctx)/Lctx)𝑛subscript𝛼𝑣𝑒𝑑subscript𝑛statesubscript𝐿ctxsuperscriptsubscript22subscript𝑛statesubscript𝐿ctxsubscript𝐿ctxn(\alpha_{\varv}ed(n_{\text{state}}+L_{\text{ctx}})\log_{2}^{2}(n_{\text{state%}}+L_{\text{ctx}})/L_{\text{ctx}})italic_n ( italic_α start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_e italic_d ( italic_n start_POSTSUBSCRIPT state end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT ) roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_n start_POSTSUBSCRIPT state end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT ) / italic_L start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT )
S4D SSM with convolutionn(αfftlog(Lctx)ed+ed)𝑛subscript𝛼fftsubscript𝐿ctx𝑒𝑑𝑒𝑑n(\alpha_{\text{fft}}\log(L_{\text{ctx}})ed+ed)italic_n ( italic_α start_POSTSUBSCRIPT fft end_POSTSUBSCRIPT roman_log ( italic_L start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT ) italic_e italic_d + italic_e italic_d )
Element-wise gatingned𝑛𝑒𝑑neditalic_n italic_e italic_d
MambaByte(Figure3)Linear projections2n(3ed2)2𝑛3𝑒superscript𝑑22n(3ed^{2})2 italic_n ( 3 italic_e italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
Pre-SSM 1D convolution2nked2𝑛𝑘𝑒𝑑2nked2 italic_n italic_k italic_e italic_d
Δ,B,CΔBC\Delta,\operatorname{\mathrm{B}},\operatorname{\mathrm{C}}roman_Δ , roman_B , roman_C from input x𝑥xitalic_x2n(2edr+2ednstate)2𝑛2𝑒𝑑𝑟2𝑒𝑑subscript𝑛state2n(2edr+2edn_{\text{state}})2 italic_n ( 2 italic_e italic_d italic_r + 2 italic_e italic_d italic_n start_POSTSUBSCRIPT state end_POSTSUBSCRIPT )
Discretization, pre-scan: AA\operatorname{\mathrm{A}}roman_A, BxB𝑥\vbox{\hrule height=0.55pt\kern 1.29167pt\hbox{\kern 0.0pt$\operatorname{%\mathrm{B}}$\kern 0.0pt}}xroman_B italic_xn(3ednstate)𝑛3𝑒𝑑subscript𝑛staten(3edn_{\text{state}})italic_n ( 3 italic_e italic_d italic_n start_POSTSUBSCRIPT state end_POSTSUBSCRIPT )
Recurrence with parallel scann(ednstate)𝑛𝑒𝑑subscript𝑛staten(edn_{\text{state}})italic_n ( italic_e italic_d italic_n start_POSTSUBSCRIPT state end_POSTSUBSCRIPT )
Output: y=Ch+Dx𝑦CD𝑥y=\vbox{\hrule height=0.55pt\kern 1.29167pt\hbox{\kern 0.0pt$\operatorname{%\mathrm{C}}$\kern 0.0pt}}h+\vbox{\hrule height=0.55pt\kern 1.29167pt\hbox{\kern 0.0pt$\operatorname{%\mathrm{D}}$\kern 0.0pt}}xitalic_y = roman_C italic_h + roman_D italic_x2nednstate+ned2𝑛𝑒𝑑subscript𝑛state𝑛𝑒𝑑2nedn_{\text{state}}+ned2 italic_n italic_e italic_d italic_n start_POSTSUBSCRIPT state end_POSTSUBSCRIPT + italic_n italic_e italic_d
Element-wise gatingned𝑛𝑒𝑑neditalic_n italic_e italic_d

For the medium-scale language modeling experiments (Table1, §5§5\lx@sectionsign 5§ 5 of Yu etal. [2023]), Yu etal. [2023] employ the MegaByte-758758758758M+262262262262M model, with a context length of 8,19281928,1928 , 192 and patch size of 8888, trained for 80808080B bytes. As shown in Figure5, MambaByte-353353353353M (n=53𝑛53n=\text{$53$}italic_n = 53,d=1,024𝑑1024d=\text{$1,024$}italic_d = 1 , 024,e=2𝑒2e=\text{$2$}italic_e = 2) and MegaByte-758758758758M+262262262262M use the same total compute in FLOPs; hence, we employ the MambaByte-353353353353M to benchmark against MegaByte-758758758758M+262262262262M in Table2 of §4§4\lx@sectionsign\ref{sec:results}§.

MambaByte: Token-free Selective State Space Model (5)

For the PG19 scaling experiment (Table2, §5§5\lx@sectionsign 5§ 5 and AppendixD.3D.3\mathrm{D}.3roman_D .3 of Yu etal. [2023]), Yu etal. [2023] use MegaByte-1.31.31.31.3B+350350350350M (context length of 8,19281928,1928 , 192 and patch size of 8888) trained for 400400400400B bytes to benchmark the observed word-level perplexity against several state-of-the-art subword models. Owing to our hardware limitations, we train MambaByte-972972972972M (n=48𝑛48n=\text{$48$}italic_n = 48,d=1,792𝑑1792d=\text{$1,792$}italic_d = 1 , 792,e=2𝑒2e=\text{$2$}italic_e = 2) and control for the total compute used (see Figure5 to view the associated computational costs). All the model sizes and associated hyperparameters employed in this work are tabulated in Table7.

ModelParametersHyperparameters
n𝑛nitalic_n(ng/nl)subscript𝑛𝑔subscript𝑛𝑙(n_{g}/n_{l})( italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) d𝑑ditalic_d(dg/dl)subscript𝑑𝑔subscript𝑑𝑙(d_{g}/d_{l})( italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT / italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) e𝑒eitalic_eLctxsubscript𝐿ctxL_{\text{ctx}}italic_L start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPTOthers
Transformer320320320320M [Yu etal., 2023]222222221,02410241,0241 , 02444441,02410241,0241 , 024heads: --
350350350350M [Yu etal., 2023]242424241,02410241,0241 , 02444441,02410241,0241 , 024heads: 16161616
361361361361M282828281,02410241,0241 , 02444448,19281928,1928 , 192heads: 16161616
PerceiverAR248248248248M [Yu etal., 2023]171717171,02410241,0241 , 02444448,19281928,1928 , 192latents: 1,02410241,0241 , 024
MegaByte193193193193M+177177177177M777We used the open-source implementation: https://github.com/lucidrains/MEGABYTE-pytorch.14/14141414/1414 / 141,024/1,024102410241,024/1,0241 , 024 / 1 , 02444448,19281928,1928 , 192p=4,8𝑝48p=4,8italic_p = 4 , 8; heads: 16/16161616/1616 / 16
758758758758M+262262262262M [Yu etal., 2023]14/18141814/1814 / 182,048/1,024204810242,048/1,0242 , 048 / 1 , 02444448,19281928,1928 , 192p=8𝑝8p=8italic_p = 8; heads: 16/16161616/1616 / 16
1.31.31.31.3B+218218218218M [Yu etal., 2023]24/15241524/1524 / 152,048/1,024204810242,048/1,0242 , 048 / 1 , 02444448,19281928,1928 , 192p=8𝑝8p=8italic_p = 8; heads: 32/32/-32 / -
1.31.31.31.3B+350350350350M [Yu etal., 2023]24/24242424/2424 / 242,048/1,024204810242,048/1,0242 , 048 / 1 , 02444448,19281928,1928 , 192p=8𝑝8p=8italic_p = 8; heads: 32/16321632/1632 / 16
Gated-S4D368368368368M262626261,02410241,0241 , 02444448,19281928,1928 , 192nstate=64subscript𝑛state64n_{\text{state}}=64italic_n start_POSTSUBSCRIPT state end_POSTSUBSCRIPT = 64
MambaByte353353353353M535353531,02410241,0241 , 02422228,19281928,1928 , 192k=4;nstate=16;r=64formulae-sequence𝑘4formulae-sequencesubscript𝑛state16𝑟64k=4;n_{\text{state}}=16;r=64italic_k = 4 ; italic_n start_POSTSUBSCRIPT state end_POSTSUBSCRIPT = 16 ; italic_r = 64
972972972972M484848481,79217921,7921 , 79222228,19281928,1928 , 192k=4;nstate=16;r=112formulae-sequence𝑘4formulae-sequencesubscript𝑛state16𝑟112k=4;n_{\text{state}}=16;r=112italic_k = 4 ; italic_n start_POSTSUBSCRIPT state end_POSTSUBSCRIPT = 16 ; italic_r = 112
1.61.61.61.6B484848482,30423042,3042 , 30422228,19281928,1928 , 192k=4;nstate=16;r=144formulae-sequence𝑘4formulae-sequencesubscript𝑛state16𝑟144k=4;n_{\text{state}}=16;r=144italic_k = 4 ; italic_n start_POSTSUBSCRIPT state end_POSTSUBSCRIPT = 16 ; italic_r = 144

Appendix C Training recipes

All the models in this study were trained using an AdamW optimizer with β=(0.9,0.95)𝛽0.90.95\beta=(0.9,0.95)italic_β = ( 0.9 , 0.95 ). We used a linear learning rate warm-up (for the first 500500500500 steps) followed by cosine annealing. Keeping consistent with MegaByte training [Yu etal., 2023], we used a batch size of 48484848 across all our experiments. Additionally, we do not use dropout with any of our models.

For the experiments in Figure1, we conducted a hyperparameter search using peak learning rates of 0.00020.00020.00020.0002, 0.00060.00060.00060.0006, and 0.00080.00080.00080.0008 and clipped the gradient norm to 1.01.01.01.0 for all the models. The best-observed performance curve for each model is reported in Figure1. Furthermore, we use an improved Transformer recipe that uses RMSNorm instead of LayerNorm, rotary positional encodings [Su etal., 2021], and linear terms without bias (same as [Yu etal., 2023]).

In our medium-scale experiments shown in Table2, we set the peak learning rate to 0.00040.00040.00040.0004 and clipped the gradient norm to 0.10.10.10.1. We trained the MambaByte-353353353353M for a total of 80808080K steps, equivalent to 80,000×48×8,19230800004881923080,000\times 48\times 8,192\approx 3080 , 000 × 48 × 8 , 192 ≈ 30B bytes.

In the large-scale experiment on PG19, we use a similar setting to that in the medium-scale experiments: the peak learning rate is set to 0.00040.00040.00040.0004, and the gradient norm is clipped to 0.10.10.10.1. The MambaByte-972972972972M is trained for 380380380380K steps, equivalent to 380,000×48×8,192150380000488192150380,000\times 48\times 8,192\approx 150380 , 000 × 48 × 8 , 192 ≈ 150B bytes.

Appendix D Discretization and selection

Discretization has deep connections to continuous-time systems, which allows for desirable properties such as model normalization [Orvieto etal., 2023, Gu etal., 2023] and resolution invariance [Nguyen etal., 2022]. In this section, we show how zero-order hold discretization of a selective SSM can be viewed as a generalization of the gating mechanism in recurrent networks.

Zero-order hold discretization.

For a given input x(t)𝑥𝑡x(t)\in{{\mathbb{R}}}italic_x ( italic_t ) ∈ blackboard_R, we wish to discretize a continuous-time SSM defined by (1) in §2§2\lx@sectionsign\ref{sec:model}§. To this end, we sample the system at different time intervals such that x[k]=x(tk)𝑥delimited-[]𝑘𝑥subscript𝑡𝑘x[k]=x(t_{k})italic_x [ italic_k ] = italic_x ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) for tk=j=1kΔ[j]subscript𝑡𝑘superscriptsubscript𝑗1𝑘Δdelimited-[]𝑗t_{k}=\sum_{j=1}^{k}\Delta[j]italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_Δ [ italic_j ] and assume a zero-order hold, i.e., x(t)𝑥𝑡x(t)italic_x ( italic_t ) is constant between samples: x(tk+ξ)=x(tk)=x[k]𝑥subscript𝑡𝑘𝜉𝑥subscript𝑡𝑘𝑥delimited-[]𝑘x(t_{k}+\xi)=x(t_{k})=x[k]italic_x ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_ξ ) = italic_x ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_x [ italic_k ] for any ξ[tk,tk+1)𝜉subscript𝑡𝑘subscript𝑡𝑘1\xi\in[t_{k},t_{k+1})italic_ξ ∈ [ italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ). The resultant matrices of the associated discrete SSM are:888In Mamba [Gu and Dao, 2023], BB\operatorname{\mathrm{B}}roman_B is discretized through a simplified Euler (as opposed to zero-order hold) discretization from empirical observations of AA\operatorname{\mathrm{A}}roman_A being more important than BB\operatorname{\mathrm{B}}roman_B, and the performance does not change significantly with simplification on BB\operatorname{\mathrm{B}}roman_B.

A=exp(AΔ);B=A1(exp(AΔ)I)B;C=C.formulae-sequenceAAΔformulae-sequenceBsuperscriptA1AΔIBCC\displaystyle\vbox{\hrule height=0.55pt\kern 1.29167pt\hbox{\kern 0.0pt$%\operatorname{\mathrm{A}}$\kern 0.0pt}}=\exp(\operatorname{\mathrm{A}}\Delta);\quad\vbox{\hrule height=0.55pt\kern 1%.29167pt\hbox{\kern 0.0pt$\operatorname{\mathrm{B}}$\kern 0.0pt}}=\operatorname{\mathrm{A}}^{-1}(\exp(\operatorname{\mathrm{A}}\Delta)-\mathrm%{I})\operatorname{\mathrm{B}};\quad\vbox{\hrule height=0.55pt\kern 1.29167pt%\hbox{\kern 0.0pt$\operatorname{\mathrm{C}}$\kern 0.0pt}}=\operatorname{\mathrm{C}}.roman_A = roman_exp ( roman_A roman_Δ ) ; roman_B = roman_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_exp ( roman_A roman_Δ ) - roman_I ) roman_B ; roman_C = roman_C .

Selection mechanics and gating in recurrent networks.

Gu and Dao [2023] note that a selective SSM can be realized as a gated recurrence by setting Δ=softplus(z(x))=softplus(WΔ(WRx))Δsoftplus𝑧𝑥softplussubscript𝑊Δsubscript𝑊𝑅𝑥\Delta=\operatorname{softplus}(z(x))=\operatorname{softplus}(W_{\Delta}(W_{R}x))roman_Δ = roman_softplus ( italic_z ( italic_x ) ) = roman_softplus ( italic_W start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT italic_x ) ) (as indicated in (3) of §2§2\lx@sectionsign\ref{sec:model}§). By letting A=1A1\operatorname{\mathrm{A}}=-1roman_A = - 1, B=1B1\operatorname{\mathrm{B}}=1roman_B = 1, and n=1𝑛1n=1italic_n = 1, the authors observe:

AA\operatorname{\mathrm{A}}roman_A=exp(AΔ)absentAΔ\displaystyle=\exp(\operatorname{\mathrm{A}}\Delta)= roman_exp ( roman_A roman_Δ )
=exp(log(1+exp(z(x))))absent1𝑧𝑥\displaystyle=\exp(-\log(1+\exp(z(x))))= roman_exp ( - roman_log ( 1 + roman_exp ( italic_z ( italic_x ) ) ) )
=11+exp(z(x))absent11𝑧𝑥\displaystyle=\frac{1}{1+\exp(z(x))}= divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( italic_z ( italic_x ) ) end_ARG
=σ(z(x))absent𝜎𝑧𝑥\displaystyle=\sigma(-z(x))= italic_σ ( - italic_z ( italic_x ) )
=1σ(z(x)).absent1𝜎𝑧𝑥\displaystyle=1-\sigma(z(x)).= 1 - italic_σ ( italic_z ( italic_x ) ) .
BB\operatorname{\mathrm{B}}roman_B=A1(exp(AΔ)I)BabsentsuperscriptA1AΔIB\displaystyle=\operatorname{\mathrm{A}}^{-1}(\exp(\operatorname{\mathrm{A}}%\Delta)-\mathrm{I})\operatorname{\mathrm{B}}= roman_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_exp ( roman_A roman_Δ ) - roman_I ) roman_B
=Iexp(AΔ)absentIAΔ\displaystyle=\mathrm{I}-\exp(\operatorname{\mathrm{A}}\Delta)= roman_I - roman_exp ( roman_A roman_Δ )
=σ(z(x)).absent𝜎𝑧𝑥\displaystyle=\sigma(z(x)).= italic_σ ( italic_z ( italic_x ) ) .

Using AA\operatorname{\mathrm{A}}roman_A and BB\operatorname{\mathrm{B}}roman_B from above in the discrete recurrence (2), the selective SSM takes the form of a 1D gated recurrence:

h[k]delimited-[]𝑘\displaystyle h[k]italic_h [ italic_k ]=(1σ(z(x)))h[k1]+σ(z(x))x[k].absent1𝜎𝑧𝑥delimited-[]𝑘1𝜎𝑧𝑥𝑥delimited-[]𝑘\displaystyle=\left(1-\sigma(z(x))\right)h[k-1]+\sigma(z(x))x[k].= ( 1 - italic_σ ( italic_z ( italic_x ) ) ) italic_h [ italic_k - 1 ] + italic_σ ( italic_z ( italic_x ) ) italic_x [ italic_k ] .(4)

It is interesting to note from (4) that limΔh[k]=x[k]subscriptΔdelimited-[]𝑘𝑥delimited-[]𝑘\lim_{\Delta\to\infty}h[k]=x[k]roman_lim start_POSTSUBSCRIPT roman_Δ → ∞ end_POSTSUBSCRIPT italic_h [ italic_k ] = italic_x [ italic_k ] and limΔ0h[k]=h[k1]subscriptΔ0delimited-[]𝑘delimited-[]𝑘1\lim_{\Delta\to 0}h[k]=h[k-1]roman_lim start_POSTSUBSCRIPT roman_Δ → 0 end_POSTSUBSCRIPT italic_h [ italic_k ] = italic_h [ italic_k - 1 ]: a large ΔΔ\Deltaroman_Δ (ΔΔ\Delta\to\inftyroman_Δ → ∞) denotes the evolution of the system to focus only on the current input and forgetting the state. In contrast, a small ΔΔ\Deltaroman_Δ (Δ0Δ0\Delta\to 0roman_Δ → 0) represents a transient input being ignored.

Selectivity of 𝐀𝐀\boldsymbol{\operatorname{\mathrm{A}}}bold_A, 𝐁𝐁\boldsymbol{\operatorname{\mathrm{B}}}bold_B, and 𝐂𝐂\boldsymbol{\operatorname{\mathrm{C}}}bold_C matrices.

Gu and Dao [2023] argue that since the system matrix AA\operatorname{\mathrm{A}}roman_A only affects the model through ΔΔ\Deltaroman_Δ, i.e., A=exp(AΔ)AAΔ\vbox{\hrule height=0.55pt\kern 1.29167pt\hbox{\kern 0.0pt$\operatorname{%\mathrm{A}}$\kern 0.0pt}}=\exp(\operatorname{\mathrm{A}}\Delta)roman_A = roman_exp ( roman_A roman_Δ ). Hence, the selectivity in ΔΔ\Deltaroman_Δ is sufficient to ensure selectivity in AA\operatorname{\mathrm{A}}roman_A.

While the selectivity in ΔΔ\Deltaroman_Δ enables selectivity in the input matrix BB\operatorname{\mathrm{B}}roman_B, Gu and Dao [2023] hypothesize that making BB\operatorname{\mathrm{B}}roman_B and CC\operatorname{\mathrm{C}}roman_C selective (in addition to ΔΔ\Deltaroman_Δ) would allow for more fine-grained control based on the content x[k]𝑥delimited-[]𝑘x[k]italic_x [ italic_k ] and evolving context h[k]delimited-[]𝑘h[k]italic_h [ italic_k ].

Appendix E Evaluation metrics

Subword-based language models [Vaswani etal., 2017, Hawthorne etal., 2022, Hutchins etal., 2022] report their performance in word-level PPLPPL\operatorname{PPL}roman_PPL, while byte-level language models [Xue etal., 2022, Yu etal., 2023] report theirs in BPBBPB\operatorname{BPB}roman_BPB. To facilitate meaningful comparisons, we report performance in BPBBPB\operatorname{BPB}roman_BPB when benchmarking against byte-level models and PPLPPL\operatorname{PPL}roman_PPL when comparing to token-level models. In this section, we detail the conversion between word-level PPLPPL\operatorname{PPL}roman_PPL and BPBBPB\operatorname{BPB}roman_BPB.

Irrespective of the underlying segmentation, the amount of information I(D)𝐼𝐷I(D)italic_I ( italic_D ) in a given dataset D𝐷Ditalic_D is constant. Simply put,

I(D)𝐼𝐷\displaystyle I(D)italic_I ( italic_D )=LTbits per token=LBbits per byteabsentsubscript𝐿𝑇bits per tokensubscript𝐿𝐵bits per byte\displaystyle=L_{T}\,\text{bits per token}=L_{B}\,\text{bits per byte}= italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT bits per token = italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT bits per byte(5a)
ln(D;model)ln(2),absent𝐷model2\displaystyle\triangleq\frac{-\ln(D;\text{model})}{\ln(2)},≜ divide start_ARG - roman_ln ( italic_D ; model ) end_ARG start_ARG roman_ln ( 2 ) end_ARG ,(5b)

where LTsubscript𝐿𝑇L_{T}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and LBsubscript𝐿𝐵L_{B}italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT are the length of the dataset in tokens and bytes, respectively. From (E), we observe:

BPB=ln(D;model)/LBln(2)=byteln(2),BPB𝐷modelsubscript𝐿𝐵2subscriptbyte2\displaystyle\operatorname{BPB}=\frac{-\ln(D;\text{model})/L_{B}}{\ln(2)}=%\frac{\ell_{\text{byte}}}{\ln(2)},roman_BPB = divide start_ARG - roman_ln ( italic_D ; model ) / italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_ARG start_ARG roman_ln ( 2 ) end_ARG = divide start_ARG roman_ℓ start_POSTSUBSCRIPT byte end_POSTSUBSCRIPT end_ARG start_ARG roman_ln ( 2 ) end_ARG ,

where bytesubscriptbyte\ell_{\text{byte}}roman_ℓ start_POSTSUBSCRIPT byte end_POSTSUBSCRIPT is the observed byte-level negative log-likelihood loss (computed using ln\lnroman_ln). From (E), we also note the following conversion from BPBBPB\operatorname{BPB}roman_BPB to word-level PPLPPL\operatorname{PPL}roman_PPL:

ln(D;model)/LTln(2)𝐷modelsubscript𝐿𝑇2\displaystyle\frac{-\ln(D;\text{model})/L_{T}}{\ln(2)}divide start_ARG - roman_ln ( italic_D ; model ) / italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG roman_ln ( 2 ) end_ARG=LBLTBPB=LBLTbyteln(2)absentsubscript𝐿𝐵subscript𝐿𝑇BPBsubscript𝐿𝐵subscript𝐿𝑇subscriptbyte2\displaystyle=\frac{L_{B}}{L_{T}}\operatorname{BPB}=\frac{L_{B}}{L_{T}}\frac{%\ell_{\text{byte}}}{\ln(2)}= divide start_ARG italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG roman_BPB = divide start_ARG italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG divide start_ARG roman_ℓ start_POSTSUBSCRIPT byte end_POSTSUBSCRIPT end_ARG start_ARG roman_ln ( 2 ) end_ARG
PPLabsentPPL\displaystyle\Rightarrow\operatorname{PPL}⇒ roman_PPL=exp(LBLTbyte)=exp(LBLTln(2)BPB).absentsubscript𝐿𝐵subscript𝐿𝑇subscriptbytesubscript𝐿𝐵subscript𝐿𝑇2BPB\displaystyle=\exp\left(\frac{L_{B}}{L_{T}}\ell_{\text{byte}}\right)=\exp\left%(\frac{L_{B}}{L_{T}}\ln(2)\operatorname{BPB}\right).= roman_exp ( divide start_ARG italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG roman_ℓ start_POSTSUBSCRIPT byte end_POSTSUBSCRIPT ) = roman_exp ( divide start_ARG italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG roman_ln ( 2 ) roman_BPB ) .

LBsubscript𝐿𝐵L_{B}italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPTLTsubscript𝐿𝑇L_{T}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPTLB/LTsubscript𝐿𝐵subscript𝐿𝑇L_{B}/L_{T}italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
Train11,677,824,2161167782421611,677,824,21611 , 677 , 824 , 2161,973,048,39319730483931,973,048,3931 , 973 , 048 , 3935.925.925.925.92
Validation17,733,0021773300217,733,00217 , 733 , 0023,007,06130070613,007,0613 , 007 , 0615.905.905.905.90
Test41,289,1014128910141,289,10141 , 289 , 1016,965,51169655116,965,5116 , 965 , 5115.935.935.935.93

For the PG19 dataset, we train MambaByte-972972972972M to minimize BPB over the training data and report word-level PPL on the test data. Split-wise values of LB/LTsubscript𝐿𝐵subscript𝐿𝑇L_{B}/L_{T}italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT / italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT for the PG19 dataset are tabulated in Table8.

Appendix F PG19 generation samples

This section includes a few sample generations from the MambaByte-972972972972M trained on the PG19 dataset. We use Nucleus sampling with p=0.98𝑝0.98p=0.98italic_p = 0.98 [Holtzman etal., 2020] and generate continuations for a total of 8,19281928,1928 , 192 bytes (including the given context prefix). Furthermore, we chose the same test set prefixes used in AppendixF of Rae etal. [2020]. We observe that the model is able to continue the dialogue in the style of the prefix and effectively recall the character names over hundreds of bytes.

Baby Mine by Margaret Mayo

Context (487487487487 bytes):

CHAPTER IEven in college Alfred Hardy was a young man of fixed ideas and highideals and proud of it.His friend, Jimmy Jinks, had few ideas and no ideals, and was glad ofit, and before half of their first college term had passed, Jimmyhad ridded himself of all such worries as making up his own mind ordirecting his own morals. Alfred did all these things so much better,argued Jimmy, furthermore, Alfred LIKED to do them–Jimmy owed it to hisfriend to give him that pleasure.

MambaByte-972972972972M:

It was not that Alfred really considered his friend an ideal husband,but Jimmy admitted that it looked differently. And, too, it was allthere was to life.And on went Alfred, through long years of honest comradeship andunblushing promises and bubbling kindness. In his heart he had long sinceforgotten Jimmy and his friendship, since he realized the superiority ofJimmy’s ideal, if of any one’s.It was to be expected that Prince William would fall a victim. Perhapshe was a modified betting man, and then had thrown over Nature and histeam-mates, as a victim to ridicule or a follower of hazardous ways.It was then that Prince William went back into society, perhaps soughtan interview with Miss La Roche, and learned that he had been the victimof romance death, and a suicide, and had been endowed with an absurd romancross. Whatever it was, he saw in it the myth of Alfred’s heroic deeds orthe myth of a yellow dog in the market plate. It sobered him gradually.He became nearly a convert, and it was Alfred who, in bidding Miss Rossitergood-by, said that he would never forget her, and never would befriendly with his friend’s family. It was not till now that Miss La Rossrealized that a really good life results in an excellent impulse, but thesequel proved the truth of Miss La Ross’s predictions. One night, a yearbefore Prince William’s death, when she was wearing a diamond star in herhair, and admiring it, she heard Jimmy’s front door open in the hall,and, turning round to look through the open doorway, in came Jimmy, andafter that where Jimmy went Alfred would sometimes go to the house nextdoor, and they went over this now, and ever since Miss La Ross’s spirithad been there, and Jimmy had often insisted that a marriage should beaccomplished in some ancient house–"And we will found a family of ourown," said Jimmy, "a for some distant house in this great Belgium."The days went on and it was a glorious spring morning. Early in March athick autumn rain had fallen on the house in New York and the leaflesstrees looked weird and strange, and the wind roared like a crested waveof fire. In the afternoon the thunder roared and soon an occasionalgray drizzle eddied about the wet streets and the city glow and tintbegan to merge into a world of pink and red and amber and purple andsunset colors.The low ceiling pillars had covered their treasures of gold and gems, andthe great stained-glass windows in dreamily gentle, half-lights, hadarranged themselves in red and gold stripes of magic tints. Even thoughthe delicate color scheme had had its disappointments once, it hadcertainly affected one hour in this room as well as ten months later andlonger stays at least would have done.To-day a soft South wind had drifted through the open door, and a suddeneast wind had driven open the French windows of Miss La Ross’s bedroom,and it rained in pure flames between the ceiling and boarded floor.Alfred’s room was fragrant with his presence. "… A little singing bird That, living in a cage, demands a friend Whose cool-blooded lure can warm the heart with love To a fluttering, wounded thing.As in a pathetic garden, so in the hall room.It was cold, to-day. Already the rooms seemed overheated. The curtainswere already half drawn.She shivered."Mid-winter, to-day," thought Alfred, watching the sweep of Ann’s whiteshoulder and patting her thin cotton frock. "Seven there were of days.And seven is yet untold gone. Fine, fine day, by Christ! Come out of thisold soot, and we’ll fly… Away. God rest his soul from hell, if ever sucha devil crawled this broad, raw earth…. Where are you, Ann?"Ann waited and trembled, she knew not why, for a sharp voice was askingsuddenly for the check book in her hand."Get me change enough to pay for lunch for Jimmy," Alfred chided.Before the one empty chair on the hall table and under the curtains laya crashing pile of ready money. "And the window shades are closed," added Alfred."It won’t shut out the rain," smiled Ann."But he won’t care," protested Ann.Alfred laid a strong withdrawing hand on the fair golden hair for amoment."It’s all right," he coaxed. "Without a cent behind them to-day we can putin four thousand and close the bottom against a falling price like this."He was holding up the window sill six inches.While he stood she whispered:"I’m only lucky to save the day.""He helps you without a reward," Alfred said."He’s kind… and darned bad."Ann noted dangerous things that afternoon."You could sing and play?" she asked."No, no!" insisted Alfred. "I CAN’T play and sing. The room is cold. It’swarm within."Alfred was changing clothes when he had that lucky escape, and Alfredmomentarily forgot his debt. Ann laid the bill she had placed on thetable, and when she had gone Alfred had not even looked at it, and it wasthe act she saw in that frame of mind, remembering it, that made her putit back again.Now Alfred was thoroughly cold and temperamental, and when he probed anobligation that he had just been trying to shift on the other fellow,he was more easily reminded. When Jimmy, cold and hungry, had wormed hisway into his room that day at dinner, and been halted at his close chairby the soup stove, the young man’s gaze had fixed furiously to theplatter of gold and had immediately started on the other food with anintensity of expression that had awakened Jimmy’s appreciation of thehot day of purposes and had aroused even Ann’s observant sense.Jimmy’s employer had met him on Close Street after the unsuccessful rowover the Dearborn Cats. Jimmy, who was not naturally an observant boy,had tried to keep in the line of his employer’s movements and tell Alfredhis employer just what he did for a living, but all Alfred’s energy hadvanished, and on sundry occasions he had caught Jimmy’s eye, and once hehad promptly appeared to mere assiduous examination of the window.Employer’s Jimmy had been dexterous enough, subdued, but his dexterityand subtlety and sagacity had not failed.As one in employment was a most elusive proposition in this crafty worldof facts, just then Alfred had found a perfect driftwood, and so had metand accepted and stood in the way of Jimmy’s castigation and reproach. Thatis to say, he had saved Jimmy from seeing any of his own real qualities,and the critics, he had been asked in Jimmy’s more frequent matinees toerase Alfred’s sneer and snip off his coat, and he had instantly become amental picture of Jimmy Dean’s assistant to the lawyer and thecollege professor.It was Jimmy’s reckless impetuousness, not his single fearless singleenergy, that had led Ann through the door at sight of Ann, that hadelectrified the tremendous audience, not her own act or attitude.Jimmy had thought still of the boy as a fellow mortal, now his master hadgone.That was a satisfactory driftwood, of Jimmy.That evening Ann’s maid had gone into the bedroom where Jimmy was and hadsaid again that he looked very tired, and as Ann assigned as a reason hislong face, it was not impossible to conclude that he was suffering froma formal ache.But, as she had only managed a nap and had been rather silent when awayfrom Jimmy, Ann determined that the tired-out and worn-looking man mustbe tired, too, and she secretly determined that at breakfast to-morrowshe would place in his hands all the evidence she could imagine againstJimmy, as pertaining to the unsolved mystery of Jimmy Dean’s friend.CHAPTER IIThat afternoon Jimmy was taken into the professor’s home to dinner.Wonderful Dean’s Hal

The Diary of Samuel Pepys

Context (826826826826 bytes):

June 1st. Having taken our leaves of Sir W. Batten and my Lady, who aregone this morning to keep their Whitsuntide, Sir W. Pen and I and Mr.Gauden by water to Woolwich, and there went from ship to ship to giveorder for and take notice of their forwardness to go forth, and then toDeptford and did the like, having dined at Woolwich with Captain Poole atthe tavern there. From Deptford we walked to Redriffe, calling at thehalf-way house, and there come into a room where there was infinite of newcakes placed that are made against Whitsuntide, and there we were verymerry. By water home, and there did businesses of the office. Amongothers got my Lord’s imprest of L1000 and Mr. Creed’s of L10,000 againstthis voyage their bills signed. Having wrote letters into the country andread some things I went to bed.

MambaByte-972972972972M:

There came my ship in this morning, which had safely arrived afterall our troubles, though some to this day we thought had been lost,viz., of the north winds. Whereabouts of us, and going to find them, wehad much ado to get up in the day time without much losing our mast andgoing about by night. At noon came the King and Queen and all theirCourt (unhappily in some pain from the swelling of the belly). Therewere myself with others going to dinner to meet them; and my LadySandwich came, by special invitation of my Lady Jem., to dine at Court.She was invited to dinner afterwards by Mrs. Penington; and after dinnerI going out to her coach, we fell to talk of the happiness of this, andbeing astonished at the curiosity and ingenuity almost of the age andremarkable folly of the age for things of this kind: "Many good fortunes,"says she, "will happen upon this discourse; your lady (I know) haththe choice of your person taken in some few; but you, charming and ingenas you are, I doubt you will not make an end." "Bah!" says I, "it isgood dallying with the ladies; for if once off of her, I promise you hertalking and witty company, and you could not choose but have her."At dinner we staid some time, discoursing about her forwardness as toher coming over from her beloved Deptford to be Lady. My Lady is sotender, that I did not help to give her opportunity of shining a little,by having been called out to dinner to Sir R. Holmes before I set out, ithaving been proposed to my Lady to stay and go as far as Gravesende; ofwhom she was very much pleased, even to the skirt of her gowne. But,when I came back I found the strictest inquiry made after me, I comingover to Hinchinbroke; and my Lady cryed out of my going into thecoach, lest she should be made to say I stayed for her. Poor littlewench, I confess I was in a great confusion, being to see my Lady set outwithout me; which, if I had known, I had wisely taken a resolution not tohave been of it. At last it was agreed between Sir W. Batten and my Ladythat the maid should go herself in a boat, and my Lady his wife to go inanother boat with Mrs. Penington, who had a great mind to go. My Ladyhad thoughts of sending Jane for her. Which we were loth to, thesingleness of their minds and opposition they meeting with no manner ofbusiness at home, and being desired to do so by my Lady, did instance meto get Jane and little Mrs. Penington to stay with me, and myself tolodge and launch two boats more, for the building now and to study thedisposing of them; but, being there, occasion will not suffer me to goupon them, but I leave it wholly to their own pleasures. The day provingmost pleasant all the afternoon, so that it was agreed on, and even ourpreparations done, my mind all over afeard and discouraged, as if thingswould not have been better. Upon our sending away this poor wench (myLady sitting looking on) I rose up and took my wife and my son T. Greavesas proxies for her to go by herself, in my great coach in the Towregarden, in the last coach my Lord will lend me; which coach is wellchosen and good. So giving order for the horse to be chosen neatest andwhitest, we set forth; Mrs. Penington on the box with me scolding at himall the way; my great discourse my losing the fine pearl, where I hadmuch ado to keep from crying, such a jealous pain she put me to. Away toLambeth chappell; where, I perceive, my Lord did give his desire to haveunderstand who are to sit in his chappell, he standing up all the sermon.This continued most part of the sermon, and in our going over the bridgeabove bridge, I bethought myself that my Lord must have grounded hisresolution of having it so. He seemed therefore to give very civildirection at cards after sermon in the evening, to my Lord’s great satisfaction;but on the reading of the Breviate he was as offended as Icould be of the alteration (enquiring too deeply into the goodness of thegentlemen in their not desiring to have a further change) which my Lorddid very readily, and told me he fell very off; and, losing all he had,and taking nothing, sighed very heavily and mournfully, and so went away.The business was not made much merrier for this sermon; for which I wassorry, and remaining professedly sad we rowed to Kingston, and after somesolemn service we landed there and went up to a little alehouse some fewmiles up yet Stoul. But, Lord! the humours of boys, which I never knew inmy life before, how fretting, tearing and swearing, and so I made theirbellies fret presently, no supper nor ever man of the house sitting down asI expected. So ended this businesse with joy beyond what was hoped for itthe first time. To Stoul and back again.

The Patrol of the Sun Dance Trail by Ralph Connor

Context (1,05910591,0591 , 059 bytes):

High up on the hillside in the midst of a rugged group of jack pines theUnion Jack shook out its folds gallantly in the breeze that swept downthe Kicking Horse Pass. That gallant flag marked the headquarters ofSuperintendent Strong, of the North West Mounted Police, whose specialduty it was to preserve law and order along the construction line of theCanadian Pacific Railway Company, now pushed west some scores of miles.Along the tote-road, which ran parallel to the steel, a man, dark ofskin, slight but wiry, came running, his hard panting, his streamingface, his open mouth proclaiming his exhaustion. At a little trail thatled to the left he paused, noted its course toward the flaunting flag,turned into it, then struggled up the rocky hillside till he came to thewooden shack, with a deep porch running round it, and surrounded bya rustic fence which enclosed a garden whose neatness illustrated acharacteristic of the British soldier. The runner passed in through thegate and up the little gravel walk and began to ascend the steps.

MambaByte-972972972972M:

Indescribable and yet unconscious was this man. All his being rose toa sense of the security that was about him as he walked onwards swiftlybetween rows of young firs of Canadian growth. He swung impulsively bythe gap of a gate and paused. Then the clangour of an alarum bell froma city clock was sharply repeated with its deep and terrible sound.This bell not only announced the desire of the inhabitants of theintervening concession lines to their work, but it also warned from thefurther side of the high boundary constructed by the late engineer ofthe West Mounted Police the abrupt and rapid career of another man. Thepistol had fallen, and the runner was searching for another.The solicitor buried his face in his hands."In God’s name, man!" he ejacul*ted. "How do I know? Haven’t I justfound one?"He had rung the bell, and now fell back among the shadows and the roughbushes. The sound startled him. He stood transfixed a moment before helooked about him."There’s only one man knowing where that tote-road goes." He still heldhis breath and then listened. "’Tis where the first news comes from,and–oh, I see! There’s my boy."He waited a moment. Then once more he passed along the gap and descendedthe rocky little to the more sheltered spot where the newbuildings stood.He was an under officer in that high branch of the service known as theUnion League, and a short sharp beat was his upon the white drilling ofbashi-bazouks. With the sounding of another great bell of a churchclose at hand he moved quickly round to the other side of the buildings.As he approached, however, he took from his pocket a thin black silkneckerchief. It was damp and stained with the blood of dead men. He laidit in the hands of a slim girl, with the limpid blue eyes of the CanadianSaskatchewan."What’s that for?" he demanded.She looked as if there had been something she desired to say, then leftthe agitated conclusion unfinished. Her eyes sought his in the patheticwistfulness of a child, then suddenly fell. For the hurt he had done herwas not a wound incurred in battle. It was merely a little scratch inthe hand, and let alone that, in a manner of speaking, it was all shehad. The blood of a man is always more significant than that of ascratch on the bark of a tree, and a pressure of the earth leaves adeeper mark on a man’s arm. With a sigh the runner removed the bloodstain and turned his face towards the sound again. He walked half acrossthe open grass from which he had sprung. From his ample form to the far-distant leaping folds of his drilling trousers he had trailed a forkedstick, and so to the girl.In a few seconds he came back."It’s me, pardner, Superintendent Strong. It’s me I’m goin’ down from theSoo, for the job I had in Mexico after I came out here. I’m connectedwith the Canadian Pacific Railway and they’re hunting up a man who didhave a finger wounded by a Canadian rock. I’m sendin’ the little flagwith her." He emphasised the word "flag." A rough skin mark, furrowed ina straight line down his left cheek, marked the place of the scar andbrought him to a sudden stop. His eyes were on the scrolled lettersabove his head."I’m going down to get it. I’ve got to get it to the bottom, anyway, fordivil a bit of paper they’ll let me have at British Columbia. Oh, God!"He raised his voice. In a moment he had departed. In a few minutes hehad rejoined the girl. They rejoined the solicitor and returned with himto an open space before the meeting place of the railway company. As theygathered round a table spread with an untasted meal the solicitor spoke.The railroad company was working out from British Columbia to Montreal."In our fight we had it hard," he said. "The northern route to LeagueIsland was blocked, we could not reach there to recruit. We had to lookfor a northern route, for there was none. At first the league flag ofOttawa was given up. That was only till October. Then a young man on theground from London came to us. He’d been in the runner’s service alongthe whole line from Montreal. He was headed for Canada on thetelegraph. Two of us had to flag him as soon as we set out from here.He had been over that ground about fifty times before, and knew thewhole road well for forty miles. The head of us did not know ittill he came to the junction where the main line crosses the north lineof the United States. We took that name on the tin to test him.""What was the corporation over there for?" said the solicitor. "I remember,I remember. It occupied a part of the big Kelvin mine. I was helping getthe first claim post run by the Union League at the time I was there.He was out hunting coal. He came down one day to see the coal pits aboutthe ground. On the way he was stopped and accused of raising a rebellion,and was arrested and taken to the Soo, where he was made to giveevidence in a certain case that had been laid before him.""And what was the precise cause of the complaint?" asked the runner."Well, it wasn’t a case at all, it was a fact. That’s all," explained theconstable."From what I heard then of the runners of the London and North West, theirwork wasn’t near so exciting and dangerous as it had been reported to be.Also it was the work of others, others still, and they were arrested.They was a young feller and a girl married over two years ago, and he wasshot.""Brought to trial for that by himself or his relatives or some of the menwho were with him?" There was a puzzled, gentle expression on the face ofthe railway superintendent. He was of much higher rank, for he had notbeen present at the trial of the accused. He glanced up at the runner."Arrested?" The bit of food in his mouth was working like a millstone inthe Soo employer’s breast. Then, as though unconsciously to himself, hislips said "yes" instead of "no," and he added instead, "and sworn to it.That’s as far as you’ve got, pardner. Anything else, sir?" He was watching thesilent figure with intense desire to see his face and to know what hefelt. It did not come, and he settled himself in his chair with a sigh."That was short work. They marched the young feller up here, and give himthe Canadian division. It was the station sergeant-inspector from theCanadian line sending down from headquarters to show he was all right andnot having heard anything against him. And if you don’t know that it’s notthe worst of the testimony we have to give, pardner. It wasn’t the best.The fact is the young man was getting three weeks’ sentence at the time.""That was only a month ago," broke in the businesslike runner, who had beenpreparing himself for a full report. "What had he done? Tell us?"There was something pathetic in the voice and in the manner of the youngman. Then, as he mounted his story, the under-officer took up the threadin an apologetic tone, but was brought back to a moment’s serious interestby the stopping of it by the voice of the other.
MambaByte: Token-free Selective State Space Model (2024)

References

Top Articles
Latest Posts
Article information

Author: Ms. Lucile Johns

Last Updated:

Views: 6177

Rating: 4 / 5 (41 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Ms. Lucile Johns

Birthday: 1999-11-16

Address: Suite 237 56046 Walsh Coves, West Enid, VT 46557

Phone: +59115435987187

Job: Education Supervisor

Hobby: Genealogy, Stone skipping, Skydiving, Nordic skating, Couponing, Coloring, Gardening

Introduction: My name is Ms. Lucile Johns, I am a successful, friendly, friendly, homely, adventurous, handsome, delightful person who loves writing and wants to share my knowledge and understanding with you.