TinyStories#

Introduction#

TinyStories is a model created with the goal of making language models smaller and usable on edge hardware, hence the name Small Language Model (SLM). In this tutorial, we demonstrate that MXA works well not only with vision models but also with language models.

In this tutorial, we use the MXA to generate stories using the popular Small Language Model (SLM), TinyStories .

Note

This application assumes that the MXA drivers, runtimes, and compilers have been successfully installed. For more information, please refer to the Install page.

Export the Model#

The official model is in PyTorch. Therefore, we will make a few changes and export it to ONNX to further compile it to DFP for use on MXA.

The parameter max_length is used to set the maximum context length of the model. Other values can also be used to experiment with speed and accuracy.

Note

Support for transformer models is rapidly expanding. For more information, please check the Transformer Support page.

Compile the Model#

Compiling this model is more involved than other vision models as we require a few experimental flags. The --effort hard flag is required, which takes longer than a standard compilation. For this purpose, we encourage users to use the precompiled models provided in the attached zip file at the end of the tutorial.

cd models
mx_nc -m tinystories33M.onnx --inputs /original_model/transformer/Add_output_0:0,/original_model/transformer/Add_output_0:1,/original_model/transformer/Add_output_0:2 --outputs /original_model/transformer/ln_f/Add_1_output_0 --graph_extensions TinyStories --insert_onnx_io_format_adapters "io" --effort hard

Model Details#

The TinyStories model takes a sequence of input tokens and outputs a sequence of tokens. We are only interested in the final token of the output, as the other tokens are generated from previous iterations. We use a concept called beam search, which selects the top n tokens of the output instead of just the best one. This approach produces much better results than using a single token. With a beam_width of 1, only one story is generated, while with a beam_width of 3, an average of 50 or more stories are generated.

Model Initialization#

Certain layers in the model that are not supported are cropped into pre/post processing models, which are run on the host system. Below is the code to initialize all the models based on where they are supposed to run:

Input Encoding#

Encoding the input to send to the model for inference:

In the above code, we start with a prompt and later create input encodings based on the outputs of the current inference and beam_width.

Inference#

The inference pipeline runs through cropped parts and MXA:

Usage#

To use this application, simply run the stories.py file and provide an input prompt when asked. A typical prompt might be Once upon a time.

python stories.py --beam_width=3

Final Thoughts#

Try using different beam_width inputs to observe the variation in story generation. MXA excels in throughput when used with large batch sizes in Sync mode, so a larger beam_width will be much faster on MXA compared to CPU.

Third-Party License#

This tutorial uses third-party software and libraries. Below are the details of the licenses for these dependencies:

Model: Copyright (c) Ronen Eldan, MIT license

Attachments#

tiny_stories.zip