
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "tutorials/_rendered_examples/dynamo/torch_export_gpt2.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_tutorials__rendered_examples_dynamo_torch_export_gpt2.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_tutorials__rendered_examples_dynamo_torch_export_gpt2.py:


.. _torch_export_gpt2:

Compiling GPT2 using the dynamo backend
==========================================================

This script illustrates Torch-TensorRT workflow with dynamo backend on popular GPT2 model.

.. GENERATED FROM PYTHON SOURCE LINES 10-12

Imports and Model Definition
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. GENERATED FROM PYTHON SOURCE LINES 12-17

.. code-block:: python

    import torch
    import torch_tensorrt
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from utils import export_llm, generate


.. GENERATED FROM PYTHON SOURCE LINES 18-39

.. code-block:: python


    # Define the parameters and initialize the model
    MAX_TOKENS = 32
    DEVICE = torch.device("cuda:0")

    # Define the GPT2 model from hugging face
    # kv_cache is not supported in Torch-TRT currently.
    # CPU is used here so that GPU memory is reserved for TRT compilation.
    with torch.no_grad():
        tokenizer = AutoTokenizer.from_pretrained("gpt2")
        model = (
            AutoModelForCausalLM.from_pretrained(
                "gpt2",
                pad_token_id=tokenizer.eos_token_id,
                use_cache=False,
                attn_implementation="eager",
            )
            .eval()
            .half()
        )


.. GENERATED FROM PYTHON SOURCE LINES 40-41

Tokenize a sample input prompt and get pytorch model outputs

.. GENERATED FROM PYTHON SOURCE LINES 41-50

.. code-block:: python

    prompt = "I enjoy walking with my cute dog"
    model_inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = model_inputs["input_ids"]

    # Auto-regressive generation loop for greedy decoding using PyTorch model
    # We use a custom generate function which is very similar to the huggingface one.
    pyt_gen_tokens = generate(model, input_ids, MAX_TOKENS, tokenizer.eos_token_id)



.. GENERATED FROM PYTHON SOURCE LINES 51-53

Compilation with `Torch-TensorRT` using dynamo backend and generate TensorRT outputs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. GENERATED FROM PYTHON SOURCE LINES 53-77

.. code-block:: python


    # Export the GPT2 model into an ExportedProgram which is input of TRT compilation
    # To compile the model in FP16, we do the following
    # 1) Cast the model to FP16 via model.half()
    # 2) Enable use_explicit_typing=True. Certain layers are explicitly casted to FP32 within the pytorch model and this flag respects this behavior during TRT compilation
    # 3) Enable use_fp32_acc=True. This ensures all the matmuls are accumulated in FP32 precision (similar to PyTorch)
    gpt2_ep = export_llm(model, input_ids, max_seq_len=1024)
    trt_model = torch_tensorrt.dynamo.compile(
        gpt2_ep,
        inputs=[input_ids],
        enabled_precisions={torch.float32},
        truncate_double=True,
        device=DEVICE,
        disable_tf32=True,
        use_explicit_typing=True,
        use_fp32_acc=True,
    )

    # Auto-regressive generation loop for greedy decoding using TensorRT model
    # We use a custom generate function which is very similar to the huggingface one.
    # Move inputs to GPU
    input_ids = input_ids.to(DEVICE)
    trt_gen_tokens = generate(trt_model, input_ids, MAX_TOKENS, tokenizer.eos_token_id)


.. GENERATED FROM PYTHON SOURCE LINES 78-80

Decode the output sentences of PyTorch and TensorRT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. GENERATED FROM PYTHON SOURCE LINES 80-98

.. code-block:: python

    print("=============================")
    print(
        "Pytorch model generated text: ",
        tokenizer.decode(pyt_gen_tokens[0], skip_special_tokens=True),
    )
    print("=============================")
    print(
        "TensorRT model generated text: ",
        tokenizer.decode(trt_gen_tokens[0], skip_special_tokens=True),
    )

    # Prompt : What is parallel programming ?

    # =============================
    # Pytorch model generated text: The parallel programming paradigm is a set of programming languages that are designed to be used in parallel. The main difference between parallel programming and parallel programming is that

    # =============================
    # TensorRT model generated text: The parallel programming paradigm is a set of programming languages that are designed to be used in parallel. The main difference between parallel programming and parallel programming is that


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.000 seconds)


.. _sphx_glr_download_tutorials__rendered_examples_dynamo_torch_export_gpt2.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example




    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: torch_export_gpt2.py <torch_export_gpt2.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: torch_export_gpt2.ipynb <torch_export_gpt2.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
