
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "tutorials/_rendered_examples/dynamo/torch_export_llama2.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_tutorials__rendered_examples_dynamo_torch_export_llama2.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_tutorials__rendered_examples_dynamo_torch_export_llama2.py:


.. _torch_export_llama2:

Compiling Llama2 using the dynamo backend
==========================================================

This script illustrates Torch-TensorRT workflow with dynamo backend on popular Llama2 model.

.. GENERATED FROM PYTHON SOURCE LINES 10-12

Imports and Model Definition
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. GENERATED FROM PYTHON SOURCE LINES 12-17

.. code-block:: python

    import torch
    import torch_tensorrt
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from utils import export_llm, generate


.. GENERATED FROM PYTHON SOURCE LINES 18-19

Define the parameters and initialize the model

.. GENERATED FROM PYTHON SOURCE LINES 19-37

.. code-block:: python

    MAX_TOKENS = 32
    DEVICE = torch.device("cuda:0")

    # Define the Llama2 model from hugging face
    # kv_cache is not supported in Torch-TRT currently.
    # CPU is used here so that GPU memory is reserved for TRT compilation.
    llama_path = "meta-llama/Llama-2-7b-chat-hf"
    with torch.no_grad():
        model = (
            AutoModelForCausalLM.from_pretrained(
                llama_path, use_cache=False, attn_implementation="eager"
            )
            .eval()
            .half()
        )

    tokenizer = AutoTokenizer.from_pretrained(llama_path)


.. GENERATED FROM PYTHON SOURCE LINES 38-39

Tokenize a sample input prompt and get pytorch model outputs

.. GENERATED FROM PYTHON SOURCE LINES 39-47

.. code-block:: python

    prompt = "What is dynamic programming?"
    model_inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = model_inputs.input_ids

    # Auto-regressive generation loop for greedy decoding using PyTorch model
    # We use a custom generate function which is very similar to the huggingface one.
    pyt_gen_tokens = generate(model, input_ids, MAX_TOKENS, tokenizer.eos_token_id)


.. GENERATED FROM PYTHON SOURCE LINES 48-50

Compilation with `Torch-TensorRT` using dynamo backend and generate TensorRT outputs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. GENERATED FROM PYTHON SOURCE LINES 50-74

.. code-block:: python


    # Export the llama2 model into an ExportedProgram which is input of TRT compilation
    # To compile the model in FP16, we do the following
    # 1) Cast the model to FP16 via model.half()
    # 2) Enable use_explicit_typing=True. Certain layers are explicitly casted to FP32 within the pytorch model and this flag respects this behavior during TRT compilation
    # 3) Enable use_fp32_acc=True. This ensures all the matmuls are accumulated in FP32 precision (similar to PyTorch)
    llama2_ep = export_llm(model, input_ids, max_seq_len=64)
    trt_model = torch_tensorrt.dynamo.compile(
        llama2_ep,
        inputs=[input_ids],
        enabled_precisions={torch.float32},
        truncate_double=True,
        device=DEVICE,
        disable_tf32=True,
        use_explicit_typing=True,
        use_fp32_acc=True,
    )

    # Auto-regressive generation loop for greedy decoding using TensorRT model
    # We use a custom generate function which is very similar to the huggingface one.
    # Move inputs to GPU
    input_ids = input_ids.to(DEVICE)
    trt_gen_tokens = generate(trt_model, input_ids, MAX_TOKENS, tokenizer.eos_token_id)


.. GENERATED FROM PYTHON SOURCE LINES 75-77

Decode the output sentences of PyTorch and TensorRT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. GENERATED FROM PYTHON SOURCE LINES 77-102

.. code-block:: python

    print("=============================")
    print(
        "Pytorch model generated text: ",
        tokenizer.batch_decode(
            pyt_gen_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )[0],
    )
    print("=============================")
    print(
        "TensorRT model generated text: ",
        tokenizer.batch_decode(
            trt_gen_tokens,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False,
        )[0],
    )


    # Prompt : What is dynamic programming?

    # =============================
    # Pytorch model generated text: Dynamic programming is an algorithmic technique used to solve complex problems by breaking them down into smaller subproblems, solving each subproblem only once, and

    # =============================
    # TensorRT model generated text: Dynamic programming is an algorithmic technique used to solve complex problems by breaking them down into smaller subproblems, solving each subproblem only once, and


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.000 seconds)


.. _sphx_glr_download_tutorials__rendered_examples_dynamo_torch_export_llama2.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example




    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: torch_export_llama2.py <torch_export_llama2.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: torch_export_llama2.ipynb <torch_export_llama2.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
