:orphan:

.. _torch_tensorrt_tutorials:

Torch-TensorRT Tutorials
===========================


.. raw:: html

    <div class="sphx-glr-thumbnails">


.. raw:: html

    </div>


Here we provide examples of Torch-TensorRT compilation of popular computer vision and language models.

Dependencies
------------------------------------

Please install the following external dependencies (assuming you already have correct `torch`, `torch_tensorrt` and `tensorrt` libraries installed (`dependencies <https://github.com/pytorch/TensorRT?tab=readme-ov-file#dependencies>`_))

.. code-block:: python

    pip install -r requirements.txt


Model Zoo
------------------------------------
* :ref:`torch_compile_resnet`: Compiling a ResNet model using the Torch Compile Frontend for ``torch_tensorrt.compile``
* :ref:`torch_compile_transformer`: Compiling a Transformer model using ``torch.compile``
* :ref:`torch_compile_stable_diffusion`: Compiling a Stable Diffusion model using ``torch.compile``
* :ref:`_torch_compile_gpt2`: Compiling a GPT2 model using ``torch.compile``
* :ref:`_torch_export_gpt2`: Compiling a GPT2 model using AOT workflow (`ir=dynamo`)
* :ref:`_torch_export_llama2`: Compiling a Llama2 model using AOT workflow (`ir=dynamo`)
* :ref:`_torch_export_sam2`: Compiling SAM2 model using AOT workflow (`ir=dynamo`)


.. raw:: html

    <div class="sphx-glr-thumbnails">


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="This interactive script is intended as a sample of the Torch-TensorRT workflow with torch.compi...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_torch_compile_stable_diffusion_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_torch_compile_stable_diffusion.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Compiling Stable Diffusion model using the torch.compile backend</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="Cross runtime compilation for windows example =================================================...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_cross_runtime_compilation_for_windows_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_cross_runtime_compilation_for_windows.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">cross runtime compilation limitations:</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="Compilation is an expensive operation as it involves many graph transformations, translations a...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_refit_engine_example_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_refit_engine_example.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Refitting Torch-TensorRT Programs with New Weights</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="This interactive script is intended as a sample of the Torch-TensorRT workflow with torch.compi...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_torch_compile_transformers_example_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_torch_compile_transformers_example.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Compiling BERT using the torch.compile backend</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="This example illustrates the state of the art model GPT2 optimized using torch.compile frontend...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_torch_compile_gpt2_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_torch_compile_gpt2.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Compiling GPT2 using the Torch-TensorRT torch.compile frontend</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="This interactive script is intended as an overview of the process by which torch_tensorrt.compi...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_torch_compile_advanced_usage_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_torch_compile_advanced_usage.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Torch Compile Advanced Usage</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="This interactive script is intended as an overview of the process by which the Torch-TensorRT C...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_torch_export_cudagraphs_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_torch_export_cudagraphs.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Torch Export with Cudagraphs</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="Small caching example on BERT.">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_engine_caching_bert_example_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_engine_caching_bert_example.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Engine Caching (BERT)</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="The TensorRT runtime module acts as a wrapper around a PyTorch model (or subgraph) that has bee...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_pre_allocated_output_example_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_pre_allocated_output_example.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Pre-allocated output buffer</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="We are going to demonstrate how we can easily use Mutable Torch TensorRT Module to compile, int...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_mutable_torchtrt_module_example_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_mutable_torchtrt_module_example.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Mutable Torch TensorRT Module</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="This interactive script is intended as a sample of the Torch-TensorRT workflow with torch.compi...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_torch_compile_resnet_example_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_torch_compile_resnet_example.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Compiling ResNet with dynamic shapes using the torch.compile backend</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="This script illustrates Torch-TensorRT workflow with dynamo backend on popular GPT2 model.">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_torch_export_gpt2_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_torch_export_gpt2.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Compiling GPT2 using the dynamo backend</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="This script illustrates Torch-TensorRT workflow with dynamo backend on popular Llama2 model.">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_torch_export_llama2_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_torch_export_llama2.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Compiling Llama2 using the dynamo backend</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="We are going to demonstrate how to automatically generate a converter for a custom kernel using...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_auto_generate_converters_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_auto_generate_converters.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Automatically Generate a Converter for a Custom Kernel</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="If for some reason you want to change the conversion behavior of a specific PyTorch operation t...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_converter_overloading_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_converter_overloading.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Overloading Torch-TensorRT Converters with Custom Converters</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="Weight streaming in TensorRT is a powerful feature designed to overcome GPU memory limitations ...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_weight_streaming_example_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_weight_streaming_example.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Weight Streaming</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="This example illustrates the state of the art model Segment Anything Model 2 (SAM2) optimized u...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_torch_export_sam2_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_torch_export_sam2.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Compiling SAM2 using the dynamo backend</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="Here we demonstrate how to deploy a model quantized to INT8 or FP8 using the Dynamo frontend of...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_vgg16_ptq_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_vgg16_ptq.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Deploy Quantized Models using Torch-TensorRT</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="As model sizes increase, the cost of compilation will as well. With AOT methods like torch.dyna...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_engine_caching_example_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_engine_caching_example.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Engine Caching</div>
    </div>


.. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="We are going to demonstrate how a developer could include a custom kernel in a TensorRT engine ...">

.. only:: html

  .. image:: /tutorials/_rendered_examples/dynamo/images/thumb/sphx_glr_custom_kernel_plugins_thumb.png
    :alt:

  :ref:`sphx_glr_tutorials__rendered_examples_dynamo_custom_kernel_plugins.py`

.. raw:: html

      <div class="sphx-glr-thumbnail-title">Using Custom Kernels within TensorRT Engines with Torch-TensorRT</div>
    </div>


.. raw:: html

    </div>


Serving a Torch-TensorRT model with Triton
==========================================

Optimization and deployment go hand in hand in a discussion about Machine
Learning infrastructure. Once network level optimization are done
to get the maximum performance, the next step would be to deploy it.

However, serving this optimized model comes with its own set of considerations
and challenges like: building an infrastructure to support concurrent model
executions, supporting clients over HTTP or gRPC and more.

The `Triton Inference Server <https://github.com/triton-inference-server/server>`__
solves the aforementioned and more. Let's discuss step-by-step, the process of
optimizing a model with Torch-TensorRT, deploying it on Triton Inference
Server, and building a client to query the model.

Step 1: Optimize your model with Torch-TensorRT
-----------------------------------------------

Most Torch-TensorRT users will be familiar with this step. For the purpose of
this demonstration, we will be using a ResNet50 model from Torchhub.

We will be working in the ``//examples/triton`` directory which contains the scripts used in this tutorial.

First pull the `NGC PyTorch Docker container <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch>`__. You may need to create
an account and get the API key from `here <https://ngc.nvidia.com/setup/>`__.
Sign up and login with your key (follow the instructions
`here <https://ngc.nvidia.com/setup/api-key>`__ after signing up).

::

   # <xx.xx> is the yy:mm for the publishing tag for NVIDIA's Pytorch
   # container; eg. 24.08
   # NOTE: Use the publishing tag for both the PyTorch container and the Triton Containers

   docker run -it --gpus all -v ${PWD}:/scratch_space nvcr.io/nvidia/pytorch:<xx.xx>-py3
   cd /scratch_space

With the container we can export the model in to the correct directory in our Triton model repository. This export script uses the **Dynamo** frontend for Torch-TensorRT to compile the PyTorch model to TensorRT. Then we save the model using **TorchScript** as a serialization format which is supported by Triton.

::

  import torch
  import torch_tensorrt as torchtrt
  import torchvision

  import torch
  import torch_tensorrt
  torch.hub._validate_not_a_forked_repo=lambda a,b,c: True

  # load model
  model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda")

  # Compile with Torch TensorRT;
  trt_model = torch_tensorrt.compile(model,
      inputs= [torch_tensorrt.Input((1, 3, 224, 224))],
      enabled_precisions= {torch_tensorrt.dtype.f16}
  )

  ts_trt_model = torch.jit.trace(trt_model, torch.rand(1, 3, 224, 224).to("cuda"))

  # Save the model
  torch.jit.save(ts_trt_model, "/triton_example/model_repository/resnet50/1/model.pt")

You can run the script with the following command (from ``//examples/triton``)

::

  docker run --gpus all -it --rm -v ${PWD}:/triton_example nvcr.io/nvidia/pytorch:YY.XX-py3 python /triton_example/export.py

This will save the serialized TorchScript version of the ResNet model in the right directory in the model repository.

Step 2: Set Up Triton Inference Server
--------------------------------------

If you are new to the Triton Inference Server and want to learn more, we
highly recommend to checking our `Github
Repository <https://github.com/triton-inference-server>`__.

To use Triton, we need to make a model repository. A model repository, as the
name suggests, is a repository of the models the Inference server hosts. While
Triton can serve models from multiple repositories, in this example, we will
discuss the simplest possible form of the model repository.

The structure of this repository should look something like this:

::

   model_repository
   |
   +-- resnet50
       |
       +-- config.pbtxt
       +-- 1
           |
           +-- model.pt

There are two files that Triton requires to serve the model: the model itself
and a model configuration file which is typically provided in ``config.pbtxt``.
For the model we prepared in step 1, the following configuration can be used:

::

    name: "resnet50"
    backend: "pytorch"
    max_batch_size : 0
    input [
      {
        name: "x"
        data_type: TYPE_FP32
        dims: [ 1, 3, 224, 224 ]
      }
    ]
    output [
      {
        name: "output0"
        data_type: TYPE_FP32
        dims: [1, 1000]
      }
    ]

The ``config.pbtxt`` file is used to describe the exact model configuration
with details like the names and shapes of the input and output layer(s),
datatypes, scheduling and batching details and more. If you are new to Triton,
we highly encourage you to check out this `section of our
documentation <https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md>`__
for more details.

With the model repository setup, we can proceed to launch the Triton server
with the docker command below. Refer `this page <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver>`__ for the pull tag for the container.

::

   # Make sure that the TensorRT version in the Triton container
   # and TensorRT version in the environment used to optimize the model
   # are the same. Roughly, like publishing tags should have the same TensorRT version

   docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}:/triton_example nvcr.io/nvidia/tritonserver:YY.MM-py3 tritonserver --model-repository=/triton_example/model_repository

This should spin up a Triton Inference server. Next step, building a simple
http client to query the server.

Step 3: Building a Triton Client to Query the Server
----------------------------------------------------

Before proceeding, make sure to have a sample image on hand. If you don't
have one, download an example image to test inference. In this section, we
will be going over a very basic client. For a variety of more fleshed out
examples, refer to the `Triton Client Repository <https://github.com/triton-inference-server/client/tree/main/src/python/examples>`__

::

   wget  -O img1.jpg "https://www.hakaimagazine.com/wp-content/uploads/header-gulf-birds.jpg"

We then need to install dependencies for building a python client. These will
change from client to client. For a full list of all languages supported by Triton,
please refer to `Triton's client repository <https://github.com/triton-inference-server/client>`__.

::

   pip install torchvision
   pip install attrdict
   pip install nvidia-pyindex
   pip install tritonclient[all]

Let's jump into the client. Firstly, we write a small preprocessing function to
resize and normalize the query image.

::

  import numpy as np
  from torchvision import transforms
  from PIL import Image
  import tritonclient.http as httpclient
  from tritonclient.utils import triton_to_np_dtype

  # preprocessing function
  def rn50_preprocess(img_path="/triton_example/img1.jpg"):
    img = Image.open(img_path)
    preprocess = transforms.Compose(
        [
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ]
    )
    return preprocess(img).unsqueeze(0).numpy()

   transformed_img = rn50_preprocess()

Building a client requires three basic points. Firstly, we setup a connection
with the Triton Inference Server.

::

   # Setting up client
   client = httpclient.InferenceServerClient(url="localhost:8000")

Secondly, we specify the names of the input and output layer(s) of our model. This can be obtained during export and should already be specified in your ``config.pbtxt``

::

   inputs = httpclient.InferInput("x", transformed_img.shape, datatype="FP32")
   inputs.set_data_from_numpy(transformed_img, binary_data=True)

   outputs = httpclient.InferRequestedOutput("output0", binary_data=True, class_count=1000)

Lastly, we send an inference request to the Triton Inference Server.

::

   # Querying the server
   results = client.infer(model_name="resnet50", inputs=[inputs], outputs=[outputs])
   inference_output = results.as_numpy('output0')
   print(inference_output[:5])

The output should look like below:

::

   [b'12.468750:90' b'11.523438:92' b'9.664062:14' b'8.429688:136'
    b'8.234375:11']

The output format here is ``<confidence_score>:<classification_index>``.
To learn how to map these to the label names and more, refer to Triton Inference Server's
`documentation <https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_classification.md>`__.

You can try out this client quickly using

::

  # Remember to use the same publishing tag for all steps (e.g. 24.08)

  docker run -it --net=host -v ${PWD}:/triton_example nvcr.io/nvidia/tritonserver:YY.MM-py3-sdk bash -c "pip install torchvision && python /triton_example/client.py"



.. raw:: html

    <div class="sphx-glr-thumbnails">


.. raw:: html

    </div>


.. toctree::
   :hidden:
   :includehidden:


   /tutorials/_rendered_examples//dynamo/index.rst
   /tutorials/_rendered_examples//triton/index.rst


.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-gallery

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download all examples in Python source code: _rendered_examples_python.zip </tutorials/_rendered_examples/_rendered_examples_python.zip>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download all examples in Jupyter notebooks: _rendered_examples_jupyter.zip </tutorials/_rendered_examples/_rendered_examples_jupyter.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
