[WEB] Remove long prompts support

It removes support to long prompts due to higher lag in loading long prompts. Signed-Off-by: Gaurav Shukla <gaurav@nod-labs>
[WEB] fix background color
2026-01-11 23:08:19 -05:00 · 2022-11-03 18:57:58 +05:30 · 2022-11-03 17:36:24 +05:30 · 2022-11-03 03:27:36 -07:00 · 2022-11-02 14:30:03 -07:00 · 2022-11-02 12:36:11 -07:00
94 changed files with 5940 additions and 881 deletions
--- a/.github/workflows/gh-pages-releases.yml
+++ b/.github/workflows/gh-pages-releases.yml
@@ -0,0 +1,37 @@
+# See: https://github.com/llvm/torch-mlir/issues/1374
+name: Publish releases page
+
+on:
+  workflow_dispatch:
+
+jobs:
+  scrape_and_publish_releases:
+    name: "Scrape and publish releases"
+    runs-on: ubuntu-latest
+
+    # Don't run this in everyone's forks.
+    if: github.repository == 'nod-ai/SHARK'
+
+    steps:
+      - name: Checking out repository
+        uses: actions/checkout@v2
+        with:
+          token: ${{ secrets.NODAI_INVOCATION_TOKEN }}
+      - name: Run scrape releases script
+        run: python ./build_tools/scrape_releases.py nod-ai SHARK > /tmp/index.html
+        shell: bash
+      - run: git fetch --all
+      - run: git switch github-pages
+      - run: git config --global user.email "none@none.com"
+      - run: git config --global user.name "nod-ai"
+      - run: mv /tmp/index.html package-index/index.html
+      - run: git add package-index/index.html
+
+      # Only try to make a commit if the file has changed.
+      - run: git diff --cached --exit-code || git commit -m "Update releases."
+
+      - name: GitHub Push
+        uses: ad-m/github-push-action@v0.6.0
+        with:
+          github_token: ${{ secrets.NODAI_INVOCATION_TOKEN }}
+          branch: github-pages
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly.yml
@@ -39,6 +39,10 @@ jobs:
        tag_name="${package_version}"
        echo "package_version=${package_version}" >> $GITHUB_ENV
        echo "tag_name=${tag_name}" >> $GITHUB_ENV    
+    - name: Set Environment Variables
+      run: |
+        echo "SHORT_SHA=`git rev-parse --short=4 HEAD`" >> $GITHUB_ENV
+        echo "DATE=$(date +'%Y-%m-%d')" >> $GITHUB_ENV
    - name: Create Release
      id: create_release
      uses: actions/create-release@v1
@@ -51,17 +55,11 @@ jobs:
          Automatic snapshot release of nod.ai SHARK.
        draft: true
        prerelease: false
-    - name: Find Torch-MLIR Release
-      run: |
-        TM_HTML_URL="$(python3 -c "import urllib.request, json, sys; u=json.loads(urllib.request.urlopen('https://api.github.com/repos/llvm/torch-mlir/releases/latest').read().decode()).get('html_url', False); print(u) if u else sys.exit(1);")"
-        TM_RELEASE_DIR=${TM_HTML_URL/"tag"/"expanded_assets"}
-        echo "TM_RELEASE_DIR=${TM_RELEASE_DIR}" >> $GITHUB_ENV
    - name: Install dependencies
      run: |
-        echo "Torch-MLIR Release DIR is ${{ env.TM_RELEASE_DIR }}"
        python -m pip install --upgrade pip
        python -m pip install flake8 pytest toml
-        if [ -f requirements.txt ]; then pip install -r requirements.txt -f ${{ env.TM_RELEASE_DIR }} -f https://github.com/nod-ai/SHARK-Runtime/releases; fi
+        if [ -f requirements.txt ]; then pip install -r requirements.txt -f https://llvm.github.io/torch-mlir/package-index/ -f https://nod-ai.github.io/SHARK-Runtime/pip-release-links.html; fi
    - name: Lint with flake8
      run: |
        # stop the build if there are Python syntax errors or undefined names
@@ -76,19 +74,19 @@ jobs:
        source iree.venv/bin/activate
        package_version="$(printf '%(%Y%m%d)T.${{ github.run_number }}')"
        SHARK_PACKAGE_VERSION=${package_version} \
-        pip wheel -v -w wheelhouse . --pre -f https://download.pytorch.org/whl/nightly/torch -f ${{ env.TM_RELEASE_DIR }} -f https://github.com/iree-org/iree/releases
+        pip wheel -v -w wheelhouse . --pre -f https://download.pytorch.org/whl/nightly/torch -f https://llvm.github.io/torch-mlir/package-index/ -f https://iree-org.github.io/iree/pip-release-links.html
        # Install the built wheel
        pip install ./wheelhouse/nodai*
        # Validate the Models
        /bin/bash "$GITHUB_WORKSPACE/build_tools/populate_sharktank_ci.sh"
-        pytest tank/test_models.py |
+        pytest --ci --ci_sha=${SHORT_SHA} --local_tank_cache="./gen_shark_tank/" tank/test_models.py |
          tail -n 1 |
          tee -a pytest_results.txt
        if !(grep -Fxq " failed" pytest_results.txt) 
          then 
            export SHA=$(git log -1 --format='%h')
-            gsutil -m cp -r $GITHUB_WORKSPACE/gen_shark_tank/* gs://shark_tank/$SHA
-            gsutil -m cp -r gs://shark_tank/$SHA/* gs://shark_tank/latest/
+            gsutil -m cp -r $GITHUB_WORKSPACE/gen_shark_tank/* gs://shark_tank/${DATE}_$SHA
+            gsutil -m cp -r gs://shark_tank/${DATE}_$SHA/* gs://shark_tank/latest/
        fi
        rm -rf ./wheelhouse/nodai*

@@ -100,17 +98,14 @@ jobs:
        source shark.venv/bin/activate
        package_version="$(printf '%(%Y%m%d)T.${{ github.run_number }}')"
        SHARK_PACKAGE_VERSION=${package_version} \
-        pip wheel -v -w wheelhouse . --pre -f https://download.pytorch.org/whl/nightly/torch -f ${{ env.TM_RELEASE_DIR }} -f https://github.com/nod-ai/SHARK-Runtime/releases
+        pip wheel -v -w wheelhouse . --pre -f https://download.pytorch.org/whl/nightly/torch -f https://llvm.github.io/torch-mlir/package-index/ -f https://nod-ai.github.io/SHARK-Runtime/pip-release-links.html
        # Install the built wheel
        pip install ./wheelhouse/nodai*
        # Validate the Models
-        pytest tank/test_models.py |
+        pytest --ci --ci_sha=${SHORT_SHA} tank/test_models.py |
          tail -n 1 |
          tee -a pytest_results.txt
-  publish:
-    runs-on: a100
-    needs: build 
-    steps:
+    
    - name: Upload Release Assets
      if: ${{ matrix.backend == 'SHARK' }}
      id: upload-release-assets
@@ -119,7 +114,7 @@ jobs:
        GITHUB_TOKEN: ${{ secrets.NODAI_INVOCATION_TOKEN }}
      with:
        release_id: ${{ steps.create_release.outputs.id }}
-        assets_path: ${GITHUB_WORKSPACE}/wheelhouse/nodai_*.whl
+        assets_path: ./wheelhouse/nodai_*.whl

    - name: Publish Release
      if: ${{ matrix.backend == 'SHARK' }}
--- a/.github/workflows/test-models.yml
+++ b/.github/workflows/test-models.yml
@@ -10,6 +10,14 @@ on:
    branches: [ main ]
  workflow_dispatch:

+# Ensure that only a single job or workflow using the same
+# concurrency group will run at a time. This would cancel
+# any in-progress jobs in the same github workflow and github
+# ref (e.g. refs/heads/main or refs/pull/<pr_number>/merge).
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+
 jobs:
  build-validate:
    strategy:
@@ -28,12 +36,12 @@ jobs:
            suite: cuda
          - os: ubuntu-latest
            suite: cpu
+          - os: MacStudio
+            suite: vulkan
          - os: MacStudio
            suite: cuda
          - os: MacStudio
            suite: cpu
-          - os: MacStudio
-            suite: vulkan
          - os: icelake
            suite: vulkan
          - os: icelake
@@ -90,7 +98,7 @@ jobs:
        cd $GITHUB_WORKSPACE
        PYTHON=python${{ matrix.python-version }} BENCHMARK=1 IMPORTER=1 ./setup_venv.sh
        source shark.venv/bin/activate
-        pytest --benchmark --ci --ci_sha=${SHORT_SHA} --local_tank_cache="/data/anush" tank/test_models.py -k cpu
+        pytest --benchmark --ci --ci_sha=${SHORT_SHA} -s --local_tank_cache="/data/anush/shark_cache" tank/test_models.py -k cpu --update_tank
        gsutil cp ./bench_results.csv gs://shark-public/builder/bench_results/${DATE}/bench_results_cpu_${SHORT_SHA}.csv
        gsutil cp gs://shark-public/builder/bench_results/${DATE}/bench_results_cpu_${SHORT_SHA}.csv gs://shark-public/builder/bench_results/latest/bench_results_cpu_latest.csv

@@ -100,14 +108,29 @@ jobs:
        cd $GITHUB_WORKSPACE
        PYTHON=python${{ matrix.python-version }} BENCHMARK=1 IMPORTER=1 ./setup_venv.sh
        source shark.venv/bin/activate
-        pytest --benchmark --ci --ci_sha=${SHORT_SHA} --local_tank_cache="/data/anush" tank/test_models.py -k cuda
+        pytest --benchmark --ci --ci_sha=${SHORT_SHA} -s --local_tank_cache="/data/anush/shark_cache" tank/test_models.py -k cuda --update_tank
        gsutil cp ./bench_results.csv gs://shark-public/builder/bench_results/${DATE}/bench_results_cuda_${SHORT_SHA}.csv
        gsutil cp gs://shark-public/builder/bench_results/${DATE}/bench_results_cuda_${SHORT_SHA}.csv gs://shark-public/builder/bench_results/latest/bench_results_cuda_latest.csv

-    - name: Validate Vulkan Models
-      if: matrix.suite == 'vulkan'
+    - name: Validate Vulkan Models (MacOS)
+      if: matrix.suite == 'vulkan' && matrix.os == 'MacStudio'
+      run: |
+        cd $GITHUB_WORKSPACE
+        PYTHON=python${{ matrix.python-version }} IMPORTER=1 ./setup_venv.sh
+        source shark.venv/bin/activate
+        echo "VULKAN SDK PATH wo setup: $VULKAN_SDK"
+        cd /Users/anush/VulkanSDK/1.3.224.1/
+        source setup-env.sh
+        cd $GITHUB_WORKSPACE
+        echo "VULKAN SDK PATH with setup: $VULKAN_SDK"
+        echo $PATH
+        pip list | grep -E "torch|iree"
+        pytest --ci --ci_sha=${SHORT_SHA} --local_tank_cache="/Volumes/builder/anush/shark_cache" tank/test_models.py -k vulkan --update_tank
+
+    - name: Validate Vulkan Models (a100)
+      if: matrix.suite == 'vulkan' && matrix.os != 'MacStudio'
      run: |
        cd $GITHUB_WORKSPACE
        PYTHON=python${{ matrix.python-version }} BENCHMARK=1 IMPORTER=1 ./setup_venv.sh
        source shark.venv/bin/activate
-        pytest --ci --ci_sha=${SHORT_SHA} --local_tank_cache="/data/anush" tank/test_models.py -k vulkan
+        pytest --ci --ci_sha=${SHORT_SHA} -s --local_tank_cache="/data/anush/shark_cache" tank/test_models.py -k vulkan --update_tank
--- a/.gitignore
+++ b/.gitignore
@@ -167,3 +167,7 @@ shark_tmp/
 # ORT related artefacts
 cache_models/
 onnx_models/
+
+#web logging
+web/logs/
+web/stored_results/stable_diffusion/
--- a/README.md
+++ b/README.md
@@ -14,16 +14,16 @@ High Performance Machine Learning and Data Analytics for CPUs, GPUs, Accelerator
 ## Installation

 <details>
-  <summary>Installation (Linux and macOS)</summary>
+  <summary>Installation (Linux, macOS and Windows)</summary>

 ### Setup a new pip Virtual Environment

 This step sets up a new VirtualEnv for Python

 ```shell
-python --version #Check you have 3.7->3.10 on Linux or 3.10 on macOS
+python --version #Check you have 3.10 on Linux, macOS or Windows Powershell
 python -m venv shark_venv
-source shark_venv/bin/activate
+source shark_venv/bin/activate   # Use shark_venv/Scripts/activate on Windows

 # If you are using conda create and activate a new conda env

@@ -38,9 +38,14 @@ python -m pip install --upgrade pip
 This step pip installs SHARK and related packages on Linux Python 3.7, 3.8, 3.9, 3.10 and macOS Python 3.10

 ```shell
-pip install nodai-shark -f https://github.com/nod-ai/SHARK/releases -f https://github.com/llvm/torch-mlir/releases -f https://github.com/nod-ai/shark-runtime/releases --extra-index-url https://download.pytorch.org/whl/nightly/cpu
+pip install nodai-shark -f https://nod-ai.github.io/SHARK/package-index/ -f https://llvm.github.io/torch-mlir/package-index/ -f  https://nod-ai.github.io/SHARK-Runtime/pip-release-links.html --extra-index-url https://download.pytorch.org/whl/nightly/cpu
 ```
-If you are on an Intel macOS machine you need this [workaround](https://github.com/nod-ai/SHARK/issues/102) for an upstream issue.
+
+### Run shark tank model tests.
+```shell
+pytest tank/test_models.py
+```
+See tank/README.md for a more detailed walkthrough of our pytest suite and CLI.

 ### Download and run Resnet50 sample

@@ -71,14 +76,41 @@ git clone https://github.com/nod-ai/SHARK.git
 ```

 ## Setup your Python VirtualEnvironment and Dependencies
+
+### Windows Users
+
+```shell
+# Setup venv and install necessary packages (torch-mlir, nodLabs/Shark, ...). 
+# Requires Python 3.10 and Powershell
+./setup_venv.ps1
+shark.venv/Scripts/activate
+```
+
+### Linux / macOS Users
+
 ```shell
 # Setup venv and install necessary packages (torch-mlir, nodLabs/Shark, ...).
 ./setup_venv.sh
 source shark.venv/bin/activate
 ```
-For example if you want to use Python3.10 and upstream IREE with TF Import tools you can use the environment variables like:
+
+
+### Run a demo script
+```shell
+python -m  shark.examples.shark_inference.resnet50_script --device="cpu" # Use gpu | vulkan
+# Or a pytest
+pytest tank/test_models.py -k "MiniLM"
 ```
-# PYTHON=python3.10 VENV_DIR=0617_venv IMPORTER=1 USE_IREE=1 ./setup_venv.sh 
+
+</details>
+
+<details>
+  <summary>Development, Testing and Benchmarks</summary>
+
+If you want to use Python3.10 and with TF Import tools you can use the environment variables like:
+Set `USE_IREE=1` to use upstream IREE
+```
+# PYTHON=python3.10 VENV_DIR=0617_venv IMPORTER=1 ./setup_venv.sh 
 ```

 If you are a *Torch-mlir developer or an IREE developer* and want to test local changes you can uninstall
@@ -102,82 +134,38 @@ for Torch-MLIR.
 ```
 Now the SHARK will use your locally build Torch-MLIR repo.

-### Run a demo script
-```shell
-python -m  shark.examples.shark_inference.resnet50_script --device="cpu" # Use gpu | vulkan
-# Or a pytest
-pytest tank/test_models.py -k "MiniLM"
+
+## Benchmarking Dispatches
+
+To produce benchmarks of individual dispatches, you can add `--dispatch_benchmarks=All --dispatch_benchmarks_dir=<output_dir>` to your command line argument.  
+If you only want to compile specific dispatches, you can specify them with a space seperated string instead of `"All"`.  E.G. `--dispatch_benchmarks="0 1 2 10"`
+
+if you want to instead incorporate this into a python script, you can pass the `dispatch_benchmarks` and `dispatch_benchmarks_dir` commands when initializing `SharkInference`, and the benchmarks will be generated when compiled.  E.G:
+
 ```
+shark_module = SharkInference(
+        mlir_model,
+        func_name,
+        device=args.device,
+        mlir_dialect="tm_tensor",
+        dispatch_benchmarks="all",
+        dispatch_benchmarks_dir="results"
+    )
+```
+
+Output will include:
+- Inside the specified directory, there will be a directory for each dispatch (there will be mlir files for all dispatches, but only compiled binaries and benchmark data for the specified dispatches)
+- An .mlir file containing the dispatch benchmark 
+- A compiled .vmfb file containing the dispatch benchmark
+- An .mlir file containing just the hal executable
+- A compiled .vmfb file of the hal executable
+- A .txt file containing benchmark output
+
+
+See tank/README.md for instructions on how to run model tests and benchmarks from the SHARK tank.

 </details>

-<details>
-  <summary>Testing and Benchmarks</summary>
-
-### Run all model tests on CPU/GPU/VULKAN/Metal
-```shell
-pytest tank/test_models.py
-
-# If on Linux for multithreading on CPU (faster results):
-pytest tank/test_models.py -n auto
-```
-
-### Running specific tests
-```shell
-
-# Search for test cases by including a keyword that matches all or part of the test case's name;
-pytest tank/test_models.py -k "keyword" 
-
-# Test cases are named uniformly by format test_module_<model_name_underscores_only>_<torch/tf>_<static/dynamic>_<device>.
-
-# Example: Test all models on nvidia gpu:
-pytest tank/test_models.py -k "cuda"
-
-# Example: Test all tensorflow resnet models on Vulkan backend:
-pytest tank/test_models.py -k "resnet and tf and vulkan"
-
-# Exclude a test case:
-pytest tank/test_models.py -k "not ..."
-
-### Run benchmarks on SHARK tank pytests and generate bench_results.csv with results.
-
-(the following requires source installation with `IMPORTER=1 ./setup_venv.sh`)
-
-```shell
-pytest --benchmark tank/test_models.py
-  
-# Just do static GPU benchmarks for PyTorch tests:
-pytest --benchmark tank/test_models.py -k "pytorch and static and cuda"
-
-```
-  
-### Benchmark Resnet50, MiniLM on CPU
-
-(requires source installation with `IMPORTER=1 ./setup_venv.sh`)  
-  
-```shell
-# We suggest running the following commands as root before running benchmarks on CPU:
-  
-cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | awk -F, '{print $2}' | sort -n | uniq | ( while read X ; do echo $X ; echo 0 > /sys/devices/system/cpu/cpu$X/online ; done )
-echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
-
-# Benchmark canonical Resnet50 on CPU via pytest
-pytest --benchmark tank/test_models -k "resnet50 and tf_static_cpu"
-
-# Benchmark canonical MiniLM on CPU via pytest
-pytest --benchmark tank/test_models -k "MiniLM and cpu"
-
-# Benchmark MiniLM on CPU via transformer-benchmarks:
-git clone --recursive https://github.com/nod-ai/transformer-benchmarks.git
-cd transformer-benchmarks
-./perf-ci.sh -n
-# Check detail.csv for MLIR/IREE results.
-
-```
-
-</details>
-
-
 <details>
  <summary>API Reference</summary>

@@ -228,160 +216,21 @@ result = shark_module.forward((arg0, arg1))
 ```
 </details>

-
 ## Supported and Validated Models

-<details>
-  <summary>PyTorch Models</summary>
+SHARK is maintained to support the latest innovations in ML Models: 

-### Huggingface PyTorch Models
+| TF HuggingFace Models | SHARK-CPU | SHARK-CUDA | SHARK-METAL |
+|---------------------|----------|----------|-------------|
+| BERT                | :green_heart:         | :green_heart:         | :green_heart:            |
+| DistilBERT         | :green_heart:         | :green_heart:         | :green_heart:            |
+| GPT2         | :green_heart:         | :green_heart:         | :green_heart:            |
+| BLOOM         | :green_heart:         | :green_heart:         | :green_heart:            |
+| Stable Diffusion         | :green_heart:         | :green_heart:         | :green_heart:            |
+| Vision Transformer       | :green_heart:         | :green_heart:         | :green_heart:            |
+| ResNet50         | :green_heart:         | :green_heart:         | :green_heart:            |

-| Hugging Face Models | Torch-MLIR lowerable | SHARK-CPU | SHARK-CUDA | SHARK-METAL |
-|---------------------|----------------------|----------|----------|-------------|
-| BERT                | :green_heart: (JIT)          | :green_heart:         | :green_heart:         | :green_heart:            |
-| Albert              | :green_heart: (JIT)            | :green_heart:         | :green_heart:         | :green_heart:            |
-| BigBird             | :green_heart: (AOT)            |          |          |             |
-| DistilBERT          | :green_heart: (JIT)            | :green_heart:         | :green_heart:         | :green_heart:            |
-| GPT2                | :broken_heart: (AOT)            |          |          |             |
-| MobileBert          | :green_heart: (JIT)            | :green_heart:         | :green_heart:         | :green_heart:            |
-
-### Torchvision  Models
-
-| TORCHVISION Models | Torch-MLIR lowerable | SHARK-CPU | SHARK-CUDA | SHARK-METAL |
-|--------------------|----------------------|----------|----------|-------------|
-| AlexNet            | :green_heart: (Script)         | :green_heart:         | :green_heart:         | :green_heart:            |
-| DenseNet121        | :green_heart: (Script)         |          |          |             |
-| MNasNet1_0         | :green_heart: (Script)         | :green_heart:         | :green_heart:         | :green_heart:            |
-| MobileNetV2        | :green_heart: (Script)         | :green_heart:         | :green_heart:         | :green_heart:            |
-| MobileNetV3        | :green_heart: (Script)         | :green_heart:         | :green_heart:         | :green_heart:            |
-| Unet               | :broken_heart: (Script)         |          |          |             |
-| Resnet18           | :green_heart: (Script)         | :green_heart:         |  :green_heart:        | :green_heart:            |
-| Resnet50           | :green_heart: (Script)         | :green_heart:         |   :green_heart:       | :green_heart:            |
-| Resnet101           | :green_heart: (Script)         | :green_heart:         |   :green_heart:       | :green_heart:            |
-| Resnext50_32x4d    | :green_heart: (Script)         | :green_heart:         | :green_heart:         | :green_heart:            |
-| ShuffleNet_v2      | :broken_heart: (Script)         |          |          |             |
-| SqueezeNet         | :green_heart: (Script)         | :green_heart:         |   :green_heart:       | :green_heart:            |
-| EfficientNet       | :green_heart: (Script)         |          |          |             |
-| Regnet             | :green_heart: (Script)         | :green_heart:         | :green_heart:         | :green_heart:            |
-| Resnest            | :broken_heart: (Script)         |          |          |             |
-| Vision Transformer | :green_heart: (Script)         |          |          |             |
-| VGG 16             | :green_heart: (Script)         | :green_heart:         |   :green_heart:       |             |
-| Wide Resnet        | :green_heart: (Script)         | :green_heart:         | :green_heart:         | :green_heart:            |
-| RAFT               | :broken_heart: (JIT)            |          |          |             |
-
-For more information refer to [MODEL TRACKING SHEET](https://docs.google.com/spreadsheets/d/15PcjKeHZIrB5LfDyuw7DGEEE8XnQEX2aX8lm8qbxV8A/edit#gid=0)
-
-### PyTorch Training Models
-
-| Models | Torch-MLIR lowerable | SHARK-CPU | SHARK-CUDA | SHARK-METAL |
-|---------------------|----------------------|----------|----------|-------------|
-| BERT                | :broken_heart:           | :broken_heart:         |          |             |
-| FullyConnected                | :green_heart:           | :green_heart:         |          |             |
-
-</details>
-
-<details>
-  <summary>JAX Models</summary>
-
-
-### JAX  Models
-
-| Models | JAX-MHLO lowerable | SHARK-CPU | SHARK-CUDA | SHARK-METAL |
-|---------------------|----------------------|----------|----------|-------------|
-| DALL-E                | :broken_heart:           | :broken_heart:         |          |             |
-| FullyConnected                | :green_heart:           | :green_heart:         |          |             |
-
-</details>
-
-<details>
-  <summary>TFLite Models</summary>
-
-### TFLite Models
-
-| Models | TOSA/LinAlg  | SHARK-CPU | SHARK-CUDA | SHARK-METAL |
-|---------------------|----------------------|----------|----------|-------------|
-| BERT                | :broken_heart:           | :broken_heart:         |          |             |
-| FullyConnected      | :green_heart:           | :green_heart:         |          |             |
-| albert | :green_heart:           | :green_heart:         |          |             |
-| asr_conformer | :green_heart:           | :green_heart:         |          |             |
-| bird_classifier | :green_heart:           | :green_heart:         |          |             |
-| cartoon_gan | :green_heart:           | :green_heart:         |          |             |
-| craft_text | :green_heart:           | :green_heart:         |          |             |
-| deeplab_v3 | :green_heart:           | :green_heart:         |          |             |
-| densenet | :green_heart:           | :green_heart:         |          |             |
-| east_text_detector | :green_heart:           | :green_heart:         |          |             |
-| efficientnet_lite0_int8 | :green_heart:           | :green_heart:         |          |             |
-| efficientnet | :green_heart:           | :green_heart:         |          |             |
-| gpt2 | :green_heart:           | :green_heart:         |          |             |
-| image_stylization | :green_heart:           | :green_heart:         |          |             |
-| inception_v4 | :green_heart:           | :green_heart:         |          |             |
-| inception_v4_uint8 | :green_heart:           | :green_heart:         |          |             |
-| lightning_fp16 | :green_heart:           | :green_heart:         |          |             |
-| lightning_i8 | :green_heart:           | :green_heart:         |          |             |
-| lightning | :green_heart:           | :green_heart:         |          |             |
-| magenta | :green_heart:           | :green_heart:         |          |             |
-| midas | :green_heart:           | :green_heart:         |          |             |
-| mirnet | :green_heart:           | :green_heart:         |          |             |
-| mnasnet | :green_heart:           | :green_heart:         |          |             |
-| mobilebert_edgetpu_s_float | :green_heart:           | :green_heart:         |          |             |
-| mobilebert_edgetpu_s_quant | :green_heart:           | :green_heart:         |          |             |
-| mobilebert | :green_heart:           | :green_heart:         |          |             |
-| mobilebert_tf2_float | :green_heart:           | :green_heart:         |          |             |
-| mobilebert_tf2_quant | :green_heart:           | :green_heart:         |          |             |
-| mobilenet_ssd_quant | :green_heart:           | :green_heart:         |          |             |
-| mobilenet_v1 | :green_heart:           | :green_heart:         |          |             |
-| mobilenet_v1_uint8 | :green_heart:           | :green_heart:         |          |             |
-| mobilenet_v2_int8 | :green_heart:           | :green_heart:         |          |             |
-| mobilenet_v2 | :green_heart:           | :green_heart:         |          |             |
-| mobilenet_v2_uint8 | :green_heart:           | :green_heart:         |          |             |
-| mobilenet_v3-large | :green_heart:           | :green_heart:         |          |             |
-| mobilenet_v3-large_uint8 | :green_heart:           | :green_heart:         |          |             |
-| mobilenet_v35-int8 | :green_heart:           | :green_heart:         |          |             |
-| nasnet | :green_heart:           | :green_heart:         |          |             |
-| person_detect | :green_heart:           | :green_heart:         |          |             |
-| posenet | :green_heart:           | :green_heart:         |          |             |
-| resnet_50_int8 | :green_heart:           | :green_heart:         |          |             |
-| rosetta | :green_heart:           | :green_heart:         |          |             |
-| spice | :green_heart:           | :green_heart:         |          |             |
-| squeezenet | :green_heart:           | :green_heart:         |          |             |
-| ssd_mobilenet_v1 | :green_heart:           | :green_heart:         |          |             |
-| ssd_mobilenet_v1_uint8 | :green_heart:           | :green_heart:         |          |             |
-| ssd_mobilenet_v2_fpnlite | :green_heart:           | :green_heart:         |          |             |
-| ssd_mobilenet_v2_fpnlite_uint8 | :green_heart:           | :green_heart:         |          |             |
-| ssd_mobilenet_v2_int8 | :green_heart:           | :green_heart:         |          |             |
-| ssd_mobilenet_v2 | :green_heart:           | :green_heart:         |          |             |
-| ssd_spaghettinet_large | :green_heart:           | :green_heart:         |          |             |
-| ssd_spaghettinet_large_uint8 | :green_heart:           | :green_heart:         |          |             |
-| visual_wake_words_i8 | :green_heart:           | :green_heart:         |          |             |
-
-</details>
-
-<details>
-  <summary>TF Models</summary>
-
-### Tensorflow Models (Inference)
-
-| Hugging Face Models | tf-mhlo lowerable | SHARK-CPU | SHARK-CUDA | SHARK-METAL |
-|---------------------|----------------------|----------|----------|-------------|
-| BERT                | :green_heart:          | :green_heart:         | :green_heart:         | :green_heart:            |
-| albert-base-v2              | :green_heart:            | :green_heart:         | :green_heart:         | :green_heart:            |
-| DistilBERT          | :green_heart:            | :green_heart:         | :green_heart:         | :green_heart:            |
-| CamemBert                | :green_heart:          | :green_heart:         | :green_heart:         | :green_heart:            |
-| ConvBert              | :green_heart:            | :green_heart:         | :green_heart:         | :green_heart:            |
-| Deberta              |            |         |          |             |
-| electra          | :green_heart:            | :green_heart:         | :green_heart:         | :green_heart:            |
-| funnel              |            |         |          |             |
-| layoutlm              | :green_heart:            | :green_heart:         | :green_heart:         | :green_heart:            |
-| longformer              |            |         |          |             |
-| mobile-bert                | :green_heart:          | :green_heart:         | :green_heart:         | :green_heart:            |
-| remembert              |            |         |          |             |
-| tapas              |            |         |          |             |
-| flaubert                | :green_heart:          | :green_heart:         | :green_heart:         | :green_heart:            |
-| roberta                | :green_heart:          | :green_heart:         | :green_heart:         | :green_heart:            |
-| xlm-roberta              | :green_heart:            | :green_heart:         | :green_heart:         | :green_heart:            |
-| mpnet              | :green_heart:            | :green_heart:         | :green_heart:         | :green_heart:            |
-
-</details>
+For a complete list of the models supported in SHARK, please refer to [tank/README.md](https://github.com/nod-ai/SHARK/blob/main/tank/README.md).

 ## Related Projects

--- a/build_tools/scrape_releases.py
+++ b/build_tools/scrape_releases.py
@@ -0,0 +1,37 @@
+"""Scrapes the github releases API to generate a static pip-install-able releases page.
+
+See https://github.com/llvm/torch-mlir/issues/1374
+"""
+import argparse
+import json
+
+import requests
+
+# Parse arguments
+parser = argparse.ArgumentParser()
+parser.add_argument("owner", type=str)
+parser.add_argument("repo", type=str)
+args = parser.parse_args()
+
+# Get releases
+response = requests.get(
+    f"https://api.github.com/repos/{args.owner}/{args.repo}/releases"
+)
+body = json.loads(response.content)
+
+# Parse releases
+releases = []
+for row in body:
+    for asset in row["assets"]:
+        releases.append((asset["name"], asset["browser_download_url"]))
+
+# Output HTML
+html = """<!DOCTYPE html>
+<html>
+  <body>
+"""
+for name, url in releases:
+    html += f"    <a href='{url}'>{name}</a><br />\n"
+html += """  </body>
+</html>"""
+print(html)
--- a/conftest.py
+++ b/conftest.py
@@ -36,6 +36,12 @@ def pytest_addoption(parser):
        default="False",
        help="Enables uploading of reproduction artifacts upon test case failure during iree-compile or validation. Must be passed with --ci_sha option ",
    )
+    parser.addoption(
+        "--update_tank",
+        action="store_true",
+        default="False",
+        help="Update local shark tank with latest artifacts.",
+    )
    parser.addoption(
        "--ci_sha",
        action="store",
--- a/cpp/.gitignore
+++ b/cpp/.gitignore
@@ -0,0 +1,3 @@
+*.mlir
+*.vmfb
+*.ini
--- a/cpp/README.md
+++ b/cpp/README.md
@@ -54,5 +54,29 @@ python -m pip install tensorflow

 *Run the vulkan_gui*
 ```bash
-./build/vulkan_gui/iree-samples-vulkan-gui
+./build/vulkan_gui/iree-samples-resnet-vulkan-gui
+```
+
+## Other models
+A tool for benchmarking other models is built and can be invoked with a command like the following
+```bash
+./build/vulkan_gui/iree-vulkan-gui --module-file=path/to/.vmfb --function_input=...
+```
+see `./build/vulkan_gui/iree-vulkan-gui --help` for an explanation on the function input. For example, stable diffusion unet can be tested with the following commands:
+```bash
+wget https://storage.googleapis.com/shark_tank/quinn/stable_diff_tf/stable_diff_tf.mlir
+iree-compile --iree-input-type=mhlo --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvm-target-cpu-features=host -iree-vulkan-target-triple=rdna2-unknown-linux --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 stable_diff_tf.mlir -o stable_diff_tf.vmfb
+./build/vulkan_gui/iree-vulkan-gui --module-file=stable_diff_tf.vmfb --function_input=2x4x64x64xf32 --function_input=1xf32 --function_input=2x77x768xf32
+```
+VAE and Autoencoder are also available
+```bash
+# VAE
+wget https://storage.googleapis.com/shark_tank/quinn/stable_diff_tf/vae_tf/vae.mlir
+iree-compile --iree-input-type=mhlo --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvm-target-cpu-features=host -iree-vulkan-target-triple=rdna2-unknown-linux --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 vae.mlir -o vae.vmfb
+./build/vulkan_gui/iree-vulkan-gui --module-file=stable_diff_tf.vmfb --function_input=1x4x64x64xf32
+
+# CLIP Autoencoder
+wget https://storage.googleapis.com/shark_tank/quinn/stable_diff_tf/clip_tf/clip_autoencoder.mlir
+iree-compile --iree-input-type=mhlo --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvm-target-cpu-features=host -iree-vulkan-target-triple=rdna2-unknown-linux --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 clip_autoencoder.mlir -o clip_autoencoder.vmfb
+./build/vulkan_gui/iree-vulkan-gui --module-file=stable_diff_tf.vmfb --function_input=1x77xi32 --function_input=1x77xi32
 ```
--- a/cpp/vulkan_gui/CMakeLists.txt
+++ b/cpp/vulkan_gui/CMakeLists.txt
@@ -40,45 +40,77 @@ set(IMGUI_DIR ${CMAKE_BINARY_DIR}/_deps/imgui-src)
 message("Looking for Imgui in ${IMGUI_DIR}")
 include_directories(${IMGUI_DIR} ${IMGUI_DIR}/backends ..)

-# Define the sample executable.
-set(_NAME "iree-samples-vulkan-gui")
-add_executable(${_NAME} "")
-target_sources(${_NAME}
-  PRIVATE
-    vulkan_inference_gui.cc
-    "${IMGUI_DIR}/backends/imgui_impl_sdl.cpp"
-    "${IMGUI_DIR}/backends/imgui_impl_vulkan.cpp"
-    "${IMGUI_DIR}/imgui.cpp"
-    "${IMGUI_DIR}/imgui_draw.cpp"
-    "${IMGUI_DIR}/imgui_demo.cpp"
-    "${IMGUI_DIR}/imgui_tables.cpp"
-    "${IMGUI_DIR}/imgui_widgets.cpp"
-)
-set_target_properties(${_NAME} PROPERTIES OUTPUT_NAME "iree-samples-vulkan-gui")
-target_include_directories(${_NAME} PUBLIC
-    $<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}>
-)
-target_link_libraries(${_NAME}
-  SDL2::SDL2
-  Vulkan::Vulkan
-  iree_runtime_runtime
-  iree_base_internal_main
-  iree_hal_drivers_vulkan_registration_registration
-  iree_modules_hal_hal
-  iree_vm_vm
-  iree_vm_bytecode_module
-  iree_vm_cc
+
+function(iree_vulkan_sample)
+
+  cmake_parse_arguments(
+    _RULE
+    ""
+    "NAME"
+    "SRCS"
+    ${ARGN}
+  )
+
+
+  # Define the sample executable.
+  set(_NAME "${_RULE_NAME}")
+  set(SRCS "${_RULE_SRCS}")
+  add_executable(${_NAME} "")
+  target_sources(${_NAME}
+    PRIVATE
+      ${SRCS}
+      "${IMGUI_DIR}/backends/imgui_impl_sdl.cpp"
+      "${IMGUI_DIR}/backends/imgui_impl_vulkan.cpp"
+      "${IMGUI_DIR}/imgui.cpp"
+      "${IMGUI_DIR}/imgui_draw.cpp"
+      "${IMGUI_DIR}/imgui_demo.cpp"
+      "${IMGUI_DIR}/imgui_tables.cpp"
+      "${IMGUI_DIR}/imgui_widgets.cpp"
+  )
+  set_target_properties(${_NAME} PROPERTIES OUTPUT_NAME "${_NAME}")
+  target_include_directories(${_NAME} PUBLIC
+      $<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}>
+  )
+  target_link_libraries(${_NAME}
+    SDL2::SDL2
+    Vulkan::Vulkan
+    iree_runtime_runtime
+    iree_base_internal_main
+    iree_hal_drivers_vulkan_registration_registration
+    iree_modules_hal_hal
+    iree_vm_vm
+    iree_vm_bytecode_module
+    iree_vm_cc
+    iree_tooling_vm_util_cc
+    iree_tooling_context_util
+  )
+
+  if(${CMAKE_SYSTEM_NAME} STREQUAL "Windows")
+    set(_GUI_LINKOPTS "-SUBSYSTEM:CONSOLE")
+  else()
+    set(_GUI_LINKOPTS "")
+  endif()
+
+  target_link_options(${_NAME}
+    PRIVATE
+      ${_GUI_LINKOPTS}
+  )
+endfunction()
+
+iree_vulkan_sample(
+    NAME
+      iree-samples-resnet-vulkan-gui
+
+    SRCS
+      vulkan_resnet_inference_gui.cc
 )

-if(${CMAKE_SYSTEM_NAME} STREQUAL "Windows")
-  set(_GUI_LINKOPTS "-SUBSYSTEM:CONSOLE")
-else()
-  set(_GUI_LINKOPTS "")
-endif()
+iree_vulkan_sample(
+    NAME
+      iree-vulkan-gui

-target_link_options(${_NAME}
-  PRIVATE
-    ${_GUI_LINKOPTS}
+    SRCS
+      vulkan_inference_gui.cc
 )

 message(STATUS "Configured vulkan_gui sample successfully")
--- a/cpp/vulkan_gui/vulkan_inference_gui.cc
+++ b/cpp/vulkan_gui/vulkan_inference_gui.cc
@@ -18,6 +18,12 @@
 #include <set>
 #include <vector>
 #include <fstream>
+#include <array>
+#include <cstdio>
+#include <cstdlib>
+#include <iterator>
+#include <string>
+#include <utility>

 #include "iree/hal/drivers/vulkan/api.h"

@@ -30,6 +36,15 @@
 #include "iree/vm/bytecode_module.h"
 #include "iree/vm/ref_cc.h"

+// iree-run-module
+#include "iree/base/internal/flags.h"
+#include "iree/base/status_cc.h"
+#include "iree/base/tracing.h"
+#include "iree/modules/hal/types.h"
+#include "iree/tooling/comparison.h"
+#include "iree/tooling/context_util.h"
+#include "iree/tooling/vm_util_cc.h"
+
 // Other dependencies (helpers, etc.)
 #include "iree/base/internal/main.h"

@@ -38,6 +53,49 @@
 #define STB_IMAGE_IMPLEMENTATION
 #include "stb_image.h"

+IREE_FLAG(string, entry_function, "",
+          "Name of a function contained in the module specified by module_file "
+          "to run.");
+
+// TODO(benvanik): move --function_input= flag into a util.
+static iree_status_t parse_function_io(iree_string_view_t flag_name,
+                                       void* storage,
+                                       iree_string_view_t value) {
+  auto* list = (std::vector<std::string>*)storage;
+  list->push_back(std::string(value.data, value.size));
+  return iree_ok_status();
+}
+static void print_function_io(iree_string_view_t flag_name, void* storage,
+                              FILE* file) {
+  auto* list = (std::vector<std::string>*)storage;
+  if (list->empty()) {
+    fprintf(file, "# --%.*s=\n", (int)flag_name.size, flag_name.data);
+  } else {
+    for (size_t i = 0; i < list->size(); ++i) {
+      fprintf(file, "--%.*s=\"%s\"\n", (int)flag_name.size, flag_name.data,
+              list->at(i).c_str());
+    }
+  }
+}
+static std::vector<std::string> FLAG_function_inputs;
+IREE_FLAG_CALLBACK(
+    parse_function_io, print_function_io, &FLAG_function_inputs, function_input,
+    "An input (a) value or (b) buffer of the format:\n"
+    "  (a) scalar value\n"
+    "     value\n"
+    "     e.g.: --function_input=\"3.14\"\n"
+    "  (b) buffer:\n"
+    "     [shape]xtype=[value]\n"
+    "     e.g.: --function_input=\"2x2xi32=1 2 3 4\"\n"
+    "Optionally, brackets may be used to separate the element values:\n"
+    "  2x2xi32=[[1 2][3 4]]\n"
+    "Raw binary files can be read to provide buffer contents:\n"
+    "  2x2xi32=@some/file.bin\n"
+    "numpy npy files (from numpy.save) can be read to provide 1+ values:\n"
+    "  @some.npy\n"
+    "Each occurrence of the flag indicates an input in the order they were\n"
+    "specified on the command line.");
+
 typedef struct iree_file_toc_t {
  const char* name;             // the file's original name
  char* data;             // beginning of the file
@@ -87,225 +145,6 @@ static void check_vk_result(VkResult err) {
  abort();
 }

-// Helper function to find Vulkan memory type bits. See ImGui_ImplVulkan_MemoryType() in imgui_impl_vulkan.cpp
-uint32_t findMemoryType(uint32_t type_filter, VkMemoryPropertyFlags properties)
-{
-  VkPhysicalDeviceMemoryProperties mem_properties;
-  vkGetPhysicalDeviceMemoryProperties(g_PhysicalDevice, &mem_properties);
-
-  for (uint32_t i = 0; i < mem_properties.memoryTypeCount; i++)
-  {
-    if ((type_filter & (1 << i)) && (mem_properties.memoryTypes[i].propertyFlags & properties) == properties)
-    {
-      return i;
-    }
-  }
-
-  return 0xFFFFFFFF; // Unable to find memoryType
-}
-
-// Helper function to load an image with common settings and return a VkDescriptorSet as a sort of Vulkan pointer
-bool LoadTextureFromFile(const char* filename, VkDescriptorSet* img_ds, int* image_width, int* image_height)
-{
-  // Specifying 4 channels forces stb to load the image in RGBA which is an easy format for Vulkan
-  int image_channels = 4;
-  unsigned char* image_data = stbi_load(filename, image_width, image_height, 0, image_channels);
-
-  if (image_data == NULL)
-  {
-    return false;
-  }
-
-  // Calculate allocation size (in number of bytes)
-  size_t image_size = (*image_width)*(*image_height)*image_channels;
-
-  VkResult err;
-
-  // Create the Vulkan image.
-  VkImage texture_image;
-  VkDeviceMemory texture_image_memory;
-  {
-    VkImageCreateInfo info = {};
-    info.sType = VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO;
-    info.imageType = VK_IMAGE_TYPE_2D;
-    info.format = VK_FORMAT_R8G8B8A8_UNORM;
-    info.extent.width = *image_width;
-    info.extent.height = *image_height;
-    info.extent.depth = 1;
-    info.mipLevels = 1;
-    info.arrayLayers = 1;
-    info.samples = VK_SAMPLE_COUNT_1_BIT;
-    info.tiling = VK_IMAGE_TILING_OPTIMAL;
-    info.usage = VK_IMAGE_USAGE_SAMPLED_BIT | VK_IMAGE_USAGE_TRANSFER_DST_BIT;
-    info.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
-    info.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
-    err = vkCreateImage(g_Device, &info, g_Allocator, &texture_image);
-    check_vk_result(err);
-    VkMemoryRequirements req;
-    vkGetImageMemoryRequirements(g_Device, texture_image, &req);
-    VkMemoryAllocateInfo alloc_info = {};
-    alloc_info.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
-    alloc_info.allocationSize = req.size;
-    alloc_info.memoryTypeIndex = findMemoryType(req.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT);
-    err = vkAllocateMemory(g_Device, &alloc_info, g_Allocator, &texture_image_memory);
-    check_vk_result(err);
-    err = vkBindImageMemory(g_Device, texture_image, texture_image_memory, 0);
-    check_vk_result(err);
-  }
-
-  // Create the Image View
-  VkImageView image_view;
-  {
-    VkImageViewCreateInfo info = {};
-    info.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;
-    info.image = texture_image;
-    info.viewType = VK_IMAGE_VIEW_TYPE_2D;
-    info.format = VK_FORMAT_R8G8B8A8_UNORM;
-    info.subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;
-    info.subresourceRange.levelCount = 1;
-    info.subresourceRange.layerCount = 1;
-    err = vkCreateImageView(g_Device, &info, g_Allocator, &image_view);
-    check_vk_result(err);
-  }
-
-  // Create Sampler
-  VkSampler sampler;
-  {
-    VkSamplerCreateInfo sampler_info{};
-    sampler_info.sType = VK_STRUCTURE_TYPE_SAMPLER_CREATE_INFO;
-    sampler_info.magFilter = VK_FILTER_LINEAR;
-    sampler_info.minFilter = VK_FILTER_LINEAR;
-    sampler_info.mipmapMode  = VK_SAMPLER_MIPMAP_MODE_LINEAR;
-    sampler_info.addressModeU = VK_SAMPLER_ADDRESS_MODE_REPEAT; // outside image bounds just use border color
-    sampler_info.addressModeV = VK_SAMPLER_ADDRESS_MODE_REPEAT;
-    sampler_info.addressModeW = VK_SAMPLER_ADDRESS_MODE_REPEAT;
-    sampler_info.minLod = -1000;
-    sampler_info.maxLod = 1000;
-    sampler_info.maxAnisotropy = 1.0f;
-    err = vkCreateSampler(g_Device, &sampler_info, g_Allocator, &sampler);
-    check_vk_result(err);
-  }
-
-  // Create Descriptor Set using ImGUI's implementation
-  *img_ds = ImGui_ImplVulkan_AddTexture(sampler, image_view, VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL);
-
-  // Create Upload Buffer
-  VkBuffer upload_buffer;
-  VkDeviceMemory upload_buffer_memory;
-  {
-    VkBufferCreateInfo buffer_info = {};
-    buffer_info.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;
-    buffer_info.size = image_size;
-    buffer_info.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT;
-    buffer_info.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
-    err = vkCreateBuffer(g_Device, &buffer_info, g_Allocator, &upload_buffer);
-    check_vk_result(err);
-    VkMemoryRequirements req;
-    vkGetBufferMemoryRequirements(g_Device, upload_buffer, &req);
-    VkMemoryAllocateInfo alloc_info = {};
-    alloc_info.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
-    alloc_info.allocationSize = req.size;
-    alloc_info.memoryTypeIndex = findMemoryType(req.memoryTypeBits, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);
-    err = vkAllocateMemory(g_Device, &alloc_info, g_Allocator, &upload_buffer_memory);
-    check_vk_result(err);
-    err = vkBindBufferMemory(g_Device, upload_buffer, upload_buffer_memory, 0);
-    check_vk_result(err);
-  }
-
-  // Upload to Buffer:
-  {
-    void* map = NULL;
-    err = vkMapMemory(g_Device, upload_buffer_memory, 0, image_size, 0, &map);
-    check_vk_result(err);
-    memcpy(map, image_data, image_size);
-    VkMappedMemoryRange range[1] = {};
-    range[0].sType = VK_STRUCTURE_TYPE_MAPPED_MEMORY_RANGE;
-    range[0].memory = upload_buffer_memory;
-    range[0].size = image_size;
-    err = vkFlushMappedMemoryRanges(g_Device, 1, range);
-    check_vk_result(err);
-    vkUnmapMemory(g_Device, upload_buffer_memory);
-  }
-
-  // Release image memory using stb
-  stbi_image_free(image_data);
-
-  // Create a command buffer that will perform following steps when hit in the command queue.
-  // TODO: this works in the example, but may need input if this is an acceptable way to access the pool/create the command buffer.
-  VkCommandPool command_pool = g_MainWindowData.Frames[g_MainWindowData.FrameIndex].CommandPool;
-  VkCommandBuffer command_buffer;
-  {
-    VkCommandBufferAllocateInfo alloc_info{};
-    alloc_info.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;
-    alloc_info.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;
-    alloc_info.commandPool = command_pool;
-    alloc_info.commandBufferCount = 1;
-
-    err = vkAllocateCommandBuffers(g_Device, &alloc_info, &command_buffer);
-    check_vk_result(err);
-
-    VkCommandBufferBeginInfo begin_info = {};
-    begin_info.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
-    begin_info.flags |= VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;
-    err = vkBeginCommandBuffer(command_buffer, &begin_info);
-    check_vk_result(err);
-  }
-
-  // Copy to Image
-  {
-    VkImageMemoryBarrier copy_barrier[1] = {};
-    copy_barrier[0].sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;
-    copy_barrier[0].dstAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
-    copy_barrier[0].oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;
-    copy_barrier[0].newLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
-    copy_barrier[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
-    copy_barrier[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
-    copy_barrier[0].image = texture_image;
-    copy_barrier[0].subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;
-    copy_barrier[0].subresourceRange.levelCount = 1;
-    copy_barrier[0].subresourceRange.layerCount = 1;
-    vkCmdPipelineBarrier(command_buffer, VK_PIPELINE_STAGE_HOST_BIT, VK_PIPELINE_STAGE_TRANSFER_BIT, 0, 0, NULL, 0, NULL, 1, copy_barrier);
-
-    VkBufferImageCopy region = {};
-    region.imageSubresource.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;
-    region.imageSubresource.layerCount = 1;
-    region.imageExtent.width = *image_width;
-    region.imageExtent.height = *image_height;
-    region.imageExtent.depth = 1;
-    vkCmdCopyBufferToImage(command_buffer, upload_buffer, texture_image, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, 1, &region);
-
-    VkImageMemoryBarrier use_barrier[1] = {};
-    use_barrier[0].sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;
-    use_barrier[0].srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
-    use_barrier[0].dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
-    use_barrier[0].oldLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
-    use_barrier[0].newLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
-    use_barrier[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
-    use_barrier[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
-    use_barrier[0].image = texture_image;
-    use_barrier[0].subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;
-    use_barrier[0].subresourceRange.levelCount = 1;
-    use_barrier[0].subresourceRange.layerCount = 1;
-    vkCmdPipelineBarrier(command_buffer, VK_PIPELINE_STAGE_TRANSFER_BIT, VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT, 0, 0, NULL, 0, NULL, 1, use_barrier);
-  }
-
-  // End command buffer
-  {
-    VkSubmitInfo end_info = {};
-    end_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
-    end_info.commandBufferCount = 1;
-    end_info.pCommandBuffers = &command_buffer;
-    err = vkEndCommandBuffer(command_buffer);
-    check_vk_result(err);
-    err = vkQueueSubmit(g_Queue, 1, &end_info, VK_NULL_HANDLE);
-    check_vk_result(err);
-    err = vkDeviceWaitIdle(g_Device);
-    check_vk_result(err);
-  }
-
-  return true;
-}
-
 // Returns the names of the Vulkan layers used for the given IREE
 // |extensibility_set| and |features|.
 std::vector<const char*> GetIreeLayers(
@@ -723,7 +562,16 @@ namespace iree {

 extern "C" int iree_main(int argc, char** argv) {

-  fprintf(stdout, "starting yo\n");
+  iree_flags_parse_checked(IREE_FLAGS_PARSE_MODE_DEFAULT, &argc, &argv);
+  if (argc > 1) {
+    // Avoid iree-run-module spinning endlessly on stdin if the user uses single
+    // dashes for flags.
+    printf(
+        "[ERROR] unexpected positional argument (expected none)."
+        " Did you use pass a flag with a single dash ('-')?"
+        " Use '--' instead.\n");
+    return 1;
+  }

  // --------------------------------------------------------------------------
  // Create a window.
@@ -835,8 +683,6 @@ extern "C" int iree_main(int argc, char** argv) {

  // Demo state.
  bool show_iree_window = true;
-  // --------------------------------------------------------------------------
-
  // --------------------------------------------------------------------------
  // Setup IREE.

@@ -900,69 +746,44 @@ extern "C" int iree_main(int argc, char** argv) {


  // Load bytecode module
-  iree_file_toc_t module_file_toc;
-  const char network_model[] = "resnet50_tf.vmfb";
-  fprintf(stdout, "Loading: %s\n", network_model);
-  if (load_file(network_model, &module_file_toc.data, &module_file_toc.size) == false)
-  {
-      abort();
-      return 1;
-  }
-  fprintf(stdout, "module size: %zu\n", module_file_toc.size);
-
-  static float input_res50[224*224*3];
-  static float output_res50[1000];
-
-  char filename[] = "dog_imagenet.jpg";
-  fprintf(stdout, "loading: %s\n", filename);
-  int x,y,n;
-  //unsigned char *image_raw = stbi_load(filename, &x, &y, &n, 3);
-  stbi_load(filename, &x, &y, &n, 3);
-  fprintf(stdout, "res: %i x %i x %i\n", x, y, n);
-
-  /* Preprocessing needs to go here. For now use a buffer preprocessed in python.
-
-  //convert image into floating point format
-  for(int i=0;i<224*224*3;i++)
-  {
-    input_res50[i]= ((float)image_raw[i])/255.0f;
-  }*/
-
-  std::ifstream fin("dog.bin", std::ifstream::in | std::ifstream::binary);
-  fin.read((char*)input_res50, 224*224*3*sizeof(float));
-
-  // load image again so imgui can display it
-  int my_image_width = 0;
-  int my_image_height = 0;
-  VkDescriptorSet my_image_texture = 0;
-  bool ret = LoadTextureFromFile(filename, &my_image_texture, &my_image_width, &my_image_height);
-  fprintf(stdout, "creating vulkan image: %s\n", ret ?"OK":"FAIL");
-  IM_ASSERT(ret);
+  //iree_file_toc_t module_file_toc;
+  //const char network_model[] = "resnet50_tf.vmfb";
+  //fprintf(stdout, "Loading: %s\n", network_model);
+  //if (load_file(network_model, &module_file_toc.data, &module_file_toc.size) == false)
+  //{
+  //    abort();
+  //    return 1;
+  //}
+  //fprintf(stdout, "module size: %zu\n", module_file_toc.size);

  iree_vm_module_t* bytecode_module = nullptr;
-  IREE_CHECK_OK(iree_vm_bytecode_module_create(
-      iree_instance,
-      iree_const_byte_span_t{
-          reinterpret_cast<const uint8_t*>(module_file_toc.data),
-          module_file_toc.size},
-      iree_allocator_null(), iree_allocator_system(), &bytecode_module));
-  // Query for details about what is in the loaded module.
-  iree_vm_module_signature_t bytecode_module_signature =
-      iree_vm_module_signature(bytecode_module);
-  fprintf(stdout, "Module loaded, have <%" PRIhsz "> exported functions:\n",
-          bytecode_module_signature.export_function_count);
-  for (int i = 0; i < bytecode_module_signature.export_function_count; ++i) {
-    iree_vm_function_t function;
-    IREE_CHECK_OK(iree_vm_module_lookup_function_by_ordinal(
-        bytecode_module, IREE_VM_FUNCTION_LINKAGE_EXPORT, i, &function));
-    auto function_name = iree_vm_function_name(&function);
-    auto function_signature = iree_vm_function_signature(&function);
+  iree_status_t module_status = iree_tooling_load_module_from_flags(
+      iree_instance, iree_allocator_system(), &bytecode_module);
+  if (!iree_status_is_ok(module_status))
+    return -1;
+  //IREE_CHECK_OK(iree_vm_bytecode_module_create(
+  //    iree_instance,
+  //    iree_const_byte_span_t{
+  //        reinterpret_cast<const uint8_t*>(module_file_toc.data),
+  //        module_file_toc.size},
+  //    iree_allocator_null(), iree_allocator_system(), &bytecode_module));
+  //// Query for details about what is in the loaded module.
+  //iree_vm_module_signature_t bytecode_module_signature =
+  //    iree_vm_module_signature(bytecode_module);
+  //fprintf(stdout, "Module loaded, have <%" PRIhsz "> exported functions:\n",
+  //        bytecode_module_signature.export_function_count);
+  //for (int i = 0; i < bytecode_module_signature.export_function_count; ++i) {
+  //  iree_vm_function_t function;
+  //  IREE_CHECK_OK(iree_vm_module_lookup_function_by_ordinal(
+  //      bytecode_module, IREE_VM_FUNCTION_LINKAGE_EXPORT, i, &function));
+  //  auto function_name = iree_vm_function_name(&function);
+  //  auto function_signature = iree_vm_function_signature(&function);

-    fprintf(stdout, "  %d: '%.*s' with calling convention '%.*s'\n", i,
-            (int)function_name.size, function_name.data,
-            (int)function_signature.calling_convention.size,
-            function_signature.calling_convention.data);
-  }
+  //  fprintf(stdout, "  %d: '%.*s' with calling convention '%.*s'\n", i,
+  //          (int)function_name.size, function_name.data,
+  //          (int)function_signature.calling_convention.size,
+  //          function_signature.calling_convention.data);
+  //}

  // Allocate a context that will hold the module state across invocations.
  iree_vm_context_t* iree_context = nullptr;
@@ -988,33 +809,42 @@ extern "C" int iree_main(int argc, char** argv) {
        // Write inputs into mappable buffers.
        iree_hal_allocator_t* allocator =
            iree_hal_device_allocator(iree_vk_device);
-        iree_hal_memory_type_t input_memory_type =
-            static_cast<iree_hal_memory_type_t>(
-                IREE_HAL_MEMORY_TYPE_HOST_LOCAL |
-                IREE_HAL_MEMORY_TYPE_DEVICE_VISIBLE);
-        iree_hal_buffer_usage_t input_buffer_usage =
-            static_cast<iree_hal_buffer_usage_t>(IREE_HAL_BUFFER_USAGE_DEFAULT);
-        iree_hal_buffer_params_t buffer_params;
-        buffer_params.type = input_memory_type;
-        buffer_params.usage = input_buffer_usage;
-        buffer_params.access = IREE_HAL_MEMORY_ACCESS_READ | IREE_HAL_MEMORY_ACCESS_WRITE;
+        //iree_hal_memory_type_t input_memory_type =
+        //    static_cast<iree_hal_memory_type_t>(
+        //        IREE_HAL_MEMORY_TYPE_HOST_LOCAL |
+        //        IREE_HAL_MEMORY_TYPE_DEVICE_VISIBLE);
+        //iree_hal_buffer_usage_t input_buffer_usage =
+        //    static_cast<iree_hal_buffer_usage_t>(IREE_HAL_BUFFER_USAGE_DEFAULT);
+        //iree_hal_buffer_params_t buffer_params;
+        //buffer_params.type = input_memory_type;
+        //buffer_params.usage = input_buffer_usage;
+        //buffer_params.access = IREE_HAL_MEMORY_ACCESS_READ | IREE_HAL_MEMORY_ACCESS_WRITE;

       // Wrap input buffers in buffer views.

-        iree_hal_buffer_view_t* input0_buffer_view = nullptr;
-        constexpr iree_hal_dim_t input_buffer_shape[] = {1, 224, 224, 3};
-        IREE_CHECK_OK(iree_hal_buffer_view_allocate_buffer(
-            allocator,
-            /*shape_rank=*/4, /*shape=*/input_buffer_shape,
-            IREE_HAL_ELEMENT_TYPE_FLOAT_32,
-            IREE_HAL_ENCODING_TYPE_DENSE_ROW_MAJOR, buffer_params,
-            iree_make_const_byte_span(&input_res50, sizeof(input_res50)),
-            &input0_buffer_view));
-
        vm::ref<iree_vm_list_t> inputs;
-        IREE_CHECK_OK(iree_vm_list_create(/*element_type=*/nullptr, 6, iree_allocator_system(), &inputs));
-        auto input0_buffer_view_ref = iree_hal_buffer_view_move_ref(input0_buffer_view);
-        IREE_CHECK_OK(iree_vm_list_push_ref_move(inputs.get(), &input0_buffer_view_ref));
+        iree_status_t input_status = ParseToVariantList(
+            allocator,
+            iree::span<const std::string>{FLAG_function_inputs.data(),
+                                          FLAG_function_inputs.size()},
+            iree_allocator_system(), &inputs);
+        if (!iree_status_is_ok(input_status))
+            return -1;
+        //vm::ref<iree_vm_list_t> inputs;
+        //IREE_CHECK_OK(iree_vm_list_create(/*element_type=*/nullptr, 6, iree_allocator_system(), &inputs));
+
+        //iree_hal_buffer_view_t* input0_buffer_view = nullptr;
+        //constexpr iree_hal_dim_t input_buffer_shape[] = {1, 224, 224, 3};
+        //IREE_CHECK_OK(iree_hal_buffer_view_allocate_buffer(
+        //    allocator,
+        //    /*shape_rank=*/4, /*shape=*/input_buffer_shape,
+        //    IREE_HAL_ELEMENT_TYPE_FLOAT_32,
+        //    IREE_HAL_ENCODING_TYPE_DENSE_ROW_MAJOR, buffer_params,
+        //    iree_make_const_byte_span(&input_res50, sizeof(input_res50)),
+        //    &input0_buffer_view));
+
+        //auto input0_buffer_view_ref = iree_hal_buffer_view_move_ref(input0_buffer_view);
+        //IREE_CHECK_OK(iree_vm_list_push_ref_move(inputs.get(), &input0_buffer_view_ref));

        // Prepare outputs list to accept results from the invocation.

@@ -1023,6 +853,7 @@ extern "C" int iree_main(int argc, char** argv) {
        IREE_CHECK_OK(iree_vm_list_create(/*element_type=*/nullptr, kOutputCount * sizeof(float), iree_allocator_system(), &outputs));

  // --------------------------------------------------------------------------
+
  // Main loop.
  bool done = false;
  while (!done) {
@@ -1076,46 +907,11 @@ extern "C" int iree_main(int argc, char** argv) {
                                     /*policy=*/nullptr, inputs.get(),
                                     outputs.get(), iree_allocator_system()));

-        // Read back the results.
-        auto* output_buffer_view = reinterpret_cast<iree_hal_buffer_view_t*>(
-            iree_vm_list_get_ref_deref(outputs.get(),
-            0,
-            iree_hal_buffer_view_get_descriptor()));
-        IREE_CHECK_OK(iree_hal_device_transfer_d2h(
-            iree_vk_device,
-            iree_hal_buffer_view_buffer(output_buffer_view),
-            0,
-            output_res50, sizeof(output_res50),
-            IREE_HAL_TRANSFER_BUFFER_FLAG_DEFAULT, iree_infinite_timeout()));

        // we want to run continuously so we can use tools like RenderDoc, RGP, etc...
        dirty = true;
      }

-      // find maxarg from results
-      float max = 0.0f;
-      int max_idx = -1;
-      for(int i=0;i<1000;i++)
-      {
-        if (output_res50[i] > max)
-        {
-          max = output_res50[i];
-          max_idx = i;
-        }
-      }
-
-      ImGui::Text("pointer = %p", my_image_texture);
-      ImGui::Text("size = %d x %d", my_image_width, my_image_height);
-      ImGui::Image((ImTextureID)my_image_texture, ImVec2(my_image_width, my_image_height));
-
-      // Display the latest computation output.
-      ImGui::Text("Max   idx = [%i]", max_idx);
-      ImGui::Text("Max value = [%f]", max);
-
-      ImGui::Text("Resnet50 categories:");
-      ImGui::PlotHistogram("Histogram", output_res50, IM_ARRAYSIZE(output_res50), 0, NULL, 0.0f, 1.0f, ImVec2(0,80));
-      ImGui::Separator();
-
      // Framerate counter.
      ImGui::Text("Application average %.3f ms/frame (%.1f FPS)",
                  1000.0f / ImGui::GetIO().Framerate, ImGui::GetIO().Framerate);
@@ -1137,6 +933,7 @@ extern "C" int iree_main(int argc, char** argv) {
  iree_vm_module_release(bytecode_module);
  iree_vm_context_release(iree_context);
  iree_hal_device_release(iree_vk_device);
+  iree_hal_allocator_release(allocator);
  iree_hal_driver_release(iree_vk_driver);
  iree_hal_vulkan_syms_release(iree_vk_syms);
  iree_vm_instance_release(iree_instance);
--- a/cpp/vulkan_gui/vulkan_resnet_inference_gui.cc
+++ b/cpp/vulkan_gui/vulkan_resnet_inference_gui.cc
--- a/generate_sharktank.py
+++ b/generate_sharktank.py
@@ -205,14 +205,14 @@ if __name__ == "__main__":
    parser.add_argument(
        "--torch_model_csv",
        type=lambda x: is_valid_file(x),
-        default="./tank/pytorch/torch_model_list.csv",
+        default="./tank/torch_model_list.csv",
        help="""Contains the file with torch_model name and args.
-             Please see: https://github.com/nod-ai/SHARK/blob/main/tank/pytorch/torch_model_list.csv""",
+             Please see: https://github.com/nod-ai/SHARK/blob/main/tank/torch_model_list.csv""",
    )
    parser.add_argument(
        "--tf_model_csv",
        type=lambda x: is_valid_file(x),
-        default="./tank/tf/tf_model_list.csv",
+        default="./tank/tf_model_list.csv",
        help="Contains the file with tf model name and args.",
    )
    parser.add_argument(
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,9 +4,9 @@ requires = [
    "wheel",
    "packaging",

-    "numpy==1.22.4",
-    "torch-mlir>=20220428.420",
-    "iree-compiler>=20220427.13",
-    "iree-runtime>=20220427.13",
+    "numpy>=1.22.4",
+    "torch-mlir>=20221021.633",
+    "iree-compiler>=20221022.190",
+    "iree-runtime>=20221022.190",
 ]
 build-backend = "setuptools.build_meta"
--- a/requirements-importer-macos.txt
+++ b/requirements-importer-macos.txt
@@ -1,8 +1,8 @@
-f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
+-f https://download.pytorch.org/whl/nightly/cpu/
 --pre

 numpy
-torch
+torch==1.14.0.dev20221021
 torchvision

 tqdm
--- a/requirements-importer.txt
+++ b/requirements-importer.txt
@@ -14,7 +14,8 @@ iree-tools-tf

 # TensorFlow and JAX.
 gin-config
-tensorflow
+tensorflow==2.10
+keras==2.10
 #tf-models-nightly
 #tensorflow-text-nightly
 transformers
--- a/setup.py
+++ b/setup.py
@@ -10,8 +10,8 @@ PACKAGE_VERSION = os.environ.get("SHARK_PACKAGE_VERSION") or "0.0.4"
 backend_deps = []
 if "NO_BACKEND" in os.environ.keys():
    backend_deps = [
-        "iree-compiler>=20220427.13",
-        "iree-runtime>=20220427.13",
+        "iree-compiler>=20221022.190",
+        "iree-runtime>=20221022.190",
    ]

 setup(
@@ -33,11 +33,11 @@ setup(
        "Operating System :: OS Independent",
    ],
    packages=find_packages(exclude=("examples")),
-    python_requires=">=3.7",
+    python_requires=">=3.9",
    install_requires=[
        "numpy",
        "PyYAML",
-        "torch-mlir>=20220428.420",
+        "torch-mlir>=20221021.633",
    ]
    + backend_deps,
 )
--- a/setup_venv.ps1
+++ b/setup_venv.ps1
@@ -0,0 +1,40 @@
+#Write-Host "Installing python"
+
+#Start-Process winget install Python.Python.3.10 '/quiet InstallAllUsers=1 PrependPath=1' -wait -NoNewWindow
+
+#Write-Host "python installation completed successfully"
+
+#Write-Host "Reload environment variables"
+#$env:Path = [System.Environment]::GetEnvironmentVariable("Path","Machine") + ";" + [System.Environment]::GetEnvironmentVariable("Path","User")
+#Write-Host "Reloaded environment variables"
+
+
+# redirect stderr into stdout
+$p = &{python -V} 2>&1
+# check if an ErrorRecord was returned
+$version = if($p -is [System.Management.Automation.ErrorRecord])
+{
+    # grab the version string from the error message
+    $p.Exception.Message
+}
+else
+{
+    # otherwise return as is
+    $p
+}
+
+Write-Host "Python version found is"
+Write-Host $p
+
+
+Write-Host "Installing Build Dependencies"
+python -m venv .\shark.venv\
+.\shark.venv\Scripts\activate
+pip install -r requirements.txt
+pip install --pre torch-mlir torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cu116 -f https://llvm.github.io/torch-mlir/package-index/
+pip install --upgrade -f https://nod-ai.github.io/SHARK-Runtime/pip-release-links.html iree-compiler iree-runtime
+Write-Host "Building SHARK..."
+pip install -e . -f https://llvm.github.io/torch-mlir/package-index/ -f https://nod-ai.github.io/SHARK-Runtime/pip-release-links.html
+pip install diffusers transformers scipy pillow gradio
+Write-Host "Build and installation completed successfully"
+Write-Host "Source your venv with ./shark.venv/Scripts/activate"
--- a/setup_venv.sh
+++ b/setup_venv.sh
@@ -76,11 +76,15 @@ fi
 $PYTHON -m pip install --upgrade pip || die "Could not upgrade pip"
 $PYTHON -m pip install --upgrade -r "$TD/requirements.txt"
 if [ "$torch_mlir_bin" = true ]; then
-  $PYTHON -m pip install --pre torch-mlir -f https://llvm.github.io/torch-mlir/package-index/
-  if [ $? -eq 0 ];then
-    echo "Successfully Installed torch-mlir"
+  if [[ $(uname -s) = 'Darwin' ]]; then
+    echo "MacOS detected. Please install torch-mlir from source or .whl, as dependency problems may occur otherwise."
  else
-    echo "Could not install torch-mlir" >&2
+    $PYTHON -m pip install --pre torch-mlir -f https://llvm.github.io/torch-mlir/package-index/
+    if [ $? -eq 0 ];then
+      echo "Successfully Installed torch-mlir"
+    else
+      echo "Could not install torch-mlir" >&2
+    fi
  fi
 else
  echo "${Red}No binaries found for Python $PYTHON_VERSION_X_Y on $(uname -s)"
@@ -89,13 +93,13 @@ else
  exit 1
 fi
 if [[ -z "${USE_IREE}" ]]; then
-  RUNTIME="nod-ai/SHARK-Runtime"
+  RUNTIME="https://nod-ai.github.io/SHARK-Runtime/pip-release-links.html"
 else
-  RUNTIME="google/iree"
+  RUNTIME="https://iree-org.github.io/iree/pip-release-links.html"
 fi
 if [[ -z "${NO_BACKEND}" ]]; then
  echo "Installing ${RUNTIME}..."
-  $PYTHON -m pip install --find-links https://github.com/${RUNTIME}/releases iree-compiler iree-runtime
+  $PYTHON -m pip install --upgrade --find-links ${RUNTIME} iree-compiler iree-runtime
 else
  echo "Not installing a backend, please make sure to add your backend to PYTHONPATH"
 fi
@@ -103,15 +107,17 @@ if [[ ! -z "${IMPORTER}" ]]; then
  echo "${Yellow}Installing importer tools.."
  if [[ $(uname -s) = 'Linux' ]]; then
    echo "${Yellow}Linux detected.. installing Linux importer tools"
-    $PYTHON -m pip install --upgrade -r "$TD/requirements-importer.txt" -f https://github.com/${RUNTIME}/releases --extra-index-url https://download.pytorch.org/whl/nightly/cpu
+    #Always get the importer tools from upstream IREE
+    $PYTHON -m pip install --upgrade -r "$TD/requirements-importer.txt" -f https://iree-org.github.io/iree/pip-release-links.html --extra-index-url https://download.pytorch.org/whl/nightly/cpu
  elif [[ $(uname -s) = 'Darwin' ]]; then
    echo "${Yellow}macOS detected.. installing macOS importer tools"
    #Conda seems to have some problems installing these packages and hope they get resolved upstream.
-    $PYTHON -m pip install --upgrade -r "$TD/requirements-importer-macos.txt" -f https://github.com/${RUNTIME}/releases --extra-index-url https://download.pytorch.org/whl/nightly/cpu
+    $PYTHON -m pip install --upgrade -r "$TD/requirements-importer-macos.txt" -f ${RUNTIME} --extra-index-url https://download.pytorch.org/whl/nightly/cpu
+    $PYTHON -m pip install https://github.com/llvm/torch-mlir/releases/download/snapshot-20221024.636/torch_mlir-20221024.636-cp310-cp310-macosx_11_0_universal2.whl
  fi
 fi

-$PYTHON -m pip install -e . -f https://llvm.github.io/torch-mlir/package-index/ -f https://github.com/${RUNTIME}/releases
+$PYTHON -m pip install -e . -f https://llvm.github.io/torch-mlir/package-index/ -f ${RUNTIME}

 if [[ $(uname -s) = 'Linux' && ! -z "${BENCHMARK}" ]]; then
  $PYTHON -m pip uninstall -y torch torchvision
--- a/shark/examples/shark_inference/resnet50_script.py
+++ b/shark/examples/shark_inference/resnet50_script.py
@@ -69,7 +69,7 @@ labels = load_labels()
 mlir_model, func_name, inputs, golden_out = download_torch_model("resnet50")

 shark_module = SharkInference(mlir_model, func_name, mlir_dialect="linalg")
-# shark_module.compile()
+shark_module.compile()
 path = shark_module.save_module()
 shark_module.load_module(path)
 result = shark_module.forward((img.detach().numpy(),))
--- a/shark/examples/shark_inference/stable_diff.py
+++ b/shark/examples/shark_inference/stable_diff.py
@@ -47,7 +47,7 @@ def load_mlir(mlir_loc):
    return mlir_module


-def compile_through_fx(model, inputs, mlir_loc=None):
+def compile_through_fx(model, inputs, mlir_loc=None, extra_args=[]):

    module = load_mlir(mlir_loc)
    if mlir_loc == None:
@@ -98,9 +98,12 @@ def compile_through_fx(model, inputs, mlir_loc=None):
    func_name = "forward"

    shark_module = SharkInference(
-        mlir_model, func_name, device=args.device, mlir_dialect="tm_tensor"
+        mlir_model,
+        func_name,
+        device=args.device,
+        mlir_dialect="tm_tensor",
    )
-    shark_module.compile()
+    shark_module.compile(extra_args)

    return shark_module

@@ -161,6 +164,7 @@ if __name__ == "__main__":
        unet,
        (latent_model_input, torch.tensor([1.0]), text_embeddings),
        args.mlir_loc,
+        ["--iree-flow-enable-conv-nchw-to-nhwc-transform"],
    )

    # torch.jit.script(unet)
--- a/shark/examples/shark_inference/stable_diff_f16.py
+++ b/shark/examples/shark_inference/stable_diff_f16.py
@@ -0,0 +1,278 @@
+from transformers import CLIPTextModel, CLIPTokenizer
+from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
+import torch
+from PIL import Image
+from diffusers import LMSDiscreteScheduler
+from tqdm.auto import tqdm
+from shark.shark_inference import SharkInference
+from torch.fx.experimental.proxy_tensor import make_fx
+from torch._decomp import get_decompositions
+import torch_mlir
+import tempfile
+import numpy as np
+
+# pip install diffusers
+# pip install scipy
+
+############### Parsing args #####################
+import argparse
+
+p = argparse.ArgumentParser(
+    description=__doc__, formatter_class=argparse.ArgumentDefaultsHelpFormatter
+)
+
+p.add_argument(
+    "--prompt",
+    type=str,
+    default="a photograph of an astronaut riding a horse",
+    help="the text prompt to use",
+)
+p.add_argument("--device", type=str, default="cpu", help="the device to use")
+p.add_argument("--steps", type=int, default=50, help="the device to use")
+p.add_argument("--mlir_loc", type=str, default=None, help="the device to use")
+p.add_argument("--vae_loc", type=str, default=None, help="the device to use")
+args = p.parse_args()
+
+#####################################################
+
+
+def fp16_unet():
+    from shark.shark_downloader import download_torch_model
+
+    mlir_model, func_name, inputs, golden_out = download_torch_model(
+        "stable_diff_f16_18_OCT", tank_url="gs://shark_tank/prashant_nod"
+    )
+    shark_module = SharkInference(
+        mlir_model, func_name, device=args.device, mlir_dialect="linalg"
+    )
+    shark_module.compile()
+    return shark_module
+
+
+def load_mlir(mlir_loc):
+    import os
+
+    if mlir_loc == None:
+        return None
+    print(f"Trying to load the model from {mlir_loc}.")
+    with open(os.path.join(mlir_loc)) as f:
+        mlir_module = f.read()
+    return mlir_module
+
+
+def compile_through_fx(model, inputs, mlir_loc=None):
+
+    module = load_mlir(mlir_loc)
+    if mlir_loc == None:
+        fx_g = make_fx(
+            model,
+            decomposition_table=get_decompositions(
+                [
+                    torch.ops.aten.embedding_dense_backward,
+                    torch.ops.aten.native_layer_norm_backward,
+                    torch.ops.aten.slice_backward,
+                    torch.ops.aten.select_backward,
+                    torch.ops.aten.norm.ScalarOpt_dim,
+                    torch.ops.aten.native_group_norm,
+                    torch.ops.aten.upsample_bilinear2d.vec,
+                    torch.ops.aten.split.Tensor,
+                    torch.ops.aten.split_with_sizes,
+                ]
+            ),
+        )(*inputs)
+
+        fx_g.graph.set_codegen(torch.fx.graph.CodeGen())
+        fx_g.recompile()
+
+        def strip_overloads(gm):
+            """
+            Modifies the target of graph nodes in :attr:`gm` to strip overloads.
+            Args:
+                gm(fx.GraphModule): The input Fx graph module to be modified
+            """
+            for node in gm.graph.nodes:
+                if isinstance(node.target, torch._ops.OpOverload):
+                    node.target = node.target.overloadpacket
+            gm.recompile()
+
+        strip_overloads(fx_g)
+
+        ts_g = torch.jit.script(fx_g)
+
+        module = torch_mlir.compile(
+            ts_g,
+            inputs,
+            torch_mlir.OutputType.LINALG_ON_TENSORS,
+            use_tracing=False,
+            verbose=False,
+        )
+
+    mlir_model = module
+    func_name = "forward"
+
+    shark_module = SharkInference(
+        mlir_model, func_name, device=args.device, mlir_dialect="linalg"
+    )
+    shark_module.compile()
+
+    return shark_module
+
+
+if __name__ == "__main__":
+
+    YOUR_TOKEN = "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"
+
+    # 1. Load the autoencoder model which will be used to decode the latents into image space.
+    vae = AutoencoderKL.from_pretrained(
+        "CompVis/stable-diffusion-v1-4",
+        subfolder="vae",
+        use_auth_token=YOUR_TOKEN,
+    )
+
+    # 2. Load the tokenizer and text encoder to tokenize and encode the text.
+    tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
+    text_encoder = CLIPTextModel.from_pretrained(
+        "openai/clip-vit-large-patch14"
+    )
+
+    class VaeModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.vae = AutoencoderKL.from_pretrained(
+                "CompVis/stable-diffusion-v1-4",
+                subfolder="vae",
+                use_auth_token=YOUR_TOKEN,
+            )
+
+        def forward(self, input):
+            return self.vae.decode(input, return_dict=False)[0]
+
+    vae = VaeModel()
+    vae_input = torch.rand(1, 4, 64, 64)
+    shark_vae = compile_through_fx(vae, (vae_input,), args.vae_loc)
+
+    # Wrap the unet model to return tuples.
+    class UnetModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.unet = UNet2DConditionModel.from_pretrained(
+                "CompVis/stable-diffusion-v1-4",
+                subfolder="unet",
+                use_auth_token=YOUR_TOKEN,
+            )
+            self.in_channels = self.unet.in_channels
+            self.train(False)
+
+    def forward(self, x, y, z):
+        return self.unet.forward(x, y, z, return_dict=False)[0]
+
+    # # 3. The UNet model for generating the latents.
+    unet = UnetModel()
+
+    shark_unet = fp16_unet()
+
+    scheduler = LMSDiscreteScheduler(
+        beta_start=0.00085,
+        beta_end=0.012,
+        beta_schedule="scaled_linear",
+        num_train_timesteps=1000,
+    )
+
+    prompt = [args.prompt]
+
+    height = 512  # default height of Stable Diffusion
+    width = 512  # default width of Stable Diffusion
+
+    num_inference_steps = args.steps  # Number of denoising steps
+
+    guidance_scale = 7.5  # Scale for classifier-free guidance
+
+    generator = torch.manual_seed(
+        42
+    )  # Seed generator to create the inital latent noise
+
+    batch_size = len(prompt)
+
+    text_input = tokenizer(
+        prompt,
+        padding="max_length",
+        max_length=tokenizer.model_max_length,
+        truncation=True,
+        return_tensors="pt",
+    )
+
+    text_embeddings = text_encoder(text_input.input_ids)[0]
+
+    max_length = text_input.input_ids.shape[-1]
+    uncond_input = tokenizer(
+        [""] * batch_size,
+        padding="max_length",
+        max_length=max_length,
+        return_tensors="pt",
+    )
+    uncond_embeddings = text_encoder(uncond_input.input_ids)[0]
+
+    text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
+
+    latents = torch.randn(
+        (batch_size, unet.in_channels, height // 8, width // 8),
+        generator=generator,
+    )
+    # latents = latents.to(torch_device)
+
+    scheduler.set_timesteps(num_inference_steps)
+
+    latents = latents * scheduler.sigmas[0]
+    # print(latents, latents.shape)
+
+    for i, t in tqdm(enumerate(scheduler.timesteps)):
+
+        print(f"i = {i} t = {t}")
+        # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
+        latent_model_input = torch.cat([latents] * 2)
+        sigma = scheduler.sigmas[i]
+        latent_model_input = latent_model_input / ((sigma**2 + 1) ** 0.5)
+
+        # predict the noise residual
+
+        # with torch.no_grad():
+        # noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings)
+
+        latent_model_input_numpy = (
+            latent_model_input.detach().numpy().astype(np.half)
+        )
+        text_embeddings_numpy = (
+            text_embeddings.detach().numpy().astype(np.half)
+        )
+
+        noise_pred = shark_unet.forward(
+            (
+                latent_model_input_numpy,
+                np.array([t]).astype(np.half),
+                text_embeddings_numpy,
+            )
+        )
+        noise_pred = torch.from_numpy(noise_pred).to(torch.float32)
+
+        # perform guidance
+        noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+        noise_pred = noise_pred_uncond + guidance_scale * (
+            noise_pred_text - noise_pred_uncond
+        )
+
+        # compute the previous noisy sample x_t -> x_t-1
+        latents = scheduler.step(noise_pred, i, latents)["prev_sample"]
+
+    # print("Latents shape : ", latents.shape)
+
+    # scale and decode the image latents with vae
+    latents = 1 / 0.18215 * latents
+    latents_numpy = latents.detach().numpy()
+    image = shark_vae.forward((latents_numpy,))
+    image = torch.from_numpy(image)
+
+    image = (image / 2 + 0.5).clamp(0, 1)
+    image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
+    images = (image * 255).round().astype("uint8")
+    pil_images = [Image.fromarray(image) for image in images]
+    pil_images[0].save("astro.jpg")
--- a/shark/examples/shark_inference/stable_diffusion/.gitignore
+++ b/shark/examples/shark_inference/stable_diffusion/.gitignore
@@ -0,0 +1,2 @@
+*.vmfb
+*.jpg
--- a/shark/examples/shark_inference/stable_diffusion/README.md
+++ b/shark/examples/shark_inference/stable_diffusion/README.md
@@ -0,0 +1,15 @@
+# STABLE DIFFUSION
+
+## Installation
+
+```shell
+pip install diffusers
+pip install scipy
+```
+
+## RUN
+
+```shell
+python main.py --precision="fp32"|"fp16" --prompt="enter the text" --device="cpu"|"cuda"|"vulkan" --import_mlir|--no-import_mlir
+
+```
--- a/shark/examples/shark_inference/stable_diffusion/download_hf_models.py
+++ b/shark/examples/shark_inference/stable_diffusion/download_hf_models.py
@@ -0,0 +1,25 @@
+from PIL import Image
+import requests
+
+from transformers import CLIPProcessor, CLIPModel
+
+model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
+processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
+
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+
+inputs = processor(
+    text=["a photo of a cat", "a photo of a dog"],
+    images=image,
+    return_tensors="pt",
+    padding=True,
+)
+
+outputs = model(**inputs)
+logits_per_image = (
+    outputs.logits_per_image
+)  # this is the image-text similarity score
+probs = logits_per_image.softmax(
+    dim=1
+)  # we can take the softmax to get the label probabilities
--- a/shark/examples/shark_inference/stable_diffusion/main.py
+++ b/shark/examples/shark_inference/stable_diffusion/main.py
@@ -0,0 +1,241 @@
+from transformers import CLIPTextModel, CLIPTokenizer
+import torch
+from PIL import Image
+from diffusers import LMSDiscreteScheduler
+from tqdm.auto import tqdm
+import numpy as np
+from stable_args import args
+from model_wrappers import (
+    get_vae32,
+    get_vae16,
+    get_unet16_wrapped,
+    get_unet32_wrapped,
+    get_clipped_text,
+)
+from utils import get_shark_model
+import time
+
+GCLOUD_BUCKET = "gs://shark_tank/prashant_nod"
+VAE_FP16 = "vae_fp16"
+VAE_FP32 = "vae_fp32"
+UNET_FP16 = "unet_fp16"
+UNET_FP32 = "unet_fp32"
+IREE_EXTRA_ARGS = []
+
+TUNED_GCLOUD_BUCKET = "gs://shark_tank/quinn"
+UNET_FP16_TUNED = "unet_fp16_tunedv2"
+
+BATCH_SIZE = len(args.prompts)
+
+if BATCH_SIZE not in [1, 2]:
+    import sys
+
+    sys.exit("Only batch size 1 and 2 are supported.")
+
+if BATCH_SIZE > 1 and args.precision != "fp16":
+    sys.exit("batch size > 1 is supported for fp16 model.")
+
+
+if BATCH_SIZE != 1:
+    TUNED_GCLOUD_BUCKET = "gs://shark_tank/prashant_nod"
+    UNET_FP16_TUNED = f"unet_fp16_{BATCH_SIZE}"
+    VAE_FP16 = f"vae_fp16_{BATCH_SIZE}"
+
+# Helper function to profile the vulkan device.
+def start_profiling(file_path="foo.rdc", profiling_mode="queue"):
+    if args.vulkan_debug_utils and "vulkan" in args.device:
+        import iree
+
+        print(f"Profiling and saving to {file_path}.")
+        vulkan_device = iree.runtime.get_device(args.device)
+        vulkan_device.begin_profiling(mode=profiling_mode, file_path=file_path)
+        return vulkan_device
+    return None
+
+
+def end_profiling(device):
+    if device:
+        return device.end_profiling()
+
+
+def get_models():
+    global IREE_EXTRA_ARGS
+    if args.precision == "fp16":
+        IREE_EXTRA_ARGS += [
+            "--iree-flow-enable-padding-linalg-ops",
+            "--iree-flow-linalg-ops-padding-size=32",
+        ]
+        if args.use_tuned:
+            unet_gcloud_bucket = TUNED_GCLOUD_BUCKET
+            vae_gcloud_bucket = GCLOUD_BUCKET
+            unet_args = IREE_EXTRA_ARGS
+            vae_args = IREE_EXTRA_ARGS + [
+                "--iree-flow-enable-conv-nchw-to-nhwc-transform"
+            ]
+            unet_name = UNET_FP16_TUNED
+            vae_name = VAE_FP16
+        else:
+            unet_gcloud_bucket = GCLOUD_BUCKET
+            vae_gcloud_bucket = GCLOUD_BUCKET
+            IREE_EXTRA_ARGS += [
+                "--iree-flow-enable-conv-nchw-to-nhwc-transform"
+            ]
+            unet_args = IREE_EXTRA_ARGS
+            vae_args = IREE_EXTRA_ARGS
+            unet_name = UNET_FP16
+            vae_name = VAE_FP16
+
+        if batch_size > 1:
+            vae_args = []
+
+        if args.import_mlir == True:
+            return get_vae16(model_name=VAE_FP16), get_unet16_wrapped(
+                model_name=UNET_FP16
+            )
+        else:
+            return get_shark_model(
+                vae_gcloud_bucket,
+                vae_name,
+                vae_args,
+            ), get_shark_model(
+                unet_gcloud_bucket,
+                unet_name,
+                unet_args,
+            )
+
+    elif args.precision == "fp32":
+        IREE_EXTRA_ARGS += [
+            "--iree-flow-enable-conv-nchw-to-nhwc-transform",
+            "--iree-flow-enable-padding-linalg-ops",
+            "--iree-flow-linalg-ops-padding-size=16",
+        ]
+        if args.import_mlir == True:
+            return get_vae32(model_name=VAE_FP32), get_unet32_wrapped(
+                model_name=UNET_FP32
+            )
+        else:
+            return get_shark_model(
+                GCLOUD_BUCKET,
+                VAE_FP32,
+                IREE_EXTRA_ARGS,
+            ), get_shark_model(
+                GCLOUD_BUCKET,
+                UNET_FP32,
+                IREE_EXTRA_ARGS,
+            )
+
+
+if __name__ == "__main__":
+
+    dtype = torch.float32 if args.precision == "fp32" else torch.half
+    if len(args.iree_vulkan_target_triple) > 0:
+        IREE_EXTRA_ARGS.append(
+            f"-iree-vulkan-target-triple={args.iree_vulkan_target_triple}"
+        )
+
+    clip_model = "clip_text"
+    clip_extra_args = [
+        "--iree-flow-linalg-ops-padding-size=16",
+        "--iree-flow-enable-padding-linalg-ops",
+    ]
+    clip = get_shark_model(GCLOUD_BUCKET, clip_model, clip_extra_args)
+
+    prompt = args.prompts
+    height = 512  # default height of Stable Diffusion
+    width = 512  # default width of Stable Diffusion
+
+    num_inference_steps = args.steps  # Number of denoising steps
+
+    guidance_scale = args.guidance_scale  # Scale for classifier-free guidance
+
+    generator = torch.manual_seed(
+        args.seed
+    )  # Seed generator to create the inital latent noise
+
+    batch_size = len(prompt)
+
+    vae, unet = get_models()
+
+    tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
+
+    scheduler = LMSDiscreteScheduler(
+        beta_start=0.00085,
+        beta_end=0.012,
+        beta_schedule="scaled_linear",
+        num_train_timesteps=1000,
+    )
+
+    start = time.time()
+
+    text_input = tokenizer(
+        prompt,
+        padding="max_length",
+        max_length=args.max_length,
+        truncation=True,
+        return_tensors="pt",
+    )
+
+    text_embeddings = clip.forward((text_input.input_ids,))
+    text_embeddings = torch.from_numpy(text_embeddings).to(dtype)
+    max_length = text_input.input_ids.shape[-1]
+    uncond_input = tokenizer(
+        [""] * batch_size,
+        padding="max_length",
+        max_length=max_length,
+        return_tensors="pt",
+    )
+    uncond_embeddings = clip.forward((uncond_input.input_ids,))
+    uncond_embeddings = torch.from_numpy(uncond_embeddings).to(dtype)
+
+    text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
+
+    latents = torch.randn(
+        (batch_size, 4, height // 8, width // 8),
+        generator=generator,
+        dtype=torch.float32,
+    ).to(dtype)
+
+    scheduler.set_timesteps(num_inference_steps)
+    scheduler.is_scale_input_called = True
+
+    latents = latents * scheduler.sigmas[0]
+    text_embeddings_numpy = text_embeddings.detach().numpy()
+    avg_ms = 0
+
+    for i, t in tqdm(enumerate(scheduler.timesteps)):
+        step_start = time.time()
+        print(f"i = {i} t = {t}", end="")
+        timestep = torch.tensor([t]).to(dtype).detach().numpy()
+        latents_numpy = latents.detach().numpy()
+        sigma_numpy = np.array(scheduler.sigmas[i]).astype(np.float32)
+
+        profile_device = start_profiling(file_path="unet.rdc")
+        noise_pred = unet.forward(
+            (latents_numpy, timestep, text_embeddings_numpy, sigma_numpy)
+        )
+        end_profiling(profile_device)
+        noise_pred = torch.from_numpy(noise_pred)
+        step_time = time.time() - step_start
+        avg_ms += step_time
+        step_ms = int((step_time) * 1000)
+        print(f" ({step_ms}ms)")
+
+        latents = scheduler.step(noise_pred, i, latents)["prev_sample"]
+    avg_ms = 1000 * avg_ms / args.steps
+    print(f"Average step time: {avg_ms}ms/it")
+
+    # scale and decode the image latents with vae
+    latents = 1 / 0.18215 * latents
+    latents_numpy = latents.detach().numpy()
+    profile_device = start_profiling(file_path="vae.rdc")
+    image = vae.forward((latents_numpy,))
+    end_profiling(profile_device)
+    image = torch.from_numpy(image)
+    image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
+    images = (image * 255).round().astype("uint8")
+
+    print("Total image generation runtime (s): {}".format(time.time() - start))
+
+    pil_images = [Image.fromarray(image) for image in images]
+    for i in range(batch_size):
+        pil_images[i].save(f"{args.prompts[i]}_{i}.jpg")
--- a/shark/examples/shark_inference/stable_diffusion/model_wrappers.py
+++ b/shark/examples/shark_inference/stable_diffusion/model_wrappers.py
@@ -0,0 +1,223 @@
+from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
+from transformers import CLIPTextModel
+from utils import compile_through_fx
+from stable_args import args
+import torch
+
+YOUR_TOKEN = "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"
+
+
+BATCH_SIZE = len(args.prompts)
+
+
+def get_clipped_text(model_name="clip_text"):
+    class CLIPText(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.text_encoder = CLIPTextModel.from_pretrained(
+                "openai/clip-vit-large-patch14"
+            )
+
+        def forward(self, input):
+            return self.text_encoder(input)[0]
+
+    clip_model = CLIPText()
+    clip_input = torch.randint(1, 2, (BATCH_SIZE, 77))
+    shark_clip = compile_through_fx(
+        clip_model,
+        (clip_input,),
+        model_name=model_name,
+    )
+    return shark_clip
+
+
+def get_vae32(model_name="vae_fp32"):
+    class VaeModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.vae = AutoencoderKL.from_pretrained(
+                "CompVis/stable-diffusion-v1-4",
+                subfolder="vae",
+                use_auth_token=YOUR_TOKEN,
+            )
+
+        def forward(self, input):
+            x = self.vae.decode(input, return_dict=False)[0]
+            return (x / 2 + 0.5).clamp(0, 1)
+
+    vae = VaeModel()
+    vae_input = torch.rand(BATCH_SIZE, 4, 64, 64)
+    shark_vae = compile_through_fx(
+        vae,
+        (vae_input,),
+        model_name=model_name,
+    )
+    return shark_vae
+
+
+def get_vae16(model_name="vae_fp16"):
+    class VaeModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.vae = AutoencoderKL.from_pretrained(
+                "CompVis/stable-diffusion-v1-4",
+                subfolder="vae",
+                use_auth_token=YOUR_TOKEN,
+                revision="fp16",
+            )
+
+        def forward(self, input):
+            x = self.vae.decode(input, return_dict=False)[0]
+            return (x / 2 + 0.5).clamp(0, 1)
+
+    vae = VaeModel()
+    vae = vae.half().cuda()
+    vae_input = torch.rand(BATCH_SIZE, 4, 64, 64, dtype=torch.half).cuda()
+    shark_vae = compile_through_fx(
+        vae,
+        (vae_input,),
+        model_name=model_name,
+    )
+    return shark_vae
+
+
+def get_unet32(model_name="unet_fp32"):
+    class UnetModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.unet = UNet2DConditionModel.from_pretrained(
+                "CompVis/stable-diffusion-v1-4",
+                subfolder="unet",
+                use_auth_token=YOUR_TOKEN,
+            )
+            self.in_channels = self.unet.in_channels
+            self.train(False)
+
+        def forward(self, x, y, z):
+            return self.unet.forward(x, y, z, return_dict=False)[0]
+
+    unet = UnetModel()
+    latent_model_input = torch.rand([2, 4, 64, 64])
+    text_embeddings = torch.rand([2, args.max_length, 768])
+    shark_unet = compile_through_fx(
+        unet,
+        (latent_model_input, torch.tensor([1.0]), text_embeddings),
+        model_name=model_name,
+    )
+    return shark_unet
+
+
+def get_unet16(model_name="unet_fp16"):
+    class UnetModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.unet = UNet2DConditionModel.from_pretrained(
+                "CompVis/stable-diffusion-v1-4",
+                subfolder="unet",
+                use_auth_token=YOUR_TOKEN,
+                revision="fp16",
+            )
+            self.in_channels = self.unet.in_channels
+            self.train(False)
+
+        def forward(self, x, y, z):
+            return self.unet.forward(x, y, z, return_dict=False)[0]
+
+    unet = UnetModel()
+    unet = unet.half().cuda()
+    latent_model_input = torch.rand([2, 4, 64, 64]).half().cuda()
+    text_embeddings = torch.rand([2, args.max_length, 768]).half().cuda()
+    shark_unet = compile_through_fx(
+        unet,
+        (
+            latent_model_input,
+            torch.tensor([1.0]).half().cuda(),
+            text_embeddings,
+        ),
+        model_name=model_name,
+    )
+    return shark_unet
+
+
+def get_unet16_wrapped(guidance_scale=7.5, model_name="unet_fp16_wrapped"):
+    class UnetModel(torch.nn.Module):
+        def __init__(self, guidance_scale=guidance_scale):
+            super().__init__()
+            self.unet = UNet2DConditionModel.from_pretrained(
+                "CompVis/stable-diffusion-v1-4",
+                subfolder="unet",
+                use_auth_token=YOUR_TOKEN,
+                revision="fp16",
+            )
+            self.in_channels = self.unet.in_channels
+            self.guidance_scale = guidance_scale
+            self.train(False)
+
+        def forward(self, latent, timestep, text_embedding, sigma):
+            # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
+            latents = torch.cat([latent] * 2)
+            latents = latents / (torch.pow((torch.pow(sigma, 2) + 1), 0.5))
+            unet_out = self.unet.forward(
+                latents, timestep, text_embedding, return_dict=False
+            )[0]
+            noise_pred_uncond, noise_pred_text = unet_out.chunk(2)
+            noise_pred = noise_pred_uncond + self.guidance_scale * (
+                noise_pred_text - noise_pred_uncond
+            )
+            return noise_pred
+
+    unet = UnetModel()
+    unet = unet.half().cuda()
+    latent_model_input = torch.rand([BATCH_SIZE, 4, 64, 64]).half().cuda()
+    text_embeddings = (
+        torch.rand([2 * BATCH_SIZE, args.max_length, 768]).half().cuda()
+    )
+    sigma = torch.tensor(1).to(torch.float32)
+    shark_unet = compile_through_fx(
+        unet,
+        (
+            latent_model_input,
+            torch.tensor([1.0]).half().cuda(),
+            text_embeddings,
+            sigma,
+        ),
+        model_name=model_name,
+    )
+    return shark_unet
+
+
+def get_unet32_wrapped(guidance_scale=7.5, model_name="unet_fp32_wrapped"):
+    class UnetModel(torch.nn.Module):
+        def __init__(self, guidance_scale=guidance_scale):
+            super().__init__()
+            self.unet = UNet2DConditionModel.from_pretrained(
+                "CompVis/stable-diffusion-v1-4",
+                subfolder="unet",
+                use_auth_token=YOUR_TOKEN,
+            )
+            self.in_channels = self.unet.in_channels
+            self.guidance_scale = guidance_scale
+            self.train(False)
+
+        def forward(self, latent, timestep, text_embedding, sigma):
+            latents = torch.cat([latent] * 2)
+            latents = latents / (torch.pow((torch.pow(sigma, 2) + 1), 0.5))
+            unet_out = self.unet.forward(
+                latents, timestep, text_embedding, return_dict=False
+            )[0]
+            noise_pred_uncond, noise_pred_text = unet_out.chunk(2)
+            noise_pred = noise_pred_uncond + self.guidance_scale * (
+                noise_pred_text - noise_pred_uncond
+            )
+            return noise_pred
+
+    unet = UnetModel()
+    latent_model_input = torch.rand([BATCH_SIZE, 4, 64, 64])
+    text_embeddings = torch.rand([2 * BATCH_SIZE, args.max_length, 768])
+    sigma = torch.tensor(1).to(torch.float32)
+    shark_unet = compile_through_fx(
+        unet,
+        (latent_model_input, torch.tensor([1.0]), text_embeddings, sigma),
+        model_name=model_name,
+    )
+    return shark_unet
--- a/shark/examples/shark_inference/stable_diffusion/stable_args.py
+++ b/shark/examples/shark_inference/stable_diffusion/stable_args.py
@@ -0,0 +1,88 @@
+import argparse
+
+p = argparse.ArgumentParser(
+    description=__doc__, formatter_class=argparse.ArgumentDefaultsHelpFormatter
+)
+
+p.add_argument(
+    "--prompts",
+    nargs="+",
+    default=["a photograph of an astronaut riding a horse"],
+    help="text of which images to be generated.",
+)
+p.add_argument(
+    "--device", type=str, default="cpu", help="device to run the model."
+)
+p.add_argument(
+    "--steps",
+    type=int,
+    default=10,
+    help="the no. of steps to do the sampling.",
+)
+p.add_argument(
+    "--seed",
+    type=int,
+    default=42,
+    help="the seed to use.",
+)
+p.add_argument(
+    "--guidance_scale",
+    type=float,
+    default=7.5,
+    help="the value to be used for guidance scaling.",
+)
+
+p.add_argument(
+    "--import_mlir",
+    default=False,
+    action=argparse.BooleanOptionalAction,
+    help="imports the model from torch module to shark_module otherwise downloads the model from shark_tank.",
+)
+
+p.add_argument(
+    "--precision", type=str, default="fp32", help="precision to run the model."
+)
+
+p.add_argument(
+    "--max_length",
+    type=int,
+    default=77,
+    help="max length of the tokenizer output.",
+)
+
+p.add_argument(
+    "--load_vmfb",
+    default=True,
+    action=argparse.BooleanOptionalAction,
+    help="attempts to load the model from a precompiled flatbuffer and compiles + saves it if not found.",
+)
+
+p.add_argument(
+    "--save_vmfb",
+    default=False,
+    action=argparse.BooleanOptionalAction,
+    help="saves the compiled flatbuffer to the local directory",
+)
+
+p.add_argument(
+    "--iree-vulkan-target-triple",
+    type=str,
+    default="",
+    help="Specify target triple for vulkan",
+)
+
+p.add_argument(
+    "--vulkan_debug_utils",
+    default=False,
+    action=argparse.BooleanOptionalAction,
+    help="Profiles vulkan device and collects the .rdc info",
+)
+
+p.add_argument(
+    "--use_tuned",
+    default=True,
+    action=argparse.BooleanOptionalAction,
+    help="Download and use the tuned version of the model if available",
+)
+
+args = p.parse_args()
--- a/shark/examples/shark_inference/stable_diffusion/utils.py
+++ b/shark/examples/shark_inference/stable_diffusion/utils.py
@@ -0,0 +1,103 @@
+import os
+
+import torch
+from shark.shark_inference import SharkInference
+from shark.shark_importer import SharkImporter
+from torch.fx.experimental.proxy_tensor import make_fx
+from stable_args import args
+from torch._decomp import get_decompositions
+import torch_mlir
+
+
+def _compile_module(shark_module, model_name, extra_args=[]):
+    if args.load_vmfb or args.save_vmfb:
+        extended_name = "{}_{}".format(model_name, args.device)
+        vmfb_path = os.path.join(os.getcwd(), extended_name + ".vmfb")
+        if args.load_vmfb and os.path.isfile(vmfb_path) and not args.save_vmfb:
+            print("Loading flatbuffer from {}".format(vmfb_path))
+            shark_module.load_module(vmfb_path)
+        else:
+            if args.save_vmfb:
+                print("Saving to {}".format(vmfb_path))
+            else:
+                print(
+                    "No vmfb found. Compiling and saving to {}".format(
+                        vmfb_path
+                    )
+                )
+            path = shark_module.save_module(
+                os.getcwd(), extended_name, extra_args
+            )
+            shark_module.load_module(path)
+    else:
+        shark_module.compile(extra_args)
+    return shark_module
+
+
+# Downloads the model from shark_tank and returns the shark_module.
+def get_shark_model(tank_url, model_name, extra_args=[]):
+    from shark.shark_downloader import download_torch_model
+
+    mlir_model, func_name, inputs, golden_out = download_torch_model(
+        model_name, tank_url=tank_url
+    )
+    shark_module = SharkInference(
+        mlir_model, func_name, device=args.device, mlir_dialect="linalg"
+    )
+    return _compile_module(shark_module, model_name, extra_args)
+
+
+# Converts the torch-module into shark_module.
+def compile_through_fx(model, inputs, model_name, extra_args=[]):
+
+    fx_g = make_fx(
+        model,
+        decomposition_table=get_decompositions(
+            [
+                torch.ops.aten.embedding_dense_backward,
+                torch.ops.aten.native_layer_norm_backward,
+                torch.ops.aten.slice_backward,
+                torch.ops.aten.select_backward,
+                torch.ops.aten.norm.ScalarOpt_dim,
+                torch.ops.aten.native_group_norm,
+                torch.ops.aten.upsample_bilinear2d.vec,
+                torch.ops.aten.split.Tensor,
+                torch.ops.aten.split_with_sizes,
+            ]
+        ),
+    )(*inputs)
+
+    fx_g.graph.set_codegen(torch.fx.graph.CodeGen())
+    fx_g.recompile()
+
+    def strip_overloads(gm):
+        """
+        Modifies the target of graph nodes in :attr:`gm` to strip overloads.
+        Args:
+            gm(fx.GraphModule): The input Fx graph module to be modified
+        """
+        for node in gm.graph.nodes:
+            if isinstance(node.target, torch._ops.OpOverload):
+                node.target = node.target.overloadpacket
+        gm.recompile()
+
+    strip_overloads(fx_g)
+
+    ts_g = torch.jit.trace(fx_g, inputs)
+
+    mlir_importer = SharkImporter(
+        ts_g,
+        inputs,
+        frontend="torch",
+    )
+
+    (mlir_module, func_name), _, _ = mlir_importer.import_debug()
+
+    shark_module = SharkInference(
+        mlir_module,
+        func_name,
+        device=args.device,
+        mlir_dialect="linalg",
+    )
+
+    return _compile_module(shark_module, model_name, extra_args)
--- a/shark/examples/shark_training/stable-diffusion-img2img/README.md
+++ b/shark/examples/shark_training/stable-diffusion-img2img/README.md
@@ -0,0 +1,41 @@
+# Stable Diffusion Img2Img model
+
+## Installation
+
+<details>
+  <summary>Installation (Linux)</summary>
+
+### Activate shark.venv Virtual Environment
+
+```shell
+source shark.venv/bin/activate
+
+# Some older pip installs may not be able to handle the recent PyTorch deps
+python -m pip install --upgrade pip
+```
+
+### Install dependencies
+
+# Run the setup.sh script
+
+```shell
+./setup.sh
+```
+
+### Run the Stable diffusion Img2Img model
+
+To run the model with the default set of images and params, run:
+```shell
+python stable_diffusion_img2img.py
+```
+To run the model with your set of images, and parameters you need to specify the following params:
+1.) Input images directory with the arg `--input_dir` containing 3-5 images.
+2.) What to teach the model? Using the arg `--what_to_teach`, allowed values are `object` or `style`.
+3.) Placeholder token using the arg `--placeholder_token`, that represents your new concept. It should be passed with the opening and closing angle brackets. For ex: token is `cat-toy`, it should be passed as `<cat-toy>`.
+4.) Initializer token using the arg `--initializer_token`, which summarise what is your new concept.
+
+For the result, you need to pass the text prompt with the arg: `--prompt`. The prompt string should contain a "*s" in it, which will be replaced by the placeholder token during the inference.
+
+By default the result images will go into the `sd_result` dir. To specify your output dir use the arg: `--output_dir`.
+
+The default value of max_training_steps is `3000`, which takes some hours to complete. You can pass the smaller value with the arg `--training_steps`. Specify the number of images to be sampled for the result with the `--num_inference_samples` arg.
--- a/shark/examples/shark_training/stable-diffusion-img2img/setup.sh
+++ b/shark/examples/shark_training/stable-diffusion-img2img/setup.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+
+TD="$(cd $(dirname $0) && pwd)"
+if [ -z "$PYTHON" ]; then
+  PYTHON="$(which python3)"
+fi
+
+function die() {
+  echo "Error executing command: $*"
+  exit 1
+}
+
+PYTHON_VERSION_X_Y=`${PYTHON} -c 'import sys; version=sys.version_info[:2]; print("{0}.{1}".format(*version))'`
+
+echo "Python: $PYTHON"
+echo "Python version: $PYTHON_VERSION_X_Y"
+
+mkdir input_images
+
+wget https://huggingface.co/datasets/valhalla/images/resolve/main/2.jpeg -P input_images/
+wget https://huggingface.co/datasets/valhalla/images/resolve/main/3.jpeg -P input_images/
+wget https://huggingface.co/datasets/valhalla/images/resolve/main/5.jpeg -P input_images/
+wget https://huggingface.co/datasets/valhalla/images/resolve/main/6.jpeg -P input_images/
+
+pip install diffusers["training"]==0.4.1 transformers ftfy opencv-python
--- a/shark/examples/shark_training/stable-diffusion-img2img/stable_diffusion_img2img.py
+++ b/shark/examples/shark_training/stable-diffusion-img2img/stable_diffusion_img2img.py
@@ -0,0 +1,597 @@
+# Textual-inversion fine-tuning for Stable Diffusion using diffusers
+# This script shows how to "teach" Stable Diffusion a new concept via
+# textual-inversion using 🤗 Hugging Face [🧨 Diffusers library](https://github.com/huggingface/diffusers).
+# By using just 3-5 images you can teach new concepts to Stable Diffusion
+# and personalize the model on your own images.
+
+import argparse
+import itertools
+import math
+import os
+import random
+import cv2
+
+import numpy as np
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from torch.utils.data import Dataset
+
+import PIL
+from accelerate import Accelerator
+from accelerate.logging import get_logger
+from accelerate.utils import set_seed
+from diffusers import (
+    AutoencoderKL,
+    DDPMScheduler,
+    PNDMScheduler,
+    StableDiffusionPipeline,
+    UNet2DConditionModel,
+)
+from diffusers.hub_utils import init_git_repo, push_to_hub
+from diffusers.optimization import get_scheduler
+from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker
+from PIL import Image
+from torchvision import transforms
+from tqdm.auto import tqdm
+from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer
+
+YOUR_TOKEN = "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"
+
+p = argparse.ArgumentParser(
+    description=__doc__, formatter_class=argparse.ArgumentDefaultsHelpFormatter
+)
+p.add_argument(
+    "--input_dir",
+    type=str,
+    default="input_images/",
+    help="the directory contains the images used for fine tuning",
+)
+p.add_argument(
+    "--output_dir",
+    type=str,
+    default="sd_result",
+    help="the directory contains the images used for fine tuning",
+)
+p.add_argument(
+    "--training_steps",
+    type=int,
+    default=3000,
+    help="the maximum number of training steps",
+)
+p.add_argument("--seed", type=int, default=42, help="the random seed")
+p.add_argument(
+    "--what_to_teach",
+    type=str,
+    choices=["object", "style"],
+    default="object",
+    help="what is it that you are teaching?",
+)
+p.add_argument(
+    "--placeholder_token",
+    type=str,
+    default="<cat-toy>",
+    help="It is the token you are going to use to represent your new concept",
+)
+p.add_argument(
+    "--initializer_token",
+    type=str,
+    default="toy",
+    help="It is a word that can summarise what is your new concept",
+)
+p.add_argument(
+    "--inference_steps",
+    type=int,
+    default=50,
+    help="the number of steps for inference",
+)
+p.add_argument(
+    "--num_inference_samples",
+    type=int,
+    default=4,
+    help="the number of samples for inference",
+)
+p.add_argument(
+    "--prompt",
+    type=str,
+    default="a grafitti in a wall with a *s on it",
+    help="the text prompt to use",
+)
+args = p.parse_args()
+
+if "*s" not in args.prompt:
+    raise ValueError(
+        f'The prompt should have a "*s" which will be replaced by a placeholder token.'
+    )
+
+prompt1, prompt2 = args.prompt.split("*s")
+args.prompt = prompt1 + args.placeholder_token + prompt2
+
+pretrained_model_name_or_path = "CompVis/stable-diffusion-v1-4"
+
+# Load input images.
+images = []
+for filename in os.listdir(args.input_dir):
+    img = cv2.imread(os.path.join(args.input_dir, filename))
+    if img is not None:
+        images.append(img)
+
+# Setup the prompt templates for training
+imagenet_templates_small = [
+    "a photo of a {}",
+    "a rendering of a {}",
+    "a cropped photo of the {}",
+    "the photo of a {}",
+    "a photo of a clean {}",
+    "a photo of a dirty {}",
+    "a dark photo of the {}",
+    "a photo of my {}",
+    "a photo of the cool {}",
+    "a close-up photo of a {}",
+    "a bright photo of the {}",
+    "a cropped photo of a {}",
+    "a photo of the {}",
+    "a good photo of the {}",
+    "a photo of one {}",
+    "a close-up photo of the {}",
+    "a rendition of the {}",
+    "a photo of the clean {}",
+    "a rendition of a {}",
+    "a photo of a nice {}",
+    "a good photo of a {}",
+    "a photo of the nice {}",
+    "a photo of the small {}",
+    "a photo of the weird {}",
+    "a photo of the large {}",
+    "a photo of a cool {}",
+    "a photo of a small {}",
+]
+
+imagenet_style_templates_small = [
+    "a painting in the style of {}",
+    "a rendering in the style of {}",
+    "a cropped painting in the style of {}",
+    "the painting in the style of {}",
+    "a clean painting in the style of {}",
+    "a dirty painting in the style of {}",
+    "a dark painting in the style of {}",
+    "a picture in the style of {}",
+    "a cool painting in the style of {}",
+    "a close-up painting in the style of {}",
+    "a bright painting in the style of {}",
+    "a cropped painting in the style of {}",
+    "a good painting in the style of {}",
+    "a close-up painting in the style of {}",
+    "a rendition in the style of {}",
+    "a nice painting in the style of {}",
+    "a small painting in the style of {}",
+    "a weird painting in the style of {}",
+    "a large painting in the style of {}",
+]
+
+# Setup the dataset
+class TextualInversionDataset(Dataset):
+    def __init__(
+        self,
+        data_root,
+        tokenizer,
+        learnable_property="object",  # [object, style]
+        size=512,
+        repeats=100,
+        interpolation="bicubic",
+        flip_p=0.5,
+        set="train",
+        placeholder_token="*",
+        center_crop=False,
+    ):
+
+        self.data_root = data_root
+        self.tokenizer = tokenizer
+        self.learnable_property = learnable_property
+        self.size = size
+        self.placeholder_token = placeholder_token
+        self.center_crop = center_crop
+        self.flip_p = flip_p
+
+        self.image_paths = [
+            os.path.join(self.data_root, file_path)
+            for file_path in os.listdir(self.data_root)
+        ]
+
+        self.num_images = len(self.image_paths)
+        self._length = self.num_images
+
+        if set == "train":
+            self._length = self.num_images * repeats
+
+        self.interpolation = {
+            "linear": PIL.Image.LINEAR,
+            "bilinear": PIL.Image.BILINEAR,
+            "bicubic": PIL.Image.BICUBIC,
+            "lanczos": PIL.Image.LANCZOS,
+        }[interpolation]
+
+        self.templates = (
+            imagenet_style_templates_small
+            if learnable_property == "style"
+            else imagenet_templates_small
+        )
+        self.flip_transform = transforms.RandomHorizontalFlip(p=self.flip_p)
+
+    def __len__(self):
+        return self._length
+
+    def __getitem__(self, i):
+        example = {}
+        image = Image.open(self.image_paths[i % self.num_images])
+
+        if not image.mode == "RGB":
+            image = image.convert("RGB")
+
+        placeholder_string = self.placeholder_token
+        text = random.choice(self.templates).format(placeholder_string)
+
+        example["input_ids"] = self.tokenizer(
+            text,
+            padding="max_length",
+            truncation=True,
+            max_length=self.tokenizer.model_max_length,
+            return_tensors="pt",
+        ).input_ids[0]
+
+        # default to score-sde preprocessing
+        img = np.array(image).astype(np.uint8)
+
+        if self.center_crop:
+            crop = min(img.shape[0], img.shape[1])
+            h, w, = (
+                img.shape[0],
+                img.shape[1],
+            )
+            img = img[
+                (h - crop) // 2 : (h + crop) // 2,
+                (w - crop) // 2 : (w + crop) // 2,
+            ]
+
+        image = Image.fromarray(img)
+        image = image.resize(
+            (self.size, self.size), resample=self.interpolation
+        )
+
+        image = self.flip_transform(image)
+        image = np.array(image).astype(np.uint8)
+        image = (image / 127.5 - 1.0).astype(np.float32)
+
+        example["pixel_values"] = torch.from_numpy(image).permute(2, 0, 1)
+        return example
+
+
+# Setting up the model
+# Load the tokenizer and add the placeholder token as a additional special token.
+# Please read and if you agree accept the LICENSE
+# [here](https://huggingface.co/CompVis/stable-diffusion-v1-4) if you see an error
+tokenizer = CLIPTokenizer.from_pretrained(
+    pretrained_model_name_or_path,
+    subfolder="tokenizer",
+    use_auth_token=YOUR_TOKEN,
+)
+
+# Add the placeholder token in tokenizer
+num_added_tokens = tokenizer.add_tokens(args.placeholder_token)
+if num_added_tokens == 0:
+    raise ValueError(
+        f"The tokenizer already contains the token {args.placeholder_token}. Please pass a different"
+        " `placeholder_token` that is not already in the tokenizer."
+    )
+
+# Get token ids for our placeholder and initializer token.
+# This code block will complain if initializer string is not a single token
+# Convert the initializer_token, placeholder_token to ids
+token_ids = tokenizer.encode(args.initializer_token, add_special_tokens=False)
+# Check if initializer_token is a single token or a sequence of tokens
+if len(token_ids) > 1:
+    raise ValueError("The initializer token must be a single token.")
+
+initializer_token_id = token_ids[0]
+placeholder_token_id = tokenizer.convert_tokens_to_ids(args.placeholder_token)
+
+# Load the Stable Diffusion model
+# Load models and create wrapper for stable diffusion
+text_encoder = CLIPTextModel.from_pretrained(
+    pretrained_model_name_or_path,
+    subfolder="text_encoder",
+    use_auth_token=YOUR_TOKEN,
+)
+vae = AutoencoderKL.from_pretrained(
+    pretrained_model_name_or_path,
+    subfolder="vae",
+    use_auth_token=YOUR_TOKEN,
+)
+unet = UNet2DConditionModel.from_pretrained(
+    pretrained_model_name_or_path,
+    subfolder="unet",
+    use_auth_token=YOUR_TOKEN,
+)
+
+# We have added the `placeholder_token` in the `tokenizer` so we resize the token embeddings here,
+#  this will a new embedding vector in the token embeddings for our `placeholder_token`
+text_encoder.resize_token_embeddings(len(tokenizer))
+
+# Initialise the newly added placeholder token with the embeddings of the initializer token
+token_embeds = text_encoder.get_input_embeddings().weight.data
+token_embeds[placeholder_token_id] = token_embeds[initializer_token_id]
+
+# In Textual-Inversion we only train the newly added embedding vector,
+# so lets freeze rest of the model parameters here.
+
+
+def freeze_params(params):
+    for param in params:
+        param.requires_grad = False
+
+
+# Freeze vae and unet
+freeze_params(vae.parameters())
+freeze_params(unet.parameters())
+# Freeze all parameters except for the token embeddings in text encoder
+params_to_freeze = itertools.chain(
+    text_encoder.text_model.encoder.parameters(),
+    text_encoder.text_model.final_layer_norm.parameters(),
+    text_encoder.text_model.embeddings.position_embedding.parameters(),
+)
+freeze_params(params_to_freeze)
+
+# Creating our training data
+
+train_dataset = TextualInversionDataset(
+    data_root=args.input_dir,
+    tokenizer=tokenizer,
+    size=512,
+    placeholder_token=args.placeholder_token,
+    repeats=100,
+    learnable_property=args.what_to_teach,  # Option selected above between object and style
+    center_crop=False,
+    set="train",
+)
+
+
+def create_dataloader(train_batch_size=1):
+    return torch.utils.data.DataLoader(
+        train_dataset, batch_size=train_batch_size, shuffle=True
+    )
+
+
+# Create noise_scheduler for training.
+noise_scheduler = DDPMScheduler(
+    beta_start=0.00085,
+    beta_end=0.012,
+    beta_schedule="scaled_linear",
+    num_train_timesteps=1000,
+    tensor_format="pt",
+)
+
+# Define hyperparameters for our training
+hyperparameters = {
+    "learning_rate": 5e-04,
+    "scale_lr": True,
+    "max_train_steps": args.training_steps,
+    "train_batch_size": 1,
+    "gradient_accumulation_steps": 4,
+    "seed": args.seed,
+    "output_dir": "sd-concept-output",
+}
+
+
+def training_function(text_encoder, vae, unet):
+    logger = get_logger(__name__)
+
+    train_batch_size = hyperparameters["train_batch_size"]
+    gradient_accumulation_steps = hyperparameters[
+        "gradient_accumulation_steps"
+    ]
+    learning_rate = hyperparameters["learning_rate"]
+    max_train_steps = hyperparameters["max_train_steps"]
+    output_dir = hyperparameters["output_dir"]
+
+    accelerator = Accelerator(
+        gradient_accumulation_steps=gradient_accumulation_steps,
+    )
+
+    train_dataloader = create_dataloader(train_batch_size)
+
+    if hyperparameters["scale_lr"]:
+        learning_rate = (
+            learning_rate
+            * gradient_accumulation_steps
+            * train_batch_size
+            * accelerator.num_processes
+        )
+
+    # Initialize the optimizer
+    optimizer = torch.optim.AdamW(
+        text_encoder.get_input_embeddings().parameters(),  # only optimize the embeddings
+        lr=learning_rate,
+    )
+
+    text_encoder, optimizer, train_dataloader = accelerator.prepare(
+        text_encoder, optimizer, train_dataloader
+    )
+
+    # Move vae and unet to device
+    vae.to(accelerator.device)
+    unet.to(accelerator.device)
+
+    # Keep vae and unet in eval model as we don't train these
+    vae.eval()
+    unet.eval()
+
+    # We need to recalculate our total training steps as the size of the training dataloader may have changed.
+    num_update_steps_per_epoch = math.ceil(
+        len(train_dataloader) / gradient_accumulation_steps
+    )
+    num_train_epochs = math.ceil(max_train_steps / num_update_steps_per_epoch)
+
+    # Train!
+    total_batch_size = (
+        train_batch_size
+        * accelerator.num_processes
+        * gradient_accumulation_steps
+    )
+
+    logger.info("***** Running training *****")
+    logger.info(f"  Num examples = {len(train_dataset)}")
+    logger.info(f"  Instantaneous batch size per device = {train_batch_size}")
+    logger.info(
+        f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}"
+    )
+    logger.info(
+        f"  Gradient Accumulation steps = {gradient_accumulation_steps}"
+    )
+    logger.info(f"  Total optimization steps = {max_train_steps}")
+    # Only show the progress bar once on each machine.
+    progress_bar = tqdm(
+        range(max_train_steps), disable=not accelerator.is_local_main_process
+    )
+    progress_bar.set_description("Steps")
+    global_step = 0
+
+    for epoch in range(num_train_epochs):
+        text_encoder.train()
+        for step, batch in enumerate(train_dataloader):
+            with accelerator.accumulate(text_encoder):
+                # Convert images to latent space
+                latents = (
+                    vae.encode(batch["pixel_values"])
+                    .latent_dist.sample()
+                    .detach()
+                )
+                latents = latents * 0.18215
+
+                # Sample noise that we'll add to the latents
+                noise = torch.randn(latents.shape).to(latents.device)
+                bsz = latents.shape[0]
+                # Sample a random timestep for each image
+                timesteps = torch.randint(
+                    0,
+                    noise_scheduler.num_train_timesteps,
+                    (bsz,),
+                    device=latents.device,
+                ).long()
+
+                # Add noise to the latents according to the noise magnitude at each timestep
+                # (this is the forward diffusion process)
+                noisy_latents = noise_scheduler.add_noise(
+                    latents, noise, timesteps
+                )
+
+                # Get the text embedding for conditioning
+                encoder_hidden_states = text_encoder(batch["input_ids"])[0]
+
+                # Predict the noise residual
+                noise_pred = unet(
+                    noisy_latents, timesteps, encoder_hidden_states
+                ).sample
+
+                loss = (
+                    F.mse_loss(noise_pred, noise, reduction="none")
+                    .mean([1, 2, 3])
+                    .mean()
+                )
+                accelerator.backward(loss)
+
+                # Zero out the gradients for all token embeddings except the newly added
+                # embeddings for the concept, as we only want to optimize the concept embeddings
+                if accelerator.num_processes > 1:
+                    grads = (
+                        text_encoder.module.get_input_embeddings().weight.grad
+                    )
+                else:
+                    grads = text_encoder.get_input_embeddings().weight.grad
+                # Get the index for tokens that we want to zero the grads for
+                index_grads_to_zero = (
+                    torch.arange(len(tokenizer)) != placeholder_token_id
+                )
+                grads.data[index_grads_to_zero, :] = grads.data[
+                    index_grads_to_zero, :
+                ].fill_(0)
+
+                optimizer.step()
+                optimizer.zero_grad()
+
+            # Checks if the accelerator has performed an optimization step behind the scenes
+            if accelerator.sync_gradients:
+                progress_bar.update(1)
+                global_step += 1
+
+            logs = {"loss": loss.detach().item()}
+            progress_bar.set_postfix(**logs)
+
+            if global_step >= max_train_steps:
+                break
+
+        accelerator.wait_for_everyone()
+
+    # Create the pipeline using using the trained modules and save it.
+    if accelerator.is_main_process:
+        pipeline = StableDiffusionPipeline(
+            text_encoder=accelerator.unwrap_model(text_encoder),
+            vae=vae,
+            unet=unet,
+            tokenizer=tokenizer,
+            scheduler=PNDMScheduler(
+                beta_start=0.00085,
+                beta_end=0.012,
+                beta_schedule="scaled_linear",
+                skip_prk_steps=True,
+            ),
+            safety_checker=StableDiffusionSafetyChecker.from_pretrained(
+                "CompVis/stable-diffusion-safety-checker"
+            ),
+            feature_extractor=CLIPFeatureExtractor.from_pretrained(
+                "openai/clip-vit-base-patch32"
+            ),
+        )
+        pipeline.save_pretrained(output_dir)
+        # Also save the newly trained embeddings
+        learned_embeds = (
+            accelerator.unwrap_model(text_encoder)
+            .get_input_embeddings()
+            .weight[placeholder_token_id]
+        )
+        learned_embeds_dict = {
+            args.placeholder_token: learned_embeds.detach().cpu()
+        }
+        torch.save(
+            learned_embeds_dict, os.path.join(output_dir, "learned_embeds.bin")
+        )
+
+
+import accelerate
+
+accelerate.notebook_launcher(
+    training_function, args=(text_encoder, vae, unet), num_processes=1
+)
+
+# Set up the pipeline
+pipe = StableDiffusionPipeline.from_pretrained(
+    hyperparameters["output_dir"],
+    # torch_dtype=torch.float16,
+)
+
+all_images = []
+for _ in range(args.num_inference_samples):
+    images = pipe(
+        [args.prompt],
+        num_inference_steps=args.inference_steps,
+        guidance_scale=7.5,
+    ).images
+    all_images.extend(images)
+
+# output_path = os.path.abspath(os.path.join(os.getcwd(), args.output_dir))
+if not os.path.isdir(args.output_dir):
+    os.mkdir(args.output_dir)
+
+[
+    image.save(f"{args.output_dir}/{i}.jpeg")
+    for i, image in enumerate(all_images)
+]
--- a/shark/iree_utils/benchmark_utils.py
+++ b/shark/iree_utils/benchmark_utils.py
@@ -78,6 +78,31 @@ def build_benchmark_args(
    return benchmark_cl


+def build_benchmark_args_non_tensor_input(
+    input_file: str,
+    device: str,
+    inputs: tuple,
+    mlir_dialect: str,
+    function_name: str,
+):
+    """
+    Inputs: input_file leading to vmfb, input_tensor to function, target device,
+    and whether it is training or not.
+    Outputs: string that execute benchmark-module on target model.
+    """
+    path = benchmark_module.__path__[0]
+    benchmarker_path = os.path.join(path, "..", "..", "iree-benchmark-module")
+    benchmark_cl = [benchmarker_path, f"--module_file={input_file}"]
+    # TODO: The function named can be passed as one of the args.
+    benchmark_cl.append(f"--entry_function={function_name}")
+    benchmark_cl.append(f"--device={IREE_DEVICE_MAP[device]}")
+    for input in inputs:
+        benchmark_cl.append(f"--function_input={input}")
+    time_extractor = "| awk 'END{{print $2 $3}}'"
+    benchmark_cl.append(time_extractor)
+    return benchmark_cl
+
+
 def run_benchmark_module(benchmark_cl):
    """
    Run benchmark command, extract result and return iteration/seconds.
--- a/shark/iree_utils/compile_utils.py
+++ b/shark/iree_utils/compile_utils.py
@@ -14,11 +14,13 @@
 import iree.runtime as ireert
 import iree.compiler as ireec
 from shark.iree_utils._common import IREE_DEVICE_MAP, IREE_TARGET_MAP
+from shark.iree_utils.benchmark_utils import *
 import numpy as np
 import os
+import re

 # Get the iree-compile arguments given device.
-def get_iree_device_args(device):
+def get_iree_device_args(device, extra_args=[]):
    if device == "cpu":
        from shark.iree_utils.cpu_utils import get_iree_cpu_args

@@ -30,7 +32,7 @@ def get_iree_device_args(device):
    if device in ["metal", "vulkan"]:
        from shark.iree_utils.vulkan_utils import get_iree_vulkan_args

-        return get_iree_vulkan_args()
+        return get_iree_vulkan_args(extra_args=extra_args)
    if device == "rocm":
        from shark.iree_utils.gpu_utils import get_iree_rocm_args

@@ -58,17 +60,138 @@ def get_iree_common_args():
    return [
        "--iree-stream-resource-index-bits=64",
        "--iree-vm-target-index-bits=64",
+        "--iree-util-zero-fill-elided-attrs",
    ]


+def create_dispatch_dirs(bench_dir, device):
+    bench_dir_path = bench_dir.split("/")
+    bench_dir_path[-1] = "temp_" + bench_dir_path[-1]
+    tmp_bench_dir = "/".join(bench_dir_path)
+    for f_ in os.listdir(bench_dir):
+        if os.path.isfile(f"{bench_dir}/{f_}"):
+            dir_name = re.sub("\.\S*$", "", f_)
+            if os.path.exists(f"{bench_dir}/{dir_name}"):
+                os.system(f"rm -rf {bench_dir}/{dir_name}")
+            os.system(f"mkdir {bench_dir}/{dir_name}")
+            os.system(f"mv {bench_dir}/{f_} {bench_dir}/{dir_name}/{f_}")
+    for f_ in os.listdir(tmp_bench_dir):
+        if os.path.isfile(f"{tmp_bench_dir}/{f_}"):
+            dir_name = ""
+            for d_ in os.listdir(bench_dir):
+                if re.search(f"{d_}(?=\D)", f_):
+                    dir_name = d_
+            if dir_name != "":
+                os.system(
+                    f"mv {tmp_bench_dir}/{f_} {bench_dir}/{dir_name}/{dir_name}_benchmark.mlir"
+                )
+
+
+def compile_benchmark_dirs(bench_dir, device, dispatch_benchmarks):
+    dispatch_list = []
+    all_dispatches = False
+
+    if dispatch_benchmarks.lower().strip() == "all":
+        all_dispatches = True
+    else:
+        try:
+            dispatch_list = [
+                int(dispatch_index)
+                for dispatch_index in dispatch_benchmarks.split(" ")
+            ]
+        except:
+            print("ERROR: Invalid dispatch benchmarks")
+            return None
+    for d_ in os.listdir(bench_dir):
+        in_dispatches = False
+        for dispatch in dispatch_list:
+            if str(dispatch) in d_:
+                in_dispatches = True
+        if all_dispatches or in_dispatches:
+            for f_ in os.listdir(f"{bench_dir}/{d_}"):
+
+                if "benchmark.mlir" in f_:
+                    dispatch_file = open(f"{bench_dir}/{d_}/{f_}", "r")
+                    module = dispatch_file.read()
+                    dispatch_file.close()
+
+                    flatbuffer_blob = ireec.compile_str(
+                        module, target_backends=[IREE_TARGET_MAP[device]]
+                    )
+
+                    vmfb_file = open(
+                        f"{bench_dir}/{d_}/{d_}_benchmark.vmfb", "wb"
+                    )
+                    vmfb_file.write(flatbuffer_blob)
+                    vmfb_file.close()
+
+                    config = ireert.Config(IREE_DEVICE_MAP[device])
+                    vm_module = ireert.VmModule.from_flatbuffer(
+                        config.vm_instance, flatbuffer_blob
+                    )
+
+                    benchmark_cl = build_benchmark_args_non_tensor_input(
+                        input_file=f"{bench_dir}/{d_}/{d_}_benchmark.vmfb",
+                        device=device,
+                        inputs=(0,),
+                        mlir_dialect="linalg",
+                        function_name=vm_module.function_names[0],
+                    )
+
+                    benchmark_bash = open(
+                        f"{bench_dir}/{d_}/{d_}_benchmark.sh", "w+"
+                    )
+                    benchmark_bash.write("#!/bin/bash\n")
+                    benchmark_bash.write(" ".join(benchmark_cl))
+                    benchmark_bash.close()
+
+                    benchmark_data = run_benchmark_module(benchmark_cl)
+
+                    benchmark_file = open(
+                        f"{bench_dir}/{d_}/{d_}_data.txt", "w+"
+                    )
+                    benchmark_file.write(f"DISPATCH: {d_}\n")
+                    benchmark_file.write(str(benchmark_data) + "\n")
+                    benchmark_file.write(
+                        "SHARK BENCHMARK RESULT: "
+                        + str(1 / (benchmark_data * 0.001))
+                        + "\n"
+                    )
+                    benchmark_file.close()
+
+                elif ".mlir" in f_ and "benchmark" not in f_:
+                    dispatch_file = open(f"{bench_dir}/{d_}/{f_}", "r")
+                    module = dispatch_file.read()
+                    dispatch_file.close()
+
+                    module = re.sub(
+                        "hal.executable private",
+                        "hal.executable public",
+                        module,
+                    )
+
+                    flatbuffer_blob = ireec.compile_str(
+                        module,
+                        target_backends=[IREE_TARGET_MAP[device]],
+                        extra_args=["--compile-mode=hal-executable"],
+                    )
+
+                    spirv_file = open(
+                        f"{bench_dir}/{d_}/{d_}_spirv.vmfb", "wb"
+                    )
+                    spirv_file.write(flatbuffer_blob)
+                    spirv_file.close()
+
+
 def compile_module_to_flatbuffer(
-    module, device, frontend, func_name, model_config_path
+    module, device, frontend, func_name, model_config_path, extra_args
 ):
    # Setup Compile arguments wrt to frontends.
    input_type = ""
    args = get_iree_frontend_args(frontend)
-    args += get_iree_device_args(device)
+    args += get_iree_device_args(device, extra_args)
    args += get_iree_common_args()
+    args += extra_args

    if frontend in ["tensorflow", "tf"]:
        input_type = "mhlo"
@@ -77,12 +200,10 @@ def compile_module_to_flatbuffer(
    elif frontend in ["tflite", "tflite-tosa"]:
        input_type = "tosa"
    elif frontend in ["tm_tensor"]:
-        input_type = frontend
+        input_type = ireec.InputType.TM_TENSOR

    # TODO: make it simpler.
    # Compile according to the input type, else just try compiling.
-    if input_type not in ["mhlo", "tosa"]:
-        module = str(module)
    if input_type != "":
        # Currently for MHLO/TOSA.
        flatbuffer_blob = ireec.compile_str(
@@ -94,7 +215,7 @@ def compile_module_to_flatbuffer(
    else:
        # Currently for Torch.
        flatbuffer_blob = ireec.compile_str(
-            str(module),
+            module,
            target_backends=[IREE_TARGET_MAP[device]],
            extra_args=args,
        )
@@ -120,10 +241,11 @@ def get_iree_compiled_module(
    frontend: str = "torch",
    func_name: str = "forward",
    model_config_path: str = None,
+    extra_args: list = [],
 ):
    """Given a module returns the compiled .vmfb and configs"""
    flatbuffer_blob = compile_module_to_flatbuffer(
-        module, device, frontend, func_name, model_config_path
+        module, device, frontend, func_name, model_config_path, extra_args
    )
    return get_iree_module(flatbuffer_blob, device, func_name)

@@ -145,12 +267,15 @@ def export_iree_module_to_vmfb(
    mlir_dialect: str = "linalg",
    func_name: str = "forward",
    model_config_path: str = None,
+    module_name: str = None,
+    extra_args: list = [],
 ):
    # Compiles the module given specs and saves it as .vmfb file.
    flatbuffer_blob = compile_module_to_flatbuffer(
-        module, device, mlir_dialect, func_name, model_config_path
+        module, device, mlir_dialect, func_name, model_config_path, extra_args
    )
-    module_name = f"{mlir_dialect}_{func_name}_{device}"
+    if module_name is None:
+        module_name = f"{mlir_dialect}_{func_name}_{device}"
    filename = os.path.join(directory, module_name + ".vmfb")
    print(f"Saved vmfb in {filename}.")
    with open(filename, "wb") as f:
--- a/shark/iree_utils/vulkan_utils.py
+++ b/shark/iree_utils/vulkan_utils.py
@@ -14,12 +14,28 @@

 # All the iree_vulkan related functionalities go here.

+from os import linesep
 from shark.iree_utils._common import run_cmd


-def get_vulkan_triple_flag():
-    vulkan_device_cmd = "vulkaninfo | grep deviceName"
-    vulkan_device = run_cmd(vulkan_device_cmd).strip()
+def get_vulkan_device_name():
+    vulkaninfo_dump = run_cmd("vulkaninfo").split(linesep)
+    vulkaninfo_list = [s.strip() for s in vulkaninfo_dump if "deviceName" in s]
+    if len(vulkaninfo_list) == 0:
+        raise ValueError("No device name found in VulkanInfo!")
+    if len(vulkaninfo_list) > 1:
+        print(
+            f"Found {len(vulkaninfo_list)} device names. choosing first one: {vulkaninfo_list[0]}"
+        )
+    return vulkaninfo_list[0]
+
+
+def get_vulkan_triple_flag(extra_args=[]):
+    if "-iree-vulkan-target-triple=" in " ".join(extra_args):
+        print(f"Using target triple from command line args")
+        return None
+
+    vulkan_device = get_vulkan_device_name()
    if all(x in vulkan_device for x in ("Apple", "M1")):
        print(f"Found {vulkan_device} Device. Using m1-moltenvk-macos")
        return "-iree-vulkan-target-triple=m1-moltenvk-macos"
@@ -32,15 +48,8 @@ def get_vulkan_triple_flag():
    elif all(x in vulkan_device for x in ("RTX", "3090")):
        print(f"Found {vulkan_device} Device. Using ampere-rtx3090-linux")
        return "-iree-vulkan-target-triple=ampere-rtx3090-linux"
-    elif all(x in vulkan_device for x in ("Radeon", "RX 5")):
-        print(
-            "Found AMD Radeon RX 5000 series device. Using rdna1-5700xt-linux"
-        )
-        return "-iree-vulkan-target-triple=rdna1-5700xt-linux"
-    elif all(x in vulkan_device for x in ("Radeon", "RX 6")):
-        print(
-            "Found AMD Radeon RX 6000 series device. Using rdna2-unknown-linux"
-        )
+    elif "AMD" in vulkan_device:
+        print("Found AMD device. Using rdna2-unknown-linux")
        return "-iree-vulkan-target-triple=rdna2-unknown-linux"
    else:
        print(
@@ -52,10 +61,10 @@ def get_vulkan_triple_flag():
        return None


-def get_iree_vulkan_args():
+def get_iree_vulkan_args(extra_args=[]):
    # vulkan_flag = ["--iree-flow-demote-i64-to-i32"]
    vulkan_flag = []
-    vulkan_triple_flag = get_vulkan_triple_flag()
+    vulkan_triple_flag = get_vulkan_triple_flag(extra_args)
    if vulkan_triple_flag is not None:
        vulkan_flag.append(vulkan_triple_flag)
    return vulkan_flag
--- a/shark/parser.py
+++ b/shark/parser.py
@@ -93,4 +93,16 @@ parser.add_argument(
    help="Specify where to save downloaded shark_tank artifacts. If this is not set, the default is ~/.local/shark_tank/.",
 )

+parser.add_argument(
+    "--dispatch_benchmarks",
+    default=None,
+    help='dispatches to return benchamrk data on.  use "All" for all, and None for none.',
+)
+
+parser.add_argument(
+    "--dispatch_benchmarks_dir",
+    default="temp_dispatch_benchmarks",
+    help='directory where you want to store dispatch data generated with "--dispatch_benchmarks"',
+)
+
 shark_args, unknown = parser.parse_known_args()
--- a/shark/shark_benchmark_runner.py
+++ b/shark/shark_benchmark_runner.py
@@ -43,25 +43,34 @@ class SharkBenchmarkRunner(SharkRunner):
    # SharkRunner derived class with Benchmarking capabilities.
    def __init__(
        self,
-        mlir_module: str,
+        mlir_module: bytes,
        function_name: str = "forward",
        device: str = "none",
        mlir_dialect: str = "linalg",
+        extra_args: list = [],
    ):
        self.device = shark_args.device if device == "none" else device
        self.frontend_model = None
        self.vmfb_file = None
        self.mlir_dialect = mlir_dialect
+        self.extra_args = extra_args
        SharkRunner.__init__(
            self,
            mlir_module,
            function_name,
            device,
            self.mlir_dialect,
+            self.extra_args,
+            compile_vmfb=True,
        )
        if self.vmfb_file == None:
            self.vmfb_file = export_iree_module_to_vmfb(
-                mlir_module, device, shark_args.repro_dir, self.mlir_dialect
+                mlir_module,
+                device,
+                shark_args.repro_dir,
+                self.mlir_dialect,
+                function_name,
+                extra_args=self.extra_args,
            )

    def setup_cl(self, input_tensors):
--- a/shark/shark_downloader.py
+++ b/shark/shark_downloader.py
@@ -137,7 +137,8 @@ def download_torch_model(

    model_dir = os.path.join(WORKDIR, model_dir_name)
    with open(
-        os.path.join(model_dir, model_name + dyn_str + "_torch.mlir")
+        os.path.join(model_dir, model_name + dyn_str + "_torch.mlir"),
+        mode="rb",
    ) as f:
        mlir_file = f.read()

@@ -201,7 +202,8 @@ def download_tflite_model(

    model_dir = os.path.join(WORKDIR, model_dir_name)
    with open(
-        os.path.join(model_dir, model_name + dyn_str + "_tflite.mlir")
+        os.path.join(model_dir, model_name + dyn_str + "_tflite.mlir"),
+        mode="rb",
    ) as f:
        mlir_file = f.read()

@@ -266,7 +268,7 @@ def download_tf_model(
    if not os.path.isfile(filename):
        filename = os.path.join(model_dir, model_name + "_tf.mlir")

-    with open(filename) as f:
+    with open(filename, mode="rb") as f:
        mlir_file = f.read()

    function_name = str(np.load(os.path.join(model_dir, "function_name.npy")))
--- a/shark/shark_importer.py
+++ b/shark/shark_importer.py
@@ -75,14 +75,17 @@ class SharkImporter:
            self.module, self.inputs, is_dynamic, tracing_required
        )

-    def _tf_mlir(self, func_name):
+    def _tf_mlir(self, func_name, save_dir="./shark_tmp/"):
        from iree.compiler import tf as tfc

        return tfc.compile_module(
-            self.module, exported_names=[func_name], import_only=True
+            self.module,
+            exported_names=[func_name],
+            import_only=True,
+            output_file=save_dir,
        )

-    def _tflite_mlir(self, func_name):
+    def _tflite_mlir(self, func_name, save_dir="./shark_tmp/"):
        from iree.compiler import tflite as tflitec
        from shark.iree_utils._common import IREE_TARGET_MAP

@@ -90,6 +93,7 @@ class SharkImporter:
            self.raw_model_file,  # in tflite, it is a path to .tflite file, not a tflite interpreter
            input_type="tosa",
            import_only=True,
+            output_file=save_dir,
        )
        return self.mlir_model

@@ -99,6 +103,7 @@ class SharkImporter:
        is_dynamic=False,
        tracing_required=False,
        func_name="forward",
+        save_dir="./shark_tmp/",
    ):
        if self.frontend in ["torch", "pytorch"]:
            if self.inputs == None:
@@ -108,15 +113,15 @@ class SharkImporter:
                sys.exit(1)
            return self._torch_mlir(is_dynamic, tracing_required), func_name
        if self.frontend in ["tf", "tensorflow"]:
-            return self._tf_mlir(func_name), func_name
+            return self._tf_mlir(func_name, save_dir), func_name
        if self.frontend in ["tflite", "tf-lite"]:
            func_name = "main"
-            return self._tflite_mlir(func_name), func_name
+            return self._tflite_mlir(func_name, save_dir), func_name

    # Converts the frontend specific tensors into np array.
    def convert_to_numpy(self, array_tuple: tuple):
        if self.frontend in ["torch", "pytorch"]:
-            return [x.detach().numpy() for x in array_tuple]
+            return [x.detach().cpu().numpy() for x in array_tuple]
        if self.frontend in ["tf", "tensorflow"]:
            return [x.numpy() for x in array_tuple]

@@ -130,19 +135,20 @@ class SharkImporter:
        outputs_name = "golden_out.npz"
        func_file_name = "function_name"
        model_name_mlir = model_name + "_" + self.frontend + ".mlir"
+        try:
+            inputs = [x.cpu().detach() for x in inputs]
+        except AttributeError:
+            try:
+                inputs = [x.numpy() for x in inputs]
+            except AttributeError:
+                inputs = [x for x in inputs]
        np.savez(os.path.join(dir, inputs_name), *inputs)
        np.savez(os.path.join(dir, outputs_name), *outputs)
        np.save(os.path.join(dir, func_file_name), np.array(func_name))

-        mlir_str = mlir_data
        if self.frontend == "torch":
-            mlir_str = mlir_data.operation.get_asm()
-        elif self.frontend == "tf":
-            mlir_str = mlir_data.decode("utf-8")
-        elif self.frontend == "tflite":
-            mlir_str = mlir_data.decode("utf-8")
-        with open(os.path.join(dir, model_name_mlir), "w") as mlir_file:
-            mlir_file.write(mlir_str)
+            with open(os.path.join(dir, model_name_mlir), "wb") as mlir_file:
+                mlir_file.write(mlir_data)

        return

@@ -159,9 +165,13 @@ class SharkImporter:
                f"There is no input provided: {self.inputs}, please provide inputs or simply run import_mlir."
            )
            sys.exit(1)
-
+        model_name_mlir = model_name + "_" + self.frontend + ".mlir"
+        artifact_path = os.path.join(dir, model_name_mlir)
        imported_mlir = self.import_mlir(
-            is_dynamic, tracing_required, func_name
+            is_dynamic,
+            tracing_required,
+            func_name,
+            save_dir=artifact_path,
        )
        # TODO: Make sure that any generic function name is accepted. Currently takes in the default function names.
        # TODO: Check for multiple outputs.
@@ -171,7 +181,7 @@ class SharkImporter:
            golden_out = self.module(*self.inputs)
            if torch.is_tensor(golden_out):
                golden_out = tuple(
-                    golden_out.detach().numpy(),
+                    golden_out.detach().cpu().numpy(),
                )
            else:
                golden_out = self.convert_to_numpy(golden_out)
--- a/shark/shark_inference.py
+++ b/shark/shark_inference.py
@@ -12,6 +12,8 @@
 from shark.iree_utils.compile_utils import (
    export_iree_module_to_vmfb,
    load_flatbuffer,
+    create_dispatch_dirs,
+    compile_benchmark_dirs,
 )
 import os
 from shark.shark_runner import SharkRunner
@@ -37,7 +39,7 @@ class SharkInference:
    Attributes
    ----------
    mlir_module : str
-        mlir_module represented in string.
+        mlir_module represented in string; modules from torch-mlir are serialized in bytecode format.
    function_name : str
        function to execute in the given mlir_module.
    device : str
@@ -63,21 +65,45 @@ class SharkInference:

    def __init__(
        self,
-        mlir_module: str,
+        mlir_module: bytes,
        function_name: str = "forward",
        device: str = "none",
        mlir_dialect: str = "linalg",
        is_benchmark: bool = False,
+        dispatch_benchmark: str = None,
+        dispatch_benchmark_dir: str = "temp_dispatch_benchmarks",
    ):
        self.mlir_module = mlir_module
        self.function_name = function_name
        self.device = shark_args.device if device == "none" else device
        self.mlir_dialect = mlir_dialect
        self.is_benchmark = is_benchmark
+        self.dispatch_benchmarks = (
+            shark_args.dispatch_benchmarks
+            if dispatch_benchmark is None
+            else dispatch_benchmark
+        )
+        self.dispatch_benchmarks_dir = (
+            shark_args.dispatch_benchmarks_dir
+            if dispatch_benchmark_dir == "temp_dispatch_benchmarks"
+            else dispatch_benchmark_dir
+        )

        self.shark_runner = None

-    def compile(self):
+    def compile(self, extra_args=[]):
+
+        if self.dispatch_benchmarks is not None:
+            extra_args.append(
+                f"--iree-hal-dump-executable-sources-to={self.dispatch_benchmarks_dir}"
+            )
+            temp_dir = self.dispatch_benchmarks_dir.split("/")
+            temp_dir[-1] = "temp_" + temp_dir[-1]
+            temp_dir = "/".join(temp_dir)
+            self.temp_dispatch_benchmarks_dir = temp_dir
+            extra_args.append(
+                f"--iree-hal-dump-executable-benchmarks-to={self.temp_dispatch_benchmarks_dir}"
+            )

        if self.is_benchmark == True:
            from shark.shark_benchmark_runner import SharkBenchmarkRunner
@@ -87,6 +113,7 @@ class SharkInference:
                self.function_name,
                self.device,
                self.mlir_dialect,
+                extra_args=extra_args,
            )

        else:
@@ -95,8 +122,18 @@ class SharkInference:
                self.function_name,
                self.device,
                self.mlir_dialect,
+                extra_args=extra_args,
            )

+        if self.dispatch_benchmarks is not None:
+            create_dispatch_dirs(self.dispatch_benchmarks_dir, self.device)
+            compile_benchmark_dirs(
+                self.dispatch_benchmarks_dir,
+                self.device,
+                self.dispatch_benchmarks,
+            )
+            os.system(f"rm -rf {self.temp_dispatch_benchmarks_dir}")
+
    # inputs are considered to be tuple of np.array.
    def forward(self, inputs: tuple):
        return self.shark_runner.run(inputs)
@@ -144,13 +181,15 @@ class SharkInference:

    # TODO: Instead of passing directory and having names decided by the module
    # , user may want to save the module with manual names.
-    def save_module(self, dir=os.getcwd()):
+    def save_module(self, dir=os.getcwd(), module_name=None, extra_args=[]):
        return export_iree_module_to_vmfb(
            self.mlir_module,
            self.device,
            dir,
            self.mlir_dialect,
            self.function_name,
+            module_name=module_name,
+            extra_args=extra_args,
        )

    # load and return the module.
--- a/shark/shark_runner.py
+++ b/shark/shark_runner.py
@@ -25,7 +25,7 @@ import sys


 # supported dialects by the shark-runtime.
-supported_dialects = {"linalg", "mhlo", "tosa", "tf-lite"}
+supported_dialects = {"linalg", "mhlo", "tosa", "tf-lite", "tm_tensor"}


 class SharkRunner:
@@ -61,16 +61,18 @@ class SharkRunner:

    def __init__(
        self,
-        mlir_module: str = "none",
+        mlir_module: bytes = None,
        function_name: str = "forward",
        device: str = "none",
        mlir_dialect: str = "linalg",
+        extra_args: list = [],
        compile_vmfb: bool = True,
    ):
        self.mlir_module = mlir_module
        self.function_name = function_name
        self.device = shark_args.device if device == "none" else device
        self.mlir_dialect = mlir_dialect
+        self.extra_args = extra_args

        if check_device_drivers(self.device):
            device_driver_info(self.device)
@@ -86,6 +88,7 @@ class SharkRunner:
                self.device,
                self.mlir_dialect,
                func_name=self.function_name,
+                extra_args=self.extra_args,
            )

    def run(self, inputs: tuple):
--- a/shark/torch_mlir_utils.py
+++ b/shark/torch_mlir_utils.py
@@ -17,6 +17,7 @@ import torch_mlir
 from torch_mlir_e2e_test.linalg_on_tensors_backends import refbackend
 import tempfile
 from shark.parser import shark_args
+import io


 def get_module_name_for_asm_dump(module):
@@ -66,11 +67,14 @@ def get_torch_mlir_module(

    tempfile.tempdir = shark_args.repro_dir

-    module = torch_mlir.compile(
+    mlir_module = torch_mlir.compile(
        module,
        input,
        output_type=torch_mlir.OutputType.LINALG_ON_TENSORS,
        use_tracing=jit_trace,
        ignore_traced_shapes=ignore_traced_shapes,
    )
-    return module
+    bytecode_stream = io.BytesIO()
+    mlir_module.operation.write_bytecode(bytecode_stream)
+    bytecode = bytecode_stream.getvalue()
+    return bytecode
--- a/tank/README.md
+++ b/tank/README.md
@@ -1,3 +1,211 @@
+## Supported and Validated Models
+
+### PyTorch HuggingFace Models
+
+| PyTorch Language Models | Torch-MLIR lowerable | SHARK-CPU | SHARK-CUDA | SHARK-METAL |
+|---------------------|----------------------|----------|----------|-------------|
+| BERT                | :green_heart: (JIT)          | :green_heart:         | :green_heart:         | :green_heart:            |
+| Albert              | :green_heart: (JIT)            | :green_heart:         | :green_heart:         | :green_heart:            |
+| BigBird             | :green_heart: (AOT)            |          |          |             |
+| dbmdz/ConvBERT      | :green_heart:          | :green_heart:         | :green_heart:         | :green_heart:            |
+| DistilBERT          | :broken_heart: (JIT)            |          |          |             |
+| GPT2                | :green_heart:            | :green_heart:         |  :green_heart:        | :green_heart:            |
+| MobileBert          | :green_heart: (JIT)            | :green_heart:         | :green_heart:         | :green_heart:            |
+| microsoft/beit      | :green_heart:                  | :green_heart:         | :broken_heart:         | :broken_heart:            |
+| facebook/deit       | :green_heart:          | :green_heart:         | :broken_heart:         | :broken_heart:            |
+| facebook/convnext   | :green_heart:          | :green_heart:         | :green_heart:         | :green_heart:            |
+
+### Torchvision  Models
+
+| TORCHVISION Models | Torch-MLIR lowerable | SHARK-CPU | SHARK-CUDA | SHARK-METAL |
+|--------------------|----------------------|----------|----------|-------------|
+| AlexNet            | :green_heart: (Script)         | :green_heart:         | :green_heart:         | :green_heart:            |
+| MobileNetV2        | :green_heart: (Script)         | :green_heart:         | :green_heart:         | :green_heart:            |
+| MobileNetV3        | :green_heart: (Script)         | :green_heart:         | :green_heart:         | :green_heart:            |
+| Unet               | :green_heart: (Script)         | :green_heart:         | :green_heart:         | :green_heart:            |
+| Resnet18           | :green_heart: (Script)         | :green_heart:         |  :green_heart:        | :green_heart:            |
+| Resnet50           | :green_heart: (Script)         | :green_heart:         |   :green_heart:       | :green_heart:            |
+| Resnet101           | :green_heart: (Script)         | :green_heart:         |   :green_heart:       | :green_heart:            |
+| Resnext50_32x4d    | :green_heart: (Script)         |          |          |             |
+| SqueezeNet         | :green_heart: (Script)         | :green_heart:         |   :broken_heart:       | :broken_heart:            |
+| EfficientNet       | :green_heart: (Script)         |          |          |             |
+| Regnet             | :green_heart: (Script)         |          |          |             |
+| Resnest            | :broken_heart: (Script)         |          |          |             |
+| Vision Transformer | :green_heart: (Script)         | :green_heart:         | :green_heart:         | :green_heart:            |
+| VGG 16             | :green_heart: (Script)         | :green_heart:         |   :green_heart:       |             |
+| Wide Resnet        | :green_heart: (Script)         | :green_heart:         | :green_heart:         | :green_heart:            |
+| RAFT               | :broken_heart: (JIT)            |          |          |             |
+
+For more information refer to [MODEL TRACKING SHEET](https://docs.google.com/spreadsheets/d/15PcjKeHZIrB5LfDyuw7DGEEE8XnQEX2aX8lm8qbxV8A/edit#gid=0)
+
+### Tensorflow Models (Inference)
+
+| Hugging Face Models | tf-mhlo lowerable | SHARK-CPU | SHARK-CUDA | SHARK-METAL |
+|---------------------|----------------------|----------|----------|-------------|
+| BERT                | :green_heart:          | :green_heart:         | :green_heart:         | :green_heart:            |
+| MiniLM                | :green_heart:          | :green_heart:         | :green_heart:         | :green_heart:            |
+| albert-base-v2              | :green_heart:            | :green_heart:         | :green_heart:         | :green_heart:            |
+| DistilBERT          | :green_heart:            | :green_heart:         | :green_heart:         | :green_heart:            |
+| CamemBert                | :green_heart:          | :green_heart:         | :green_heart:         | :green_heart:            |
+| ConvBert              | :green_heart:            | :green_heart:         | :green_heart:         | :green_heart:            |
+| Deberta              |            |         |          |             |
+| electra          | :green_heart:            | :green_heart:         | :green_heart:         | :green_heart:            |
+| funnel              |            |         |          |             |
+| layoutlm              | :green_heart:            | :green_heart:         | :green_heart:         | :green_heart:            |
+| longformer              |            |         |          |             |
+| mobile-bert                | :green_heart:          | :green_heart:         | :green_heart:         | :green_heart:            |
+| rembert              |            |         |          |             |
+| tapas              |            |         |          |             |
+| flaubert                | :broken_heart:          | :green_heart:         | :green_heart:         | :green_heart:            |
+| roberta                | :green_heart:          | :green_heart:         | :green_heart:         | :green_heart:            |
+| xlm-roberta              | :green_heart:            | :green_heart:         | :green_heart:         | :green_heart:            |
+| mpnet              | :green_heart:            | :green_heart:         | :green_heart:         | :green_heart:            |
+
+### PyTorch Training Models
+
+| Models | Torch-MLIR lowerable | SHARK-CPU | SHARK-CUDA | SHARK-METAL |
+|---------------------|----------------------|----------|----------|-------------|
+| BERT                | :green_heart:           | :green_heart:         |          |             |
+| FullyConnected                | :green_heart:           | :green_heart:         |          |             |
+
+### JAX  Models
+
+| Models | JAX-MHLO lowerable | SHARK-CPU | SHARK-CUDA | SHARK-METAL |
+|---------------------|----------------------|----------|----------|-------------|
+| DALL-E                | :broken_heart:           | :broken_heart:         |          |             |
+| FullyConnected                | :green_heart:           | :green_heart:         |          |             |
+
+<details>
+  <summary>TFLite Models</summary>
+
+### TFLite Models
+
+| Models | TOSA/LinAlg  | SHARK-CPU | SHARK-CUDA | SHARK-METAL |
+|---------------------|----------------------|----------|----------|-------------|
+| BERT                | :broken_heart:           | :broken_heart:         |          |             |
+| FullyConnected      | :green_heart:           | :green_heart:         |          |             |
+| albert | :green_heart:           | :green_heart:         |          |             |
+| asr_conformer | :green_heart:           | :green_heart:         |          |             |
+| bird_classifier | :green_heart:           | :green_heart:         |          |             |
+| cartoon_gan | :green_heart:           | :green_heart:         |          |             |
+| craft_text | :green_heart:           | :green_heart:         |          |             |
+| deeplab_v3 | :green_heart:           | :green_heart:         |          |             |
+| densenet | :green_heart:           | :green_heart:         |          |             |
+| east_text_detector | :green_heart:           | :green_heart:         |          |             |
+| efficientnet_lite0_int8 | :green_heart:           | :green_heart:         |          |             |
+| efficientnet | :green_heart:           | :green_heart:         |          |             |
+| gpt2 | :green_heart:           | :green_heart:         |          |             |
+| image_stylization | :green_heart:           | :green_heart:         |          |             |
+| inception_v4 | :green_heart:           | :green_heart:         |          |             |
+| inception_v4_uint8 | :green_heart:           | :green_heart:         |          |             |
+| lightning_fp16 | :green_heart:           | :green_heart:         |          |             |
+| lightning_i8 | :green_heart:           | :green_heart:         |          |             |
+| lightning | :green_heart:           | :green_heart:         |          |             |
+| magenta | :green_heart:           | :green_heart:         |          |             |
+| midas | :green_heart:           | :green_heart:         |          |             |
+| mirnet | :green_heart:           | :green_heart:         |          |             |
+| mnasnet | :green_heart:           | :green_heart:         |          |             |
+| mobilebert_edgetpu_s_float | :green_heart:           | :green_heart:         |          |             |
+| mobilebert_edgetpu_s_quant | :green_heart:           | :green_heart:         |          |             |
+| mobilebert | :green_heart:           | :green_heart:         |          |             |
+| mobilebert_tf2_float | :green_heart:           | :green_heart:         |          |             |
+| mobilebert_tf2_quant | :green_heart:           | :green_heart:         |          |             |
+| mobilenet_ssd_quant | :green_heart:           | :green_heart:         |          |             |
+| mobilenet_v1 | :green_heart:           | :green_heart:         |          |             |
+| mobilenet_v1_uint8 | :green_heart:           | :green_heart:         |          |             |
+| mobilenet_v2_int8 | :green_heart:           | :green_heart:         |          |             |
+| mobilenet_v2 | :green_heart:           | :green_heart:         |          |             |
+| mobilenet_v2_uint8 | :green_heart:           | :green_heart:         |          |             |
+| mobilenet_v3-large | :green_heart:           | :green_heart:         |          |             |
+| mobilenet_v3-large_uint8 | :green_heart:           | :green_heart:         |          |             |
+| mobilenet_v35-int8 | :green_heart:           | :green_heart:         |          |             |
+| nasnet | :green_heart:           | :green_heart:         |          |             |
+| person_detect | :green_heart:           | :green_heart:         |          |             |
+| posenet | :green_heart:           | :green_heart:         |          |             |
+| resnet_50_int8 | :green_heart:           | :green_heart:         |          |             |
+| rosetta | :green_heart:           | :green_heart:         |          |             |
+| spice | :green_heart:           | :green_heart:         |          |             |
+| squeezenet | :green_heart:           | :green_heart:         |          |             |
+| ssd_mobilenet_v1 | :green_heart:           | :green_heart:         |          |             |
+| ssd_mobilenet_v1_uint8 | :green_heart:           | :green_heart:         |          |             |
+| ssd_mobilenet_v2_fpnlite | :green_heart:           | :green_heart:         |          |             |
+| ssd_mobilenet_v2_fpnlite_uint8 | :green_heart:           | :green_heart:         |          |             |
+| ssd_mobilenet_v2_int8 | :green_heart:           | :green_heart:         |          |             |
+| ssd_mobilenet_v2 | :green_heart:           | :green_heart:         |          |             |
+| ssd_spaghettinet_large | :green_heart:           | :green_heart:         |          |             |
+| ssd_spaghettinet_large_uint8 | :green_heart:           | :green_heart:         |          |             |
+| visual_wake_words_i8 | :green_heart:           | :green_heart:         |          |             |
+
+</details>
+
+## Testing and Benchmarks
+
+### Run all model tests on CPU/GPU/VULKAN/Metal
+
+For a list of models included in our pytest model suite, see https://github.com/nod-ai/SHARK/blob/main/tank/all_models.csv
+
+```shell
+pytest tank/test_models.py
+
+# Models included in the pytest suite can be found listed in all_models.csv.
+
+# If on Linux for multithreading on CPU (faster results):
+pytest tank/test_models.py -n auto
+```
+
+### Running specific tests
+```shell
+
+# Search for test cases by including a keyword that matches all or part of the test case's name;
+pytest tank/test_models.py -k "keyword" 
+
+# Test cases are named uniformly by format test_module_<model_name_underscores_only>_<torch/tf>_<static/dynamic>_<device>.
+
+# Example: Test all models on nvidia gpu:
+pytest tank/test_models.py -k "cuda"
+
+# Example: Test all tensorflow resnet models on Vulkan backend:
+pytest tank/test_models.py -k "resnet and tf and vulkan"
+
+# Exclude a test case:
+pytest tank/test_models.py -k "not ..."
+
+### Run benchmarks on SHARK tank pytests and generate bench_results.csv with results.
+
+(the following requires source installation with `IMPORTER=1 ./setup_venv.sh`)
+
+```shell
+pytest --benchmark tank/test_models.py
+  
+# Just do static GPU benchmarks for PyTorch tests:
+pytest --benchmark tank/test_models.py -k "pytorch and static and cuda"
+
+```
+  
+### Benchmark Resnet50, MiniLM on CPU
+
+(requires source installation with `IMPORTER=1 ./setup_venv.sh`)  
+  
+```shell
+# We suggest running the following commands as root before running benchmarks on CPU:
+  
+cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | awk -F, '{print $2}' | sort -n | uniq | ( while read X ; do echo $X ; echo 0 > /sys/devices/system/cpu/cpu$X/online ; done )
+echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
+
+# Benchmark canonical Resnet50 on CPU via pytest
+pytest --benchmark tank/test_models -k "resnet50 and tf_static_cpu"
+
+# Benchmark canonical MiniLM on CPU via pytest
+pytest --benchmark tank/test_models -k "MiniLM and cpu"
+
+# Benchmark MiniLM on CPU via transformer-benchmarks:
+git clone --recursive https://github.com/nod-ai/transformer-benchmarks.git
+cd transformer-benchmarks
+./perf-ci.sh -n
+# Check detail.csv for MLIR/IREE results.
+
+```
+
 To run the fine tuning example, from the root SHARK directory, run:

 ```shell
@@ -11,3 +219,5 @@ if running from a google vm, you can view jupyter notebooks on your local system
 gcloud compute ssh <YOUR_INSTANCE_DETAILS> --ssh-flag="-N -L localhost:8888:localhost:8888"
 ```

+
+
--- a/tank/examples/MiniLM_tf/huggingface_MiniLM_gen.py
+++ b/tank/examples/MiniLM_tf/huggingface_MiniLM_gen.py
--- a/tank/examples/MiniLM_tf/huggingface_MiniLM_run.py
+++ b/tank/examples/MiniLM_tf/huggingface_MiniLM_run.py
--- a/tank/examples/MiniLM_tf/huggingface_MiniLM_tf.py
+++ b/tank/examples/MiniLM_tf/huggingface_MiniLM_tf.py
--- a/tank/examples/MiniLM_tf/seq_classification.py
+++ b/tank/examples/MiniLM_tf/seq_classification.py
--- a/tank/examples/bert-base-uncased_tosa_torch/bert_base_uncased_tosa.py
+++ b/tank/examples/bert-base-uncased_tosa_torch/bert_base_uncased_tosa.py
--- a/tank/examples/bert_fine_tuning/bert_fine_tune_tf.py
+++ b/tank/examples/bert_fine_tuning/bert_fine_tune_tf.py
--- a/tank/examples/bert_tf/bert_large_gen.py
+++ b/tank/examples/bert_tf/bert_large_gen.py
--- a/tank/examples/bert_tf/bert_large_run.py
+++ b/tank/examples/bert_tf/bert_large_run.py
--- a/tank/examples/bert_tf/bert_large_tf.py
+++ b/tank/examples/bert_tf/bert_large_tf.py
--- a/tank/examples/bert_tf/bert_small_gen.py
+++ b/tank/examples/bert_tf/bert_small_gen.py
--- a/tank/examples/bert_tf/bert_small_run.py
+++ b/tank/examples/bert_tf/bert_small_run.py
--- a/tank/examples/bert_tf/bert_small_tf_run.py
+++ b/tank/examples/bert_tf/bert_small_tf_run.py
--- a/tank/examples/bert_tf/seq_classification.py
+++ b/tank/examples/bert_tf/seq_classification.py
@@ -0,0 +1,83 @@
+#!/usr/bin/env python
+from transformers import TFAutoModelForSequenceClassification, AutoTokenizer
+import tensorflow as tf
+from shark.shark_inference import SharkInference
+from shark.parser import shark_args
+import argparse
+
+
+seq_parser = argparse.ArgumentParser(
+    description="Shark Sequence Classification."
+)
+seq_parser.add_argument(
+    "--hf_model_name",
+    type=str,
+    default="bert-base-uncased",
+    help="Hugging face model to run sequence classification.",
+)
+
+seq_args, unknown = seq_parser.parse_known_args()
+
+
+BATCH_SIZE = 1
+MAX_SEQUENCE_LENGTH = 16
+
+# Create a set of input signature.
+inputs_signature = [
+    tf.TensorSpec(shape=[BATCH_SIZE, MAX_SEQUENCE_LENGTH], dtype=tf.int32),
+    tf.TensorSpec(shape=[BATCH_SIZE, MAX_SEQUENCE_LENGTH], dtype=tf.int32),
+]
+
+# For supported models please see here:
+# https://huggingface.co/docs/transformers/model_doc/auto#transformers.TFAutoModelForSequenceClassification
+
+
+def preprocess_input(text="This is just used to compile the model"):
+    tokenizer = AutoTokenizer.from_pretrained(seq_args.hf_model_name)
+    inputs = tokenizer(
+        text,
+        padding="max_length",
+        return_tensors="tf",
+        truncation=True,
+        max_length=MAX_SEQUENCE_LENGTH,
+    )
+    return inputs
+
+
+class SeqClassification(tf.Module):
+    def __init__(self, model_name):
+        super(SeqClassification, self).__init__()
+        self.m = TFAutoModelForSequenceClassification.from_pretrained(
+            model_name, output_attentions=False, num_labels=2
+        )
+        self.m.predict = lambda x, y: self.m(input_ids=x, attention_mask=y)[0]
+
+    @tf.function(input_signature=inputs_signature)
+    def forward(self, input_ids, attention_mask):
+        return tf.math.softmax(
+            self.m.predict(input_ids, attention_mask), axis=-1
+        )
+
+
+if __name__ == "__main__":
+    inputs = preprocess_input()
+    shark_module = SharkInference(
+        SeqClassification(seq_args.hf_model_name),
+        (inputs["input_ids"], inputs["attention_mask"]),
+    )
+    shark_module.set_frontend("tensorflow")
+    shark_module.compile()
+    print(f"Model has been successfully compiled on {shark_args.device}")
+
+    while True:
+        input_text = input(
+            "Enter the text to classify (press q or nothing to exit): "
+        )
+        if not input_text or input_text == "q":
+            break
+        inputs = preprocess_input(input_text)
+        print(
+            shark_module.forward(
+                (inputs["input_ids"], inputs["attention_mask"])
+            )
+        )
--- a/tank/examples/bloom/README.md
+++ b/tank/examples/bloom/README.md
--- a/tank/examples/bloom/bloom_model.py
+++ b/tank/examples/bloom/bloom_model.py
--- a/tank/examples/deberta-base_tf/deberta-base_tf_test.py
+++ b/tank/examples/deberta-base_tf/deberta-base_tf_test.py
--- a/tank/examples/gpt2-64/gpt2-64_tflite_test.py
+++ b/tank/examples/gpt2-64/gpt2-64_tflite_test.py
--- a/tank/examples/opt/README.md
+++ b/tank/examples/opt/README.md
@@ -0,0 +1,3 @@
+# Running Different OPT Variants
+
+To run different sizes of OPT, change the string `OPT_MODEL` string in `opt_torch_test.py`. The default is 350m parameters. 66b cases also exist in the file, simply uncomment the test cases.
--- a/tank/examples/opt/hacked_hf_opt.py
+++ b/tank/examples/opt/hacked_hf_opt.py
@@ -0,0 +1,881 @@
+# coding=utf-8
+# Copyright 2022 The Fairseq Authors and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch OPT model."""
+import random
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import CrossEntropyLoss
+from transformers import OPTConfig, PreTrainedModel
+from transformers.activations import ACT2FN
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPast,
+    CausalLMOutputWithPast,
+)
+
+_CHECKPOINT_FOR_DOC = "facebook/opt-350m"
+_CONFIG_FOR_DOC = "OPTConfig"
+_TOKENIZER_FOR_DOC = "GPT2Tokenizer"
+
+# Base model docstring
+_EXPECTED_OUTPUT_SHAPE = [1, 8, 1024]
+
+
+OPT_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "facebook/opt-125m",
+    "facebook/opt-350m",
+    "facebook/opt-1.3b",
+    "facebook/opt-2.7b",
+    "facebook/opt-6.7b",
+    "facebook/opt-13b",
+    "facebook/opt-30b",
+    # See all OPT models at https://huggingface.co/models?filter=opt
+]
+
+
+def _make_causal_mask(
+    input_ids_shape: torch.Size,
+    dtype: torch.dtype,
+    past_key_values_length: int = 0,
+):
+    """
+    Make causal mask used for bi-directional self-attention.
+    """
+    bsz, tgt_len = input_ids_shape
+    mask = torch.full((tgt_len, tgt_len), float("-inf"))
+    mask_cond = torch.arange(int(mask.size(-1)))
+    mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
+    # mask = mask.to(dtype)
+
+    if past_key_values_length > 0:
+        mask = torch.cat(
+            [torch.zeros(tgt_len, past_key_values_length, dtype=dtype), mask],
+            dim=-1,
+        )
+    return mask[None, None, :, :].expand(
+        bsz, 1, tgt_len, tgt_len + past_key_values_length
+    )
+
+
+def _expand_mask(
+    mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None
+):
+    """
+    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
+    """
+    bsz, src_len = map(int, mask.size())
+    tgt_len = tgt_len if tgt_len is not None else src_len
+
+    expanded_mask = (
+        mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
+    )
+
+    inverted_mask = 1.0 - expanded_mask
+
+    return inverted_mask.masked_fill(
+        inverted_mask.to(torch.bool), torch.finfo(dtype).min
+    )
+    # return inverted_mask.masked_fill(inverted_mask, torch.finfo(dtype).min)
+
+
+class OPTLearnedPositionalEmbedding(nn.Embedding):
+    """
+    This module learns positional embeddings up to a fixed maximum size.
+    """
+
+    def __init__(self, num_embeddings: int, embedding_dim: int):
+        # OPT is set up so that if padding_idx is specified then offset the embedding ids by 2
+        # and adjust num_embeddings appropriately. Other models don't have this hack
+        self.offset = 2
+        super().__init__(num_embeddings + self.offset, embedding_dim)
+
+    def forward(
+        self,
+        attention_mask: torch.LongTensor,
+        past_key_values_length: int = 0,
+    ):
+        """`input_ids_shape` is expected to be [bsz x seqlen]."""
+        attention_mask = attention_mask.long()
+
+        # create positions depending on attention_mask
+        positions = (
+            torch.cumsum(attention_mask, dim=1).type_as(attention_mask)
+            * attention_mask
+        ).long() - 1
+
+        # cut positions if `past_key_values_length` is > 0
+        positions = positions[:, past_key_values_length:]
+
+        return super().forward(positions + self.offset)
+
+
+# Copied from transformers.models.bart.modeling_bart.BartAttention with Bart->OPT
+class OPTAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(
+        self,
+        embed_dim: int,
+        num_heads: int,
+        dropout: float = 0.0,
+        is_decoder: bool = False,
+        bias: bool = True,
+    ):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.head_dim = embed_dim // num_heads
+
+        if (self.head_dim * num_heads) != self.embed_dim:
+            raise ValueError(
+                "embed_dim must be divisible by num_heads (got `embed_dim`:"
+                f" {self.embed_dim} and `num_heads`: {num_heads})."
+            )
+        self.scaling = self.head_dim**-0.5
+        self.is_decoder = is_decoder
+
+        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+
+    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
+        return (
+            tensor.view(bsz, seq_len, self.num_heads, self.head_dim)
+            .transpose(1, 2)
+            .contiguous()
+        )
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        key_value_states: Optional[torch.Tensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        layer_head_mask: Optional[torch.Tensor] = None,
+        output_attentions: bool = False,
+    ) -> Tuple[
+        torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]
+    ]:
+        """Input shape: Batch x Time x Channel"""
+
+        # if key_value_states are provided this layer is used as a cross-attention layer
+        # for the decoder
+        is_cross_attention = key_value_states is not None
+
+        # bsz, tgt_len, _ = map(int, hidden_states.size())
+        bsz, tgt_len, _ = hidden_states.size()
+
+        # get query proj
+        query_states = self.q_proj(hidden_states) * self.scaling
+        # get key, value proj
+        if is_cross_attention and past_key_value is not None:
+            # reuse k,v, cross_attentions
+            key_states = past_key_value[0]
+            value_states = past_key_value[1]
+        elif is_cross_attention:
+            # cross_attentions
+            key_states = self._shape(self.k_proj(key_value_states), -1, bsz)
+            value_states = self._shape(self.v_proj(key_value_states), -1, bsz)
+        elif past_key_value is not None:
+            # reuse k, v, self_attention
+            key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
+            value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
+            key_states = torch.cat([past_key_value[0], key_states], dim=2)
+            value_states = torch.cat([past_key_value[1], value_states], dim=2)
+        else:
+            # self_attention
+            key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
+            value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
+
+        if self.is_decoder:
+            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
+            # Further calls to cross_attention layer can then reuse all cross-attention
+            # key/value_states (first "if" case)
+            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
+            # all previous decoder key/value_states. Further calls to uni-directional self-attention
+            # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
+            # if encoder bi-directional self-attention `past_key_value` is always `None`
+            past_key_value = (key_states, value_states)
+
+        proj_shape = (bsz * self.num_heads, -1, self.head_dim)
+        query_states = self._shape(query_states, tgt_len, bsz).view(
+            *proj_shape
+        )
+        key_states = key_states.view(*proj_shape)
+        value_states = value_states.view(*proj_shape)
+
+        src_len = key_states.size(1)
+        attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
+
+        if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
+            raise ValueError(
+                "Attention weights should be of size"
+                f" {(bsz * self.num_heads, tgt_len, src_len)}, but is"
+                f" {attn_weights.size()}"
+            )
+
+        if attention_mask is not None:
+            if attention_mask.size() != (bsz, 1, tgt_len, src_len):
+                raise ValueError(
+                    "Attention mask should be of size"
+                    f" {(bsz, 1, tgt_len, src_len)}, but is"
+                    f" {attention_mask.size()}"
+                )
+            attn_weights = (
+                attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+                + attention_mask
+            )
+            attn_weights = attn_weights.view(
+                bsz * self.num_heads, tgt_len, src_len
+            )
+
+        attn_weights = nn.functional.softmax(attn_weights, dim=-1)
+
+        if layer_head_mask is not None:
+            if layer_head_mask.size() != (self.num_heads,):
+                raise ValueError(
+                    "Head mask for a single layer should be of size"
+                    f" {(self.num_heads,)}, but is {layer_head_mask.size()}"
+                )
+            attn_weights = layer_head_mask.view(
+                1, -1, 1, 1
+            ) * attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights.view(
+                bsz * self.num_heads, tgt_len, src_len
+            )
+
+        if output_attentions:
+            # this operation is a bit awkward, but it's required to
+            # make sure that attn_weights keeps its gradient.
+            # In order to do so, attn_weights have to be reshaped
+            # twice and have to be reused in the following
+            attn_weights_reshaped = attn_weights.view(
+                bsz, self.num_heads, tgt_len, src_len
+            )
+            attn_weights = attn_weights_reshaped.view(
+                bsz * self.num_heads, tgt_len, src_len
+            )
+        else:
+            attn_weights_reshaped = None
+
+        attn_probs = nn.functional.dropout(
+            attn_weights, p=self.dropout, training=self.training
+        )
+
+        attn_output = torch.bmm(attn_probs, value_states)
+
+        if attn_output.size() != (
+            bsz * self.num_heads,
+            tgt_len,
+            self.head_dim,
+        ):
+            raise ValueError(
+                "`attn_output` should be of size"
+                f" {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is"
+                f" {attn_output.size()}"
+            )
+
+        attn_output = attn_output.view(
+            bsz, self.num_heads, tgt_len, self.head_dim
+        )
+        attn_output = attn_output.transpose(1, 2)
+
+        # Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be
+        # partitioned aross GPUs when using tensor-parallelism.
+        attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim)
+
+        attn_output = self.out_proj(attn_output)
+
+        return attn_output, attn_weights_reshaped, past_key_value
+
+
+class OPTDecoderLayer(nn.Module):
+    def __init__(self, config: OPTConfig):
+        super().__init__()
+        self.embed_dim = config.hidden_size
+        self.self_attn = OPTAttention(
+            embed_dim=self.embed_dim,
+            num_heads=config.num_attention_heads,
+            dropout=config.attention_dropout,
+            is_decoder=True,
+        )
+        self.do_layer_norm_before = config.do_layer_norm_before
+        self.dropout = config.dropout
+        self.activation_fn = ACT2FN[config.activation_function]
+
+        self.activation_dropout = config.activation_dropout
+
+        self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
+        self.fc1 = nn.Linear(self.embed_dim, config.ffn_dim)
+        self.fc2 = nn.Linear(config.ffn_dim, self.embed_dim)
+        self.final_layer_norm = nn.LayerNorm(self.embed_dim)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        layer_head_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+    ) -> Tuple[
+        torch.FloatTensor,
+        Optional[Tuple[torch.FloatTensor, torch.FloatTensor]],
+    ]:
+
+        # TODO: Refactor this function
+
+        residual = hidden_states
+
+        # 125m, 1.7B, ..., 175B applies layer norm BEFORE attention
+        if self.do_layer_norm_before:
+            hidden_states = self.self_attn_layer_norm(hidden_states)
+
+        # Self Attention
+        hidden_states, self_attn_weights, present_key_value = self.self_attn(
+            hidden_states=hidden_states,
+            past_key_value=past_key_value,
+            attention_mask=attention_mask,
+            layer_head_mask=layer_head_mask,
+            output_attentions=output_attentions,
+        )
+        hidden_states = nn.functional.dropout(
+            hidden_states, p=self.dropout, training=self.training
+        )
+        hidden_states = residual + hidden_states
+
+        # 350m applies layer norm AFTER attention
+        if not self.do_layer_norm_before:
+            hidden_states = self.self_attn_layer_norm(hidden_states)
+
+        # Fully Connected
+        hidden_states_shape = hidden_states.shape
+        hidden_states = hidden_states.reshape(-1, hidden_states.size(-1))
+        residual = hidden_states
+
+        # 125m, 1.7B, ..., 175B applies layer norm BEFORE attention
+        if self.do_layer_norm_before:
+            hidden_states = self.final_layer_norm(hidden_states)
+
+        hidden_states = self.fc1(hidden_states)
+        hidden_states = self.activation_fn(hidden_states)
+
+        hidden_states = self.fc2(hidden_states)
+        hidden_states = nn.functional.dropout(
+            hidden_states, p=self.dropout, training=self.training
+        )
+
+        hidden_states = (residual + hidden_states).view(hidden_states_shape)
+
+        # 350m applies layer norm AFTER attention
+        if not self.do_layer_norm_before:
+            hidden_states = self.final_layer_norm(hidden_states)
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (self_attn_weights,)
+
+        if use_cache:
+            outputs += (present_key_value,)
+
+        return outputs
+
+
+class OPTPreTrainedModel(PreTrainedModel):
+    config_class = OPTConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["OPTDecoderLayer"]
+    _keys_to_ignore_on_load_unexpected = [r"decoder.version"]
+
+    def _init_weights(self, module):
+        std = self.config.init_std
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, (OPTDecoder)):
+            module.gradient_checkpointing = value
+
+
+class OPTDecoder(OPTPreTrainedModel):
+    def __init__(self, config: OPTConfig):
+        super().__init__(config)
+        self.dropout = config.dropout
+        self.layerdrop = config.layerdrop
+        self.padding_idx = config.pad_token_id
+        self.max_target_positions = config.max_position_embeddings
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = nn.Embedding(
+            config.vocab_size, config.word_embed_proj_dim, self.padding_idx
+        )
+        self.embed_positions = OPTLearnedPositionalEmbedding(
+            config.max_position_embeddings, config.hidden_size
+        )
+
+        if config.word_embed_proj_dim != config.hidden_size:
+            self.project_out = nn.Linear(
+                config.hidden_size, config.word_embed_proj_dim, bias=False
+            )
+        else:
+            self.project_out = None
+
+        if config.word_embed_proj_dim != config.hidden_size:
+            self.project_in = nn.Linear(
+                config.word_embed_proj_dim, config.hidden_size, bias=False
+            )
+        else:
+            self.project_in = None
+
+        self.layer_norm = None
+        self.layers = nn.ModuleList(
+            [OPTDecoderLayer(config) for _ in range(config.num_hidden_layers)]
+        )
+
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
+    def _prepare_decoder_attention_mask(
+        self,
+        attention_mask,
+        input_shape,
+        inputs_embeds,
+        past_key_values_length,
+    ):
+        # create causal mask
+        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+        combined_attention_mask = None
+        if input_shape[-1] > 1:
+            combined_attention_mask = _make_causal_mask(
+                input_shape,
+                inputs_embeds.dtype,
+                past_key_values_length=past_key_values_length,
+            )  # .to(inputs_embeds.device)
+
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            expanded_attn_mask = _expand_mask(
+                attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]
+            )
+            combined_attention_mask = (
+                expanded_attn_mask
+                if combined_attention_mask is None
+                else expanded_attn_mask + combined_attention_mask
+            )
+
+        return combined_attention_mask
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+
+        # TODO: Refactor this function
+
+        output_attentions = (
+            output_attentions
+            if output_attentions is not None
+            else self.config.output_attentions
+        )
+        output_hidden_states = (
+            output_hidden_states
+            if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        use_cache = (
+            use_cache if use_cache is not None else self.config.use_cache
+        )
+
+        return_dict = (
+            return_dict
+            if return_dict is not None
+            else self.config.use_return_dict
+        )
+
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError(
+                "You cannot specify both decoder_input_ids and"
+                " decoder_inputs_embeds at the same time"
+            )
+        elif input_ids is not None:
+            input_shape = input_ids.size()
+            input_ids = input_ids.view(-1, input_shape[-1])
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+        else:
+            raise ValueError(
+                "You have to specify either decoder_input_ids or"
+                " decoder_inputs_embeds"
+            )
+
+        past_key_values_length = (
+            past_key_values[0][0].shape[2]
+            if past_key_values is not None
+            else 0
+        )
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        # embed positions
+        if attention_mask is None:
+            attention_mask = torch.ones(
+                inputs_embeds.shape[:2],
+                dtype=torch.bool,
+                device=inputs_embeds.device,
+            )
+        pos_embeds = self.embed_positions(
+            attention_mask, past_key_values_length
+        )
+
+        attention_mask = self._prepare_decoder_attention_mask(
+            attention_mask, input_shape, inputs_embeds, past_key_values_length
+        )
+
+        if self.project_in is not None:
+            inputs_embeds = self.project_in(inputs_embeds)
+
+        hidden_states = inputs_embeds + pos_embeds
+        hidden_states = nn.functional.dropout(
+            hidden_states, p=self.dropout, training=self.training
+        )
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = () if use_cache else None
+
+        # check if head_mask has a correct number of layers specified if desired
+        for attn_mask, mask_name in zip([head_mask], ["head_mask"]):
+            if attn_mask is not None:
+                if attn_mask.size()[0] != (len(self.layers)):
+                    raise ValueError(
+                        f"The `{mask_name}` should be specified for"
+                        f" {len(self.layers)} layers, but it is for"
+                        f" {head_mask.size()[0]}."
+                    )
+
+        for idx, decoder_layer in enumerate(self.layers):
+            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+
+            dropout_probability = random.uniform(0, 1)
+            if self.training and (dropout_probability < self.layerdrop):
+                continue
+
+            past_key_value = (
+                past_key_values[idx] if past_key_values is not None else None
+            )
+
+            if self.gradient_checkpointing and self.training:
+                if use_cache:
+                    use_cache = False
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        # None for past_key_value
+                        return module(*inputs, output_attentions, None)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(decoder_layer),
+                    hidden_states,
+                    attention_mask,
+                    head_mask[idx] if head_mask is not None else None,
+                    None,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    layer_head_mask=(
+                        head_mask[idx] if head_mask is not None else None
+                    ),
+                    past_key_value=past_key_value,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if use_cache:
+                next_decoder_cache += (
+                    layer_outputs[2 if output_attentions else 1],
+                )
+
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+
+        if self.project_out is not None:
+            hidden_states = self.project_out(hidden_states)
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        next_cache = next_decoder_cache if use_cache else None
+        if not return_dict:
+            # TODO: This tuple needs to be a static list (of tensors)
+            # return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+            return hidden_states
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+
+
+class OPTModel(OPTPreTrainedModel):
+    def __init__(self, config: OPTConfig):
+        super().__init__(config)
+        self.decoder = OPTDecoder(config)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.decoder.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.decoder.embed_tokens = value
+
+    def get_decoder(self):
+        return self.decoder
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = (
+            output_attentions
+            if output_attentions is not None
+            else self.config.output_attentions
+        )
+        output_hidden_states = (
+            output_hidden_states
+            if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        use_cache = (
+            use_cache if use_cache is not None else self.config.use_cache
+        )
+        return_dict = (
+            return_dict
+            if return_dict is not None
+            else self.config.use_return_dict
+        )
+
+        # decoder outputs consists of (dec_features, past_key_value, dec_hidden, dec_attn)
+        decoder_outputs = self.decoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        # if not return_dict:
+        #     return decoder_outputs
+
+        # return BaseModelOutputWithPast(
+        #     last_hidden_state=decoder_outputs.last_hidden_state,
+        #     past_key_values=decoder_outputs.past_key_values,
+        #     hidden_states=decoder_outputs.hidden_states,
+        #     attentions=decoder_outputs.attentions,
+        # )
+        return decoder_outputs.last_hidden_state
+
+
+class OPTForCausalLM(OPTPreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = OPTModel(config)
+
+        # the lm_head weight is automatically tied to the embed tokens weight
+        self.lm_head = nn.Linear(
+            config.word_embed_proj_dim, config.vocab_size, bias=False
+        )
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.model.decoder.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.model.decoder.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.model.decoder = decoder
+
+    def get_decoder(self):
+        return self.model.decoder
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+
+        # TODO: Refactor this function
+
+        output_attentions = (
+            output_attentions
+            if output_attentions is not None
+            else self.config.output_attentions
+        )
+        output_hidden_states = (
+            output_hidden_states
+            if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        return_dict = (
+            return_dict
+            if return_dict is not None
+            else self.config.use_return_dict
+        )
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model.decoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        logits = self.lm_head(outputs[0]).contiguous()
+
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(
+                shift_logits.view(-1, self.config.vocab_size),
+                shift_labels.view(-1),
+            )
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        past=None,
+        attention_mask=None,
+        use_cache=None,
+        **kwargs,
+    ):
+        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
+        if attention_mask is None:
+            attention_mask = input_ids.new_ones(input_ids.shape)
+
+        if past:
+            input_ids = input_ids[:, -1:]
+        # first step, decoder_cached_states are empty
+        return {
+            "input_ids": input_ids,  # encoder_outputs is defined. input_ids not needed
+            "attention_mask": attention_mask,
+            "past_key_values": past,
+            "use_cache": use_cache,
+        }
+
+    @staticmethod
+    def _reorder_cache(past, beam_idx):
+        reordered_past = ()
+        for layer_past in past:
+            reordered_past += (
+                tuple(
+                    past_state.index_select(0, beam_idx)
+                    for past_state in layer_past
+                ),
+            )
+        return reordered_past
--- a/tank/examples/opt/opt_torch_test.py
+++ b/tank/examples/opt/opt_torch_test.py
@@ -0,0 +1,188 @@
+import unittest
+
+import pytest
+import torch_mlir
+from hacked_hf_opt import OPTModel
+from shark.iree_utils._common import check_device_drivers, device_driver_info
+from shark.shark_inference import SharkInference
+from tank.model_utils import compare_tensors
+from transformers import GPT2Tokenizer
+
+OPT_MODEL = "facebook/opt-350m"
+OPT_MODEL_66B = "facebook/opt-66b"
+
+
+class OPTModuleTester:
+    def __init__(
+        self,
+        benchmark=False,
+    ):
+        self.benchmark = benchmark
+
+    def create_and_check_module(self, dynamic, device, model_name):
+        # model_mlir, func_name, input, act_out = download_torch_model(
+        #     "opt", dynamic
+        # )
+
+        tokenizer = GPT2Tokenizer.from_pretrained(model_name)
+        # config = OPTConfig()
+        # opt_model = OPTModel(config)
+        opt_model = OPTModel.from_pretrained(model_name)
+        opt_model.eval()
+
+        inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
+        input_ids, attention_mask = (
+            inputs.data["input_ids"],
+            inputs.data["attention_mask"],
+        )
+
+        module = torch_mlir.compile(
+            opt_model,
+            (input_ids, attention_mask),
+            output_type=torch_mlir.OutputType.LINALG_ON_TENSORS,
+            use_tracing=True,
+        )
+
+        model_mlir = module.operation.get_asm(
+            large_elements_limit=None, enable_debug_info=True
+        )
+        func_name = "forward"
+        act_out = opt_model(input_ids, attention_mask).detach()
+
+        # mlir_importer = SharkImporter(
+        #    model,
+        #    (input,),
+        #    frontend="torch",
+        # )
+        # minilm_mlir, func_name = mlir_importer.import_mlir(
+        #    is_dynamic=dynamic, tracing_required=True
+        # )
+
+        shark_module = SharkInference(
+            model_mlir,
+            func_name,
+            device=device,
+            mlir_dialect="tm_tensor",
+            is_benchmark=self.benchmark,
+        )
+        shark_module.compile()
+        results = shark_module.forward((input_ids, attention_mask))
+        assert compare_tensors(act_out, results)
+
+        if self.benchmark:
+            shark_module.shark_runner.benchmark_all_csv(
+                (input_ids, attention_mask),
+                "opt",
+                dynamic,
+                device,
+                "torch",
+            )
+
+
+class OPTModuleTest(unittest.TestCase):
+    @pytest.fixture(autouse=True)
+    def configure(self, pytestconfig):
+        self.module_tester = OPTModuleTester(self)
+        self.module_tester.save_mlir = False
+        self.module_tester.save_vmfb = False
+        self.module_tester.benchmark = pytestconfig.getoption("benchmark")
+
+    def test_350m_static_cpu(self):
+        dynamic = False
+        device = "cpu"
+        self.module_tester.create_and_check_module(dynamic, device, OPT_MODEL)
+
+    def test_350m_dynamic_cpu(self):
+        dynamic = True
+        device = "cpu"
+        self.module_tester.create_and_check_module(dynamic, device, OPT_MODEL)
+
+    @pytest.mark.skipif(
+        check_device_drivers("cuda"), reason=device_driver_info("cuda")
+    )
+    def test_350m_static_cuda(self):
+        dynamic = False
+        device = "cuda"
+        self.module_tester.create_and_check_module(dynamic, device, OPT_MODEL)
+
+    @pytest.mark.skipif(
+        check_device_drivers("cuda"), reason=device_driver_info("cuda")
+    )
+    def test_350m_dynamic_cuda(self):
+        dynamic = True
+        device = "cuda"
+        self.module_tester.create_and_check_module(dynamic, device, OPT_MODEL)
+
+    @pytest.mark.skipif(
+        check_device_drivers("vulkan"), reason=device_driver_info("vulkan")
+    )
+    def test_350m_static_vulkan(self):
+        dynamic = False
+        device = "vulkan"
+        self.module_tester.create_and_check_module(dynamic, device, OPT_MODEL)
+
+    @pytest.mark.skipif(
+        check_device_drivers("vulkan"), reason=device_driver_info("vulkan")
+    )
+    def test_350m_dynamic_vulkan(self):
+        dynamic = True
+        device = "vulkan"
+        self.module_tester.create_and_check_module(dynamic, device, OPT_MODEL)
+
+    # def test_66b_static_cpu(self):
+    #    dynamic = False
+    #    device = "cpu"
+    #    self.module_tester.create_and_check_module(
+    #        dynamic, device, OPT_MODEL_66B
+    #    )
+
+    # def test_66b_dynamic_cpu(self):
+    #    dynamic = True
+    #    device = "cpu"
+    #    self.module_tester.create_and_check_module(
+    #        dynamic, device, OPT_MODEL_66B
+    #    )
+
+    # @pytest.mark.skipif(
+    #    check_device_drivers("cuda"), reason=device_driver_info("cuda")
+    # )
+    # def test_66b_static_cuda(self):
+    #    dynamic = False
+    #    device = "cuda"
+    #    self.module_tester.create_and_check_module(
+    #        dynamic, device, OPT_MODEL_66B
+    #    )
+
+    # @pytest.mark.skipif(
+    #    check_device_drivers("cuda"), reason=device_driver_info("cuda")
+    # )
+    # def test_66b_dynamic_cuda(self):
+    #    dynamic = True
+    #    device = "cuda"
+    #    self.module_tester.create_and_check_module(
+    #        dynamic, device, OPT_MODEL_66B
+    #    )
+
+    # @pytest.mark.skipif(
+    #    check_device_drivers("vulkan"), reason=device_driver_info("vulkan")
+    # )
+    # def test_66b_static_vulkan(self):
+    #    dynamic = False
+    #    device = "vulkan"
+    #    self.module_tester.create_and_check_module(
+    #        dynamic, device, OPT_MODEL_66B
+    #    )
+
+    # @pytest.mark.skipif(
+    #    check_device_drivers("vulkan"), reason=device_driver_info("vulkan")
+    # )
+    # def test_66b_dynamic_vulkan(self):
+    #    dynamic = True
+    #    device = "vulkan"
+    #    self.module_tester.create_and_check_module(
+    #        dynamic, device, OPT_MODEL_66B
+    #    )
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tank/examples/rembert_tf/rembert_tf_test.py
+++ b/tank/examples/rembert_tf/rembert_tf_test.py
--- a/tank/examples/tapas-base_tf/tapas-base_tf_test.py
+++ b/tank/examples/tapas-base_tf/tapas-base_tf_test.py
--- a/tank/examples/v_diffusion_pytorch/.gitignore
+++ b/tank/examples/v_diffusion_pytorch/.gitignore
--- a/tank/examples/v_diffusion_pytorch/README.md
+++ b/tank/examples/v_diffusion_pytorch/README.md
--- a/tank/examples/v_diffusion_pytorch/cc12m_1.py
+++ b/tank/examples/v_diffusion_pytorch/cc12m_1.py
--- a/tank/examples/v_diffusion_pytorch/cfg_sample.py
+++ b/tank/examples/v_diffusion_pytorch/cfg_sample.py
--- a/tank/examples/v_diffusion_pytorch/cfg_sample_eager.py
+++ b/tank/examples/v_diffusion_pytorch/cfg_sample_eager.py
--- a/tank/examples/v_diffusion_pytorch/cfg_sample_from_mlir.py
+++ b/tank/examples/v_diffusion_pytorch/cfg_sample_from_mlir.py
--- a/tank/examples/v_diffusion_pytorch/cfg_sample_preprocess.py
+++ b/tank/examples/v_diffusion_pytorch/cfg_sample_preprocess.py
--- a/tank/examples/v_diffusion_pytorch/setup_v_diffusion_pytorch.sh
+++ b/tank/examples/v_diffusion_pytorch/setup_v_diffusion_pytorch.sh
--- a/tank/test_models.py
+++ b/tank/test_models.py
@@ -131,6 +131,7 @@ class SharkModuleTester:

    def create_and_check_module(self, dynamic, device):
        shark_args.local_tank_cache = self.local_tank_cache
+        shark_args.update_tank = self.update_tank
        if self.config["framework"] == "tf":
            model, func_name, inputs, golden_out = download_tf_model(
                self.config["model_name"],
@@ -266,6 +267,9 @@ class SharkModuleTest(unittest.TestCase):
        self.module_tester.local_tank_cache = self.pytestconfig.getoption(
            "local_tank_cache"
        )
+        self.module_tester.update_tank = self.pytestconfig.getoption(
+            "update_tank"
+        )
        self.module_tester.tank_url = self.pytestconfig.getoption("tank_url")
        if (
            config["model_name"] == "distilbert-base-uncased"
@@ -311,6 +315,7 @@ class SharkModuleTest(unittest.TestCase):
                reason="https://github.com/nod-ai/SHARK/issues/311, https://github.com/nod-ai/SHARK/issues/342"
            )
        if config["model_name"] == "funnel-transformer/small" and device in [
+            "cpu",
            "cuda",
            "metal",
            "vulkan",
@@ -348,14 +353,55 @@ class SharkModuleTest(unittest.TestCase):
            and device == "cuda"
        ):
            pytest.xfail(reason="https://github.com/nod-ai/SHARK/issues/390")
-        if config["model_name"] == "squeezenet1_0" and device == "vulkan":
+        if config["model_name"] == "squeezenet1_0" and device in [
+            "cpu",
+            "metal",
+            "vulkan",
+        ]:
            pytest.xfail(
                reason="Numerics Issues: https://github.com/nod-ai/SHARK/issues/388"
            )
-        if config["model_name"] == "mobilenet_v3_small" and device == "vulkan":
+        if config["model_name"] == "mobilenet_v3_small" and device in [
+            "metal",
+            "vulkan",
+        ]:
            pytest.xfail(
                reason="Numerics Issues: https://github.com/nod-ai/SHARK/issues/388"
            )
+        if config["model_name"] == "hf-internal-testing/tiny-random-flaubert":
+            pytest.xfail(reason="Transformers API mismatch")
+        if config["model_name"] == "alexnet" and device in ["metal", "vulkan"]:
+            pytest.xfail(reason="Assertion Error: Zeros Output")
+        if (
+            config["model_name"] == "camembert-base"
+            and dynamic == False
+            and device in ["metal", "vulkan"]
+        ):
+            pytest.xfail(
+                reason="chlo.broadcast_compare failed to satify constraint"
+            )
+        if (
+            config["model_name"] == "roberta-base"
+            and dynamic == False
+            and device in ["metal", "vulkan"]
+        ):
+            pytest.xfail(
+                reason="chlo.broadcast_compare failed to satify constraint"
+            )
+        if config["model_name"] in [
+            "microsoft/MiniLM-L12-H384-uncased",
+            "wide_resnet50_2",
+            "resnet50",
+            "resnet18",
+            "resnet101",
+            "microsoft/resnet-50",
+        ] and device in ["metal", "vulkan"]:
+            pytest.xfail(reason="Vulkan Numerical Error (mostly conv)")
+        if config["model_name"] == "mobilenet_v3_small" and device in [
+            "cuda",
+            "cpu",
+        ]:
+            pytest.xfail(reason="https://github.com/nod-ai/SHARK/issues/424")
        if config["framework"] == "tf" and dynamic == True:
            pytest.skip(
                reason="Dynamic shapes not supported for this framework."
--- a/tank/tf/README.md
+++ b/tank/tf/README.md
@@ -1,15 +0,0 @@
-## Running SharkInference on CPUs, GPUs and MAC.
-
-
-### Run the binary sequence_classification.
-#### The models supported are: [hugging face sequence classification](https://huggingface.co/docs/transformers/model_doc/auto#transformers.TFAutoModelForSequenceClassification)
-```shell
-./seq_classification.py --hf_model_name="hf_model" --device="cpu" # Use gpu | vulkan
-```
-
-Once the model is compiled to run on the device mentioned, we can pass in text and 
-get the logits.
-
-
-
-
--- a/tank/tf/tf_model_list.csv
+++ b/tank/tf/tf_model_list.csv
@@ -5,7 +5,6 @@ camembert-base,hf
 dbmdz/convbert-base-turkish-cased,hf
 distilbert-base-uncased,hf
 google/electra-small-discriminator,hf
-hf-internal-testing/tiny-random-flaubert,hf
 funnel-transformer/small,hf
 microsoft/layoutlm-base-uncased,hf
 google/mobilebert-uncased,hf
--- a/tank/pytorch/torch_model_list.csv
+++ b/tank/pytorch/torch_model_list.csv
--- a/web/Nod_logo.jpg
+++ b/web/Nod_logo.jpg
--- a/web/README.md
+++ b/web/README.md
@@ -1,13 +1,16 @@
 In order to launch SHARK-web, from the root SHARK directory, run:

+## Linux
 ```shell
 IMPORTER=1 ./setup_venv.sh
 source shark.venv/bin/activate
-pip install diffusers scipy
 cd web
-wget -O stable_diffusion.mlir https://storage.googleapis.com/shark_tank/prashant_nod/stable_diff/stable_diff_torch.mlir
 python index.py
 ```

-This will launch a gradio server with a public URL like:
-Running on public URL: https://xxxxx.gradio.app
+## Windows
+```shell
+./setup_venv.ps1
+cd web
+python index.py --local_tank_cache=<current_working_dir>
+```
--- a/web/index.py
+++ b/web/index.py
@@ -1,142 +1,241 @@
-from models.resnet50 import resnet_inf
-from models.albert_maskfill import albert_maskfill_inf
-from models.stable_diffusion import stable_diff_inf
-
-#  from models.diffusion.v_diffusion import vdiff_inf
-import gradio as gr
-from PIL import Image
-
-
-def debug_event(debug):
-    return gr.Textbox.update(visible=debug)
-
-
-with gr.Blocks() as shark_web:
-
-    with gr.Row():
-        with gr.Group():
-            with gr.Column(scale=1):
-                img = Image.open("./Nod_logo.jpg")
-                gr.Image(value=img, show_label=False, interactive=False).style(
-                    height=70, width=70
-                )
-            with gr.Column(scale=9):
-                gr.Label(value="Shark Models Demo.")
-
-    with gr.Tabs():
-        with gr.TabItem("ResNet50"):
-            image = device = debug = resnet = output = std_output = None
-            with gr.Row():
-                with gr.Column(scale=1, min_width=600):
-                    image = gr.Image(label="Image")
-                    device = gr.Textbox(label="Device", value="cpu")
-                    debug = gr.Checkbox(label="DEBUG", value=False)
-                    resnet = gr.Button("Recognize Image").style(
-                        full_width=True
-                    )
-                with gr.Column(scale=1, min_width=600):
-                    output = gr.Label(label="Output")
-                    std_output = gr.Textbox(
-                        label="Std Output",
-                        value="Nothing to show.",
-                        visible=False,
-                    )
-            debug.change(
-                debug_event,
-                inputs=[debug],
-                outputs=[std_output],
-                show_progress=False,
-            )
-            resnet.click(
-                resnet_inf,
-                inputs=[image, device],
-                outputs=[output, std_output],
-            )
-
-        with gr.TabItem("Albert MaskFill"):
-            masked_text = (
-                device
-            ) = debug = albert_mask = decoded_res = std_output = None
-            with gr.Row():
-                with gr.Column(scale=1, min_width=600):
-                    masked_text = gr.Textbox(
-                        label="Masked Text",
-                        placeholder="Give me a sentence with [MASK] to fill",
-                    )
-                    device = gr.Textbox(label="Device", value="cpu")
-                    debug = gr.Checkbox(label="DEBUG", value=False)
-                    albert_mask = gr.Button("Decode Mask")
-                with gr.Column(scale=1, min_width=600):
-                    decoded_res = gr.Label(label="Decoded Results")
-                    std_output = gr.Textbox(
-                        label="Std Output",
-                        value="Nothing to show.",
-                        visible=False,
-                    )
-            debug.change(
-                debug_event,
-                inputs=[debug],
-                outputs=[std_output],
-                show_progress=False,
-            )
-            albert_mask.click(
-                albert_maskfill_inf,
-                inputs=[masked_text, device],
-                outputs=[decoded_res, std_output],
-            )
-
-        #  with gr.TabItem("V-Diffusion"):
-        #      prompt = sample_count = batch_size = iters = device = v_diffusion = generated_img = None
-        #      with gr.Row():
-        #          with gr.Column(scale=1, min_width=600):
-        #              prompt = gr.Textbox(
-        #                  label="Prompt", value="New York City, oil on canvas:5"
-        #              )
-        #              sample_count = gr.Number(label="Sample Count", value=1)
-        #              batch_size = gr.Number(label="Batch Size", value=1)
-        #              iters = gr.Number(label="Steps", value=2)
-        #              device = gr.Textbox(label="Device", value="gpu")
-        #              v_diffusion = gr.Button("Generate image from prompt")
-        #          with gr.Column(scale=1, min_width=600):
-        #              generated_img = gr.Image(type="pil", shape=(100, 100))
-        #              std_output = gr.Textbox(label="Std Output", value="Nothing.")
-        #      v_diffusion.click(
-        #          vdiff_inf,
-        #          inputs=[prompt, sample_count, batch_size, iters, device],
-        #          outputs=[generated_img, std_output]
-        #      )
-
-        with gr.TabItem("Stable-Diffusion"):
-            prompt = (
-                iters
-            ) = (
-                device
-            ) = debug = stable_diffusion = generated_img = std_output = None
-            with gr.Row():
-                with gr.Column(scale=1, min_width=600):
-                    prompt = gr.Textbox(
-                        label="Prompt",
-                        value="a photograph of an astronaut riding a horse",
-                    )
-                    iters = gr.Number(label="Steps", value=2)
-                    device = gr.Textbox(label="Device", value="vulkan")
-                    debug = gr.Checkbox(label="DEBUG", value=False)
-                    stable_diffusion = gr.Button("Generate image from prompt")
-                with gr.Column(scale=1, min_width=600):
-                    generated_img = gr.Image(type="pil", shape=(100, 100))
-                    std_output = gr.Textbox(
-                        label="Std Output", value="Nothing.", visible=False
-                    )
-            debug.change(
-                debug_event,
-                inputs=[debug],
-                outputs=[std_output],
-                show_progress=False,
-            )
-            stable_diffusion.click(
-                stable_diff_inf,
-                inputs=[prompt, iters, device],
-                outputs=[generated_img, std_output],
-            )
-
-shark_web.launch(share=True, server_port=8080, enable_queue=True)
+# from models.resnet50 import resnet_inf
+# from models.albert_maskfill import albert_maskfill_inf
+from models.stable_diffusion.main import stable_diff_inf
+
+# from models.diffusion.v_diffusion import vdiff_inf
+import gradio as gr
+from PIL import Image
+import json
+import os
+
+
+def debug_event(debug):
+    return gr.Textbox.update(visible=debug)
+
+
+prompt_examples = []
+prompt_loc = "./prompts.json"
+if os.path.exists(prompt_loc):
+    with open("./prompts.json", encoding="utf-8") as fopen:
+        prompt_examples = json.load(fopen)
+
+
+demo_css = """
+.gradio-container {background-color: black}
+.container {background-color: black !important; padding-top:20px !important; }
+#ui_title {padding: 10px !important; }
+#top_logo {background-color: transparent; border-radius: 0 !important; border: 0; } 
+#demo_title {background-color: black; border-radius: 0 !important; border: 0; padding-top: 50px; padding-bottom: 0px; width: 460px !important;} 
+
+#demo_title_outer  {border-radius: 0; } 
+#prompt_box_outer div:first-child  {border-radius: 0 !important}
+#prompt_box textarea  {background-color:#1d1d1d !important}
+#prompt_examples {margin:0 !important}
+#prompt_examples svg {display: none !important;}
+
+.gr-sample-textbox { border-radius: 1rem !important; border-color: rgb(31,41,55) !important; border-width:2px !important; }
+#ui_body {background-color: #111111 !important; padding: 10px !important; border-radius: 0.5em !important;}
+
+#img_result+div {display: none !important;}
+
+footer {display: none !important;}
+"""
+
+with gr.Blocks(css=demo_css) as shark_web:
+    # load prompt examples.
+
+    with gr.Row(elem_id="ui_title"):
+        with gr.Column(scale=1, elem_id="demo_title_outer"):
+            logo2 = Image.open("./logos/sd-demo-logo.png")
+            gr.Image(
+                value=logo2,
+                show_label=False,
+                interactive=False,
+                elem_id="demo_title",
+            ).style(width=230)
+            # with gr.Column(scale=1):
+            #    gr.Label(value="Ultra fast Stable Diffusion")
+
+    with gr.Row(elem_id="ui_body"):
+        prompt = (
+            scheduler
+        ) = (
+            iters_count
+        ) = (
+            batch_size
+        ) = (
+            steps
+        ) = (
+            guidance
+        ) = (
+            height
+        ) = (
+            width
+        ) = (
+            seed
+        ) = (
+            precision
+        ) = (
+            device
+        ) = (
+            cache
+        ) = (
+            iree_vulkan_target_triple
+        ) = (
+            live_preview
+        ) = (
+            debug
+        ) = save_img = stable_diffusion = generated_img = std_output = None
+        # load prompts.
+
+        with gr.Row():
+            with gr.Column(scale=1, min_width=600):
+                with gr.Group(elem_id="prompt_box_outer"):
+                    prompt = gr.Textbox(
+                        label="Prompt",
+                        value="A photograph of an astronaut riding a horse",
+                        lines=1,
+                        elem_id="prompt_box",
+                    )
+                with gr.Group():
+                    ex = gr.Examples(
+                        label="Examples",
+                        examples=prompt_examples,
+                        inputs=prompt,
+                        cache_examples=False,
+                        elem_id="prompt_examples",
+                    )
+                with gr.Row():
+                    steps = gr.Slider(1, 100, value=50, step=1, label="Steps")
+                    guidance = gr.Slider(
+                        0,
+                        50,
+                        value=7.5,
+                        step=0.1,
+                        label="Guidance Scale",
+                        interactive=False,
+                    )
+                with gr.Row():
+                    height = gr.Slider(
+                        384,
+                        768,
+                        value=512,
+                        step=64,
+                        label="Height",
+                        interactive=False,
+                    )
+                    width = gr.Slider(
+                        384,
+                        768,
+                        value=512,
+                        step=64,
+                        label="Width",
+                        interactive=False,
+                    )
+                with gr.Row():
+                    precision = gr.Radio(
+                        label="Precision",
+                        value="fp16",
+                        choices=["fp16", "fp32"],
+                    )
+                    seed = gr.Textbox(value="42", max_lines=1, label="Seed")
+                with gr.Row():
+                    cache = gr.Checkbox(label="Cache", value=True)
+                    # debug = gr.Checkbox(label="DEBUG", value=False)
+                    save_img = gr.Checkbox(label="Save Image", value=False)
+                    live_preview = gr.Checkbox(
+                        label="Live Preview", value=False
+                    )
+                    # Hidden Items.
+                    scheduler = gr.Radio(
+                        label="Scheduler",
+                        value="LMS",
+                        choices=["PNDM", "LMS", "DDIM"],
+                        interactive=False,
+                        visible=False,
+                    )
+                    device = gr.Radio(
+                        label="Device",
+                        value="vulkan",
+                        choices=["cpu", "cuda", "vulkan"],
+                        interactive=False,
+                        visible=False,
+                        elem_id="ugly_line",
+                    )
+                    iters_count = gr.Slider(
+                        1,
+                        24,
+                        value=1,
+                        step=1,
+                        label="Iteration Count",
+                        visible=False,
+                    )
+                    batch_size = gr.Slider(
+                        1,
+                        4,
+                        value=1,
+                        step=1,
+                        label="Batch Size",
+                        visible=False,
+                    )
+                    iree_vulkan_target_triple = gr.Textbox(
+                        value="",
+                        max_lines=1,
+                        label="IREE VULKAN TARGET TRIPLE",
+                        visible=False,
+                        elem_id="ugly_line",
+                    )
+                stable_diffusion = gr.Button("Generate Image")
+                # logo
+                nod_logo = Image.open("./logos/amd-nod-logo.png")
+                gr.Image(
+                    value=nod_logo,
+                    show_label=False,
+                    interactive=False,
+                    elem_id="top_logo",
+                ).style(width=230)
+            with gr.Column(scale=1, min_width=600):
+                generated_img = gr.Image(
+                    type="pil", elem_id="img_result", interactive=False
+                ).style(height=768, width=768)
+                std_output = gr.Textbox(
+                    label="Std Output",
+                    value="Nothing.",
+                    lines=5,
+                    visible=False,
+                    elem_id="ugly_line",
+                )
+        """
+        debug.change(
+            debug_event,
+            inputs=[debug],
+            outputs=[std_output],
+            show_progress=False,
+        )
+        """
+
+        stable_diffusion.click(
+            stable_diff_inf,
+            inputs=[
+                prompt,
+                scheduler,
+                iters_count,
+                batch_size,
+                steps,
+                guidance,
+                height,
+                width,
+                seed,
+                precision,
+                device,
+                cache,
+                iree_vulkan_target_triple,
+                live_preview,
+                save_img,
+            ],
+            outputs=[generated_img, std_output],
+            show_progress=False,
+        )
+
+shark_web.queue()
+shark_web.launch(server_name="0.0.0.0", server_port=8080, enable_queue=True)
--- a/web/logos/Nod_logo.png
+++ b/web/logos/Nod_logo.png
--- a/web/logos/amd-nod-logo.png
+++ b/web/logos/amd-nod-logo.png
--- a/web/logos/other_logo.png
+++ b/web/logos/other_logo.png
--- a/web/logos/sd-demo-logo.png
+++ b/web/logos/sd-demo-logo.png
--- a/web/logs/resnet50_log.txt
+++ b/web/logs/resnet50_log.txt
--- a/web/logs/stable_diffusion_log.txt
+++ b/web/logs/stable_diffusion_log.txt
--- a/web/models/stable_diffusion.py
+++ b/web/models/stable_diffusion.py
@@ -23,7 +23,7 @@ def load_mlir(mlir_loc):
    return mlir_module


-def compile_through_fx(model, inputs, device, mlir_loc=None):
+def compile_through_fx(model, inputs, device, mlir_loc=None, extra_args=[]):

    module = load_mlir(mlir_loc)
    if mlir_loc == None:
@@ -74,9 +74,12 @@ def compile_through_fx(model, inputs, device, mlir_loc=None):
    func_name = "forward"

    shark_module = SharkInference(
-        mlir_model, func_name, device=device, mlir_dialect="tm_tensor"
+        mlir_model,
+        func_name,
+        device=device,
+        mlir_dialect="tm_tensor",
    )
-    shark_module.compile()
+    shark_module.compile(extra_args)

    return shark_module

@@ -150,6 +153,7 @@ def stable_diff_inf(prompt: str, steps, device: str):
            (latent_model_input, torch.tensor([1.0]), text_embeddings),
            args["device"],
            args["mlir_loc"],
+            ["--iree-flow-enable-conv-nchw-to-nhwc-transform"],
        )
        compiled_module[args["device"]] = shark_unet
        if DEBUG:
--- a/web/models/stable_diffusion/init.py
+++ b/web/models/stable_diffusion/init.py
--- a/web/models/stable_diffusion/main.py
+++ b/web/models/stable_diffusion/main.py
@@ -0,0 +1,365 @@
+from transformers import CLIPTextModel, CLIPTokenizer
+import torch
+from PIL import Image
+from diffusers import DDIMScheduler, LMSDiscreteScheduler, PNDMScheduler
+from tqdm.auto import tqdm
+import numpy as np
+from models.stable_diffusion.model_wrappers import (
+    get_vae32,
+    get_vae16,
+    get_unet16_wrapped,
+    get_unet32_wrapped,
+)
+from models.stable_diffusion.utils import get_shark_model
+import time
+import os
+
+GCLOUD_BUCKET = "gs://shark_tank/prashant_nod"
+VAE_FP16 = "vae_fp16"
+VAE_FP32 = "vae_fp32"
+UNET_FP16 = "unet_fp16"
+UNET_FP32 = "unet_fp32"
+
+TUNED_GCLOUD_BUCKET = "gs://shark_tank/quinn"
+UNET_FP16_TUNED = "unet_fp16_tunedv2"
+
+args = None
+
+
+class Arguments:
+    def __init__(
+        self,
+        prompt: str,
+        scheduler: str,
+        iteration_count: int,
+        batch_size: int,
+        steps: int,
+        guidance: float,
+        height: int,
+        width: int,
+        seed: int,
+        precision: str,
+        device: str,
+        cache: bool,
+        iree_vulkan_target_triple: str,
+        live_preview: bool,
+        save_img: bool,
+        import_mlir: bool = False,
+        max_length: int = 77,
+        use_tuned: bool = True,
+    ):
+        self.prompt = prompt
+        self.scheduler = scheduler
+        self.iteration_count = iteration_count
+        self.batch_size = batch_size
+        self.steps = steps
+        self.guidance = guidance
+        self.height = height
+        self.width = width
+        self.seed = seed
+        self.precision = precision
+        self.device = device
+        self.cache = cache
+        self.iree_vulkan_target_triple = iree_vulkan_target_triple
+        self.live_preview = live_preview
+        self.save_img = save_img
+        self.import_mlir = import_mlir
+        self.max_length = max_length
+        self.use_tuned = use_tuned
+
+
+def get_models():
+
+    global args
+
+    IREE_EXTRA_ARGS = []
+    if args.precision == "fp16":
+        IREE_EXTRA_ARGS += [
+            "--iree-flow-enable-padding-linalg-ops",
+            "--iree-flow-linalg-ops-padding-size=32",
+        ]
+        if args.use_tuned:
+            unet_gcloud_bucket = TUNED_GCLOUD_BUCKET
+            vae_gcloud_bucket = GCLOUD_BUCKET
+            unet_args = IREE_EXTRA_ARGS
+            vae_args = IREE_EXTRA_ARGS + [
+                "--iree-flow-enable-conv-nchw-to-nhwc-transform"
+            ]
+            unet_name = UNET_FP16_TUNED
+            vae_name = VAE_FP16
+        else:
+            unet_gcloud_bucket = GCLOUD_BUCKET
+            vae_gcloud_bucket = GCLOUD_BUCKET
+            IREE_EXTRA_ARGS += [
+                "--iree-flow-enable-conv-nchw-to-nhwc-transform"
+            ]
+            unet_args = IREE_EXTRA_ARGS
+            vae_args = IREE_EXTRA_ARGS
+            unet_name = UNET_FP16
+            vae_name = VAE_FP16
+
+        if args.import_mlir == True:
+            return get_vae16(args, model_name=VAE_FP16), get_unet16_wrapped(
+                args, model_name=UNET_FP16
+            )
+        else:
+            return get_shark_model(
+                args,
+                vae_gcloud_bucket,
+                vae_name,
+                vae_args,
+            ), get_shark_model(
+                args,
+                unet_gcloud_bucket,
+                unet_name,
+                unet_args,
+            )
+
+    elif args.precision == "fp32":
+        IREE_EXTRA_ARGS += [
+            "--iree-flow-enable-conv-nchw-to-nhwc-transform",
+            "--iree-flow-enable-padding-linalg-ops",
+            "--iree-flow-linalg-ops-padding-size=16",
+        ]
+        if args.import_mlir == True:
+            return get_vae32(args, model_name=VAE_FP32), get_unet32_wrapped(
+                args, model_name=UNET_FP32
+            )
+        else:
+            return get_shark_model(
+                args,
+                GCLOUD_BUCKET,
+                VAE_FP32,
+                IREE_EXTRA_ARGS,
+            ), get_shark_model(
+                args,
+                GCLOUD_BUCKET,
+                UNET_FP32,
+                IREE_EXTRA_ARGS,
+            )
+
+
+schedulers = dict()
+# set scheduler value
+schedulers["PNDM"] = PNDMScheduler(
+    beta_start=0.00085,
+    beta_end=0.012,
+    beta_schedule="scaled_linear",
+    num_train_timesteps=1000,
+)
+schedulers["LMS"] = LMSDiscreteScheduler(
+    beta_start=0.00085,
+    beta_end=0.012,
+    beta_schedule="scaled_linear",
+    num_train_timesteps=1000,
+)
+schedulers["DDIM"] = DDIMScheduler(
+    beta_start=0.00085,
+    beta_end=0.012,
+    beta_schedule="scaled_linear",
+    clip_sample=False,
+    set_alpha_to_one=False,
+)
+
+cache_obj = dict()
+# cache tokenizer and text_encoder
+cache_obj["tokenizer"] = CLIPTokenizer.from_pretrained(
+    "openai/clip-vit-large-patch14"
+)
+cache_obj["text_encoder"] = CLIPTextModel.from_pretrained(
+    "openai/clip-vit-large-patch14"
+)
+
+# cache vae and unet.
+args = Arguments(
+    prompt="load unet/vmfb",
+    scheduler="LMS",
+    iteration_count=1,
+    batch_size=1,
+    steps=50,
+    guidance=7.5,
+    height=512,
+    width=512,
+    seed=42,
+    precision="fp16",
+    device="vulkan",
+    cache=True,
+    iree_vulkan_target_triple="",
+    live_preview=False,
+    save_img=False,
+    import_mlir=False,
+    max_length=77,
+    use_tuned=True,
+)
+cache_obj["vae_fp16_vulkan"], cache_obj["unet_fp16_vulkan"] = get_models()
+args.precision = "fp32"
+cache_obj["vae_fp32_vulkan"], cache_obj["unet_fp32_vulkan"] = get_models()
+
+output_dir = "./stored_results/stable_diffusion"
+os.makedirs(output_dir, exist_ok=True)
+
+
+def stable_diff_inf(
+    prompt: str,
+    scheduler: str,
+    iteration_count: int,
+    batch_size: int,
+    steps: int,
+    guidance: float,
+    height: int,
+    width: int,
+    seed: str,
+    precision: str,
+    device: str,
+    cache: bool,
+    iree_vulkan_target_triple: str,
+    live_preview: bool,
+    save_img: bool,
+):
+
+    global args
+    global schedulers
+    global cache_obj
+    global output_dir
+
+    start = time.time()
+
+    # set seed value
+    if seed == "":
+        seed = int(torch.randint(low=25, high=100, size=()))
+    else:
+        try:
+            seed = int(seed)
+            if seed < 0 or seed > 10000:
+                seed = hash(seed)
+        except (ValueError, OverflowError) as error:
+            seed = hash(seed)
+
+    scheduler = schedulers[scheduler]
+    args = Arguments(
+        prompt,
+        scheduler,
+        iteration_count,
+        batch_size,
+        steps,
+        guidance,
+        height,
+        width,
+        seed,
+        precision,
+        device,
+        cache,
+        iree_vulkan_target_triple,
+        live_preview,
+        save_img,
+    )
+    dtype = torch.float32 if args.precision == "fp32" else torch.half
+    num_inference_steps = int(args.steps)  # Number of denoising steps
+    generator = torch.manual_seed(
+        args.seed
+    )  # Seed generator to create the inital latent noise
+
+    # Initialize vae and unet models.
+    is_model_initialized = False
+    if (
+        args.cache
+        and args.use_tuned
+        and args.device == "vulkan"
+        and not args.import_mlir
+    ):
+        vae_key = f"vae_{args.precision}_vulkan"
+        unet_key = f"unet_{args.precision}_vulkan"
+        cached_keys = cache_obj.keys()
+        if vae_key in cached_keys and unet_key in cached_keys:
+            vae, unet = cache_obj[vae_key], cache_obj[unet_key]
+            is_model_initialized = True
+    if not is_model_initialized:
+        vae, unet = get_models()
+
+    tokenizer = cache_obj["tokenizer"]
+    text_encoder = cache_obj["text_encoder"]
+    text_input = tokenizer(
+        [args.prompt],
+        padding="max_length",
+        max_length=args.max_length,
+        truncation=True,
+        return_tensors="pt",
+    )
+
+    text_embeddings = text_encoder(text_input.input_ids)[0].to(dtype)
+    max_length = text_input.input_ids.shape[-1]
+    uncond_input = tokenizer(
+        [""] * batch_size,
+        padding="max_length",
+        max_length=max_length,
+        return_tensors="pt",
+    )
+    uncond_embeddings = text_encoder(uncond_input.input_ids)[0].to(dtype)
+
+    text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
+
+    latents = torch.randn(
+        (batch_size, 4, args.height // 8, args.width // 8),
+        generator=generator,
+        dtype=torch.float32,
+    ).to(dtype)
+
+    scheduler.set_timesteps(num_inference_steps)
+    scheduler.is_scale_input_called = True
+
+    latents = latents * scheduler.sigmas[0]
+    text_embeddings_numpy = text_embeddings.detach().numpy()
+
+    avg_ms = 0
+    out_img = None
+    text_output = ""
+    for i, t in tqdm(enumerate(scheduler.timesteps)):
+
+        text_output += f"\n Iteration = {i} | Timestep = {t} | "
+        step_start = time.time()
+        timestep = torch.tensor([t]).to(dtype).detach().numpy()
+        latents_numpy = latents.detach().numpy()
+        sigma_numpy = np.array(scheduler.sigmas[i]).astype(np.float32)
+
+        noise_pred = unet.forward(
+            (latents_numpy, timestep, text_embeddings_numpy, sigma_numpy)
+        )
+        noise_pred = torch.from_numpy(noise_pred)
+        step_time = time.time() - step_start
+        avg_ms += step_time
+        step_ms = int((step_time) * 1000)
+        text_output += f"Time = {step_ms}ms."
+        latents = scheduler.step(noise_pred, i, latents)["prev_sample"]
+
+        if live_preview and i % 5 == 0:
+            scaled_latents = 1 / 0.18215 * latents
+            latents_numpy = scaled_latents.detach().numpy()
+            image = vae.forward((latents_numpy,))
+            image = torch.from_numpy(image)
+            image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
+            images = (image * 255).round().astype("uint8")
+            pil_images = [Image.fromarray(image) for image in images]
+            out_img = pil_images[0]
+            yield out_img, text_output
+
+    # scale and decode the image latents with vae
+    latents = 1 / 0.18215 * latents
+    latents_numpy = latents.detach().numpy()
+    image = vae.forward((latents_numpy,))
+    image = torch.from_numpy(image)
+    image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
+    images = (image * 255).round().astype("uint8")
+    pil_images = [Image.fromarray(image) for image in images]
+    out_img = pil_images[0]
+
+    avg_ms = 1000 * avg_ms / args.steps
+    text_output += f"\n\nAverage step time: {avg_ms}ms/it"
+
+    total_time = time.time() - start
+    text_output += f"\n\nTotal image generation time: {total_time}sec"
+
+    if args.save_img:
+        # save outputs.
+        output_loc = f"{output_dir}/{time.time()}_{int(args.steps)}_{args.precision}_{args.device}.jpg"
+        out_img.save(os.path.join(output_loc))
+    yield out_img, text_output
--- a/web/models/stable_diffusion/model_wrappers.py
+++ b/web/models/stable_diffusion/model_wrappers.py
@@ -0,0 +1,201 @@
+from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
+from models.stable_diffusion.utils import compile_through_fx
+import torch
+
+YOUR_TOKEN = "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"
+
+
+def get_vae32(args, model_name="vae_fp32"):
+    class VaeModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.vae = AutoencoderKL.from_pretrained(
+                "CompVis/stable-diffusion-v1-4",
+                subfolder="vae",
+                use_auth_token=YOUR_TOKEN,
+            )
+
+        def forward(self, input):
+            x = self.vae.decode(input, return_dict=False)[0]
+            return (x / 2 + 0.5).clamp(0, 1)
+
+    vae = VaeModel()
+    vae_input = torch.rand(1, 4, 64, 64)
+    shark_vae = compile_through_fx(
+        args,
+        vae,
+        (vae_input,),
+        model_name,
+    )
+    return shark_vae
+
+
+def get_vae16(args, model_name="vae_fp16"):
+    class VaeModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.vae = AutoencoderKL.from_pretrained(
+                "CompVis/stable-diffusion-v1-4",
+                subfolder="vae",
+                use_auth_token=YOUR_TOKEN,
+                revision="fp16",
+            )
+
+        def forward(self, input):
+            x = self.vae.decode(input, return_dict=False)[0]
+            return (x / 2 + 0.5).clamp(0, 1)
+
+    vae = VaeModel()
+    vae = vae.half().cuda()
+    vae_input = torch.rand(1, 4, 64, 64, dtype=torch.half).cuda()
+    shark_vae = compile_through_fx(
+        args,
+        vae,
+        (vae_input,),
+        model_name,
+    )
+    return shark_vae
+
+
+def get_unet32(args, model_name="unet_fp32"):
+    class UnetModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.unet = UNet2DConditionModel.from_pretrained(
+                "CompVis/stable-diffusion-v1-4",
+                subfolder="unet",
+                use_auth_token=YOUR_TOKEN,
+            )
+            self.in_channels = self.unet.in_channels
+            self.train(False)
+
+        def forward(self, x, y, z):
+            return self.unet.forward(x, y, z, return_dict=False)[0]
+
+    unet = UnetModel()
+    latent_model_input = torch.rand([2, 4, 64, 64])
+    text_embeddings = torch.rand([2, args.max_length, 768])
+    shark_unet = compile_through_fx(
+        args,
+        unet,
+        (latent_model_input, torch.tensor([1.0]), text_embeddings),
+        model_name,
+    )
+    return shark_unet
+
+
+def get_unet16(args, model_name="unet_fp16"):
+    class UnetModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.unet = UNet2DConditionModel.from_pretrained(
+                "CompVis/stable-diffusion-v1-4",
+                subfolder="unet",
+                use_auth_token=YOUR_TOKEN,
+                revision="fp16",
+            )
+            self.in_channels = self.unet.in_channels
+            self.train(False)
+
+        def forward(self, x, y, z):
+            return self.unet.forward(x, y, z, return_dict=False)[0]
+
+    unet = UnetModel()
+    unet = unet.half().cuda()
+    latent_model_input = torch.rand([2, 4, 64, 64]).half().cuda()
+    text_embeddings = torch.rand([2, args.max_length, 768]).half().cuda()
+    shark_unet = compile_through_fx(
+        args,
+        unet,
+        (
+            latent_model_input,
+            torch.tensor([1.0]).half().cuda(),
+            text_embeddings,
+        ),
+        model_name,
+    )
+    return shark_unet
+
+
+def get_unet16_wrapped(args, model_name="unet_fp16_wrapped"):
+    class UnetModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.unet = UNet2DConditionModel.from_pretrained(
+                "CompVis/stable-diffusion-v1-4",
+                subfolder="unet",
+                use_auth_token=YOUR_TOKEN,
+                revision="fp16",
+            )
+            self.in_channels = self.unet.in_channels
+            self.guidance_scale = args.guidance_scale
+            self.train(False)
+
+        def forward(self, latent, timestep, text_embedding, sigma):
+            # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
+            latents = torch.cat([latent] * 2)
+            latents = latents / (torch.pow((torch.pow(sigma, 2) + 1), 0.5))
+            unet_out = self.unet.forward(
+                latents, timestep, text_embedding, return_dict=False
+            )[0]
+            noise_pred_uncond, noise_pred_text = unet_out.chunk(2)
+            noise_pred = noise_pred_uncond + self.guidance_scale * (
+                noise_pred_text - noise_pred_uncond
+            )
+            return noise_pred
+
+    unet = UnetModel()
+    unet = unet.half().cuda()
+    latent_model_input = torch.rand([1, 4, 64, 64]).half().cuda()
+    text_embeddings = torch.rand([2, args.max_length, 768]).half().cuda()
+    sigma = torch.tensor(1).to(torch.float32)
+    shark_unet = compile_through_fx(
+        args,
+        unet,
+        (
+            latent_model_input,
+            torch.tensor([1.0]).half().cuda(),
+            text_embeddings,
+            sigma,
+        ),
+        model_name,
+    )
+    return shark_unet
+
+
+def get_unet32_wrapped(args, model_name="unet_fp32_wrapped"):
+    class UnetModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.unet = UNet2DConditionModel.from_pretrained(
+                "CompVis/stable-diffusion-v1-4",
+                subfolder="unet",
+                use_auth_token=YOUR_TOKEN,
+            )
+            self.in_channels = self.unet.in_channels
+            self.guidance_scale = args.guidance_scale
+            self.train(False)
+
+        def forward(self, latent, timestep, text_embedding, sigma):
+            latents = torch.cat([latent] * 2)
+            latents = latents / (torch.pow((torch.pow(sigma, 2) + 1), 0.5))
+            unet_out = self.unet.forward(
+                latents, timestep, text_embedding, return_dict=False
+            )[0]
+            noise_pred_uncond, noise_pred_text = unet_out.chunk(2)
+            noise_pred = noise_pred_uncond + self.guidance_scale * (
+                noise_pred_text - noise_pred_uncond
+            )
+            return noise_pred
+
+    unet = UnetModel()
+    latent_model_input = torch.rand([1, 4, 64, 64])
+    text_embeddings = torch.rand([2, args.max_length, 768])
+    sigma = torch.tensor(1).to(torch.float32)
+    shark_unet = compile_through_fx(
+        args,
+        unet,
+        (latent_model_input, torch.tensor([1.0]), text_embeddings, sigma),
+        model_name,
+    )
+    return shark_unet
--- a/web/models/stable_diffusion/utils.py
+++ b/web/models/stable_diffusion/utils.py
@@ -0,0 +1,90 @@
+import torch
+from shark.shark_inference import SharkInference
+from shark.shark_importer import SharkImporter
+from torch.fx.experimental.proxy_tensor import make_fx
+from torch._decomp import get_decompositions
+import torch_mlir
+import os
+
+
+def _compile_module(args, shark_module, model_name, extra_args=[]):
+    extended_name = "{}_{}".format(model_name, args.device)
+    if args.cache:
+        vmfb_path = os.path.join(os.getcwd(), extended_name + ".vmfb")
+        if os.path.isfile(vmfb_path):
+            print("Loading flatbuffer from {}".format(vmfb_path))
+            shark_module.load_module(vmfb_path)
+            return shark_module
+        print("No vmfb found. Compiling and saving to {}".format(vmfb_path))
+    path = shark_module.save_module(os.getcwd(), extended_name, extra_args)
+    shark_module.load_module(path)
+    return shark_module
+
+
+# Downloads the model from shark_tank and returns the shark_module.
+def get_shark_model(args, tank_url, model_name, extra_args=[]):
+    from shark.shark_downloader import download_torch_model
+
+    mlir_model, func_name, inputs, golden_out = download_torch_model(
+        model_name, tank_url=tank_url
+    )
+    shark_module = SharkInference(
+        mlir_model, func_name, device=args.device, mlir_dialect="linalg"
+    )
+    return _compile_module(args, shark_module, model_name, extra_args)
+
+
+# Converts the torch-module into shark_module.
+def compile_through_fx(args, model, inputs, model_name, extra_args=[]):
+
+    fx_g = make_fx(
+        model,
+        decomposition_table=get_decompositions(
+            [
+                torch.ops.aten.embedding_dense_backward,
+                torch.ops.aten.native_layer_norm_backward,
+                torch.ops.aten.slice_backward,
+                torch.ops.aten.select_backward,
+                torch.ops.aten.norm.ScalarOpt_dim,
+                torch.ops.aten.native_group_norm,
+                torch.ops.aten.upsample_bilinear2d.vec,
+                torch.ops.aten.split.Tensor,
+                torch.ops.aten.split_with_sizes,
+            ]
+        ),
+    )(*inputs)
+
+    fx_g.graph.set_codegen(torch.fx.graph.CodeGen())
+    fx_g.recompile()
+
+    def strip_overloads(gm):
+        """
+        Modifies the target of graph nodes in :attr:`gm` to strip overloads.
+        Args:
+            gm(fx.GraphModule): The input Fx graph module to be modified
+        """
+        for node in gm.graph.nodes:
+            if isinstance(node.target, torch._ops.OpOverload):
+                node.target = node.target.overloadpacket
+        gm.recompile()
+
+    strip_overloads(fx_g)
+
+    ts_g = torch.jit.trace(fx_g, inputs)
+
+    mlir_importer = SharkImporter(
+        ts_g,
+        inputs,
+        frontend="torch",
+    )
+
+    (mlir_module, func_name), _, _ = mlir_importer.import_debug()
+
+    shark_module = SharkInference(
+        mlir_module,
+        func_name,
+        device=args.device,
+        mlir_dialect="linalg",
+    )
+
+    return _compile_module(args, shark_module, model_name, extra_args)
--- a/web/models/stable_diffusion/xyz.jpg
+++ b/web/models/stable_diffusion/xyz.jpg
--- a/web/prompts.json
+++ b/web/prompts.json
@@ -0,0 +1,8 @@
+[["A high tech solarpunk utopia in the Amazon rainforest"],
+["A pikachu fine dining with a view to the Eiffel Tower"],
+["A mecha robot in a favela in expressionist style"],
+["an insect robot preparing a delicious meal"],
+["A digital Illustration of the Babel tower, 4k, detailed, trending in artstation, fantasy vivid colors"],
+["Cluttered house in the woods, anime, oil painting, high resolution, cottagecore, ghibli inspired, 4k"],
+["A beautiful mansion beside a waterfall in the woods, by josef thoma, matte painting, trending on artstation HQ"],
+["portrait photo of a asia old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes"]]
--- a/web/stored_results/stable_diffusion/empty.jpg
+++ b/web/stored_results/stable_diffusion/empty.jpg
Author	SHA1	Message	Date
Gaurav Shukla	4b1a0b43ff	[WEB] Remove long prompts support It removes support to long prompts due to higher lag in loading long prompts. Signed-Off-by: Gaurav Shukla <gaurav@nod-labs>	2022-11-03 18:57:58 +05:30
Gaurav Shukla	099f2160c3	[WEB] fix background color Signed-Off-by: Gaurav Shukla	2022-11-03 17:36:24 +05:30
Gaurav Shukla	9d2d62dedf	[WEB] Add support for long prompts (#467 )	2022-11-03 03:27:36 -07:00
Gaurav Shukla	15ed05b221	[WEB] Update the title (#466 )	2022-11-02 14:30:03 -07:00
Gaurav Shukla	7c825fc288	[WEB] CSS changes to the web-ui (#465 ) This commit updates UI with styling. Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com> Signed-off-by: Gaurav Shukla <gaurav@nod-labs.com>	2022-11-02 12:36:11 -07:00
Gaurav Shukla	88f8718635	[WEB] Load prompts from json The prompt examples will now be loaded from a json file `prompts.json`. Signed-Off-by: Gaurav Shukla	2022-11-02 20:52:34 +05:30
Prashant Kumar	a081733a42	Add the clip text shark_model. (#458 )	2022-11-02 00:08:33 -07:00
Gaurav Shukla	06ccfb0533	[WEB] Load vae and unet during server start up The vae and unet models(both fp16 and fp32 variant) can be loaded at server startup in order to reduce web response time. Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com>	2022-11-01 23:11:52 +05:30
Gaurav Shukla	b18d75e3f7	[WEB] Use tuned version of UNET fp16 This commit updates SD script in order to use the tuned version of Unet fp16 model. Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com>	2022-11-01 19:00:21 +05:30
Quinn Dawkins	3e7efaa048	Switch stable diffusion to the new tuned model (#455 )	2022-10-31 15:15:31 -07:00
Gaurav Shukla	a3fdfc81db	[WEB] Minor changes in the shark web (#454 ) 1. Default steps = 50. 2. Live preview will yield intermediate image at every 5 steps. 3. Add logs to .gitignore Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com> Signed-off-by: Gaurav Shukla <gaurav@nod-labs.com>	2022-10-31 14:29:00 -07:00
Gaurav Shukla	f4c91df1df	[WEB] Add pillow dependency (#453 ) Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com> Signed-off-by: Gaurav Shukla <gaurav@nod-labs.com>	2022-10-31 12:57:21 -07:00
Prashant Kumar	32e1ba8c0d	Adding batch_size support for stable diffusion.	2022-11-01 00:57:52 +05:30
Gaurav Shukla	1939376d72	[WEB] Cache model parameters (#452 ) This commit cache some of the model parameters to reduce the response time of shark web. Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com> Signed-off-by: Gaurav Shukla <gaurav@nod-labs.com>	2022-10-31 11:55:10 -07:00
Gaurav Shukla	25931d48a3	[WEB] Update stable diffusion UI and enable live preview (#447 ) This commit enables live preview feature and also updates stable diffusion web UI. Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com> Signed-off-by: Gaurav Shukla <gaurav@nod-labs.com>	2022-10-31 04:10:15 -07:00
powderluv	024c5e153a	Update Windows in README	2022-10-30 22:27:03 -07:00
powderluv	83f34b645d	Add Windows instructions	2022-10-30 22:25:42 -07:00
powderluv	3f9f450e0d	Add setup_venv.ps1 for windows (#448 ) Powershell users can run ./setup_venv.ps1 to setup the env	2022-10-30 22:17:35 -07:00
powderluv	fd89b06641	Drop RDNA1 for now	2022-10-29 14:29:09 -07:00
Gaurav Shukla	f8dc996004	Update vulkan-target-triple for Radeon devices. (#446 ) Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com> Signed-off-by: Gaurav Shukla <gaurav@nod-labs.com>	2022-10-29 14:27:20 -07:00
Phaneesh Barwaria	e6a964088b	Add os agnostic vulkan device name check (#445 )	2022-10-29 13:19:14 -07:00
Gaurav Shukla	e3e767c7eb	[WEB] Remove live preview and disable `resnet\|albert_maskfill` This commit removes live preview feature for now as it's not functional. This feature will be added in the next patch. Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com>	2022-10-30 00:37:59 +05:30
Quinn Dawkins	239c19eb12	Update Stable diffusion script to enable use of tuned models (#443 )	2022-10-29 01:42:49 -04:00
Eliasj42	7f37599a60	Added a dispatch benchmarking tool (#441 ) To produce benchmarks of individual dispatches, you can add --dispatch_benchmarks=All --dispatch_benchmarks_dir=<output_dir> to your command line argument. Co-authored-by: Elias Joseph <elias@nod-labs.com>	2022-10-28 14:31:03 -07:00
Prashant Kumar	77c9a2c5ea	Add profiling vulkan_device info and minor changes to reflect upstream changes.	2022-10-28 18:02:07 +05:30
Ean Garvey	fd7baae548	Serialize torch-mlir CAPI module as bytecode instead of string. (#435 ) * Serialize torch-mlir CAPI as bytecode instead of string. * Minor fixes to MLIR data handling in SHARK python.	2022-10-27 14:37:15 -05:00
Stanley Winata	01fdf5ee16	[example][SD] compile fp16 with iree-spirv-unify-aliased-resources (#436 )	2022-10-27 05:12:28 -07:00
Gaurav Shukla	e52f533c16	[WEB] Save vmfb and add live preview This commit updates SD script to save the compiled module and also adds live preview of generated images. Signed-off-by: Gaurav Shukla<gaurav@nod-labs.com>	2022-10-26 23:20:53 +05:30
Quinn Dawkins	fbd77dc936	Enable iterator space fusion for SD (#432 )	2022-10-26 01:08:26 -04:00
Quinn Dawkins	cdc6dd19e3	Force stable diffusion fp16 and fp32 to generate images with similar noise (#431 )	2022-10-25 17:28:18 -04:00
PhaneeshB	fd578a48a9	add cli args for vulkan target triple	2022-10-25 21:47:26 +05:30
Ean Garvey	9956099516	Add pytest option for updating tank and fix save_mlir function. (#413 ) * Use IREE tf tools to save .mlir modules when generating shark_tank. * Add option to pytest for enabling auto-updates to local shark tank. * xfail mobilenet torch on cpu, cuda and fix CI macos setup * Update test-models.yml to disable macos vulkan CI.	2022-10-25 21:29:18 +05:30
powderluv	f97b8fffed	Update README.md	2022-10-24 12:51:49 -07:00
Gaurav Shukla	7b9e309724	[WEB] Expose SD parameters in the web ui (#427 )	2022-10-24 04:34:35 -07:00
Quinn Dawkins	1d33913d48	Add option to save and load precompiled flatbuffer (#425 )	2022-10-23 16:24:09 -07:00
Prashant Kumar	a48eaaed20	Pass the flags to vae.	2022-10-23 23:57:48 +05:30
Prashant Kumar	2741b8be53	Pass the flags to vae. (#422 )	2022-10-23 11:23:13 -07:00
Anush Elangovan	4f906a265c	Fix lint	2022-10-22 12:43:52 -07:00
Anush Elangovan	0dff8d7af0	Simple download script to prime the hf model cache	2022-10-21 17:42:05 -07:00
Quinn Dawkins	4f0d0d8167	Update vulkan gui README for iree-vulkan-gui + Stable Diffusion (#399 )	2022-10-21 14:02:40 -04:00
Vivek Khandelwal	d513060b21	Add params for Stable Diffusion (#420 )	2022-10-21 23:11:09 +05:30
Prashant Kumar	d1a25ce4f3	Update stable_args.py	2022-10-21 17:26:31 +05:30
Gaurav Shukla	51c98695b2	[WEB] Update stable diffusion inference This commit updates the stable diffusion web incorporating the latest improvements. Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com>	2022-10-21 01:26:38 +05:30
Quinn Dawkins	b448770ec2	Add ms/iter timing for stable diffusion script (#414 )	2022-10-20 13:32:37 -04:00
Prashant Kumar	5fe22a7980	Minor fix.	2022-10-20 22:57:22 +05:30
Prashant Kumar	38ae6b5af4	Add stable_diffusion fp16 and fp32 with args.	2022-10-20 21:47:11 +05:30
Ean Garvey	0bfe30d75d	Fix issues with extra_args in benchmarks, pin tf==2.10 (#411 )	2022-10-20 06:55:26 -07:00
Quinn Dawkins	7be1d7d0be	Add option for extra arguments through SharkInference.compile (#408 )	2022-10-19 15:32:48 -05:00
Prashant Kumar	0d74c873f0	Add stable_diff_f16 version. (#407 )	2022-10-19 10:04:24 -07:00
powderluv	139aff2938	Update nightly.yml fix links	2022-10-18 23:42:22 -07:00
anush elangovan	a3f733490c	Force update of packages Pickup tools from upstream IREE	2022-10-19 05:20:53 +00:00
anush elangovan	8a11f138d1	Update SHARK-Runtime releases page	2022-10-19 05:06:36 +00:00
Ean Garvey	3405607917	(TESTING) Fix .whl assets path (#404 )	2022-10-14 12:13:14 -05:00
Ean Garvey	7c99a6bd33	Update README.md (#406 )	2022-10-13 20:29:49 -05:00
Ean Garvey	3fba8ce0e6	Update README.md (#405 )	2022-10-13 12:43:03 -07:00
Ean Garvey	f3bde3c7fc	Cleanup tank directory and move instructions to tank/README.md (#401 )	2022-10-13 12:20:02 -05:00
Phaneesh Barwaria	21fee8ef33	enable only one workflow job per branch (#402 )	2022-10-13 12:15:30 -05:00
Vivek Khandelwal	0e217d6180	Add Stable Diffusion Img2Img model script	2022-10-13 21:56:46 +05:30
Phaneesh Barwaria	00a8ce75d1	Xfail vulkan tests and Enable MacOs test on CI (#383 )	2022-10-13 11:14:41 -05:00
Quinn Dawkins	8f3f00cd99	Add iree-run-module like tool for running in a vulkan session (#398 )	2022-10-12 20:46:26 -04:00
Ean Garvey	13bae2538a	Update URL for IREE compiler/runtime install (#397 ) * Update URL for IREE compiler/runtime install * Update gh-pages-releases.yml * Update test_models.py * Update assets path	2022-10-12 15:47:11 -05:00
Ean Garvey	f508c80c23	Add workflow for GH pages releases and release scraping script. (#394 ) * Add workflow for GH pages releases and release scraping script. * Update test_models.py and change tokens for gh pages.	2022-10-11 22:03:33 -05:00
gpetters94	53df0620e3	Add OPT to tank (#214 )	2022-10-11 11:03:56 -05:00