Bump gradio from 3.15.0 to 3.34.0 (#1518 )

Bumps [gradio](https://github.com/gradio-app/gradio) from 3.15.0 to 3.34.0. - [Release notes](https://github.com/gradio-app/gradio/releases) - [Changelog](https://github.com/gradio-app/gradio/blob/main/CHANGELOG.md) - [Commits](https://github.com/gradio-app/gradio/compare/v3.15.0...v3.34.0) --- updated-dependencies: - dependency-name: gradio dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Change mlir dialects for tf tests to stablehlo. (#1535 )
2026-04-20 03:00:34 -04:00 · 2023-06-14 18:13:48 -07:00 · 2023-06-14 10:43:49 -07:00 · 2023-06-14 09:39:16 -07:00 · 2023-06-14 22:08:24 +05:30 · 2023-06-14 19:06:29 +05:30
38 changed files with 2305 additions and 2378 deletions
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly.yml
@@ -50,27 +50,13 @@ jobs:
      shell: powershell
      run: |
        ./setup_venv.ps1
+        $env:SHARK_PACKAGE_VERSION=${{ env.package_version }}
+        pip wheel -v -w dist . --pre -f https://download.pytorch.org/whl/nightly/cpu -f https://llvm.github.io/torch-mlir/package-index/ -f https://nod-ai.github.io/SHARK-Runtime/pip-release-links.html
        python process_skipfiles.py
        pyinstaller .\apps\stable_diffusion\shark_sd.spec
-        mv ./dist/shark_sd.exe ./dist/shark_sd_${{ env.package_version_ }}.exe
-        signtool sign /f c:\g\shark_02152023.cer /csp "eToken Base Cryptographic Provider" /k "${{ secrets.CI_CERT }}" ./dist/shark_sd_${{ env.package_version_ }}.exe
-        pyinstaller .\apps\stable_diffusion\shark_sd_cli.spec
-        python process_skipfiles.py
-        mv ./dist/shark_sd_cli.exe ./dist/shark_sd_cli_${{ env.package_version_ }}.exe
-        signtool sign /f c:\g\shark_02152023.cer /csp "eToken Base Cryptographic Provider" /k "${{ secrets.CI_CERT }}" ./dist/shark_sd_cli_${{ env.package_version_ }}.exe
-
-        
-    # GHA windows VM OOMs so disable for now
-    #- name: Build and validate the SHARK Runtime package
-    #  shell: powershell
-    #  run: |
-    #    $env:SHARK_PACKAGE_VERSION=${{ env.package_version }}
-    #    pip wheel -v -w dist . --pre -f https://download.pytorch.org/whl/nightly/torch -f https://llvm.github.io/torch-mlir/package-index/ -f https://nod-ai.github.io/SHARK-Runtime/pip-release-links.html
-
-    #- uses: actions/upload-artifact@v2
-    #  with:
-    #    path: dist/*
-    
+        mv ./dist/shark_sd.exe ./dist/nodai_shark_sd_${{ env.package_version_ }}.exe
+        signtool sign /f c:\g\shark_02152023.cer /csp "eToken Base Cryptographic Provider" /k "${{ secrets.CI_CERT }}" ./dist/nodai_shark_sd_${{ env.package_version_ }}.exe
+  
    - name: Upload Release Assets
      id: upload-release-assets
      uses: dwenegar/upload-release-assets@v1
@@ -78,7 +64,7 @@ jobs:
        GITHUB_TOKEN: ${{ secrets.NODAI_INVOCATION_TOKEN }}
      with:
        release_id: ${{ steps.create_release.outputs.id }}
-        assets_path: ./dist/*
+        assets_path: ./dist/nodai*
        #asset_content_type: application/vnd.microsoft.portable-executable 

    - name: Publish Release
--- a/apps/language_models/scripts/vicuna.py
+++ b/apps/language_models/scripts/vicuna.py
--- a/apps/language_models/scripts/vicuna_web.py
+++ b/apps/language_models/scripts/vicuna_web.py
@@ -1,777 +0,0 @@
-import sys
-import warnings
-import gradio as gr
-import time
-
-warnings.filterwarnings("ignore")
-sys.path.insert(0, "D:\S\SB\I\python_packages\iree_compiler")
-sys.path.insert(0, "D:\S\SB\I\python_packages\iree_runtime")
-import torch
-import torch_mlir
-from transformers import AutoTokenizer, AutoModelForCausalLM
-from torch.fx.experimental.proxy_tensor import make_fx
-from torch._decomp import get_decompositions
-from typing import List
-from io import BytesIO
-from pathlib import Path
-from shark.shark_downloader import download_public_file
-from shark.shark_importer import transform_fx as transform_fx_
-import re
-from shark.shark_inference import SharkInference
-from tqdm import tqdm
-from torch_mlir import TensorPlaceholder
-from apps.stable_diffusion.web.ui.utils import available_devices
-
-
-class FirstVicunaLayer(torch.nn.Module):
-    def __init__(self, model):
-        super().__init__()
-        self.model = model
-
-    def forward(self, hidden_states, attention_mask, position_ids):
-        outputs = self.model(
-            hidden_states,
-            attention_mask=attention_mask,
-            position_ids=position_ids,
-            use_cache=True,
-        )
-        next_hidden_states = outputs[0]
-        past_key_value_out0, past_key_value_out1 = (
-            outputs[-1][0],
-            outputs[-1][1],
-        )
-
-        return (
-            next_hidden_states,
-            past_key_value_out0,
-            past_key_value_out1,
-        )
-
-
-class SecondVicunaLayer(torch.nn.Module):
-    def __init__(self, model):
-        super().__init__()
-        self.model = model
-
-    def forward(
-        self,
-        hidden_states,
-        attention_mask,
-        position_ids,
-        past_key_value0,
-        past_key_value1,
-    ):
-        outputs = self.model(
-            hidden_states,
-            attention_mask=attention_mask,
-            position_ids=position_ids,
-            past_key_value=(
-                past_key_value0,
-                past_key_value1,
-            ),
-            use_cache=True,
-        )
-        next_hidden_states = outputs[0]
-        past_key_value_out0, past_key_value_out1 = (
-            outputs[-1][0],
-            outputs[-1][1],
-        )
-
-        return (
-            next_hidden_states,
-            past_key_value_out0,
-            past_key_value_out1,
-        )
-
-
-class CompiledFirstVicunaLayer(torch.nn.Module):
-    def __init__(self, shark_module):
-        super().__init__()
-        self.model = shark_module
-
-    def forward(
-        self,
-        hidden_states,
-        attention_mask,
-        position_ids,
-        past_key_value=None,
-        output_attentions=False,
-        use_cache=True,
-    ):
-        hidden_states = hidden_states.detach()
-        attention_mask = attention_mask.detach()
-        position_ids = position_ids.detach()
-        output = self.model(
-            "forward",
-            (
-                hidden_states,
-                attention_mask,
-                position_ids,
-            ),
-        )
-
-        output0 = torch.tensor(output[0])
-        output1 = torch.tensor(output[1])
-        output2 = torch.tensor(output[2])
-
-        return (
-            output0,
-            (
-                output1,
-                output2,
-            ),
-        )
-
-
-class CompiledSecondVicunaLayer(torch.nn.Module):
-    def __init__(self, shark_module):
-        super().__init__()
-        self.model = shark_module
-
-    def forward(
-        self,
-        hidden_states,
-        attention_mask,
-        position_ids,
-        past_key_value,
-        output_attentions=False,
-        use_cache=True,
-    ):
-        hidden_states = hidden_states.detach()
-        attention_mask = attention_mask.detach()
-        position_ids = position_ids.detach()
-        pkv0 = past_key_value[0].detach()
-        pkv1 = past_key_value[1].detach()
-        output = self.model(
-            "forward",
-            (
-                hidden_states,
-                attention_mask,
-                position_ids,
-                pkv0,
-                pkv1,
-            ),
-        )
-
-        output0 = torch.tensor(output[0])
-        output1 = torch.tensor(output[1])
-        output2 = torch.tensor(output[2])
-
-        return (
-            output0,
-            (
-                output1,
-                output2,
-            ),
-        )
-
-
-class ShardedVicunaModel(torch.nn.Module):
-    def __init__(self, model, layers0, layers1):
-        super().__init__()
-        self.model = model
-        assert len(layers0) == len(model.model.layers)
-        # self.model.model.layers = torch.nn.modules.container.ModuleList(layers0)
-        self.model.model.config.use_cache = True
-        self.model.model.config.output_attentions = False
-        self.layers0 = layers0
-        self.layers1 = layers1
-
-    def forward(
-        self,
-        input_ids,
-        is_first=True,
-        past_key_values=None,
-        attention_mask=None,
-    ):
-        if is_first:
-            self.model.model.layers = torch.nn.modules.container.ModuleList(
-                self.layers0
-            )
-            return self.model.forward(input_ids, attention_mask=attention_mask)
-        else:
-            self.model.model.layers = torch.nn.modules.container.ModuleList(
-                self.layers1
-            )
-            return self.model.forward(
-                input_ids,
-                attention_mask=attention_mask,
-                past_key_values=past_key_values,
-            )
-
-
-def write_in_dynamic_inputs0(module, dynamic_input_size):
-    new_lines = []
-    for line in module.splitlines():
-        line = re.sub(f"{dynamic_input_size}x", "?x", line)
-        if "?x" in line:
-            line = re.sub("tensor.empty\(\)", "tensor.empty(%dim)", line)
-        line = re.sub(f" {dynamic_input_size},", " %dim,", line)
-        if "tensor.empty" in line and "?x?" in line:
-            line = re.sub(
-                "tensor.empty\(%dim\)", "tensor.empty(%dim, %dim)", line
-            )
-        if "arith.cmpi" in line:
-            line = re.sub(f"c{dynamic_input_size}", "dim", line)
-        new_lines.append(line)
-    new_module = "\n".join(new_lines)
-    return new_module
-
-
-def write_in_dynamic_inputs1(module, dynamic_input_size):
-    new_lines = []
-    for line in module.splitlines():
-        if "dim_42 =" in line:
-            continue
-        if f"%c{dynamic_input_size}_i64 =" in line:
-            new_lines.append(
-                "%dim_42 = tensor.dim %arg1, %c3 : tensor<1x1x1x?xf32>"
-            )
-            new_lines.append(
-                f"%dim_42_i64 = arith.index_cast %dim_42 : index to i64"
-            )
-            continue
-        line = re.sub(f"{dynamic_input_size}x", "?x", line)
-        if "?x" in line:
-            line = re.sub("tensor.empty\(\)", "tensor.empty(%dim_42)", line)
-        line = re.sub(f" {dynamic_input_size},", " %dim_42,", line)
-        if "tensor.empty" in line and "?x?" in line:
-            line = re.sub(
-                "tensor.empty\(%dim_42\)",
-                "tensor.empty(%dim_42, %dim_42)",
-                line,
-            )
-        if "arith.cmpi" in line:
-            line = re.sub(f"c{dynamic_input_size}", "dim_42", line)
-        new_lines.append(line)
-    new_module = "\n".join(new_lines)
-    return new_module
-
-
-def compile_vicuna_layer(
-    vicuna_layer,
-    hidden_states,
-    attention_mask,
-    position_ids,
-    past_key_value0=None,
-    past_key_value1=None,
-):
-    hidden_states_placeholder = TensorPlaceholder.like(
-        hidden_states, dynamic_axes=[1]
-    )
-    attention_mask_placeholder = TensorPlaceholder.like(
-        attention_mask, dynamic_axes=[2, 3]
-    )
-    position_ids_placeholder = TensorPlaceholder.like(
-        position_ids, dynamic_axes=[1]
-    )
-
-    if past_key_value0 is None and past_key_value1 is None:
-        fx_g = make_fx(
-            vicuna_layer,
-            decomposition_table=get_decompositions(
-                [
-                    torch.ops.aten.embedding_dense_backward,
-                    torch.ops.aten.native_layer_norm_backward,
-                    torch.ops.aten.slice_backward,
-                    torch.ops.aten.select_backward,
-                    torch.ops.aten.norm.ScalarOpt_dim,
-                    torch.ops.aten.native_group_norm,
-                    torch.ops.aten.upsample_bilinear2d.vec,
-                    torch.ops.aten.split.Tensor,
-                    torch.ops.aten.split_with_sizes,
-                ]
-            ),
-        )(hidden_states, attention_mask, position_ids)
-
-    else:
-        fx_g = make_fx(
-            vicuna_layer,
-            decomposition_table=get_decompositions(
-                [
-                    torch.ops.aten.embedding_dense_backward,
-                    torch.ops.aten.native_layer_norm_backward,
-                    torch.ops.aten.slice_backward,
-                    torch.ops.aten.select_backward,
-                    torch.ops.aten.norm.ScalarOpt_dim,
-                    torch.ops.aten.native_group_norm,
-                    torch.ops.aten.upsample_bilinear2d.vec,
-                    torch.ops.aten.split.Tensor,
-                    torch.ops.aten.split_with_sizes,
-                ]
-            ),
-        )(
-            hidden_states,
-            attention_mask,
-            position_ids,
-            past_key_value0,
-            past_key_value1,
-        )
-
-    def _remove_nones(fx_g: torch.fx.GraphModule) -> List[int]:
-        removed_indexes = []
-        for node in fx_g.graph.nodes:
-            if node.op == "output":
-                assert (
-                    len(node.args) == 1
-                ), "Output node must have a single argument"
-                node_arg = node.args[0]
-                if isinstance(node_arg, (list, tuple)):
-                    node_arg = list(node_arg)
-                    node_args_len = len(node_arg)
-                    for i in range(node_args_len):
-                        curr_index = node_args_len - (i + 1)
-                        if node_arg[curr_index] is None:
-                            removed_indexes.append(curr_index)
-                            node_arg.pop(curr_index)
-                    node.args = (tuple(node_arg),)
-                    break
-
-        if len(removed_indexes) > 0:
-            fx_g.graph.lint()
-            fx_g.graph.eliminate_dead_code()
-            fx_g.recompile()
-        removed_indexes.sort()
-        return removed_indexes
-
-    def _unwrap_single_tuple_return(fx_g: torch.fx.GraphModule) -> bool:
-        """
-        Replace tuple with tuple element in functions that return one-element tuples.
-        Returns true if an unwrapping took place, and false otherwise.
-        """
-        unwrapped_tuple = False
-        for node in fx_g.graph.nodes:
-            if node.op == "output":
-                assert (
-                    len(node.args) == 1
-                ), "Output node must have a single argument"
-                node_arg = node.args[0]
-                if isinstance(node_arg, tuple):
-                    if len(node_arg) == 1:
-                        node.args = (node_arg[0],)
-                        unwrapped_tuple = True
-                        break
-
-        if unwrapped_tuple:
-            fx_g.graph.lint()
-            fx_g.recompile()
-        return unwrapped_tuple
-
-    def transform_fx(fx_g):
-        for node in fx_g.graph.nodes:
-            if node.op == "call_function":
-                if node.target in [
-                    torch.ops.aten.empty,
-                ]:
-                    # aten.empty should be filled with zeros.
-                    if node.target in [torch.ops.aten.empty]:
-                        with fx_g.graph.inserting_after(node):
-                            new_node = fx_g.graph.call_function(
-                                torch.ops.aten.zero_,
-                                args=(node,),
-                            )
-                            node.append(new_node)
-                            node.replace_all_uses_with(new_node)
-                            new_node.args = (node,)
-
-        fx_g.graph.lint()
-
-    transform_fx(fx_g)
-    fx_g.recompile()
-    removed_none_indexes = _remove_nones(fx_g)
-    was_unwrapped = _unwrap_single_tuple_return(fx_g)
-
-    fx_g.graph.set_codegen(torch.fx.graph.CodeGen())
-    fx_g.recompile()
-
-    print("FX_G recompile")
-
-    def strip_overloads(gm):
-        """
-        Modifies the target of graph nodes in :attr:`gm` to strip overloads.
-        Args:
-            gm(fx.GraphModule): The input Fx graph module to be modified
-        """
-        for node in gm.graph.nodes:
-            if isinstance(node.target, torch._ops.OpOverload):
-                node.target = node.target.overloadpacket
-        gm.recompile()
-
-    strip_overloads(fx_g)
-    ts_g = torch.jit.script(fx_g)
-    return ts_g
-
-
-path = "TheBloke/vicuna-7B-1.1-HF"
-kwargs = {"torch_dtype": torch.float}
-vicuna_model = AutoModelForCausalLM.from_pretrained(path, **kwargs)
-tokenizer = AutoTokenizer.from_pretrained(path, use_fast=False)
-
-
-def compile_to_vmfb(inputs, layers, is_first=True):
-    mlirs, modules = [], []
-    for idx, layer in tqdm(enumerate(layers), desc="Getting mlirs"):
-        if is_first:
-            mlir_path = Path(f"{idx}_0.mlir")
-            vmfb_path = Path(f"{idx}_0.vmfb")
-        else:
-            mlir_path = Path(f"{idx}_1.mlir")
-            vmfb_path = Path(f"{idx}_1.vmfb")
-        if vmfb_path.exists():
-            continue
-        if mlir_path.exists():
-            # print(f"Found layer {idx} mlir")
-            f_ = open(mlir_path, "rb")
-            bytecode = f_.read()
-            f_.close()
-        else:
-            hidden_states_placeholder = TensorPlaceholder.like(
-                inputs[0], dynamic_axes=[1]
-            )
-            attention_mask_placeholder = TensorPlaceholder.like(
-                inputs[1], dynamic_axes=[3]
-            )
-            position_ids_placeholder = TensorPlaceholder.like(
-                inputs[2], dynamic_axes=[1]
-            )
-            if not is_first:
-                pkv0_placeholder = TensorPlaceholder.like(
-                    inputs[3], dynamic_axes=[2]
-                )
-                pkv1_placeholder = TensorPlaceholder.like(
-                    inputs[4], dynamic_axes=[2]
-                )
-            print(f"Compiling layer {idx} mlir")
-            if is_first:
-                ts_g = compile_vicuna_layer(
-                    layer, inputs[0], inputs[1], inputs[2]
-                )
-                module = torch_mlir.compile(
-                    ts_g,
-                    (
-                        hidden_states_placeholder,
-                        inputs[1],
-                        inputs[2],
-                    ),
-                    torch_mlir.OutputType.LINALG_ON_TENSORS,
-                    use_tracing=False,
-                    verbose=False,
-                )
-            else:
-                ts_g = compile_vicuna_layer(
-                    layer,
-                    inputs[0],
-                    inputs[1],
-                    inputs[2],
-                    inputs[3],
-                    inputs[4],
-                )
-                module = torch_mlir.compile(
-                    ts_g,
-                    (
-                        inputs[0],
-                        attention_mask_placeholder,
-                        inputs[2],
-                        pkv0_placeholder,
-                        pkv1_placeholder,
-                    ),
-                    torch_mlir.OutputType.LINALG_ON_TENSORS,
-                    use_tracing=False,
-                    verbose=False,
-                )
-
-            # bytecode_stream = BytesIO()
-            # module.operation.write_bytecode(bytecode_stream)
-            # bytecode = bytecode_stream.getvalue()
-
-            if is_first:
-                module = write_in_dynamic_inputs0(str(module), 137)
-                bytecode = module.encode("UTF-8")
-                bytecode_stream = BytesIO(bytecode)
-                bytecode = bytecode_stream.read()
-
-            else:
-                module = write_in_dynamic_inputs1(str(module), 138)
-                if idx in [0, 5, 6, 7]:
-                    module_str = module
-                    module_str = module_str.splitlines()
-                    new_lines = []
-                    for line in module_str:
-                        if len(line) < 1000:
-                            new_lines.append(line)
-                        else:
-                            new_lines.append(line[:999])
-                    module_str = "\n".join(new_lines)
-                    f1_ = open(f"{idx}_1_test.mlir", "w+")
-                    f1_.write(module_str)
-                    f1_.close()
-
-                bytecode = module.encode("UTF-8")
-                bytecode_stream = BytesIO(bytecode)
-                bytecode = bytecode_stream.read()
-
-            f_ = open(mlir_path, "wb")
-            f_.write(bytecode)
-            f_.close()
-        mlirs.append(bytecode)
-
-    for idx, layer in tqdm(enumerate(layers), desc="compiling modules"):
-        if is_first:
-            vmfb_path = Path(f"{idx}_0.vmfb")
-            if idx < 25:
-                device = "cpu"
-            else:
-                device = "cpu"
-            if vmfb_path.exists():
-                # print(f"Found layer {idx} vmfb")
-                module = SharkInference(
-                    None, device=device, mlir_dialect="tm_tensor"
-                )
-                module.load_module(vmfb_path)
-            else:
-                print(f"Compiling layer {idx} vmfb")
-                module = SharkInference(
-                    mlirs[idx], device=device, mlir_dialect="tm_tensor"
-                )
-                module.save_module(
-                    module_name=f"{idx}_0",
-                    extra_args=[
-                        "--iree-hal-dump-executable-sources-to=ies",
-                        "--iree-vm-target-truncate-unsupported-floats",
-                        "--iree-codegen-check-ir-before-llvm-conversion=false",
-                        "--iree-vm-bytecode-module-output-format=flatbuffer-binary",
-                    ],
-                )
-                module.load_module(vmfb_path)
-            modules.append(module)
-        else:
-            vmfb_path = Path(f"{idx}_1.vmfb")
-            if idx < 25:
-                device = "vulkan"
-            else:
-                device = "cpu"
-            if vmfb_path.exists():
-                # print(f"Found layer {idx} vmfb")
-                module = SharkInference(
-                    None, device=device, mlir_dialect="tm_tensor"
-                )
-                module.load_module(vmfb_path)
-            else:
-                print(f"Compiling layer {idx} vmfb")
-                module = SharkInference(
-                    mlirs[idx], device=device, mlir_dialect="tm_tensor"
-                )
-                module.save_module(
-                    module_name=f"{idx}_1",
-                    extra_args=[
-                        "--iree-hal-dump-executable-sources-to=ies",
-                        "--iree-vm-target-truncate-unsupported-floats",
-                        "--iree-codegen-check-ir-before-llvm-conversion=false",
-                        "--iree-vm-bytecode-module-output-format=flatbuffer-binary",
-                    ],
-                )
-                module.load_module(vmfb_path)
-            modules.append(module)
-
-    return mlirs, modules
-
-
-def get_sharded_model():
-    # SAMPLE_INPUT_LEN is used for creating mlir with dynamic inputs, which is currently an increadibly hacky proccess
-    # please don't change it
-    SAMPLE_INPUT_LEN = 137
-    global vicuna_model
-
-    placeholder_input0 = (
-        torch.zeros([1, SAMPLE_INPUT_LEN, 4096]),
-        torch.zeros([1, 1, SAMPLE_INPUT_LEN, SAMPLE_INPUT_LEN]),
-        torch.zeros([1, SAMPLE_INPUT_LEN], dtype=torch.int64),
-    )
-
-    placeholder_input1 = (
-        torch.zeros([1, 1, 4096]),
-        torch.zeros([1, 1, 1, SAMPLE_INPUT_LEN + 1]),
-        torch.zeros([1, 1], dtype=torch.int64),
-        torch.zeros([1, 32, SAMPLE_INPUT_LEN, 128]),
-        torch.zeros([1, 32, SAMPLE_INPUT_LEN, 128]),
-    )
-
-    layers0 = [FirstVicunaLayer(layer) for layer in vicuna_model.model.layers]
-    _, modules0 = compile_to_vmfb(placeholder_input0, layers0, is_first=True)
-    shark_layers0 = [CompiledFirstVicunaLayer(m) for m in modules0]
-
-    layers1 = [SecondVicunaLayer(layer) for layer in vicuna_model.model.layers]
-    _, modules1 = compile_to_vmfb(placeholder_input1, layers1, is_first=False)
-    shark_layers1 = [CompiledSecondVicunaLayer(m) for m in modules1]
-
-    sharded_model = ShardedVicunaModel(
-        vicuna_model, shark_layers0, shark_layers1
-    )
-    return sharded_model
-
-
-sharded_model = get_sharded_model()
-
-
-def user(message, history):
-    print("msg=", message)
-    print("history=", history)
-    # Append the user's message to the conversation history
-    return "", history + [[message, ""]]
-
-
-def chat(curr_system_message, history):
-    global sharded_model
-    past_key_values = None
-    messages = curr_system_message + "".join(
-        [
-            "".join(["<|USER|>" + item[0], "<|ASSISTANT|>" + item[1]])
-            for item in history
-        ]
-    )
-    print(messages)
-    prompt = messages.strip()
-    input_ids = tokenizer(prompt).input_ids
-    tokens = input_ids
-    new_sentence = []
-    max_response_len = 1000
-    partial_sentence = []
-    partial_text = ""
-    start_time = time.time()
-    for iteration in range(max_response_len):
-        original_input_ids = input_ids
-        input_id_len = len(input_ids)
-        input_ids = torch.tensor(input_ids)
-        input_ids = input_ids.reshape([1, input_id_len])
-
-        if iteration == 0:
-            output = sharded_model.forward(input_ids, is_first=True)
-        else:
-            output = sharded_model.forward(
-                input_ids, past_key_values=past_key_values, is_first=False
-            )
-        logits = output["logits"]
-        past_key_values = output["past_key_values"]
-        new_token = int(torch.argmax(logits[:, -1, :], dim=1)[0])
-        if new_token == 2:
-            break
-        new_sentence += [new_token]
-        partial_sentence += [new_token]
-        if iteration > 0 and iteration % 2 == 0:
-            new_text = tokenizer.decode(partial_sentence)
-            partial_sentence = []
-            print(new_text, " ")
-            partial_text += new_text + " "
-            history[-1][1] = partial_text
-            yield history
-
-        tokens.append(new_token)
-        original_input_ids.append(new_token)
-        input_ids = [new_token]
-    end_time = time.time()
-    print(
-        f"Total time taken to generated response is {end_time-start_time} seconds"
-    )
-
-    for i in range(len(tokens)):
-        if type(tokens[i]) != int:
-            tokens[i] = int(tokens[i][0])
-    new_sentence_str = tokenizer.decode(new_sentence)
-    print(new_sentence_str)
-    history[-1][1] = new_sentence_str
-    return history
-
-
-system_msg = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n"
-# history_eg = [["hi hello how are you", ""]]
-# print(chat(system_msg, history_eg))
-
-with gr.Blocks(title="Chatbot") as vicuna_chat:
-    with gr.Row():
-        model = gr.Dropdown(
-            label="Select Model",
-            value="TheBloke/vicuna-7B-1.1-HF",
-            choices=[
-                "TheBloke/vicuna-7B-1.1-HF",
-            ],
-        )
-        device_value = None
-        for d in available_devices:
-            if "vulkan" in d:
-                device_value = d
-                break
-
-        device = gr.Dropdown(
-            label="Device",
-            value=device_value if device_value else available_devices[0],
-            interactive=False,
-            choices=available_devices,
-        )
-    chatbot = gr.Chatbot().style(height=500)
-    with gr.Row():
-        with gr.Column():
-            msg = gr.Textbox(
-                label="Chat Message Box",
-                placeholder="Chat Message Box",
-                show_label=False,
-            ).style(container=False)
-        with gr.Column():
-            with gr.Row():
-                submit = gr.Button("Submit")
-                stop = gr.Button("Stop")
-                clear = gr.Button("Clear")
-    system_msg = gr.Textbox(
-        system_msg, label="System Message", interactive=False, visible=False
-    )
-
-    submit_event = msg.submit(
-        fn=user, inputs=[msg, chatbot], outputs=[msg, chatbot], queue=False
-    ).then(
-        fn=chat,
-        inputs=[system_msg, chatbot],
-        outputs=[chatbot],
-        queue=True,
-    )
-    submit_click_event = submit.click(
-        fn=user, inputs=[msg, chatbot], outputs=[msg, chatbot], queue=False
-    ).then(
-        fn=chat,
-        inputs=[system_msg, chatbot],
-        outputs=[chatbot],
-        queue=True,
-    )
-    stop.click(
-        fn=None,
-        inputs=None,
-        outputs=None,
-        cancels=[submit_event, submit_click_event],
-        queue=False,
-    )
-    clear.click(lambda: None, None, [chatbot], queue=False)
-
-import argparse
-
-p = argparse.ArgumentParser(
-    description=__doc__, formatter_class=argparse.ArgumentDefaultsHelpFormatter
-)
-p.add_argument(
-    "--share",
-    default=False,
-    action=argparse.BooleanOptionalAction,
-    help="flag for generating a public URL",
-)
-p.add_argument(
-    "--server_port",
-    type=int,
-    default=8080,
-    help="flag for setting server port",
-)
-args, unknown = p.parse_known_args()
-
-vicuna_chat.queue()
-vicuna_chat.launch(
-    share=args.share,
-    inbrowser=True,
-    server_name="0.0.0.0",
-    server_port=args.server_port,
-)
--- a/apps/language_models/src/model_wrappers/falcon_model.py
+++ b/apps/language_models/src/model_wrappers/falcon_model.py
@@ -0,0 +1,22 @@
+import torch
+
+
+class FalconModel(torch.nn.Module):
+    def __init__(self, model):
+        super().__init__()
+        self.model = model
+
+    def forward(self, input_ids, attention_mask):
+        input_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "past_key_values": None,
+            "use_cache": True,
+        }
+        output = self.model(
+            **input_dict,
+            return_dict=True,
+            output_attentions=False,
+            output_hidden_states=False,
+        )[0]
+        return output[:, -1, :]
--- a/apps/language_models/src/model_wrappers/vicuna_model.py
+++ b/apps/language_models/src/model_wrappers/vicuna_model.py
@@ -237,3 +237,25 @@ class SecondVicuna(torch.nn.Module):
            return_vals.append(item[0])
            return_vals.append(item[1])
        return tuple(return_vals)
+
+
+class CombinedModel(torch.nn.Module):
+    def __init__(
+        self,
+        first_vicuna_model_path="TheBloke/vicuna-7B-1.1-HF",
+        second_vicuna_model_path="TheBloke/vicuna-7B-1.1-HF",
+    ):
+        super().__init__()
+        self.first_vicuna = FirstVicuna(first_vicuna_model_path)
+        self.second_vicuna = SecondVicuna(second_vicuna_model_path)
+
+    def forward(self, input_ids):
+        first_output = self.first_vicuna(input_ids=input_ids, use_cache=True)
+        logits = first_output[0]
+        pkv = first_output[1:]
+
+        token = torch.argmax(torch.tensor(logits)[:, -1, :], dim=1)
+        token = token.to(torch.int64).reshape([1, 1])
+        secondVicunaInput = (token,) + tuple(pkv)
+        second_output = self.second_vicuna(secondVicunaInput)
+        return second_output
--- a/apps/language_models/src/pipelines/falcon_pipeline.py
+++ b/apps/language_models/src/pipelines/falcon_pipeline.py
@@ -0,0 +1,473 @@
+from apps.language_models.src.model_wrappers.falcon_model import FalconModel
+from apps.language_models.src.pipelines.SharkLLMBase import SharkLLMBase
+from apps.language_models.utils import (
+    get_vmfb_from_path,
+)
+from io import BytesIO
+from pathlib import Path
+from contextlib import redirect_stdout
+from shark.shark_downloader import download_public_file
+from shark.shark_importer import import_with_fx
+from shark.shark_inference import SharkInference
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from transformers.generation import (
+    GenerationConfig,
+    LogitsProcessorList,
+    StoppingCriteriaList,
+)
+import copy
+
+import re
+import torch
+import torch_mlir
+import os
+import argparse
+
+parser = argparse.ArgumentParser(
+    prog="falcon runner",
+    description="runs a falcon model",
+)
+
+parser.add_argument(
+    "--precision", "-p", default="fp32", help="fp32, fp16, int8, int4"
+)
+parser.add_argument("--device", "-d", default="cuda", help="vulkan, cpu, cuda")
+parser.add_argument(
+    "--falcon_vmfb_path", default=None, help="path to falcon's vmfb"
+)
+parser.add_argument(
+    "--falcon_mlir_path",
+    default=None,
+    help="path to falcon's mlir file",
+)
+
+parser.add_argument(
+    "--load_mlir_from_shark_tank",
+    default=False,
+    action=argparse.BooleanOptionalAction,
+    help="download precompile mlir from shark tank",
+)
+parser.add_argument(
+    "--cli",
+    default=True,
+    action=argparse.BooleanOptionalAction,
+    help="Run model in cli mode",
+)
+
+
+class Falcon(SharkLLMBase):
+    def __init__(
+        self,
+        model_name,
+        hf_model_path="tiiuae/falcon-7b-instruct",
+        max_num_tokens=150,
+        device="cuda",
+        precision="fp32",
+        falcon_mlir_path=Path("falcon.mlir"),
+        falcon_vmfb_path=Path("falcon.vmfb"),
+    ) -> None:
+        super().__init__(model_name, hf_model_path, max_num_tokens)
+        self.max_padding_length = 100
+        self.device = device
+        self.precision = precision
+        self.falcon_vmfb_path = falcon_vmfb_path
+        self.falcon_mlir_path = falcon_mlir_path
+        self.tokenizer = self.get_tokenizer()
+        self.shark_model = self.compile()
+        self.src_model = self.get_src_model()
+
+    def get_tokenizer(self):
+        tokenizer = AutoTokenizer.from_pretrained(
+            self.hf_model_path, trust_remote_code=True
+        )
+        tokenizer.padding_side = "left"
+        tokenizer.pad_token_id = 11
+        return tokenizer
+
+    def get_src_model(self):
+        print("Loading src model")
+        kwargs = {"torch_dtype": torch.float, "trust_remote_code": True}
+        falcon_model = AutoModelForCausalLM.from_pretrained(
+            self.hf_model_path, **kwargs
+        )
+        return falcon_model
+
+    def compile_falcon(self):
+        vmfb = get_vmfb_from_path(self.falcon_vmfb_path, self.device, "linalg")
+        if vmfb is not None:
+            return vmfb
+
+        print(
+            f"[DEBUG] vmfb not found at {self.falcon_vmfb_path.absolute()}. Trying to work with"
+            f"[DEBUG] mlir path { self.falcon_mlir_path} {'exists' if self.falcon_mlir_path.exists() else 'does not exist'}"
+        )
+        if self.falcon_mlir_path.exists():
+            with open(self.falcon_mlir_path, "rb") as f:
+                bytecode = f.read()
+        else:
+            mlir_generated = False
+            if args.load_mlir_from_shark_tank:
+                if self.precision == "fp32":
+                    # download MLIR from shark_tank for fp32
+                    download_public_file(
+                        "gs://shark_tank/falcon/7b/cuda/falcon.mlir",
+                        self.falcon_mlir_path.absolute(),
+                        single_file=True,
+                    )
+                    if self.falcon_mlir_path.exists():
+                        with open(self.falcon_mlir_path, "rb") as f:
+                            bytecode = f.read()
+                        mlir_generated = True
+                    else:
+                        raise ValueError(
+                            f"MLIR not found at {self.falcon_mlir_path.absolute()}"
+                            " after downloading! Please check path and try again"
+                        )
+                else:
+                    print(
+                        "Only fp32 mlir added to tank, generating mlir on device."
+                    )
+
+            if not mlir_generated:
+                compilation_input_ids = torch.randint(
+                    low=1, high=10000, size=(1, 100)
+                )
+                compilation_attention_mask = torch.ones(
+                    1, 100, dtype=torch.int64
+                )
+                falconCompileInput = (
+                    compilation_input_ids,
+                    compilation_attention_mask,
+                )
+                model = FalconModel(self.src_model)
+
+                print(f"[DEBUG] generating torchscript graph")
+                ts_graph = import_with_fx(
+                    model,
+                    falconCompileInput,
+                    is_f16=self.precision == "fp16",
+                    f16_input_mask=[False, False],
+                    mlir_type="torchscript",
+                )
+                del model
+                print(f"[DEBUG] generating torch mlir")
+
+                module = torch_mlir.compile(
+                    ts_graph,
+                    [*falconCompileInput],
+                    torch_mlir.OutputType.LINALG_ON_TENSORS,
+                    use_tracing=False,
+                    verbose=False,
+                )
+                del ts_graph
+
+                print(f"[DEBUG] converting to bytecode")
+                bytecode_stream = BytesIO()
+                module.operation.write_bytecode(bytecode_stream)
+                bytecode = bytecode_stream.getvalue()
+                del module
+
+                print(f"[DEBUG] writing mlir to file")
+                with open(f"{self.model_name}.mlir", "wb") as f_:
+                    with redirect_stdout(f_):
+                        print(module.operation.get_asm())
+                f_.close()
+
+        shark_module = SharkInference(
+            mlir_module=bytecode, device=self.device, mlir_dialect="linalg"
+        )
+        path = shark_module.save_module(
+            self.falcon_vmfb_path.parent.absolute(),
+            self.falcon_vmfb_path.stem,
+            extra_args=[
+                "--iree-hal-dump-executable-sources-to=ies",
+                "--iree-vm-target-truncate-unsupported-floats",
+                "--iree-codegen-check-ir-before-llvm-conversion=false",
+                "--iree-vm-bytecode-module-output-format=flatbuffer-binary",
+            ],
+        )
+        print("Saved falcon vmfb at ", str(path))
+        shark_module.load_module(path)
+
+        return shark_module
+
+    def compile(self):
+        if (
+            not self.falcon_vmfb_path.exists()
+            and self.device == "cuda"
+            and self.precision == "fp32"
+        ):
+            download_public_file(
+                "gs://shark_tank/falcon/7b/cuda/falcon.vmfb",
+                self.falcon_vmfb_path.absolute(),
+                single_file=True,
+            )
+
+        falcon_shark_model = self.compile_falcon()
+        return falcon_shark_model
+
+    def generate(self, prompt):
+        model_inputs = self.tokenizer(
+            prompt,
+            padding="max_length",
+            max_length=self.max_padding_length,
+            add_special_tokens=False,
+            return_tensors="pt",
+        )
+        model_inputs["prompt_text"] = prompt
+
+        input_ids = model_inputs["input_ids"]
+        attention_mask = model_inputs.get("attention_mask", None)
+
+        # Allow empty prompts
+        if input_ids.shape[1] == 0:
+            input_ids = None
+            attention_mask = None
+            in_b = 1
+        else:
+            in_b = input_ids.shape[0]
+
+        generate_kwargs = {
+            "max_length": self.max_num_tokens,
+            "do_sample": True,
+            "top_k": 10,
+            "num_return_sequences": 1,
+            "eos_token_id": 11,
+        }
+        generate_kwargs["input_ids"] = input_ids
+        generate_kwargs["attention_mask"] = attention_mask
+        generation_config_ = GenerationConfig.from_model_config(
+            self.src_model.config
+        )
+        generation_config = copy.deepcopy(generation_config_)
+        model_kwargs = generation_config.update(**generate_kwargs)
+
+        logits_processor = LogitsProcessorList()
+        stopping_criteria = StoppingCriteriaList()
+
+        eos_token_id = generation_config.eos_token_id
+        generation_config.pad_token_id = eos_token_id
+
+        (
+            inputs_tensor,
+            model_input_name,
+            model_kwargs,
+        ) = self.src_model._prepare_model_inputs(
+            None, generation_config.bos_token_id, model_kwargs
+        )
+        batch_size = inputs_tensor.shape[0]
+
+        model_kwargs["output_attentions"] = generation_config.output_attentions
+        model_kwargs[
+            "output_hidden_states"
+        ] = generation_config.output_hidden_states
+        model_kwargs["use_cache"] = generation_config.use_cache
+
+        input_ids = (
+            inputs_tensor
+            if model_input_name == "input_ids"
+            else model_kwargs.pop("input_ids")
+        )
+
+        self.logits_processor = self.src_model._get_logits_processor(
+            generation_config=generation_config,
+            input_ids_seq_length=input_ids.shape[-1],
+            encoder_input_ids=inputs_tensor,
+            prefix_allowed_tokens_fn=None,
+            logits_processor=logits_processor,
+        )
+
+        self.stopping_criteria = self.src_model._get_stopping_criteria(
+            generation_config=generation_config,
+            stopping_criteria=stopping_criteria,
+        )
+
+        self.logits_warper = self.src_model._get_logits_warper(
+            generation_config
+        )
+
+        (
+            self.input_ids,
+            self.model_kwargs,
+        ) = self.src_model._expand_inputs_for_generation(
+            input_ids=input_ids,
+            expand_size=generation_config.num_return_sequences,  # 1
+            is_encoder_decoder=self.src_model.config.is_encoder_decoder,  # False
+            **model_kwargs,
+        )
+
+        if isinstance(eos_token_id, int):
+            eos_token_id = [eos_token_id]
+        self.eos_token_id_tensor = (
+            torch.tensor(eos_token_id) if eos_token_id is not None else None
+        )
+
+        self.pad_token_id = generation_config.pad_token_id
+        self.eos_token_id = eos_token_id
+
+        output_scores = generation_config.output_scores  # False
+        output_attentions = generation_config.output_attentions  # False
+        output_hidden_states = generation_config.output_hidden_states  # False
+        return_dict_in_generate = (
+            generation_config.return_dict_in_generate  # False
+        )
+
+        # init attention / hidden states / scores tuples
+        self.scores = (
+            () if (return_dict_in_generate and output_scores) else None
+        )
+        decoder_attentions = (
+            () if (return_dict_in_generate and output_attentions) else None
+        )
+        cross_attentions = (
+            () if (return_dict_in_generate and output_attentions) else None
+        )
+        decoder_hidden_states = (
+            () if (return_dict_in_generate and output_hidden_states) else None
+        )
+
+        # keep track of which sequences are already finished
+        self.unfinished_sequences = torch.ones(
+            input_ids.shape[0], dtype=torch.long, device=input_ids.device
+        )
+
+        all_text = prompt
+
+        for i in range(self.max_num_tokens - 1):
+            next_token = self.generate_new_token()
+            new_word = self.tokenizer.decode(
+                next_token.cpu().numpy(),
+                add_special_tokens=False,
+                skip_special_tokens=True,
+                clean_up_tokenization_spaces=True,
+            )
+
+            all_text = all_text + new_word
+
+            print(f"{new_word}", end="", flush=True)
+
+            # if eos_token was found in one sentence, set sentence to finished
+            if self.eos_token_id_tensor is not None:
+                self.unfinished_sequences = self.unfinished_sequences.mul(
+                    next_token.tile(self.eos_token_id_tensor.shape[0], 1)
+                    .ne(self.eos_token_id_tensor.unsqueeze(1))
+                    .prod(dim=0)
+                )
+                # stop when each sentence is finished
+                if (
+                    self.unfinished_sequences.max() == 0
+                    or self.stopping_criteria(input_ids, self.scores)
+                ):
+                    break
+
+        torch.cuda.empty_cache()
+        gc.collect()
+
+        return all_text
+
+    def generate_new_token(self):
+        model_inputs = self.src_model.prepare_inputs_for_generation(
+            self.input_ids, **self.model_kwargs
+        )
+        outputs = torch.from_numpy(
+            self.shark_model(
+                "forward",
+                (model_inputs["input_ids"], model_inputs["attention_mask"]),
+            )
+        )
+        next_token_logits = outputs
+
+        # pre-process distribution
+        next_token_scores = self.logits_processor(
+            self.input_ids, next_token_logits
+        )
+        next_token_scores = self.logits_warper(
+            self.input_ids, next_token_scores
+        )
+
+        # sample
+        probs = torch.nn.functional.softmax(next_token_scores, dim=-1)
+
+        next_token = torch.multinomial(probs, num_samples=1).squeeze(1)
+
+        # finished sentences should have their next token be a padding token
+        if self.eos_token_id is not None:
+            if self.pad_token_id is None:
+                raise ValueError(
+                    "If `eos_token_id` is defined, make sure that `pad_token_id` is defined."
+                )
+            next_token = (
+                next_token * self.unfinished_sequences
+                + self.pad_token_id * (1 - self.unfinished_sequences)
+            )
+
+        self.input_ids = torch.cat(
+            [self.input_ids, next_token[:, None]], dim=-1
+        )
+
+        self.model_kwargs["past_key_values"] = None
+        if "attention_mask" in self.model_kwargs:
+            attention_mask = self.model_kwargs["attention_mask"]
+            self.model_kwargs["attention_mask"] = torch.cat(
+                [
+                    attention_mask,
+                    attention_mask.new_ones((attention_mask.shape[0], 1)),
+                ],
+                dim=-1,
+            )
+
+        self.input_ids = self.input_ids[:, 1:]
+        self.model_kwargs["attention_mask"] = self.model_kwargs[
+            "attention_mask"
+        ][:, 1:]
+
+        return next_token
+
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+
+    falcon_mlir_path = (
+        Path("falcon.mlir")
+        if args.falcon_mlir_path is None
+        else Path(args.falcon_mlir_path)
+    )
+    falcon_vmfb_path = (
+        Path("falcon.vmfb")
+        if args.falcon_vmfb_path is None
+        else Path(args.falcon_vmfb_path)
+    )
+
+    falcon = Falcon(
+        "falcon",
+        device=args.device,
+        precision=args.precision,
+        falcon_mlir_path=falcon_mlir_path,
+        falcon_vmfb_path=falcon_vmfb_path,
+    )
+
+    import gc
+
+    default_prompt_text = "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:"
+    continue_execution = True
+
+    while continue_execution:
+        use_default_prompt = input(
+            "\nDo you wish to use the default prompt text? True or False?: "
+        )
+        if use_default_prompt:
+            prompt = default_prompt_text
+        else:
+            prompt = input("Please enter the prompt text: ")
+        print("\nPrompt Text: ", prompt)
+
+        res_str = falcon.generate(prompt)
+        torch.cuda.empty_cache()
+        gc.collect()
+        print(
+            "\n\n-----\nHere's the complete formatted result: \n\n",
+            res_str,
+        )
+        continue_execution = input(
+            "\nDo you wish to run script one more time? True or False?: "
+        )
--- a/apps/language_models/src/pipelines/vicuna_pipeline.py
+++ b/apps/language_models/src/pipelines/vicuna_pipeline.py
@@ -3,9 +3,14 @@ from apps.language_models.src.model_wrappers.vicuna_model import (
    SecondVicuna,
 )
 from apps.language_models.src.pipelines.SharkLLMBase import SharkLLMBase
-from apps.language_models.utils import get_torch_mlir_module_bytecode
+from apps.language_models.utils import (
+    get_vmfb_from_path,
+)
+
 from io import BytesIO
 from pathlib import Path
+from shark.shark_downloader import download_public_file
+from shark.shark_importer import import_with_fx
 from shark.shark_inference import SharkInference
 from transformers import AutoTokenizer, AutoModelForCausalLM

@@ -23,13 +28,23 @@ class Vicuna(SharkLLMBase):
        max_num_tokens=512,
        device="cuda",
        precision="fp32",
+        first_vicuna_mlir_path=Path("first_vicuna.mlir"),
+        second_vicuna_mlir_path=Path("second_vicuna.mlir"),
+        first_vicuna_vmfb_path=Path("first_vicuna.vmfb"),
+        second_vicuna_vmfb_path=Path("second_vicuna.vmfb"),
+        load_mlir_from_shark_tank=True,
    ) -> None:
        super().__init__(model_name, hf_model_path, max_num_tokens)
        self.max_sequence_length = 256
        self.device = device
        self.precision = precision
+        self.first_vicuna_vmfb_path = first_vicuna_vmfb_path
+        self.second_vicuna_vmfb_path = second_vicuna_vmfb_path
+        self.first_vicuna_mlir_path = first_vicuna_mlir_path
+        self.second_vicuna_mlir_path = second_vicuna_mlir_path
        self.tokenizer = self.get_tokenizer()
        self.shark_model = self.compile()
+        self.load_mlir_from_shark_tank = load_mlir_from_shark_tank

    def get_tokenizer(self):
        tokenizer = AutoTokenizer.from_pretrained(
@@ -45,96 +60,137 @@ class Vicuna(SharkLLMBase):
        return vicuna_model

    def compile_first_vicuna(self):
-        vmfb_path = Path(self.model_name + ".vmfb")
-        if vmfb_path.exists():
-            shark_module = SharkInference(
-                None, device=self.device, mlir_dialect="tm_tensor"
-            )
-            shark_module.load_module(vmfb_path)
-            # self.shark_module = shark_module
-            return shark_module
-        mlir_path = Path(self.model_name + ".mlir")
-        print(
-            f"[DEBUG] mlir path { mlir_path} {'exists' if mlir_path.exists() else 'does not exist'}"
+        vmfb = get_vmfb_from_path(
+            self.first_vicuna_vmfb_path, self.device, "tm_tensor"
        )
-        if mlir_path.exists():
-            with open(mlir_path, "rb") as f:
+        if vmfb is not None:
+            return vmfb
+
+        # Compilation path needs some more work before it is functional
+
+        print(
+            f"[DEBUG] vmfb not found at {self.first_vicuna_vmfb_path.absolute()}. Trying to work with"
+            f"[DEBUG] mlir path { self.first_vicuna_mlir_path} {'exists' if self.first_vicuna_mlir_path.exists() else 'does not exist'}"
+        )
+        if self.first_vicuna_mlir_path.exists():
+            with open(self.first_vicuna_mlir_path, "rb") as f:
                bytecode = f.read()
        else:
-            compilation_prompt = "".join(["0" for _ in range(17)])
-            compilation_input_ids = self.tokenizer(
-                compilation_prompt
-            ).input_ids
-            compilation_input_ids = torch.tensor(
-                compilation_input_ids
-            ).reshape([1, 19])
-            firstVicunaCompileInput = (compilation_input_ids,)
-            model = FirstVicuna(self.hf_model_path)
-
-            ts_graph = get_torch_mlir_module_bytecode(
-                model, firstVicunaCompileInput
-            )
-
-            firstVicunaCompileInput = list(firstVicunaCompileInput)
-            firstVicunaCompileInput[0] = torch_mlir.TensorPlaceholder.like(
-                firstVicunaCompileInput[0], dynamic_axes=[1]
-            )
-            firstVicunaCompileInput = tuple(firstVicunaCompileInput)
-            module = torch_mlir.compile(
-                ts_graph,
-                [*firstVicunaCompileInput],
-                torch_mlir.OutputType.LINALG_ON_TENSORS,
-                use_tracing=False,
-                verbose=False,
-            )
-
-            def remove_constant_dim(line):
-                if "19x" in line:
-                    line = re.sub("19x", "?x", line)
-                    line = re.sub(
-                        "tensor.empty\(\)", "tensor.empty(%dim)", line
+            mlir_generated = False
+            if self.load_mlir_from_shark_tank:
+                if self.precision == "fp32":
+                    # download MLIR from shark_tank for fp32
+                    download_public_file(
+                        "gs://shark_tank/vicuna/unsharded/mlir/first_vicuna.mlir",
+                        self.first_vicuna_mlir_path.absolute(),
+                        single_file=True,
                    )
-                if "tensor.empty" in line and "?x?" in line:
-                    line = re.sub(
-                        "tensor.empty\(%dim\)",
-                        "tensor.empty(%dim, %dim)",
-                        line,
+                    if self.first_vicuna_mlir_path.exists():
+                        with open(self.first_vicuna_mlir_path, "rb") as f:
+                            bytecode = f.read()
+                        mlir_generated = True
+                    else:
+                        raise ValueError(
+                            f"MLIR not found at {self.first_vicuna_mlir_path.absolute()}"
+                            " after downloading! Please check path and try again"
+                        )
+                else:
+                    print(
+                        "Only fp32 mlir added to tank, generating mlir on device."
                    )
-                if "arith.cmpi" in line:
-                    line = re.sub("c19", "dim", line)
-                if " 19," in line:
-                    line = re.sub(" 19,", " %dim,", line)
-                return line

-            module_str = str(module)
-            new_lines = []
+            if not mlir_generated:
+                compilation_prompt = "".join(["0" for _ in range(17)])
+                compilation_input_ids = self.tokenizer(
+                    compilation_prompt
+                ).input_ids
+                compilation_input_ids = torch.tensor(
+                    compilation_input_ids
+                ).reshape([1, 19])
+                firstVicunaCompileInput = (compilation_input_ids,)
+                model = FirstVicuna(self.hf_model_path)

-            for line in module_str.splitlines():
-                line = remove_constant_dim(line)
-                if "%0 = tensor.empty(%dim) : tensor<?xi64>" in line:
-                    new_lines.append(
+                print(f"[DEBUG] generating torchscript graph")
+                ts_graph = import_with_fx(
+                    model,
+                    firstVicunaCompileInput,
+                    is_f16=self.precision == "fp16",
+                    f16_input_mask=[False, False],
+                    mlir_type="torchscript",
+                )
+                del model
+                print(f"[DEBUG] generating torch mlir")
+
+                firstVicunaCompileInput = list(firstVicunaCompileInput)
+                firstVicunaCompileInput[0] = torch_mlir.TensorPlaceholder.like(
+                    firstVicunaCompileInput[0], dynamic_axes=[1]
+                )
+                firstVicunaCompileInput = tuple(firstVicunaCompileInput)
+                module = torch_mlir.compile(
+                    ts_graph,
+                    [*firstVicunaCompileInput],
+                    torch_mlir.OutputType.LINALG_ON_TENSORS,
+                    use_tracing=False,
+                    verbose=False,
+                )
+                del ts_graph
+
+                def remove_constant_dim(line):
+                    if "19x" in line:
+                        line = re.sub("19x", "?x", line)
+                        line = re.sub(
+                            "tensor.empty\(\)", "tensor.empty(%dim)", line
+                        )
+                    if "tensor.empty" in line and "?x?" in line:
+                        line = re.sub(
+                            "tensor.empty\(%dim\)",
+                            "tensor.empty(%dim, %dim)",
+                            line,
+                        )
+                    if "arith.cmpi" in line:
+                        line = re.sub("c19", "dim", line)
+                    if " 19," in line:
+                        line = re.sub(" 19,", " %dim,", line)
+                    return line
+
+                module = str(module)
+                new_lines = []
+
+                print(f"[DEBUG] rewriting torch_mlir file")
+                for line in module.splitlines():
+                    line = remove_constant_dim(line)
+                    if "%0 = tensor.empty(%dim) : tensor<?xi64>" in line:
+                        new_lines.append(
+                            "%dim = tensor.dim %arg0, %c1 : tensor<1x?xi64>"
+                        )
+                    if (
                        "%dim = tensor.dim %arg0, %c1 : tensor<1x?xi64>"
-                    )
-                if "%dim = tensor.dim %arg0, %c1 : tensor<1x?xi64>" in line:
-                    continue
+                        in line
+                    ):
+                        continue

-                new_lines.append(line)
+                    new_lines.append(line)

-            module_str = "\n".join(new_lines)
-            bytecode = module_str.encode("UTF-8")
-            bytecode_stream = BytesIO(bytecode)
-            bytecode = bytecode_stream.read()
-            f_ = open(f"{self.model_name}.mlir", "wb")
-            f_.write(bytecode)
-            f_.close()
+                module = "\n".join(new_lines)
+
+                print(f"[DEBUG] converting to bytecode")
+                del new_lines
+                module = module.encode("UTF-8")
+                module = BytesIO(module)
+                bytecode = module.read()
+                del module
+
+                print(f"[DEBUG] writing mlir to file")
+                f_ = open(self.first_vicuna_mlir_path, "wb")
+                f_.write(bytecode)
+                f_.close()

        shark_module = SharkInference(
            mlir_module=bytecode, device=self.device, mlir_dialect="tm_tensor"
        )
-
        path = shark_module.save_module(
-            os.getcwd(),
-            self.model_name,
+            self.first_vicuna_vmfb_path.parent.absolute(),
+            self.first_vicuna_vmfb_path.stem,
            extra_args=[
                "--iree-hal-dump-executable-sources-to=ies",
                "--iree-vm-target-truncate-unsupported-floats",
@@ -142,124 +198,150 @@ class Vicuna(SharkLLMBase):
                "--iree-vm-bytecode-module-output-format=flatbuffer-binary",
            ],
        )
-        print("Saved vmfb at ", str(path))
-        shark_module.load_module(vmfb_path)
+        print("Saved first vic vmfb at vmfb at ", str(path))
+        shark_module.load_module(path)

        return shark_module

    def compile_second_vicuna(self):
-        vmfb_path = Path(self.model_name + ".vmfb")
-        if vmfb_path.exists():
-            shark_module = SharkInference(
-                None, device=self.device, mlir_dialect="tm_tensor"
-            )
-            shark_module.load_module(vmfb_path)
-            # self.shark_module = shark_module
-            return shark_module
-        mlir_path = Path(self.model_name + ".mlir")
-        print(
-            f"[DEBUG] mlir path { mlir_path} {'exists' if mlir_path.exists() else 'does not exist'}"
+        vmfb = get_vmfb_from_path(
+            self.second_vicuna_vmfb_path, self.device, "tm_tensor"
        )
-        if mlir_path.exists():
-            with open(mlir_path, "rb") as f:
+        if vmfb is not None:
+            return vmfb
+
+        # Compilation path needs some more work before it is functional
+        print(
+            f"[DEBUG] mlir path {self.second_vicuna_mlir_path} {'exists' if self.second_vicuna_mlir_path.exists() else 'does not exist'}"
+        )
+        if self.second_vicuna_mlir_path.exists():
+            with open(self.second_vicuna_mlir_path, "rb") as f:
                bytecode = f.read()
        else:
-            compilation_input_ids = torch.zeros([1, 1], dtype=torch.int64)
-            pkv = tuple(
-                (torch.zeros([1, 32, 19, 128], dtype=torch.float32))
-                for _ in range(64)
-            )
-            secondVicunaCompileInput = (compilation_input_ids,) + pkv
-            model = SecondVicuna(self.hf_model_path)
-            ts_graph = get_torch_mlir_module_bytecode(
-                model, secondVicunaCompileInput
-            )
-            secondVicunaCompileInput = list(secondVicunaCompileInput)
-            for i in range(len(secondVicunaCompileInput)):
-                if i != 0:
-                    secondVicunaCompileInput[
-                        i
-                    ] = torch_mlir.TensorPlaceholder.like(
-                        secondVicunaCompileInput[i], dynamic_axes=[2]
+            mlir_generated = False
+            if self.load_mlir_from_shark_tank:
+                if self.precision == "fp32":
+                    # download MLIR from shark_tank for fp32
+                    download_public_file(
+                        "gs://shark_tank/vicuna/unsharded/mlir/second_vicuna.mlir",
+                        self.second_vicuna_mlir_path.absolute(),
+                        single_file=True,
+                    )
+                    if self.second_vicuna_mlir_path.exists():
+                        with open(self.second_vicuna_mlir_path, "rb") as f:
+                            bytecode = f.read()
+                        mlir_generated = True
+                    else:
+                        raise ValueError(
+                            f"MLIR not found at {self.second_vicuna_mlir_path.absolute()}"
+                            " after downloading! Please check path and try again"
+                        )
+                else:
+                    print(
+                        "Only fp32 mlir added to tank, generating mlir on device."
                    )
-            secondVicunaCompileInput = tuple(secondVicunaCompileInput)
-            module = torch_mlir.compile(
-                ts_graph,
-                [*secondVicunaCompileInput],
-                torch_mlir.OutputType.LINALG_ON_TENSORS,
-                use_tracing=False,
-                verbose=False,
-            )

-            def remove_constant_dim(line):
-                if "c19_i64" in line:
-                    line = re.sub("c19_i64", "dim_i64", line)
-                if "19x" in line:
-                    line = re.sub("19x", "?x", line)
-                    line = re.sub(
-                        "tensor.empty\(\)", "tensor.empty(%dim)", line
-                    )
-                if "tensor.empty" in line and "?x?" in line:
-                    line = re.sub(
-                        "tensor.empty\(%dim\)",
-                        "tensor.empty(%dim, %dim)",
-                        line,
-                    )
-                if "arith.cmpi" in line:
-                    line = re.sub("c19", "dim", line)
-                if " 19," in line:
-                    line = re.sub(" 19,", " %dim,", line)
-                if "20x" in line:
-                    line = re.sub("20x", "?x", line)
-                    line = re.sub(
-                        "tensor.empty\(\)", "tensor.empty(%dimp1)", line
-                    )
-                if " 20," in line:
-                    line = re.sub(" 20,", " %dimp1,", line)
-                return line
+            if not mlir_generated:
+                compilation_input_ids = torch.zeros([1, 1], dtype=torch.int64)
+                pkv = tuple(
+                    (torch.zeros([1, 32, 19, 128], dtype=torch.float32))
+                    for _ in range(64)
+                )
+                secondVicunaCompileInput = (compilation_input_ids,) + pkv
+                model = SecondVicuna(self.hf_model_path)
+                ts_graph = import_with_fx(
+                    model,
+                    secondVicunaCompileInput,
+                    is_f16=self.precision == "fp16",
+                    f16_input_mask=[False, False],
+                    mlir_type="torchscript",
+                )
+                secondVicunaCompileInput = list(secondVicunaCompileInput)
+                for i in range(len(secondVicunaCompileInput)):
+                    if i != 0:
+                        secondVicunaCompileInput[
+                            i
+                        ] = torch_mlir.TensorPlaceholder.like(
+                            secondVicunaCompileInput[i], dynamic_axes=[2]
+                        )
+                secondVicunaCompileInput = tuple(secondVicunaCompileInput)
+                module = torch_mlir.compile(
+                    ts_graph,
+                    [*secondVicunaCompileInput],
+                    torch_mlir.OutputType.LINALG_ON_TENSORS,
+                    use_tracing=False,
+                    verbose=False,
+                )

-            module_str = str(module)
-            new_lines = []
+                def remove_constant_dim(line):
+                    if "c19_i64" in line:
+                        line = re.sub("c19_i64", "dim_i64", line)
+                    if "19x" in line:
+                        line = re.sub("19x", "?x", line)
+                        line = re.sub(
+                            "tensor.empty\(\)", "tensor.empty(%dim)", line
+                        )
+                    if "tensor.empty" in line and "?x?" in line:
+                        line = re.sub(
+                            "tensor.empty\(%dim\)",
+                            "tensor.empty(%dim, %dim)",
+                            line,
+                        )
+                    if "arith.cmpi" in line:
+                        line = re.sub("c19", "dim", line)
+                    if " 19," in line:
+                        line = re.sub(" 19,", " %dim,", line)
+                    if "20x" in line:
+                        line = re.sub("20x", "?x", line)
+                        line = re.sub(
+                            "tensor.empty\(\)", "tensor.empty(%dimp1)", line
+                        )
+                    if " 20," in line:
+                        line = re.sub(" 20,", " %dimp1,", line)
+                    return line

-            for line in module_str.splitlines():
-                if "%c19_i64 = arith.constant 19 : i64" in line:
-                    new_lines.append("%c2 = arith.constant 2 : index")
-                    new_lines.append(
-                        "%dim_4_int = tensor.dim %arg1, %c2 : tensor<1x32x?x128xf32>"
-                    )
-                    new_lines.append(
-                        "%dim_i64 = arith.index_cast %dim_4_int : index to i64"
-                    )
-                    continue
-                if "%c2 = arith.constant 2 : index" in line:
-                    continue
-                if "%c20_i64 = arith.constant 20 : i64" in line:
-                    new_lines.append("%c1_i64 = arith.constant 1 : i64")
-                    new_lines.append(
-                        "%c20_i64 = arith.addi %dim_i64, %c1_i64 : i64"
-                    )
-                    new_lines.append(
-                        "%dimp1 = arith.index_cast %c20_i64 : i64 to index"
-                    )
-                    continue
-                line = remove_constant_dim(line)
-                new_lines.append(line)
+                module_str = str(module)
+                new_lines = []

-            module_str = "\n".join(new_lines)
-            bytecode = module_str.encode("UTF-8")
-            bytecode_stream = BytesIO(bytecode)
-            bytecode = bytecode_stream.read()
-            f_ = open(f"{self.model_name}.mlir", "wb")
-            f_.write(bytecode)
-            f_.close()
+                for line in module_str.splitlines():
+                    if "%c19_i64 = arith.constant 19 : i64" in line:
+                        new_lines.append("%c2 = arith.constant 2 : index")
+                        new_lines.append(
+                            "%dim_4_int = tensor.dim %arg1, %c2 : tensor<1x32x?x128xf32>"
+                        )
+                        new_lines.append(
+                            "%dim_i64 = arith.index_cast %dim_4_int : index to i64"
+                        )
+                        continue
+                    if "%c2 = arith.constant 2 : index" in line:
+                        continue
+                    if "%c20_i64 = arith.constant 20 : i64" in line:
+                        new_lines.append("%c1_i64 = arith.constant 1 : i64")
+                        new_lines.append(
+                            "%c20_i64 = arith.addi %dim_i64, %c1_i64 : i64"
+                        )
+                        new_lines.append(
+                            "%dimp1 = arith.index_cast %c20_i64 : i64 to index"
+                        )
+                        continue
+                    line = remove_constant_dim(line)
+                    new_lines.append(line)
+
+                module_str = "\n".join(new_lines)
+                bytecode = module_str.encode("UTF-8")
+                bytecode_stream = BytesIO(bytecode)
+                bytecode = bytecode_stream.read()
+                f_ = open(self.second_vicuna_mlir_path, "wb")
+                f_.write(bytecode)
+                f_.close()

        shark_module = SharkInference(
            mlir_module=bytecode, device=self.device, mlir_dialect="tm_tensor"
        )

        path = shark_module.save_module(
-            os.getcwd(),
-            self.model_name,
+            self.second_vicuna_vmfb_path.parent.absolute(),
+            self.second_vicuna_vmfb_path.stem,
            extra_args=[
                "--iree-hal-dump-executable-sources-to=ies",
                "--iree-vm-target-truncate-unsupported-floats",
@@ -268,13 +350,50 @@ class Vicuna(SharkLLMBase):
            ],
        )
        print("Saved vmfb at ", str(path))
-        shark_module.load_module(vmfb_path)
+        shark_module.load_module(self.second_vicuna_vmfb_path)

        # self.shark_module = shark_module

        return shark_module

    def compile(self):
+        # Cannot load both the models in the memory at once
+        # due to memory constraints, hence on demand compilation
+        # is being used until the space is enough for both models
+
+        # Testing : DO NOT Download Vmfbs if not found. Modify later
+        # download vmfbs for A100
+        if (
+            not self.first_vicuna_vmfb_path.exists()
+            and self.device == "cuda"
+            and self.precision == "fp32"
+        ):
+            download_public_file(
+                "gs://shark_tank/vicuna/unsharded/first_vicuna.vmfb",
+                self.first_vicuna_vmfb_path.absolute(),
+                single_file=True,
+            )
+        else:
+            # get first vic
+            # TODO: Remove after testing to avoid memory overload
+            # fvic_shark_model = self.compile_first_vicuna()
+            pass
+        if (
+            not self.second_vicuna_vmfb_path.exists()
+            and self.device == "cuda"
+            and self.precision == "fp32"
+        ):
+            download_public_file(
+                "gs://shark_tank/vicuna/unsharded/second_vicuna.vmfb",
+                self.second_vicuna_vmfb_path.absolute(),
+                single_file=True,
+            )
+        else:
+            # get second vic
+            # TODO: Remove after testing to avoid memory overload
+            # svic_shark_model = self.compile_second_vicuna()
+            pass
+
        # get first vic
        # fvic_shark_model = self.compile_first_vicuna()
        # get second vic
@@ -282,14 +401,19 @@ class Vicuna(SharkLLMBase):
        # return tuple of shark_modules
        # return fvic_shark_model, svic_shark_model
        return None
+        # return tuple of shark_modules once mem is supported
+        # return fvic_shark_model, svic_shark_model

-    def generate(self, prompt):
+    def generate(self, prompt, cli=False):
        # TODO: refactor for cleaner integration
+        import gc

        res = []
+        res_tokens = []
        params = {
            "prompt": prompt,
            "is_first": True,
+            "fv": self.compile_first_vicuna(),
        }

        generated_token_op = self.generate_new_token(params=params)
@@ -300,21 +424,26 @@ class Vicuna(SharkLLMBase):
        detok = generated_token_op["detok"]

        res.append(detok)
+        res_tokens.append(token)
+        if cli:
+            print(f"Assistant: {detok}", end=" ", flush=True)

+        # Clear First Vic from Memory (main and cuda)
+        del params
+        torch.cuda.empty_cache()
+        gc.collect()
+
+        sec_vic = self.compile_second_vicuna()
        for _ in range(self.max_num_tokens - 2):
-            # t1 = time.time()
            params = {
                "prompt": None,
                "is_first": False,
                "logits": logits,
                "pkv": pkv,
+                "sv": sec_vic,
            }

            generated_token_op = self.generate_new_token(params=params)
-            import gc
-
-            gc.collect()
-            torch.cuda.empty_cache()

            token = generated_token_op["token"]
            logits = generated_token_op["logits"]
@@ -323,14 +452,28 @@ class Vicuna(SharkLLMBase):

            if token == 2:
                break
+            res_tokens.append(token)
            if detok == "<0x0A>":
                res.append("\n")
+                if cli:
+                    print("\n", end="", flush=True)
            else:
                res.append(detok)
+                if cli:
+                    print(f"{detok}", end=" ", flush=True)
+        del sec_vic, pkv, logits
+        torch.cuda.empty_cache()
+        gc.collect()

-        return res
+        for i in range(len(res_tokens)):
+            if type(res_tokens[i]) != int:
+                res_tokens[i] = int(res_tokens[i][0])

-    def generate_new_token(self, params):
+        res_str = self.tokenizer.decode(res_tokens)
+        # print(f"[DEBUG] final output : \n{res_str}")
+        return res_str
+
+    def generate_new_token(self, params, debug=False):
        def forward_first(first_vic, prompt, cache_outputs=False):
            input_ids = self.tokenizer(prompt).input_ids
            input_id_len = len(input_ids)
@@ -380,30 +523,29 @@ class Vicuna(SharkLLMBase):

        if is_first:
            prompt = params["prompt"]
-            fv = self.compile_first_vicuna()
+            fv = params["fv"]
            token, logits, pkv = forward_first(
                fv,  # self.shark_model[0],
                prompt=prompt,
                cache_outputs=False,
            )
-            del fv
        else:
            _logits = params["logits"]
            _pkv = params["pkv"]
            inputs = (_logits,) + tuple(_pkv)
-            sv = self.compile_second_vicuna()
+            sv = params["sv"]
            token, logits, pkv = forward_second(
                sv,  # self.shark_model[1],
                inputs=inputs,
                load_inputs=False,
            )
-            del sv

        detok = self.tokenizer.decode(token)
-        print(
-            f"[DEBUG] is_first: {is_first} |"
-            f" token : {token} | detok : {detok}"
-        )
+        if debug:
+            print(
+                f"[DEBUG] is_first: {is_first} |"
+                f" token : {token} | detok : {detok}"
+            )
        ret_dict = {
            "token": token,
            "logits": logits,
--- a/apps/language_models/src/pipelines/vicuna_sharded_pipeline.py
+++ b/apps/language_models/src/pipelines/vicuna_sharded_pipeline.py
@@ -0,0 +1,408 @@
+from apps.language_models.src.model_wrappers.vicuna_sharded_model import (
+    FirstVicunaLayer,
+    SecondVicunaLayer,
+    CompiledFirstVicunaLayer,
+    CompiledSecondVicunaLayer,
+    ShardedVicunaModel,
+)
+from apps.language_models.src.pipelines.SharkLLMBase import SharkLLMBase
+from shark.shark_importer import import_with_fx
+from io import BytesIO
+from pathlib import Path
+from shark.shark_inference import SharkInference
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from tqdm import tqdm
+from torch_mlir import TensorPlaceholder
+
+
+import re
+import torch
+import torch_mlir
+import os
+
+
+class Vicuna(SharkLLMBase):
+    def __init__(
+        self,
+        model_name,
+        hf_model_path="TheBloke/vicuna-7B-1.1-HF",
+        max_num_tokens=512,
+        device="cuda",
+        precision="fp32",
+    ) -> None:
+        super().__init__(model_name, hf_model_path, max_num_tokens)
+        self.max_sequence_length = 256
+        self.device = device
+        self.precision = precision
+        self.tokenizer = self.get_tokenizer()
+        self.shark_model = self.compile()
+
+    def get_tokenizer(self):
+        tokenizer = AutoTokenizer.from_pretrained(
+            self.hf_model_path, use_fast=False
+        )
+        return tokenizer
+
+    def get_src_model(self):
+        kwargs = {"torch_dtype": torch.float}
+        vicuna_model = AutoModelForCausalLM.from_pretrained(
+            self.hf_model_path, **kwargs
+        )
+        return vicuna_model
+
+    def write_in_dynamic_inputs0(self, module, dynamic_input_size):
+        new_lines = []
+        for line in module.splitlines():
+            line = re.sub(f"{dynamic_input_size}x", "?x", line)
+            if "?x" in line:
+                line = re.sub("tensor.empty\(\)", "tensor.empty(%dim)", line)
+            line = re.sub(f" {dynamic_input_size},", " %dim,", line)
+            if "tensor.empty" in line and "?x?" in line:
+                line = re.sub(
+                    "tensor.empty\(%dim\)", "tensor.empty(%dim, %dim)", line
+                )
+            if "arith.cmpi" in line:
+                line = re.sub(f"c{dynamic_input_size}", "dim", line)
+            new_lines.append(line)
+        new_module = "\n".join(new_lines)
+        return new_module
+
+    def write_in_dynamic_inputs1(self, module, dynamic_input_size):
+        new_lines = []
+        for line in module.splitlines():
+            if "dim_42 =" in line:
+                continue
+            if f"%c{dynamic_input_size}_i64 =" in line:
+                new_lines.append(
+                    "%dim_42 = tensor.dim %arg1, %c3 : tensor<1x1x1x?xf32>"
+                )
+                new_lines.append(
+                    f"%dim_42_i64 = arith.index_cast %dim_42 : index to i64"
+                )
+                continue
+            line = re.sub(f"{dynamic_input_size}x", "?x", line)
+            if "?x" in line:
+                line = re.sub(
+                    "tensor.empty\(\)", "tensor.empty(%dim_42)", line
+                )
+            line = re.sub(f" {dynamic_input_size},", " %dim_42,", line)
+            if "tensor.empty" in line and "?x?" in line:
+                line = re.sub(
+                    "tensor.empty\(%dim_42\)",
+                    "tensor.empty(%dim_42, %dim_42)",
+                    line,
+                )
+            if "arith.cmpi" in line:
+                line = re.sub(f"c{dynamic_input_size}", "dim_42", line)
+            new_lines.append(line)
+        new_module = "\n".join(new_lines)
+        return new_module
+
+    def compile_vicuna_layer(
+        self,
+        vicuna_layer,
+        hidden_states,
+        attention_mask,
+        position_ids,
+        past_key_value0=None,
+        past_key_value1=None,
+    ):
+        if past_key_value0 is None and past_key_value1 is None:
+            model_inputs = (hidden_states, attention_mask, position_ids)
+        else:
+            model_inputs = (
+                hidden_states,
+                attention_mask,
+                position_ids,
+                past_key_value0,
+                past_key_value1,
+            )
+        mlir_bytecode = import_with_fx(
+            vicuna_layer,
+            model_inputs,
+            is_f16=self.precision == "fp16",
+            f16_input_mask=[False, False],
+            mlir_type="torchscript",
+        )
+        return mlir_bytecode
+
+    def compile_to_vmfb(self, inputs, layers, is_first=True):
+        mlirs, modules = [], []
+        for idx, layer in tqdm(enumerate(layers), desc="Getting mlirs"):
+            if is_first:
+                mlir_path = Path(f"{idx}_0.mlir")
+                vmfb_path = Path(f"{idx}_0.vmfb")
+            else:
+                mlir_path = Path(f"{idx}_1.mlir")
+                vmfb_path = Path(f"{idx}_1.vmfb")
+            if vmfb_path.exists():
+                continue
+            if mlir_path.exists():
+                # print(f"Found layer {idx} mlir")
+                f_ = open(mlir_path, "rb")
+                bytecode = f_.read()
+                f_.close()
+            else:
+                hidden_states_placeholder = TensorPlaceholder.like(
+                    inputs[0], dynamic_axes=[1]
+                )
+                attention_mask_placeholder = TensorPlaceholder.like(
+                    inputs[1], dynamic_axes=[3]
+                )
+                position_ids_placeholder = TensorPlaceholder.like(
+                    inputs[2], dynamic_axes=[1]
+                )
+                if not is_first:
+                    pkv0_placeholder = TensorPlaceholder.like(
+                        inputs[3], dynamic_axes=[2]
+                    )
+                    pkv1_placeholder = TensorPlaceholder.like(
+                        inputs[4], dynamic_axes=[2]
+                    )
+                print(f"Compiling layer {idx} mlir")
+                if is_first:
+                    ts_g = self.compile_vicuna_layer(
+                        layer, inputs[0], inputs[1], inputs[2]
+                    )
+                    module = torch_mlir.compile(
+                        ts_g,
+                        (
+                            hidden_states_placeholder,
+                            inputs[1],
+                            inputs[2],
+                        ),
+                        torch_mlir.OutputType.LINALG_ON_TENSORS,
+                        use_tracing=False,
+                        verbose=False,
+                    )
+                else:
+                    ts_g = self.compile_vicuna_layer(
+                        layer,
+                        inputs[0],
+                        inputs[1],
+                        inputs[2],
+                        inputs[3],
+                        inputs[4],
+                    )
+                    module = torch_mlir.compile(
+                        ts_g,
+                        (
+                            inputs[0],
+                            attention_mask_placeholder,
+                            inputs[2],
+                            pkv0_placeholder,
+                            pkv1_placeholder,
+                        ),
+                        torch_mlir.OutputType.LINALG_ON_TENSORS,
+                        use_tracing=False,
+                        verbose=False,
+                    )
+
+                # bytecode_stream = BytesIO()
+                # module.operation.write_bytecode(bytecode_stream)
+                # bytecode = bytecode_stream.getvalue()
+
+                if is_first:
+                    module = self.write_in_dynamic_inputs0(str(module), 137)
+                    bytecode = module.encode("UTF-8")
+                    bytecode_stream = BytesIO(bytecode)
+                    bytecode = bytecode_stream.read()
+
+                else:
+                    module = self.write_in_dynamic_inputs1(str(module), 138)
+                    if idx in [0, 5, 6, 7]:
+                        module_str = module
+                        module_str = module_str.splitlines()
+                        new_lines = []
+                        for line in module_str:
+                            if len(line) < 1000:
+                                new_lines.append(line)
+                            else:
+                                new_lines.append(line[:999])
+                        module_str = "\n".join(new_lines)
+                        f1_ = open(f"{idx}_1_test.mlir", "w+")
+                        f1_.write(module_str)
+                        f1_.close()
+
+                    bytecode = module.encode("UTF-8")
+                    bytecode_stream = BytesIO(bytecode)
+                    bytecode = bytecode_stream.read()
+
+                f_ = open(mlir_path, "wb")
+                f_.write(bytecode)
+                f_.close()
+            mlirs.append(bytecode)
+
+        for idx, layer in tqdm(enumerate(layers), desc="compiling modules"):
+            if is_first:
+                vmfb_path = Path(f"{idx}_0.vmfb")
+                if idx < 25:
+                    device = "cpu"
+                else:
+                    device = "cpu"
+                if vmfb_path.exists():
+                    # print(f"Found layer {idx} vmfb")
+                    module = SharkInference(
+                        None, device=device, mlir_dialect="tm_tensor"
+                    )
+                    module.load_module(vmfb_path)
+                else:
+                    print(f"Compiling layer {idx} vmfb")
+                    module = SharkInference(
+                        mlirs[idx], device=device, mlir_dialect="tm_tensor"
+                    )
+                    module.save_module(
+                        module_name=f"{idx}_0",
+                        extra_args=[
+                            "--iree-hal-dump-executable-sources-to=ies",
+                            "--iree-vm-target-truncate-unsupported-floats",
+                            "--iree-codegen-check-ir-before-llvm-conversion=false",
+                            "--iree-vm-bytecode-module-output-format=flatbuffer-binary",
+                        ],
+                    )
+                    module.load_module(vmfb_path)
+                modules.append(module)
+            else:
+                vmfb_path = Path(f"{idx}_1.vmfb")
+                if idx < 25:
+                    device = "cpu"
+                else:
+                    device = "cpu"
+                if vmfb_path.exists():
+                    # print(f"Found layer {idx} vmfb")
+                    module = SharkInference(
+                        None, device=device, mlir_dialect="tm_tensor"
+                    )
+                    module.load_module(vmfb_path)
+                else:
+                    print(f"Compiling layer {idx} vmfb")
+                    module = SharkInference(
+                        mlirs[idx], device=device, mlir_dialect="tm_tensor"
+                    )
+                    module.save_module(
+                        module_name=f"{idx}_1",
+                        extra_args=[
+                            "--iree-hal-dump-executable-sources-to=ies",
+                            "--iree-vm-target-truncate-unsupported-floats",
+                            "--iree-codegen-check-ir-before-llvm-conversion=false",
+                            "--iree-vm-bytecode-module-output-format=flatbuffer-binary",
+                        ],
+                    )
+                    module.load_module(vmfb_path)
+                modules.append(module)
+
+        return mlirs, modules
+
+    def get_sharded_model(self):
+        # SAMPLE_INPUT_LEN is used for creating mlir with dynamic inputs, which is currently an increadibly hacky proccess
+        # please don't change it
+        SAMPLE_INPUT_LEN = 137
+        vicuna_model = self.get_src_model()
+        placeholder_input0 = (
+            torch.zeros([1, SAMPLE_INPUT_LEN, 4096]),
+            torch.zeros([1, 1, SAMPLE_INPUT_LEN, SAMPLE_INPUT_LEN]),
+            torch.zeros([1, SAMPLE_INPUT_LEN], dtype=torch.int64),
+        )
+
+        placeholder_input1 = (
+            torch.zeros([1, 1, 4096]),
+            torch.zeros([1, 1, 1, SAMPLE_INPUT_LEN + 1]),
+            torch.zeros([1, 1], dtype=torch.int64),
+            torch.zeros([1, 32, SAMPLE_INPUT_LEN, 128]),
+            torch.zeros([1, 32, SAMPLE_INPUT_LEN, 128]),
+        )
+
+        layers0 = [
+            FirstVicunaLayer(layer) for layer in vicuna_model.model.layers
+        ]
+        _, modules0 = self.compile_to_vmfb(
+            placeholder_input0, layers0, is_first=True
+        )
+        shark_layers0 = [CompiledFirstVicunaLayer(m) for m in modules0]
+
+        layers1 = [
+            SecondVicunaLayer(layer) for layer in vicuna_model.model.layers
+        ]
+        _, modules1 = self.compile_to_vmfb(
+            placeholder_input1, layers1, is_first=False
+        )
+        shark_layers1 = [CompiledSecondVicunaLayer(m) for m in modules1]
+
+        sharded_model = ShardedVicunaModel(
+            vicuna_model, shark_layers0, shark_layers1
+        )
+        return sharded_model
+
+    def compile(self):
+        return self.get_sharded_model()
+
+    def generate(self, prompt, cli=False):
+        # TODO: refactor for cleaner integration
+
+        tokens_generated = []
+        _past_key_values = None
+        _token = None
+        detoks_generated = []
+        for iteration in range(self.max_num_tokens):
+            params = {
+                "prompt": prompt,
+                "is_first": iteration == 0,
+                "token": _token,
+                "past_key_values": _past_key_values,
+            }
+
+            generated_token_op = self.generate_new_token(params=params)
+
+            _token = generated_token_op["token"]
+            _past_key_values = generated_token_op["past_key_values"]
+            _detok = generated_token_op["detok"]
+
+            if _token == 2:
+                break
+            detoks_generated.append(_detok)
+            tokens_generated.append(_token)
+
+        for i in range(len(tokens_generated)):
+            if type(tokens_generated[i]) != int:
+                tokens_generated[i] = int(tokens_generated[i][0])
+        result_output = self.tokenizer.decode(tokens_generated)
+        return result_output
+
+    def generate_new_token(self, params):
+        is_first = params["is_first"]
+        if is_first:
+            prompt = params["prompt"]
+            input_ids = self.tokenizer(prompt).input_ids
+            input_id_len = len(input_ids)
+            input_ids = torch.tensor(input_ids)
+            input_ids = input_ids.reshape([1, input_id_len])
+            output = self.shark_model.forward(input_ids, is_first=is_first)
+        else:
+            token = params["token"]
+            past_key_values = params["past_key_values"]
+            input_ids = [token]
+            input_id_len = len(input_ids)
+            input_ids = torch.tensor(input_ids)
+            input_ids = input_ids.reshape([1, input_id_len])
+            output = self.shark_model.forward(
+                input_ids, past_key_values=past_key_values, is_first=is_first
+            )
+
+        _logits = output["logits"]
+        _past_key_values = output["past_key_values"]
+        _token = int(torch.argmax(_logits[:, -1, :], dim=1)[0])
+        _detok = self.tokenizer.decode(_token)
+
+        ret_dict = {
+            "token": _token,
+            "detok": _detok,
+            "past_key_values": _past_key_values,
+        }
+
+        print(f" token : {_token} | detok : {_detok}")
+
+        return ret_dict
+
+    def autocomplete(self, prompt):
+        # use First vic alone to complete a story / prompt / sentence.
+        pass
--- a/apps/language_models/utils.py
+++ b/apps/language_models/utils.py
@@ -5,121 +5,6 @@ from typing import List
 from pathlib import Path


-def get_torch_mlir_module_bytecode(model, model_inputs):
-    fx_g = make_fx(
-        model,
-        decomposition_table=get_decompositions(
-            [
-                torch.ops.aten.embedding_dense_backward,
-                torch.ops.aten.native_layer_norm_backward,
-                torch.ops.aten.slice_backward,
-                torch.ops.aten.select_backward,
-                torch.ops.aten.norm.ScalarOpt_dim,
-                torch.ops.aten.native_group_norm,
-                torch.ops.aten.upsample_bilinear2d.vec,
-                torch.ops.aten.split.Tensor,
-                torch.ops.aten.split_with_sizes,
-            ]
-        ),
-        # tracing_mode='symbolic',
-    )(*model_inputs)
-    print("Got FX_G")
-
-    def _remove_nones(fx_g: torch.fx.GraphModule) -> List[int]:
-        removed_indexes = []
-        for node in fx_g.graph.nodes:
-            if node.op == "output":
-                assert (
-                    len(node.args) == 1
-                ), "Output node must have a single argument"
-                node_arg = node.args[0]
-                if isinstance(node_arg, (list, tuple)):
-                    node_arg = list(node_arg)
-                    node_args_len = len(node_arg)
-                    for i in range(node_args_len):
-                        curr_index = node_args_len - (i + 1)
-                        if node_arg[curr_index] is None:
-                            removed_indexes.append(curr_index)
-                            node_arg.pop(curr_index)
-                    node.args = (tuple(node_arg),)
-                    break
-
-        if len(removed_indexes) > 0:
-            fx_g.graph.lint()
-            fx_g.graph.eliminate_dead_code()
-            fx_g.recompile()
-        removed_indexes.sort()
-        return removed_indexes
-
-    def _unwrap_single_tuple_return(fx_g: torch.fx.GraphModule) -> bool:
-        """
-        Replace tuple with tuple element in functions that return one-element tuples.
-        Returns true if an unwrapping took place, and false otherwise.
-        """
-        unwrapped_tuple = False
-        for node in fx_g.graph.nodes:
-            if node.op == "output":
-                assert (
-                    len(node.args) == 1
-                ), "Output node must have a single argument"
-                node_arg = node.args[0]
-                if isinstance(node_arg, tuple):
-                    if len(node_arg) == 1:
-                        node.args = (node_arg[0],)
-                        unwrapped_tuple = True
-                        break
-
-        if unwrapped_tuple:
-            fx_g.graph.lint()
-            fx_g.recompile()
-        return unwrapped_tuple
-
-    def transform_fx(fx_g):
-        for node in fx_g.graph.nodes:
-            if node.op == "call_function":
-                if node.target in [
-                    torch.ops.aten.empty,
-                ]:
-                    # aten.empty should be filled with zeros.
-                    if node.target in [torch.ops.aten.empty]:
-                        with fx_g.graph.inserting_after(node):
-                            new_node = fx_g.graph.call_function(
-                                torch.ops.aten.zero_,
-                                args=(node,),
-                            )
-                            node.append(new_node)
-                            node.replace_all_uses_with(new_node)
-                            new_node.args = (node,)
-
-        fx_g.graph.lint()
-
-    transform_fx(fx_g)
-    fx_g.recompile()
-    removed_none_indexes = _remove_nones(fx_g)
-    was_unwrapped = _unwrap_single_tuple_return(fx_g)
-
-    fx_g.graph.set_codegen(torch.fx.graph.CodeGen())
-    fx_g.recompile()
-
-    print("FX_G recompile")
-
-    def strip_overloads(gm):
-        """
-        Modifies the target of graph nodes in :attr:`gm` to strip overloads.
-        Args:
-            gm(fx.GraphModule): The input Fx graph module to be modified
-        """
-        for node in gm.graph.nodes:
-            if isinstance(node.target, torch._ops.OpOverload):
-                node.target = node.target.overloadpacket
-        gm.recompile()
-
-    strip_overloads(fx_g)
-    ts_g = torch.jit.script(fx_g)
-    print("Got TS_G")
-    return ts_g
-
-
 # expects a Path / str as arg
 # returns None if path not found or SharkInference module
 def get_vmfb_from_path(vmfb_path, device, mlir_dialect):
--- a/apps/stable_diffusion/src/utils/sd_annotation.py
+++ b/apps/stable_diffusion/src/utils/sd_annotation.py
@@ -125,6 +125,8 @@ def load_lower_configs(base_model_id=None):
                config_name = f"{args.annotation_model}_v2_1_768_{args.precision}_{device}.json"
            else:
                config_name = f"{args.annotation_model}_{version}_{args.precision}_{device}.json"
+        elif spec in ["rdna2"] and version in ["v2_1", "v2_1base", "v1_4"]:
+            config_name = f"{args.annotation_model}_{version}_{args.precision}_{device}_{spec}_{args.width}x{args.height}.json"
        else:
            config_name = f"{args.annotation_model}_{version}_{args.precision}_{device}_{spec}.json"

--- a/apps/stable_diffusion/src/utils/utils.py
+++ b/apps/stable_diffusion/src/utils/utils.py
@@ -337,13 +337,25 @@ def set_init_device_flags():
                "stabilityai/stable-diffusion-2-1",
                "stabilityai/stable-diffusion-2-1-base",
            ]
-            or "rdna3" not in args.iree_vulkan_target_triple
+            or "rdna" not in args.iree_vulkan_target_triple
        )
    ):
        args.use_tuned = False

+    elif "rdna2" in args.iree_vulkan_target_triple and (
+        base_model_id
+        not in [
+            "stabilityai/stable-diffusion-2-1",
+            "stabilityai/stable-diffusion-2-1-base",
+            "CompVis/stable-diffusion-v1-4",
+        ]
+    ):
+        args.use_tuned = False
+
    if args.use_tuned:
-        print(f"Using tuned models for {base_model_id}/fp16/{args.device}.")
+        print(
+            f"Using tuned models for {base_model_id}(fp16) on device {args.device}."
+        )
    else:
        print("Tuned models are currently not supported for this setting.")

--- a/apps/stable_diffusion/web/index.py
+++ b/apps/stable_diffusion/web/index.py
@@ -7,6 +7,8 @@ import apps.stable_diffusion.web.utils.global_obj as global_obj

 if sys.platform == "darwin":
    os.environ["DYLD_LIBRARY_PATH"] = "/usr/local/lib"
+    # import before IREE to avoid MLIR library issues
+    import torch_mlir

 if args.clear_all:
    clear_all()
--- a/apps/stable_diffusion/web/ui/outputgallery_ui.py
+++ b/apps/stable_diffusion/web/ui/outputgallery_ui.py
@@ -9,10 +9,10 @@ from apps.stable_diffusion.src.utils import (
    get_generated_imgs_todays_subdir,
 )
 from apps.stable_diffusion.web.ui.utils import nodlogo_loc
-from apps.stable_diffusion.web.utils.png_metadata import (
-    parse_generation_parameters,
+from apps.stable_diffusion.web.utils.gradio_configs import (
+    gradio_tmp_galleries_folder,
 )
-from apps.stable_diffusion.web.utils.exif_metadata import parse_exif
+from apps.stable_diffusion.web.utils.metadata import displayable_metadata

 # -- Functions for file, directory and image info querying

@@ -32,37 +32,6 @@ def outputgallery_filenames(subdir) -> list[str]:
        return []


-def parameters_for_display(image_filename) -> tuple[str, list[list[str]]]:
-    pil_image = Image.open(image_filename)
-
-    # we have PNG generation parameters
-    if "parameters" in pil_image.info:
-        params = parse_generation_parameters(pil_image.info["parameters"])
-
-        # make showing the sizes more compact by using only one line each
-        if params.keys() & {"Size-1", "Size-2"}:
-            params["Size"] = f"{params.pop('Size-1')}x{params.pop('Size-2')}"
-
-        if params.keys() & {"Hires resize-1", "Hires resize-1"}:
-            hires_x = params.pop("Hires resize-1")
-            hires_y = params.pop("Hires resize-2")
-
-            if hires_x == 0 and hires_y == 0:
-                params["Hires resize"] = "None"
-            else:
-                params["Hires resize"] = f"{hires_x}x{hires_y}"
-
-        return "params", list(map(list, params.items()))
-
-    # we have EXIF data, but no generation parameters we know how to read
-    elif pil_image.getexif():
-        return "exif", list(map(list, parse_exif(pil_image).items()))
-
-    # couldn't find anything
-    else:
-        return None, None
-
-
 def output_subdirs() -> list[str]:
    # Gets a list of subdirectories of output_dir and below, as relative paths.
    relative_paths = [
@@ -77,8 +46,9 @@ def output_subdirs() -> list[str]:
    if get_generated_imgs_todays_subdir() not in relative_paths:
        relative_paths.append(get_generated_imgs_todays_subdir())

-    # sort subdirectories so that that the date named ones we probably created in this or previous sessions
-    # come first, sorted with the most recent first. Other subdirs are listed after.
+    # sort subdirectories so that that the date named ones we probably created in this or
+    # previous sessions come first, sorted with the most recent first. Other subdirs are listed
+    # after.
    generated_paths = sorted(
        [path for path in relative_paths if path.isnumeric()], reverse=True
    )
@@ -93,6 +63,19 @@ def output_subdirs() -> list[str]:
    return result_paths


+# clear zero length temporary files that gradio 3.22.0 buggily creates
+# TODO: remove once gradio is upgraded to or past 3.32.0
+def clear_zero_length_temps():
+    zero_length_temps = [
+        os.path.join(root, file)
+        for root, dirs, files in os.walk(gradio_tmp_galleries_folder)
+        for file in files
+        if os.path.getsize(os.path.join(root, file)) == 0
+    ]
+    for file in zero_length_temps:
+        os.remove(file)
+
+
 # --- Define UI layout for Gradio

 with gr.Blocks() as outputgallery_web:
@@ -122,6 +105,7 @@ with gr.Blocks() as outputgallery_web:
                visible=False,
                show_label=True,
            ).style(grid=4)
+            gallery.DEFAULT_TEMP_DIR = gradio_tmp_galleries_folder

        with gr.Column(scale=4):
            with gr.Box():
@@ -195,6 +179,7 @@ with gr.Blocks() as outputgallery_web:
    # --- Event handlers

    def on_clear_gallery():
+        clear_zero_length_temps()
        return [
            gr.Gallery.update(
                value=[],
@@ -262,6 +247,7 @@ with gr.Blocks() as outputgallery_web:

        # only update if the current subdir is the most recent one as new images only go there
        if subdir_paths[0] == subdir:
+            clear_zero_length_temps()
            new_images = outputgallery_filenames(subdir)
            new_label = f"{len(new_images)} images in {os.path.join(output_dir, subdir)} - {status}"

@@ -284,20 +270,18 @@ with gr.Blocks() as outputgallery_web:
    def on_select_image(images: list[str], evt: gr.SelectData) -> list:
        # evt.index is an index into the full list of filenames for the current subdirectory
        filename = images[evt.index]
+        params = displayable_metadata(filename)

-        # this gets the parameters in the form our dataframe is expecting (list of lists)
-        params_type, params = parameters_for_display(filename)
+        if params:
+            return [
+                filename,
+                list(map(list, params["parameters"].items())),
+            ]

-        if params_type == "params":
-            new_parameters = params
-        elif params_type == "exif":
-            new_parameters = [
-                ["Status", "No PNG parameters found, showing EXIF metadata"]
-            ] + params
-        else:
-            new_parameters = [["Status", "No parameters found"]]
-
-        return [filename, new_parameters]
+        return [
+            filename,
+            [["Status", "No parameters found"]],
+        ]

    def on_outputgallery_filename_change(filename: str) -> list:
        exists = filename != "None" and os.path.exists(filename)
--- a/apps/stable_diffusion/web/ui/stablelm_ui.py
+++ b/apps/stable_diffusion/web/ui/stablelm_ui.py
@@ -22,26 +22,37 @@ def user(message, history):

 sharkModel = 0
 sharded_model = 0
+vicuna_model = 0


 start_message_vicuna = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n"
 past_key_values = None


-def chat(curr_system_message, history, model):
+def chat(curr_system_message, history, model, device, precision):
    print(f"In chat for {model}")
    global sharded_model
    global past_key_values
+    global vicuna_model
    if "vicuna" in model:
-        from apps.language_models.scripts.vicuna import (
-            tokenizer,
-            get_sharded_model,
+        from apps.language_models.src.pipelines.vicuna_pipeline import (
+            Vicuna,
        )

-        SAMPLE_INPUT_LEN = 137
        curr_system_message = start_message_vicuna
-        if sharded_model == 0:
-            sharded_model = get_sharded_model()
+        if vicuna_model == 0:
+            first_vic_vmfb_path = Path("first_vicuna.vmfb")
+            second_vic_vmfb_path = Path("second_vicuna.vmfb")
+            if "cuda" in device:
+                device = "cuda"
+            vicuna_model = Vicuna(
+                "vicuna",
+                hf_model_path=model,
+                device=device,
+                precision=precision,
+                first_vicuna_vmfb_path=first_vic_vmfb_path,
+                second_vicuna_vmfb_path=second_vic_vmfb_path,
+            )
        messages = curr_system_message + "".join(
            [
                "".join(["<|USER|>" + item[0], "<|ASSISTANT|>" + item[1]])
@@ -50,40 +61,16 @@ def chat(curr_system_message, history, model):
        )
        prompt = messages.strip()
        print("prompt = ", prompt)
-        input_ids = tokenizer(prompt).input_ids
-        new_sentence = ""
-        for _ in range(200):
-            original_input_ids = input_ids
-            input_id_len = len(input_ids)
-            pad_len = SAMPLE_INPUT_LEN - input_id_len
-            attention_mask = torch.ones([1, input_id_len], dtype=torch.int64)
-            input_ids = torch.tensor(input_ids)
-            input_ids = input_ids.reshape([1, input_id_len])
-            attention_mask = torch.nn.functional.pad(
-                torch.tensor(attention_mask),
-                (0, pad_len),
-                mode="constant",
-                value=0,
-            )
+        sentence = vicuna_model.generate(prompt)

-            if _ == 0:
-                output = sharded_model.forward(input_ids, is_first=True)
-            else:
-                output = sharded_model.forward(
-                    input_ids, past_key_values=past_key_values, is_first=False
-                )
-            logits = output["logits"]
-            past_key_values = output["past_key_values"]
-            new_word = tokenizer.decode(torch.argmax(logits[:, -1, :], dim=1))
-            if new_word == "</s>":
-                break
-            new_sentence += " " + new_word
-            history[-1][1] = new_sentence
+        partial_text = ""
+        for new_text in sentence.split(" "):
+            # print(new_text)
+            partial_text += new_text + " "
+            history[-1][1] = partial_text
+            # Yield an empty string to cleanup the message textbox and the updated conversation history
            yield history
-            next_token = torch.argmax(logits[:, input_id_len - 1, :], dim=1)
-            original_input_ids.append(next_token)
-            input_ids = [next_token]
-        print(new_sentence)
+        history[-1][1] = sentence
        return history

    # else Model is StableLM
@@ -133,17 +120,26 @@ with gr.Blocks(title="Chatbot") as stablelm_chat:
                "TheBloke/vicuna-7B-1.1-HF",
            ],
        )
-        device_value = None
-        for d in available_devices:
-            if "vulkan" in d:
-                device_value = d
-                break
-
+        supported_devices = [
+            device for device in available_devices if "cuda" in device
+        ]
+        enabled = len(supported_devices) > 0
        device = gr.Dropdown(
            label="Device",
-            value=device_value if device_value else available_devices[0],
-            interactive=False,
-            choices=available_devices,
+            value=supported_devices[0]
+            if enabled
+            else "Only CUDA Supported for now",
+            choices=supported_devices,
+            interactive=enabled,
+        )
+        precision = gr.Radio(
+            label="Precision",
+            value="fp32",
+            choices=[
+                "fp16",
+                "fp32",
+            ],
+            visible=True,
        )
    chatbot = gr.Chatbot().style(height=500)
    with gr.Row():
@@ -152,12 +148,13 @@ with gr.Blocks(title="Chatbot") as stablelm_chat:
                label="Chat Message Box",
                placeholder="Chat Message Box",
                show_label=False,
+                interactive=enabled,
            ).style(container=False)
        with gr.Column():
            with gr.Row():
-                submit = gr.Button("Submit")
-                stop = gr.Button("Stop")
-                clear = gr.Button("Clear")
+                submit = gr.Button("Submit", interactive=enabled)
+                stop = gr.Button("Stop", interactive=enabled)
+                clear = gr.Button("Clear", interactive=enabled)
    system_msg = gr.Textbox(
        start_message, label="System Message", interactive=False, visible=False
    )
@@ -166,7 +163,7 @@ with gr.Blocks(title="Chatbot") as stablelm_chat:
        fn=user, inputs=[msg, chatbot], outputs=[msg, chatbot], queue=False
    ).then(
        fn=chat,
-        inputs=[system_msg, chatbot, model],
+        inputs=[system_msg, chatbot, model, device, precision],
        outputs=[chatbot],
        queue=True,
    )
@@ -174,7 +171,7 @@ with gr.Blocks(title="Chatbot") as stablelm_chat:
        fn=user, inputs=[msg, chatbot], outputs=[msg, chatbot], queue=False
    ).then(
        fn=chat,
-        inputs=[system_msg, chatbot, model],
+        inputs=[system_msg, chatbot, model, device, precision],
        outputs=[chatbot],
        queue=True,
    )
--- a/apps/stable_diffusion/web/ui/txt2img_ui.py
+++ b/apps/stable_diffusion/web/ui/txt2img_ui.py
@@ -16,7 +16,7 @@ from apps.stable_diffusion.web.ui.utils import (
    predefined_models,
    cancel_sd,
 )
-from apps.stable_diffusion.web.utils.png_metadata import import_png_metadata
+from apps.stable_diffusion.web.utils.metadata import import_png_metadata
 from apps.stable_diffusion.web.utils.common_label_calc import status_label
 from apps.stable_diffusion.src import (
    args,
--- a/apps/stable_diffusion/web/utils/gradio_configs.py
+++ b/apps/stable_diffusion/web/utils/gradio_configs.py
@@ -1,19 +1,48 @@
 import os
+import shutil
 import tempfile
 import gradio
-from os import listdir
+from time import time

 gradio_tmp_imgs_folder = os.path.join(os.getcwd(), "shark_tmp/")
+gradio_tmp_galleries_folder = os.path.join(gradio_tmp_imgs_folder, "galleries")


 # Clear all gradio tmp images
 def clear_gradio_tmp_imgs_folder():
    if not os.path.exists(gradio_tmp_imgs_folder):
        return
-    for fileName in listdir(gradio_tmp_imgs_folder):
-        # Delete tmp png files
-        if fileName.startswith("tmp") and fileName.endswith(".png"):
-            os.remove(gradio_tmp_imgs_folder + fileName)
+
+    # clear all gradio tmp files created by generation galleries
+    print(
+        "Clearing gradio temporary image files from a prior run. This may take some time..."
+    )
+    image_files = [
+        filename
+        for filename in os.listdir(gradio_tmp_imgs_folder)
+        if os.path.isfile(os.path.join(gradio_tmp_imgs_folder, filename))
+        and filename.startswith("tmp")
+        and filename.endswith(".png")
+    ]
+    if len(image_files) > 0:
+        cleanup_start = time()
+        for filename in image_files:
+            os.remove(gradio_tmp_imgs_folder + filename)
+        print(
+            f"Clearing generation temporary image files took {time() - cleanup_start:4f} seconds"
+        )
+    else:
+        print("no generation temporary files to clear")
+
+    # Clear all gradio tmp files created by output galleries
+    if os.path.exists(gradio_tmp_galleries_folder):
+        cleanup_start = time()
+        shutil.rmtree(gradio_tmp_galleries_folder, ignore_errors=True)
+        print(
+            f"Clearing output gallery temporary image files took {time() - cleanup_start:4f} seconds"
+        )
+    else:
+        print("no output gallery temporary files to clear")


 # Overwrite save_pil_to_file from gradio to save tmp images generated by gradio into our own tmp folder
--- a/apps/stable_diffusion/web/utils/metadata/init.py
+++ b/apps/stable_diffusion/web/utils/metadata/init.py
@@ -0,0 +1,6 @@
+from .png_metadata import (
+    import_png_metadata,
+)
+from .display import (
+    displayable_metadata,
+)
--- a/apps/stable_diffusion/web/utils/metadata/csv_metadata.py
+++ b/apps/stable_diffusion/web/utils/metadata/csv_metadata.py
@@ -0,0 +1,31 @@
+import csv
+import os
+from .format import humanize, humanizable
+
+
+def csv_path(image_filename: str):
+    return os.path.join(os.path.dirname(image_filename), "imgs_details.csv")
+
+
+def has_csv(image_filename: str) -> bool:
+    return os.path.exists(csv_path(image_filename))
+
+
+def parse_csv(image_filename: str):
+    # We use a reader instead of a DictReader here for images_details.csv files due to the lack of
+    # headers, and then match up the return list for each row with our guess at which column format
+    # the file is using.
+
+    # we assume the final column of the csv has the original filename with full path and match that
+    # against the image_filename. We then exclude the filename from the output, hence the -1's.
+    csv_filename = csv_path(image_filename)
+
+    matches = [
+        humanize(row)
+        for row in csv.reader(open(csv_filename, "r", newline=""))
+        if row
+        and humanizable(row)
+        and os.path.basename(image_filename) in row[-1]
+    ]
+
+    return matches[0] if matches else {}
--- a/apps/stable_diffusion/web/utils/metadata/display.py
+++ b/apps/stable_diffusion/web/utils/metadata/display.py
@@ -0,0 +1,50 @@
+import json
+import os
+from PIL import Image
+from .png_metadata import parse_generation_parameters
+from .exif_metadata import has_exif, parse_exif
+from .csv_metadata import has_csv, parse_csv
+from .format import compact, humanize
+
+
+def displayable_metadata(image_filename: str) -> dict:
+    pil_image = Image.open(image_filename)
+
+    # we have PNG generation parameters (preferred, as it's what the txt2img dropzone reads,
+    # and we go via that for SendTo, and is directly tied to the image)
+    if "parameters" in pil_image.info:
+        return {
+            "source": "png",
+            "parameters": compact(
+                parse_generation_parameters(pil_image.info["parameters"])
+            ),
+        }
+
+    # we have a matching json file (next most likely to be accurate when it's there)
+    json_path = os.path.splitext(image_filename)[0] + ".json"
+    if os.path.isfile(json_path):
+        with open(json_path) as params_file:
+            return {
+                "source": "json",
+                "parameters": compact(
+                    humanize(json.load(params_file), includes_filename=False)
+                ),
+            }
+
+    # we have a CSV file so try that (can be different shapes, and it usually has no
+    # headers/param names so of the things we we *know* have parameters, it's the
+    # last resort)
+    if has_csv(image_filename):
+        params = parse_csv(image_filename)
+        if params:  # we might not have found the filename in the csv
+            return {
+                "source": "csv",
+                "parameters": compact(params),  # already humanized
+            }
+
+    # EXIF data, probably a .jpeg, may well not include parameters, but at least it's *something*
+    if has_exif(image_filename):
+        return {"source": "exif", "parameters": parse_exif(pil_image)}
+
+    # we've got nothing
+    return None
--- a/apps/stable_diffusion/web/utils/metadata/exif_metadata.py
+++ b/apps/stable_diffusion/web/utils/metadata/exif_metadata.py
@@ -2,6 +2,10 @@ from PIL import Image
 from PIL.ExifTags import Base as EXIFKeys, TAGS, IFD, GPSTAGS


+def has_exif(image_filename: str) -> bool:
+    return True if Image.open(image_filename).getexif() else False
+
+
 def parse_exif(pil_image: Image) -> dict:
    img_exif = pil_image.getexif()

--- a/apps/stable_diffusion/web/utils/metadata/format.py
+++ b/apps/stable_diffusion/web/utils/metadata/format.py
@@ -0,0 +1,115 @@
+# As SHARK has evolved more columns have been added to images_details.csv. However, since
+# no version of the CSV has any headers (yet) we don't actually have anything within the
+# file that tells us which parameter each column is for. So this is a list of known patterns
+# indexed by length which is what we're going to have to use to guess which columns are the
+# right ones for the file we're looking at.
+
+# The same ordering is used for JSON, but these do have key names, however they are not very
+# human friendly, nor do they match up with the what is written to the .png headers
+
+# So these are functions to try and get something consistent out the raw input from all
+# these sources
+
+PARAMS_FORMATS = {
+    9: {
+        "VARIANT": "Model",
+        "SCHEDULER": "Sampler",
+        "PROMPT": "Prompt",
+        "NEG_PROMPT": "Negative prompt",
+        "SEED": "Seed",
+        "CFG_SCALE": "CFG scale",
+        "PRECISION": "Precision",
+        "STEPS": "Steps",
+        "OUTPUT": "Filename",
+    },
+    10: {
+        "MODEL": "Model",
+        "VARIANT": "Variant",
+        "SCHEDULER": "Sampler",
+        "PROMPT": "Prompt",
+        "NEG_PROMPT": "Negative prompt",
+        "SEED": "Seed",
+        "CFG_SCALE": "CFG scale",
+        "PRECISION": "Precision",
+        "STEPS": "Steps",
+        "OUTPUT": "Filename",
+    },
+    12: {
+        "VARIANT": "Model",
+        "SCHEDULER": "Sampler",
+        "PROMPT": "Prompt",
+        "NEG_PROMPT": "Negative prompt",
+        "SEED": "Seed",
+        "CFG_SCALE": "CFG scale",
+        "PRECISION": "Precision",
+        "STEPS": "Steps",
+        "HEIGHT": "Height",
+        "WIDTH": "Width",
+        "MAX_LENGTH": "Max Length",
+        "OUTPUT": "Filename",
+    },
+}
+
+PARAMS_FORMAT_LONGEST = PARAMS_FORMATS[max(PARAMS_FORMATS.keys())]
+
+
+def compact(metadata: dict) -> dict:
+    # we don't want to alter the original dictionary
+    result = dict(metadata)
+
+    # discard the filename because we should already have it
+    if result.keys() & {"Filename"}:
+        result.pop("Filename")
+
+    # make showing the sizes more compact by using only one line each
+    if result.keys() & {"Size-1", "Size-2"}:
+        result["Size"] = f"{result.pop('Size-1')}x{result.pop('Size-2')}"
+    elif result.keys() & {"Height", "Width"}:
+        result["Size"] = f"{result.pop('Height')}x{result.pop('Width')}"
+
+    if result.keys() & {"Hires resize-1", "Hires resize-1"}:
+        hires_y = result.pop("Hires resize-1")
+        hires_x = result.pop("Hires resize-2")
+
+        if hires_x == 0 and hires_y == 0:
+            result["Hires resize"] = "None"
+        else:
+            result["Hires resize"] = f"{hires_y}x{hires_x}"
+
+    return result
+
+
+def humanizable(metadata: dict | list[str], includes_filename=True) -> dict:
+    lookup_key = len(metadata) + (0 if includes_filename else 1)
+    return lookup_key in PARAMS_FORMATS.keys()
+
+
+def humanize(metadata: dict | list[str], includes_filename=True) -> dict:
+    lookup_key = len(metadata) + (0 if includes_filename else 1)
+
+    # For lists we can only work based on the length, we have no other information
+    if isinstance(metadata, list):
+        if humanizable(metadata, includes_filename):
+            return dict(zip(PARAMS_FORMATS[lookup_key].values(), metadata))
+        else:
+            raise KeyError(
+                f"Humanize could not find the format for a parameter list of length {len(metadata)}"
+            )
+
+    # For dictionaries we try to use the matching length parameter format if
+    # available, otherwise we use the longest. Then we swap keys in the
+    # metadata that match keys in the format for the friendlier name that we
+    # have set in the format value
+    if isinstance(metadata, dict):
+        if humanizable(metadata, includes_filename):
+            format = PARAMS_FORMATS[lookup_key]
+        else:
+            format = PARAMS_FORMAT_LONGEST
+
+        return {
+            format[key]: value
+            for (key, value) in metadata.items()
+            if key in format.keys()
+        }
+
+    raise TypeError("Can only humanize parameter lists or dictionaries")
--- a/apps/stable_diffusion/web/utils/metadata/png_metadata.py
+++ b/apps/stable_diffusion/web/utils/metadata/png_metadata.py
--- a/dataset/requirements.txt
+++ b/dataset/requirements.txt
@@ -1,3 +1,3 @@
 # SHARK Annotator
-gradio==3.15.0
+gradio==3.34.0
 jsonlines
--- a/requirements.txt
+++ b/requirements.txt
@@ -19,7 +19,7 @@ transformers
 diffusers @ git+https://github.com/huggingface/diffusers@e47459c80f6f6a5a1c19d32c3fd74edf94f47aa2
 scipy
 ftfy
-gradio==3.22.0
+gradio==3.34.0
 altair
 omegaconf
 safetensors
--- a/shark/examples/shark_inference/resnet50_script.py
+++ b/shark/examples/shark_inference/resnet50_script.py
@@ -70,11 +70,11 @@ mlir_model, func_name, inputs, golden_out = download_model(
    "resnet50", frontend="torch"
 )

-shark_module = SharkInference(mlir_model, func_name, mlir_dialect="linalg")
+shark_module = SharkInference(mlir_model, mlir_dialect="linalg")
 shark_module.compile()
 path = shark_module.save_module()
 shark_module.load_module(path)
-result = shark_module.forward((img.detach().numpy(),))
+result = shark_module("forward", (img.detach().numpy(),))

 print("The top 3 results obtained via shark_runner is:")
 print(top3_possibilities(torch.from_numpy(result)))
--- a/shark/iree_utils/_common.py
+++ b/shark/iree_utils/_common.py
@@ -62,6 +62,8 @@ def get_supported_device_list():

 _IREE_DEVICE_MAP = {
    "cpu": "local-task",
+    "cpu-task": "local-task",
+    "cpu-sync": "local-sync",
    "cuda": "cuda",
    "vulkan": "vulkan",
    "metal": "vulkan",
@@ -78,6 +80,8 @@ def iree_target_map(device):

 _IREE_TARGET_MAP = {
    "cpu": "llvm-cpu",
+    "cpu-task": "llvm-cpu",
+    "cpu-sync": "llvm-cpu",
    "cuda": "cuda",
    "vulkan": "vulkan",
    "metal": "vulkan",
--- a/shark/iree_utils/compile_utils.py
+++ b/shark/iree_utils/compile_utils.py
@@ -63,7 +63,6 @@ def get_iree_frontend_args(frontend):
    elif frontend in ["tensorflow", "tf", "mhlo", "stablehlo"]:
        return [
            "--iree-llvmcpu-target-cpu-features=host",
-            "--iree-mhlo-demote-i64-to-i32=false",
            "--iree-flow-demote-i64-to-i32",
        ]
    else:
--- a/shark/shark_generate_model_config.py
+++ b/shark/shark_generate_model_config.py
@@ -0,0 +1,38 @@
+import json
+from collections import OrderedDict
+
+
+class GenerateConfigFile:
+    def __init__(
+        self,
+        model,
+        num_sharding_stages: int,
+        sharding_stages_id: list[str] = None,
+    ):
+        self.model = model
+        self.num_sharding_stages = num_sharding_stages
+        self.sharding_stages_id = sharding_stages_id
+        assert self.num_sharding_stages == len(
+            self.sharding_stages_id
+        ), "Number of sharding stages should be equal to the list of their ID"
+
+    def generate_json(self):
+        model_dictionary = dict()
+
+        for name, m in self.model.named_modules():
+            if name == "":
+                continue
+
+            # Remove non-leaf nodes from the config as they aren't an operation
+            substring_before_final_period = name.split(".")[:-1]
+            substring_before_final_period = ".".join(
+                substring_before_final_period
+            )
+            if substring_before_final_period in model_dictionary:
+                del model_dictionary[substring_before_final_period]
+
+            layer_dict = {n: "None" for n in self.sharding_stages_id}
+            model_dictionary[name] = layer_dict
+
+        with open("model_config.json", "w") as outfile:
+            json.dump(model_dictionary, outfile)
--- a/shark/shark_importer.py
+++ b/shark/shark_importer.py
@@ -312,6 +312,47 @@ def get_f16_inputs(inputs, is_f16, f16_input_mask):
    return tuple(f16_masked_inputs)


+# Upcasts the block/list of ops.
+def add_upcast(fx_g):
+    import torch
+
+    for node in fx_g.graph.nodes:
+        if node.target in [torch.ops.aten.mul]:
+            # This is a very strict check.
+            if hasattr(node.args[1], "target"):
+                if (
+                    node.args[1].target in [torch.ops.aten.rsqrt]
+                    and node.args[1].args[0].target in [torch.ops.aten.add]
+                    and node.args[1].args[0].args[0].target
+                    in [torch.ops.aten.mean]
+                    and node.args[1].args[0].args[0].args[0].target
+                    in [torch.ops.aten.pow]
+                ):
+                    print("found an upcasting block let's upcast it.")
+                    pow_node = node.args[1].args[0].args[0].args[0]
+                    mul_node = node
+                    with fx_g.graph.inserting_before(pow_node):
+                        lhs = pow_node.args[0]
+                        upcast_lhs = fx_g.graph.call_function(
+                            torch.ops.aten._to_copy,
+                            args=(lhs,),
+                            kwargs={"dtype": torch.float32},
+                        )
+                        pow_node.args = (upcast_lhs, pow_node.args[1])
+                    with fx_g.graph.inserting_before(mul_node):
+                        new_node = fx_g.graph.call_function(
+                            torch.ops.aten._to_copy,
+                            args=(mul_node,),
+                            kwargs={"dtype": torch.float16},
+                        )
+                        mul_node.append(new_node)
+                        mul_node.replace_all_uses_with(new_node)
+                        new_node.args = (mul_node,)
+                        new_node.kwargs = {"dtype": torch.float16}
+
+    fx_g.graph.lint()
+
+
 def transform_fx(fx_g):
    import torch

@@ -340,6 +381,28 @@ def transform_fx(fx_g):
                if node.kwargs.get("dtype") == torch.float32:
                    node.kwargs = kwargs_dict1

+            if node.target in [
+                torch.ops.aten.masked_fill,
+            ]:
+                if node.args[2] > torch.finfo(torch.half).max:
+                    max_val = torch.finfo(torch.half).max
+                    node.args = (node.args[0], node.args[1], max_val)
+                elif node.args[2] < torch.finfo(torch.half).min:
+                    min_val = torch.finfo(torch.half).min
+                    node.args = (node.args[0], node.args[1], min_val)
+
+            if node.target in [
+                torch.ops.aten.full,
+            ]:
+                if node.args[1] > torch.finfo(torch.half).max:
+                    max_val = torch.finfo(torch.half).max
+                    node.args = (node.args[0], max_val)
+                    node.kwargs = kwargs_dict
+                elif node.args[1] < torch.finfo(torch.half).min:
+                    min_val = torch.finfo(torch.half).min
+                    node.args = (node.args[0], min_val)
+                    node.kwargs = kwargs_dict
+
            # Inputs and outputs of aten.var.mean should be upcasted to fp32.
            if node.target in [torch.ops.aten.var_mean]:
                with fx_g.graph.inserting_before(node):
@@ -363,18 +426,6 @@ def transform_fx(fx_g):
                        new_node.args = (node,)
                        new_node.kwargs = {"dtype": torch.float16}

-            # Change the default dtype of aten.full op. (Vicuna)
-            if node.target in [torch.ops.aten.full]:
-                new_node = fx_g.graph.call_function(
-                    torch.ops.aten._to_copy,
-                    args=(node,),
-                    kwargs={"dtype": torch.float16},
-                )
-                node.append(new_node)
-                node.replace_all_uses_with(new_node)
-                new_node.args = (node,)
-                new_node.kwargs = {"dtype": torch.float16}
-
            # aten.empty should be filled with zeros.
            if node.target in [torch.ops.aten.empty]:
                with fx_g.graph.inserting_after(node):
@@ -386,6 +437,14 @@ def transform_fx(fx_g):
                    node.replace_all_uses_with(new_node)
                    new_node.args = (node,)

+    # Required for cuda debugging.
+    # for node in fx_g.graph.nodes:
+    # if node.op == "call_function":
+    # if node.kwargs.get("device") == torch.device(type="cpu"):
+    # new_kwargs = node.kwargs.copy()
+    # new_kwargs["device"] = torch.device(type="cuda")
+    # node.kwargs = new_kwargs
+
    fx_g.graph.lint()


@@ -489,6 +548,8 @@ def import_with_fx(
    if is_f16:
        fx_g = fx_g.half()
        transform_fx(fx_g)
+        # TODO: Have to make it more generic.
+        add_upcast(fx_g)
        fx_g.recompile()

    if training:
@@ -496,6 +557,9 @@ def import_with_fx(
        inputs = flatten_training_input(inputs)

    ts_graph = torch.jit.script(fx_g)
+    if mlir_type == "torchscript":
+        return ts_graph
+
    inputs = get_f16_inputs(inputs, is_f16, f16_input_mask)
    mlir_importer = SharkImporter(
        ts_graph,
--- a/shark/shark_runner.py
+++ b/shark/shark_runner.py
@@ -27,7 +27,7 @@ import sys
 # supported dialects by the shark-runtime.
 supported_dialects = {
    "linalg",
-    "mhlo",
+    "auto",
    "stablehlo",
    "tosa",
    "tf-lite",
--- a/tank/all_models.csv
+++ b/tank/all_models.csv
@@ -1,47 +1,47 @@
-resnet50,mhlo,tf,1e-2,1e-3,default,nhcw-nhwc,False,False,False,"","macos"
-albert-base-v2,mhlo,tf,1e-2,1e-2,default,None,False,False,False,"",""
-roberta-base,mhlo,tf,1e-02,1e-3,default,nhcw-nhwc,True,True,True,"","macos"
-bert-base-uncased,mhlo,tf,1e-2,1e-3,default,None,False,False,False,"","enabled_windows"
-camembert-base,mhlo,tf,1e-2,1e-3,default,None,True,True,True,"",""
-dbmdz/convbert-base-turkish-cased,mhlo,tf,1e-2,1e-3,default,nhcw-nhwc,True,True,False,"https://github.com/iree-org/iree/issues/9971",""
-distilbert-base-uncased,mhlo,tf,1e-2,1e-3,default,None,False,False,False,"",""
-facebook/convnext-tiny-224,mhlo,tf,1e-2,1e-3,tf_vit,nhcw-nhwc,True,True,False,"https://github.com/nod-ai/SHARK/issues/311 & https://github.com/nod-ai/SHARK/issues/342","macos"
-funnel-transformer/small,mhlo,tf,1e-2,1e-3,default,None,True,True,False,"https://github.com/nod-ai/SHARK/issues/201",""
-google/electra-small-discriminator,mhlo,tf,1e-2,1e-3,default,None,False,False,False,"",""
-google/mobilebert-uncased,mhlo,tf,1e-2,1e-3,default,None,True,False,False,"Fails during iree-compile",""
-google/vit-base-patch16-224,mhlo,tf,1e-2,1e-3,tf_vit,nhcw-nhwc,False,False,False,"",""
-microsoft/MiniLM-L12-H384-uncased,mhlo,tf,1e-2,1e-3,tf_hf,None,True,False,False,"Fails during iree-compile.",""
-microsoft/layoutlm-base-uncased,mhlo,tf,1e-2,1e-3,default,None,False,False,False,"",""
-microsoft/mpnet-base,mhlo,tf,1e-2,1e-2,default,None,True,True,True,"",""
+resnet50,stablehlo,tf,1e-2,1e-3,default,nhcw-nhwc,False,False,False,"","macos"
+albert-base-v2,stablehlo,tf,1e-2,1e-2,default,None,False,False,False,"",""
+roberta-base,stablehlo,tf,1e-02,1e-3,default,nhcw-nhwc,True,True,True,"","macos"
+bert-base-uncased,stablehlo,tf,1e-2,1e-3,default,None,False,False,False,"","enabled_windows"
+camembert-base,stablehlo,tf,1e-2,1e-3,default,None,True,True,True,"",""
+dbmdz/convbert-base-turkish-cased,stablehlo,tf,1e-2,1e-3,default,nhcw-nhwc,True,True,False,"https://github.com/iree-org/iree/issues/9971",""
+distilbert-base-uncased,stablehlo,tf,1e-2,1e-3,default,None,False,False,False,"",""
+facebook/convnext-tiny-224,stablehlo,tf,1e-2,1e-3,tf_vit,nhcw-nhwc,True,True,False,"https://github.com/nod-ai/SHARK/issues/311 & https://github.com/nod-ai/SHARK/issues/342","macos"
+funnel-transformer/small,stablehlo,tf,1e-2,1e-3,default,None,True,True,False,"https://github.com/nod-ai/SHARK/issues/201",""
+google/electra-small-discriminator,stablehlo,tf,1e-2,1e-3,default,None,False,False,False,"",""
+google/mobilebert-uncased,stablehlo,tf,1e-2,1e-3,default,None,True,False,False,"Fails during iree-compile","macos"
+google/vit-base-patch16-224,stablehlo,tf,1e-2,1e-3,tf_vit,nhcw-nhwc,False,False,False,"",""
+microsoft/MiniLM-L12-H384-uncased,stablehlo,tf,1e-2,1e-3,tf_hf,None,True,False,False,"Fails during iree-compile.",""
+microsoft/layoutlm-base-uncased,stablehlo,tf,1e-2,1e-3,default,None,False,False,False,"",""
+microsoft/mpnet-base,stablehlo,tf,1e-2,1e-2,default,None,True,True,True,"",""
 albert-base-v2,linalg,torch,1e-2,1e-3,default,None,True,True,True,"issue with aten.tanh in torch-mlir",""
 alexnet,linalg,torch,1e-2,1e-3,default,None,True,True,False,"https://github.com/nod-ai/SHARK/issues/879",""
 bert-base-cased,linalg,torch,1e-2,1e-3,default,None,False,True,False,"",""
 bert-base-uncased,linalg,torch,1e-2,1e-3,default,None,False,True,False,"",""
 bert-base-uncased_fp16,linalg,torch,1e-1,1e-1,default,None,True,True,True,"",""
 bert-large-uncased,linalg,torch,1e-2,1e-3,default,None,False,True,False,"",""
-bert-large-uncased,mhlo,tf,1e-2,1e-3,default,None,False,False,False,"",""
+bert-large-uncased,stablehlo,tf,1e-2,1e-3,default,None,False,False,False,"",""
 facebook/deit-small-distilled-patch16-224,linalg,torch,1e-2,1e-3,default,nhcw-nhwc,False,True,False,"Fails during iree-compile.",""
 google/vit-base-patch16-224,linalg,torch,1e-2,1e-3,default,nhcw-nhwc,False,True,False,"https://github.com/nod-ai/SHARK/issues/311",""
 microsoft/beit-base-patch16-224-pt22k-ft22k,linalg,torch,1e-2,1e-3,default,nhcw-nhwc,False,True,False,"https://github.com/nod-ai/SHARK/issues/390","macos"
 microsoft/MiniLM-L12-H384-uncased,linalg,torch,1e-2,1e-3,default,None,False,True,False,"",""
-google/mobilebert-uncased,linalg,torch,1e-2,1e-3,default,None,False,True,False,"https://github.com/nod-ai/SHARK/issues/344",""
-mobilenet_v3_small,linalg,torch,1e-1,1e-2,default,nhcw-nhwc,False,True,False,"https://github.com/nod-ai/SHARK/issues/388","macos"
-nvidia/mit-b0,linalg,torch,1e-2,1e-3,default,None,True,True,False,"https://github.com/nod-ai/SHARK/issues/343","macos"
+google/mobilebert-uncased,linalg,torch,1e-2,1e-3,default,None,False,True,False,"https://github.com/nod-ai/SHARK/issues/344","macos"
+mobilenet_v3_small,linalg,torch,1e-1,1e-2,default,nhcw-nhwc,False,True,True,"https://github.com/nod-ai/SHARK/issues/388, https://github.com/nod-ai/SHARK/issues/1487","macos"
+nvidia/mit-b0,linalg,torch,1e-2,1e-3,default,None,True,True,True,"https://github.com/nod-ai/SHARK/issues/343,https://github.com/nod-ai/SHARK/issues/1487","macos"
 resnet101,linalg,torch,1e-2,1e-3,default,nhcw-nhwc/img2col,False,False,False,"","macos"
 resnet18,linalg,torch,1e-2,1e-3,default,None,True,True,False,"","macos"
 resnet50,linalg,torch,1e-2,1e-3,default,nhcw-nhwc,False,False,False,"","macos"
 resnet50_fp16,linalg,torch,1e-2,1e-2,default,nhcw-nhwc/img2col,True,False,True,"",""
 squeezenet1_0,linalg,torch,1e-2,1e-3,default,nhcw-nhwc,False,False,False,"","macos"
 wide_resnet50_2,linalg,torch,1e-2,1e-3,default,nhcw-nhwc/img2col,False,False,False,"","macos"
-efficientnet-v2-s,mhlo,tf,1e-02,1e-3,default,nhcw-nhwc,False,False,False,"","macos"
+efficientnet-v2-s,stablehlo,tf,1e-02,1e-3,default,nhcw-nhwc,False,False,False,"","macos"
 mnasnet1_0,linalg,torch,1e-2,1e-3,default,nhcw-nhwc,True,True,True,"","macos"
-efficientnet_b0,linalg,torch,1e-2,1e-3,default,nhcw-nhwc,True,True,False,"https://github.com/nod-ai/SHARK/issues/1243",""
-efficientnet_b7,linalg,torch,1e-2,1e-3,default,nhcw-nhwc,False,True,False,"Fails on MacOS builder, VK device lost","macos"
-efficientnet_b0,mhlo,tf,1e-2,1e-3,default,nhcw-nhwc,False,False,False,"",""
-efficientnet_b7,mhlo,tf,1e-2,1e-3,default,nhcw-nhwc,False,False,False,"Fails on MacOS builder, VK device lost","macos"
-gpt2,mhlo,tf,1e-2,1e-3,default,None,True,False,False,"","macos"
+efficientnet_b0,linalg,torch,1e-2,1e-3,default,nhcw-nhwc,True,True,True,"https://github.com/nod-ai/SHARK/issues/1487","macos"
+efficientnet_b7,linalg,torch,1e-2,1e-3,default,nhcw-nhwc,False,True,True,"https://github.com/nod-ai/SHARK/issues/1487","macos"
+efficientnet_b0,stablehlo,tf,1e-2,1e-3,default,nhcw-nhwc,False,False,False,"",""
+efficientnet_b7,stablehlo,tf,1e-2,1e-3,default,nhcw-nhwc,False,False,False,"Fails on MacOS builder, VK device lost","macos"
+gpt2,stablehlo,tf,1e-2,1e-3,default,None,True,False,False,"","macos"
 t5-base,linalg,torch,1e-2,1e-3,default,None,True,True,True,"Inputs for seq2seq models in torch currently unsupported.","macos"
-t5-base,mhlo,tf,1e-2,1e-3,default,None,False,False,False,"","macos"
+t5-base,stablehlo,tf,1e-2,1e-3,default,None,False,False,False,"","macos"
 t5-large,linalg,torch,1e-2,1e-3,default,None,True,True,True,"Inputs for seq2seq models in torch currently unsupported","macos"
-t5-large,mhlo,tf,1e-2,1e-3,default,None,False,False,False,"","macos"
+t5-large,stablehlo,tf,1e-2,1e-3,default,None,False,False,False,"","macos"
 stabilityai/stable-diffusion-2-1-base,linalg,torch,1e-3,1e-3,default,None,True,False,False,"","macos"
--- a/tank/examples/opt/opt_causallm.py
+++ b/tank/examples/opt/opt_causallm.py
@@ -0,0 +1,143 @@
+import os
+import torch
+import numpy as np
+from shark_opt_wrapper import OPTForCausalLMModel
+from shark.iree_utils._common import (
+    check_device_drivers,
+    device_driver_info,
+)
+from shark.shark_inference import SharkInference
+from shark.shark_importer import import_with_fx
+from transformers import AutoTokenizer, OPTForCausalLM
+
+OPT_MODEL = "opt-1.3b"
+OPT_FS_NAME = "opt-1_3b"
+MAX_SEQUENCE_LENGTH = 30
+MAX_NEW_TOKENS = 20
+
+
+def create_module(model_name, tokenizer, device):
+    opt_base_model = OPTForCausalLM.from_pretrained("facebook/" + model_name)
+    opt_base_model.eval()
+    opt_model = OPTForCausalLMModel(opt_base_model)
+    encoded_inputs = tokenizer(
+        "What is the meaning of life?",
+        padding="max_length",
+        truncation=True,
+        max_length=MAX_SEQUENCE_LENGTH,
+        return_tensors="pt",
+    )
+    inputs = (
+        encoded_inputs["input_ids"],
+        encoded_inputs["attention_mask"],
+    )
+    # np.save("model_inputs_0.npy", inputs[0])
+    # np.save("model_inputs_1.npy", inputs[1])
+
+    mlir_path = f"./{OPT_FS_NAME}_causallm_{MAX_SEQUENCE_LENGTH}_torch.mlir"
+    if os.path.isfile(mlir_path):
+        with open(mlir_path, "r") as f:
+            model_mlir = f.read()
+        print(f"Loaded .mlir from {mlir_path}")
+    else:
+        (model_mlir, func_name) = import_with_fx(
+            model=opt_model,
+            inputs=inputs,
+            is_f16=False,
+            model_name=OPT_FS_NAME,
+            return_str=True,
+        )
+        with open(mlir_path, "w") as f:
+            f.write(model_mlir)
+        print(f"Saved mlir at {mlir_path}")
+
+    shark_module = SharkInference(
+        model_mlir,
+        device=device,
+        mlir_dialect="tm_tensor",
+        is_benchmark=False,
+    )
+
+    vmfb_name = f"{OPT_FS_NAME}_causallm_{MAX_SEQUENCE_LENGTH}_torch_{device}"
+    shark_module.save_module(module_name=vmfb_name)
+    vmfb_path = vmfb_name + ".vmfb"
+    return vmfb_path
+
+
+def shouldStop(tokens):
+    stop_ids = [50278, 50279, 50277, 0]
+    for stop_id in stop_ids:
+        if tokens[0][-1] == stop_id:
+            return True
+    return False
+
+
+def generate_new_token(shark_model, tokenizer, new_text):
+    model_inputs = tokenizer(
+        new_text,
+        padding="max_length",
+        max_length=MAX_SEQUENCE_LENGTH,
+        truncation=True,
+        return_tensors="pt",
+    )
+    inputs = (
+        model_inputs["input_ids"],
+        model_inputs["attention_mask"],
+    )
+    sum_attentionmask = torch.sum(model_inputs.attention_mask)
+    output = shark_model("forward", inputs)
+    output = torch.FloatTensor(output[0])
+    next_toks = torch.topk(output, 1)
+    stop_generation = False
+    if shouldStop(next_toks.indices):
+        stop_generation = True
+    new_token = next_toks.indices[int(sum_attentionmask) - 1]
+    detok = tokenizer.decode(
+        new_token,
+        skip_special_tokens=False,
+        clean_up_tokenization_spaces=False,
+    )
+    ret_dict = {
+        "new_token": new_token,
+        "detok": detok,
+        "stop_generation": stop_generation,
+    }
+    return ret_dict
+
+
+if __name__ == "__main__":
+    tokenizer = AutoTokenizer.from_pretrained(
+        "facebook/" + OPT_MODEL, use_fast=False
+    )
+    vmfb_path = (
+        f"./{OPT_FS_NAME}_causallm_{MAX_SEQUENCE_LENGTH}_torch_cpu-sync.vmfb"
+    )
+    opt_shark_module = SharkInference(mlir_module=None, device="cpu-sync")
+    if os.path.isfile(vmfb_path):
+        opt_shark_module.load_module(vmfb_path)
+    else:
+        vmfb_path = create_module(OPT_MODEL, tokenizer, "cpu-sync")
+        opt_shark_module.load_module(vmfb_path)
+    while True:
+        try:
+            new_text = input("Give me a sentence to complete:")
+            new_text_init = new_text
+            words_list = []
+
+            for i in range(MAX_NEW_TOKENS):
+                generated_token_op = generate_new_token(
+                    opt_shark_module, tokenizer, new_text
+                )
+                detok = generated_token_op["detok"]
+                stop_generation = generated_token_op["stop_generation"]
+                if stop_generation:
+                    break
+                print(detok, end="", flush=True)
+                words_list.append(detok)
+                if detok == "":
+                    break
+                new_text = new_text + detok
+
+        except KeyboardInterrupt:
+            print("Exiting program.")
+            break
--- a/tank/examples/opt/opt_causallm_torch_test.py
+++ b/tank/examples/opt/opt_causallm_torch_test.py
@@ -0,0 +1,200 @@
+import unittest
+import os
+import pytest
+import torch
+import numpy as np
+from shark_opt_wrapper import OPTForCausalLMModel
+from shark.iree_utils._common import check_device_drivers, device_driver_info
+from shark.shark_inference import SharkInference
+from shark.shark_importer import import_with_fx
+from transformers import AutoTokenizer, OPTForCausalLM
+
+OPT_MODEL = "facebook/opt-1.3b"
+OPT_FS_NAME = "opt-1_3b"
+OPT_MODEL_66B = "facebook/opt-66b"
+
+
+class OPTModuleTester:
+    def __init__(
+        self,
+        benchmark=False,
+    ):
+        self.benchmark = benchmark
+
+    def create_and_check_module(self, dynamic, device, model_name):
+        tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
+        opt_model = OPTForCausalLM.from_pretrained(
+            model_name, return_dict=False
+        )
+        opt_model.eval()
+
+        model_inputs = tokenizer(
+            "The meaning of life is",
+            padding="max_length",
+            max_length=30,
+            truncation=True,
+            return_tensors="pt",
+        )
+        inputs = (
+            model_inputs.data["input_ids"],
+            model_inputs.data["attention_mask"],
+        )
+        act_out = opt_model(
+            inputs[0], attention_mask=inputs[1], return_dict=False
+        )[0]
+        (
+            mlir_module,
+            func_name,
+        ) = import_with_fx(
+            model=opt_model,
+            inputs=inputs,
+            is_f16=False,
+            model_name=OPT_FS_NAME,
+        )
+        del opt_model
+        opt_filename = f"./{OPT_FS_NAME}_causallm_30_torch_{device}"
+        mlir_path = os.path.join(opt_filename, ".mlir")
+        with open(mlir_path, "w") as f:
+            f.write(mlir_module)
+        print(f"Saved mlir at {mlir_path}")
+
+        shark_module = SharkInference(
+            mlir_module,
+            device=device,
+            mlir_dialect="tm_tensor",
+            is_benchmark=self.benchmark,
+        )
+
+        shark_module.compile()
+        results = shark_module("forward", inputs)
+        print(
+            "SHARK logits have shape: ",
+            str(results[0].shape) + " : " + str(results[0]),
+        )
+        print(
+            "PyTorch logits have shape: "
+            + str(act_out[0].shape)
+            + " : "
+            + str(act_out[0])
+        )
+        # exp_out = tokenizer.decode(act_out[0][0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
+        # shark_out = tokenizer.decode(results[0][0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
+        np.testing.assert_allclose(act_out[0].detach(), results[0])
+
+        if self.benchmark:
+            shark_module.shark_runner.benchmark_all_csv(
+                inputs,
+                "opt",
+                dynamic,
+                device,
+                "torch",
+            )
+
+
+class OPTModuleTest(unittest.TestCase):
+    @pytest.fixture(autouse=True)
+    def configure(self, pytestconfig):
+        self.module_tester = OPTModuleTester(self)
+        self.module_tester.save_mlir = False
+        self.module_tester.save_vmfb = False
+        self.module_tester.benchmark = pytestconfig.getoption("benchmark")
+
+    def test_1_3b_static_cpu(self):
+        dynamic = False
+        device = "cpu"
+        self.module_tester.create_and_check_module(dynamic, device, OPT_MODEL)
+
+    def test_1_3b_dynamic_cpu(self):
+        dynamic = True
+        device = "cpu"
+        self.module_tester.create_and_check_module(dynamic, device, OPT_MODEL)
+
+    @pytest.mark.skipif(
+        check_device_drivers("cuda"), reason=device_driver_info("cuda")
+    )
+    def test_1_3b_static_cuda(self):
+        dynamic = False
+        device = "cuda"
+        self.module_tester.create_and_check_module(dynamic, device, OPT_MODEL)
+
+    @pytest.mark.skipif(
+        check_device_drivers("cuda"), reason=device_driver_info("cuda")
+    )
+    def test_1_3b_dynamic_cuda(self):
+        dynamic = True
+        device = "cuda"
+        self.module_tester.create_and_check_module(dynamic, device, OPT_MODEL)
+
+    @pytest.mark.skipif(
+        check_device_drivers("vulkan"), reason=device_driver_info("vulkan")
+    )
+    def test_1_3b_static_vulkan(self):
+        dynamic = False
+        device = "vulkan"
+        self.module_tester.create_and_check_module(dynamic, device, OPT_MODEL)
+
+    @pytest.mark.skipif(
+        check_device_drivers("vulkan"), reason=device_driver_info("vulkan")
+    )
+    def test_1_3b_dynamic_vulkan(self):
+        dynamic = True
+        device = "vulkan"
+        self.module_tester.create_and_check_module(dynamic, device, OPT_MODEL)
+
+    # def test_66b_static_cpu(self):
+    #    dynamic = False
+    #    device = "cpu"
+    #    self.module_tester.create_and_check_module(
+    #        dynamic, device, OPT_MODEL_66B
+    #    )
+
+    # def test_66b_dynamic_cpu(self):
+    #    dynamic = True
+    #    device = "cpu"
+    #    self.module_tester.create_and_check_module(
+    #        dynamic, device, OPT_MODEL_66B
+    #    )
+
+    # @pytest.mark.skipif(
+    #    check_device_drivers("cuda"), reason=device_driver_info("cuda")
+    # )
+    # def test_66b_static_cuda(self):
+    #    dynamic = False
+    #    device = "cuda"
+    #    self.module_tester.create_and_check_module(
+    #        dynamic, device, OPT_MODEL_66B
+    #    )
+
+    # @pytest.mark.skipif(
+    #    check_device_drivers("cuda"), reason=device_driver_info("cuda")
+    # )
+    # def test_66b_dynamic_cuda(self):
+    #    dynamic = True
+    #    device = "cuda"
+    #    self.module_tester.create_and_check_module(
+    #        dynamic, device, OPT_MODEL_66B
+    #    )
+
+    # @pytest.mark.skipif(
+    #    check_device_drivers("vulkan"), reason=device_driver_info("vulkan")
+    # )
+    # def test_66b_static_vulkan(self):
+    #    dynamic = False
+    #    device = "vulkan"
+    #    self.module_tester.create_and_check_module(
+    #        dynamic, device, OPT_MODEL_66B
+    #    )
+
+    # @pytest.mark.skipif(
+    #    check_device_drivers("vulkan"), reason=device_driver_info("vulkan")
+    # )
+    # def test_66b_dynamic_vulkan(self):
+    #    dynamic = True
+    #    device = "vulkan"
+    #    self.module_tester.create_and_check_module(
+    #        dynamic, device, OPT_MODEL_66B
+    #    )
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tank/examples/opt/opt_torch_test.py
+++ b/tank/examples/opt/opt_torch_test.py
@@ -2,7 +2,7 @@ import unittest

 import pytest
 import torch_mlir
-from hacked_hf_opt import OPTModel
+from shark_hf_opt import OPTModel
 from shark.iree_utils._common import check_device_drivers, device_driver_info
 from shark.shark_inference import SharkInference
 from tank.model_utils import compare_tensors
@@ -56,13 +56,12 @@ class OPTModuleTester:

        shark_module = SharkInference(
            model_mlir,
-            func_name,
            device=device,
            mlir_dialect="tm_tensor",
            is_benchmark=self.benchmark,
        )
        shark_module.compile()
-        results = shark_module.forward((input_ids, attention_mask))
+        results = shark_module("forward", (input_ids, attention_mask))
        assert compare_tensors(act_out, results)

        if self.benchmark:
--- a/tank/examples/opt/shark_hf_base_opt.py
+++ b/tank/examples/opt/shark_hf_base_opt.py
@@ -0,0 +1,47 @@
+import os
+import torch
+from transformers import AutoTokenizer, OPTForCausalLM
+from shark.shark_inference import SharkInference
+from shark.shark_importer import import_with_fx
+from shark_opt_wrapper import OPTForCausalLMModel
+
+model_name = "facebook/opt-1.3b"
+base_model = OPTForCausalLM.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
+
+model = OPTForCausalLMModel(base_model)
+
+prompt = "What is the meaning of life?"
+model_inputs = tokenizer(prompt, return_tensors="pt")
+inputs = (
+    model_inputs["input_ids"],
+    model_inputs["attention_mask"],
+)
+
+(
+    mlir_module,
+    func_name,
+) = import_with_fx(
+    model=model,
+    inputs=inputs,
+    is_f16=False,
+    debug=True,
+    model_name=model_name.split("/")[1],
+    save_dir=".",
+)
+
+shark_module = SharkInference(
+    mlir_module,
+    device="cpu-sync",
+    mlir_dialect="tm_tensor",
+)
+shark_module.compile()
+# Generated logits.
+logits = shark_module("forward", inputs=inputs)
+print("SHARK module returns logits:")
+print(logits[0])
+
+hf_logits = base_model.forward(inputs[0], inputs[1], return_dict=False)[0]
+
+print("PyTorch baseline returns logits:")
+print(hf_logits)
--- a/tank/examples/opt/shark_opt_wrapper.py
+++ b/tank/examples/opt/shark_opt_wrapper.py
@@ -0,0 +1,15 @@
+import torch
+
+
+class OPTForCausalLMModel(torch.nn.Module):
+    def __init__(self, model):
+        super().__init__()
+        self.model = model
+
+    def forward(self, input_ids, attention_mask):
+        combine_input_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+        }
+        output = self.model(**combine_input_dict)
+        return output.logits
--- a/tank/examples/opt/shark_opt_wrapper_train.py
+++ b/tank/examples/opt/shark_opt_wrapper_train.py
@@ -279,7 +279,6 @@ class OPTAttention(nn.Module):
        )

        attn_output = torch.bmm(attn_probs, value_states)
-
        if attn_output.size() != (
            bsz * self.num_heads,
            tgt_len,
@@ -314,6 +313,7 @@ class OPTDecoderLayer(nn.Module):
            num_heads=config.num_attention_heads,
            dropout=config.attention_dropout,
            is_decoder=True,
+            bias=config.enable_bias,
        )
        self.do_layer_norm_before = config.do_layer_norm_before
        self.dropout = config.dropout
@@ -321,10 +321,16 @@ class OPTDecoderLayer(nn.Module):

        self.activation_dropout = config.activation_dropout

-        self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
+        self.self_attn_layer_norm = nn.LayerNorm(
+            self.embed_dim,
+            elementwise_affine=config.layer_norm_elementwise_affine,
+        )
        self.fc1 = nn.Linear(self.embed_dim, config.ffn_dim)
        self.fc2 = nn.Linear(config.ffn_dim, self.embed_dim)
-        self.final_layer_norm = nn.LayerNorm(self.embed_dim)
+        self.final_layer_norm = nn.LayerNorm(
+            self.embed_dim,
+            elementwise_affine=config.layer_norm_elementwise_affine,
+        )

    def forward(
        self,
@@ -450,7 +456,14 @@ class OPTDecoder(OPTPreTrainedModel):
        else:
            self.project_in = None

-        self.layer_norm = None
+        if config.do_layer_norm_before and not config._remove_final_layer_norm:
+            self.final_layer_norm = nn.LayerNorm(
+                config.hidden_size,
+                elementwise_affine=config.layer_norm_elementwise_affine,
+            )
+        else:
+            self.final_layer_norm = None
+
        self.layers = nn.ModuleList(
            [OPTDecoderLayer(config) for _ in range(config.num_hidden_layers)]
        )
@@ -647,6 +660,9 @@ class OPTDecoder(OPTPreTrainedModel):
            if output_attentions:
                all_self_attns += (layer_outputs[1],)

+        if self.final_layer_norm is not None:
+            hidden_states = self.final_layer_norm(hidden_states)
+
        if self.project_out is not None:
            hidden_states = self.project_out(hidden_states)

@@ -832,7 +848,10 @@ class OPTForCausalLM(OPTPreTrainedModel):
            )

        if not return_dict:
-            output = (logits,) + outputs[1:]
+            if isinstance(outputs[1:], tuple):
+                output = (logits,) + outputs[1:]
+            else:
+                output = (logits, outputs[1:])
            return (loss,) + output if loss is not None else output

        return CausalLMOutputWithPast(
--- a/tank/test_models.py
+++ b/tank/test_models.py
@@ -64,6 +64,7 @@ def get_valid_test_params():
        device
        for device in get_supported_device_list()
        if not check_device_drivers(device)
+        and device not in ["cpu-sync", "cpu-task"]
    ]
    dynamic_list = (True, False)
    # TODO: This is soooo ugly, but for some reason creating the dict at runtime
@@ -92,6 +93,8 @@ def get_valid_test_params():
 def is_valid_case(test_params):
    if test_params[0] == True and test_params[2]["framework"] == "tf":
        return False
+    if test_params[2]["framework"] == "tf":
+        return False
    elif "fp16" in test_params[2]["model_name"] and test_params[1] != "cuda":
        return False
    else:
@@ -348,7 +351,11 @@ class SharkModuleTest(unittest.TestCase):
            self.pytestconfig.getoption("dispatch_benchmarks_dir")
        )

-        if config["xfail_cpu"] == "True" and device == "cpu":
+        if config["xfail_cpu"] == "True" and device in [
+            "cpu",
+            "cpu-sync",
+            "cpu-task",
+        ]:
            pytest.xfail(reason=config["xfail_reason"])

        if config["xfail_cuda"] == "True" and device == "cuda":
Author	SHA1	Message	Date
dependabot[bot]	a5c882f296	Bump gradio from 3.15.0 to 3.34.0 (#1518 ) Bumps [gradio](https://github.com/gradio-app/gradio) from 3.15.0 to 3.34.0. - [Release notes](https://github.com/gradio-app/gradio/releases) - [Changelog](https://github.com/gradio-app/gradio/blob/main/CHANGELOG.md) - [Commits](https://github.com/gradio-app/gradio/compare/v3.15.0...v3.34.0) --- updated-dependencies: - dependency-name: gradio dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-06-14 18:13:48 -07:00
Ean Garvey	eb6d11cfed	Change mlir dialects for tf tests to stablehlo. (#1535 ) * Change mlir dialects for tf tests to stablehlo * Update shark_runner.py	2023-06-14 10:43:49 -07:00
Vivek Khandelwal	46184a81ac	Add Falcon pipeline (#1534 )	2023-06-14 09:39:16 -07:00
PhaneeshB	149165a2f0	add multi-device mutli-precision vmfb names	2023-06-14 22:08:24 +05:30
dan	bec82a665f	mega vicuna merge single endpoint in apps/language/models/scripts/vicuna.py removed main functions from pipelines replaced divergent utils compile with shark_importer adds support for different precisions	2023-06-14 19:06:29 +05:30
Ean Garvey	9551490341	Remove deprecared --iree-mhlo-demote-164-to-132 flag usage. (#1533 )	2023-06-13 22:40:47 -05:00
Ean Garvey	49b3ecdbca	(pytest) don't run redundant tests in cpu suite (#1532 )	2023-06-13 22:40:33 -05:00
Ean Garvey	f53e3594c3	OPT Refactor (#1516 ) * Change script to 1.3b model and add pytorch comparison * fix CLI command * Match OPT transformers model updates + numerics against latest version * Cleanup OPT sentence completion script. * Fix formatting and add standalone validation scripts. * Add minimal OPT wrapper and example with import_with_fx * Rename OPT full model wrapper. * Cleanup test scripts for OPT.	2023-06-13 22:40:07 -05:00
Ean Garvey	5562d1dfda	Fix xfails for cpu pytest cases (#1527 ) Adding cpu-sync and cpu-task device configs was allowing respective tests to bypass the xfail conditional for cpu pytests marked in tank/all_models.csv. This commit updates the conditional to xfail those cases for cpu-sync and cpu-task as well.	2023-06-13 17:01:51 -07:00
Stefan Kapusniak	c7b0c2961e	UI/Web Improve output gallery temp file handling (#1531 ) * On startup report that cleaning up of temp files is taking place, in case it takes a long time. * Have the output gallery tab delete any zero length temporary files generated by gradio < 3.32.0 for its gallery control whenever it needs to update that control with images. This prevents such files multiplying out of control.	2023-06-13 16:25:37 -05:00
Ean Garvey	44273b0791	Fix conditional in transform_fx() (#1530 )	2023-06-13 16:24:53 -05:00
Prashant Kumar	0a4c8fcb3e	Minor changes in the fx transforms.	2023-06-13 21:23:35 +05:30
Stefan Kapusniak	2fec3c8169	re-indents add_upcast in shark importer (#1523 ) * The two with blocks in add_upcast appear to be underindented making SD 1.4 break on rdna3, I've pushed them out one more tab, and then everything appears to work again.	2023-06-12 14:41:10 -05:00
Gaurav Shukla	5e7d5930dd	[vicuna] Add device and precision propagation in vicuna (#1520 ) Signed-off-by: Gaurav Shukla <gaurav@nod-labs.com>	2023-06-12 12:14:43 -05:00
Prashant Kumar	b6dbd20250	Modify the fx transforms. (#1521 ) - The bounds are set properly. - The upcasting and downcasting is done for vicuna.	2023-06-12 09:40:14 -07:00
Nithin Meganathan	34f1295349	Add a model config generator (#1511 ) Model config generator takes a PyTorch model as input and generates a JSON file with model layers and other propperties that define sharding on a particular hardware.	2023-06-09 15:32:00 -07:00
Phaneesh Barwaria	1980d7b2c3	Cpu device map (#1515 ) * update cpu iree device * fix vmfb paths vic unsharded	2023-06-09 11:27:02 -05:00
powderluv	2cfacc5051	fix osx torch_mlir (#1513 ) * fix osx torch_mlir * Update index.py * Update index.py	2023-06-09 00:57:26 -07:00
Phaneesh Barwaria	436f58ddc4	cli using generate and mem fixes (#1509 )	2023-06-08 13:13:32 -05:00
Phaneesh Barwaria	6b29bd17c8	Enable compilation vicuna (#1507 ) * add cli for unsharded vic * enable mlir download and compile	2023-06-07 13:08:22 -07:00
Ean Garvey	2c3485ca3e	Add standalone OPT sentence completion script. (#1506 )	2023-06-07 10:58:03 -07:00
Daniel Garvey	f206ecc635	reenable compilation in vicuna pipeline, add flags (#1505 ) * replace vicuna.py backend with pipeline * add some memory management to fist vicuna compile reenable compilation	2023-06-07 09:49:27 -07:00
Stefan Kapusniak	a187e05ae6	Prevent having no cuda devices breaking the UI (#1503 ) Don't break the UI when the LLM tab only wants cuda devices but there aren't any.	2023-06-06 11:41:16 -07:00
Gaurav Shukla	8c21960486	[vicuna] Set only cuda devices in vicuna UI for now Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com>	2023-06-06 22:15:20 +05:30
Gaurav Shukla	be62fce676	[vicuna] Fix vicuna chatbot (#1499 ) Signed-off-by: Gaurav Shukla <gaurav@nod-labs.com>	2023-06-06 09:23:32 -07:00
PhaneeshB	f23b778a6c	remove old vicuna scripts	2023-06-06 21:35:58 +05:30
PhaneeshB	436edf900d	add vic sharded pipeline	2023-06-06 21:35:58 +05:30
Gaurav Shukla	ed58c2553f	[vicuna] Integrate vicuna in shark studio Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com>	2023-06-06 20:57:48 +05:30
Stefan Kapusniak	f2ca58e844	Add .csv and .json param info to output gallery (#1495 )	2023-06-06 07:08:34 -07:00
Ean Garvey	1dbcc736eb	[SD] (RDNA2) Enable new tuning for sd1.4 (#1498 )	2023-06-06 06:48:58 -07:00
Phaneesh Barwaria	a83808ddc5	Vicuna cuda on A100 40G (#1496 ) * vic chat with memory management (precompiled vmfb) * fix vmfb path and download	2023-06-06 15:10:33 +05:30
Ean Garvey	a07fe80530	Update OPT, ResNet example scripts. (#1492 ) * Update API in OPT example. * fix resnet50 script * Add OPT1.3b test script.	2023-06-05 20:19:35 -07:00
Ean Garvey	d0ba3ef8fa	disable use_tuned on SD1.4 for rdna2 (#1490 ) this is a temporary measure while we retune SD1.4 for rdna2. The current config fails during iree-compile.	2023-06-05 19:46:16 -05:00
Stefan Kapusniak	8400529c2c	Fix output gallery not using shark_tmp (#1493 ) This fix the gallery component of the output gallery dumping temporary files into the standard folders rather than shark_tmp so those files never got cleared out on restart and would build up.	2023-06-05 16:23:49 -05:00
powderluv	7eaee9c242	update SHARK to nodai SHARK	2023-06-05 00:44:49 -07:00
powderluv	8230eebce5	Switch to CPU torch builds for shark.whl	2023-06-05 00:36:03 -07:00
Ean Garvey	6296ea4be9	fix config handling for sd1.4 on rdna2 (#1489 )	2023-06-05 00:02:30 -07:00
Ean Garvey	4151ec3a8f	(pytest) tag efficientnet, mobilenet as xfails on vulkan (#1488 )	2023-06-04 23:22:32 -07:00
powderluv	a2467e8d43	Enable SHARK whl packages	2023-06-04 23:21:22 -07:00
Ean Garvey	e677178bcc	Replace RDNA2 SD lowering configs. (#1486 )	2023-06-05 00:57:43 -05:00
Anush Elangovan	7ef1bea953	XFAIL some macos tests	2023-06-04 15:27:03 -07:00