# Quick Start QEfficient Library was designed with one goal: **To make onboarding of models inference straightforward for any Transformer architecture, while leveraging the complete power of Cloud AI platform** To achieve this, we have 2 levels of APIs, with different levels of abstraction. 1. Command line interface abstracts away complex details, offering a simpler interface. They're ideal for quick development and prototyping. If you're new to a technology or want to minimize coding effort. 2. Python high level APIs offer more granular control, ideal for when customization is necessary. --- ## Transformed models and QPC storage By default, the library exported models and Qaic Program Container (QPC) files, which are compiled and inference-ready model binaries generated by the compiler, are stored in `~/.cache/qeff_cache`. You can customize this storage path using the following environment variables: 1. **QEFF_HOME**: If this variable is set, its path will be used for storing models and QPC files. 2. **XDG_CACHE_HOME**: If `QEFF_HOME` is not set but `XDG_CACHE_HOME` is provided, this path will be used instead. Note that setting `XDG_CACHE_HOME` will reroute the entire `~/.cache` directory to the specified folder, including HF models. 3. **Default**: If neither `QEFF_HOME` nor `XDG_CACHE_HOME` are set, the default path `~/.cache/qeff_cache` will be used. --- ## Command Line Interface Execution ```{NOTE} Use ``bash terminal``, else if using ``ZSH terminal`` then ``device_group``should be in single quotes e.g. ``'--device_group [0]'`` ``` ### Inference Below are the Command Line APIs we support for infernce in the library. #### Export **CLI API:** [`QEfficient.cloud.export`](#export_api) User can export a model to ONNX using the CLI command. This will convert the model to an ONNX format and store the resulting ONNX model file in the QEfficient cache folder. [Click here](#export_api) for more information about the export command and arguments explanation. ```bash python -m QEfficient.cloud.export --model_name gpt2 ``` --- #### Compile **CLI API:** [`QEfficient.cloud.compile`](#compile_api) ```{warning} The `QEfficient.cloud.compile` API is **deprecated** and **not supported** for direct use. It will be removed in future versions. Please use the unified `QEfficient.cloud.infer` API instead, which handles both compilation and execution. ``` Users can also use `compile` API to compile pre exported onnx models using QNN SDK. Refer [Compile API doc](#compile_api) for more details. Without QNN Config ```bash python -m QEfficient.cloud.compile --onnx_path --qpc-path --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0] --prompt_len 32 --mos 1 --aic_enable_depth_first --enable_qnn ``` With QNN Config ```bash python -m QEfficient.cloud.compile --onnx_path --qpc-path --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0] --prompt_len 32 --mos 1 --aic_enable_depth_first --enable_qnn QEfficient/compile/qnn_config.json ``` **QNN Compilation** Users can compile a model with QNN SDK by following the steps below: * Set QNN SDK Path: export $QNN_SDK_ROOT=/path/to/qnn_sdk_folder * Enabled QNN by passing enable_qnn flag, add --enable_qnn in the cli command. * An optional config file can be passed to override the default parameters. **Default Parameters** QNN Converter Stage: "--float_bias_bitwidth 32 --float_bitwidth 16 --preserve_io_datatype --onnx_skip_simplification --target_backend AIC" QNN Context Binary Stage: LOG_LEVEL = "error" COMPILER_COMPILATION_TARGET = "hardware" COMPILER_CONVERT_TO_FP16 = True COMPILER_DO_DDR_TO_MULTICAST = True COMPILER_HARDWARE_VERSION = "2.0" COMPILER_PERF_WARNINGS = False COMPILER_PRINT_DDR_STATS = False COMPILER_PRINT_PERF_METRICS = False COMPILER_RETAINED_STATE = True COMPILER_STAT_LEVEL = 10 COMPILER_STATS_BATCH_SIZE = 1 COMPILER_TIME_PASSES = False --- #### Execute **CLI API:** [`QEfficient.cloud.execute`](#execute_api) Once we have compiled the QPC using `infer` or `compile` API, we can now use the precompiled QPC in `execute` API to run for different prompts. Make sure to pass same `--device_group` as used during infer. Refer [Execute API doc](#execute_api) for more details. ```bash python -m QEfficient.cloud.execute --model_name gpt2 --qpc_path qeff_models/gpt2/qpc_qnn_16cores_1BS_32PL_128CL_1devices_mxfp6/qpcs --prompt "Once upon a time in" --device_group [0] ``` --- #### Infer **CLI API:** [`QEfficient.cloud.infer`](#infer_api) This is the single e2e CLI API, which takes `model_card` name as input along with other compilation arguments. Check [Infer API doc](#infer_api) for more details. * HuggingFace model files Download → Optimize for Cloud AI 100 → Export to `ONNX` → Compile on Cloud AI 100 → [Execute](#execute_api) * It skips the export/compile stage based if `ONNX` or `qpc` files are found. If you use infer second time with different compilation arguments, it will automatically skip `ONNX` model creation and directly jump to compile stage. ```bash # Check out the options using the help python -m QEfficient.cloud.infer --help python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0] --prompt "My name is" --mos 1 --aic_enable_depth_first ``` If executing for batch size>1, You can pass input prompts in single string but separate with pipe (|) symbol". Example below ```bash python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 3 --prompt_len 32 --ctx_len 128 --num_cores 16 --device_group [0] --prompt "My name is|The flat earth theory is the belief that|The sun rises from" --mxfp6 --mos 1 --aic_enable_depth_first ``` You can also pass path of txt file with input prompts when you want to run inference on lot of prompts, Example below, sample txt file(prompts.txt) is present in examples/sample_prompts folder. ```bash python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 3 --prompt_len 32 --ctx_len 128 --num_cores 16 --device_group [0] --prompts_txt_file_path examples/sample_prompts/prompts.txt --mxfp6 --mos 1 --aic_enable_depth_first ``` **QNN CLI Inference Command** Without QNN Config ```bash python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0] --prompt "My name is" --mos 1 --aic_enable_depth_first --enable_qnn ``` With QNN Config ```bash python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0] --prompt "My name is" --mos 1 --aic_enable_depth_first --enable_qnn QEfficient/compile/qnn_config.json ``` **Users can also take advantage of features like multi-Qranium inference and continuous batching with QNN SDK Compilation.** --- ### Finetune **CLI API:** [`QEfficient.cloud.finetune`](#finetune_api) You can run the finetune with set of predefined existing datasets on QAIC using the eager pipeline. Check [Finetune API doc](#finetune_api) for more details. ```bash python -m QEfficient.cloud.finetune --device qaic:0 --use-peft --output_dir ./meta-sam --num_epochs 2 --context_length 256 ``` For more details on finetune, please refer to the [**finetune**](finetune.md) page. --- ## QEFF Auto Class Execution Here is the high level API to compile and run the model on Cloud AI 100 via Python using Qeff Autoclasses. To Know more about the QEFF Auto Classes, refer the link [QEFFAutoClasses](qeff_autoclasses.md) ### 1. Model download and Optimize for Cloud AI 100 If your models falls into the model architectures that are [already supported](validated_models), Below steps should work fine. Please raise an [issue](https://github.com/quic/efficient-transformers/issues), in case of trouble. ```Python # Initiate the Original Transformer model # import os from QEfficient import QEFFAutoModelForCausalLM as AutoModelForCausalLM from transformers import AutoTokenizer # Please uncomment and use appropriate Cache Directory for transformers, in case you don't want to use default ~/.cache dir. # os.environ["TRANSFORMERS_CACHE"] = "/local/mnt/workspace/hf_cache" # ROOT_DIR = os.path.dirname(os.path.abspath("")) # CACHE_DIR = os.path.join(ROOT_DIR, "tmp") #, you can use a different location for just one model by passing this param as cache_dir in below API. # Model-Card name (This is HF Model Card name) : https://huggingface.co/gpt2-xl model_name = "gpt2" # Similar, we can change model name and generate corresponding models, if we have added the support in the lib. qeff_model = AutoModelForCausalLM.from_pretrained(model_name) print(f"{model_name} optimized for AI 100 \n", qeff_model) ``` ### 2. Export and Compile with one API Use the qualcomm_efficient_converter API to export the KV transformed Model to ONNX and Verify on Torch. ```Python # We can now export the modified models to ONNX framework # This will generate single ONNX Model for both Prefill and Decode Variations which are optimized for # Cloud AI 100 Platform. # While generating the ONNX model, this will clip the overflow constants to fp16 # Verify the model on ONNXRuntime vs Pytorch # Then generate inputs and customio yaml file required for compilation. # Compile the model for provided compilation arguments # Please use platform SDk to Check num_cores for your card. generated_qpc_path = qeff_model.compile( num_cores=16, mxfp6_matmul=True, ) ``` ### 3. Execute Benchmark the model on Cloud AI 100, run the infer API to print tokens and tok/sec ```Python # post compilation, we can print the latency stats for the kv models, We provide API to print token and Latency stats on AI 100 # We need the compiled prefill and decode qpc to compute the token generated, This is based on Greedy Sampling Approach tokenizer = AutoTokenizer.from_pretrained(model_name) qeff_model.generate(prompts=["My name is"],tokenizer=tokenizer) ``` ### Local Model Execution If the model and tokenizer are already downloaded, we can directly load them from local path. ```python from QEfficient import QEFFAutoModelForCausalLM from transformers import AutoTokenizer # Local path to the downloaded model. You can find downloaded HF models in: # - Default location: ~/.cache/huggingface/hub/models--{model_name}/snapshots/{snapshot_id}/ local_model_repo = "~/.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e" # Load model from local path model = QEFFAutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=local_model_repo) model.compile(num_cores=16) # Load tokenizer from the same local path tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=local_model_repo) model.generate(prompts=["Hi there!!"], tokenizer=tokenizer) ``` End to End demo examples for various models are available in [**notebooks**](https://github.com/quic/efficient-transformers/tree/main/notebooks) directory. Please check them out.