Deploy Speech Recognition Model Whisper on Local
Depoy Speech Recognition Model Whisper on Local
什么是 Whisper?
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. Whisper 是一个通用的语音识别模型。使用了大量不同类型的音频进行训练,也是一个可以多语言识别的多任务模型,包括语音翻译、语言识别。
部署方法
PIP 安装
We used Python 3.9.9 and PyTorch 1.10.1 to train and test our models, but the codebase is expected to be compatible with Python 3.8-3.11 and recent PyTorch versions. The codebase also depends on a few Python packages, most notably OpenAI’s tiktoken for their fast tokenizer implementation. You can download and install (or update to) the latest release of Whisper with the following command: 我们使用 Python 3.9.9 和 PyTorch 1.10.1 来训练和测试模型,但是代码可以兼容Python 3.8-3.11 和最近的 PyTorch 版本. 代码依赖一些 Python 包,OpenAI 的tiktoken用来实现快速 token 化。你可以通过一下命令下载或安装最新的 Whisper模型:
|
|
Alternatively, the following command will pull and install the latest commit from this repository, along with its Python dependencies: 也可以用下面的命令 从 github 拉取安装最新的 commit,包括他的 python 依赖项:
|
|
To update the package to the latest version of this repository, please run: 要更新到最新版本,请运行
|
|
It also requires the command-line tool ffmpeg to be installed on your system, which is available from most package managers: 项目还依赖 ffmpeg 命令行工具,需要在操作系统上安装该工具。可以从大多数包管理器上找到 ffmpeg:
|
|
You may need rust installed as well, in case tiktoken does not provide a pre-built wheel for your platform. If you see installation errors during the pip install command above, please follow the Getting started page to install Rust development environment. Additionally, you may need to configure the PATH environment variable, e.g. export PATH="$HOME/.cargo/bin:$PATH". If the installation fails with No module named ‘setuptools_rust’, you need to install setuptools_rust, e.g. by running: 如果 tiktoken 没有为你的操作系统平台提供预先编译好的 wheel,你可能还需要安装好 rust 。如果你在上面的 pip install 命令运行时看到安装错误,请按照“如何开始”页的介绍安装 Rust 开发环境。然后,你需要配置 PATH 环境变量,比如:export PATH="$HOME/.cargo/bin:$PATH". 如果“setuptools_rust”安装失败,你需要通过以下命令安装setuptools_rust:
|
|
可用的模型和支持的语言
There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model; actual speed may vary depending on many factors including the available hardware. 有 5 个大小不同的模型,4 个是只有英语的版本,提供速度和准确性。下面是可用的模型名称和大概的内存需求、相对于大型明星的推理速度;实际速度可能受包括可用硬件在内的多个因素影响。
Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
---|---|---|---|---|---|
tiny | 39 M | tiny.en | tiny | ~1 GB | ~32x |
base | 74 M | base.en | base | ~1 GB | ~16x |
small | 244 M | small.en | small | ~2 GB | ~6x |
»medium | 769 M | medium.en | medium | ~5 GB | ~2x |
large | 1550 M | N/A | large | ~10 GB | 1x |
The .en models for English-only applications tend to perform better, especially for the tiny.en and base.en models. We observed that the difference becomes less significant for the small.en and medium.en models. .en 模型是仅限英语的模型,有更好的性能表现,尤其是tiny.en 和 base.en 模型。我们注意到 small.en 和 medium.en 的差异则较小。
命令行的使用
The following command will transcribe speech in audio files, using the medium model: 下面的命令可以把音频文件转写为文本,使用中等模型:
|
|
The default setting (which selects the small model) works well for transcribing English. To transcribe an audio file containing non-English speech, you can specify the language using the –language option:
默认设置(使用小模型)对于英文音频的转换效果很好。对于非英文的音频,你可以用 --language Japanese
指定语言:
|
|
Adding –task translate will translate the speech into English:
添加 --task translate
参数可以把文字翻译为英语
|
|
Run the following to view all available options: 以下命令可以查看可用的选项
|
|
See tokenizer.py for the list of all available languages. 可以打开 tokenizer.py
查看可用的语言列表,如下:
|
|
Internally, the transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window. transcribe() 方法读取了整个音频文件然后按照 30 秒的时间窗口进行滑动处理,对每个窗口进行一序列的连续预测处理。
Below is an example usage of whisper.detect_language() and whisper.decode() which provide lower-level access to the model.
下面是用 whisper.detect_language()
和whisper.decode()
两个底层函数来处理音频的例子:
|
|
用容器部署
容器Dockerfile
|
|