Deploy Speech Recognition Model Whisper on Local

Derek 收录于 AI

2023-12-05 约 1949 字预计阅读 4 分钟

Depoy Speech Recognition Model Whisper on Local

什么是 Whisper?

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. Whisper 是一个通用的语音识别模型。使用了大量不同类型的音频进行训练，也是一个可以多语言识别的多任务模型，包括语音翻译、语言识别。

部署方法

PIP 安装

We used Python 3.9.9 and PyTorch 1.10.1 to train and test our models, but the codebase is expected to be compatible with Python 3.8-3.11 and recent PyTorch versions. The codebase also depends on a few Python packages, most notably OpenAI’s tiktoken for their fast tokenizer implementation. You can download and install (or update to) the latest release of Whisper with the following command: 我们使用 Python 3.9.9 和 PyTorch 1.10.1 来训练和测试模型，但是代码可以兼容Python 3.8-3.11 和最近的 PyTorch 版本. 代码依赖一些 Python 包，OpenAI 的tiktoken用来实现快速 token 化。你可以通过一下命令下载或安装最新的 Whisper模型：

1
pip install -U openai-whisper

Alternatively, the following command will pull and install the latest commit from this repository, along with its Python dependencies: 也可以用下面的命令从 github 拉取安装最新的 commit，包括他的 python 依赖项：

1
pip install git+https://github.com/openai/whisper.git 

To update the package to the latest version of this repository, please run: 要更新到最新版本，请运行

1
pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

It also requires the command-line tool ffmpeg to be installed on your system, which is available from most package managers: 项目还依赖 ffmpeg 命令行工具，需要在操作系统上安装该工具。可以从大多数包管理器上找到 ffmpeg：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

You may need rust installed as well, in case tiktoken does not provide a pre-built wheel for your platform. If you see installation errors during the pip install command above, please follow the Getting started page to install Rust development environment. Additionally, you may need to configure the PATH environment variable, e.g. export PATH="$HOME/.cargo/bin:$PATH". If the installation fails with No module named ‘setuptools_rust’, you need to install setuptools_rust, e.g. by running: 如果 tiktoken 没有为你的操作系统平台提供预先编译好的 wheel，你可能还需要安装好 rust 。如果你在上面的 pip install 命令运行时看到安装错误，请按照“如何开始”页的介绍安装 Rust 开发环境。然后，你需要配置 PATH 环境变量，比如：export PATH="$HOME/.cargo/bin:$PATH". 如果“setuptools_rust”安装失败，你需要通过以下命令安装setuptools_rust：

1
pip install setuptools-rust

可用的模型和支持的语言

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model; actual speed may vary depending on many factors including the available hardware. 有 5 个大小不同的模型，4 个是只有英语的版本，提供速度和准确性。下面是可用的模型名称和大概的内存需求、相对于大型明星的推理速度；实际速度可能受包括可用硬件在内的多个因素影响。

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	tiny.en	tiny	~1 GB	~32x
base	74 M	base.en	base	~1 GB	~16x
small	244 M	small.en	small	~2 GB	~6x
»medium	769 M	medium.en	medium	~5 GB	~2x
large	1550 M	N/A	large	~10 GB	1x

The .en models for English-only applications tend to perform better, especially for the tiny.en and base.en models. We observed that the difference becomes less significant for the small.en and medium.en models. .en 模型是仅限英语的模型，有更好的性能表现，尤其是tiny.en 和 base.en 模型。我们注意到 small.en 和 medium.en 的差异则较小。

命令行的使用

The following command will transcribe speech in audio files, using the medium model: 下面的命令可以把音频文件转写为文本，使用中等模型：

1
whisper audio.flac audio.mp3 audio.wav --model medium

The default setting (which selects the small model) works well for transcribing English. To transcribe an audio file containing non-English speech, you can specify the language using the –language option: 默认设置（使用小模型）对于英文音频的转换效果很好。对于非英文的音频，你可以用 --language Japanese 指定语言：

1
whisper japanese.wav --language Japanese

Adding –task translate will translate the speech into English: 添加 --task translate 参数可以把文字翻译为英语

1
whisper japanese.wav --language Japanese --task translate

Run the following to view all available options: 以下命令可以查看可用的选项

1
whisper --help

See tokenizer.py for the list of all available languages. 可以打开 tokenizer.py查看可用的语言列表，如下：

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
LANGUAGES = {
    "en": "english",
    "zh": "chinese",
    "de": "german",
    "es": "spanish",
    "ru": "russian",
    "ko": "korean",
    "fr": "french",
    "ja": "japanese",
    "pt": "portuguese",
    "tr": "turkish",
    "pl": "polish",
    "ca": "catalan",
    "nl": "dutch",
    "ar": "arabic",
    "sv": "swedish",
    "it": "italian",
    "id": "indonesian",
    "hi": "hindi",
    "fi": "finnish",
    "vi": "vietnamese",
    "he": "hebrew",
    "uk": "ukrainian",
    "el": "greek",
    "ms": "malay",
    "cs": "czech",
    "ro": "romanian",
    "da": "danish",
    "hu": "hungarian",
    "ta": "tamil",
    "no": "norwegian",
    "th": "thai",
    "ur": "urdu",
    "hr": "croatian",
    "bg": "bulgarian",
    "lt": "lithuanian",
    "la": "latin",
    "mi": "maori",
    "ml": "malayalam",
    "cy": "welsh",
    "sk": "slovak",
    "te": "telugu",
    "fa": "persian",
    "lv": "latvian",
    "bn": "bengali",
    "sr": "serbian",
    "az": "azerbaijani",
    "sl": "slovenian",
    "kn": "kannada",
    "et": "estonian",
    "mk": "macedonian",
    "br": "breton",
    "eu": "basque",
    "is": "icelandic",
    "hy": "armenian",
    "ne": "nepali",
    "mn": "mongolian",
    "bs": "bosnian",
    "kk": "kazakh",
    "sq": "albanian",
    "sw": "swahili",
    "gl": "galician",
    "mr": "marathi",
    "pa": "punjabi",
    "si": "sinhala",
    "km": "khmer",
    "sn": "shona",
    "yo": "yoruba",
    "so": "somali",
    "af": "afrikaans",
    "oc": "occitan",
    "ka": "georgian",
    "be": "belarusian",
    "tg": "tajik",
    "sd": "sindhi",
    "gu": "gujarati",
    "am": "amharic",
    "yi": "yiddish",
    "lo": "lao",
    "uz": "uzbek",
    "fo": "faroese",
    "ht": "haitian creole",
    "ps": "pashto",
    "tk": "turkmen",
    "nn": "nynorsk",
    "mt": "maltese",
    "sa": "sanskrit",
    "lb": "luxembourgish",
    "my": "myanmar",
    "bo": "tibetan",
    "tl": "tagalog",
    "mg": "malagasy",
    "as": "assamese",
    "tt": "tatar",
    "haw": "hawaiian",
    "ln": "lingala",
    "ha": "hausa",
    "ba": "bashkir",
    "jw": "javanese",
    "su": "sundanese",
    "yue": "cantonese",
    }
    ```

## 在 Python 中使用 whisper

Transcription can also be performed within Python: 我们可以用 python 调用 whisper 实现语音识别：

```python
import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

Internally, the transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window. transcribe() 方法读取了整个音频文件然后按照 30 秒的时间窗口进行滑动处理，对每个窗口进行一序列的连续预测处理。

Below is an example usage of whisper.detect_language() and whisper.decode() which provide lower-level access to the model. 下面是用 whisper.detect_language()和whisper.decode()两个底层函数来处理音频的例子：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import whisper

model = whisper.load_model("base")

# load audio and pad/trim it to fit 30 seconds 
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)

用容器部署

容器Dockerfile

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
FROM python:3.10.9-slim

COPY sources.list /etc/apt/sources.list
RUN apt-get update && apt install ffmpeg -y && apt-get clean

RUN pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

RUN mkdir -p /app/models && mkdir -p /app/worker
WORKDIR /app

COPY requirements.txt /app/

RUN python -m pip install --upgrade pip 
RUN pip3 install --no-cache-dir -r requirements.txt
RUN pip3 install -U openai-whisper
 
# 这里复制代码到容器内。
COPY . /app

ENTRYPOINT ["python3"]

CMD [ "-u",  "-m", "ars_worker"]

目录