OpenVINO™ によるタイプミス検出

この Jupyter ノートブックは、ローカルへのインストール後にのみ起動できます。

GitHub

AI におけるタイプミス検出は、マシンラーニング・アルゴリズムを使用してテキストデータ内のタイプミスを識別して修正するプロセスです。タイプミス検出の目的は、文章作成中に発生した間違いを識別して指摘することで、テキストの正確性、読みやすさ、使いやすさを向上させることです。AI ベースのタイプミス検出器は、タイプミスを検出するために、自然言語処理 (NLP)、マシンラーニング (ML)、ディープラーニング (DL) などのさまざまな技術を使用します。

タイプミス検出器は、文章を入力として受け取り、スペルミスや同音異義語などのすべてのタイプミスを識別します。

このチュートリアルでは、OpenVINO 環境の Hugging Face トランスフォーマー・ライブラリーの Typo Detector を使用して上記のタスクを実行する方法について説明します。

このモデルは、与えられたテキスト内のタイプミスを高い精度で検出します。そのパフォーマンスは以下の通りです: - 精度スコア 0.9923 - 再現率スコア 0.9859 - f1 スコア 0.9891

上記メトリックの出典

これらのメトリックは、モデルが正しいテキストと誤ったテキストの両方を高い割合で正しく識別し、偽陽性と偽陰性を最小限に抑えることができることを示しています。

このモデルは NeuSpell データセットで事前トレーニングされています。

目次

%pip install -q "diffusers>=0.17.1" "openvino>=2023.1.0" "nncf>=2.5.0" "gradio" "onnx>=1.11.0" "transformers>=4.33.0" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "git+https://github.com/huggingface/optimum-intel.git"
DEPRECATION: pytorch-lightning 1.6.5 has a non-standard dependency specifier torch>=1.8.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pytorch-lightning 1.6.5 requires protobuf<=3.20.1, but you have protobuf 4.25.2 which is incompatible.
tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 4.25.2 which is incompatible.
tf2onnx 1.16.1 requires protobuf~=3.20, but you have protobuf 4.25.2 which is incompatible.
Note: you may need to restart the kernel to use updated packages.
DEPRECATION: pytorch-lightning 1.6.5 has a non-standard dependency specifier torch>=1.8.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063
Note: you may need to restart the kernel to use updated packages.

インポート

from transformers import AutoConfig, AutoTokenizer, AutoModelForTokenClassification, pipeline
from pathlib import Path
import numpy as np
import re
from typing import List, Dict
import time
2024-02-10 00:22:01.461810: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-02-10 00:22:01.496463: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-10 00:22:02.086391: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

メソッド

このノートブックには、OpenVINO ランタイムを使用してタイプミス検出器の推論を実行する 2 つの方法が用意されているため、OpenVINO ランタイムが組み込まれた Optimum の API を呼び出す方法と、他のフレームワークでモデルをロードして OpenVINO IR 形式に変換し、OpenVINO ランタイムで推論を実行する方法の両方を体験できます。

1.Hugging Face Optimum ライブラリーの使用

Hugging Face Optimum API は、Hugging Face トランスフォーマー・ライブラリーのモデルを OpenVINO™ IR 形式に変換できる高レベル API です。OpenVINO IR 形式でコンパイルされたモデルは、Optimum を使用して読み込むことができます。Optimum を使用すると、対象ハードウェア上で最適化を使用できます。

2.モデルを OpenVINO IR に変換

Pytorch モデルは OpenVINO IR 形式に変換されます。この方法は、モデルのロードからモデルの変換、コンパイル、OpenVINO を使用した推論の実行までのパイプラインを設定する方法についてより詳細な情報を提供するため、OpenVINO を使用して他のディープラーニング・モデルの推論を最適化および高速化することができます。ここでは、ターゲット・ハードウェアの最適化も行われます。

次の表は、2つの方法の主な違いをまとめたものです

メソッド 1

メソッド 2

トランスフォーマーの拡張である Optimum からモデルをロード

トランスフォーマーからモデルをロード

OpenVINO IR 形式でモデルをオンザフライでロード

OpenVINO IRへの変換

デフォルトでコンパイルされたモデルをロード

OpenVINO IR をコンパイルし、OpenVINO ランタイムで推論を実行

OpenVINO ランタイムで推論を実行するパイプラインを作成

手動で推論を実行します。

推論デバイスの選択

OpenVINO を使用して推論を実行するためにドロップダウン・リストからデバイスを選択します

import ipywidgets as widgets
import openvino as ov

core = ov.Core()

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

1.Hugging Face Optimum Intel ライブラリー

この方法では、OpenVINO 統合によって高速化された Hugging Face Optimum Intel ライブラリーをインストールする必要があります。

Optimum Intel を使用すると、Hugging Face ハブ から最適化されたモデルをロードし、Hugging Face API を使用して OpenVINO ランタイムで推論を実行するパイプラインを作成できます。Optimum 推論モデルは、Hugging Face Transformers モデルと API の互換性があります。つまり、AutoModelForXxx クラスを対応する OVModelForXxx クラスに置き換えるだけです。

必要なモデルクラスをインポート

from optimum.intel.openvino import OVModelForTokenClassification
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
                                        torch.utils._pytree._register_pytree_node(

モデルのロード

OVModelForTokenCLassification クラスから、関連する事前トレーニング済みモデルをインポートします。トランスフォーマー・モデルをロードし、オンザフライで OpenVINO 形式に変換するには、モデルをロードするときに export=True を設定します。

# The pretrained model we are using
model_id = "m3hrdadfi/typo-detector-distilbert-en"

model_dir = Path("optimum_model")

# Save the model to the path if not existing
if model_dir.exists():
    model = OVModelForTokenClassification.from_pretrained(model_dir, device=device.value)
else:
    model = OVModelForTokenClassification.from_pretrained(model_id, export=True, device=device.value)
    model.save_pretrained(model_dir)
Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
                                            - default: The default ONNX variant.
Using framework PyTorch: 2.2.0+cpu
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.base has been moved to tensorflow.python.trackable.base. The old module will be deleted in version 2.11.
[ WARNING ]  Please fix your imports. Module %s has been moved to %s. The old module will be deleted in version %s.
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-609/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/nncf/torch/dynamic_graph/wrappers.py:83: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  op1 = operator(*args, **kwargs)
Compiling the model to AUTO ...

トークナイザーをロード

テキスト前処理は、テキストベースの入力データをクリーンアップして、モデルに入力できるようにします。トークン化により、段落と文がより小さな単位に分割され、意味をより簡単に割り当てることができます。これには、データのクリーニングと単語へのトークンまたは ID の割り当てが含まれます。これにより、類似した単語が類似したベクトルを持つベクトル空間で単語が表現されます。これは、モデルが文のコンテキストを理解するのに役立ちます。ここでは、基本的に事前トレーニング済みのトークナイザーである Hugging Face の AutoTokenizer を利用しています。

tokenizer = AutoTokenizer.from_pretrained(model_id)

次に、token-classification タスクに推論パイプラインを使用します。Hugging Face 推論パイプラインの使用法の詳細については、このチュートリアルをご覧ください。

nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="average")
device must be of type <class 'str'> but got <class 'torch.device'> instead

文章中のタイプミスを見つけて端末に書き込む機能

def show_typos(sentence: str):
    """
    Detect typos from the given sentence.
    Writes both the original input and typo-tagged version to the terminal.

    Arguments:
    sentence -- Sentence to be evaluated (string)
    """

    typos = [sentence[r["start"]: r["end"]] for r in nlp(sentence)]

    detected = sentence
    for typo in typos:
        detected = detected.replace(typo, f'<i>{typo}</i>')

    print("[Input]: ", sentence)
    print("[Detected]: ", detected)
    print("-" * 130)

Hugging Face Optimum API を使用してデモを実行してみましょう。

sentences = [
    "He had also stgruggled with addiction during his time in Congress .",
    "The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .",
    "Letterma also apologized two his staff for the satyation .",
    "Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .",
    "It is left to the directors to figure out hpw to bring the stry across to tye audience .",
    "I wnet to the park yestreday to play foorball with my fiends, but it statred to rain very hevaily and we had to stop.",
    "My faorite restuarant servs the best spahgetti in the town, but they are always so buzy that you have to make a resrvation in advnace.",
    "I was goig to watch a mvoie on Netflx last night, but the straming was so slow that I decided to cancled my subscrpition.",
    "My freind and I went campign in the forest last weekend and saw a beutiful sunst that was so amzing it took our breth away.",
    "I  have been stuying for my math exam all week, but I'm stil not very confidet that I will pass it, because there are so many formuals to remeber."
]

start = time.time()

for sentence in sentences:
    show_typos(sentence)

print(f"Time elapsed: {time.time() - start}")
[Input]:  He had also stgruggled with addiction during his time in Congress .
[Detected]:  He had also <i>stgruggled</i> with addiction during his time in Congress .
----------------------------------------------------------------------------------------------------------------------------------
[Input]:  The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .
[Detected]:  The review <i>thoroughla</i> assessed all aspects of JLENS SuR and CPG <i>esign maturit</i> and confidence .
----------------------------------------------------------------------------------------------------------------------------------
[Input]:  Letterma also apologized two his staff for the satyation .
[Detected]:  <i>Letterma</i> also apologized <i>two</i> his staff for the <i>satyation</i> .
----------------------------------------------------------------------------------------------------------------------------------
[Input]:  Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .
[Detected]:  Vincent Jay had earlier won France 's first gold in <i>gthe</i> 10km biathlon sprint .
----------------------------------------------------------------------------------------------------------------------------------
[Input]:  It is left to the directors to figure out hpw to bring the stry across to tye audience .
[Detected]:  It is left to the directors to figure out <i>hpw</i> to bring the <i>stry</i> across to <i>tye</i> audience .
----------------------------------------------------------------------------------------------------------------------------------
[Input]:  I wnet to the park yestreday to play foorball with my fiends, but it statred to rain very hevaily and we had to stop.
[Detected]:  I <i>wnet</i> to the park <i>yestreday</i> to play <i>foorball</i> with my <i>fiends</i>, but it <i>statred</i> to rain very <i>hevaily</i> and we had to stop.
----------------------------------------------------------------------------------------------------------------------------------
[Input]:  My faorite restuarant servs the best spahgetti in the town, but they are always so buzy that you have to make a resrvation in advnace.
[Detected]:  My <i>faorite restuarant servs</i> the best <i>spahgetti</i> in the town, but they are always so <i>buzy</i> that you have to make a <i>resrvation</i> in <i>advnace</i>.
----------------------------------------------------------------------------------------------------------------------------------
[Input]:  I was goig to watch a mvoie on Netflx last night, but the straming was so slow that I decided to cancled my subscrpition.
[Detected]:  I was <i>goig</i> to watch a <i>mvoie</i> on <i>Netflx</i> last night, but the <i>straming</i> was so slow that I decided to <i>cancled</i> my <i>subscrpition</i>.
----------------------------------------------------------------------------------------------------------------------------------
[Input]:  My freind and I went campign in the forest last weekend and saw a beutiful sunst that was so amzing it took our breth away.
[Detected]:  My <i>freind</i> and I went <i>campign</i> in the forest last weekend and saw a <i>beutiful sunst</i> that was so <i>amzing</i> it took our <i>breth</i> away.
----------------------------------------------------------------------------------------------------------------------------------
[Input]:  I  have been stuying for my math exam all week, but I'm stil not very confidet that I will pass it, because there are so many formuals to remeber.
[Detected]:  I  have been <i>stuying</i> for my math exam all week, but I'm <i>stil</i> not very <i>confidet</i> that I will pass it, because there are so many formuals to <i>remeber</i>.
----------------------------------------------------------------------------------------------------------------------------------
Time elapsed: 0.15669655799865723

2.モデルを OpenVINO IR に変換

Pytorch モデルをロード

AutoModelForTokenClassification クラスを使用して、事前トレーニング済みの pytorch モデルを読み込みます。

model_id = "m3hrdadfi/typo-detector-distilbert-en"
model_dir = Path("pytorch_model")

tokenizer = AutoTokenizer.from_pretrained(model_id)
config = AutoConfig.from_pretrained(model_id)

# Save the model to the path if not existing
if model_dir.exists():
    model = AutoModelForTokenClassification.from_pretrained(model_dir)
else:
    model = AutoModelForTokenClassification.from_pretrained(model_id, config=config)
    model.save_pretrained(model_dir)

OpenVINO IRへの変換

ov_model_path = Path(model_dir) / "typo_detect.xml"

dummy_model_input = tokenizer("This is a sample", return_tensors="pt")
ov_model = ov.convert_model(model, example_input=dict(dummy_model_input))
ov.save_model(ov_model, ov_model_path)

推論

OpenVINO™ ランタイム Python API は、モデルを OpenVINO IR 形式でコンパイルするために使用されます。最初に、openvino モジュールの Core クラスがインポートされます。このクラスは、OpenVINO ランタイム API へのアクセスを提供します。Core クラスのインスタンスである core オブジェクトは API を表し、モデルをコンパイルするために使用されます。出力レイヤーは推論に必要なため、コンパイルされたモデルから抽出されます。

compiled_model = core.compile_model(ov_model, device.value)
output_layer = compiled_model.output(0)

ヘルパー関数

def token_to_words(tokens: List[str]) -> Dict[str, int]:
    """
    Maps the list of tokens to words in the original text.
    Built on the feature that tokens starting with '##' is attached to the previous token as tokens derived from the same word.

    Arguments:
    tokens -- List of tokens

    Returns:
    map_to_words -- Dictionary mapping tokens to words in original text
    """

    word_count = -1
    map_to_words = {}
    for token in tokens:
        if token.startswith('##'):
            map_to_words[token] = word_count
            continue
        word_count += 1
        map_to_words[token] = word_count
    return map_to_words
def infer(input_text: str) -> Dict[np.ndarray, np.ndarray]:
    """
    Creating a generic inference function to read the input and infer the result

    Arguments:
    input_text -- The text to be infered (String)

    Returns:
    result -- Resulting list from inference
    """

    tokens = tokenizer(
        input_text,
        return_tensors="np",
    )
    inputs = dict(tokens)
    result = compiled_model(inputs)[output_layer]
    return result
def get_typo_indexes(result: Dict[np.ndarray, np.ndarray], map_to_words: Dict[str, int], tokens: List[str]) -> List[int]:
    """
    Given results from the inference and tokens-map-to-words, identifies the indexes of the words with typos.

    Arguments:
    result -- Result from inference (tensor)
    map_to_words -- Dictionary mapping tokens to words (Dictionary)

    Results:
    wrong_words -- List of indexes of words with typos
    """

    wrong_words = []
    c = 0
    result_list = result[0][1:-1]
    for i in result_list:
        prob = np.argmax(i)
        if prob == 1:
            if map_to_words[tokens[c]] not in wrong_words:
                wrong_words.append(map_to_words[tokens[c]])
        c += 1
    return wrong_words
def sentence_split(sentence: str) -> List[str]:
    """
    Split the sentence into words and characters

    Arguments:
    sentence - Sentence to be split (string)

    Returns:
    splitted -- List of words and characters
    """

    splitted = re.split("([',. ])",sentence)
    splitted = [x for x in splitted if x != " " and x != ""]
    return splitted
def show_typos(sentence: str):
    """
    Detect typos from the given sentence.
    Writes both the original input and typo-tagged version to the terminal.

    Arguments:
    sentence -- Sentence to be evaluated (string)
    """

    tokens = tokenizer.tokenize(sentence)
    map_to_words = token_to_words(tokens)
    result = infer(sentence)
    typo_indexes = get_typo_indexes(result,map_to_words, tokens)

    sentence_words = sentence_split(sentence)

    typos = [sentence_words[i] for i in typo_indexes]

    detected = sentence
    for typo in typos:
        detected = detected.replace(typo, f'<i>{typo}</i>')

    print("   [Input]: ", sentence)
    print("[Detected]: ", detected)
    print("-" * 130)

変換された OpenVINO IR モデルを使用してデモを実行してみます。

sentences = [
    "He had also stgruggled with addiction during his time in Congress .",
    "The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .",
    "Letterma also apologized two his staff for the satyation .",
    "Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .",
    "It is left to the directors to figure out hpw to bring the stry across to tye audience .",
    "I wnet to the park yestreday to play foorball with my fiends, but it statred to rain very hevaily and we had to stop.",
    "My faorite restuarant servs the best spahgetti in the town, but they are always so buzy that you have to make a resrvation in advnace.",
    "I was goig to watch a mvoie on Netflx last night, but the straming was so slow that I decided to cancled my subscrpition.",
    "My freind and I went campign in the forest last weekend and saw a beutiful sunst that was so amzing it took our breth away.",
    "I  have been stuying for my math exam all week, but I'm stil not very confidet that I will pass it, because there are so many formuals to remeber."
]

start = time.time()

for sentence in sentences:
    show_typos(sentence)

print(f"Time elapsed: {time.time() - start}")
   [Input]:  He had also stgruggled with addiction during his time in Congress .
[Detected]:  He had also <i>stgruggled</i> with addiction during his time in Congress .
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .
[Detected]:  The review <i>thoroughla</i> assessed all aspects of JLENS SuR and CPG <i>esign</i> <i>maturit</i> and confidence .
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  Letterma also apologized two his staff for the satyation .
[Detected]:  <i>Letterma</i> also apologized <i>two</i> his staff for the <i>satyation</i> .
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .
[Detected]:  Vincent Jay had earlier won France 's first gold in <i>gthe</i> 10km biathlon sprint .
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  It is left to the directors to figure out hpw to bring the stry across to tye audience .
[Detected]:  It is left to the directors to figure out <i>hpw</i> to bring the <i>stry</i> across to <i>tye</i> audience .
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  I wnet to the park yestreday to play foorball with my fiends, but it statred to rain very hevaily and we had to stop.
[Detected]:  I <i>wnet</i> to the park <i>yestreday</i> to play <i>foorball</i> with my <i>fiends</i>, but it <i>statred</i> to rain very <i>hevaily</i> and we had to stop.
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  My faorite restuarant servs the best spahgetti in the town, but they are always so buzy that you have to make a resrvation in advnace.
[Detected]:  My <i>faorite</i> <i>restuarant</i> <i>servs</i> the best <i>spahgetti</i> in the town, but they are always so <i>buzy</i> that you have to make a <i>resrvation</i> in <i>advnace</i>.
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  I was goig to watch a mvoie on Netflx last night, but the straming was so slow that I decided to cancled my subscrpition.
[Detected]:  I was <i>goig</i> to watch a <i>mvoie</i> on <i>Netflx</i> last night, but the <i>straming</i> was so slow that I decided to <i>cancled</i> my <i>subscrpition</i>.
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  My freind and I went campign in the forest last weekend and saw a beutiful sunst that was so amzing it took our breth away.
[Detected]:  My <i>freind</i> and I went <i>campign</i> in the forest last weekend and saw a <i>beutiful</i> <i>sunst</i> that was so <i>amzing</i> it took our <i>breth</i> away.
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  I  have been stuying for my math exam all week, but I'm stil not very confidet that I will pass it, because there are so many formuals to remeber.
[Detected]:  I  have been <i>stuying</i> for my math exam all week, but I'm <i>stil</i> not very <i>confidet</i> that I will pass it, because there are so many formuals to <i>remeber</i>.
----------------------------------------------------------------------------------------------------------------------------------
Time elapsed: 0.09725761413574219