テキスト検出 C++ デモ#

デモでは、ニューラル・ネットワークを使用して、さまざまな環境で任意の角度で回転された印刷テキストを検出および認識する例を示します。デモでは事前トレーニングされた次のモデルを使用できます:

text-detection-0003、テキストを検索する検出ネットワークです。
text-detection-0004、テキストを検索する軽量の検出ネットワークです。
horizontal-text-detection-0001、上記のモデルよりもはるかに高速に動作する検出ネットワークですが、水平方向のテキストの検索にのみ適用できます。
text-recognition-0012、テキストを認識する認識ネットワークです。
text-recognition-0014、テキストを認識する認識ネットワークです。このモデルでは、オプション -tr_pt_first を追加し、-tr_o_blb_nm "logits" オプションを介して出力レイヤー名を指定する必要があります (詳細はモデルの説明を参照してください)。
text-recognition-0015、テキストを認識する認識ネットワークです。オプション -tr_pt_first, -m_tr_ss "?0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" (サポートされるシンボルセット)、-tr_o_blb_nm "decoder_output" (出力名を指定)、および -dt simple (デコーダータイプを指定) を追加する必要があります。-lower オプションを指定して、予測されたテキストを小文字に変換することもできます。詳細は、説明を参照してください。
text-recognition-0016、テキストを認識する認識ネットワークです。オプション -tr_pt_first, -m_tr_ss "?0123456789abcdefghijklmnopqrstuvwxyz" (サポートされるシンボルセット)、-tr_o_blb_nm "decoder_output" (出力名を指定)、および -dt simple (デコーダータイプを指定) を追加する必要があります。-lower オプションを指定して、予測されたテキストを小文字に変換することもできます。詳細は、説明を参照してください。
text-recognition-resnet-fc、テキストを認識する認識ネットワークです。オプション -tr_pt_first および -dt simple (デコーダータイプを指定) を追加する必要があります。
handwritten-score-recognition-0003、<digit> または <digit>.<digit> などの手書きスコアマークを認識する認識ネットワークです。オプション -m_tr_ss "0123456789._" (サポートされるシンボルセット) と -dt ctc (デコーダータイプを指定) を追加する必要があります。
vitstr-small-patch16-224、テキストを認識する認識ネットワークです。オプション -tr_pt_first、-m_tr_ss < vocab ファイルのパス>/.vocab.txt へのパス (サポートされているシンボルセット)、-dt simple (デコーダータイプを指定する)、-start_index 1 (提供されたインデックスからの出力を処理する)、および -pad " " (特定のパッド記号を使用してください)。

どのように動作するか#

起動時に、アプリケーションはコマンドライン・パラメーターを受け取り、モデルを OpenVINO™ ランタイムプラグインにロードします。画像を取得すると、テキスト検出の推論が実行され、結果が各テキスト境界ボックスの 4 つの点 (x1, y1)、(x2, y2)、(x3, y3)、(x4, y4) として出力されます。

テキスト認識モデルが提供されている場合、認識されたテキストも出力されます。

注: デフォルトでは、Open Model Zoo のデモは BGR チャネル順序での入力を期待します。RGB 順序で動作するようにモデルをトレーニングした場合は、サンプルまたはデモ・アプリケーションでデフォルトのチャネル順序を手動で再配置するか、--reverse_input_channels 引数を指定したモデル・オプティマイザー・ツールを使用してモデルを再変換する必要があります。引数の詳細については、[前処理計算の埋め込み](@ref openvino_docs_MO_DG_Additional_Optimization_Use_Cases) の入力チャネルを反転するセクションを参照してください。

実行の準備#

デモの入力画像またはビデオファイルについては、Open Model Zoo デモの概要のデモに使用できるメディアファイルのセクションを参照してください。デモでサポートされるモデルリストは、<omz_dir>/demos/text_detection_demo/cpp/models.lst ファイルにあります。このファイルは、モデル・ダウンローダーおよびコンバーターのパラメーターとして使用され、モデルをダウンロードし、必要に応じて OpenVINO IR 形式 (*.xml + *.bin) に変換できます。

モデル・ダウンローダーの使用例:

omz_downloader --list models.lst

モデル・コンバーターの使用例:

omz_converter --list models.lst

サポートされるモデル#

handwritten-score-recognition-0003
horizontal-text-detection-0001
text-detection-0003
text-detection-0004
decoder_type = ctc
- text-recognition-0012
- text-recognition-0014
decoder_type = simple
- text-recognition-0015
- text-recognition-0016
- text-recognition-resnet-fc
- vitstr-small-patch16-224

注: 各種デバイス向けのモデル推論サポートの詳細については、インテルの事前トレーニング・モデルのデバイスサポートとパブリックの事前トレーニング・モデルのデバイスサポートの表を参照してください。

実行する#

-h オプションを指定してアプリケーションを実行すると、使用方法が表示されます:

text_detection_demo [OPTION] 
Options: 

    -h                             Print a usage message.
    -i                             Required. An input to process.The input must be a single image, a folder of images, video file or camera id. 
    -loop                          Optional. Enable reading the input in a loop.
    -o "<path>"                    Optional. Name of the output file(s) to save. Frames of odd width or height can be truncated. See https://github.com/opencv/opencv/pull/24086 
    -limit "<num>"                 Optional. Number of frames to store in output. If 0 is set, all frames are stored.
    -m_td "<path>"                 Required. Path to the Text Detection model (.xml) file.
    -m_tr "<path>"                 Required. Path to the Text Recognition model (.xml) file.
    -dt "<type>"                   Optional. Type of the decoder, either 'simple' for SimpleDecoder or 'ctc' for CTC greedy and CTC beam search decoders. Default is 'ctc’ 
    -m_tr_ss "<value>" or "<path>" Optional. String or vocabulary file with symbol set for the Text Recognition model.
    -tr_pt_firsttr_pt_first        Optional. Specifies if pad token is the first symbol in the alphabet. Default is false 
    -lower                         Optional. Set this flag to convert recognized text to lowercase 
    -out_enc_hidden_name "<value>" Optional. Name of the text recognition model encoder output hidden blob 
    -out_dec_hidden_name "<value>" Optional. Name of the text recognition model decoder output hidden blob 
    -in_dec_hidden_name "<value>"  Optional. Name of the text recognition model decoder input hidden blob 
    -features_name "<value>"       Optional. Name of the text recognition model features blob 
    -in_dec_symbol_name "<value>"  Optional. Name of the text recognition model decoder input blob (prev. decoded symbol) 
    -out_dec_symbol_name "<value>" Optional. Name of the text recognition model decoder output blob (probability distribution over tokens) 
    -tr_o_blb_nm "<value>"         Optional. Name of the output blob of the model which would be used as model output. If not stated, first blob of the model would be used.
    -cc                            Optional. If it is set, then in case of absence of the Text Detector, the Text Recognition model takes a central image crop as an input, but not full frame.
    -w_td "<value>"                Optional. Input image width for Text Detection model.
    -h_td "<value>"                Optional. Input image height for Text Detection model.
    -thr "<value>"                 Optional. Specify a recognition confidence threshold. Text detection candidates with text recognition confidence below specified threshold are rejected.
    -cls_pixel_thr "<value>"       Optional. Specify a confidence threshold for pixel classification. Pixels with classification confidence below specified threshold are rejected.
    -link_pixel_thr "<value>"      Optional. Specify a confidence threshold for pixel linkage. Pixels with linkage confidence below specified threshold are not linked.
    -max_rect_num "<value>"        Optional. Maximum number of rectangles to recognize. If it is negative, number of rectangles to recognize is not limited.
    -d_td "<device>"               Optional. Specify the target device for the Text Detection model to infer on (the list of available devices is shown below). The demo will look for a suitable plugin for a specified device. By default, it is CPU.
    -d_tr "<device>"               Optional. Specify the target device for the Text Recognition model to infer on (the list of available devices is shown below). The demo will look for a suitable plugin for a specified device. By default, it is CPU.
    -no_show                       Optional. If it is true, then detected text will not be shown on image frame. By default, it is false.
    -r                             Optional. Output Inference results as raw values.
    -u                             Optional. List of monitors to show initially.
    -b                             Optional. Bandwidth for CTC beam search decoder. Default value is 0, in this case CTC greedy decoder will be used.
    -start_index                   Optional. Start index for Simple decoder. Default value is 0.
    -pad                           Optional. Pad symbol. Default value is '#'.

オプションの空のリストを指定してアプリケーションを実行すると、上記の使用法メッセージとエラーメッセージが表示されます。

例えば、次のコマンドライン・コマンドでアプリケーションを実行します:

./text_detection_demo \ 
    -i <path_to_image>/sample.jpg \ 
    -m_td <path_to_model>/text-detection-0004.xml \ 
    -m_tr <path_to_model>/text-recognition-0014.xml \ 
    -dt ctc \ 
    -tr_pt_first \ 
    -tr_o_blb_nm "logits"

text-recognition-resnet-fc、text-recgonition-0015、および text-recognition-0016 の場合は、-dt オプションに simple デコーダーを使用する必要があります。残りのモデルには ctc デコーダー (デフォルトのデコーダー) を使用します。text-recognition-0015 および text-recognition-0016 モデルの場合、-m_tr キーとデコーダー部分 (text-recognition-0015-decoder) に text-recognition-0015-encoder (text-recognition-0016-encoder) モデルへのパスを以下の例に示すように自動的に検索されます:

./text_detection_demo \ 
   -i <path_to_image>/sample.jpg \ 
   -m_td <path_to_model>/text-detection-0003.xml \ 
   -m_tr <path_to_model>/text-recognition-0015/text-recognition-0015-encoder/<precision>/text-recognition-0015-encoder.xml \ 
   -dt simple \ 
   -tr_pt_first \ 
   -tr_o_blb_nm "logits" \ 
   -m_tr_ss "?0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

注: 単一の画像を入力として指定すると、デモはすぐに処理してレンダリングし終了します。推論結果を画面上で継続的に視覚化するには、loop オプションを適用します。これにより、単一の画像がループで処理されます。

-o オプションを使用すると、処理結果を Motion JPEG AVI ファイル、または別の JPEG または PNG ファイルに保存できます:

処理結果を AVI ファイルに保存するには、avi 拡張子を付けた出力ファイル名を指定します (例: -o output.avi)。
処理結果を画像として保存するには、出力画像ファイルのテンプレート名を拡張子 jpg または png で指定します (例: -o output_%03d.jpg)。実際のファイル名は、実行時に正規表現 %03d をフレーム番号に置き換えることによってテンプレートから構築され、output_000.jpg、output_001.jpg などになります。カメラなど連続入力ストリームでディスク領域のオーバーランを避けるため、limit オプションを使用して出力ファイルに保存されるデータの量を制限できます。デフォルト値は 1000 です。これを変更するには、-limit N オプションを適用します。ここで、N は保存するフレームの数です。

注: Windows* システムには、デフォルトでは Motion JPEG コーデックがインストールされていない場合があります。この場合、OpenVINO ™ インストール・パッケージに付属する、<INSTALL_DIR>/opencv/ffmpeg-download.ps1 にある PowerShell スクリプトを使用して OpenCV FFMPEG バックエンドをダウンロードできます。OpenVINO ™ がシステムで保護されたフォルダーにインストールされている場合 (一般的なケース)、スクリプトは管理者権限で実行する必要があります。あるいは、結果を画像として保存することもできます。

デモの出力#

このデモでは、OpenCV を使用して、境界ボックスとテキストとしてレンダリングされた検出を含む結果フレームを表示します。デモレポート:

FPS: ビデオフレーム処理の平均レート (1 秒あたりのフレーム数)。
レイテンシー: 1 フレームの処理 (フレームの読み取りから結果の表示まで) に必要な平均時間。
次のパイプライン・ステージのレイテンシー:
- テキスト検出推論 — 入力データ (画像) を推論し、テキスト検出モデルの結果を取得します。
- テキスト検出の後処理 - テキスト検出モデルの出力のための推論結果の準備。
- テキスト認識推論 — 入力データ (画像) を推論し、テキスト認識モデルの結果を取得します。
- テキスト認識の後処理 - テキスト認識モデルの出力のための推論結果の準備。
- テキストのトリミング — 入力画像からのテキストを含む境界ボックスをトリミングします。

これらのメトリックを使用して、アプリケーション・レベルのパフォーマンスを測定できます。