Elasticsearchの形態素解析機Sudachiでユーザー辞書（dictionary）の活用

ホーム
BLOG
技術ブログ
Elasticsearchの形態素解析機Sudachiで…

技術ブログ
2019.01.30

Elasticsearchの形態素解析機Sudachiでユーザー辞書（dictionary）の活用

前回、Elasticsearchのkuomojiでの辞書（dictionary）と類義語（synonym）の設定を行なったので、Sucachiでの辞書の登録をして見ます。

★Elasticsearchのkuromojiの検索で重要な辞書（dictionary）と類義語（synonym）の設定
https://colabmix.co.jp/tech-blog/elasticsearch-kuro-dictionary-synonym/

類義語（synonym）の登録はkuromojiと全く違いがありません。

searchモードだと失敗するのでnormalで実施しましょうぐらいが気をつける点です。

ただ、辞書（dictionarry）の方は作成方法などがkuromojiとは随分と異なり、データファイルを作成する必要があります。

Sudachiでのユーザー辞書の作成

公式のドキュメントは以下となります。
https://github.com/WorksApplications/Sudachi/blob/develop/docs/user_dict.md

まず、辞書ファイルの定義方法がかなり異なります。

フォーマットはCSVで以下のような形となります。


見出し (TRIE 用),左連接ID,右連接ID,コスト,見出し (解析結果表示用),品詞1,品詞2,品詞3,品詞4,品詞 (活用型),品詞 (活用形),読み,正規化表記,辞書形ID,※未使用,※未使用,※未使用,※未使用

見出し (TRIE 用),左連接ID,右連接ID,コスト,見出し (解析結果表示用),品詞1,品詞2,品詞3,品詞4,品詞 (活用型),品詞 (活用形),読み,正規化表記,辞書形ID,※未使用,※未使用,※未使用,※未使用

まず項目が非常に大きくなります。

今回、以下のような辞書ファイルを作成します。


きゃりーぱみゅぱみゅ,4786,4786,5000,きゃりーぱみゅぱみゅ,名詞,固有名詞,一般,*,*,*,キャリーパミュパミュ,きゃりーぱみゅぱみゅ,*,*,*,*,*

きゃりーぱみゅぱみゅ,4786,4786,5000,きゃりーぱみゅぱみゅ,名詞,固有名詞,一般,*,*,*,キャリーパミュパミュ,きゃりーぱみゅぱみゅ,*,*,*,*,*

左連接ID、右連接IDなどはgitのドキュメントを参照で。

コストは名詞類の登録の推奨の"5000 ～ 9000"で一番解析結果結果の出やすくなる5000を設定しています。

こちらをデータファイルに変換します。

コマンドの形式は以下です。


$ java -Dfile.encoding=UTF-8 -cp sudachi-XX.jar com.worksap.nlp.sudachi.dictionary.UserDictionaryBuilder system_core.dic input output.dic [comment]

$ java -Dfile.encoding=UTF-8 -cp sudachi-XX.jar com.worksap.nlp.sudachi.dictionary.UserDictionaryBuilder system_core.dic input output.dic [comment]

実際に変換したコマンドは以下で。


$ java -Dfile.encoding=UTF-8 -cp /usr/share/elasticsearch/plugins/analysis-sudachi/sudachi-0.1.1-20181018.034406-43.jar com.worksap.nlp.sudachi.dictionary.UserDictionaryBuilder /etc/elasticsearch/sudachi/system_full.dic /etc/elasticsearch/sudachi/sample_dict.text /etc/elasticsearch/sudachi/sample_dict.dic
reading the source file... 1 words
building the trie.done
writing the trie... 1,028 bytes
writing the word-ID table... 9 bytes
writing the word parameters... 10 bytes
writing the wordInfos... 53 bytes
writing wordInfo offsets... 8 bytes

$ java -Dfile.encoding=UTF-8 -cp /usr/share/elasticsearch/plugins/analysis-sudachi/sudachi-0.1.1-20181018.034406-43.jar com.worksap.nlp.sudachi.dictionary.UserDictionaryBuilder /etc/elasticsearch/sudachi/system_full.dic /etc/elasticsearch/sudachi/sample_dict.text /etc/elasticsearch/sudachi/sample_dict.dic

reading the source file... 1 words

building the trie.done

writing the trie... 1,028 bytes

writing the word-ID table... 9 bytes

writing the word parameters... 10 bytes

writing the wordInfos... 53 bytes

writing wordInfo offsets... 8 bytes

jarファイルを使うのでパスなどは環境に合わせて変更になるでしょう。

そしてSucachiの設定ファイルである「/etc/elasticsearch/sudachi/sudachi.json」は以下のようになっています。


{
    "systemDict" : "system_full.dic",
    "userDict" : ["sample_dict.dic"],
    "inputTextPlugin" : [
        { "class" : "com.worksap.nlp.sudachi.DefaultInputTextPlugin" },
        { "class" : "com.worksap.nlp.sudachi.ProlongedSoundMarkInputTextPlugin",
          "prolongedSoundMarks": ["ー", "-", "⁓", "〜", "〰"],
          "replacementSymbol": "ー"}
    ],
    "oovProviderPlugin" : [
        { "class" : "com.worksap.nlp.sudachi.MeCabOovProviderPlugin" },
        { "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
          "oovPOS" : [ "補助記号", "一般", "*", "*", "*", "*" ],
          "leftId" : 5968,
          "rightId" : 5968,
          "cost" : 3857 }
    ],
    "pathRewritePlugin" : [
        { "class" : "com.worksap.nlp.sudachi.JoinNumericPlugin",
          "joinKanjiNumeric" : true },
        { "class" : "com.worksap.nlp.sudachi.JoinKatakanaOovPlugin",
          "oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*" ],
          "minLength" : 3
        }
    ]
}

{

"systemDict" : "system_full.dic",

"userDict" : ["sample_dict.dic"],

"inputTextPlugin" : [

{ "class" : "com.worksap.nlp.sudachi.DefaultInputTextPlugin" },

{ "class" : "com.worksap.nlp.sudachi.ProlongedSoundMarkInputTextPlugin",

"prolongedSoundMarks": ["ー", "-", "⁓", "〜", "〰"],

"replacementSymbol": "ー"}

"oovProviderPlugin" : [

{ "class" : "com.worksap.nlp.sudachi.MeCabOovProviderPlugin" },

{ "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",

"oovPOS" : [ "補助記号", "一般", "*", "*", "*", "*" ],

"leftId" : 5968,

"rightId" : 5968,

"cost" : 3857 }

"pathRewritePlugin" : [

{ "class" : "com.worksap.nlp.sudachi.JoinNumericPlugin",

"joinKanjiNumeric" : true },

{ "class" : "com.worksap.nlp.sudachi.JoinKatakanaOovPlugin",

"oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*" ],

"minLength" : 3

}

]

}

userDictの部分が追加されています。

配列にしないとエラーになります。

それでは解析を行います。

まず、indexの作成などを行うためにsettingファイルの準備をします。


{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer",
            "mode": "search",
            "discard_punctuation": true,
            "resources_path": "/etc/elasticsearch/sudachi",
            "settings_path": "/etc/elasticsearch/sudachi/sudachi.json"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": [
            ],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        }
      }
    }
  }
}

{

"settings": {

"index": {

"analysis": {

"tokenizer": {

"sudachi_tokenizer": {

"type": "sudachi_tokenizer",

"mode": "search",

"discard_punctuation": true,

"resources_path": "/etc/elasticsearch/sudachi",

"settings_path": "/etc/elasticsearch/sudachi/sudachi.json"

}

"analyzer": {

"sudachi_analyzer": {

"filter": [

"tokenizer": "sudachi_tokenizer",

"type": "custom"

}

indexの作成を行います。


$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample1/?pretty' -d @sample1.json
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "sample1"
}

$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample1/?pretty' -d @sample1.json

{

"acknowledged" : true,

"shards_acknowledged" : true,

"index" : "sample1"

}

次に単語の解析を実施します。


$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/_analyze?pretty=true' -d '
> {
>   "analyzer": "sudachi_analyzer",
>   "text": "きゃりーぱみゅぱみゅ",
>   "explain": true
> }
> '
{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "sudachi_tokenizer",
      "tokens" : [
        {
          "token" : "きゃりーぱみゅぱみゅ",
          "start_offset" : 0,
          "end_offset" : 10,
          "type" : "word",
          "position" : 0,
          "baseForm" : "きゃりーぱみゅぱみゅ",
          "bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",
          "normalizedForm" : "きゃりーぱみゅぱみゅ",
          "partOfSpeech" : "名詞,固有名詞,一般,*,*,*",
          "positionLength" : 1,
          "pronunciation" : "キャリーパミュパミュ",
          "reading" : "キャリーパミュパミュ",
          "termFrequency" : 1
        }
      ]
    },
    "tokenfilters" : [ ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/_analyze?pretty=true' -d '

> {

> "analyzer": "sudachi_analyzer",

> "text": "きゃりーぱみゅぱみゅ",

> "explain": true

> }

> '

{

"detail" : {

"custom_analyzer" : true,

"charfilters" : [ ],

"tokenizer" : {

"name" : "sudachi_tokenizer",

"tokens" : [

{

"token" : "きゃりーぱみゅぱみゅ",

"start_offset" : 0,

"end_offset" : 10,

"type" : "word",

"position" : 0,

"baseForm" : "きゃりーぱみゅぱみゅ",

"bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",

"normalizedForm" : "きゃりーぱみゅぱみゅ",

"partOfSpeech" : "名詞,固有名詞,一般,*,*,*",

"positionLength" : 1,

"pronunciation" : "キャリーパミュパミュ",

"reading" : "キャリーパミュパミュ",

"termFrequency" : 1

}

]

"tokenfilters" : [ ]

}

無事、辞書で登録された形で登録されました。

尚、辞書を登録する前の結果は以下で分解された形となっていました。


$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/_analyze?pretty=true' -d '
> {
>   "analyzer": "sudachi_analyzer",
>   "text": "きゃりーぱみゅぱみゅ",
>   "explain": true
> }
> '
{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "sudachi_tokenizer",
      "tokens" : [
        {
          "token" : "き",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "word",
          "position" : 0,
          "baseForm" : "き",
          "bytes" : "[e3 81 8d]",
          "normalizedForm" : "き",
          "partOfSpeech" : "助動詞,*,*,*,文語助動詞-キ,終止形-一般",
          "positionLength" : 1,
          "pronunciation" : "キ",
          "reading" : "キ",
          "termFrequency" : 1
        },
        {
          "token" : "ゃ",
          "start_offset" : 1,
          "end_offset" : 2,
          "type" : "word",
          "position" : 1,
          "baseForm" : "ゃ",
          "bytes" : "[e3 82 83]",
          "normalizedForm" : "ヤ",
          "partOfSpeech" : "記号,一般,*,*,*,*",
          "positionLength" : 1,
          "pronunciation" : "ャ",
          "reading" : "ャ",
          "termFrequency" : 1
        },
        {
          "token" : "り",
          "start_offset" : 2,
          "end_offset" : 3,
          "type" : "word",
          "position" : 2,
          "baseForm" : "り",
          "bytes" : "[e3 82 8a]",
          "normalizedForm" : "り",
          "partOfSpeech" : "助動詞,*,*,*,文語助動詞-リ,終止形-一般",
          "positionLength" : 1,
          "pronunciation" : "リ",
          "reading" : "リ",
          "termFrequency" : 1
        },
        {
          "token" : "ー",
          "start_offset" : 3,
          "end_offset" : 4,
          "type" : "word",
          "position" : 3,
          "baseForm" : "ー",
          "bytes" : "[e3 83 bc]",
          "normalizedForm" : "ー",
          "partOfSpeech" : "補助記号,一般,*,*,*,*",
          "positionLength" : 1,
          "pronunciation" : "ー",
          "reading" : "ー",
          "termFrequency" : 1
        },
        {
          "token" : "ぱ",
          "start_offset" : 4,
          "end_offset" : 5,
          "type" : "word",
          "position" : 4,
          "baseForm" : "ぱ",
          "bytes" : "[e3 81 b1]",
          "normalizedForm" : "ぱっ",
          "partOfSpeech" : "副詞,*,*,*,*,*",
          "positionLength" : 1,
          "pronunciation" : "パ",
          "reading" : "パ",
          "termFrequency" : 1
        },
        {
          "token" : "み",
          "start_offset" : 5,
          "end_offset" : 6,
          "type" : "word",
          "position" : 5,
          "baseForm" : "み",
          "bytes" : "[e3 81 bf]",
          "normalizedForm" : "み",
          "partOfSpeech" : "接頭辞,*,*,*,*,*",
          "positionLength" : 1,
          "pronunciation" : "ミ",
          "reading" : "ミ",
          "termFrequency" : 1
        },
        {
          "token" : "ゅ",
          "start_offset" : 6,
          "end_offset" : 7,
          "type" : "word",
          "position" : 6,
          "baseForm" : "ゅ",
          "bytes" : "[e3 82 85]",
          "normalizedForm" : "ユ",
          "partOfSpeech" : "記号,一般,*,*,*,*",
          "positionLength" : 1,
          "pronunciation" : "ュ",
          "reading" : "ュ",
          "termFrequency" : 1
        },
        {
          "token" : "ぱ",
          "start_offset" : 7,
          "end_offset" : 8,
          "type" : "word",
          "position" : 7,
          "baseForm" : "ぱ",
          "bytes" : "[e3 81 b1]",
          "normalizedForm" : "ぱっ",
          "partOfSpeech" : "副詞,*,*,*,*,*",
          "positionLength" : 1,
          "pronunciation" : "パ",
          "reading" : "パ",
          "termFrequency" : 1
        },
        {
          "token" : "み",
          "start_offset" : 8,
          "end_offset" : 9,
          "type" : "word",
          "position" : 8,
          "baseForm" : "み",
          "bytes" : "[e3 81 bf]",
          "normalizedForm" : "み",
          "partOfSpeech" : "接頭辞,*,*,*,*,*",
          "positionLength" : 1,
          "pronunciation" : "ミ",
          "reading" : "ミ",
          "termFrequency" : 1
        },
        {
          "token" : "ゅ",
          "start_offset" : 9,
          "end_offset" : 10,
          "type" : "word",
          "position" : 9,
          "baseForm" : "ゅ",
          "bytes" : "[e3 82 85]",
          "normalizedForm" : "ユ",
          "partOfSpeech" : "記号,一般,*,*,*,*",
          "positionLength" : 1,
          "pronunciation" : "ュ",
          "reading" : "ュ",
          "termFrequency" : 1
        }
      ]
    },
    "tokenfilters" : [ ]
  }
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/_analyze?pretty=true' -d '

> {

> "analyzer": "sudachi_analyzer",

> "text": "きゃりーぱみゅぱみゅ",

> "explain": true

> }

> '

{

"detail" : {

"custom_analyzer" : true,

"charfilters" : [ ],

"tokenizer" : {

"name" : "sudachi_tokenizer",

"tokens" : [

{

"token" : "き",

"start_offset" : 0,

"end_offset" : 1,

"type" : "word",

"position" : 0,

"baseForm" : "き",

"bytes" : "[e3 81 8d]",

"normalizedForm" : "き",

"partOfSpeech" : "助動詞,*,*,*,文語助動詞-キ,終止形-一般",

"positionLength" : 1,

"pronunciation" : "キ",

"reading" : "キ",

"termFrequency" : 1

{

"token" : "ゃ",

"start_offset" : 1,

"end_offset" : 2,

"type" : "word",

"position" : 1,

"baseForm" : "ゃ",

"bytes" : "[e3 82 83]",

"normalizedForm" : "ヤ",

"partOfSpeech" : "記号,一般,*,*,*,*",

"positionLength" : 1,

"pronunciation" : "ャ",

"reading" : "ャ",

"termFrequency" : 1

{

"token" : "り",

"start_offset" : 2,

"end_offset" : 3,

"type" : "word",

"position" : 2,

"baseForm" : "り",

"bytes" : "[e3 82 8a]",

"normalizedForm" : "り",

"partOfSpeech" : "助動詞,*,*,*,文語助動詞-リ,終止形-一般",

"positionLength" : 1,

"pronunciation" : "リ",

"reading" : "リ",

"termFrequency" : 1

{

"token" : "ー",

"start_offset" : 3,

"end_offset" : 4,

"type" : "word",

"position" : 3,

"baseForm" : "ー",

"bytes" : "[e3 83 bc]",

"normalizedForm" : "ー",

"partOfSpeech" : "補助記号,一般,*,*,*,*",

"positionLength" : 1,

"pronunciation" : "ー",

"reading" : "ー",

"termFrequency" : 1

{

"token" : "ぱ",

"start_offset" : 4,

"end_offset" : 5,

"type" : "word",

"position" : 4,

"baseForm" : "ぱ",

"bytes" : "[e3 81 b1]",

"normalizedForm" : "ぱっ",

"partOfSpeech" : "副詞,*,*,*,*,*",

"positionLength" : 1,

"pronunciation" : "パ",

"reading" : "パ",

"termFrequency" : 1

{

"token" : "み",

"start_offset" : 5,

"end_offset" : 6,

"type" : "word",

"position" : 5,

"baseForm" : "み",

"bytes" : "[e3 81 bf]",

"normalizedForm" : "み",

"partOfSpeech" : "接頭辞,*,*,*,*,*",

"positionLength" : 1,

"pronunciation" : "ミ",

"reading" : "ミ",

"termFrequency" : 1

{

"token" : "ゅ",

"start_offset" : 6,

"end_offset" : 7,

"type" : "word",

"position" : 6,

"baseForm" : "ゅ",

"bytes" : "[e3 82 85]",

"normalizedForm" : "ユ",

"partOfSpeech" : "記号,一般,*,*,*,*",

"positionLength" : 1,

"pronunciation" : "ュ",

"reading" : "ュ",

"termFrequency" : 1

{

"token" : "ぱ",

"start_offset" : 7,

"end_offset" : 8,

"type" : "word",

"position" : 7,

"baseForm" : "ぱ",

"bytes" : "[e3 81 b1]",

"normalizedForm" : "ぱっ",

"partOfSpeech" : "副詞,*,*,*,*,*",

"positionLength" : 1,

"pronunciation" : "パ",

"reading" : "パ",

"termFrequency" : 1

{

"token" : "み",

"start_offset" : 8,

"end_offset" : 9,

"type" : "word",

"position" : 8,

"baseForm" : "み",

"bytes" : "[e3 81 bf]",

"normalizedForm" : "み",

"partOfSpeech" : "接頭辞,*,*,*,*,*",

"positionLength" : 1,

"pronunciation" : "ミ",

"reading" : "ミ",

"termFrequency" : 1

{

"token" : "ゅ",

"start_offset" : 9,

"end_offset" : 10,

"type" : "word",

"position" : 9,

"baseForm" : "ゅ",

"bytes" : "[e3 82 85]",

"normalizedForm" : "ユ",

"partOfSpeech" : "記号,一般,*,*,*,*",

"positionLength" : 1,

"pronunciation" : "ュ",

"reading" : "ュ",

"termFrequency" : 1

}

]

"tokenfilters" : [ ]

}

Sudachiでのユーザー辞書の作成は以上となります。

辞書（dictionary）と類義語（synonym）を合わせることでより制度の高いサーチエンジンとして活用することができます。

今回はこの辺で。

このブログは株式会社CoLabMixによる技術ブログです。

GCP、AWSなどでのインフラ構築・運用や、クローリング・分析・検索などを主体とした開発を行なっています。

Ruby on RailsやDjango、Pythonなどの開発依頼などお気軽にお声がけください。

開発パートナーを増やしたいという企業と積極的に繋がっていきたいです。

お問い合わせやご依頼・ご相談など

More from my site

Elasticsearchのkuromo…前の記事

scrapy-redisを使って、red…次の記事

制作実績一覧

PAGE TOP