Elasticsearchのkuromojiの検索で重要な辞書（dictionary）と類義語（synonym）の設定

ホーム
BLOG
技術ブログ
Elasticsearchのkuromojiの検索で重要…

技術ブログ
2019.01.30

Elasticsearchのkuromojiの検索で重要な辞書（dictionary）と類義語（synonym）の設定

Elasticsearchのkuromojiの検索系でとても重要な辞書（dictionary）と類義語（synonym）の設定を行います。

どちらも運用を行なっていく上ではとても重要な設定になります。

検索の精度を上げるといった部分ではこちらの２つの定期的なメンテナンスがとても大事でしょう。

今回の検証では以下の目的では、「きゃりーぱみゅぱみゅ」の単語で試してみます。

★きゃりーぱみゅぱみゅ - Wikipedia
https://ja.wikipedia.org/wiki/%E3%81%8D%E3%82%83%E3%82%8A%E3%83%BC%E3%81%B1%E3%81%BF%E3%82%85%E3%81%B1%E3%81%BF%E3%82%85

正式名称は「きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ」なんですね。

今回の検証のゴールは以下になります。

「きゃりーぱみゅぱみゅ」が分割されずに検索される
「きゃりーぱみゅぱみゅ」でも「きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ」でも検索される

今回はElasticsearchの6系とkuromojiのプラグインを使って実施しています。

プラグインの確認

以下の手順でプラグインの確認を行います。


# /usr/share/elasticsearch/bin/elasticsearch-plugin list
analysis-kuromoji

# /usr/share/elasticsearch/bin/elasticsearch-plugin list

analysis-kuromoji

kuromojiが無事入っていますね。

辞書なしでセッティング（settings）とマッピング（mapping）を登録して検索を実行

まず、辞書を登録しないで登録してみます。

以下のセッティング（setting）用のファイルを作成します。


{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_kuromoji_analyzer": {
          "type": "custom",
          "tokenizer": "kuromoji_tokenizer"
        }
      }
    }
  }
}

{

"settings": {

"analysis": {

"analyzer": {

"my_kuromoji_analyzer": {

"type": "custom",

"tokenizer": "kuromoji_tokenizer"

}

次にindexの作成を行います。


$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample1?pretty' -d @kuromoji_setting1.json
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "sample1"
}

$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample1?pretty' -d @kuromoji_setting1.json

{

"acknowledged" : true,

"shards_acknowledged" : true,

"index" : "sample1"

}

無事作成されました。

次にマッピング（mapping）用のファイルを作成します。


{
  "properties": {
    "name": {
      "type": "text",
      "analyzer": "my_kuromoji_analyzer"
    }
  }
}

{

"properties": {

"name": {

"type": "text",

"analyzer": "my_kuromoji_analyzer"

}

マッピングを反映させます。


$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample1/_mapping/type?pretty' -d @mapping1.json
{
  "acknowledged" : true
}

$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample1/_mapping/type?pretty' -d @mapping1.json

{

"acknowledged" : true

}

それでは設定内容の確認をしてみます。


$ curl -H "Content-Type: application/json" -X GET 'localhost:9200/sample1?pretty'
{
  "sample1" : {
    "aliases" : { },
    "mappings" : {
      "type" : {
        "properties" : {
          "name" : {
            "type" : "text",
            "analyzer" : "my_kuromoji_analyzer"
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "number_of_shards" : "5",
        "provided_name" : "sample1",
        "creation_date" : "1548803353804",
        "analysis" : {
          "analyzer" : {
            "my_kuromoji_analyzer" : {
              "type" : "custom",
              "tokenizer" : "kuromoji_tokenizer"
            }
          }
        },
        "number_of_replicas" : "1",
        "uuid" : "xxn1ofl3RIOI7dlqTmJRMg",
        "version" : {
          "created" : "6040199"
        }
      }
    }
  }
}

$ curl -H "Content-Type: application/json" -X GET 'localhost:9200/sample1?pretty'

{

"sample1" : {

"aliases" : { },

"mappings" : {

"type" : {

"properties" : {

"name" : {

"type" : "text",

"analyzer" : "my_kuromoji_analyzer"

}

"settings" : {

"index" : {

"number_of_shards" : "5",

"provided_name" : "sample1",

"creation_date" : "1548803353804",

"analysis" : {

"analyzer" : {

"my_kuromoji_analyzer" : {

"type" : "custom",

"tokenizer" : "kuromoji_tokenizer"

}

"number_of_replicas" : "1",

"uuid" : "xxn1ofl3RIOI7dlqTmJRMg",

"version" : {

"created" : "6040199"

}

では、データを一件登録します。


$ curl -H "Content-Type: application/json" -XPOST 'localhost:9200/sample1/type/?pretty' -d '
> {
>   "name": "きゃりーぱみゅぱみゅ"
> }'
{
  "_index" : "sample1",
  "_type" : "type",
  "_id" : "ekAvnGgBnYzPeU5w9UfW",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

$ curl -H "Content-Type: application/json" -XPOST 'localhost:9200/sample1/type/?pretty' -d '

> {

> "name": "きゃりーぱみゅぱみゅ"

> }'

{

"_index" : "sample1",

"_type" : "type",

"_id" : "ekAvnGgBnYzPeU5w9UfW",

"_version" : 1,

"result" : "created",

"_shards" : {

"total" : 2,

"successful" : 1,

"failed" : 0

"_seq_no" : 0,

"_primary_term" : 1

}

無事登録されました。

それでは検索を実行してみます。


$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/type/_search?pretty=true' -d '
> {
>   "query" : {
>     "simple_query_string" : {
>        "query": "きゃりーぱみゅぱみゅ"
>     }
>   }
> }
> '
{
  "took" : 16,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.1507283,
    "hits" : [
      {
        "_index" : "sample1",
        "_type" : "type",
        "_id" : "ekAvnGgBnYzPeU5w9UfW",
        "_score" : 1.1507283,
        "_source" : {
          "name" : "きゃりーぱみゅぱみゅ"
        }
      }
    ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/type/_search?pretty=true' -d '

> {

> "query" : {

> "simple_query_string" : {

> "query": "きゃりーぱみゅぱみゅ"

> }

> '

{

"took" : 16,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 1.1507283,

"hits" : [

{

"_index" : "sample1",

"_type" : "type",

"_id" : "ekAvnGgBnYzPeU5w9UfW",

"_score" : 1.1507283,

"_source" : {

"name" : "きゃりーぱみゅぱみゅ"

}

]

}

無事、検索されました。

さて、ここからが本番です。

analyzeで「きゃりーぱみゅぱみゅ」がどう分解されているか見てみます。


$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/_analyze?pretty=true' -d '
> {
>   "analyzer": "my_kuromoji_analyzer",
>   "text": "きゃりーぱみゅぱみゅ",
>   "explain": true
> }
> '
{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "kuromoji_tokenizer",
      "tokens" : [
        {
          "token" : "きゃ",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "word",
          "position" : 0,
          "baseForm" : "く",
          "bytes" : "[e3 81 8d e3 82 83]",
          "inflectionForm" : "仮定縮約１",
          "inflectionForm (en)" : "conditional-contracted-1",
          "inflectionType" : "五段・カ行促音便",
          "inflectionType (en)" : "5-row-cons-k-cons-onbin",
          "partOfSpeech" : "動詞-非自立",
          "partOfSpeech (en)" : "verb-auxiliary",
          "positionLength" : 1,
          "pronunciation" : "キャ",
          "pronunciation (en)" : "kya",
          "reading" : "キャ",
          "reading (en)" : "kya",
          "termFrequency" : 1
        },
        {
          "token" : "り",
          "start_offset" : 2,
          "end_offset" : 3,
          "type" : "word",
          "position" : 1,
          "baseForm" : null,
          "bytes" : "[e3 82 8a]",
          "inflectionForm" : "基本形",
          "inflectionForm (en)" : "base",
          "inflectionType" : "文語・リ",
          "inflectionType (en)" : "classical-ri",
          "partOfSpeech" : "助動詞",
          "partOfSpeech (en)" : "auxiliary-verb",
          "positionLength" : 1,
          "pronunciation" : "リ",
          "pronunciation (en)" : "ri",
          "reading" : "リ",
          "reading (en)" : "ri",
          "termFrequency" : 1
        },
        {
          "token" : "ー",
          "start_offset" : 3,
          "end_offset" : 4,
          "type" : "word",
          "position" : 2,
          "baseForm" : null,
          "bytes" : "[e3 83 bc]",
          "inflectionForm" : null,
          "inflectionForm (en)" : null,
          "inflectionType" : null,
          "inflectionType (en)" : null,
          "partOfSpeech" : "名詞-固有名詞-一般",
          "partOfSpeech (en)" : "noun-proper-misc",
          "positionLength" : 1,
          "pronunciation" : null,
          "pronunciation (en)" : null,
          "reading" : null,
          "reading (en)" : null,
          "termFrequency" : 1
        },
        {
          "token" : "ぱみゅぱみゅ",
          "start_offset" : 4,
          "end_offset" : 10,
          "type" : "word",
          "position" : 3,
          "baseForm" : null,
          "bytes" : "[e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",
          "inflectionForm" : null,
          "inflectionForm (en)" : null,
          "inflectionType" : null,
          "inflectionType (en)" : null,
          "partOfSpeech" : "名詞-一般",
          "partOfSpeech (en)" : "noun-common",
          "positionLength" : 1,
          "pronunciation" : null,
          "pronunciation (en)" : null,
          "reading" : null,
          "reading (en)" : null,
          "termFrequency" : 1
        }
      ]
    },
    "tokenfilters" : [ ]
  }
}

100

101

102

103

104

105

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/_analyze?pretty=true' -d '

> {

> "analyzer": "my_kuromoji_analyzer",

> "text": "きゃりーぱみゅぱみゅ",

> "explain": true

> }

> '

{

"detail" : {

"custom_analyzer" : true,

"charfilters" : [ ],

"tokenizer" : {

"name" : "kuromoji_tokenizer",

"tokens" : [

{

"token" : "きゃ",

"start_offset" : 0,

"end_offset" : 2,

"type" : "word",

"position" : 0,

"baseForm" : "く",

"bytes" : "[e3 81 8d e3 82 83]",

"inflectionForm" : "仮定縮約１",

"inflectionForm (en)" : "conditional-contracted-1",

"inflectionType" : "五段・カ行促音便",

"inflectionType (en)" : "5-row-cons-k-cons-onbin",

"partOfSpeech" : "動詞-非自立",

"partOfSpeech (en)" : "verb-auxiliary",

"positionLength" : 1,

"pronunciation" : "キャ",

"pronunciation (en)" : "kya",

"reading" : "キャ",

"reading (en)" : "kya",

"termFrequency" : 1

{

"token" : "り",

"start_offset" : 2,

"end_offset" : 3,

"type" : "word",

"position" : 1,

"baseForm" : null,

"bytes" : "[e3 82 8a]",

"inflectionForm" : "基本形",

"inflectionForm (en)" : "base",

"inflectionType" : "文語・リ",

"inflectionType (en)" : "classical-ri",

"partOfSpeech" : "助動詞",

"partOfSpeech (en)" : "auxiliary-verb",

"positionLength" : 1,

"pronunciation" : "リ",

"pronunciation (en)" : "ri",

"reading" : "リ",

"reading (en)" : "ri",

"termFrequency" : 1

{

"token" : "ー",

"start_offset" : 3,

"end_offset" : 4,

"type" : "word",

"position" : 2,

"baseForm" : null,

"bytes" : "[e3 83 bc]",

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : "名詞-固有名詞-一般",

"partOfSpeech (en)" : "noun-proper-misc",

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : null,

"reading (en)" : null,

"termFrequency" : 1

{

"token" : "ぱみゅぱみゅ",

"start_offset" : 4,

"end_offset" : 10,

"type" : "word",

"position" : 3,

"baseForm" : null,

"bytes" : "[e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : "名詞-一般",

"partOfSpeech (en)" : "noun-common",

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : null,

"reading (en)" : null,

"termFrequency" : 1

}

]

"tokenfilters" : [ ]

}

いい感じに分割されちゃってます。

結果としては「きゃ」、「り」、「ー」、「ぱみゅぱみゅ」で分割されてますね。

ということは「きゃ」とか「ぱみゅぱみゅ」で検索しても見つかりますね。


$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/type/_search?pretty=true' -d '
> {
>   "query": { "term": { "name": "きゃ" } }
> }
> '
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "sample1",
        "_type" : "type",
        "_id" : "ekAvnGgBnYzPeU5w9UfW",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "きゃりーぱみゅぱみゅ"
        }
      }
    ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/type/_search?pretty=true' -d '
> {
>   "query": { "match": { "name": "きゃ" } }
> }
> '
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "sample1",
        "_type" : "type",
        "_id" : "ekAvnGgBnYzPeU5w9UfW",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "きゃりーぱみゅぱみゅ"
        }
      }
    ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/type/_search?pretty=true' -d '
> {
>   "query": { "term": { "name": "ぱみゅぱみゅ" } }
> }
> '
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "sample1",
        "_type" : "type",
        "_id" : "ekAvnGgBnYzPeU5w9UfW",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "きゃりーぱみゅぱみゅ"
        }
      }
    ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/type/_search?pretty=true' -d '
> {
>   "query": { "match": { "name": "ぱみゅぱみゅ" } }
> }
> '
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "sample1",
        "_type" : "type",
        "_id" : "ekAvnGgBnYzPeU5w9UfW",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "きゃりーぱみゅぱみゅ"
        }
      }
    ]
  }
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/type/_search?pretty=true' -d '

> {

> "query": { "term": { "name": "きゃ" } }

> }

> '

{

"took" : 6,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 0.2876821,

"hits" : [

{

"_index" : "sample1",

"_type" : "type",

"_id" : "ekAvnGgBnYzPeU5w9UfW",

"_score" : 0.2876821,

"_source" : {

"name" : "きゃりーぱみゅぱみゅ"

}

]

}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/type/_search?pretty=true' -d '

> {

> "query": { "match": { "name": "きゃ" } }

> }

> '

{

"took" : 3,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 0.2876821,

"hits" : [

{

"_index" : "sample1",

"_type" : "type",

"_id" : "ekAvnGgBnYzPeU5w9UfW",

"_score" : 0.2876821,

"_source" : {

"name" : "きゃりーぱみゅぱみゅ"

}

]

}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/type/_search?pretty=true' -d '

> {

> "query": { "term": { "name": "ぱみゅぱみゅ" } }

> }

> '

{

"took" : 2,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 0.2876821,

"hits" : [

{

"_index" : "sample1",

"_type" : "type",

"_id" : "ekAvnGgBnYzPeU5w9UfW",

"_score" : 0.2876821,

"_source" : {

"name" : "きゃりーぱみゅぱみゅ"

}

]

}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample1/type/_search?pretty=true' -d '

> {

> "query": { "match": { "name": "ぱみゅぱみゅ" } }

> }

> '

{

"took" : 4,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 0.2876821,

"hits" : [

{

"_index" : "sample1",

"_type" : "type",

"_id" : "ekAvnGgBnYzPeU5w9UfW",

"_score" : 0.2876821,

"_source" : {

"name" : "きゃりーぱみゅぱみゅ"

}

]

}

スコアは下がりましたが見つかっているのがわかります。

queryとして「term」でも「match」でも同じスコアで見つかっているのが確認できます。

辞書を登録して検索を実行

さて、前置きが長くなりましたが上記の検索で「きゃ」で検索されるのは避けたいところです。

kuromojiでの辞書はセッティング（setting）ファイルで設定します。

辞書としてはCSV形式のファイルで指定して以下のような形式で辞書を記載します。


単語,形態素解析後の単語,読み,品詞

単語,形態素解析後の単語,読み,品詞

登録した内容は以下になります。


きゃりーぱみゅぱみゅ,きゃりーぱみゅぱみゅ,キャリーパミュパミュ,カスタム名詞

きゃりーぱみゅぱみゅ,きゃりーぱみゅぱみゅ,キャリーパミュパミュ,カスタム名詞

分割を「きゃりー」と「ぱみゅぱみゅ」にしたいときは以下のようにします。


きゃりーぱみゅぱみゅ,きゃりー ぱみゅぱみゅ,キャリー パミュパミュ,カスタム名詞

きゃりーぱみゅぱみゅ,きゃりーぱみゅぱみゅ,キャリーパミュパミュ,カスタム名詞

読みの方もスペースを入れる必要があります。

それではセッティング（setting）ファイルを変更します。


{
  "settings": {
    "analysis": {
      "tokenizer": {
        "custom_kuromoji": {
          "type": "kuromoji_tokenizer",
          "user_dictionary": "/etc/elasticsearch/sample.dic"
        }
      },
      "analyzer": {
        "my_kuromoji_analyzer": {
          "type": "custom",
          "tokenizer": "custom_kuromoji"
        }
      }
    }
  }
}

{

"settings": {

"analysis": {

"tokenizer": {

"custom_kuromoji": {

"type": "kuromoji_tokenizer",

"user_dictionary": "/etc/elasticsearch/sample.dic"

}

"analyzer": {

"my_kuromoji_analyzer": {

"type": "custom",

"tokenizer": "custom_kuromoji"

}

次にindexの作成を行います。


$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample2?pretty' -d @kuromoji_setting2.json
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "sample2"
}

$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample2?pretty' -d @kuromoji_setting2.json

{

"acknowledged" : true,

"shards_acknowledged" : true,

"index" : "sample2"

}

無事作成されました。

次にマッピング（mapping）を実施します。

マッピングファイルの方は内容は特に変更ありません。


$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample2/_mapping/type?pretty' -d @mapping1.json
{
  "acknowledged" : true
}

$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample2/_mapping/type?pretty' -d @mapping1.json

{

"acknowledged" : true

}

それでは設定内容の確認をしてみます。


$ curl -H "Content-Type: application/json" -X GET 'localhost:9200/sample2?pretty'
{
  "sample2" : {
    "aliases" : { },
    "mappings" : { },
    "settings" : {
      "index" : {
        "number_of_shards" : "5",
        "provided_name" : "sample2",
        "creation_date" : "1548811611770",
        "analysis" : {
          "analyzer" : {
            "my_kuromoji_analyzer" : {
              "type" : "custom",
              "tokenizer" : "custom_kuromoji"
            }
          },
          "tokenizer" : {
            "custom_kuromoji" : {
              "type" : "kuromoji_tokenizer",
              "user_dictionary" : "/etc/elasticsearch/sample.dic"
            }
          }
        },
        "number_of_replicas" : "1",
        "uuid" : "8rSzbaQSQgWJjf5olalDlg",
        "version" : {
          "created" : "6040199"
        }
      }
    }
  }
}

$ curl -H "Content-Type: application/json" -X GET 'localhost:9200/sample2?pretty'

{

"sample2" : {

"aliases" : { },

"mappings" : { },

"settings" : {

"index" : {

"number_of_shards" : "5",

"provided_name" : "sample2",

"creation_date" : "1548811611770",

"analysis" : {

"analyzer" : {

"my_kuromoji_analyzer" : {

"type" : "custom",

"tokenizer" : "custom_kuromoji"

}

"tokenizer" : {

"custom_kuromoji" : {

"type" : "kuromoji_tokenizer",

"user_dictionary" : "/etc/elasticsearch/sample.dic"

}

"number_of_replicas" : "1",

"uuid" : "8rSzbaQSQgWJjf5olalDlg",

"version" : {

"created" : "6040199"

}

では、データを一件登録します。


$ curl -H "Content-Type: application/json" -XPOST 'localhost:9200/sample2/type/?pretty' -d '
> {
>   "name": "きゃりーぱみゅぱみゅ"
> }'
{
  "_index" : "sample2",
  "_type" : "type",
  "_id" : "e0BgnGgBnYzPeU5wiEes",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

$ curl -H "Content-Type: application/json" -XPOST 'localhost:9200/sample2/type/?pretty' -d '

> {

> "name": "きゃりーぱみゅぱみゅ"

> }'

{

"_index" : "sample2",

"_type" : "type",

"_id" : "e0BgnGgBnYzPeU5wiEes",

"_version" : 1,

"result" : "created",

"_shards" : {

"total" : 2,

"successful" : 1,

"failed" : 0

"_seq_no" : 0,

"_primary_term" : 1

}

無事登録されました。

それでは検索を実行してみます。


$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '
> {
>   "query" : {
>     "simple_query_string" : {
>        "query": "きゃりーぱみゅぱみゅ"
>     }
>   }
> }
> '
{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 3.8117876,
    "hits" : [
      {
        "_index" : "sample2",
        "_type" : "type",
        "_id" : "e0BgnGgBnYzPeU5wiEes",
        "_score" : 3.8117876,
        "_source" : {
          "name" : "きゃりーぱみゅぱみゅ"
        }
      }
    ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '

> {

> "query" : {

> "simple_query_string" : {

> "query": "きゃりーぱみゅぱみゅ"

> }

> '

{

"took" : 12,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 3.8117876,

"hits" : [

{

"_index" : "sample2",

"_type" : "type",

"_id" : "e0BgnGgBnYzPeU5wiEes",

"_score" : 3.8117876,

"_source" : {

"name" : "きゃりーぱみゅぱみゅ"

}

]

}

前回の検索よりスコアが上がっています。

さて、analyzeで「きゃりーぱみゅぱみゅ」がどう分解されているか見てみます。


$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/_analyze?pretty=true' -d '
> {
>   "analyzer": "my_kuromoji_analyzer",
>   "text": "きゃりーぱみゅぱみゅ",
>   "explain": true
> }
> '
{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "custom_kuromoji",
      "tokens" : [
        {
          "token" : "きゃりーぱみゅぱみゅ",
          "start_offset" : 0,
          "end_offset" : 10,
          "type" : "word",
          "position" : 0,
          "baseForm" : null,
          "bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",
          "inflectionForm" : null,
          "inflectionForm (en)" : null,
          "inflectionType" : null,
          "inflectionType (en)" : null,
          "partOfSpeech" : "カスタム名詞",
          "partOfSpeech (en)" : null,
          "positionLength" : 1,
          "pronunciation" : null,
          "pronunciation (en)" : null,
          "reading" : "キャリーパミュパミュ",
          "reading (en)" : "kyaripamyupamyu",
          "termFrequency" : 1
        }
      ]
    },
    "tokenfilters" : [ ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/_analyze?pretty=true' -d '

> {

> "analyzer": "my_kuromoji_analyzer",

> "text": "きゃりーぱみゅぱみゅ",

> "explain": true

> }

> '

{

"detail" : {

"custom_analyzer" : true,

"charfilters" : [ ],

"tokenizer" : {

"name" : "custom_kuromoji",

"tokens" : [

{

"token" : "きゃりーぱみゅぱみゅ",

"start_offset" : 0,

"end_offset" : 10,

"type" : "word",

"position" : 0,

"baseForm" : null,

"bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : "カスタム名詞",

"partOfSpeech (en)" : null,

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : "キャリーパミュパミュ",

"reading (en)" : "kyaripamyupamyu",

"termFrequency" : 1

}

]

"tokenfilters" : [ ]

}

分割されなく一語として登録されています。

辞書登録前に検索で引っかかった、「きゃ」とか「ぱみゅぱみゅ」で検索してみます。


$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '
> {
>   "query": { "term": { "name": "きゃ" } }
> }
> '
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '
> {
>   "query": { "match": { "name": "きゃ" } }
> }
> '
{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "sample2",
        "_type" : "type",
        "_id" : "e0BgnGgBnYzPeU5wiEes",
        "_score" : 0.5753642,
        "_source" : {
          "name" : "きゃりーぱみゅぱみゅ"
        }
      }
    ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '
> {
>   "query": { "term": { "name": "ぱみゅぱみゅ" } }
> }
> '
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '
> {
>   "query": { "match": { "name": "ぱみゅぱみゅ" } }
> }
> '
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 2.373377,
    "hits" : [
      {
        "_index" : "sample2",
        "_type" : "type",
        "_id" : "e0BgnGgBnYzPeU5wiEes",
        "_score" : 2.373377,
        "_source" : {
          "name" : "きゃりーぱみゅぱみゅ"
        }
      }
    ]
  }
}

100

101

102

103

104

105

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '

> {

> "query": { "term": { "name": "きゃ" } }

> }

> '

{

"took" : 4,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 0,

"max_score" : null,

"hits" : [ ]

}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '

> {

> "query": { "match": { "name": "きゃ" } }

> }

> '

{

"took" : 9,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 0.5753642,

"hits" : [

{

"_index" : "sample2",

"_type" : "type",

"_id" : "e0BgnGgBnYzPeU5wiEes",

"_score" : 0.5753642,

"_source" : {

"name" : "きゃりーぱみゅぱみゅ"

}

]

}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '

> {

> "query": { "term": { "name": "ぱみゅぱみゅ" } }

> }

> '

{

"took" : 3,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 0,

"max_score" : null,

"hits" : [ ]

}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '

> {

> "query": { "match": { "name": "ぱみゅぱみゅ" } }

> }

> '

{

"took" : 6,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 2.373377,

"hits" : [

{

"_index" : "sample2",

"_type" : "type",

"_id" : "e0BgnGgBnYzPeU5wiEes",

"_score" : 2.373377,

"_source" : {

"name" : "きゃりーぱみゅぱみゅ"

}

]

}

なかなか興味深いです。

termとして検索してみると検索結果は出なくなりました。

matchとして検索してみると検索結果は引き続き出てくる形となります。

類義語（Synonym）を登録して別名での検索を実行

さて、次に「きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ」でも「きゃりーぱみゅぱみゅ」で登録した単語が検索されるようにします。

類義語は「Synonym Token Filter」というTokenフィルターで設定します。

settingsのJSONに直接記載することもできますし、外部ファイルを使うこともできます。

今回は外部ファイルを使って設定します。

ファイルの形式はCSVで類義語となる単語をカンマで区切ります。

今回は以下のようなファイルを用意します。

以下では単語のグルーピングを行なっています。


きゃりーぱみゅぱみゅ,きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ

きゃりーぱみゅぱみゅ,きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ

また、Synonymでは以下のように単語を置き換える方法もあります。
（前方はカンマ区切りで複数登録可能）


きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ => きゃりーぱみゅぱみゅ

きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ => きゃりーぱみゅぱみゅ

こちらでは「きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ」が「きゃりーぱみゅぱみゅ」というTermに置き換えられます。

今回は上の方の設定で行なっています。

それではセッティング（setting）ファイルを変更します。


{
  "settings": {
    "analysis": {
      "tokenizer": {
        "custom_kuromoji": {
          "type": "kuromoji_tokenizer",
          "user_dictionary": "/etc/elasticsearch/sample.dic"
        }
      },
      "analyzer": {
        "my_kuromoji_analyzer": {
          "type": "custom",
          "tokenizer": "custom_kuromoji",
          "filter": [
            "custom_synonym"
          ]
        }
      },
      "filter": {
        "custom_synonym": {
          "type": "synonym",
          "synonyms_path" : "/etc/elasticsearch/custom_synonyms.txt"
        }
      }
    }
  }
}

{

"settings": {

"analysis": {

"tokenizer": {

"custom_kuromoji": {

"type": "kuromoji_tokenizer",

"user_dictionary": "/etc/elasticsearch/sample.dic"

}

"analyzer": {

"my_kuromoji_analyzer": {

"type": "custom",

"tokenizer": "custom_kuromoji",

"filter": [

"custom_synonym"

]

}

"filter": {

"custom_synonym": {

"type": "synonym",

"synonyms_path" : "/etc/elasticsearch/custom_synonyms.txt"

}

次にindexの作成を行います。


$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample3?pretty' -d @kuromoji_setting3.json
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "sample3"
}

$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample3?pretty' -d @kuromoji_setting3.json

{

"acknowledged" : true,

"shards_acknowledged" : true,

"index" : "sample3"

}

無事作成されました。

次にマッピング（mapping）を実施します。

マッピングファイルの方は内容は特に変更ありません。


$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample3/_mapping/type?pretty' -d @mapping1.json
{
  "acknowledged" : true
}

$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample3/_mapping/type?pretty' -d @mapping1.json

{

"acknowledged" : true

}

それでは設定内容の確認をしてみます。


$ curl -H "Content-Type: application/json" -X GET 'localhost:9200/sample3?pretty'
{
  "sample3" : {
    "aliases" : { },
    "mappings" : { },
    "settings" : {
      "index" : {
        "number_of_shards" : "5",
        "provided_name" : "sample3",
        "creation_date" : "1548814737822",
        "analysis" : {
          "filter" : {
            "custom_synonym" : {
              "type" : "synonym",
              "synonyms_path" : "/etc/elasticsearch/custom_synonyms.txt"
            }
          },
          "analyzer" : {
            "my_kuromoji_analyzer" : {
              "filter" : [
                "custom_synonym"
              ],
              "type" : "custom",
              "tokenizer" : "custom_kuromoji"
            }
          },
          "tokenizer" : {
            "custom_kuromoji" : {
              "type" : "kuromoji_tokenizer",
              "user_dictionary" : "/etc/elasticsearch/sample.dic"
            }
          }
        },
        "number_of_replicas" : "1",
        "uuid" : "8SbFeGcoQPy_04tHBZmmDQ",
        "version" : {
          "created" : "6040199"
        }
      }
    }
  }
}

$ curl -H "Content-Type: application/json" -X GET 'localhost:9200/sample3?pretty'

{

"sample3" : {

"aliases" : { },

"mappings" : { },

"settings" : {

"index" : {

"number_of_shards" : "5",

"provided_name" : "sample3",

"creation_date" : "1548814737822",

"analysis" : {

"filter" : {

"custom_synonym" : {

"type" : "synonym",

"synonyms_path" : "/etc/elasticsearch/custom_synonyms.txt"

}

"analyzer" : {

"my_kuromoji_analyzer" : {

"filter" : [

"custom_synonym"

"type" : "custom",

"tokenizer" : "custom_kuromoji"

}

"tokenizer" : {

"custom_kuromoji" : {

"type" : "kuromoji_tokenizer",

"user_dictionary" : "/etc/elasticsearch/sample.dic"

}

"number_of_replicas" : "1",

"uuid" : "8SbFeGcoQPy_04tHBZmmDQ",

"version" : {

"created" : "6040199"

}

では、データを一件登録します。


$ curl -H "Content-Type: application/json" -XPOST 'localhost:9200/sample3/type/?pretty' -d '
> {
>   "name": "きゃりーぱみゅぱみゅ"
> }'
{
  "_index" : "sample3",
  "_type" : "type",
  "_id" : "fECOnGgBnYzPeU5wTEdE",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

$ curl -H "Content-Type: application/json" -XPOST 'localhost:9200/sample3/type/?pretty' -d '

> {

> "name": "きゃりーぱみゅぱみゅ"

> }'

{

"_index" : "sample3",

"_type" : "type",

"_id" : "fECOnGgBnYzPeU5wTEdE",

"_version" : 1,

"result" : "created",

"_shards" : {

"total" : 2,

"successful" : 1,

"failed" : 0

"_seq_no" : 0,

"_primary_term" : 1

}

無事登録されました。

それでは検索を実行してみます。


$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '
> {
>   "query" : {
>     "simple_query_string" : {
>        "query": "きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ"
>     }
>   }
> }
> '
{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 4.3871517,
    "hits" : [
      {
        "_index" : "sample2",
        "_type" : "type",
        "_id" : "e0BgnGgBnYzPeU5wiEes",
        "_score" : 4.3871517,
        "_source" : {
          "name" : "きゃりーぱみゅぱみゅ"
        }
      }
    ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '
> {
>   "query" : {
>     "simple_query_string" : {
>        "query": "きゃろらいんちゃろんぷろっぷ"
>     }
>   }
> }
> '
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.8630463,
    "hits" : [
      {
        "_index" : "sample2",
        "_type" : "type",
        "_id" : "e0BgnGgBnYzPeU5wiEes",
        "_score" : 0.8630463,
        "_source" : {
          "name" : "きゃりーぱみゅぱみゅ"
        }
      }
    ]
  }
}


$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '
> {
>   "query" : {
>     "simple_query_string" : {
>        "query": "ちゃろん"
>     }
>   }
> }
> '
{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "sample2",
        "_type" : "type",
        "_id" : "e0BgnGgBnYzPeU5wiEes",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "きゃりーぱみゅぱみゅ"
        }
      }
    ]
  }
}

100

101

102

103

104

105

106

107

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '

> {

> "query" : {

> "simple_query_string" : {

> "query": "きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ"

> }

> '

{

"took" : 9,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 4.3871517,

"hits" : [

{

"_index" : "sample2",

"_type" : "type",

"_id" : "e0BgnGgBnYzPeU5wiEes",

"_score" : 4.3871517,

"_source" : {

"name" : "きゃりーぱみゅぱみゅ"

}

]

}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '

> {

> "query" : {

> "simple_query_string" : {

> "query": "きゃろらいんちゃろんぷろっぷ"

> }

> '

{

"took" : 8,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 0.8630463,

"hits" : [

{

"_index" : "sample2",

"_type" : "type",

"_id" : "e0BgnGgBnYzPeU5wiEes",

"_score" : 0.8630463,

"_source" : {

"name" : "きゃりーぱみゅぱみゅ"

}

]

}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample2/type/_search?pretty=true' -d '

> {

> "query" : {

> "simple_query_string" : {

> "query": "ちゃろん"

> }

> '

{

"took" : 7,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 0.2876821,

"hits" : [

{

"_index" : "sample2",

"_type" : "type",

"_id" : "e0BgnGgBnYzPeU5wiEes",

"_score" : 0.2876821,

"_source" : {

"name" : "きゃりーぱみゅぱみゅ"

}

]

}

「ちゃろん」という「きゃりーぱみゅぱみゅ」に含まれないものも検索でhitするようになりました。

さて、analyzeで「きゃりーぱみゅぱみゅ」（「きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ」ではなく）がどう分解されているか見てみます。


$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample3/_analyze?pretty=true' -d '
> {
>   "analyzer": "my_kuromoji_analyzer",
>   "text": "きゃりーぱみゅぱみゅ",
>   "explain": true
> }
> '
{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "custom_kuromoji",
      "tokens" : [
        {
          "token" : "きゃりーぱみゅぱみゅ",
          "start_offset" : 0,
          "end_offset" : 10,
          "type" : "word",
          "position" : 0,
          "baseForm" : null,
          "bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",
          "inflectionForm" : null,
          "inflectionForm (en)" : null,
          "inflectionType" : null,
          "inflectionType (en)" : null,
          "partOfSpeech" : "カスタム名詞",
          "partOfSpeech (en)" : null,
          "positionLength" : 1,
          "pronunciation" : null,
          "pronunciation (en)" : null,
          "reading" : "キャリーパミュパミュ",
          "reading (en)" : "kyaripamyupamyu",
          "termFrequency" : 1
        }
      ]
    },
    "tokenfilters" : [
      {
        "name" : "synonym",
        "tokens" : [
          {
            "token" : "きゃりーぱみゅぱみゅ",
            "start_offset" : 0,
            "end_offset" : 10,
            "type" : "word",
            "position" : 0,
            "baseForm" : null,
            "bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : "カスタム名詞",
            "partOfSpeech (en)" : null,
            "positionLength" : 1,
            "pronunciation" : null,
            "pronunciation (en)" : null,
            "reading" : "キャリーパミュパミュ",
            "reading (en)" : "kyaripamyupamyu",
            "termFrequency" : 1
          },
          {
            "token" : "き",
            "start_offset" : 0,
            "end_offset" : 10,
            "type" : "SYNONYM",
            "position" : 0,
            "baseForm" : null,
            "bytes" : "[e3 81 8d]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : null,
            "partOfSpeech (en)" : null,
            "positionLength" : 1,
            "pronunciation" : null,
            "pronunciation (en)" : null,
            "reading" : null,
            "reading (en)" : null,
            "termFrequency" : 1
          },
          {
            "token" : "ゃろらいんちゃろんぷろっぷきゃり",
            "start_offset" : 0,
            "end_offset" : 10,
            "type" : "SYNONYM",
            "position" : 1,
            "baseForm" : null,
            "bytes" : "[e3 82 83 e3 82 8d e3 82 89 e3 81 84 e3 82 93 e3 81 a1 e3 82 83 e3 82 8d e3 82 93 e3 81 b7 e3 82 8d e3 81 a3 e3 81 b7 e3 81 8d e3 82 83 e3 82 8a]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : null,
            "partOfSpeech (en)" : null,
            "positionLength" : 1,
            "pronunciation" : null,
            "pronunciation (en)" : null,
            "reading" : null,
            "reading (en)" : null,
            "termFrequency" : 1
          },
          {
            "token" : "ー",
            "start_offset" : 0,
            "end_offset" : 10,
            "type" : "SYNONYM",
            "position" : 2,
            "baseForm" : null,
            "bytes" : "[e3 83 bc]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : null,
            "partOfSpeech (en)" : null,
            "positionLength" : 1,
            "pronunciation" : null,
            "pronunciation (en)" : null,
            "reading" : null,
            "reading (en)" : null,
            "termFrequency" : 1
          },
          {
            "token" : "ぱみゅぱみゅ",
            "start_offset" : 0,
            "end_offset" : 10,
            "type" : "SYNONYM",
            "position" : 3,
            "baseForm" : null,
            "bytes" : "[e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : null,
            "partOfSpeech (en)" : null,
            "positionLength" : 1,
            "pronunciation" : null,
            "pronunciation (en)" : null,
            "reading" : null,
            "reading (en)" : null,
            "termFrequency" : 1
          }
        ]
      }
    ]
  }
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample3/_analyze?pretty=true' -d '

> {

> "analyzer": "my_kuromoji_analyzer",

> "text": "きゃりーぱみゅぱみゅ",

> "explain": true

> }

> '

{

"detail" : {

"custom_analyzer" : true,

"charfilters" : [ ],

"tokenizer" : {

"name" : "custom_kuromoji",

"tokens" : [

{

"token" : "きゃりーぱみゅぱみゅ",

"start_offset" : 0,

"end_offset" : 10,

"type" : "word",

"position" : 0,

"baseForm" : null,

"bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : "カスタム名詞",

"partOfSpeech (en)" : null,

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : "キャリーパミュパミュ",

"reading (en)" : "kyaripamyupamyu",

"termFrequency" : 1

}

]

"tokenfilters" : [

{

"name" : "synonym",

"tokens" : [

{

"token" : "きゃりーぱみゅぱみゅ",

"start_offset" : 0,

"end_offset" : 10,

"type" : "word",

"position" : 0,

"baseForm" : null,

"bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : "カスタム名詞",

"partOfSpeech (en)" : null,

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : "キャリーパミュパミュ",

"reading (en)" : "kyaripamyupamyu",

"termFrequency" : 1

{

"token" : "き",

"start_offset" : 0,

"end_offset" : 10,

"type" : "SYNONYM",

"position" : 0,

"baseForm" : null,

"bytes" : "[e3 81 8d]",

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : null,

"partOfSpeech (en)" : null,

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : null,

"reading (en)" : null,

"termFrequency" : 1

{

"token" : "ゃろらいんちゃろんぷろっぷきゃり",

"start_offset" : 0,

"end_offset" : 10,

"type" : "SYNONYM",

"position" : 1,

"baseForm" : null,

"bytes" : "[e3 82 83 e3 82 8d e3 82 89 e3 81 84 e3 82 93 e3 81 a1 e3 82 83 e3 82 8d e3 82 93 e3 81 b7 e3 82 8d e3 81 a3 e3 81 b7 e3 81 8d e3 82 83 e3 82 8a]",

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : null,

"partOfSpeech (en)" : null,

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : null,

"reading (en)" : null,

"termFrequency" : 1

{

"token" : "ー",

"start_offset" : 0,

"end_offset" : 10,

"type" : "SYNONYM",

"position" : 2,

"baseForm" : null,

"bytes" : "[e3 83 bc]",

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : null,

"partOfSpeech (en)" : null,

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : null,

"reading (en)" : null,

"termFrequency" : 1

{

"token" : "ぱみゅぱみゅ",

"start_offset" : 0,

"end_offset" : 10,

"type" : "SYNONYM",

"position" : 3,

"baseForm" : null,

"bytes" : "[e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : null,

"partOfSpeech (en)" : null,

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : null,

"reading (en)" : null,

"termFrequency" : 1

}

]

}

]

}

複雑な分割のされ方をしています。

synonymの方で、「きゃりーぱみゅぱみゅ」「き」「ゃろらいんちゃろんぷろっぷきゃり」「ー」「ぱみゅぱみゅ」にhitするようです。

試しに「き」で検索してみます。


$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample3/type/_search?pretty=true' -d '
> {
>   "query": { "term": { "name": "き" } }
> }
> '
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "sample3",
        "_type" : "type",
        "_id" : "fECOnGgBnYzPeU5wTEdE",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "きゃりーぱみゅぱみゅ"
        }
      }
    ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample3/type/_search?pretty=true' -d '

> {

> "query": { "term": { "name": "き" } }

> }

> '

{

"took" : 3,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 0.2876821,

"hits" : [

{

"_index" : "sample3",

"_type" : "type",

"_id" : "fECOnGgBnYzPeU5wTEdE",

"_score" : 0.2876821,

"_source" : {

"name" : "きゃりーぱみゅぱみゅ"

}

]

}

termでhitしてしまいました。

さすがにこちらでは厳しいのでsynonymとして登録した方も辞書登録します。


きゃりーぱみゅぱみゅ,きゃりーぱみゅぱみゅ,キャリーパミュパミュ,カスタム名詞
きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ,きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ,キャロラインチャッロンプロップキャッリーパミュパみゅ,カスタム名詞

きゃりーぱみゅぱみゅ,きゃりーぱみゅぱみゅ,キャリーパミュパミュ,カスタム名詞

きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ,きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ,キャロラインチャッロンプロップキャッリーパミュパみゅ,カスタム名詞

indexのclose/openで確認をしてみます。


$ curl -H "Content-Type: application/json" -X POST 'http://localhost:9200/sample3/_close?pretty'
{
  "acknowledged" : true
}

$ curl -H "Content-Type: application/json" -X POST 'http://localhost:9200/sample3/_open?pretty'
{
  "acknowledged" : true,
  "shards_acknowledged" : true
}

$ curl -H "Content-Type: application/json" -X POST 'http://localhost:9200/sample3/_close?pretty'

{

"acknowledged" : true

}

$ curl -H "Content-Type: application/json" -X POST 'http://localhost:9200/sample3/_open?pretty'

{

"acknowledged" : true,

"shards_acknowledged" : true

}

さて、再度analyzeで確認してみます。


$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample3/_analyze?pretty=true' -d '
{
  "analyzer": "my_kuromoji_analyzer",
  "text": "きゃりーぱみゅぱみゅ",
  "explain": true
}'
{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "custom_kuromoji",
      "tokens" : [
        {
          "token" : "きゃりーぱみゅぱみゅ",
          "start_offset" : 0,
          "end_offset" : 10,
          "type" : "word",
          "position" : 0,
          "baseForm" : null,
          "bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",
          "inflectionForm" : null,
          "inflectionForm (en)" : null,
          "inflectionType" : null,
          "inflectionType (en)" : null,
          "partOfSpeech" : "カスタム名詞",
          "partOfSpeech (en)" : null,
          "positionLength" : 1,
          "pronunciation" : null,
          "pronunciation (en)" : null,
          "reading" : "キャリーパミュパミュ",
          "reading (en)" : "kyaripamyupamyu",
          "termFrequency" : 1
        }
      ]
    },
    "tokenfilters" : [
      {
        "name" : "synonym",
        "tokens" : [
          {
            "token" : "きゃりーぱみゅぱみゅ",
            "start_offset" : 0,
            "end_offset" : 10,
            "type" : "word",
            "position" : 0,
            "baseForm" : null,
            "bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : "カスタム名詞",
            "partOfSpeech (en)" : null,
            "positionLength" : 1,
            "pronunciation" : null,
            "pronunciation (en)" : null,
            "reading" : "キャリーパミュパミュ",
            "reading (en)" : "kyaripamyupamyu",
            "termFrequency" : 1
          },
          {
            "token" : "きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ",
            "start_offset" : 0,
            "end_offset" : 10,
            "type" : "SYNONYM",
            "position" : 0,
            "baseForm" : null,
            "bytes" : "[e3 81 8d e3 82 83 e3 82 8d e3 82 89 e3 81 84 e3 82 93 e3 81 a1 e3 82 83 e3 82 8d e3 82 93 e3 81 b7 e3 82 8d e3 81 a3 e3 81 b7 e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : null,
            "partOfSpeech (en)" : null,
            "positionLength" : 1,
            "pronunciation" : null,
            "pronunciation (en)" : null,
            "reading" : null,
            "reading (en)" : null,
            "termFrequency" : 1
          }
        ]
      }
    ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample3/_analyze?pretty=true' -d '

{

"analyzer": "my_kuromoji_analyzer",

"text": "きゃりーぱみゅぱみゅ",

"explain": true

{

"detail" : {

"custom_analyzer" : true,

"charfilters" : [ ],

"tokenizer" : {

"name" : "custom_kuromoji",

"tokens" : [

{

"token" : "きゃりーぱみゅぱみゅ",

"start_offset" : 0,

"end_offset" : 10,

"type" : "word",

"position" : 0,

"baseForm" : null,

"bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : "カスタム名詞",

"partOfSpeech (en)" : null,

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : "キャリーパミュパミュ",

"reading (en)" : "kyaripamyupamyu",

"termFrequency" : 1

}

]

"tokenfilters" : [

{

"name" : "synonym",

"tokens" : [

{

"token" : "きゃりーぱみゅぱみゅ",

"start_offset" : 0,

"end_offset" : 10,

"type" : "word",

"position" : 0,

"baseForm" : null,

"bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : "カスタム名詞",

"partOfSpeech (en)" : null,

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : "キャリーパミュパミュ",

"reading (en)" : "kyaripamyupamyu",

"termFrequency" : 1

{

"token" : "きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ",

"start_offset" : 0,

"end_offset" : 10,

"type" : "SYNONYM",

"position" : 0,

"baseForm" : null,

"bytes" : "[e3 81 8d e3 82 83 e3 82 8d e3 82 89 e3 81 84 e3 82 93 e3 81 a1 e3 82 83 e3 82 8d e3 82 93 e3 81 b7 e3 82 8d e3 81 a3 e3 81 b7 e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : null,

"partOfSpeech (en)" : null,

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : null,

"reading (en)" : null,

"termFrequency" : 1

}

]

}

]

}

登録されたものは２つになり、辞書の形で整理され変な分割がなくなりました。

「き」がhitしないか確認してみます。


$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample3/type/_search?pretty=true' -d '
{
  "query": { "term": { "name": "き" } }
}
'
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "sample3",
        "_type" : "type",
        "_id" : "fECOnGgBnYzPeU5wTEdE",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "きゃりーぱみゅぱみゅ"
        }
      }
    ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample3/type/_search?pretty=true' -d '

{

"query": { "term": { "name": "き" } }

}

{

"took" : 5,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 0.2876821,

"hits" : [

{

"_index" : "sample3",

"_type" : "type",

"_id" : "fECOnGgBnYzPeU5wTEdE",

"_score" : 0.2876821,

"_source" : {

"name" : "きゃりーぱみゅぱみゅ"

}

]

}

hitしてしまいます。

これはすでに登録されたものが分割して登録が作成されているためでしょうか。

ドキュメントの削除もreindexも効果なかったので、indexを再度作成してみます。


$  curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample4?pretty' -d @kuromoji_setting3.json
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "sample4"
}

$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample4/_mapping/type?pretty' -d @mapping1.json
{
  "acknowledged" : true
}

$ curl -H "Content-Type: application/json" -X GET 'localhost:9200/sample4?pretty'
{
  "sample4" : {
    "aliases" : { },
    "mappings" : {
      "type" : {
        "properties" : {
          "name" : {
            "type" : "text",
            "analyzer" : "my_kuromoji_analyzer"
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "number_of_shards" : "5",
        "provided_name" : "sample4",
        "creation_date" : "1548824896792",
        "analysis" : {
          "filter" : {
            "custom_synonym" : {
              "type" : "synonym",
              "synonyms_path" : "/etc/elasticsearch/custom_synonyms.txt"
            }
          },
          "analyzer" : {
            "my_kuromoji_analyzer" : {
              "filter" : [
                "custom_synonym"
              ],
              "type" : "custom",
              "tokenizer" : "custom_kuromoji"
            }
          },
          "tokenizer" : {
            "custom_kuromoji" : {
              "type" : "kuromoji_tokenizer",
              "user_dictionary" : "/etc/elasticsearch/sample.dic"
            }
          }
        },
        "number_of_replicas" : "1",
        "uuid" : "1MB7pKqnQOWnPgsMHaOhdA",
        "version" : {
          "created" : "6040199"
        }
      }
    }
  }
}

$ curl -H "Content-Type: application/json" -XPOST 'localhost:9200/sample4/type/?pretty' -d '
> {
>   "name": "きゃりーぱみゅぱみゅ"
> }'
{
  "_index" : "sample4",
  "_type" : "type",
  "_id" : "fkApnWgBnYzPeU5w6Ecr",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample4/_analyze?pretty=true' -d '
> {
>   "analyzer": "my_kuromoji_analyzer",
>   "text": "きゃりーぱみゅぱみゅ",
>   "explain": true
> }'
{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "custom_kuromoji",
      "tokens" : [
        {
          "token" : "きゃりーぱみゅぱみゅ",
          "start_offset" : 0,
          "end_offset" : 10,
          "type" : "word",
          "position" : 0,
          "baseForm" : null,
          "bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",
          "inflectionForm" : null,
          "inflectionForm (en)" : null,
          "inflectionType" : null,
          "inflectionType (en)" : null,
          "partOfSpeech" : "カスタム名詞",
          "partOfSpeech (en)" : null,
          "positionLength" : 1,
          "pronunciation" : null,
          "pronunciation (en)" : null,
          "reading" : "キャリーパミュパミュ",
          "reading (en)" : "kyaripamyupamyu",
          "termFrequency" : 1
        }
      ]
    },
    "tokenfilters" : [
      {
        "name" : "synonym",
        "tokens" : [
          {
            "token" : "きゃりーぱみゅぱみゅ",
            "start_offset" : 0,
            "end_offset" : 10,
            "type" : "word",
            "position" : 0,
            "baseForm" : null,
            "bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : "カスタム名詞",
            "partOfSpeech (en)" : null,
            "positionLength" : 1,
            "pronunciation" : null,
            "pronunciation (en)" : null,
            "reading" : "キャリーパミュパミュ",
            "reading (en)" : "kyaripamyupamyu",
            "termFrequency" : 1
          },
          {
            "token" : "きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ",
            "start_offset" : 0,
            "end_offset" : 10,
            "type" : "SYNONYM",
            "position" : 0,
            "baseForm" : null,
            "bytes" : "[e3 81 8d e3 82 83 e3 82 8d e3 82 89 e3 81 84 e3 82 93 e3 81 a1 e3 82 83 e3 82 8d e3 82 93 e3 81 b7 e3 82 8d e3 81 a3 e3 81 b7 e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : null,
            "partOfSpeech (en)" : null,
            "positionLength" : 1,
            "pronunciation" : null,
            "pronunciation (en)" : null,
            "reading" : null,
            "reading (en)" : null,
            "termFrequency" : 1
          }
        ]
      }
    ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample4/type/_search?pretty=true' -d '
> {
>   "query": { "term": { "name": "きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ" } }
> }
> '
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.36165747,
    "hits" : [
      {
        "_index" : "sample4",
        "_type" : "type",
        "_id" : "fkApnWgBnYzPeU5w6Ecr",
        "_score" : 0.36165747,
        "_source" : {
          "name" : "きゃりーぱみゅぱみゅ"
        }
      }
    ]
  }
}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample4/type/_search?pretty=true' -d '
{
  "query": { "term": { "name": "き" } }
}
'
{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample4?pretty' -d @kuromoji_setting3.json

{

"acknowledged" : true,

"shards_acknowledged" : true,

"index" : "sample4"

}

$ curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sample4/_mapping/type?pretty' -d @mapping1.json

{

"acknowledged" : true

}

$ curl -H "Content-Type: application/json" -X GET 'localhost:9200/sample4?pretty'

{

"sample4" : {

"aliases" : { },

"mappings" : {

"type" : {

"properties" : {

"name" : {

"type" : "text",

"analyzer" : "my_kuromoji_analyzer"

}

"settings" : {

"index" : {

"number_of_shards" : "5",

"provided_name" : "sample4",

"creation_date" : "1548824896792",

"analysis" : {

"filter" : {

"custom_synonym" : {

"type" : "synonym",

"synonyms_path" : "/etc/elasticsearch/custom_synonyms.txt"

}

"analyzer" : {

"my_kuromoji_analyzer" : {

"filter" : [

"custom_synonym"

"type" : "custom",

"tokenizer" : "custom_kuromoji"

}

"tokenizer" : {

"custom_kuromoji" : {

"type" : "kuromoji_tokenizer",

"user_dictionary" : "/etc/elasticsearch/sample.dic"

}

"number_of_replicas" : "1",

"uuid" : "1MB7pKqnQOWnPgsMHaOhdA",

"version" : {

"created" : "6040199"

}

$ curl -H "Content-Type: application/json" -XPOST 'localhost:9200/sample4/type/?pretty' -d '

> {

> "name": "きゃりーぱみゅぱみゅ"

> }'

{

"_index" : "sample4",

"_type" : "type",

"_id" : "fkApnWgBnYzPeU5w6Ecr",

"_version" : 1,

"result" : "created",

"_shards" : {

"total" : 2,

"successful" : 1,

"failed" : 0

"_seq_no" : 0,

"_primary_term" : 1

}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample4/_analyze?pretty=true' -d '

> {

> "analyzer": "my_kuromoji_analyzer",

> "text": "きゃりーぱみゅぱみゅ",

> "explain": true

> }'

{

"detail" : {

"custom_analyzer" : true,

"charfilters" : [ ],

"tokenizer" : {

"name" : "custom_kuromoji",

"tokens" : [

{

"token" : "きゃりーぱみゅぱみゅ",

"start_offset" : 0,

"end_offset" : 10,

"type" : "word",

"position" : 0,

"baseForm" : null,

"bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : "カスタム名詞",

"partOfSpeech (en)" : null,

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : "キャリーパミュパミュ",

"reading (en)" : "kyaripamyupamyu",

"termFrequency" : 1

}

]

"tokenfilters" : [

{

"name" : "synonym",

"tokens" : [

{

"token" : "きゃりーぱみゅぱみゅ",

"start_offset" : 0,

"end_offset" : 10,

"type" : "word",

"position" : 0,

"baseForm" : null,

"bytes" : "[e3 81 8d e3 82 83 e3 82 8a e3 83 bc e3 81 b1 e3 81 bf e3 82 85 e3 81 b1 e3 81 bf e3 82 85]",

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : "カスタム名詞",

"partOfSpeech (en)" : null,

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : "キャリーパミュパミュ",

"reading (en)" : "kyaripamyupamyu",

"termFrequency" : 1

{

"token" : "きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ",

"start_offset" : 0,

"end_offset" : 10,

"type" : "SYNONYM",

"position" : 0,

"baseForm" : null,

"inflectionForm" : null,

"inflectionForm (en)" : null,

"inflectionType" : null,

"inflectionType (en)" : null,

"partOfSpeech" : null,

"partOfSpeech (en)" : null,

"positionLength" : 1,

"pronunciation" : null,

"pronunciation (en)" : null,

"reading" : null,

"reading (en)" : null,

"termFrequency" : 1

}

]

}

]

}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample4/type/_search?pretty=true' -d '

> {

> "query": { "term": { "name": "きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ" } }

> }

> '

{

"took" : 4,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 0.36165747,

"hits" : [

{

"_index" : "sample4",

"_type" : "type",

"_id" : "fkApnWgBnYzPeU5w6Ecr",

"_score" : 0.36165747,

"_source" : {

"name" : "きゃりーぱみゅぱみゅ"

}

]

}

$ curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sample4/type/_search?pretty=true' -d '

{

"query": { "term": { "name": "き" } }

}

{

"took" : 7,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 0,

"max_score" : null,

"hits" : [ ]

}

無事、「き」が検索ではhitしなくなり、「きゃろらいんちゃろんぷろっぷきゃりーぱみゅぱみゅ」がhitするようになりました。

検索の精度をより向上させるためにはsynonymで登録したものもdictionayに登録しておいた方が良さそうです。

また、辞書の登録をした場合はindexの再作成をした方が良いでしょう。

長くなりましたがこの辺で。

このブログは株式会社CoLabMixによる技術ブログです。

GCP、AWSなどでのインフラ構築・運用や、クローリング・分析・検索などを主体とした開発を行なっています。

Ruby on RailsやDjango、Pythonなどの開発依頼などお気軽にお声がけください。

開発パートナーを増やしたいという企業と積極的に繋がっていきたいです。

お問い合わせやご依頼・ご相談など

More from my site

CentOS7のPython3でMeCa…前の記事

Elasticsearchの形態素解析機…次の記事

制作実績一覧

PAGE TOP