さくらVPS の CentOS7 サーバに Elasticsearch 6.2.0 と Sudachi をインストールしてみる

ホーム
BLOG
技術ブログ
さくらVPS の CentOS7 サーバに Elasti…

技術ブログ
2018.10.11

さくらVPS の CentOS7 サーバに Elasticsearch 6.2.0 と Sudachi をインストールしてみる

さくらVPS に Elasticsearch 6.2 を入れて日本語形態素解析器 Sudachi を試してみます。

OS は CentOS 7系でのインストールを行います。

「Sudachi」についての詳細はこちらが詳しいです。

★ Elasticsearchのための新しい形態素解析器「Sudachi」

Elasticsearch は 6.2 でインストールを実施しています。

ソースは以下にあります。
https://github.com/WorksApplications/elasticsearch-sudachi

6.4.1 などのバージョンで試してみましたがエラーとなりましたので、「pom.xml」でデフォルトでバージョン指定している「<elasticsearch.version>6.2.0</elasticsearch.version>」部分の 6.2.0 でのインストールを実施しています。

バージョン合わせが大変な場合は、Snapshot版リポジトリに合わせてダウンロードすると良いかもしれません。

こちらも最新版の ElasticSearch とはバージョンに若干の開きがあります。
（記事作成段階で最新は6.4.2、Snapshotは6.2.2）

Elasticsearch 6.2.0 のインストール

Elasticsearch 6.2.0 のインストールを行います。

公式ドキュメントの方法でRPMでのインストールを行います。
https://www.elastic.co/guide/en/elasticsearch/reference/6.2/rpm.html


# yum -y install perl-Digest-SHA
# yum -y install java-1.8.0-openjdk-devel

# cd /tmp

# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.0.rpm
# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.0.rpm.sha512
# shasum -a 512 -c elasticsearch-6.2.0.rpm.sha512

# rpm --install elasticsearch-6.2.0.rpm

# yum -y install perl-Digest-SHA

# yum -y install java-1.8.0-openjdk-devel

# cd /tmp

# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.0.rpm

# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.0.rpm.sha512

# shasum -a 512 -c elasticsearch-6.2.0.rpm.sha512

# rpm --install elasticsearch-6.2.0.rpm

続いて起動設定を行います。


# systemctl daemon-reload
# systemctl enable elasticsearch.service
# systemctl start elasticsearch

# systemctl daemon-reload

# systemctl enable elasticsearch.service

# systemctl start elasticsearch

しばらくすると起動するので9200のポートでの確認を行います。


# curl localhost:9200
{
  "name" : "jeeFzKI",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "3rGak3xUS9CfbHWjn3zGkA",
  "version" : {
    "number" : "6.2.0",
    "build_hash" : "37cdac1",
    "build_date" : "2018-02-01T17:31:12.527918Z",
    "build_snapshot" : false,
    "lucene_version" : "7.2.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

# curl localhost:9200

{

"name" : "jeeFzKI",

"cluster_name" : "elasticsearch",

"cluster_uuid" : "3rGak3xUS9CfbHWjn3zGkA",

"version" : {

"number" : "6.2.0",

"build_hash" : "37cdac1",

"build_date" : "2018-02-01T17:31:12.527918Z",

"build_snapshot" : false,

"lucene_version" : "7.2.1",

"minimum_wire_compatibility_version" : "5.6.0",

"minimum_index_compatibility_version" : "5.0.0"

"tagline" : "You Know, for Search"

}

Apache Mavenのインストール

java のパッケージ管理の apache-maven のインストールを行います。


# wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo
# yum -y install apache-maven
# mvn -version
Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T16:58:13+09:00)
Maven home: /usr/share/apache-maven
Java version: 1.8.0_181, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-3.b13.el7_5.x86_64/jre
Default locale: ja_JP, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-327.36.3.el7.x86_64", arch: "amd64", family: "unix"

# wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo

# yum -y install apache-maven

# mvn -version

Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T16:58:13+09:00)

Maven home: /usr/share/apache-maven

Java version: 1.8.0_181, vendor: Oracle Corporation

Java home: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-3.b13.el7_5.x86_64/jre

Default locale: ja_JP, platform encoding: UTF-8

OS name: "linux", version: "3.10.0-327.36.3.el7.x86_64", arch: "amd64", family: "unix"

これで Sudachi のインストール準備が整いました。

Sudachi のインストール

インストールの実施をします。


# cd /tmp
# git clone https://github.com/WorksApplications/elasticsearch-sudachi.git
Cloning into 'elasticsearch-sudachi'...
remote: Enumerating objects: 8, done.
remote: Counting objects: 100% (8/8), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 970 (delta 2), reused 8 (delta 2), pack-reused 962
Receiving objects: 100% (970/970), 149.19 KiB | 0 bytes/s, done.
Resolving deltas: 100% (344/344), done.

# cd elasticsearch-sudachi

# mvn package
          ・
          ・
          ・
[INFO] Building jar: /tmp/elasticsearch-sudachi/target/analysis-sudachi-elasticsearch6.2-1.1.0-SNAPSHOT-sources.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:30 min
[INFO] Finished at: 2018-10-11T16:57:45+09:00
[INFO] Final Memory: 40M/206M
[INFO] ------------------------------------------------------------------------

# cd /usr/share/elasticsearch
# mkdir -p target/releases
# mv /tmp/elasticsearch-sudachi/target/releases/analysis-sudachi-elasticsearch6.2-1.1.0-SNAPSHOT.zip target/releases/.
# bin/elasticsearch-plugin install file:///usr/share/elasticsearch/target/releases/analysis-sudachi-elasticsearch6.2-1.1.0-SNAPSHOT.zip

# cd /tmp

# git clone https://github.com/WorksApplications/elasticsearch-sudachi.git

Cloning into 'elasticsearch-sudachi'...

remote: Enumerating objects: 8, done.

remote: Counting objects: 100% (8/8), done.

remote: Compressing objects: 100% (6/6), done.

remote: Total 970 (delta 2), reused 8 (delta 2), pack-reused 962

Receiving objects: 100% (970/970), 149.19 KiB | 0 bytes/s, done.

Resolving deltas: 100% (344/344), done.

# cd elasticsearch-sudachi

# mvn package

・

[INFO] Building jar: /tmp/elasticsearch-sudachi/target/analysis-sudachi-elasticsearch6.2-1.1.0-SNAPSHOT-sources.jar

[INFO] ------------------------------------------------------------------------

[INFO] BUILD SUCCESS

[INFO] ------------------------------------------------------------------------

[INFO] Total time: 02:30 min

[INFO] Finished at: 2018-10-11T16:57:45+09:00

[INFO] Final Memory: 40M/206M

[INFO] ------------------------------------------------------------------------

# cd /usr/share/elasticsearch

# mkdir -p target/releases

# mv /tmp/elasticsearch-sudachi/target/releases/analysis-sudachi-elasticsearch6.2-1.1.0-SNAPSHOT.zip target/releases/.

# bin/elasticsearch-plugin install file:///usr/share/elasticsearch/target/releases/analysis-sudachi-elasticsearch6.2-1.1.0-SNAPSHOT.zip

インストールはこちらで完了です。

プラグインの確認を行ってみます。


# /usr/share/elasticsearch/bin/elasticsearch-plugin list
analysis-sudachi

# /usr/share/elasticsearch/bin/elasticsearch-plugin list

analysis-sudachi

辞書のインストール

「dictionary_full」が「dictionary_core」の上位版であり、「dictionary_core」の倍弱のファイルサイズです。現在で110MBほどのサイズがあります。

今回は「dictionary_full」でインストールを行います。


# cd /tmp
# wget https://oss.sonatype.org/content/repositories/snapshots/com/worksap/nlp/sudachi/0.1.1-SNAPSHOT/sudachi-0.1.1-20181002.083840-42-dictionary-full.tar.bz2
# tar xvf sudachi-0.1.1-20181002.083840-42-dictionary-full.tar.bz2

# mkdir /etc/elasticsearch/sudachi
# mv system_full.dic /etc/elasticsearch/sudachi/.

# cd /tmp

# wget https://oss.sonatype.org/content/repositories/snapshots/com/worksap/nlp/sudachi/0.1.1-SNAPSHOT/sudachi-0.1.1-20181002.083840-42-dictionary-full.tar.bz2

# tar xvf sudachi-0.1.1-20181002.083840-42-dictionary-full.tar.bz2

# mkdir /etc/elasticsearch/sudachi

# mv system_full.dic /etc/elasticsearch/sudachi/.

次にサービスを再起動して Sudachi を有効にします。


# service elasticsearch restart

# curl -X GET 'http://localhost:9200/_nodes/plugins?pretty'
          ・
          ・
          ・
      "plugins" : [
        {
          "name" : "analysis-sudachi",
          "version" : "1.1.0-SNAPSHOT",
          "description" : "The Japanese (Sudachi) Analysis plugin integrates Lucene Sudachi analysis module into elasticsearch.",
          "classname" : "com.worksap.nlp.elasticsearch.sudachi.plugin.AnalysisSudachiPlugin",
          "extended_plugins" : [ ],
          "has_native_controller" : false,
          "requires_keystore" : false
        }
          ・
          ・
          ・

# ls /etc/elasticsearch/sudachi/
system_full.dic

# service elasticsearch restart

# curl -X GET 'http://localhost:9200/_nodes/plugins?pretty'

・

"plugins" : [

{

"name" : "analysis-sudachi",

"version" : "1.1.0-SNAPSHOT",

"description" : "The Japanese (Sudachi) Analysis plugin integrates Lucene Sudachi analysis module into elasticsearch.",

"classname" : "com.worksap.nlp.elasticsearch.sudachi.plugin.AnalysisSudachiPlugin",

"extended_plugins" : [ ],

"has_native_controller" : false,

"requires_keystore" : false

}

・

# ls /etc/elasticsearch/sudachi/

system_full.dic

以下を参照に「/etc/elasticsearch/sudachi/sudachi.json」ファイルを作成します。 https://github.com/WorksApplications/elasticsearch-sudachi


# vi /etc/elasticsearch/sudachi/sudachi.json
-----------------------------追加
{
    "systemDict" : "system_full.dic",
    "inputTextPlugin" : [
        { "class" : "com.worksap.nlp.sudachi.DefaultInputTextPlugin" },
        { "class" : "com.worksap.nlp.sudachi.ProlongedSoundMarkInputTextPlugin",
          "prolongedSoundMarks": ["ー", "-", "⁓", "〜", "〰"],
          "replacementSymbol": "ー"}
    ],
    "oovProviderPlugin" : [
        { "class" : "com.worksap.nlp.sudachi.MeCabOovProviderPlugin" },
        { "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
          "oovPOS" : [ "補助記号", "一般", "*", "*", "*", "*" ],
          "leftId" : 5968,
          "rightId" : 5968,
          "cost" : 3857 }
    ],
    "pathRewritePlugin" : [
        { "class" : "com.worksap.nlp.sudachi.JoinNumericPlugin",
          "joinKanjiNumeric" : true },
        { "class" : "com.worksap.nlp.sudachi.JoinKatakanaOovPlugin",
          "oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*" ],
          "minLength" : 3
        }
    ]
}
-----------------------------

# vi /etc/elasticsearch/sudachi/sudachi.json

-----------------------------追加

{

"systemDict" : "system_full.dic",

"inputTextPlugin" : [

{ "class" : "com.worksap.nlp.sudachi.DefaultInputTextPlugin" },

{ "class" : "com.worksap.nlp.sudachi.ProlongedSoundMarkInputTextPlugin",

"prolongedSoundMarks": ["ー", "-", "⁓", "〜", "〰"],

"replacementSymbol": "ー"}

"oovProviderPlugin" : [

{ "class" : "com.worksap.nlp.sudachi.MeCabOovProviderPlugin" },

{ "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",

"oovPOS" : [ "補助記号", "一般", "*", "*", "*", "*" ],

"leftId" : 5968,

"rightId" : 5968,

"cost" : 3857 }

"pathRewritePlugin" : [

{ "class" : "com.worksap.nlp.sudachi.JoinNumericPlugin",

"joinKanjiNumeric" : true },

{ "class" : "com.worksap.nlp.sudachi.JoinKatakanaOovPlugin",

"oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*" ],

"minLength" : 3

}

]

}

-----------------------------

インデックスを作成して解析の準備

まず、インデックスを作成するための設定ファイルを作成します。


# cd /tmp
# vi analysis_sudachi_settings.json
-----------------------------追加
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer",
            "mode": "search",
        "discard_punctuation": true,
            "resources_path": "/etc/elasticsearch/sudachi",
            "settings_path": "/etc/elasticsearch/sudachi/sudachi.json"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": [
            ],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        }
      }
    }
  }
}
-----------------------------

# cd /tmp

# vi analysis_sudachi_settings.json

-----------------------------追加

{

"settings": {

"index": {

"analysis": {

"tokenizer": {

"sudachi_tokenizer": {

"type": "sudachi_tokenizer",

"mode": "search",

"discard_punctuation": true,

"resources_path": "/etc/elasticsearch/sudachi",

"settings_path": "/etc/elasticsearch/sudachi/sudachi.json"

}

"analyzer": {

"sudachi_analyzer": {

"filter": [

"tokenizer": "sudachi_tokenizer",

"type": "custom"

}

-----------------------------

次に設定ファイルからインデックスを作成します。


# curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sudachi_test/' -d @analysis_sudachi_settings.json
{"acknowledged":true,"shards_acknowledged":true,"index":"sudachi_test"}

# curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/sudachi_test/' -d @analysis_sudachi_settings.json

{"acknowledged":true,"shards_acknowledged":true,"index":"sudachi_test"}

インデックスが作成されているかの確認を行います。


# curl -X GET 'http://localhost:9200/sudachi_test/?pretty'
{
  "sudachi_test" : {
    "aliases" : { },
    "mappings" : { },
    "settings" : {
      "index" : {
        "number_of_shards" : "5",
        "provided_name" : "sudachi_test",
        "creation_date" : "1539250623599",
        "analysis" : {
          "analyzer" : {
            "sudachi_analyzer" : {
              "filter" : [ ],
              "type" : "custom",
              "tokenizer" : "sudachi_tokenizer"
            }
          },
          "tokenizer" : {
            "sudachi_tokenizer" : {
              "mode" : "search",
              "settings_path" : "/etc/elasticsearch/sudachi/sudachi.json",
              "resources_path" : "/etc/elasticsearch/sudachi",
              "type" : "sudachi_tokenizer",
              "discard_punctuation" : "true"
            }
          }
        },
        "number_of_replicas" : "1",
        "uuid" : "2-RdcSUfTCmOkRuvxBHT9A",
        "version" : {
          "created" : "6020099"
        }
      }
    }
  }
}

# curl -X GET 'http://localhost:9200/sudachi_test/?pretty'

{

"sudachi_test" : {

"aliases" : { },

"mappings" : { },

"settings" : {

"index" : {

"number_of_shards" : "5",

"provided_name" : "sudachi_test",

"creation_date" : "1539250623599",

"analysis" : {

"analyzer" : {

"sudachi_analyzer" : {

"filter" : [ ],

"type" : "custom",

"tokenizer" : "sudachi_tokenizer"

}

"tokenizer" : {

"sudachi_tokenizer" : {

"mode" : "search",

"settings_path" : "/etc/elasticsearch/sudachi/sudachi.json",

"resources_path" : "/etc/elasticsearch/sudachi",

"type" : "sudachi_tokenizer",

"discard_punctuation" : "true"

}

"number_of_replicas" : "1",

"uuid" : "2-RdcSUfTCmOkRuvxBHT9A",

"version" : {

"created" : "6020099"

}

Sudachiプラグインを使った解析

試験を実施してみます。

表記揺れのテストです。


# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sudachi_test/_analyze?pretty=true' -d '
> {
>   "analyzer": "sudachi_analyzer",
>   "text": "見積もり"
> }
> '
{
  "tokens" : [
    {
      "token" : "見積もり",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    }
  ]
}

# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sudachi_test/_analyze?pretty=true' -d '
> {
>   "analyzer": "sudachi_analyzer",
>   "text": "見積り"
> }
> '
{
  "tokens" : [
    {
      "token" : "見積もり",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    }
  ]
}

# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sudachi_test/_analyze?pretty=true' -d '

> {

> "analyzer": "sudachi_analyzer",

> "text": "見積もり"

> }

> '

{

"tokens" : [

{

"token" : "見積もり",

"start_offset" : 0,

"end_offset" : 4,

"type" : "word",

"position" : 0

}

]

}

# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sudachi_test/_analyze?pretty=true' -d '

> {

> "analyzer": "sudachi_analyzer",

> "text": "見積り"

> }

> '

{

"tokens" : [

{

"token" : "見積もり",

"start_offset" : 0,

"end_offset" : 3,

"type" : "word",

"position" : 0

}

]

}

上記では「見積もり」も「見積り」も「見積もり」という同じトークンで表示されました。

ちなみにkuromojiでは以下のようになります。


# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/restaurant/_analyze?pretty=true' -d '
> {
>   "analyzer": "my_kuromoji_analyzer",
>   "text": "見積もり"
> }
> '
{
  "tokens" : [
    {
      "token" : "見積もり",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    }
  ]
}

# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/restaurant/_analyze?pretty=true' -d '
> {
>   "analyzer": "my_kuromoji_analyzer",
>   "text": "見積り"
> }
> '
{
  "tokens" : [
    {
      "token" : "見積り",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    }
  ]
}

# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/restaurant/_analyze?pretty=true' -d '

> {

> "analyzer": "my_kuromoji_analyzer",

> "text": "見積もり"

> }

> '

{

"tokens" : [

{

"token" : "見積もり",

"start_offset" : 0,

"end_offset" : 4,

"type" : "word",

"position" : 0

}

]

}

# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/restaurant/_analyze?pretty=true' -d '

> {

> "analyzer": "my_kuromoji_analyzer",

> "text": "見積り"

> }

> '

{

"tokens" : [

{

"token" : "見積り",

"start_offset" : 0,

"end_offset" : 3,

"type" : "word",

"position" : 0

}

]

}

続いてミュージシャンの人名で。


# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sudachi_test/_analyze?pretty=true' -d '
> {
>   "analyzer": "sudachi_analyzer",
>   "text": "きゃりーぱみゅぱみゅ"
> }
> '
{
  "tokens" : [
    {
      "token" : "き",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ヤ",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "り",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "ー",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "ぱっ",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "み",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "ユ",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "ぱっ",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "み",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "ユ",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 9
    }
  ]
}

# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sudachi_test/_analyze?pretty=true' -d '

> {

> "analyzer": "sudachi_analyzer",

> "text": "きゃりーぱみゅぱみゅ"

> }

> '

{

"tokens" : [

{

"token" : "き",

"start_offset" : 0,

"end_offset" : 1,

"type" : "word",

"position" : 0

{

"token" : "ヤ",

"start_offset" : 1,

"end_offset" : 2,

"type" : "word",

"position" : 1

{

"token" : "り",

"start_offset" : 2,

"end_offset" : 3,

"type" : "word",

"position" : 2

{

"token" : "ー",

"start_offset" : 3,

"end_offset" : 4,

"type" : "word",

"position" : 3

{

"token" : "ぱっ",

"start_offset" : 4,

"end_offset" : 5,

"type" : "word",

"position" : 4

{

"token" : "み",

"start_offset" : 5,

"end_offset" : 6,

"type" : "word",

"position" : 5

{

"token" : "ユ",

"start_offset" : 6,

"end_offset" : 7,

"type" : "word",

"position" : 6

{

"token" : "ぱっ",

"start_offset" : 7,

"end_offset" : 8,

"type" : "word",

"position" : 7

{

"token" : "み",

"start_offset" : 8,

"end_offset" : 9,

"type" : "word",

"position" : 8

{

"token" : "ユ",

"start_offset" : 9,

"end_offset" : 10,

"type" : "word",

"position" : 9

}

]

}

おっと、ダメでした。

ちなみにkuromojiでも。


# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/restaurant/_analyze?pretty=true' -d '
> {
>   "analyzer": "my_kuromoji_analyzer",
>   "text": "きゃりーぱみゅぱみゅ"
> }
> '
{
  "tokens" : [
    {
      "token" : "きゃ",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "り",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ー",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "ぱみゅぱみゅ",
      "start_offset" : 4,
      "end_offset" : 10,
      "type" : "word",
      "position" : 3
    }
  ]
}

# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/restaurant/_analyze?pretty=true' -d '

> {

> "analyzer": "my_kuromoji_analyzer",

> "text": "きゃりーぱみゅぱみゅ"

> }

> '

{

"tokens" : [

{

"token" : "きゃ",

"start_offset" : 0,

"end_offset" : 2,

"type" : "word",

"position" : 0

{

"token" : "り",

"start_offset" : 2,

"end_offset" : 3,

"type" : "word",

"position" : 1

{

"token" : "ー",

"start_offset" : 3,

"end_offset" : 4,

"type" : "word",

"position" : 2

{

"token" : "ぱみゅぱみゅ",

"start_offset" : 4,

"end_offset" : 10,

"type" : "word",

"position" : 3

}

]

}

微妙に区切られるトークンが変わります。

続いて一般用語。


# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sudachi_test/_analyze?pretty=true' -d '
> {
>   "analyzer": "sudachi_analyzer",
>   "text": "東京都知事選"
> }
> '
{
  "tokens" : [
    {
      "token" : "東京都知事選",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0,
      "positionLength" : 4
    },
    {
      "token" : "東京",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "都",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "知事",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "選",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 3
    }
  ]
}

# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/sudachi_test/_analyze?pretty=true' -d '

> {

> "analyzer": "sudachi_analyzer",

> "text": "東京都知事選"

> }

> '

{

"tokens" : [

{

"token" : "東京都知事選",

"start_offset" : 0,

"end_offset" : 6,

"type" : "word",

"position" : 0,

"positionLength" : 4

{

"token" : "東京",

"start_offset" : 0,

"end_offset" : 2,

"type" : "word",

"position" : 0

{

"token" : "都",

"start_offset" : 2,

"end_offset" : 3,

"type" : "word",

"position" : 1

{

"token" : "知事",

"start_offset" : 3,

"end_offset" : 5,

"type" : "word",

"position" : 2

{

"token" : "選",

"start_offset" : 5,

"end_offset" : 6,

"type" : "word",

"position" : 3

}

]

}

きちんと「東京都知事選」をキーワードとして認識しています。

ちなみにkuromojiでも。


# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/restaurant/_analyze?pretty=true' -d '
> {
>   "analyzer": "my_kuromoji_analyzer",
>   "text": "東京都知事選"
> }
> '
{
  "tokens" : [
    {
      "token" : "東京",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "都",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "都知事",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1,
      "positionLength" : 2
    },
    {
      "token" : "知事",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "選",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 3
    }
  ]
}

# curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/restaurant/_analyze?pretty=true' -d '

> {

> "analyzer": "my_kuromoji_analyzer",

> "text": "東京都知事選"

> }

> '

{

"tokens" : [

{

"token" : "東京",

"start_offset" : 0,

"end_offset" : 2,

"type" : "word",

"position" : 0

{

"token" : "都",

"start_offset" : 2,

"end_offset" : 3,

"type" : "word",

"position" : 1

{

"token" : "都知事",

"start_offset" : 2,

"end_offset" : 5,

"type" : "word",

"position" : 1,

"positionLength" : 2

{

"token" : "知事",

"start_offset" : 3,

"end_offset" : 5,

"type" : "word",

"position" : 2

{

"token" : "選",

"start_offset" : 5,

"end_offset" : 6,

"type" : "word",

"position" : 3

}

]

}

こちらは一語ではダメでした。

ということで日本語解析ではSudachiの方が非常に強くなるようです。

特に文章系での表記のゆれなど大きな効果を発揮しそうです。

このブログは株式会社CoLabMixによる技術ブログです。

GCP、AWSなどでのインフラ構築・運用や、クローリング・分析・検索などを主体とした開発を行なっています。

Ruby on RailsやDjango、Pythonなどの開発依頼などお気軽にお声がけください。

開発パートナーを増やしたいという企業と積極的に繋がっていきたいです。

お問い合わせやご依頼・ご相談など

More from my site

Django2.1.1 を使ってログイン…前の記事

Ajax での複数項目の JSON ファ…次の記事

制作実績一覧

PAGE TOP