elasticsearch-rails を使っているときの custom analyzer 設定の書き方

f:id:inouetakuya:20141103002159p:plain

Elasticsearch に analysis-kuromoji というプラグインを入れると、下記にあるように「kuromoji」という analyzer が使えるようになります。

https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/blob/v2.3.0/src/main/java/org/elasticsearch/plugin/analysis/kuromoji/AnalysisKuromojiPlugin.java#L52

しかし、これにもう少し手を入れて custom analyzer を使いたいという場合はあるはず。elasticsearch-rails gem を使っているときの custom analyzer 設定の書き方について、README に記述がなく、日本語情報が見つけられなかったので、メモしておきます。

環境は下記のとおりです。

Elasticserch 1.3.1
analysis-kuromoji 2.3.0
elasticsearch-rails 0.1.4
elasticsearch-model 0.1.4

設定サンプル

実はこの点、まだまだ「これだ」という設定に辿り着けておらず試行錯誤中です。ウェブでよく見かける設定を elasticsearch-rails 使って書いたらこうなりますよ、というのを示すに留めたいと思います。

analyzer の設定は下記を参考にしました。

Elasticsearchとkuromojiでちゃんとした日本語全文検索をやるメモ | GMOメディアエンジニアブログ

また、設定ファイル全体の記述については公式サンプルを参考にしています。

https://github.com/elasticsearch/elasticsearch-rails/blob/master/elasticsearch-rails/lib/rails/templates/searchable.rb

# app/models/concerns/searchable.rb

module Searchable
  extend ActiveSupport::Concern

  included do
    include Elasticsearch::Model

    # Customize the index name
    index_name [Rails.application.engine_name, Rails.env].join('_')

    # Set up index configuration and mapping
    settings index: {
      number_of_shards:   1,
      number_of_replicas: 0,
      analysis: {
        filter: {
          pos_filter: {
            type:     'kuromoji_part_of_speech',
            stoptags: ['助詞-格助詞-一般', '助詞-終助詞'],
          },
          greek_lowercase_filter: {
            type:     'lowercase',
            language: 'greek',
          },
        },
        analyzer: {
          kuromoji_analyzer: {
            type:      'custom',
            tokenizer: 'kuromoji_tokenizer',
            filter:    ['kuromoji_baseform', 'pos_filter', 'greek_lowercase_filter', 'cjk_width'],
          }
        }
      }
    } do
      mapping do
        indexes :status, type: 'string', index: 'not_analyzed'
        indexes :title, type: 'string', index: 'analyzed', analyzer: 'kuromoji_analyzer'
        indexes :displayed_at, type: 'date'
        
        # ...
      end
    end

    # ...

# app/models/video.rb

class Video < ActiveRecord::Base
  include Searchable

# ...

まあ、書いてしまえば何てことないのですが analysis ってどこに書くんだろうというのが迷いました。

stoptags

上記設定のうち kuromoji_part_of_speech というフィルターは stoptags に指定した品詞を除外するためのものです。

stoptags: ['助詞-格助詞-一般', '助詞-終助詞'],

としていますが、デバッグ（後述）しながら、除外したい語句を決めて、それがどの品詞に当たるのかを調べて stoptags に書きます。

除外したい語句がどの品詞に当たるのか迷ったら lucene-analyzer-kuromoji.jar に含まれる stoptags.txt を見てみると、例付きで記述してくれています。

https://github.com/elasticsearch/elasticsearch-analysis-kuromoji#tokenfilter--kuromoji_part_of_speech

#  particle-case-misc: Case particles.
#  e.g. から, が, で, と, に, へ, より, を, の, にて
助詞-格助詞-一般

#  particle-case-quote: the "to" that appears after nouns, a person’s speech, 
#  quotation marks, expressions of decisions from a meeting, reasons, judgements,
#  conjectures, etc.
#  e.g. ( だ) と (述べた.), ( である) と (して執行猶予...)
助詞-格助詞-引用

#  particle-dependency:
#  e.g. こそ, さえ, しか, すら, は, も, ぞ
助詞-係助詞

#  particle-final:
#  e.g. かい, かしら, さ, ぜ, (だ)っけ-口語/, (とまってる) で-方言/, な, ナ, なあ-口語/, ぞ, ね, ネ, 
#       ねぇ-口語/, ねえ-口語/, ねん-方言/, の, のう-口語/, や, よ, ヨ, よぉ-口語/, わ, わい-口語/
助詞-終助詞

# ...

kuromoji demo も併せて使うと便利です。

kuromoji demo - japanese morphological analyzer

f:id:inouetakuya:20141108232315p:plain

その他の設定の意味

その他の設定の意味は、下記の記事がとても参考になります！（@9215 さん、ありがとうございます。いつもとても助かっています！）

インデックス再作成

上記設定をした後に、インデックスを再作成すると設定が反映されます。

Video.__elasticsearch__.create_index! force: true

デバッグ

意図したとおりに動くか確認します。

curl -XGET 'http://localhost:9200/x300_application_development/_analyze?pretty=true&analyzer=kuromoji_analyzer' -d '絶対に手を出してはいけない相手を夜這いしちゃった俺'

もしくは Elasticsearch の inquisitor というプラグインを使うと便利です。

polyfractal/elasticsearch-inquisitor

f:id:inouetakuya:20141108223709p:plain

以上です。ではでは。

おいちゃんと呼ばれています

ウェブ技術や日々考えたことなどを綴っていきます