前陣子試完簡單的查詢和統計後,接著來試看看對聊天訊息進行分詞。

安裝

首先先依照官網的步驟安裝kuromoji

sudo bin/elasticsearch-plugin install analysis-kuromoji

如果是用rpm安裝的話,bin/elasticsearch-plugin通常路徑會是在/usr/share/elasticsearch/中。
執行指令後,它會自己去下載需要的檔案後並執行安裝,安裝完後記得要重啟elasticsearch才會生效。

可以呼叫API來檢查是否安裝成功:

curl -X GET http://10.0.0.19:9200/_nodes/plugins?pretty

如果plugins裡面有列出analysis-kuromoji字眼的話基本上就是安裝成功了。

開始使用分詞

一開始先用預設的分詞器看看:

GET /_analyze
{
  "text": "無料でしずりん買います おつりん!"
}

結果會是:

{
  "tokens" : [
    {
      "token" : "無",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "料",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "で",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<HIRAGANA>",
      "position" : 2
    },
    {
      "token" : "し",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<HIRAGANA>",
      "position" : 3
    },
    {
      "token" : "ず",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<HIRAGANA>",
      "position" : 4
    },
    {
      "token" : "り",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<HIRAGANA>",
      "position" : 5
    },
    {
      "token" : "ん",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<HIRAGANA>",
      "position" : 6
    },
    {
      "token" : "買",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "い",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<HIRAGANA>",
      "position" : 8
    },
    {
      "token" : "ま",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<HIRAGANA>",
      "position" : 9
    },
    {
      "token" : "す",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<HIRAGANA>",
      "position" : 10
    },
    {
      "token" : "お",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "<HIRAGANA>",
      "position" : 11
    },
    {
      "token" : "つ",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "<HIRAGANA>",
      "position" : 12
    },
    {
      "token" : "り",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "<HIRAGANA>",
      "position" : 13
    },
    {
      "token" : "ん",
      "start_offset" : 15,
      "end_offset" : 16,
      "type" : "<HIRAGANA>",
      "position" : 14
    }
  ]
}

基本上就以一個字為單位。

接著用kuromoji拆看看:

GET /_analyze
{
  "analyzer": "kuromoji",
  "text": "無料でしずりん買います おつりん!"
}

結果則是:

{
  "tokens" : [
    {
      "token" : "無料",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "りん",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "買う",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "つり",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 8
    }
  ]
}

好像有很多字被刪掉了,畢竟しずりんおつりん並不是一般用詞。

查了一下後發現kuromoji有支援user_dictionary,可以讓使用者自訂詞彙。

設定

字典檔案預設名稱似乎是userdict_ja.txt,預設路徑則是/etc/elasticsearch/
內容是以一行為單位,格式是:

<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>

官方範例是:

東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞

看起來共有四個部分,用逗號分開;分別是:完整的字串、分詞(用空白分)、讀音和詞性。

姑且先在字典裡面新增兩行來測試看看:

しずりん,しずりん,しずりん,カスタム名詞
おつりん,おつりん,おつりん,カスタム名詞

日文不熟,讀音和詞性什麼的就先放一邊吧(?)。

接著建立一個index,並指定剛剛建立的字典:

PUT try_kuromoji
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "discard_punctuation": "false",
            "user_dictionary": "userdict_ja.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

建立好後就直接呼叫看看吧:

GET /try_kuromoji/_analyze
{
  "analyzer": "my_analyzer",
  "text": "無料でしずりん買います おつりん!"
}

結果:

{
  "tokens" : [
    {
      "token" : "無料",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "で",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "しずりん",
      "start_offset" : 3,
      "end_offset" : 7,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "買い",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "ます",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : " ",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "おつりん",
      "start_offset" : 12,
      "end_offset" : 16,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "!",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "word",
      "position" : 7
    }
  ]
}

看起來就順眼多了。

最後

本來以為elasticsearch會提供個API來做到類似查詢某個時間區間,對查詢結果的某個屬性值做分詞,接著再進行出現次數加總,最後再進行排序。結果查了一陣子後似乎沒有,只能自己實作。

另外實作過程中發現太多用詞需要自訂,還得篩掉很多不需要的字眼;像是:1234數字或是標點符號之類的,不過這又是另一回事了。