2.2 ElasticSearch分词之分词器API

1.简介
elasticsearch提供了一个测试分词效果的API接口，endpoint是_analyze，其主要有指定analyzer、指定索引中的字段(分词结果与预期不一致时使用)以及自定义分词器三种测试方式。

2.指定analyzer

POST /_analyze

{
	"analyzer": "standard",
	"text": "stephen curry"
}

analyzer:分词器名称，standard是elasticsearch自带的一个分词器，并且是默认的
text:测试文本

{
  "tokens" : [
    {
      "token" : "stephen",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "curry",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

token:分词结果
start_offset:起始偏移
end_offset:结束偏移
position:分词位置

3.指定索引中的字段

POST /employee/_analyze

{
	"field": "name",
	"text": "stephen curry"
}

{
  "tokens" : [
    {
      "token" : "stephen",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "curry",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

和指定analyzer为“standard”的分词效果一致，这说明，在搜索时elasticsearch默认使用的分词器是standard。

4.自定义分词器

POST /employee/_analyze

{
	"tokenizer": "standard",
	"filter": ["lowercase"],
	"text": "stephen curry"
}

{
  "tokens" : [
    {
      "token" : "stephen",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "curry",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}