2.2 ElasticSearch分词之分词器API

1.简介
elasticsearch提供了一个测试分词效果的API接口,endpoint是_analyze,其主要有指定analyzer、指定索引中的字段(分词结果与预期不一致时使用)以及自定义分词器三种测试方式。

2.指定analyzer

POST /_analyze
{
	"analyzer": "standard",
	"text": "stephen curry"
}
  • analyzer:分词器名称,standard是elasticsearch自带的一个分词器,并且是默认的
  • text:测试文本
{
  "tokens" : [
    {
      "token" : "stephen",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "curry",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}
  • token:分词结果
  • start_offset:起始偏移
  • end_offset:结束偏移
  • position:分词位置

3.指定索引中的字段

POST /employee/_analyze
{
	"field": "name",
	"text": "stephen curry"
}
{
  "tokens" : [
    {
      "token" : "stephen",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "curry",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

和指定analyzer为“standard”的分词效果一致,这说明,在搜索时elasticsearch默认使用的分词器是standard。

4.自定义分词器

POST /employee/_analyze
{
	"tokenizer": "standard",
	"filter": ["lowercase"],
	"text": "stephen curry"
}
{
  "tokens" : [
    {
      "token" : "stephen",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "curry",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}