2.2 ElasticSearch分词之分词器API
1.简介
elasticsearch提供了一个测试分词效果的API接口,endpoint是_analyze,其主要有指定analyzer、指定索引中的字段(分词结果与预期不一致时使用)以及自定义分词器三种测试方式。
2.指定analyzer
POST /_analyze
{
"analyzer": "standard",
"text": "stephen curry"
}
- analyzer:分词器名称,standard是elasticsearch自带的一个分词器,并且是默认的
- text:测试文本
{
"tokens" : [
{
"token" : "stephen",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "curry",
"start_offset" : 8,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
- token:分词结果
- start_offset:起始偏移
- end_offset:结束偏移
- position:分词位置
3.指定索引中的字段
POST /employee/_analyze
{
"field": "name",
"text": "stephen curry"
}
{
"tokens" : [
{
"token" : "stephen",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "curry",
"start_offset" : 8,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
和指定analyzer为“standard”的分词效果一致,这说明,在搜索时elasticsearch默认使用的分词器是standard。
4.自定义分词器
POST /employee/_analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text": "stephen curry"
}
{
"tokens" : [
{
"token" : "stephen",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "curry",
"start_offset" : 8,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}