Elasticsearch： Join 数据类型

在 Elasticsearch 中，Join 可以让我们创建 parent/child 关系。Elasticsearch 不是一个 RDMS。通常 join 数据类型尽量不要使用，除非不得已。那么 Elasticsearch 为什么需要 Join 数据类型呢？

在 Elasticsearch 中，更新一个 object 需要 root object 一个完整的 reindex：

即使是一个 field 的一个字符的改变
即便是 nested object 也需要完整的 reindex 才可以实现搜索

通常情况下，这是完全 OK 的，但是在有些场合下，如果我们有频繁的更新操作，这样可能对性能带来很大的影响。

如果你的数据需要频繁的更新，并带来性能上的影响，这个时候，join 数据类型可能是你的一个解决方案。

join 数据类型可以完全地把两个 object 分开，但是还是保持这两者之前的关系。

parent 及 child 是完全分开的两个文档
parent 可以单独更新而不需要重新 reindex child
children 可以任意被添加/串改/删除而不影响 parent 及其它的 children

与 nested 类型类似，父子关系也允许你将不同的实体关联在一起，但它们在实现和行为上有所不同。与 nested 文档不同，它们不在同一文档中，而 parent/child 文档是完全独立的文档。它们遵循一对多关系原则，允许你将一种类型定义为 parent 类型，将一种或多种类型定义为 child 类型

即便 join 数据类型给我们带来了方便，但是，它也在搜索时给我带来额外的内存及计算的开销。

注意：目前 Kibana 对 nested 及 join 数据类型有比较少的支持。如果你想使用 Kibana 来在 dashboard 里展示数据，这个方面的你需要考虑。在未来，这种情况可能会发生改变。

join 数据类型是一个特殊字段，用于在同一索引的文档中创建父/子关系。关系部分定义文档中的一组可能关系，每个关系是父（parent）名称和子（child）名称。

一个例子：

PUT my_index
{
  "mappings": {
    "properties": {
      "my_join_field": { 
        "type": "join",
        "relations": {
          "question": "answer" 
        }
      }
    }
  }
}

在这里我们定义了一个叫做 my_index 的索引。在这个索引中，我们定义了一个 field，它的名字是 my_join_field。它的类型是 join 数据类型。同时我们定义了单个关系：question 是 answer 的 parent。

要使用 join 来 index 文档，必须在 source 中提供关系的 name 和文档的可选 parent。例如，以下示例在 question 上下文中创建两个 parent 文档：

PUT my_index/_doc/1?refresh
{
  "text": "This is a question",
  "my_join_field": {
    "name": "question" 
  }
}

PUT my_index/_doc/2?refresh
{
  "text": "This is another question",
  "my_join_field": {
    "name": "question"
  }
}

这里采用 refresh 来强制进行索引，以便接下来的搜索。在这里 name 标识 question，说明这个文档时一个 question 文档。

索引 parent 文档时，你可以选择仅将关系的名称指定为快捷方式，而不是将其封装在普通对象表示法中：

PUT my_index/_doc/1?refresh
{
  "text": "This is a question",
  "my_join_field": "question" 
}

PUT my_index/_doc/2?refresh
{
  "text": "This is another question",
  "my_join_field": "question"
}

这种方法和前面的是一样的，只是这里我们只使用了 question, 而不是一个像第一种方法那样，使用如下的一个对象来表达：

"my_join_field": {
    "name": "question"
  }

在实际的使用中，你可以根据自己的喜好来使用。

索引 child 项时，必须在 _source 中添加关系的名称以及文档的 parent id。

注意：需要在同一分片中索引父级的谱系，必须使用其 parent 的 id 来确保这个 child 和 parent 是在一个 shard 中。每个文档分配在那个 shard 之中在默认的情况下是按照文档的 id 进行一些 hash 来分配的，当然也可以通过 routing 来进行。针对 child，我们使用其 parent 的 id，这样就可以保证。否则在我们 join 数据的时候，跨 shard 是非常大的一个消费。

例如，以下示例显示如何索引两个 child 文档：

PUT my_index/_doc/3?routing=1?refresh  (1)
{
  "text": "This is an answer",
  "my_join_field": {
    "name": "answer",   (2)
    "parent": "1"       (3)
  }
}

PUT my_index/_doc/4?routing=1?refresh
{
  "text": "This is another answer",
  "my_join_field": {
    "name": "answer",
    "parent": "1"
  }
}

在上面的（1）处，我们必须使用 routing，这样能确保 parent 和 child 是在同一个 shard 里。我们这里 routing 为1，这是因为 parent 的 id 为1，在（3）处定义。(2) 处定义了该文档 join 的名称。

parent-join 及其性能

join 字段不应像关系数据库中的连接一样使用。在 Elasticsearch 中，良好性能的关键是将数据去规范化为文档。每个连接字段 has_child 或 has_parent 查询都会对查询性能产生重大影响。

join 字段有意义的唯一情况是，如果你的数据包含一对多关系，其中一个实体明显超过另一个实体。

parent-join 的限制

对于每个索引来说，只能有一个 join 字段
parent 及 child 文档，必须是在一个 shard 里建立索引。这也意味着，同样的 routing 值必须应用于 getting, deleting 或 updating 一个 child 文档。
一个元素可以有多个 children，但是只能有一个 parent.
可以对已有的 join 项添加新的关系
也可以将 child 添加到现有元素，但仅当元素已经是 parent 时才可以。

针对 parent-join 的搜索

parent-join 创建一个字段来索引文档中关系的名称（my_parent，my_child，...）。

它还为每个 parent/child 关系创建一个字段。此字段的名称是 join 字段的名称，后跟＃和关系中 parent 的名称。因此，例如对于 my_parent⇒[my_child，another_child] 关系，join 字段会创建一个名为 my_join_field ＃my_parent 的附加字段。

如果文档是子文件（my_child 或 another_child），则此字段包含文档链接到的 parent_id，如果文档是 parent 文件（my_parent），则包含文档的_id。

搜索包含 join 字段的索引时，始终在搜索响应中返回这两个字段：

上面的描述比较绕口，我们还是以一个例子来说说明吧：

GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "sort": ["_id"]
}

这里我们搜索所有的文档，并以 _id 进行排序：

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : null,
        "_source" : {
          "text" : "This is a question",
          "my_join_field" : "question" (1)
        },
        "sort" : [
          "1"
        ]
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : null,
        "_source" : {
          "text" : "This is another question",
          "my_join_field" : "question" (2)
        },
        "sort" : [
          "2"
        ]
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : null,
        "_routing" : "1",
        "_source" : {
          "text" : "This is an answer",
          "my_join_field" : {
            "name" : "answer", (3)
            "parent" : "1"     (4)
          }
        },
        "sort" : [
          "3"
        ]
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : null,
        "_routing" : "1",
        "_source" : {
          "text" : "This is another answer",
          "my_join_field" : {
            "name" : "answer",
            "parent" : "1"
          }
        },
        "sort" : [
          "4"
        ]
      }
    ]
  }
}

在这里，我们可以看到 4 个文档：

（1）表明这个文档是一个 question join
（2）表明这个文档是一个 question join
（3）表明这个文档是一个 answer join
(4) 表明这个文档的parent是 id 为1的文档

Parent-join 查询及 aggregation

可以在 aggregation 和 script 中访问 join 字段的值，并可以使用 parent_id 查询进行查询：

GET my_index/_search
{
  "query": {
    "parent_id": { 
      "type": "answer",
      "id": "1"
    }
  }
}

我们通过查询 parent_id，返回所有 parent_id 为 1 的所有 answer 类型的文档：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.35667494,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.35667494,
        "_routing" : "1",
        "_source" : {
          "text" : "This is another answer",
          "my_join_field" : {
            "name" : "answer",
            "parent" : "1"
          }
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.35667494,
        "_routing" : "1",
        "_source" : {
          "text" : "This is an answer",
          "my_join_field" : {
            "name" : "answer",
            "parent" : "1"
          }
        }
      }
    ]
  }
}

在这里，我们可以看到返回 id 为 3 和 4 的文档。我们也可以对这些文档进行 aggregation:

GET my_index/_search
{
  "query": {
    "parent_id": {
      "type": "answer",
      "id": "1"
    }
  },
  "aggs": {
    "parents": {
      "terms": {
        "field": "my_join_field#question",
        "size": 10
      }
    }
  },
  "script_fields": {
    "parent": {
      "script": {
        "source": "doc['my_join_field#question']"
      }
    }
  }
}

就像我们在上一节中介绍的那样，在我们的应用实例中，在 index 时，它也创建一个额外的一个字段，虽然在 source 里我们看不到。这个字段就是 my_join_filed#question，这个字段含有 parent _id。在上面的查询中，我们首先查询所有的 parent_id 为1的所有的 answer 类型的文档。接下来对所有的文档以 parent_id 进行聚合：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.35667494,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.35667494,
        "_routing" : "1",
        "fields" : {
          "parent" : [
            "1"
          ]
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.35667494,
        "_routing" : "1",
        "fields" : {
          "parent" : [
            "1"
          ]
        }
      }
    ]
  },
  "aggregations" : {
    "parents" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "1",
          "doc_count" : 2
        }
      ]
    }
  }
}

一个 parent 对应多个 child

对于一个 parent 来说，我们可以定义多个 child，比如：

PUT my_index
{
  "mappings": {
    "properties": {
      "my_join_field": {
        "type": "join",
        "relations": {
          "question": ["answer", "comment"]  
        }
      }
    }
  }
}

在这里，question 是 answer 及 comment 的 parent。

多层的 parent join

虽然这个不建议，这样做可能会可能在 query 时带来更多的内存及计算方面的开销：

PUT my_index
{
  "mappings": {
    "properties": {
      "my_join_field": {
        "type": "join",
        "relations": {
          "question": ["answer", "comment"],  
          "answer": "vote" 
        }
      }
    }
  }
}

这里 question 是 answer 及 comment 的 parent，同时 answer 也是 vote 的 parent。它表明了如下的关系：

索引 grandchild 文档需 routing 值等于 grand-parent（谱系里的更大 parent）：

PUT my_index/_doc/3?routing=1&refresh 
{
  "text": "This is a vote",
  "my_join_field": {
    "name": "vote",
    "parent": "2" 
  }
}

这个 child 文档必须是和他的 grand-parent 在一个 shard 里。在这里它使用了1，也即 question 的id。同时，对于 vote 来说，它的 parent 必须是它的 parent，也即 answer 的 id。

更多阅读，请参阅文章 “Elasticsearch：在 Elasticsearch 中的 join 数据类型父子关系”。