基于Reciprocal Rank Fusion的融合查询

更新时间：2024-07-12

背景

Reciprocal Rank Fusion（RRF）是一种在信息检索和机器学习领域中使用的排名融合方法，旨在将具有不同相关性指标的多个查询的结果集融合成单个结果集。

百度智能云Elasticsearch自研支持 RRF 算法，在 RRF 算法中，不同结果集的相关性指标不必相互关联，即可相互融合获得高质量的结果。

注意：此功能当前仅支持7.10.2版本的1.7.5以上内核版本集群使用。集群版本和内核版本可在百度智能云控制台-对应集群的【集群详情】页面查看

Reciprocal Rank Fusion 的计算方法

RRF 的基本思想是在多个查询中，考虑每个查询结果集的文档排名列表，并为结果集中的每个文档分配一个得分，这个得分是文档所在的所有结果集中排名的倒数之和。

RRF 使用以下公式来确定特定文档 d 在所有查询集中结合了综合排名的最终分数：

D 表示不同相关性指标下的多个查询的结果集集合
d 为某个结果集 D 的一个文档
k 为等级常量，此值确定每个查询的单个结果集中的文档，对最终排名结果集的影响程度。k 越大说明排名较低的文档对最终排名有更大的影响力。
R 为排名的集合，其元素为单个查询结果集的排名。r() 为排名函数，表示单个结果集中 d 所在排名。排名从1开始，并且只有当结果集合 D 中存在 d 时，才进行累加。

Reciprocal Rank Fusion 查询方式

可以使用 rrf 参数进行倒排融合查询。需要在 rrf 中配置多个子查询，并设置混合查询的参数 rank_constant 和 window_size 。

倒排融合查询会考虑每个子查询的结果，并把所有结果集融合成一个具有综合排名的最终结果集，其中 rrf 语法包含的参数如下：

queries (必须包含，类型为 query 的列表)

queries 需要包含两个及以上 query ，每个 query 是一个单独的查询，需要遵循 query 的语法，最终通过 RRF 将所有 query 结果进行关联并合并融合成为最终查询结果集
rank_constant (非必须，类型为 float )

rank_constant 对应 RRF 公式中的 k 值，确定每个查询的单个结果集中的文档，对最终排名结果集的影响程度。默认为60，需要大于或者等于1
window_size (非必须，类型为 int )

window_size 确定单个查询中包含结果集的大小，如果 window_size 较高，会以性能为代价提高结果的相关性。默认为10，需要大于0，同时需要大于或者等于查询的size

其中 RRF 的查询语句示例如下：

GET my_index/_search
{
  "rrf": {
    "queries": [
      {
        "query": {
          "match": {
            "text": "RRF"
          }
        }
      },
      {
        "query": {
          "match": {
            "title": "Reciprocal Rank Fusion"
          }
        }
      },
      {
        "query": {
          "knn": {
            "vector": {
              "vector": [ 5, 4, 3, 2, 1 ],
              "k": 3,
              "ef": 100
            }
          }
        }
      }
    ],
    "window_size": 5,
    "rank_constant": 10
  },
  "size": 5
}

示例

首先为一个索引创建一个 mapping ，其中包含一个 text 字段、一个 vector 字段和一个整数字段，并为这个索引添加几个文档。

对于这个索引的向量字段，使用只有单一维度的向量，使排名更容易解释。

创建索引

使用如下 setting 和 mapping ，创建一个索引 index_rrf ：

PUT index_rrf
{
  "settings": {
    "index": {
      "knn": true,
      "number_of_shards": 3,
      "number_of_replicas": 1
    }
  },
  "mappings": {
    "properties": {
      "vector": { 
        "type": "bpack_vector",
        "dims": 1,
        "index_type":  "hnsw",
        "space_type": "l2",
        "parameters": {
            "ef_construction": 200,
            "m": 32
        }
      },
      "id": {
        "type": "integer"
      },
      "text": {
        "type": "text"
      }
    }
  }
}

写入数据

灌入5条一维的向量数据

PUT index_rrf/_doc/1
{
    "text" : "rrf",
    "vector" : [5],
    "integer": 1
}
 
PUT index_rrf/_doc/2
{
    "text" : "rrf rrf",
    "vector" : [4],
    "integer": 2
}
 
PUT index_rrf/_doc/3
{
    "text" : "rrf rrf rrf",
    "vector" : [3],
    "integer": 1
}
 
PUT index_rrf/_doc/4
{
    "text" : "rrf rrf rrf rrf",
    "integer": 2
}
 
PUT index_rrf/_doc/5
{
    "vector" : [0],
    "integer": 1
}

执行RRF查询

执行如下查询语句

GET index_rrf/_search
{
  "rrf": {
    "queries": [
      {
        "query": {
          "match": {
            "text": "rrf"
          }
        }
      },
      {
        "query": {
          "knn": {
            "vector": {
              "vector": [
                5
              ],
              "k": 3,
              "ef": 100
            }
          }
        }
      }
    ],
    "window_size": 5,
    "rank_constant": 1
  },
  "size": 5
}

其中查询结果为

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "index_rrf",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : { ... }
      },
      {
        "_index" : "index_rrf",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.53333336,
        "_source" : { ... }
      },
      {
        "_index" : "index_rrf",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.5,
        "_source" : { ... }
      },
      {
        "_index" : "index_rrf",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.33333334,
        "_source" : { ... }
      },
      {
        "_index" : "index_rrf",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.2,
        "_source" : { ... }
      }
    ]
  }
}

过程分析

标量查询分析

单独执行标量查询，并获取标量查询中命中的各个文档的排名

GET index_rrf/_search?filter_path=**.hits
{
  "query": {
    "term": {
      "text": "rrf"
    }
  }
}

执行结果如下所示：

{
  "hits" : {
    "hits" : [
      {
        "_index" : "index_rrf",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "text" : "rrf",
          "vector" : [
            5
          ],
          "integer" : 1
        }
      },
      {
        "_index" : "index_rrf",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.21365023,
        "_source" : {
          "text" : "rrf rrf rrf rrf",
          "integer" : 2
        }
      },
      {
        "_index" : "index_rrf",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.20983505,
        "_source" : {
          "text" : "rrf rrf rrf",
          "vector" : [
            3
          ],
          "integer" : 1
        }
      },
      {
        "_index" : "index_rrf",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.20259935,
        "_source" : {
          "text" : "rrf rrf",
          "vector" : [
            4
          ],
          "integer" : 2
        }
      }
    ]
  }
}

上述标量查询命中了4条结果( _id 为5的文档不包含 text 字段)，因此标量的排名为

_id	1	2	3	4	5
rank	1	4	3	2	none

向量查询分析

单独执行向量查询，并获取向量查询命中的各个文档的排名

GET index_rrf/_search?filter_path=**.hits
{
  "query": {
    "knn": {
      "vector": {
        "vector": [
          5
        ],
        "k": 3,
        "ef": 100
      }
    }
  }
}

执行结果如下所示：

{
  "hits" : {
    "hits" : [
      {
        "_index" : "index_rrf",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "text" : "rrf",
          "vector" : [
            5
          ],
          "integer" : 1
        }
      },
      {
        "_index" : "index_rrf",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.5,
        "_source" : {
          "text" : "rrf rrf",
          "vector" : [
            4
          ],
          "integer" : 2
        }
      },
      {
        "_index" : "index_rrf",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.33333334,
        "_source" : {
          "text" : "rrf rrf rrf",
          "vector" : [
            3
          ],
          "integer" : 1
        }
      },
      {
        "_index" : "index_rrf",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.16666667,
        "_source" : {
          "vector" : [
            0
          ],
          "integer" : 1
        }
      }
    ]
  }
}

上述查询命中了4条结果( _id 为4的文档不包含 vector 字段)，因此向量的排名为

_id	1	2	3	4	5
rank	1	2	3	none	4

结果分析

把两个查询的排名结果整理到一起

_id	query_rank	knn_rank
1	1	1
2	4	2
3	3	3
4	2	none
5	none	4

然后使用上面给出的RRF公式计算最终得分，其中 rank_constant = 1

# doc  | query     | knn       | score
_id: 1 = 1.0/(1+1) + 1.0/(1+1) = 1
_id: 2 = 1.0/(1+4) + 1.0/(1+2) = 0.5333
_id: 3 = 1.0/(1+3) + 1.0/(1+3) = 0.5
_id: 4 = 1.0/(1+2)             = 0.3333
_id: 5 =             1.0/(1+4) = 0.2000

因此可以得到最后的排名顺序为 _id 1, _id 2, _id 3, _id 4, _id 5

向量索引缓存管理

应用场景

百度智能云

Elasticsearch BES