Elastic Stack 笔记（七）Elasticsearch5.6 聚合分析

时间 2019-11-06 标签 elastic stack 笔记七 elasticsearch5.6 elasticsearch 5 6 聚合分析

博客地址：http://www.moonxy.comjavascript

1、前言html

Elasticsearch 是一个分布式的全文搜索引擎，索引和搜索是 Elasticsarch 的基本功能。同时，Elasticsearch 的聚合（Aggregations）功能也时分强大，容许在数据上作复杂的分析统计。ES 提供的聚合分析功能主要有指标聚合、桶聚合、管道聚合和矩阵聚合。须要主要掌握的是前两个，即指标聚合和桶聚合。java

聚合分析的官方文档：Aggregationsnode

2、聚合分析python

2.1 指标聚合编程

指标聚合官网文档：Metricelasticsearch

指标聚合中主要包括 min、max、sum、avg、stats、extended_stats、value_count 等聚合，至关于 SQL 中的聚合函数。编程语言

指标聚合中包括以下聚合：分布式

Aggregations that keep track and compute metrics over a set of documents.ide

在一组文档中跟踪和计算度量的聚合。以下以 max 聚合为例：

Max Aggregation

max 聚合官网文档：Max Aggregation

max 聚合用于最大值统计，与 SQL 中的聚合函数 max() 的做用相似，其中 "max_price" 为自定义的聚合名称。

##Max Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "max_price": {
      "max":  {
        "field": "price"
      }
    }
  }
}

返回结果以下：

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "max_price": {
      "value": 81.4
    }
  }
}

Cardinality Aggregation

基数统计聚合官网文档：Cardinality Aggregation

Cardinality Aggregation 用于基数查询，其做用是先执行相似 SQL 中的 distinct 操做，去掉集合中的重复项，而后统计排重后的集合长度。

##Cardinality Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "all_language": {
      "cardinality":  {
        "field": "language"
      }
    }
  }
}

返回结果以下：

{
  "took": 41,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "all_language": {
      "value": 3
    }
  }
}

Stats Aggregation

基本统计聚合官网文档：Stats Aggregation

Stats Aggregation 用于基本统计，会一次返回 count、max、min、avg 和 sum 这 5 个指标。以下：

##Stats Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "stats_pirce": {
      "stats":  {
        "field": "price"
      }
    }
  }
}

返回结果以下：

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "stats_pirce": {
      "count": 5,
      "min": 46.5,
      "max": 81.4,
      "avg": 63.8,
      "sum": 319
    }
  }
}

Extended Stats Aggregation

高级统计聚合官网文档：Extended Stats Aggregation

用于高级统计，和基本统计功能相似，可是会比基本统计多4个统计结果：平方和、方差、标准差、平均值加/减两个标准差的区间。

##Extended Stats Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "extend_stats_pirce": {
      "extended_stats":  {
        "field": "price"
      }
    }
  }
}

返回响应结果：

{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "extend_stats_pirce": {
      "count": 5,
      "min": 46.5,
      "max": 81.4,
      "avg": 63.8,
      "sum": 319,
      "sum_of_squares": 21095.46,
      "variance": 148.65199999999967,
      "std_deviation": 12.19229264740638,
      "std_deviation_bounds": {
        "upper": 88.18458529481276,
        "lower": 39.41541470518724
      }
    }
  }
}

Value Count Aggregation

文档数量聚合官网文档：Value Count Aggregation

Value Count Aggregation 可按字段统计文档数量。

##Value Count Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "doc_count": {
      "value_count":  {
        "field": "author"
      }
    }
  }
}

返回结果以下：

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "doc_count": {
      "value": 5
    }
  }
}

注意：

text 类型的字段不能作排序和聚合（terms Aggregation 除外），以下对 title 字段作聚合，title 定义为 text：

GET books/_search
{
  "size": 0, 
  "aggs": {
    "doc_count": {
      "value_count":  {
        "field": "title"
      }
    }
  }
}

返回结果以下：

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [title] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "books",
        "node": "6n3douACShiPmlA9j2soBw",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [title] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
        }
      }
    ]
  },
  "status": 400
}

2.2 桶聚合

桶聚合官网文档：Bucket Aggregations

Bucket 能够理解为一个桶，它会遍历文档中的内容，凡是符合某一要求的就放入一个桶中，分桶至关与 SQL 中 SQL 中的 group by。

桶聚合包括以下聚合：

terms Aggregation 用于分组聚合，统计属于各编程语言的书籍数量，以下：

GET books/_search
{
  "size": 0, 
  "aggs": {
    "terms_count": {
      "terms":  {
        "field": "language"
      }
    }
  }
}

返回结果以下：

{
  "took": 31,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "terms_count": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "java",
          "doc_count": 2
        },
        {
          "key": "python",
          "doc_count": 2
        },
        {
          "key": "javascript",
          "doc_count": 1
        }
      ]
    }
  }
}

在 terms 分桶的基础上，还能够对每一个桶进行指标聚合。例如，想统计每一类图书的平局价格，能够先按照 language 字段进行 Terms Aggregation，再进行 Avg Aggregattion，查询语句以下：

GET books/_search
{
  "size": 0, 
  "aggs": {
    "terms_count": {
      "terms":  {
        "field": "language"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

返回结果以下：

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "terms_count": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "java",
          "doc_count": 2,
          "avg_price": {
            "value": 58.35
          }
        },
        {
          "key": "python",
          "doc_count": 2,
          "avg_price": {
            "value": 67.95
          }
        },
        {
          "key": "javascript",
          "doc_count": 1,
          "avg_price": {
            "value": 66.4
          }
        }
      ]
    }
  }
}

Range Aggregation

Range Aggregation 是范围聚合，用于反映数据的分布状况。好比，对 books 索引中的图书按照价格区间在 0~50、50~80、80 以上进行范围聚合，以下：

GET books/_search
{
  "size": 0, 
  "aggs": {
    "price_range": {
      "range": {
        "field": "price",
        "ranges": [
          {"to": 50},
          {"from": 50, "to": 80},
          {"from": 80}
        ]
      }
    }
  }
}

返回结果以下：

{
  "took": 16,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "price_range": {
      "buckets": [
        {
          "key": "*-50.0",
          "to": 50,
          "doc_count": 1
        },
        {
          "key": "50.0-80.0",
          "from": 50,
          "to": 80,
          "doc_count": 3
        },
        {
          "key": "80.0-*",
          "from": 80,
          "doc_count": 1
        }
      ]
    }
  }
}

Range Aggregation 不只能够对数值型字段进行范围统计，也能够做用在日期类型上。Date Range Aggregation 专门用于日期类型的范围聚合，和 Range Aggregation 的区别在于日期的起止值可使用数学表达式。

2.3 管道聚合

管道聚合官网文档：Pipeline Aggregations

Pipeline Aggregations 处理的对象是其余聚合的输出（而不是文档）。

2.4 矩阵聚合

矩阵聚合官网文档：Matrix Aggregations

Matrix Stats

Matrix Stats 聚合是一种面向数值型的聚合，用于计算一组文档字段中的如下统计信息：

计数：计算过程当中每种字段的样本数量；

平均值：每一个字段数据的平均值；

方差：每一个字段样本数据偏离平均值的程度；

偏度：量化每一个字段样本数据在平均值附近的非对称分布状况；

峰度：量化每一个字段样本数据分布的形状；

协方差：一种量化描述一个字段数据随另外一个字段数据变化程度的矩阵；

相关性：描述两个字段数据之间的分布关系，其协方差矩阵取值在[-1,1]之间。

主要用于计算两个数值型字段之间的关系。如对日志记录长度和 HTTP 状态码之间关系的计算。

GET /_search
{
    "aggs": {
        "statistics": {
            "matrix_stats": {
                "fields": ["log_size", "status_code"]
            }
        }
    }
}