Postgresql 中文操作指南

CREATE STATISTICS

CREATE STATISTICS — 定义扩展统计信息

CREATE STATISTICS — define extended statistics

Synopsis

CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ]
    ON ( expression )
    FROM table_name

CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ]
    [ ( statistics_kind [, ... ] ) ]
    ON { column_name | ( expression ) }, { column_name | ( expression ) } [, ...]
    FROM table_name

Description

CREATE STATISTICS 将创建一个新的扩展统计对象,用于跟踪有关指定表、外部表或物化视图的数据。统计对象将在当前数据库中创建,并由发出该命令的用户拥有。

CREATE STATISTICS will create a new extended statistics object tracking data about the specified table, foreign table or materialized view. The statistics object will be created in the current database and will be owned by the user issuing the command.

CREATE STATISTICS 命令具有两种基本形式。第一种形式允许收集单个表达式的单变量统计信息,提供类似于表达式索引的好处,而没有维护索引的开销。此表单不允许指定统计种类,因为各种统计种类仅指多变量统计。该命令的第二种形式允许收集多个列和/或表达式的多变量统计信息,并选择性地指定要包括的统计种类。此表单还会自动收集列表中包含的任何表达式上的单变量统计信息。

The CREATE STATISTICS command has two basic forms. The first form allows univariate statistics for a single expression to be collected, providing benefits similar to an expression index without the overhead of index maintenance. This form does not allow the statistics kind to be specified, since the various statistics kinds refer only to multivariate statistics. The second form of the command allows multivariate statistics on multiple columns and/or expressions to be collected, optionally specifying which statistics kinds to include. This form will also automatically cause univariate statistics to be collected on any expressions included in the list.

如果给定模式名称(例如 CREATE STATISTICS myschema.mystat …​ ),则统计对象将创建在指定模式中。否则,它将创建在当前模式中。如果给定,则统计对象名称必须不同于同一模式中任何其他统计对象的名称。

If a schema name is given (for example, CREATE STATISTICS myschema.mystat …​) then the statistics object is created in the specified schema. Otherwise it is created in the current schema. If given, the name of the statistics object must be distinct from the name of any other statistics object in the same schema.

Parameters

  • IF NOT EXISTS

    • Do not throw an error if a statistics object with the same name already exists. A notice is issued in this case. Note that only the name of the statistics object is considered here, not the details of its definition. Statistics name is required when IF NOT EXISTS is specified.

  • statistics_name

    • The name (optionally schema-qualified) of the statistics object to be created. If the name is omitted, PostgreSQL chooses a suitable name based on the parent table’s name and the defined column name(s) and/or expression(s).

  • statistics_kind

    • A multivariate statistics kind to be computed in this statistics object. Currently supported kinds are ndistinct, which enables n-distinct statistics, dependencies, which enables functional dependency statistics, and mcv which enables most-common values lists. If this clause is omitted, all supported statistics kinds are included in the statistics object. Univariate expression statistics are built automatically if the statistics definition includes any complex expressions rather than just simple column references. For more information, see Section 14.2.2 and Section 76.2.

  • column_name

    • The name of a table column to be covered by the computed statistics. This is only allowed when building multivariate statistics. At least two column names or expressions must be specified, and their order is not significant.

  • expression

    • An expression to be covered by the computed statistics. This may be used to build univariate statistics on a single expression, or as part of a list of multiple column names and/or expressions to build multivariate statistics. In the latter case, separate univariate statistics are built automatically for each expression in the list.

  • table_name

    • The name (optionally schema-qualified) of the table containing the column(s) the statistics are computed on; see ANALYZE for an explanation of the handling of inheritance and partitions.

Notes

您必须是表的拥有者才能创建该表的读取统计信息对象。但是,一旦创建统计信息对象,它便独立于基础表的所有权。

You must be the owner of a table to create a statistics object reading it. Once created, however, the ownership of the statistics object is independent of the underlying table(s).

表达式统计信息是逐表达式的,类似于在表达式上创建索引,但它们可以避免索引维护的开销。会自动为统计信息对象定义中的每个表达式构建表达式统计信息。

Expression statistics are per-expression and are similar to creating an index on the expression, except that they avoid the overhead of index maintenance. Expression statistics are built automatically for each expression in the statistics object definition.

计划程序目前未使用扩展统计信息进行表连接计算的选择性估计值。此限制可能会在未来版本的 PostgreSQL 中得到解决。

Extended statistics are not currently used by the planner for selectivity estimations made for table joins. This limitation will likely be removed in a future version of PostgreSQL.

Examples

使用两个函数相关列创建表 t1 ,即,了解第一列的值足以确定其他列中的值。然后,在这些列上构建函数相关统计信息:

Create table t1 with two functionally dependent columns, i.e., knowledge of a value in the first column is sufficient for determining the value in the other column. Then functional dependency statistics are built on those columns:

CREATE TABLE t1 (
    a   int,
    b   int
);

INSERT INTO t1 SELECT i/100, i/500
                 FROM generate_series(1,1000000) s(i);

ANALYZE t1;

-- the number of matching rows will be drastically underestimated:
EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);

CREATE STATISTICS s1 (dependencies) ON a, b FROM t1;

ANALYZE t1;

-- now the row count estimate is more accurate:
EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);

如果没有函数相关统计信息,计划程序会假设这两个 WHERE 条件是独立的,并且会将它们的选择性相乘以得出小得多的行计数估计值。有了这些统计信息,计划程序就会认识到 WHERE 条件是冗余的,并且不会低估行计数。

Without functional-dependency statistics, the planner would assume that the two WHERE conditions are independent, and would multiply their selectivities together to arrive at a much-too-small row count estimate. With such statistics, the planner recognizes that the WHERE conditions are redundant and does not underestimate the row count.

使用两个完全相关的列(包含相同数据)和一个 MCV 列表创建表 t2

Create table t2 with two perfectly correlated columns (containing identical data), and an MCV list on those columns:

CREATE TABLE t2 (
    a   int,
    b   int
);

INSERT INTO t2 SELECT mod(i,100), mod(i,100)
                 FROM generate_series(1,1000000) s(i);

CREATE STATISTICS s2 (mcv) ON a, b FROM t2;

ANALYZE t2;

-- valid combination (found in MCV)
EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);

-- invalid combination (not found in MCV)
EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 2);

MCV 列表为计划程序提供了有关表中通常出现的具体值的更详细信息,以及未在表中出现的组合值的选择性的上限,从而允许它在两种情况下生成更好的估计值。

The MCV list gives the planner more detailed information about the specific values that commonly appear in the table, as well as an upper bound on the selectivities of combinations of values that do not appear in the table, allowing it to generate better estimates in both cases.

使用单个时间戳列创建表 t3 ,并使用该列上的表达式运行查询。如果没有扩展统计信息,计划程序不了解表达式的数据分布,会使用默认估计值。计划程序也不认识日期截断为月的值完全由日期截断为天的值确定。然后基于这两个表达式构建表达式和 ndistinct 统计信息:

Create table t3 with a single timestamp column, and run queries using expressions on that column. Without extended statistics, the planner has no information about the data distribution for the expressions, and uses default estimates. The planner also does not realize that the value of the date truncated to the month is fully determined by the value of the date truncated to the day. Then expression and ndistinct statistics are built on those two expressions:

CREATE TABLE t3 (
    a   timestamp
);

INSERT INTO t3 SELECT i FROM generate_series('2020-01-01'::timestamp,
                                             '2020-12-31'::timestamp,
                                             '1 minute'::interval) s(i);

ANALYZE t3;

-- the number of matching rows will be drastically underestimated:
EXPLAIN ANALYZE SELECT * FROM t3
  WHERE date_trunc('month', a) = '2020-01-01'::timestamp;

EXPLAIN ANALYZE SELECT * FROM t3
  WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
                                 AND '2020-06-30'::timestamp;

EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
   FROM t3 GROUP BY 1, 2;

-- build ndistinct statistics on the pair of expressions (per-expression
-- statistics are built automatically)
CREATE STATISTICS s3 (ndistinct) ON date_trunc('month', a), date_trunc('day', a) FROM t3;

ANALYZE t3;

-- now the row count estimates are more accurate:
EXPLAIN ANALYZE SELECT * FROM t3
  WHERE date_trunc('month', a) = '2020-01-01'::timestamp;

EXPLAIN ANALYZE SELECT * FROM t3
  WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
                                 AND '2020-06-30'::timestamp;

EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
   FROM t3 GROUP BY 1, 2;

如果没有表达式和 ndistinct 统计信息,计划程序不了解表达式中不同值的个数,不得不依靠默认估计值。等式和范围条件被认为具有 0.5% 的选择性,并且表达式中不同值的个数被认为与列相同(即唯一)。这样会严重低估前两个查询中的行计数。此外,计划程序不了解表达式之间的关系,所以它假设 WHEREGROUP BY 条件是独立的,并将它们的选择性相乘以得出总计查询中组计数的严重高估值。由于没有表达式的准确统计信息,这种情况会进一步恶化,这会迫使计划程序为该表达式使用默认的 ndistinct 估计,该估计是根据该列的 ndistinct 得出的。通过此类统计信息,计划程序能认识到这些条件是相关的,并且得出更准确的估计值。

Without expression and ndistinct statistics, the planner has no information about the number of distinct values for the expressions, and has to rely on default estimates. The equality and range conditions are assumed to have 0.5% selectivity, and the number of distinct values in the expression is assumed to be the same as for the column (i.e. unique). This results in a significant underestimate of the row count in the first two queries. Moreover, the planner has no information about the relationship between the expressions, so it assumes the two WHERE and GROUP BY conditions are independent, and multiplies their selectivities together to arrive at a severe overestimate of the group count in the aggregate query. This is further exacerbated by the lack of accurate statistics for the expressions, forcing the planner to use a default ndistinct estimate for the expression derived from ndistinct for the column. With such statistics, the planner recognizes that the conditions are correlated, and arrives at much more accurate estimates.

Compatibility

SQL 标准中没有 CREATE STATISTICS 命令。

There is no CREATE STATISTICS command in the SQL standard.