What is Baked in the Baker's Dozen?
On April 8, PostgreSQL feature freeze took place, so only features committed earlier will get into version PostgreSQL 13. Probably, this version can hardly be considered revolutionary, since it has no conceptual changes. Some of critical patches were late to get into it, such as Table and Functions for the JSON/SQL standard, which had been desirable to be part of PostgreSQL 12, along with the JSONPath patch; plug-in warehouses did not appear either — only the interface is being finalized. The list of improvements is still impressive. We prepared a pretty complete overview of the patches of the Baker’s Dozen.
Changes to SQL commands
CREATE DATABASE… LOCALE
The initdb
and createdb
utilities, as well as the CREATE COLLATION
command, have the LOCALE
parameter, which enables specifying the values of LC_CTYPE
and LC_COLLATE
at the start. Now the CREATE DATABASE
command has it as well:
CREATE DATABASE db_koi8r TEMPLATE template0
ENCODING 'KOI8R' LOCALE 'ru_RU.KOI8R';
ALTER VIEW… RENAME COLUMN
The ALTER VIEW
command now enables changing the column name in a view. Earlier, we had to recreate the view to do this.
Suppose we forgot to give a name to a column:
CREATE VIEW uptime AS SELECT current_timestamp, current_timestamp - pg_postmaster_start_time();
SELECT * FROM uptime;
current_timestamp | ?column?
-------------------------------+-----------------
2020-03-23 15:37:00.088824+03 | 04:18:24.897856
We can correct this:
ALTER VIEW uptime RENAME COLUMN "?column?" TO uptime;
SELECT * FROM uptime;
current_timestamp | uptime
-------------------------------+-----------------
2020-03-23 15:37:40.726516+03 | 04:19:05.535548
ALTER TABLE… ALTER COLUMN… DROP EXPRESSION
Now we can turn a generated column of a table into a usual one, that is, remove the expression:
CREATE TABLE payments (
id integer PRIMARY KEY,
amount numeric(18,2),
income_tax numeric(18,2) GENERATED ALWAYS AS (amount*0.13) STORED
);
INSERT INTO payments(id, amount) VALUES(1, 42);
\d payments
Table "public.payments"
Column | Type | Collation | Nullable | Default
------------+---------------+-----------+----------+--------------------------------------
id | integer | | not null |
amount | numeric(18,2) | | |
income_tax | numeric(18,2) | | | generated always as ((amount * 0.13))
stored
Indexes:
"payments_pkey" PRIMARY KEY, btree (id)
Suppose we changed our mind, and now we prefer to define income_tax
explicitly. So we just remove the expression:
ALTER TABLE payments ALTER COLUMN income_tax DROP EXPRESSION;
\d payments
Table "public.payments"
Column | Type | Collation | Nullable | Default
------------+---------------+-----------+----------+---------
id | integer | | not null |
amount | numeric(18,2) | | |
income_tax | numeric(18,2) | | |
Indexes:
"payments_pkey" PRIMARY KEY, btree (id)
Certainly, the existing data in the column are still there:
SELECT * FROM payments;
id | amount | income_tax
----+--------+------------
1 | 42.00 | 5.46
DROP DATABASE… FORCE
If we need to delete a database before all the users are disconnected, we can benefit from the new FORCE
option of the DROP DATABASE
command.
CREATE DATABASE db;
We connect to the new database:
CREATE EXTENSION IF NOT EXISTS dblink;
SELECT dblink_connect('dbname=db');
dblink_connect
----------------
OK
And now we delete it forcing open connections to terminate, same way as pg_terminate_backend
does:
DROP DATABASE db WITH (FORCE);
ALTER TYPE… SET STORAGE
The ALTER TYPE
command enables us to change various properties of base data types, including the storage strategy. Earlier, it could be specified only in the CREATE TYPE
command.
Rather than create a new base type for demo purposes, we will use the existing tsquery
type. But first we will create a separate database and connect to it:
CREATE DATABASE db;
\c db
For the tsquery
data type, the plain
storage strategy is used, so columns in tables of this type get the same strategy:
SELECT typname,typstorage FROM pg_type WHERE typname = 'tsquery';
typname | typstorage
---------+------------
tsquery | p
CREATE TABLE queries (query tsquery);
SELECT attname, attstorage FROM pg_attribute WHERE attrelid = 'queries'::regclass AND attname = 'query';
attname | attstorage
---------+------------
query | p
If we need to use another strategy for new tables, we can change the base type:
ALTER TYPE tsquery SET (storage=external);
SELECT typname,typstorage FROM pg_type WHERE typname = 'tsquery';
typname | typstorage
---------+------------
tsquery | e
The storage type in the new tables will also change:
CREATE TABLE new_queries (query tsquery);
SELECT attname, attstorage FROM pg_attribute WHERE attrelid = 'new_queries'::regclass AND attname = 'query';
attname | attstorage
---------+------------
query | e
Please keep in mind that we cannot change the strategy intended for use of TOAST back to plain
:
ALTER TYPE tsquery SET (storage=plain);
ERROR: cannot change type's storage to PLAIN
That’s why we conducted the experiments in a separate database, which we won’t mind to delete.
ALTER STATISTICS… SET STATISTICS
The CREATE STATISTICS
command can gather lists of most common values for selected combinations of table columns. The number of most common values gathered is defined by the default_statistics_target
parameter. The value for specific statistics can now be changed by the command:
ALTER STATISTICS name SET STATISTICS new_value;
FETCH FIRST clause WITH TIES option
As we know, in the SELECT
command, the syntax defined in the SQL standard can be used instead of the LIMIT
option:
SELECT * FROM (VALUES (1,'1.1'), (2,'2.1'), (2,'2.2'), (3,'3.1')) t(a,b)
ORDER BY a
FETCH FIRST 2 ROWS ONLY;
a | b
---+-----
1 | 1.1
2 | 2.1
(2 rows)
Now FETCH
also supports WITH TIES
option, which adds to the output all «familial» rows (rows equal to already selected if only sort condition is taken into account):
SELECT * FROM (VALUES (1,'1.1'), (2,'2.1'), (2,'2.2'), (3,'3.1')) t(a,b)
ORDER BY a
FETCH FIRST 2 ROWS WITH TIES;
a | b
---+-----
1 | 1.1
2 | 2.1
2 | 2.2
(3 rows)
Built-in functions and data types
get_random_uuid
The new get_random_uuid
function returns version 4 UUID (random value):
SELECT gen_random_uuid();
gen_random_uuid
--------------------------------------
25e02793-80c0-438c-be07-c94b966c43ab
The function is useful for generation of unique values of the UUID type in distributed systems.
Formerly, we had to use the uuid-ossp
or pgcrypto
library.
min_scale and trim_scale for values of type numeric
The min_scale
function determines how many significant figures are contained in the fractional part a number, and the trim_scale
function removes zeros:
SELECT n, min_scale(n), trim_scale(n) FROM (VALUES(42.42000)) as t(n);
n | min_scale | trim_scale
----------+-----------+------------
42.42000 | 2 | 42.42
gcd and lcm
This is an addition to a set of mathematical functions. Now we can quickly find the greatest common divisor (gcm
) and least common multiple (lcm
):
SELECT gcd(54,24), lcm(54,24);
gcd | lcm
-----+-----
6 | 216
Aggregate functions min and max for the pg_lsn data type
For the pg_lsn
data type, min
and max
aggregate functions are added, which enables execution of queries such as:
SELECT min(restart_lsn) FROM pg_replication_slots;
SELECT min(sent_lsn) FROM pg_stat_replication;
Check of the type modifier of the returned function value
In previous versions, the type modifier was not checked for the function value returned.
Suppose that a type is available to store currency units and a function that returns the amount of income taxes:
CREATE TYPE currency AS (
amount numeric(18,2),
code text
);
CREATE FUNCTION income_tax(quantity currency) RETURNS currency
AS 'SELECT quantity.amount * 0.13, quantity.code' LANGUAGE SQL;
When calling the function, we expect two decimal places, but we get four. Even the explicit type cast after the function call does not help (the third column):
SELECT amount, code, amount::numeric(18,2)
FROM income_tax(ROW(42.42, 'RUR'))\gx
-[ RECORD 1 ]--
amount | 5.5146
code | RUR
amount | 5.5146
In version 13, the result is correct:
SELECT amount, code, amount::numeric(18,2)
FROM income_tax(ROW(42.42, 'RUR'))\gx
-[ RECORD 1 ]
amount | 5.51
code | RUR
amount | 5.51
Localized names in to_date () and to_timestamp ()
The to_date
and to_timestamp
functions learned to recognize localized names of months and of week days. Formerly, only English names could be used:
SELECT to_date('Вторник, 24 Марта 2020', 'TMDay, DD TMMonth YYYY');
to_date
------------
2020-03-24
normalize and IS NORMALIZED
For compliance with the SQL standard, the normalize()
function was added to normalize a Unicode string and the IS NORMALIZED predicate to check whether the string is normalized.
SELECT str, normalize(str, NFC) AS nfc,
str IS NFC NORMALIZED AS is_nfc_normalized,
normalize(str, NFKC) AS nfkc,
str IS NFKC NORMALIZED AS is_nfkc_normalized
FROM (VALUES(U&'\0032\2075')) AS vals(str)\gx
-[ RECORD 1 ]------+---
str | 2⁵
nfc | 2⁵
is_nfc_normalized | t
nfkc | 25
is_nfkc_normalized | f
More details on Unicode normalization forms.
xid8 type and xid8_current () function for 64-bit transaction IDs
The new xid8
data type was added for a 64-bit transaction ID. But this does not mean that PostgreSQL migrated to 64-bit transactions: everything works exactly as before. But certain functions return the new type; for example: it is recommended to use pg_current_xact_id
rather than the old txid_current
function, which returned int8
, and so forth.
New polymorphic data types of anycompatible family
anycompatible
, anycompatiblearray
, anycompatiblenonarray
and anycompatiblerange
types are added. Unlike types of the anyelement
family, the new types enable using not exactly the same, but actually compatible types.
In the following example, the maximum
function takes integer
and numeric
arguments defined as anycompatible
. The value returned is cast to the type common for the two:
CREATE FUNCTION maximum(a anycompatible, b anycompatible) RETURNS anycompatible
AS 'SELECT CASE WHEN a > b THEN a ELSE b END' LANGUAGE SQL;
SELECT f, pg_typeof(f) FROM maximum(42, 42.42) f;
f | pg_typeof
-------+-----------
42.42 | numeric
Moreover, anycompatible-
and any-
are two independent type sets:
CREATE FUNCTION first_elems(a anyarray, b anycompatiblearray)
RETURNS TABLE(a anyelement, b anycompatible) AS $$
SELECT ROW(a[array_lower(a,1)], b[array_lower(b,1)]);
$$ LANGUAGE sql;
SELECT first_elems(ARRAY[1,2,3], ARRAY['a','b','c']) AS str,
first_elems(ARRAY[1.1,2.2,3.3],ARRAY[true,false]) bool;
str | bool
-------+---------
(1,a) | (1.1,t)
Procedural languages
Transform for bool type to PL/Perl
A while ago, TRANSFORM by Ivan Panchenko (Deputy CEO of Postgres Professional) was committed — bool_plperl. Postgres passes Boolean values as t
or f
to PL/Perl, but for Perl, t
and f
are not the true and false values of the Boolean variable, but just the letters «t» and «f». There are a few ways to resolve this issue (see email), but to create TRANSFORM for bool
is the most efficient.
Faster execution of simple expression in PL/pgSQL
Simple expressions (at least not containing access to tables and not requiring locking) will be executed faster. Formerly, in such cases time was inefficiently wasted on invoking the planner in each loop.
CREATE OR REPLACE FUNCTION slow_pi() RETURNS double precision AS $$
DECLARE
a double precision := 1;
s double precision := 1;
r double precision := 0;
BEGIN
FOR i IN 1 .. 10000000 LOOP
r := r + s/a; a := a + 2; s := -s;
END LOOP;
RETURN 4*r;
END;
$$ LANGUAGE plpgsql;
Calling slow_pi()
in PostgreSQL 12:
SELECT slow_pi();
slow_pi
--------------------
3.1415925535898497
(1 row)
Time: 13060,650 ms (00:13,061)
Doing the same in PostgreSQL 13:
SELECT slow_pi();
slow_pi
--------------------
3.1415925535897915
(1 row)
Time: 2108,464 ms (00:02,108)
Trustable extensions instead of pg_pltemplate
The system catalog now contains one table less. pg_pltemplate
is removed. It stored properties of procedural languages that were needed to execute CREATE LANGUAGE
. But it was decided to move the properties from pg_pltemplate
into scripts of appropriate language extensions and get rid of the table itself. But to implement the idea, the database owner (non-superuser) must be able to create the trusted language from an extension script. This is because currently to create plpgsql
for instance, the database owner does not need to be superuser.
So, the following was done. A new Boolean parameter, trusted
, was added to the extensions' control file. If both trusted
and superuser
parameters are turned on, not only superuser can create the extension, but also a user with CREATE
privilege for the current database (including its owner). When the script of such an extension is run, privileges of the superuser who initialized the cluster will be used. This superuser will also own the objects created by the extension, while the extension itself will be owned by the user who created it.
The following are important implications of these changes:
Trustable extensions open the floodgates for third-party developers to creation of other trusted languages. It was hardcoded in pg_pltemplate
that plpython pertains to the second version of the language. Dropping pg_pltemplate
is a step (necessary, although not sufficient) towards transitioning to python3.
Indexes
B-tree deduplication
A critical and long awaited patch (the work on it started as early as in Y2015) created by Anastasia Lubennikova (Postgres Professional) and Peter Geoghegan was finally committed by Peter. Anastasia already reported on this at PGconf India. Postgres learned to considerably reduce sizes of B-tree indexes by means of deduplication, that is, by skimping on repeating index keys. These indexes were deeply reworked to enable compression without losing backward compatibility of indexes. The idea of deduplication was borrowed from a more flexible architecture of GIN indexes.
In these indexes, situations where a lot of rows match a key occur more frequently than in B-tree. For example: in the case of text processing, one lexeme usually occurs in several documents. And it is stored in the index only once. B-tree indexes could not ensure this until recently.
B-tree indexes are different from GIN mainly in leaf pages. Depending on the number of rows related to the same value of a key, there are the following possibilities: a page contains only the «posting list», that is, the list of TIDs (tuple IDs of rows being indexed) if the list is short, but if there are a lot of TIDs, instead of value lists, new «branches of trees» are stored — references to other pages of the «posting list» type or other branches of the tree (they are called «posting trees»).
Such a tree-like structure is similar to B-tree, but differs in significant details: for example, lists for moving across pages of the same tree level are unidirectional in GIN rather than bidirectional. Therefore (among the rest), it is uneasy to ensure good backward compatibility of new deduplicated indexes. And enhancements really took more than 3 years. It was also needed to perfect vacuum techniques (microvacuum) and other details.
In performance tests, all the indexes to which deduplication was applicable were compressed by approximately 3 times. Compression of duplicates also helps unique indexes by eliminating the issue of index bloat up when a table changes at a high rate of INSERT/DELETE. This behavior can be turned on/off manually at index setup.
Avoid full GIN index scan when possible
Now we can avoid a walk through a whole GIN index in certain cases. Some operations, although supported by GIN index, are actually performed by a full index scan. For example: let’s consider an index for full-text search by the tsvector
column. If the search query looks like «anything but a given word», we will have to read the whole index. If the same query also includes another condition that does not require a full index scan, the whole index will be scanned regardless.
With the new optimization, first a more restrictive condition will be used and then the results will be rechecked so as to also take into account the other limitation. Compare the number of pages that were read in version 12 (Buffers):
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMING OFF)
SELECT * FROM mail_messages WHERE tsv @@ to_tsquery('!vacuum') AND tsv @@ to_tsquery('analyze');
QUERY PLAN
------------------------------------------------------------------
Bitmap Heap Scan on mail_messages (actual rows=5864 loops=1)
Recheck Cond: ((tsv @@ to_tsquery('!vacuum'::text)) AND (tsv @@ to_tsquery('analyze'::text)))
Heap Blocks: exact=5167
Buffers: shared hit=24 read=27405
-> Bitmap Index Scan on mail_messages_tsv_idx (actual rows=5864 loops=1)
Index Cond: ((tsv @@ to_tsquery('!vacuum'::text)) AND (tsv @@ to_tsquery('analyze'::text)))
Buffers: shared hit=24 read=22238
Planning Time: 0.283 ms
Execution Time: 3258.234 ms
with the number of buffers in the new version:
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMING OFF)
SELECT * FROM mail_messages WHERE tsv @@ to_tsquery('!vacuum') AND tsv @@ to_tsquery('analyze');
QUERY PLAN
---------------------------------------------------------------------------
Bitmap Heap Scan on mail_messages (actual rows=5864 loops=1)
Recheck Cond: ((tsv @@ to_tsquery('!vacuum'::text)) AND (tsv @@ to_tsquery('analyze'::text)))
Heap Blocks: exact=5156
Buffers: shared hit=5179
-> Bitmap Index Scan on mail_messages_tsv_idx (actual rows=5864 loops=1)
Index Cond: ((tsv @@ to_tsquery('!vacuum'::text)) AND (tsv @@ to_tsquery('analyze'::text)))
Buffers: shared hit=23
Planning Time: 0.250 ms
Execution Time: 8.779 ms
We can come across a similar situation when using trigrams and when checking for being array elements.
Opclass parameters
Many index access methods in PostgreSQL are actually a «framework» that takes over a high-level implementation of a search algorithm, work with pages and locks, as well as with WAL. And it’s operator classes that associate the method with specific data types and operators.
Operators could not have parameters as yet. For example, GiST index with the tsvector_ops
operator class can be used for full-text search (refer here for GiST operator classes). This operator class uses a signature tree, and the signature must be fixed-length (124 bytes). Now we can explicitly specify the length, which enables us to control the balance between the index size and efficiency (the number of hash collisions):
CREATE INDEX ON mail_messages USING gist(tsv tsvector_ops(siglen=1024));
For a start, similar changes were also made to other GiST operator classes that use a signature tree; this involves the hstore
, intarray
and pg_trgm
extensions.
But the main idea behind proposing this change is a possibility to pass a JSONPath expression to a GIN index in order to index only the necessary part of the JSON document rather than the entire document. In many cases this will enable us to drastically reduce the index sizes. But this work is yet to be done.
This Oleg Bartunov’s idea is implemented by Nikita Glukhov and Alexander Korotkov (all from Postgres Professional).
<-> (box, point) operator is added
A missing operation to be used in kNN for GiST and SP-GiST was added. In PostgreSQL 12, when working with geometric types point
and box
, a distance operator <->(point, box)
can be used, which will speed up search with GiST and SP-GiST indexes. But the reverse operator <->(box, point)
was not implemented although box
already recognized distances to more complex types — polygons and circles.
CREATE TABLE points(point point);
CREATE TABLE boxes(box box);
INSERT INTO points VALUES('1,2','3,4');
INSERT INTO boxes VALUES(box(point(3,4),point(5,6)), box(point(13,14),point(15,16)));
В PostgreSQL 12:
SELECT p.point, b.box, b.box <-> p.point distance FROM points AS p, boxes AS b ORDER BY distance;
ERROR: operator does not exist: box <-> point
Inversely, everything is OK:
SELECT p.point, b.box, p.point <-> b.box distance FROM points AS p, boxes AS b ORDER BY distance;
point | box | distance
-------+-----------------+--------------------
(1,2) | (5,6),(3,4) | 2.8284271247461903
(2,1) | (5,6),(3,4) | 3.1622776601683795
(1,2) | (15,16),(13,14) | 16.970562748477143
(2,1) | (15,16),(13,14) | 17.029386365926403
And in PostgreSQL 13:
SELECT p.point, b.box, b.box <-> p.point distance FROM points AS p, boxes AS b ORDER BY distance;
point | box | distance
-------+-----------------+--------------------
(1,2) | (5,6),(3,4) | 2.8284271247461903
(2,1) | (5,6),(3,4) | 3.1622776601683795
(1,2) | (15,16),(13,14) | 16.970562748477143
(2,1) | (15,16),(13,14) | 17.029386365926403
GiST and SP-GiST indexes will accelerate on this operation too.
Note what this query returns in PostgreSQL 13:
SELECT count(*) FROM pg_operator WHERE oprname = '<->';
count
-------
28
But if we run the same query in PostgreSQL 12, we will get 20 rows: as many as 8 operators were added to the list in version 13.
JSON
Support for jsonpath .datetime () method
It is one of non-qualified patches from a large series of JSONPath patches, which was late to be ready by the PostgreSQL 12 release. It is part of JSON/SQL standard. The issue is that all functions in the series of JSONPath patches are immutable, but comparison of dates takes into account the time zone, which can vary during a session.
In such cases, we allow existing immutable functions to throw an error on non-immutable comparisons. At the same time, this patch contains functions named with the _tz
suffix, which work fine with timezone
.
jsonb_set_lax function
In general, «lax» is a non-strict (unlike «strict») mode of functions manipulating jsonb
. In this case, the function will be operational when one of the arguments that it takes equals NULL. Unlike the strict version — jsonb_set()
, it has an additional argument that indicates actions in the case of NULL. The options are: use_json_null / raise_exception / return_target / delete_key
. These options were proposed by concerned users.
Optimization of some jsonb functions
A great deal was optimized, mainly by Nikita Glukhov (Postgres Professional). It does not make sense to discuss each item here: first, the abundance of them would swell the article that is pretty large as it is; second, the changes deal with the internal structure, which is interesting not to everyone. So, let’s only list most of the changes:
JsonbExtractScalar()
function is optimized.#>>
operator, as well asjsonb_each_text()
andjsonb_array_elements_text()
functions are optimized.- Recognition of the
JsonbContainer
type inget_jsonb_path_all()
is optimized. - Fetch of the first token from the
JsonbIterator
iterator is replaced with a light-weightJsonbContainerIsXxx()
macro. - Key retrieval is more convenient now —
findJsonbKeyInObject
. - Storage of the result of
findJsonbValueFromContainer()
andgetIthJsonbValueFromContainer()
is optimized. - The
get_jsonb_path_all()
function is optimized, it no longer uses an iterator.
Code ofJsonbValueAsText
is rewritten.
As they say, information on patches of SQL/JSON: JSON_TABLE and SQL/JSON: functions could be here. We expected them and hoped to see them committed. Unfortunately this did not happen. The patches are large and involve a lot of files. We believe they will get into PostgreSQL 14. And JSONPath is left on its own so far.
Utilities and extensions
pgbench
The benchmarking utility got a number of improvements. Execution statistics for tasks at the initialization phase, a more visualized output, a possibility to look up the code of the built-in scriptsand testing on a partitioned table of accounts were added.
Besides, the \aset
command was added, which is similar to \gset
, but enables setting the values for variables of several queries sent in one go. The following string sent to the server for execution will set the values of both variables one
and two
:
SELECT 1 AS one \; SELECT 2 AS two \aset
pg_dump
pg_dump
learned to dump the data of foreign tables. The --include-foreign-data
parameter enables specifing a list of foreign servers whose table data will be dumped.
Such dumping should be used cautiously. Data do not necessarily need to be loaded to a foreign server. In addition, a foreign server is not unlikely to be unavailable during recovery. Or a foreign server can allow only reading, but not writing.
psql
A series of small patches adds more convenience to working in psql:
- Autocomplete is improved for several commands.
- In addition to
\echo
, which sends a string to STDOUT, the new\warn
command sends a string to the standard error output (STDERR). - The
\d
command for a TOAST table provides information on the index and table. And for an index on a partitioned table, we can use the\d+
command to get a list of index partitions with table spaces. - In the
\dt+
command, the new «Persistence» column shows whether tables are unlogged or permanent. - We can output a query edited by the
\e
command to the screen if we do not put;
(or the\g*
command) at the end of the query. Now, on return from the editor to thepsql
window, we cannot see the text of the command prior to its execution. - The default prompt will include the information on the status of the current transaction. To this end, a special character
%x
is added to thePROMPT1
andPROMPT2
variables. - New commands appeared for getting additional information on access methods:
\dAc
,\dAf
,\dAo
,\dAp
. - In
\g
, now any options supported by\pset
can be specified in parentheses. They will affect only the current transaction.
libpq
Some changes relate to connecting to PostgreSQL:
- Inaccuracy is fixed in the descriptions of the
host
andhostadr
parameters and the related inconsistency in the output of the\conninfo
command of thepsql
utility. - If the key of the client’s certificate is stored in an encrypted form, you can enter the password only interactively. The new
sslpassword
parameter will allow non-interactive key decryption. - Two new parameters
sslminprotocolversion
andsslmaxprotocolversion
enable specifying limitations of the version of the SSL/TCL protocol for which connection is permitted.
reindexdb
The new --jobs parameter of the reindexdb
utility specifies the number of connections to the database for which indexes will be rebuilt simultaneously.
pg_rewind
The limitations of the utility are gradually removed and the capabilities increased.
First, pg_rewind
can now write information for recovery (as pg_basebackup
can do it) and also start the recovery and subsequent shutdown of the instance unless it was shut down via a checkpoint (earlier, this could be done only manually).
Second, pg_rewind
learned to work with WAL archive.
When the utility finds the point of divergence of WAL between two servers, it must make up a list of all pages that need to be copied to the target server in order to eliminate the differences. To do this, the utility needs all WAL files starting at the point found. Earlier, if the necessary WAL files were unavailable on the target cluster, the utility could not do the job.
With this patch by Alexey Kondratov (Postgres Professional), pg_rewind
will be able to read missing WAL segments from the archive of WAL files using the restore_command
parameter if the new key -c
or --restore-target-wal
is provided.
pg_waldump
pg_waldump
will extracts information about the prepared transaction from the record.
amchec
The amcheck
extension learned to better detect corruption in B-tree indexes.
By the way, now error messages in the server log about corrupted pages will differ for indexes and tables.
pageinspect
The heap_tuple_infomask_flags
function of the pageinspect
extension extracts information from the values of fields infomask
and infomask2
returned by the heap_page_items
function. This is useful for investigating situations related to data corruption.
postgres_fdw
Superuser can allow regular users to connect without the password at the level of user names maps:
ALTER USER MAPPING FOR non-superuser SERVER server
OPTIONS (ADD password_required 'false');
Among the rest, this is done to enable using sslkey
and sslcert
as connection parameters.
adminpack
The adminpack
extension contains a new function — pg_file_sync
. It can be used to do fsync
for files that the server writes to disk, for instance, by means of pg_file_write
or COPY TO
.
Monitoring
pg_stat_slru
The shared memory of a server contains not only a large buffer cache, but also a number of other, simpler, caches (for example: for transaction statuses). They use a simple algorithm of evicting least frequently used pages (simple least-recently-used, or SLRU). These caches «just worked» so far, but now the time is ripe for monitoring them; first of all the developers of PostgreSQL core need this to figure out whether anything needs to be changed in the caches. To this end, a new, pg_stat_slru, view was added.
pg_stat_activity
A new column, leader_id
, was added to the pg_stat_activity
view. For processes involved in execution of parallel queries, it is filled with the pid
of the leader process. And for the leader process, leader_id
equals pid
.
The following query shows what processes execute which queries in the parallel mode:
SELECT query, leader_pid,
array_agg(pid) filter(WHERE leader_pid != pid) AS members
FROM pg_stat_activity
WHERE leader_pid IS NOT NULL
GROUP BY query, leader_pid;
The list of wait events was changed. Two new events were added: BackupWaitWalArchive
and RecoveryPause
. And two more were given more adequate names: RecoveryWalStream -> RecoveryRetrieveRetryInterval
and RecoveryWalAll -> RecoveryWalStream
.
There are also two new events that occur on a replica: RecoveryConflictSnapshot
(a conflict with VACUUM that deleted needed tuples) and RecoveryConflictTablespace
(a conflict related to deletion of a table space).
pg_stat_statements
The pg_stat_statements
extension regarded queries with the FOR UPDATE
clause and without it as the same query so far. Now queries with FOR UPDATE
are treated differently.
The amount of information gathered also increased. From now on, not only information on resources for execution of commands is tracked, but also statistics on generated WAL records. New columns of the view are: wal_bytes
— the amount of WAL records generated, wal_records
— the number of WAL records generated, wal_num_fpw
— the number of full page writes.
This was achieved thanks to the developed infrastructure to track WAL usage. So, now EXPLAIN
with the new WAL
option will also show the number of WAL records generated:
CREATE TABLE t (id int);
EXPLAIN (ANALYZE, WAL, COSTS OFF, TIMING OFF, SUMMARY OFF)
INSERT INTO t VALUES(1);
QUERY PLAN
--------------------------------------
Insert on t (actual rows=0 loops=1)
WAL: records=1 bytes=59
-> Result (actual rows=1 loops=1)
The auto_explain
extension, VACUUM
with VERBOSE
and autovacuum
also use the infrastructure developed and will output WAL amounts.
Back to pg_stat_statements
. If the new pg_stat_statements.track_planning parameter is turned on, additional, planner, statistics will be written: the number of plan creations, the total planning time and the maximum time of a single planning, as well as the average and standard deviation.
One more patch, unrelated to pg_stat_statements
, deals with tracking the resources allocated to the planner. EXPLAIN
with the BUFFERS
option will report the number of buffers used during planning:
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMING OFF)
SELECT * FROM pg_class;
QUERY PLAN
------------------------------------------------
Seq Scan on pg_class (actual rows=386 loops=1)
Buffers: shared hit=9 read=4
Planning Time: 0.782 ms
Buffers: shared hit=103 read=11
Execution Time: 0.219 ms
Logging
- The
log_statement_sample_rate
parameter now defines the share of SQL commands being logged if the duration of their execution is greater than the value oflog_min_duration_sample
(new parameter).
Note that the value of thelog_min_duration_statement
parameter takes precedence, i. e., iflog_min_duration_statement
is less thanlog_min_duration_sample
, all the commands will be logged regardless of the share defined inlog_statement_sample_rate
.
In addition, the newlog_transaction_sample_rate
parameter allows us to define the share of transactions for which all the commands will be logged regardless of their execution time. - The values of the bind variables can now be logged along with commands that caused an error condition (which is defined by the
log_min_error_statement
parameter). The maximum size of values logged is defined by the newlog_parameter_max_length_on_error
parameter. By default, it equals zero, which turns the output off.
Turninglog_parameter_max_length_on_error
on entails additional overhead costs for execution of all SQL commands rather than only erroneous ones, but provides for more accurate error diagnostics.
For non-erroneous statements (which are logged as defined by thelog_statements
andlog_duration
parameters), a parameter was also added to limit the size:log_parameter_max_length
. By default, entire values are logged, as before. - The process type can be logged now in the server log (
pg_stat_activity.backend_type
). To this end, the%b
special character can be provided in thelog_line_prefix
parameter. And if the log has the csv format (log_destination=csvlog
), thebackend_type
column is already there.
Progress report
New views pg_stat_progress_analyze
and pg_stat_progress_basebackup
allow us to track execution of gathering statistics by the ANALYZE
command and of a backup by the pg_basebackup
utility, respectively.
Optimization
Evaluation of immutable functions in the FROM clause during planning
The patch by Alexander Kuzmenkov and Alexandr Parfenov (both from Postgres Professional) helps when the FROM
clause contains a call to a function that is actually a constant. In this case, instead of doing the join, the value of the constant is substituted into the query where appropriate.
This is how it happens, for instance, in a full-text search query:
EXPLAIN (COSTS OFF)
SELECT subject, ts_rank_cd(tsv, q) AS rank
FROM mail_messages, to_tsquery('english', 'tuple') q
WHERE tsv @@ q
ORDER BY rank DESC;
QUERY PLAN
------------------------------------------------------------------
Sort
Sort Key: (ts_rank_cd(mail_messages.tsv, '''tuple'''::tsquery)) DESC
-> Bitmap Heap Scan on mail_messages
Recheck Cond: (tsv @@ '''tuple'''::tsquery)
-> Bitmap Index Scan on mail_messages_tsv_idx
Index Cond: (tsv @@ '''tuple'''::tsquery)
There is no join here, and the value of 'tuple'::tsquery
is substituted into the query as early as during planning. In version 12, the picture was quite different:
EXPLAIN (COSTS OFF)
SELECT subject, ts_rank_cd(tsv, q) AS rank
FROM mail_messages, to_tsquery('english', 'tuple') q
WHERE tsv @@ q
ORDER BY rank DESC;
QUERY PLAN
-----------------------------------------------------
Sort
Sort Key: (ts_rank_cd(mail_messages.tsv, q.q)) DESC
-> Nested Loop
-> Function Scan on q
-> Bitmap Heap Scan on mail_messages
Recheck Cond: (tsv @@ q.q)
-> Bitmap Index Scan on mail_messages_tsv_idx
Index Cond: (tsv @@ q.q)
Incremental sort
In cases where a sort by many keys (k1, k2, k3…) is required, the planner can now use the awareness that the data is already sorted by a few of the first keys (say, k1 and k2). In this case, we can avoid re-sort of all the data from scratch, but divide them into sequential groups with the same values of k1 and k2 and only do additional sort by k3.
This way, the entire sort is decomposed into several sequential smaller-size sorts. This reduces the size of the memory required and also allows the first data to be output earlier than the entire sort is complete.
For example, in the demo database the tickets
table has an index on the ticket_id
column. The data obtained from the index will be definitely sorted by ticket_id
, therefore, the following query will do an incremental sort:
EXPLAIN (analyze, costs off, timing off)
SELECT * FROM tickets ORDER BY ticket_no, passenger_id;
QUERY PLAN
------------------------------------------------------------------------------
Incremental Sort (actual rows=2949857 loops=1)
Sort Key: ticket_no, passenger_id
Presorted Key: ticket_no
Full-sort Groups: 92184 Sort Method: quicksort Memory: avg=31kB peak=31kB
-> Index Scan using tickets_pkey on tickets (actual rows=2949857 loops=1)
Planning Time: 2.137 ms
Execution Time: 2230.019 ms
The incremental sort functionality can be turned off by the enable_incrementalsort
parameter. In such a case, the sort will take noticeably more time:
SET enable_incrementalsort = off;
EXPLAIN (analyze, costs off, timing off)
SELECT * FROM tickets ORDER BY ticket_no, passenger_id;
QUERY PLAN
-----------------------------------------------------------------------
Gather Merge (actual rows=2949857 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (actual rows=983286 loops=3)
Sort Key: ticket_no, passenger_id
Sort Method: external merge Disk: 107024kB
Worker 0: Sort Method: external merge Disk: 116744kB
Worker 1: Sort Method: external merge Disk: 107200kB
-> Parallel Seq Scan on tickets (actual rows=983286 loops=3)
Planning Time: 3.994 ms
Execution Time: 12291.844 ms
Alexander Korotkov (Postgres Professional) suggested the idea of incremental sort as early as in Y2013, and then seven years later, James Coleman brought the patch to the state committed by the community.
Speedup truncates of relation forks
When TRUNCATE
is performed, shared buffers are scanned to remove buffers of the table from the shared memory. Formerly, scan was done three times, for each fork of the table: MAIN (the main, data, fork), FSM (the free space map), VM (the visibility map). Now the logic is changed: the buffers are scanned only once rather than three times. For large values of shared_buffers
, this yields considerable gains.
Optimized partial TOAST decompression
When there is no need to read the entire TOAST, but only a slice at the beginning or not far from the beginning is sufficient, it makes no sense to decompress TOAST entirely. Compressed TOAST is read in iterations: we read a slice and if there is no data needed, we decompress more and read further. This is suggested by the Google Summer of Code student Binguo Bao, who provides an example:
CREATE TABLE slicingtest (
id serial PRIMARY KEY,
a text
);
INSERT INTO slicingtest (a) SELECT
repeat('1234567890-=abcdefghijklmnopqrstuvwxyz', 1000000) AS a FROM
generate_series(1,100);
\timing
SELECT sum(length(substr(a, 0, 20))) FROM slicingtest;
Time: 28.123 ms
With the patch, this is faster by an order of magnitude:
Time: 2.306 ms
Parallel VACUUM
In his related article, Egor Rogov explains this critical step to parallelization in detail. In short: «Masahiko Sawada’s patch that enables vacuuming in the parallel mode. The table itself is still vacuumed by one (leader) process, but to vacuum indexes, this process can now launch background worker processes, one for each index. In the manual mode, this enables us to speed up vacuuming of large tables with several indexes; autovacuum does not use this functionality yet.» — translated.
Trigger autovacuum on tuple insertion
For this patch (also known as Berserk autovacuum), we should thank Darafei Praliaskouski, who suggested a solution to the following problem: autovacuum never comes to append-only tables since they do not contain «dead» tuples. Because of this, the visibility map does not get updated, which makes index-only scans inefficient. But when vacuum does come to prevent transaction id wraparound, it needs to do very much work instantaneously. This situation is fixed now: row inserts will trigger autovacuum. Two new server parameters were added (autovacuum_vacuum_insert_threshold
and autovacuum_vacuum_insert_scale_factor
), similar to existing parameters for deletes and updates (autovacuum_vacuum_threshold
and autovacuum_vacuum_scale_factor
).
Memory-Bounded Hash Aggregation
Aggregation that uses hashing may require more RAM than the planner expected and than specified in work_mem
. Formerly this planner error resulted in ignoring work_mem
and allocating as much memory as needed for the operation or for arrival of OOM Killer. Now the algorithm can avoid falling outside the range of work_mem
and use temporary files on disk when necessary. To control the behavior of the planner, new parameters were added: enable_groupingsets_hash_disk
and enable_hashagg_disk
.
Optimize update of tables with generated columns
In version 12, generated columns were recomputed for any update of a row, even if this update did not affect these columns at all. Now they will be recomputed only if it is really necessary (if their base columns changed).
This optimization can, for instance, considerably speed up updates of tables with a generated column of the tsvector
type since the to_tsvector()
function is pretty expensive.
Allow trigger to get updated columns
This small patch adds the bitmap of updated columns to the TriggerData
structure. General-purpose trigger functions, such as tsvector_update_trigger()
or lo_manage(), can use this information
to avoid doing extra work.
Using multiple extended statistics for estimates
In PostgreSQL 12, the planner could not use several extended statistics together for one table. For example: imagine the situation when there are two extended statistics on different sets of columns and a query contains columns both from one set and the other. The planner can now access all the available information.
Infrastructure for parallelization and COPY (See also this patch.)
Parallelism in PostgreSQL still works only for read queries. For write queries, there are issues, and one of them is locking the processes that perform the same task in parallel, all in a common parallel group). It is taken to be that locks of such processes do not conflict — for example: several processes can hold an exclusive lock on the same table. This requires the developers of the core to be especially scrupulous, otherwise, deadlocks would continuously arise.
But there are two exceptions:
- Relation extension lock, which is acquired when new pages are added at the end of a data file.
- Page lock, which is used when moving elements of a GIN index from the pending list to the main tree.
(This article provides more details.)
Locks like these must conflict even between processes in one parallel group — this is what this patch implements. But these locks can never cause deadlocks, therefore, they are not checked.
In general, nothing changes for users, but this patch is important because first, it paves the way to parallel INSERT and COPY and second, it eliminates one of the bottlenecks of PostgreSQL under heavy load (of which you can hear in this presentation at HL++).
Security
EDH SKIP primes are replaced
This is about discontinuing use of EDH primes (Diffie-Hellman ephemeral keys) via the already outdated SKIP protocol.
Change authentication defaults for initdb
Default access settings for local and network connections at the start of initdb
are changed. Now in pg_hba.conf
, instead of the trust
authentication method, peer
will be used for local connections (or md5
, if peer
is not supported) and md5
for network connections. Initially, more liberal measures were discussed: a warning in the documentation. And then stricter ones were considered: scram-sha-256
. As a result, it was decided not to go beyond peer
and md5
so far.
Using explicit_bzero
This is a critical patch. OS functions bzero()
and explicit_bzero()
fill specified memory areas with bytes containing '\0'
(for an example on Linux, refer here). These patches are only a beginning: there are quite a few memory areas where passwords and other sensitive information can remain. It was decided to begin with libpq
, where an entire file with passwords can remain in memory after reading .pgpass
, and with cleaning after closing a connection. be-secure-common.c
now includes erasing a secret phrase that is input in SSL and that shows up in the error string (path).
Add password_protocol connection parameter to libpq
This patch enables libpq
to check which password transfer protocol is used when connecting. After receiving this parameter, libpq
will reject authentication if the protocol is weaker than the specified one. The default value of this parameter is plaintext
, that is, all protocols will do.
Add mandatory access control for TRUNCATE
This patch allows extensions to build in mandatory access control (MAC) for the TRUNCATE operation. Now the sepgsql extension will also check the privileges for this operation. SELinux Reference Policy and Redhat-based Linux distributions do not support checks of permissions for db_table {truncate}
by SELinux. In cases like this, sepgsql
with deny_unknown
equal to 1 will be used and TRUNCATE will not be performed.
Exposure related to GUC value of ssl_passphrase_command
This is a simple, but useful patch. Now the value of the ssl_passphrase_command
parameter will be visible only to superuser. This parameter specifies an external command to be called when a password is required to decrypt an SSL file, such as a private key.
Localization
Simple collation versions for glibc
Databases store version numbers for ICU collations. For every use of a rule (collation, comparison of symbols) the stored version number is checked against the current OS version of the ISU library and in case of discrepancies, a warning is issued. This allows us to detect indexes that were built using a changed collation and may therefore appear incorrect and need rebuilding. After rebuilding the indexes using the ALTER COLLATION ... REFRESH VERSION
command, the collation version in the database gets updated and warnings are no longer issued.
But this was the case only for ICU. Now the version number is also stored for libc
collations:
SELECT collname, collversion FROM pg_collation WHERE collname = 'ru_RU';
collname | collversion
----------+-------------
ru_RU | 2.27
This enables us to issue warnings when the library changes in the OS. It is relevant in view of transitioning to glibc 2.28
, where a lot of collations changed and appropriate indexes will need rebuilding.
But until we transition to 2.28, all is quiet:
ALTER COLLATION "ru_RU" REFRESH VERSION;
NOTICE: version has not changed
ALTER COLLATION
Full-text search
Full-text search for Greek language
No comment.
Add absolute value to dict_int
A capability to remove the sign of a number was added to the dict_int
template dictionary (same as the extension).
CREATE EXTENSION dict_int;
SELECT ts_lexize('intdict', '-123');
ts_lexize
-----------
{-123}
(1 row)
ALTER TEXT SEARCH DICTIONARY intdict (ABSVAL = true);
SELECT ts_lexize('intdict', '-123');
ts_lexize
-----------
{123}
(1 row)
So this time the absolute value is recognized.
Partitioning
BEFORE ROW triggers for partitioned tables
In version 12, row triggers BEFORE cannot be created on a partitioned table, while there is no problem to create them on separate partitions. Now the BEFORE FOR EACH ROW trigger created on a partitioned table will be automatically inherited and work for all partitions, but the partition key of the trigger on UPDATE will be able to vary only in the range of the current partition.
Support adding partitioned tables to logical replication publications
Earlier, addition of a partitioned table to a publication resulted in an error:
CREATE PUBLICATION pub FOR TABLE p;
ERROR: "p" is a partitioned table
DETAIL: Adding partitioned tables to publications is not supported.
HINT: You can add the table partitions individually.
Now this does work.
Advanced partition matching for partition-wise join
Starting with version 11, the planner can join partitioned tables in a partition-wise fashion, but only when partition bounds are exactly the same. Now the algorithm is improved: it will work when a partition of one table is fully contained in a partition of the other even if their sizes are different (for example: if one table is partitioned by days, and the other — by months). The new algorithm is functional for range and list partitioning.
Fix handling of full outer joins by partition-wise join
Partition-wise joins also work now for full outer joins with the USING clause.
tableam
There are no drastic advancements with respect to PostgreSQL 12 so far in this attractive and challenging, but difficult area. There are no ready-to-use pluggable starages (zheap
and so on). The old friend heap
) is still the only choice. But development of API is in progress.
Abstracting relation sizing code
Robert Haas rewrote the code by changing its architecture in favor of abstract forks in order to avoid code duplication in future. This part relates to estimate_rel_size
— the size of table forks.
Allow table AM’s to cache stuff in relcache
This patch makes the possibilities of time management for table access methods closer to those of index methods.
tableam vs. TOAST
TOAST is mainly designed for the heap
storage concept, therefore, when table access methods are created, one can choose between two ways to go: whether help developers of the new methods to integrate there inserts, updates and deletes of TOAST rows, or delegate work with TOAST to the code that uses heap storage, which is traditional for PostgreSQL. A set of 5 patches to implement INSERT, UPDATE and DELETE operations uses slots of tuples and can help those going either way.
fsync
fsync error handling in pg_receivewal, pg_recvlogical
Struggle with the fsync()
issue continues. PostgreSQL believes that a successful call to fsync()
means that all data of the file is on disk, but sometimes this does not happen (it depends on the OS) and can cause data loss. It was decided that in PostgreSQL 13, we would need to handle the pg_receivewal
and pg_recvlogical
utilities. At present, the default behavior is like this: the utilities will log fsync
errors, restore the connection and continue as if nothing ever happened. As a result, WAL will contain information about successfully copied files, which actually were not correctly copied to disk. So, it is better to terminate the utility. The future of pg_dump
, pg_basebackup
, pg_rewind
and pg_checksums
was also discussed, but it was decided not to go beyond the above two.
Safeguards against incorrect fd flags for fsync ()
This patch checks whether the flags are set correctly when receiving the file descriptor for fsync()
— directories are open only for reading, while files are open for writing or for both.
Backup and replication
Pause recovery if PITR target not reached
If WALs are over and recovery_target_time
is not reached, the server stops the recovery and proceeds to a regular work. Now the behavior will be different. The recovery process will pause and put information into the log, and the administrator will be able to add missing WAL segments to continue the recovery.
Add ignore_invalid_pages parameter
If a recovery process comes across a reference to an incorrect page in a WAL record on a replica, panic
happens. Turning on the ignore_invalid_pages
parameter can help to handle it. The recovery will continue with possible loss of consistency or data, as well as with other severe consequences. The parameter is intended for server developers and must be used in cases where it is anyway needed to complete the recovery and run the replica by all means.
Online change of primary_conninfo
This Sergey Kornilov’s patch allows us to change primary_conninfo
, primary_slot_name
and wal_receiver_create_temp_slot
parameters without restarting the server. It’s for this sake that in the 12-th release, the recovery.conf
file was dropped.
Backup manifests
The pg_basebackup
tool now creates a «manifest» — a JSON file that contains information on the backup copy created (filenames and file sizes, necessary WAL files, as well as checksums of anything and everything).
The new pg_validatebackup
utility validates backup copies against the manifest and also uses the pg_waldump
utility to check availability of WAL files needed for recovery and make sure they are not corrupted (only WAL files inside the backup copy itself rather than in the archive are meant).
This will allow us to detect situations when backup files got corrupted or disappeared or when recovery became impossible due to lack of necessary WAL files.
Restricting maximum keep segments by repslots
A replication slot is a convenient but dangerous technique: if a client does not read data from the slot in due time, unread WAL records may occupy all the server space. Now the max_slot_wal_keep_size
parameter will enab