LoopJump's Blog

Extensible/Rule Based Query Rewrite Optimization in Starburst

2022-04-10

《Extensible/Rule Based Query Rewrite Optimization in Starburst》SIGMOD 1992

这篇文章是介绍Starburst的基于规则的优化器实现框架。

背景

很多业务的SQL语句或类SQL语句,尤其是中间件软件自动生成的SQL语句,主要面向业务模型,如果不做query rewrite,可能执行非常慢。

例如:

1
2
3
4
5
6
7
8
DISTINCT P
FROM Patient p IN Patient_Set
WHERE p.sex == 'male' AND
EXISTS (SELECT r
FROM Medical_recrod r IN p.get_medical_record()
WHERE r.get_date() < '10/10/89' AND
(r.get_diagnosis() == 'Malaria' OR
r.get_diagnosis() == 'Smallpox'));

这条类SQL语句(某些OO中间件生成的)查询了所有在10/10/89之前被诊断为Malaria和Smallpox的男性病人。

如果按照SQL语句直观的查询计划,需要先遍历每一个男性病人,然后对每一个病人查阅其诊断记录。但实际上,男性病人中得Malaria和Smallpox的是少数,如果可以先查阅得此病的人,再过滤男性,执行效率更高。这就是Query Rewrite的价值。

Query Graph Model

Query在内部用Query Graph Model (QGM) 来表示,QGM表达能力很强,支持任意的表操作,它的输入和输出都是表。

例如如下的查询

1
2
3
4
SELECT DISTINCT q1.partno, q1.descr, q2.suppno
FROM inventory q1, quotations q2
WHERE q1.partno = q2.partno AND q1.descr = 'engine'
AND q2.price <= ALL (SELECT q3.price FROM quotations q3 WHERE q2.partno = q3.partno);

对应的QGM图为

http://loopjump.com/wp-content/uploads/2022/04/1649496507207-1024x996.jpg

1649496507207.jpg

!http://loopjump.com/wp-content/uploads/2022/04/1649496507207-1024x996.jpg

Box1和Box2是对应q1和q2,Box3对应主查询,Box4对应嵌套查询。

以Box3为例,head表示该box的输出表,并且有一个属性head.distinct。body表示生成输出表的过程,黑色圆点表示quantifier (quantified tuple variable),它的含义类似于参与计算的操作数的输入源,例如q1(F)表示类型为F(来自FROM)、关联到Box1输出的quantifier。q4(A)表示类型为ALL、关联到Box4的quantifier。q1和q2之间的方形代表join predicate。q1左侧圈代表在q1上的local predicate。body也有一个distinct属性,取值ENFORCE / PRESERVE / PERMIT,含义分别是 必须去重 / 保留含重复的数量 / 允许重复,该值和head.distinct直接关联。

Rewrite Rules

设计rewrite engine和rule的核心 Rewrite Philosophy : Whenever possible, a query should be converted to a single SELECT operator。其原因是单个SELECT操作简单,容易得出最优路径。

Starburst支持的rules

1649505294859.jpg

http://loopjump.com/wp-content/uploads/2022/04/1649505294859.jpg

!http://loopjump.com/wp-content/uploads/2022/04/1649505294859.jpg

其中箭头表示一个rule应用完成后可能使得另一个rule满足条件。

因为Rewrite Philosophy,所以有很多最终的箭头指向SELMERGE。

Rule 1. SELECT Merge (SELMERGE)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
if ( in a SELECT box(upper box)
a quantifier has a type F
AND ranges over a SELECT box(lower box)
AND no other quantifier ranges over lower box
AND
(upper.head.distinct = TRUE
OR
upper.body.distinct = PERMIT
OR
lower.body.distinct != ENFORCE
)
)
{
merge the lower box into upper box;
if(lower.body.distinct = ENFORCE
AND upper.body.distinct != PERMIT)
{
upper.body.distinct = ENFORCE;
}
}

例子:

1
2
3
4
5
6
7
8
CREATE VIEW itpv AS
(SELECT DISTINCT itp.itemn, pur.vendn
FROM itp, pur
WHERE itp.ponum = pur.ponum AND pur.odate > '85');

SELECT itm.itmn, itpv.vendn FROM itm, itpv
WHERE itm.itemn = itpv.itemn
AND itm.itemn >= '01' AND itm.itemn < '20';

Rewrite后的结果:

1
2
3
4
5
SELECT DISTINCT itm.itmn, pur.vendn
FROM itm, itp, pur
WHERE itp.ponum = pur.ponum AND itm.itemn = itp.itemn
AND pur.odate > '85'
AND itm.itemn >= '01' AND itm.itemn < '20';

这样,查询优化器就可以从新的查询将view的内部查询考虑进去进行优化。

Rule 2. Distinct Pullup (DISTPU)

1
2
3
4
5
6
7
8
if (in a SELECT box
either quantifier-nodup-condition
or one-tuple-condition
holds for all F quantifier)
{
head.distinct = TRUE;
body.distinct = PERSERVE;
}

其中 one-tuple-condition 指的是只会吐出一条满足条件的结果,quantifier-nodup-condition 指的是可能吐出多条记录,但记录包含候选码(候选码唯一,例如输出列含主键),也就是说,这两个条件自动保证了输出无重复。这种情况下,box的head/body的distinct可以提前明确到代价低的情况(比如不需要再进行额外的去重)。

Rule 3. Distinct Pushdown From/To (DISTPDFR/DISTPDTO)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
/* DISTPDFR */
if (in a box with type SELECT, UNION, INTERSECT or EXCEPT,
body.distinct = PERMIT or ENFORCE)
{
for (each F quantifier in the body)
quantifier.distinct = PERMIT;
}

/* DISTPDTO */
if (in a box with type SELECT, UNION, INTERSECT or EXCEPT,
all quantifiers ranging over the box have quantifier.distinct = PERMIT)
{
body.distinct = PERMIT;
}

这条rule也比较简单,如果查询允许重复,那么例如子查询等quantifier也不必去重。

Rule 4. E or A Distinct Pushdown From (EorAPDFR)

1
2
3
4
5
if (in a SELECT box
a quantifier has type = E or A)
{
quantifier.distinct = PERMIT;
}

意指quantifier如果是E或者A量词,则quantifier本身可以不去重。

Rule 5. Common Subexpression Replication (BOXCOPY)

公共子表达式复制。

1
2
3
4
5
6
7
if (in a SELECT box
more than one quantifier
range over the box)
{
make a copy of the quantifiers ranging over the original box
and change it to range over the new copy;
}

Rule 6. Add keys (ADDKEYS)

1
2
3
4
5
6
7
8
9
10
if (in a SELECT box
head.distinct = FALSE)
{
for (each F quantifier)
if (the key of the F quantifier does not appear in the output)
{
Add the key to the head;
head.distinct = TURE;
}
}

如果原SELECT输出非distinct,则可以在quantifier多输出唯一性的列用到外层SELECT。

例子:

1
2
3
4
5
6
7
8
9
10
CREATE VIEW itemprice AS
(
SELECT DISTINCT itp.itemno, itp.NegotiatedPrice
FROM itp
WHERE NegotiatedPrice > 1000;
)

SELECT itemprice.NegotiatedPrice, itm.type
FROM itemprice, itm
WHERE itemprice.itemno = itm.itemno;

Rewrite后的结果:

1
2
3
SELECT DISTINCT itp.NegotiatedPrice, itm.type, itm.itemno
FROM itp, itm
WHERE itp.NegotiatedPrice > 1000 AND itp.itemno = itm.itemno;

Rule 7. E to F Quantifier Conversion (EtoF)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
if (in a SELECT box
there is a quantifier of type E
forming a Boolean factor
AND
(
head.distinct = TRUE
OR
body.distinct = PERMIT
OR
one-tuple-condition
)
)
{
set quantifier type to F;
if (one-tuple-condiftion is FALSE
AND head.distinct = TRUE)
{
body.distinct = ENFORCE;
}
}

例子:

1
2
3
4
SELECT * FROM itp
WHERE itp.itemn IN
( SELECT itl.itemn FRO itl
WHERE itl.wkcen = 'WK468' AND itl.locatio = 'LOCA000IN');

Rewrite后的结果:

1
2
3
SELECT DISTINCT itp.* FROM itp, itl
WHERE itp.itemn = itl.itemn
AND itl.wkcen = 'WK468' AND itl.locatio = 'LOCA000IN';

Rule 8. INTERSECT to Exists (INT2EXIST)

将INTERSECT查询改写为EXISTS子查询,进而通过EtoF改写为SELMERGE。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
if ( in an INTERSECT box
body.distinct != PRESERVE)
{
set the box to be of type SELECT;
choose an arbitrary quantifier Q1;
/* Q1 will keep tyoe F */
for ( each quantifier Q != Q1 in the box)
{
Q.type = E;
add the predicate
EXISTS ( SELECT * FROM Q
WHERE Q1.c1 = Q.c1 AND
Q1.c2 = Q.c2 AND
...
Q1.cn = Q.cn
);
}
}

Rule Engine

Starburst的Rule Engine的特点:

1. 支持任意复杂度的规则

规则定义为两个(C语言)函数,一个是condition函数,返回true/false表示是否满足该规则应用条件,一个是action函数,在满足条件时被调用,可以执行任意操作。

2. 上下文

规则引擎执行操作时,带有上下文参数,该参数包含指向用户查询信息的指针。

3. 规则分类和扩展的冲突解决

规则分类的好处:有利于对规则的理解;有利于规则condition/action函数的模块化和可理解性;不同的规则类的冲突解决方式不一样。

规则冲突及解决:规则冲突是指当前存在规则可以应用。解决的两种模式:sequential(对所有规则按某个顺序一轮轮地尝试) 和 priority(总是应用满足条件的优先级最高的规则)。

4. 保证终止

5. 规则控制

支持在线enable/disable某些rule。

Tags: SQL

扫描二维码,分享此文章