Extensible/Rule Based Query Rewrite Optimization in Starburst
这篇文章是介绍Starburst的基于规则的优化器实现框架。
背景
很多业务的SQL语句或类SQL语句,尤其是中间件软件自动生成的SQL语句,主要面向业务模型,如果不做query rewrite,可能执行非常慢。
例如:
1 2 3 4 5 6 7 8 |
DISTINCT P FROM Patient p IN Patient_Set WHERE p.sex == 'male' AND EXISTS (SELECT r FROM Medical_recrod r IN p.get_medical_record() WHERE r.get_date() < '10/10/89' AND (r.get_diagnosis() == 'Malaria' OR r.get_diagnosis() == 'Smallpox')); |
这条类SQL语句(某些OO中间件生成的)查询了所有在10/10/89之前被诊断为Malaria和Smallpox的男性病人。
如果按照SQL语句直观的查询计划,需要先遍历每一个男性病人,然后对每一个病人查阅其诊断记录。但实际上,男性病人中得Malaria和Smallpox的是少数,如果可以先查阅得此病的人,再过滤男性,执行效率更高。这就是Query Rewrite的价值。
Query Graph Model
Query在内部用Query Graph Model (QGM) 来表示,QGM表达能力很强,支持任意的表操作,它的输入和输出都是表。
例如如下的查询
1 2 3 4 |
SELECT DISTINCT q1.partno, q1.descr, q2.suppno FROM inventory q1, quotations q2 WHERE q1.partno = q2.partno AND q1.descr = 'engine' AND q2.price <= ALL (SELECT q3.price FROM quotations q3 WHERE q2.partno = q3.partno); |
对应的QGM图为
Box1和Box2是对应q1和q2,Box3对应主查询,Box4对应嵌套查询。
以Box3为例,head表示该box的输出表,并且有一个属性head.distinct。body表示生成输出表的过程,黑色圆点表示quantifier (quantified tuple variable),它的含义类似于参与计算的操作数的输入源,例如q1(F)表示类型为F(来自FROM)、关联到Box1输出的quantifier。q4(A)表示类型为ALL、关联到Box4的quantifier。q1和q2之间的方形代表join predicate。q1左侧圈代表在q1上的local predicate。body也有一个distinct属性,取值ENFORCE / PRESERVE / PERMIT,含义分别是 必须去重 / 保留含重复的数量 / 允许重复,该值和head.distinct直接关联。
Rewrite Rules
设计rewrite engine和rule的核心 Rewrite Philosophy : Whenever possible, a query should be converted to a single SELECT operator。其原因是单个SELECT操作简单,容易得出最优路径。
Starburst支持的rules
其中箭头表示一个rule应用完成后可能使得另一个rule满足条件。
因为Rewrite Philosophy,所以有很多最终的箭头指向SELMERGE。
Rule 1. SELECT Merge (SELMERGE)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
if ( in a SELECT box(upper box) a quantifier has a type F AND ranges over a SELECT box(lower box) AND no other quantifier ranges over lower box AND (upper.head.distinct = TRUE OR upper.body.distinct = PERMIT OR lower.body.distinct != ENFORCE ) ) { merge the lower box into upper box; if(lower.body.distinct = ENFORCE AND upper.body.distinct != PERMIT) { upper.body.distinct = ENFORCE; } } |
例子:
1 2 3 4 5 6 7 8 |
CREATE VIEW itpv AS (SELECT DISTINCT itp.itemn, pur.vendn FROM itp, pur WHERE itp.ponum = pur.ponum AND pur.odate > '85'); SELECT itm.itmn, itpv.vendn FROM itm, itpv WHERE itm.itemn = itpv.itemn AND itm.itemn >= '01' AND itm.itemn < '20'; |
Rewrite后的结果:
1 2 3 4 5 |
SELECT DISTINCT itm.itmn, pur.vendn FROM itm, itp, pur WHERE itp.ponum = pur.ponum AND itm.itemn = itp.itemn AND pur.odate > '85' AND itm.itemn >= '01' AND itm.itemn < '20'; |
这样,查询优化器就可以从新的查询将view的内部查询考虑进去进行优化。
Rule 2. Distinct Pullup (DISTPU)
1 2 3 4 5 6 7 8 |
if (in a SELECT box either quantifier-nodup-condition or one-tuple-condition holds for all F quantifier) { head.distinct = TRUE; body.distinct = PERSERVE; } |
其中 one-tuple-condition
指的是只会吐出一条满足条件的结果,quantifier-nodup-condition
指的是可能吐出多条记录,但记录包含候选码(候选码唯一,例如输出列含主键),也就是说,这两个条件自动保证了输出无重复。这种情况下,box的head/body的distinct可以提前明确到代价低的情况(比如不需要再进行额外的去重)。
Rule 3. Distinct Pushdown From/To (DISTPDFR/DISTPDTO)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
/* DISTPDFR */ if (in a box with type SELECT, UNION, INTERSECT or EXCEPT, body.distinct = PERMIT or ENFORCE) { for (each F quantifier in the body) quantifier.distinct = PERMIT; } /* DISTPDTO */ if (in a box with type SELECT, UNION, INTERSECT or EXCEPT, all quantifiers ranging over the box have quantifier.distinct = PERMIT) { body.distinct = PERMIT; } |
这条rule也比较简单,如果查询允许重复,那么例如子查询等quantifier也不必去重。
Rule 4. E or A Distinct Pushdown From (EorAPDFR)
1 2 3 4 5 |
if (in a SELECT box a quantifier has type = E or A) { quantifier.distinct = PERMIT; } |
意指quantifier如果是E或者A量词,则quantifier本身可以不去重。
Rule 5. Common Subexpression Replication (BOXCOPY)
公共子表达式复制。
1 2 3 4 5 6 7 |
if (in a SELECT box more than one quantifier range over the box) { make a copy of the quantifiers ranging over the original box and change it to range over the new copy; } |
Rule 6. Add keys (ADDKEYS)
1 2 3 4 5 6 7 8 9 10 |
if (in a SELECT box head.distinct = FALSE) { for (each F quantifier) if (the key of the F quantifier does not appear in the output) { Add the key to the head; head.distinct = TURE; } } |
如果原SELECT输出非distinct,则可以在quantifier多输出唯一性的列用到外层SELECT。
例子:
1 2 3 4 5 6 7 8 9 10 |
CREATE VIEW itemprice AS ( SELECT DISTINCT itp.itemno, itp.NegotiatedPrice FROM itp WHERE NegotiatedPrice > 1000; ) SELECT itemprice.NegotiatedPrice, itm.type FROM itemprice, itm WHERE itemprice.itemno = itm.itemno; |
Rewrite后的结果:
1 2 3 |
SELECT DISTINCT itp.NegotiatedPrice, itm.type, itm.itemno FROM itp, itm WHERE itp.NegotiatedPrice > 1000 AND itp.itemno = itm.itemno; |
Rule 7. E to F Quantifier Conversion (EtoF)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
if (in a SELECT box there is a quantifier of type E forming a Boolean factor AND ( head.distinct = TRUE OR body.distinct = PERMIT OR one-tuple-condition ) ) { set quantifier type to F; if (one-tuple-condiftion is FALSE AND head.distinct = TRUE) { body.distinct = ENFORCE; } } |
例子:
1 2 3 4 |
SELECT * FROM itp WHERE itp.itemn IN ( SELECT itl.itemn FRO itl WHERE itl.wkcen = 'WK468' AND itl.locatio = 'LOCA000IN'); |
Rewrite后的结果:
1 2 3 |
SELECT DISTINCT itp.* FROM itp, itl WHERE itp.itemn = itl.itemn AND itl.wkcen = 'WK468' AND itl.locatio = 'LOCA000IN'; |
Rule 8. INTERSECT to Exists (INT2EXIST)
将INTERSECT查询改写为EXISTS子查询,进而通过EtoF改写为SELMERGE。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
if ( in an INTERSECT box body.distinct != PRESERVE) { set the box to be of type SELECT; choose an arbitrary quantifier Q1; /* Q1 will keep tyoe F */ for ( each quantifier Q != Q1 in the box) { Q.type = E; add the predicate EXISTS ( SELECT * FROM Q WHERE Q1.c1 = Q.c1 AND Q1.c2 = Q.c2 AND ... Q1.cn = Q.cn ); } } |
Rule Engine
Starburst的Rule Engine的特点:
1. 支持任意复杂度的规则
规则定义为两个(C语言)函数,一个是condition函数,返回true/false表示是否满足该规则应用条件,一个是action函数,在满足条件时被调用,可以执行任意操作。
2. 上下文
规则引擎执行操作时,带有上下文参数,该参数包含指向用户查询信息的指针。
3. 规则分类和扩展的冲突解决
规则分类的好处:有利于对规则的理解;有利于规则condition/action函数的模块化和可理解性;不同的规则类的冲突解决方式不一样。
规则冲突及解决:规则冲突是指当前存在规则可以应用。解决的两种模式:sequential(对所有规则按某个顺序一轮轮地尝试) 和 priority(总是应用满足条件的优先级最高的规则)。
4. 保证终止
5. 规则控制
支持在线enable/disable某些rule。