Extensible/Rule Based Query Rewrite Optimization in Starburst

《Extensible/Rule Based Query Rewrite Optimization in Starburst》SIGMOD 1992

这篇文章是介绍Starburst的基于规则的优化器实现框架。

背景

很多业务的SQL语句或类SQL语句，尤其是中间件软件自动生成的SQL语句，主要面向业务模型，如果不做query rewrite，可能执行非常慢。

例如：

DISTINCT P
FROM Patient p IN Patient_Set
WHERE p.sex == 'male' AND
      EXISTS (SELECT r
              FROM Medical_recrod r IN p.get_medical_record()
              WHERE r.get_date() < '10/10/89' AND
                    (r.get_diagnosis() == 'Malaria' OR
                     r.get_diagnosis() == 'Smallpox'));

这条类SQL语句（某些OO中间件生成的）查询了所有在10/10/89之前被诊断为Malaria和Smallpox的男性病人。

如果按照SQL语句直观的查询计划，需要先遍历每一个男性病人，然后对每一个病人查阅其诊断记录。但实际上，男性病人中得Malaria和Smallpox的是少数，如果可以先查阅得此病的人，再过滤男性，执行效率更高。这就是Query Rewrite的价值。

Query Graph Model

Query在内部用Query Graph Model (QGM) 来表示，QGM表达能力很强，支持任意的表操作，它的输入和输出都是表。

例如如下的查询

SELECT DISTINCT q1.partno, q1.descr, q2.suppno
FROM inventory q1, quotations q2
WHERE q1.partno = q2.partno AND q1.descr = 'engine'
      AND q2.price <= ALL (SELECT q3.price FROM quotations q3 WHERE q2.partno = q3.partno);

对应的QGM图为

Box1和Box2是对应q1和q2，Box3对应主查询，Box4对应嵌套查询。

以Box3为例，head表示该box的输出表，并且有一个属性head.distinct。body表示生成输出表的过程，黑色圆点表示quantifier (quantified tuple variable)，它的含义类似于参与计算的操作数的输入源，例如q1(F)表示类型为F（来自FROM）、关联到Box1输出的quantifier。q4(A)表示类型为ALL、关联到Box4的quantifier。q1和q2之间的方形代表join predicate。q1左侧圈代表在q1上的local predicate。body也有一个distinct属性，取值ENFORCE / PRESERVE / PERMIT，含义分别是必须去重 / 保留含重复的数量 / 允许重复，该值和head.distinct直接关联。

Rewrite Rules

设计rewrite engine和rule的核心 Rewrite Philosophy : Whenever possible, a query should be converted to a single SELECT operator。其原因是单个SELECT操作简单，容易得出最优路径。

Starburst支持的rules

其中箭头表示一个rule应用完成后可能使得另一个rule满足条件。

因为Rewrite Philosophy，所以有很多最终的箭头指向SELMERGE。

Rule1. SELECT Merge (SELMERGE)

if ( in a SELECT box(upper box)
        a quantifier has a type F
     AND ranges over a SELECT box(lower box)
     AND no other quantifier ranges over lower box
     AND
    	  (upper.head.distinct = TRUE
           OR
           upper.body.distinct = PERMIT
           OR
           lower.body.distinct != ENFORCE
          )
   )
{
     merge the lower box into upper box;
     if（lower.body.distinct = ENFORCE
         AND upper.body.distinct != PERMIT)
     {
         upper.body.distinct = ENFORCE;
     }
}

例子：

CREATE VIEW itpv AS
 (SELECT DISTINCT itp.itemn, pur.vendn
  FROM itp, pur
  WHERE itp.ponum = pur.ponum AND pur.odate > '85');

SELECT itm.itmn, itpv.vendn FROM itm, itpv
WHERE itm.itemn = itpv.itemn
      AND itm.itemn >= '01' AND itm.itemn < '20';

Rewrite后的结果：

SELECT DISTINCT itm.itmn, pur.vendn
FROM itm, itp, pur
WHERE itp.ponum = pur.ponum AND itm.itemn = itp.itemn
      AND pur.odate > '85'
      AND itm.itemn >= '01' AND itm.itemn < '20';

这样，查询优化器就可以从新的查询将view的内部查询考虑进去进行优化。

Rule2. Distinct Pullup (DISTPU)

if (in a SELECT box
    either quantifier-nodup-condition
    or one-tuple-condition
    holds for all F quantifier)
{
    head.distinct = TRUE;
    body.distinct = PERSERVE;
}

其中 one-tuple-condition 指的是只会吐出一条满足条件的结果，quantifier-nodup-condition 指的是可能吐出多条记录，但记录包含候选码（候选码唯一，例如输出列含主键），也就是说，这两个条件自动保证了输出无重复。这种情况下，box的head/body的distinct可以提前明确到代价低的情况（比如不需要再进行额外的去重）。

Rule3. Distinct Pushdown From/To (DISTPDFR/DISTPDTO)

/* DISTPDFR */
if (in a box with type SELECT, UNION, INTERSECT or EXCEPT,
       body.distinct = PERMIT or ENFORCE)
{
    for (each F quantifier in the body)
        quantifier.distinct = PERMIT;
}

/* DISTPDTO */
if (in a box with type SELECT, UNION, INTERSECT or EXCEPT,
       all quantifiers ranging over the box have quantifier.distinct = PERMIT)
{
    body.distinct = PERMIT;
}

这条rule也比较简单，如果查询允许重复，那么例如子查询等quantifier也不必去重。

Rule4. E or A Distinct Pushdown From (EorAPDFR)

if (in a SELECT box
       a quantifier has type = E or A)
{
    quantifier.distinct = PERMIT;
}

意指quantifier如果是E或者A量词，则quantifier本身可以不去重。

Rule5. Common Subexpression Replication (BOXCOPY)

公共子表达式复制。

if (in a SELECT box
       more than one quantifier
       range over the box)
{
    make a copy of the quantifiers ranging over the original box
    and change it to range over the new copy;
}

Rule6. Add keys (ADDKEYS)

if (in a SELECT box
       head.distinct = FALSE)
{
    for (each F quantifier)
        if (the key of the F quantifier does not appear in the output)
        {
            Add the key to the head;
            head.distinct = TURE;
        }
}

如果原SELECT输出非distinct，则可以在quantifier多输出唯一性的列用到外层SELECT。

例子：

CREATE VIEW itemprice AS
(
    SELECT DISTINCT itp.itemno, itp.NegotiatedPrice
    FROM itp
    WHERE NegotiatedPrice > 1000;
)

SELECT itemprice.NegotiatedPrice, itm.type
FROM itemprice, itm
WHERE itemprice.itemno = itm.itemno;

Rewrite后的结果：

1
2
3

SELECT DISTINCT itp.NegotiatedPrice, itm.type, itm.itemno
FROM itp, itm
WHERE itp.NegotiatedPrice > 1000 AND itp.itemno = itm.itemno;

Rule7. E to F Quantifier Conversion (EtoF)

if (in a SELECT box
        there is a quantifier of type E
        forming a Boolean factor
    AND
    (
        head.distinct = TRUE
      OR
        body.distinct = PERMIT
      OR
        one-tuple-condition
    )
   )
{
    set quantifier type to F;
    if (one-tuple-condiftion is FALSE
        AND head.distinct = TRUE)
    {
        body.distinct = ENFORCE;
    }
}

例子：

SELECT * FROM itp
WHERE itp.itemn IN
    ( SELECT itl.itemn FRO itl
      WHERE itl.wkcen = 'WK468' AND itl.locatio = 'LOCA000IN');

Rewrite后的结果：

1
2
3

SELECT DISTINCT itp.* FROM itp, itl
WHERE itp.itemn = itl.itemn
      AND itl.wkcen = 'WK468' AND itl.locatio = 'LOCA000IN';

Rule8. INTERSECT to Exists (INT2EXIST)

将INTERSECT查询改写为EXISTS子查询，进而通过EtoF改写为SELMERGE。

if ( in an INTERSECT box
        body.distinct != PRESERVE)
{
    set the box to be of type SELECT;
    choose an arbitrary quantifier Q1;
    /* Q1 will keep tyoe F */
    for ( each quantifier Q != Q1 in the box)
    {
        Q.type = E;
        add the predicate
        EXISTS ( SELECT * FROM Q
                 WHERE Q1.c1 = Q.c1 AND
                       Q1.c2 = Q.c2 AND
                       ...
                       Q1.cn = Q.cn
               );
    }
}

Rule Engine

Starburst的Rule Engine的特点：

1. 支持任意复杂度的规则

规则定义为两个(C语言)函数，一个是condition函数，返回true/false表示是否满足该规则应用条件，一个是action函数，在满足条件时被调用，可以执行任意操作。

2. 上下文

规则引擎执行操作时，带有上下文参数，该参数包含指向用户查询信息的指针。

3. 规则分类和扩展的冲突解决

规则分类的好处：有利于对规则的理解；有利于规则condition/action函数的模块化和可理解性；不同的规则类的冲突解决方式不一样。

规则冲突及解决：规则冲突是指当前存在规则可以应用。解决的两种模式：sequential（对所有规则按某个顺序一轮轮地尝试）和 priority（总是应用满足条件的优先级最高的规则）。

4. 保证终止

5. 规则控制

支持在线enable/disable某些rule。

Tags: SQL

← Content-Aware Lock Scheduling Helping compiler help you - Ispike, BOLT →

扫描二维码，分享此文章