Java應用中表達式解析器(Java Cup/JFlex)生成器的介紹及示例

openkk 14年前發布 | 115K 次閱讀 JFlex 語法/詞法分析器

在基于Java的軟件系統的構建過程中，開發人員經常會遇到詞法解析、語法解析等問題，比如：在報表系統的中一般需要支持單元格計算公式（類似于 Excel的公式），在某些系統的數據轉換過程要實現自定義的轉換規則腳本。面對這一類問題，我們最終需要的是一個針對問題域的詞法及語法解析器，通常的實現方式不外乎有以下幾種：

1. 自己實現全部過程

當所涉及語法非常簡單（一般只有一兩句原語）時，可以選擇該方式。

優點：最直接，自然而然的選擇；如果開發人員理論基礎扎實的話不需要額外的學習負擔。

缺點：要求有一定的理論基礎；不利于擴充、或設計良好利于擴充但隨著未來需求的改變及新語法的不斷增加使擴充的開發成本不可控；測試成本增加；因未經過應用檢驗而存在風險；

2. 使用第三方的腳本引擎

當所涉及的語法及其復雜（需要支持比較復雜的腳本，甚至需要實現一種新的程序設計語言時，類似于Brio Query中提供的自定義腳本功能）時，可以選擇該方式。目前，在互聯網上有很多第三方腳本引擎（如：針對應用有：Netscape Script Engine、JavaRhino、BeanShell、IBM BSF、Tcl-Java等，針對Delphi應用有：Innerfuse Pascal Script 3，對微軟體系有Microsoft Script Engine。這些引擎的具體情況不在這里討論）可以選擇，只需要少量的編程工作這些腳本引擎大多都可以內嵌到宿主應用程序中，為宿主應用程序提供腳本支持功能。

優點：開發工作量少；部分引擎已經得到廣泛應用的驗證；具有復雜的腳本支持能力，可以提供類似于程序設計語言的處理能力。

缺點：腳本的語法由使用的引擎決定，往往比較復雜，對于處理簡單問題的場合顯得過于累贅；系統中需要引入第三方的類庫/控件；要求最終用戶學習引擎所規定的腳本語法；某些引擎在訪問應用系統的運行時對象時能力有限。

3. 使用語法解析器生成器

使用過Unix的同志一般會知道在Unix中有Yacc/Lex，C語言的編譯器生成器。在Java中，也有相應的工具。此方法使用JFlex /Java_Cup生成語法解析源碼。本人使用這種方法成功的實現了類似Excel那樣的公式，支持多種運算及函數，可以維護上下文，可以在表達式中引用系統的對象體系，可以執行動作，并且可以任意擴展而不需要修改解析源碼。

關于JFlex與Java_Cup的文檔你需要自己到下面的鏈接中仔細看看。其中關鍵的部分是需要定義你的語法及詞法分析文件，類似于BNF表達式。

主要步驟如下（為了方便，這里僅僅以四則運算為例，這個例子也是編譯器構造工具的通用例子，滿世界都采用它。）：

當然首先你必須下在JFlex與Java_cup，并解壓到特定目錄，假定二者的目錄分別是${JFlex_home}、${Java_Cup_home}。

3.1定義JFlex詞法分析文件calc.flex

/*
This example comes from a short article series in the Linux
Gazette by Richard A. Sevenich and Christopher Lopes, titled
"Compiler Construction Tools". The article series starts at

http://www.linuxgazette.com/issue39/sevenich.html

Small changes and updates to newest JFlex+Cup versions
by Gerwin Klein
*/

/*
Commented By: Christopher Lopes
File Name: lcalc.flex
To Create: > jflex lcalc.flex

and then after the parser is created
> javac Lexer.java
*/
/* ————————–Usercode Section———————— */
import java_cup.runtime.*;
%%
/* —————–Options and Declarations Section—————– */
/*
The name of the class JFlex will create will be Lexer.
Will write the code to the file Lexer.java.
*/
%class Lexer

/*
The current line number can be accessed with the variable yyline
and the current column number with the variable yycolumn.
*/
%line
%column
/*
   Will switch to a CUP compatibility mode to interface with a CUP
   generated parser.
*/
%cup
/*
Declarations
Code between %{ and %}, both of which must be at the beginning of a
line, will be copied letter to letter into the lexer class source.
Here you declare member variables and functions that are used inside
scanner actions.
*/
%{
    /* To create a new java_cup.runtime.Symbol with information about
       the current token, the token will have no value in this
       case. */
    private Symbol symbol(int type) {
        return new Symbol(type, yyline, yycolumn);
    }
    /* Also creates a new java_cup.runtime.Symbol with information
       about the current token, but this object has a value. */
    private Symbol symbol(int type, Object value) {
        return new Symbol(type, yyline, yycolumn, value);
    }
%}

/*
Macro Declarations
These declarations are regular expressions that will be used latter
in the Lexical Rules Section.
*/
/* A line terminator is a r (carriage return), n (line feed), or
   rn. */
LineTerminator = r|n|rn
/* White space is a line terminator, space, tab, or line feed. */
WhiteSpace     = {LineTerminator} | [ tf]
/* A literal integer is is a number beginning with a number between
   one and nine followed by zero or more numbers between zero and nine
   or just a zero. */
dec_int_lit = 0 | [1-9][0-9]*
/* A identifier integer is a word beginning a letter between A and
   Z, a and z, or an underscore followed by zero or more letters
   between A and Z, a and z, zero and nine, or an underscore. */
dec_int_id = [A-Za-z_][A-Za-z_0-9]*
%%
/* ————————Lexical Rules Section———————- */
/*
   This section contains regular expressions and actions, i.e. Java
   code, that will be executed when the scanner matches the associated
   regular expression. */
   /* YYINITIAL is the state at which the lexer begins scanning. So
   these regular expressions will only be matched if the scanner is in
   the start state YYINITIAL. */
{
    /* Return the token SEMI declared in the class sym that was found. */
    ";"                { return symbol(sym.SEMI); }
    /* Print the token found that was declared in the class sym and then
       return it. */
    "+"                { System.out.print(" + "); return symbol(sym.PLUS); }
    "-"                { System.out.print(" – "); return symbol(sym.MINUS); }
    "*"                { System.out.print(" * "); return symbol(sym.TIMES); }
    "/"                { System.out.print(" / "); return symbol(sym.DIVIDE); }
    "("                { System.out.print(" ( "); return symbol(sym.LPAREN); }
    ")"                { System.out.print(" ) "); return symbol(sym.RPAREN); }
    /* If an integer is found print it out, return the token NUMBER
       that represents an integer and the value of the integer that is
       held in the string yytext which will get turned into an integer
       before returning */
    {dec_int_lit}      { System.out.print(yytext());
                         return symbol(sym.NUMBER, new Integer(yytext())); }
    /* If an identifier is found print it out, return the token ID
       that represents an identifier and the default value one that is
       given to all identifiers. */
    {dec_int_id}       { System.out.print(yytext());
                         return symbol(sym.ID, new Integer(1));}
    /* Don’t do anything if whitespace is found */
    {WhiteSpace}       { /* just skip what was found, do nothing */ }
}

/* No token was found for the input so through an error. Print out an
Illegal character message with the illegal character that was found. */
[^] { throw new Error("Illegal character <"+yytext()+">"); }

3.2生成詞法解析程序

執行${JFlex_home}binjflex.bat，并輸入calc.flex，成功后輸出lex.java文件即四則運算的詞法解析文件。職責主要是從本例中的四則運算字符串輸入中進行詞法分析，輸出每一個Token，比如數字、運算符等。

3.3定義Java_cup語法分析文件

/*
This example comes from a short article series in the Linux
Gazette by Richard A. Sevenich and Christopher Lopes, titled
"Compiler Construction Tools". The article series starts at

http://www.linuxgazette.com/issue39/sevenich.html

Small changes and updates to newest JFlex+Cup versions
by Gerwin Klein
*/

/*
Commented By: Christopher Lopes
File Name: ycalc.cup
To Create: > java java_cup.Main < ycalc.cup
*/
/* ———————-Preliminary Declarations Section——————–*/
/* Import the class java_cup.runtime.* */
import java_cup.runtime.*;
/* Parser code to change the way the parser reports errors (include
   line and column number of the error). */
parser code {:
    /* Change the method report_error so it will display the line and
       column of where the error occurred in the input as well as the
       reason for the error which is passed into the method in the
       String ‘message’. */
    public void report_error(String message, Object info) {
        /* Create a StringBuffer called ‘m’ with the string ‘Error’ in it. */
        StringBuffer m = new StringBuffer("Error");
        /* Check if the information passed to the method is the same
           type as the type java_cup.runtime.Symbol. */
        if (info instanceof java_cup.runtime.Symbol) {
            /* Declare a java_cup.runtime.Symbol object ‘s’ with the
               information in the object info that is being typecasted
               as a java_cup.runtime.Symbol object. */
            java_cup.runtime.Symbol s = ((java_cup.runtime.Symbol) info);
            /* Check if the line number in the input is greater or
               equal to zero. */
            if (s.left >= 0) {
                /* Add to the end of the StringBuffer error message
                   the line number of the error in the input. */
                m.append(" in line "+(s.left+1));
                /* Check if the column number in the input is greater
                   or equal to zero. */
                if (s.right >= 0)
                    /* Add to the end of the StringBuffer error message
                       the column number of the error in the input. */
                    m.append(", column "+(s.right+1));
            }
        }
        /* Add to the end of the StringBuffer error message created in
           this method the message that was passed into this method. */
        m.append(" : "+message);
        /* Print the contents of the StringBuffer ‘m’, which contains
           an error message, out on a line. */
        System.err.println(m);
    }
    /* Change the method report_fatal_error so when it reports a fatal
       error it will display the line and column number of where the
       fatal error occurred in the input as well as the reason for the
       fatal error which is passed into the method in the object
       ‘message’ and then exit.*/
    public void report_fatal_error(String message, Object info) {
        report_error(message, info);
        System.exit(1);
    }
:};

/* ————Declaration of Terminals and Non Terminals Section———– */
/* Terminals (tokens returned by the scanner).

   Terminals that have no value are listed first and then terminals
   that do have an value, in this case an integer value, are listed on
   the next line down. */
terminal           SEMI, PLUS, MINUS, TIMES, DIVIDE, LPAREN, RPAREN;
terminal Integer   NUMBER, ID;
/* Non terminals used in the grammar section.

   Non terminals that have an object value are listed first and then
   non terminals that have an integer value are listed. An object
   value means that it can be any type, it isn’t set to a specific
   type. So it could be an Integer or a String or whatever. */
non terminal Object     expr_list, expr_part;
non terminal Integer    expr, factor, term;

/* ————-Precedence and Associatively of Terminals Section———– */
/*
Precedence of non terminals could be defined here. If you do define
precedence here you won’t need to worry about precedence in the
Grammar Section, i.e. that TIMES should have a higher precedence
than PLUS.
The precedence defined here would look something like this where the
lower line always will have higher precedence than the line before it.
precedence left PLUS, MINUS;
precedence left TIMES, DIVIDE;
*/

/* —————————-Grammar Section——————– */
/* The grammar for our parser.
   expr_list ::=   expr_list expr_part
                 | expr_part
   expr_part ::=   expr SEMI
   expr      ::=   expr PLUS factor
                 | expr MINUS factor
                 | factor
   factor    ::=   factor TIMES term
                 | factor DIVIDE term
                 | term
   term     ::=    LPAREN expr RPAREN
                 | NUMBER
                 | ID
*/
/* ‘expr_list’ is the start of our grammar. It can lead to another
   ‘expr_list’ followed by an ‘expr_part’ or it can just lead to an
   ‘expr_part’. The lhs of the non terminals ‘expr_list’ and
   ‘expr_part’ that are in the rhs side of the production below need
   to be found. Then the rhs sides of those non terminals need to be
   followed in a similar manner, i.e. if there are any non terminals
   in the rhs of those productions then the productions with those non
   terminals need to be found and those rhs’s followed. This process
   keeps continuing until only terminals are found in the rhs of a
   production. Then we can work our way back up the grammar bringing
   any values that might have been assigned from a terminal. */
   expr_list ::= expr_list expr_part
                 |
                 expr_part;
/* ‘expr_part’ is an ‘expr’ followed by the terminal ‘SEMI’. The ‘:e’
   after the non terminal ‘expr’ is a label an is used to access the
   value of ‘expr’ which will be an integer. The action for the
   production lies between {: and :}. This action will print out the
   line " = + e" where e is the value of ‘expr’. Before the action
   takes places we need to go deeper into the grammar since ‘expr’ is
   a non terminal. Whenever a non terminal is encountered on the rhs
   of a production we need to find the rhs of that non terminal until
   there are no more non terminals in the rhs. So when we finish
   going through the grammar and don’t encounter any more non
   terminals in the rhs productions will return until we get back to
   ‘expr’ and at that point ‘expr’ will contain an integer value. */
   expr_part ::= expr:e
                 {: System.out.println(" = " + e); :}
                 SEMI
                 ;
/* ‘expr’ can lead to ‘expr PLUS factor’, ‘expr MINUS factor’, or
   ‘factor’. The ‘TIMES’ and ‘DIVIDE’ productions are not at this
   level. They are at a lower level in the grammar which in affect
   makes them have higher precedence. Actions for the rhs of the non
   terminal ‘expr’ return a value to ‘expr’. This value that is
   created is an integer and gets stored in ‘RESULT’ in the action.
   RESULT is the label that is assigned automatically to the rhs, in
   this case ‘expr’. If the rhs is just ‘factor’ then ‘f’ refers to
   the non terminal ‘factor’. The value of ‘f’ is retrieved with the
   function ‘intValue()’ and will be stored in ‘RESULT’. In the other
   two cases ‘f’ and ‘e’ refers to the non terminals ‘factor’ and
   ‘expr’ respectively with a terminal between them, either ‘PLUS’ or
   ‘MINUS’. The value of each is retrieved with the same function
   ‘intValue’. The values will be added or subtracted and then the
   new integer will be stored in ‘RESULT’.*/
   expr      ::= expr:e PLUS factor:f
                 {: RESULT = new Integer(e.intValue() + f.intValue()); :}
                 |
                 expr:e MINUS factor:f
                 {: RESULT = new Integer(e.intValue() – f.intValue()); :}
                 |
                 factor:f
                 {: RESULT = new Integer(f.intValue()); :}
                 ;
/* ‘factor’ can lead to ‘factor TIMES term’, ‘factor DIVIDE term’, or
   ‘term’. Since the productions for TIMES and DIVIDE are lower in
   the grammar than ‘PLUS’ and ‘MINUS’ they will have higher
   precedence. The same sort of actions take place in the rhs of
   ‘factor’ as in ‘expr’. The only difference is the operations that
   takes place on the values retrieved with ‘intValue()’, ‘TIMES’ and
   ‘DIVIDE’ here instead of ‘PLUS’ and ‘MINUS’. */
   factor    ::= factor:f TIMES term:t
                 {: RESULT = new Integer(f.intValue() * t.intValue()); :}
                 |
                 factor:f DIVIDE term:t
                 {: RESULT = new Integer(f.intValue() / t.intValue()); :}
                 |
                 term:t
                 {: RESULT = new Integer(t.intValue()); :}
                 ;
/* ‘term’ can lead to ‘LPAREN expr RPAREN’, ‘NUMBER’, or ‘ID’. The
   first production has the non terminal ‘expr’ in it so the
   production with its lhs side needs to be found and followed. The
   next rhs has no non terminals. So the grammar ends here and can go
   back up. When it goes back up it will bring the value that was
   retrieved when the scanner encounter the token ‘NUMBER’. ‘RESULT’
   is assigned ‘n’, which refers to ‘NUMBER’, as the action for this
   production. The same action occurs for ‘ID’, except the ‘i’ is
   used to refer to ‘ID’. ‘ID’ is also the only thing on the rhs of
   the production. And since ‘ID’ is a terminal the grammar will end
   here and go back up. */
   term      ::= LPAREN expr:e RPAREN
                 {: RESULT = e; :}
                 |
                 NUMBER:n
                 {: RESULT = n; :}
                 |
                 ID:i
                 {: RESULT = i; :}
                 ;

3.4 生成語法解析器calc.cup

當前目錄切換到${java_cup_home}下，執行

java java_cup.Main options < calc.cup

其中，Options可以自己讀文檔，視情況而設。

執行的結果是生成了語法解析文件parser.java，它負責從lex.java中接受token流，構造語法規則，比如生成語法解析樹。

至此，你已經擁有了一個解析器，只要在你的代碼中引入Parser.java，調用其相應parse()方法即可完成表達式解析過程。測試調用過程如下：

/*
Commented By: Christopher Lopes
File Name: Main.java
To Create:
After the scanner, lcalc.flex, and the parser, ycalc.cup, have been created.
> javac Main.java
To Run:
> java Main test.txt
where test.txt is an test input file for the calculator.
*/
import java.io.*;
public class Main {
static public void main(String argv[]) {
    /* Start the parser */
    try {
      parser p = new parser(new Lexer(new FileReader(argv[0])));
      Object result = p.parse().value;
    } catch (Exception e) {
      /* do cleanup here — possibly rethrow e */
      e.printStackTrace();
    }
}
}

不過在處理復雜表達式（涉及到上下文、系統其他對象引用等）時，你需要仔細定義語法規則文件中的action。

1. 自己實現全部過程

當所涉及語法非常簡單（一般只有一兩句原語）時，可以選擇該方式。

優點：最直接，自然而然的選擇；如果開發人員理論基礎扎實的話不需要額外的學習負擔。

2. 使用第三方的腳本引擎

優點：開發工作量少；部分引擎已經得到廣泛應用的驗證；具有復雜的腳本支持能力，可以提供類似于程序設計語言的處理能力。

3. 使用語法解析器生成器

關于JFlex與Java_Cup的文檔你需要自己到下面的鏈接中仔細看看。其中關鍵的部分是需要定義你的語法及詞法分析文件，類似于BNF表達式。

主要步驟如下（為了方便，這里僅僅以四則運算為例，這個例子也是編譯器構造工具的通用例子，滿世界都采用它。）：

當然首先你必須下在JFlex與Java_cup，并解壓到特定目錄，假定二者的目錄分別是${JFlex_home}、${Java_Cup_home}。

3.1定義JFlex詞法分析文件calc.flex

/*
This example comes from a short article series in the Linux
Gazette by Richard A. Sevenich and Christopher Lopes, titled
"Compiler Construction Tools". The article series starts at

http://www.linuxgazette.com/issue39/sevenich.html

Small changes and updates to newest JFlex+Cup versions
by Gerwin Klein
*/

/*
Commented By: Christopher Lopes
File Name: lcalc.flex
To Create: > jflex lcalc.flex

3.2生成詞法解析程序

3.3定義Java_cup語法分析文件

/*
This example comes from a short article series in the Linux
Gazette by Richard A. Sevenich and Christopher Lopes, titled
"Compiler Construction Tools". The article series starts at

http://www.linuxgazette.com/issue39/sevenich.html

Small changes and updates to newest JFlex+Cup versions
by Gerwin Klein
*/

/* ————Declaration of Terminals and Non Terminals Section———– */
/* Terminals (tokens returned by the scanner).

3.4 生成語法解析器calc.cup

當前目錄切換到${java_cup_home}下，執行

java java_cup.Main options < calc.cup

其中，Options可以自己讀文檔，視情況而設。

執行的結果是生成了語法解析文件parser.java，它負責從lex.java中接受token流，構造語法規則，比如生成語法解析樹。

至此，你已經擁有了一個解析器，只要在你的代碼中引入Parser.java，調用其相應parse()方法即可完成表達式解析過程。測試調用過程如下：

不過在處理復雜表達式（涉及到上下文、系統其他對象引用等）時，你需要仔細定義語法規則文件中的action。

Java應用中表達式解析器(Java Cup/JFlex)生成器的介紹及示例

1. 自己實現全部過程

2. 使用第三方的腳本引擎

3. 使用語法解析器生成器

3.1定義JFlex詞法分析文件calc.flex

3.2生成詞法解析程序

3.3定義Java_cup語法分析文件

3.4 生成語法解析器calc.cup

1. 自己實現全部過程

2. 使用第三方的腳本引擎

3. 使用語法解析器生成器

3.1定義JFlex詞法分析文件calc.flex

3.2生成詞法解析程序

3.3定義Java_cup語法分析文件

3.4 生成語法解析器calc.cup

相關資源：

相關經驗

相關資訊

相關文檔

目錄