使用 Acorn 來解析 JavaScript

KandisMouto 9年前發布 | 12K 次閱讀 JavaScript開發 JavaScript

因為最近工作上有需要使用解析 JavaScript 的代碼，大部分情況使用正則表達式匹配就可以處理，但是一旦依賴于代碼上下文的內容時，正則或者簡單的字符解析就很力不從心了，這個時候需要一個語言解析器來獲取整一個 AST（abstract syntax tree）。

然后我找到了多個使用 JavaScript 編寫的 JavaScript 解析器：

Esprima
Acorn
UglifyJS 2
Shift

從提交記錄來看，維護情況都蠻好的，ES 各種發展的特性都跟得上，我分別都簡單了解了一下，聊聊他們的一些情況。

Esprima 是很經典的一個解析器，Acorn 在它之后誕生，都是幾年前的事情了。按照 Acorn 作者的說法，當時造這個輪子更多只是好玩，速度可以和 Esprima 媲美，但是實現代碼更少。其中比較關鍵的點是這兩個解析器出來的 AST 結果（對，只是 AST，tokens 不一樣）都是符合 The Estree Spec 規范（這是 Mozilla 的工程師給出的 SpiderMonkey 引擎輸出的 JavaScript AST 的規范文檔，也可以參考： SpiderMonkey in MDN ）的，也就是得到的結果在很大部分上是兼容的。

現在很出名的 Webpack 解析代碼時用的也是 Acorn。

至于 Uglify，很出名的一個 JavaScript 代碼壓縮器，其實它自帶了一個代碼解析器，也可以輸出 AST，但是它的功能更多還是用于壓縮代碼，如果拿來解析代碼感覺不夠純粹。

Shift 這個沒做多少了解，只知道他定義了自己的一套 AST 規范。

Esprima 官網上有一個性能測試，我在 chrome 上跑的結果如下：

可見，Acorn 的性能很不錯，而且還有一個 Estree 的規范呢（規范很重要，我個人覺得遵循通用的規范是代碼復用的重要基礎），所以我就直接選用 Acorn 來做代碼解析了。

圖中做性能對比的還有 Google 的 Traceur，它更多是一個 ES6 to ES5 的 compiler，于我們想要找的解析器定位不符。

下面進入正題，如何使用 Acorn 來解析 JavaScript。

API

解析器的 API 都是很簡單的：

const ast = acorn.parse(code, options)

Acorn 的配置項蠻多的，里邊還包括了一些事件可以設置回調函數。我們挑幾個比較重要的講下：

ecmaVersion
字面意義，很好理解，就是設置你要解析的 JavaScript 的 ECMA 版本。默認是 ES7。
sourceType
這個配置項有兩個值： module 和 script ，默認是 script 。
主要是嚴格模式和 import/export 的區別。ES6 中的模塊是嚴格模式，也就是你無須添加 use strict 。我們通常瀏覽器中使用的 script 是沒有 import/export 語法的。
所以，選擇了 script 則出現 import/export 會報錯，可以使用嚴格模式聲明，選擇了 module ，則不用嚴格模式聲明，可以使用 import/export 語法。
locations
默認值是 false ，設置為 true 之后會在 AST 的節點中攜帶多一個 loc 對象來表示當前的開始和結束的行數和列數。
onComment
傳入一個回調函數，每當解析到代碼中的注釋時會觸發，可以獲取當年注釋內容，參數列表是： [block, text, start, end] 。
block 表示是否是塊注釋， text 是注釋內容， start 和 end 是注釋開始和結束的位置。

上邊提及的 Espree 需要 Esprima 的 attachComment 的配置項，設置為 true 后，Esprima 會在代碼解析結果的節點中攜帶注釋相關信息（ trailingComments 和 leadingComments ）。Espree 則是利用 Acorn 的 onComment 配置來實現這個 Esprima 特性的兼容。

解析器通常還會有一個獲取詞法分析結果的接口：

const tokens = [...acorn.tokenizer(code, options)]

tokenizer 方法的第二個參數也能夠配置 locations 。

詞法結果 token 和 Esprima 的結果數據結構上有一定的區別（Espree 又是做了這一層的兼容），有興趣了解的可以看下 Esprima 的解析結果： http://esprima.org/demo/parse.html 。

至于 Acorn 解析的 AST 和 token 的內容我們接下來詳述。

Token

我找了半天，沒找到關于 token 數據結構的詳細介紹，只能自己動手來看一下了。

我用來測試解析的代碼是：

import "hello.js"

var a = 2;

// test
function name(){ console.log(arguments); }

解析出來的 token 數組是一個個類似這樣的對象：

Token {
 type:
 TokenType {
 label: 'import',
 keyword: 'import',
 beforeExpr: false,
 startsExpr: false,
 isLoop: false,
 isAssign: false,
 prefix: false,
 postfix: false,
 binop: null,
 updateContext: null },
 value: 'import',
 start: 5,
 end: 11 },

看上去其實很好理解對不對，在 type 對應的對象中， label 表示當前標識的一個類型， keyword 就是關鍵詞，像例子中的 import ，或者 function 之類的。

value 則是當前標識的值， start/end 分別是開始和結束的位置。

通常我們需要關注的就是 label/keyword/value 這些了。其他的詳細可以參考源碼： tokentype.js 。

The Estree Spec

這一部分是重頭戲，因為實際上我需要的還是解析出來的 AST。最原滋原味的內容來自于： The Estree Spec ，我只是閱讀了之后的搬運工。

提供了標準文檔的好處是，很多東西有跡可循，這里還有一個工具，用于把滿足 Estree 標準的 AST 轉換為 ESMAScript 代碼： escodegen 。

好吧，回到正題，我們先來看一下 ES5 的部分，可以在 Esprima: Parser 這個頁面測試各種代碼的解析結果。

符合這個規范的解析出來的 AST 節點用 Node 對象來標識， Node 對象應該符合這樣的接口：

interface Node {
 type: string;
 loc: SourceLocation | null;
}

type 字段表示不同的節點類型，下邊會再講一下各個類型的情況，分別對應了 JavaScript 中的什么語法。

loc 字段表示源碼的位置信息，如果沒有相關信息的話為 null ，否則是一個對象，包含了開始和結束的位置。接口如下：

interface SourceLocation {
 source: string | null;
 start: Position;
 end: Position;
}

這里的 Position 對象包含了行和列的信息，行從 1 開始，列從 0 開始：

interface Position {
 line: number; // >= 1
 column: number; // >= 0
}

好了，基礎部分就是這樣，接下來看各種類型的節點，順帶溫習一下 JavaScript 語法的一些東西吧。對于這里每一部分的內容，會簡單談一下，但不會展開（內容不少），對 JavaScript 了解的人很容易就明白的。

我覺得看完就像把 JavaScript 的基礎語法整理了一遍。

Identifier

標識符，我覺得應該是這么叫的，就是我們寫 JS 時自定義的名稱，如變量名，函數名，屬性名，都歸為標識符。相應的接口是這樣的：

interface Identifier <: Expression, Pattern {
 type: "Identifier";
 name: string;
}

一個標識符可能是一個表達式，或者是解構的模式（ES6 中的解構語法）。我們等會會看到 Expression 和 Pattern 相關的內容的。

Literal

字面量，這里不是指 [] 或者 {} 這些，而是本身語義就代表了一個值的字面量，如 1 ， “hello” , true 這些，還有正則表達式（有一個擴展的 Node 來表示正則表達式），如 /\d?/ 。我們看一下文檔的定義：

interface Literal <: Expression {
 type: "Literal";
 value: string | boolean | null | number | RegExp;
}

value 這里即對應了字面量的值，我們可以看出字面量值的類型，字符串，布爾，數值， null 和正則。

RegExpLiteral

這個針對正則字面量的，為了更好地來解析正則表達式的內容，添加多一個 regex 字段，里邊會包括正則本身，以及正則的 flags 。

interface RegExpLiteral <: Literal {
 regex: {
 pattern: string;
 flags: string;
 };
}

Programs

一般這個是作為跟節點的，即代表了一棵完整的程序代碼樹。

interface Program <: Node {
 type: "Program";
 body: [ Statement ];
}

body 屬性是一個數組，包含了多個 Statement （即語句）節點。

Functions

函數聲明或者函數表達式節點。

interface Function <: Node {
 id: Identifier | null;
 params: [ Pattern ];
 body: BlockStatement;
}

id 是函數名， params 屬性是一個數組，表示函數的參數。 body 是一個塊語句。

有一個值得留意的點是，你在測試過程中，是不會找到 type: "Function" 的節點的，但是你可以找到 type: "FunctionDeclaration" 和 type: "FunctionExpression" ，因為函數要么以聲明語句出現，要么以函數表達式出現，都是節點類型的組合類型，后邊會再提及 FunctionDeclaration 和 FunctionExpression 的相關內容。

這讓人感覺這個文檔規劃得蠻細致的，函數名，參數和函數塊是屬于函數部分的內容，而聲明或者表達式則有它自己需要的東西。

Statement

語句節點沒什么特別的，它只是一個節點，一種區分，但是語句有很多種，下邊會詳述。

interface Statement <: Node { }

ExpressionStatement

表達式語句節點， a = a + 1 或者 a++ 里邊會有一個 expression 屬性指向一個表達式節點對象（后邊會提及表達式）。

interface ExpressionStatement <: Statement {
 type: "ExpressionStatement";
 expression: Expression;
}

BlockStatement

塊語句節點，舉個例子： if (...) { // 這里是塊語句的內容 } ，塊里邊可以包含多個其他的語句，所以有一個 body 屬性，是一個數組，表示了塊里邊的多個語句。

interface BlockStatement <: Statement {
 type: "BlockStatement";
 body: [ Statement ];
}

EmptyStatement

一個空的語句節點，沒有執行任何有用的代碼，例如一個單獨的分號 ;

interface EmptyStatement <: Statement {
 type: "EmptyStatement";
}

DebuggerStatement

debugger ，就是表示這個，沒有其他了。

interface DebuggerStatement <: Statement {
 type: "DebuggerStatement";
}

WithStatement

with 語句節點，里邊有兩個特別的屬性， object 表示 with 要使用的那個對象（可以是一個表達式）， body 則是對應 with 后邊要執行的語句，一般會是一個塊語句。

interface WithStatement <: Statement {
 type: "WithStatement";
 object: Expression;
 body: Statement;
}

下邊是控制流的語句：

ReturnStatement

返回語句節點， argument 屬性是一個表達式，代表返回的內容。

interface ReturnStatement <: Statement {
 type: "ReturnStatement";
 argument: Expression | null;
}

LabeledStatement

label 語句，平時可能會比較少接觸到，舉個例子：

loop: for(let i = 0; i < len; i++) {
 // ...
 for (let j = 0; j < min; j++) {
 // ...
 break loop;
 }
}

這里的 loop 就是一個 label 了，我們可以在循環嵌套中使用 break loop 來指定跳出哪個循環。所以這里的 label 語句指的就是 loop: ... 這個。

一個 label 語句節點會有兩個屬性，一個 label 屬性表示 label 的名稱，另外一個 body 屬性指向對應的語句，通常是一個循環語句或者 switch 語句。

interface LabeledStatement <: Statement {
 type: "LabeledStatement";
 label: Identifier;
 body: Statement;
}

BreakStatement

break 語句節點，會有一個 label 屬性表示需要的 label 名稱，當不需要 label 的時候（通常都不需要），便是 null 。

interface BreakStatement <: Statement {
 type: "BreakStatement";
 label: Identifier | null;
}

ContinueStatement

continue 語句節點，和 break 類似。

interface ContinueStatement <: Statement {
 type: "ContinueStatement";
 label: Identifier | null;
}

下邊是條件語句：

IfStatement

if 語句節點，很常見，會帶有三個屬性， test 屬性表示 if (...) 括號中的表達式。

consequent 屬性是表示條件為 true 時的執行語句，通常會是一個塊語句。

alternate 屬性則是用來表示 else 后跟隨的語句節點，通常也會是塊語句，但也可以又是一個 if 語句節點，即類似這樣的結構：

if (a) { //... } else if (b) { // ... } 。

alternate 當然也可以為 null 。

interface IfStatement <: Statement {
 type: "IfStatement";
 test: Expression;
 consequent: Statement;
 alternate: Statement | null;
}

SwitchStatement

switch 語句節點，有兩個屬性， discriminant 屬性表示 switch 語句后緊隨的表達式，通常會是一個變量， cases 屬性是一個 case 節點的數組，用來表示各個 case 語句。

interface SwitchStatement <: Statement {
 type: "SwitchStatement";
 discriminant: Expression;
 cases: [ SwitchCase ];
}

SwitchCase

switch 的 case 節點。 test 屬性代表這個 case 的判斷表達式， consequent 則是這個 case 的執行語句。

當 test 屬性是 null 時，則是表示 default 這個 case 節點。

interface SwitchCase <: Node {
 type: "SwitchCase";
 test: Expression | null;
 consequent: [ Statement ];
}

下邊是異常相關的語句：

ThrowStatement

throw 語句節點， argument 屬性用以表示 throw 后邊緊跟的表達式。

interface ThrowStatement <: Statement {
 type: "ThrowStatement";
 argument: Expression;
}

TryStatement

try 語句節點， block 屬性表示 try 的執行語句，通常是一個塊語句。

hanlder 屬性是指 catch 節點， finalizer 是指 finally 語句節點，當 hanlder 為 null 時， finalizer 必須是一個塊語句節點。

interface TryStatement <: Statement {
 type: "TryStatement";
 block: BlockStatement;
 handler: CatchClause | null;
 finalizer: BlockStatement | null;
}

CatchClause

catch 節點， param 用以表示 catch 后的參數， body 則表示 catch 后的執行語句，通常是一個塊語句。

interface CatchClause <: Node {
 type: "CatchClause";
 param: Pattern;
 body: BlockStatement;
}

下邊是循環語句：

WhileStatement

while 語句節點， test 表示括號中的表達式， body 是表示要循環執行的語句。

interface WhileStatement <: Statement {
 type: "WhileStatement";
 test: Expression;
 body: Statement;
}

DoWhileStatement

do/while 語句節點，和 while 語句類似。

interface DoWhileStatement <: Statement {
 type: "DoWhileStatement";
 body: Statement;
 test: Expression;
}

ForStatement

for 循環語句節點，屬性 init/test/update 分別表示了 for 語句括號中的三個表達式，初始化值，循環判斷條件，每次循環執行的變量更新語句（ init 可以是變量聲明或者表達式）。這三個屬性都可以為 null ，即 for(;;){} 。

body 屬性用以表示要循環執行的語句。

interface ForStatement <: Statement {
 type: "ForStatement";
 init: VariableDeclaration | Expression | null;
 test: Expression | null;
 update: Expression | null;
 body: Statement;
}

ForInStatement

for/in 語句節點， left 和 right 屬性分別表示在 in 關鍵詞左右的語句（左側可以是一個變量聲明或者表達式）。 body 依舊是表示要循環執行的語句。

interface ForInStatement <: Statement {
 type: "ForInStatement";
 left: VariableDeclaration | Pattern;
 right: Expression;
 body: Statement;
}

Declarations

聲明語句節點，同樣也是語句，只是一個類型的細化。下邊會介紹各種聲明語句類型。

interface Declaration <: Statement { }

FunctionDeclaration

函數聲明，和之前提到的 Function 不同的是， id 不能為 null 。

interface FunctionDeclaration <: Function, Declaration {
 type: "FunctionDeclaration";
 id: Identifier;
}

VariableDeclaration

變量聲明， kind 屬性表示是什么類型的聲明，因為 ES6 引入了 const/let 。

declarations 表示聲明的多個描述，因為我們可以這樣： let a = 1, b = 2; 。

interface VariableDeclaration <: Declaration {
 type: "VariableDeclaration";
 declarations: [ VariableDeclarator ];
 kind: "var";
}

VariableDeclarator

變量聲明的描述， id 表示變量名稱節點， init 表示初始值的表達式，可以為 null 。

interface VariableDeclarator <: Node {
 type: "VariableDeclarator";
 id: Pattern;
 init: Expression | null;
}

Expressions

表達式節點。

interface Expression <: Node { }

ThisExpression

表示 this 。

interface ThisExpression <: Expression {
 type: "ThisExpression";
}

ArrayExpression

數組表達式節點， elements 屬性是一個數組，表示數組的多個元素，每一個元素都是一個表達式節點。

interface ArrayExpression <: Expression {
 type: "ArrayExpression";
 elements: [ Expression | null ];
}

ObjectExpression

對象表達式節點， property 屬性是一個數組，表示對象的每一個鍵值對，每一個元素都是一個屬性節點。

interface ObjectExpression <: Expression {
 type: "ObjectExpression";
 properties: [ Property ];
}

Property

對象表達式中的屬性節點。 key 表示鍵， value 表示值，由于 ES5 語法中有 get/set 的存在，所以有一個 kind 屬性，用來表示是普通的初始化，或者是 get/set 。

interface Property <: Node {
 type: "Property";
 key: Literal | Identifier;
 value: Expression;
 kind: "init" | "get" | "set";
}

FunctionExpression

函數表達式節點。

interface FunctionExpression <: Function, Expression {
 type: "FunctionExpression";
}

下邊是一元運算符相關的表達式部分：

UnaryExpression

一元運算表達式節點（ ++/-- 是 update 運算符，不在這個范疇內）， operator 表示運算符， prefix 表示是否為前綴運算符。 argument 是要執行運算的表達式。

interface UnaryExpression <: Expression {
 type: "UnaryExpression";
 operator: UnaryOperator;
 prefix: boolean;
 argument: Expression;
}

UnaryOperator

一元運算符，枚舉類型，所有值如下：

enum UnaryOperator {
 "-" | "+" | "!" | "~" | "typeof" | "void" | "delete"
}

UpdateExpression

update 運算表達式節點，即 ++/-- ，和一元運算符類似，只是 operator 指向的節點對象類型不同，這里是 update 運算符。

interface UpdateExpression <: Expression {
 type: "UpdateExpression";
 operator: UpdateOperator;
 argument: Expression;
 prefix: boolean;
}

UpdateOperator

update 運算符，值為 ++ 或 -- ，配合 update 表達式節點的 prefix 屬性來表示前后。

enum UpdateOperator {
 "++" | "--"
}

下邊是二元運算符相關的表達式部分：

BinaryExpression

二元運算表達式節點， left 和 right 表示運算符左右的兩個表達式， operator 表示一個二元運算符。

interface BinaryExpression <: Expression {
 type: "BinaryExpression";
 operator: BinaryOperator;
 left: Expression;
 right: Expression;
}

BinaryOperator

二元運算符，所有值如下：

enum BinaryOperator {
 "==" | "!=" | "===" | "!=="
 | "<" | "<=" | ">" | ">="
 | "<<" | ">>" | ">>>"
 | "+" | "-" | "*" | "/" | "%"
 | "|" | "^" | "&" | "in"
 | "instanceof"
}

AssignmentExpression

賦值表達式節點， operator 屬性表示一個賦值運算符， left 和 right 是賦值運算符左右的表達式。

interface AssignmentExpression <: Expression {
 type: "AssignmentExpression";
 operator: AssignmentOperator;
 left: Pattern | Expression;
 right: Expression;
}

AssignmentOperator

賦值運算符，所有值如下：（常用的并不多）

enum AssignmentOperator {
 "=" | "+=" | "-=" | "*=" | "/=" | "%="
 | "<<=" | ">>=" | ">>>="
 | "|=" | "^=" | "&="
}

LogicalExpression

邏輯運算表達式節點，和賦值或者二元運算類型，只不過 operator 是邏輯運算符類型。

interface LogicalExpression <: Expression {
 type: "LogicalExpression";
 operator: LogicalOperator;
 left: Expression;
 right: Expression;
}

LogicalOperator

邏輯運算符，兩種值，即與或。

enum LogicalOperator {
 "||" | "&&"
}

MemberExpression

成員表達式節點，即表示引用對象成員的語句， object 是引用對象的表達式節點， property 是表示屬性名稱， computed 如果為 false ，是表示 . 來引用成員， property 應該為一個 Identifier 節點，如果 computed 屬性為 true ，則是 [] 來進行引用，即 property 是一個 Expression 節點，名稱是表達式的結果值。

interface MemberExpression <: Expression, Pattern {
 type: "MemberExpression";
 object: Expression;
 property: Expression;
 computed: boolean;
}

下邊是其他的一些表達式：

ConditionalExpression

條件表達式，通常我們稱之為三元運算表達式，即 boolean ? true : false 。屬性參考條件語句。

interface ConditionalExpression <: Expression {
 type: "ConditionalExpression";
 test: Expression;
 alternate: Expression;
 consequent: Expression;
}

CallExpression

函數調用表達式，即表示了 func(1, 2) 這一類型的語句。 callee 屬性是一個表達式節點，表示函數， arguments 是一個數組，元素是表達式節點，表示函數參數列表。

interface CallExpression <: Expression {
 type: "CallExpression";
 callee: Expression;
 arguments: [ Expression ];
}

NewExpression

new 表達式。

interface NewExpression <: CallExpression {
 type: "NewExpression";
}

SequenceExpression

這個就是逗號運算符構建的表達式（不知道確切的名稱）， expressions 屬性為一個數組，即表示構成整個表達式，被逗號分割的多個表達式。

interface SequenceExpression <: Expression {
 type: "SequenceExpression";
 expressions: [ Expression ];
}

Patterns

模式，主要在 ES6 的解構賦值中有意義，在 ES5 中，可以理解為和 Identifier 差不多的東西。

interface Pattern <: Node { }

這一部分的內容比較多，但都可以舉一反三，寫這個的時候我就當把 JavaScript 語法再復習一遍。這個文檔還有 ES2015，ES2016，ES2017 相關的內容，涉及的東西也蠻多，但是理解了上邊的這一些，然后從語法層面去思考這個文檔，其他的內容也就很好理解了。

Plugins

回到我們的主角，Acorn，提供了一種擴展的方式來編寫相關的插件： Acorn Plugins 。

我們可以使用插件來擴展解析器，來解析更多的一些語法，如 .jsx 語法，有興趣的看看這個插件： acorn-jsx 。

官方表示 Acorn 的插件是用于方便擴展解析器，但是需要對 Acorn 內部的運行極致比較了解，擴展的方式會在原本的基礎上重新定義一些方法。這里不展開講了，如果我需要插件的話，會再寫文章聊聊這個東西。

Examples

現在我們來看一下如何應用這個解析器，例如我們需要用來解析出一個符合 CommonJS 規范的模塊依賴了哪些模塊，我們可以用 Acorn 來解析 require 這個函數的調用，然后取出調用時的傳入參數，便可以獲取依賴的模塊。

下邊是示例代碼：

// 遍歷所有節點的函數
function walkNode(node, callback){
 callback(node)

 // 有 type 字段的我們認為是一個節點
 Object.keys(node).forEach((key) => {
 const item = node[key]
 if (Array.isArray(item)) {
 item.forEach((sub) => {
 sub.type && walkNode(sub, callback)
 })
 }

 item && item.type && walkNode(item, callback)
 })
}

function parseDependencies(str){
 const ast = acorn.parse(str, { ranges: true })
 const resource = [] // 依賴列表

 // 從根節點開始
 walkNode(ast, (node) => {
 const callee = node.callee
 const args = node.arguments

 // require 我們認為是一個函數調用，并且函數名為 require，參數只有一個，且必須是字面量
 if (
 node.type === 'CallExpression' &&
 callee.type === 'Identifier' &&
 callee.name === 'require' &&
 args.length === 1 &&
 args[0].type === 'Literal'
 ) {
 const args = node.arguments

 // 獲取依賴的相關信息
 resource.push({
 string: str.substring(node.range[0], node.range[1]),
 path: args[0].value,
 start: node.range[0],
 end: node.range[1]
 })
 }
 })

 return resource
}

這只是簡單的一個情況的處理，但是已經給我們呈現了如何使用解析器，Webpack 則在這個的基礎上做了更多的東西，包括 var r = require; r('a') 或者 require.async('a') 等的處理。

AST 這個東西對于前端來說，我們無時無刻不在享受著它帶來的成果（模塊構建，代碼壓縮，代碼混淆），所以了解一下總歸有好處。

來自：http://blog.fronts.in/acorn.html

本文由用戶 KandisMouto 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1478844964787.html

JavaScript開發 JavaScript