Skip to content

DTStack/dt-sql-parser

Repository files navigation

dt-sql-parser

NPM versionNPM downloadsChat

English |Giản thể tiếng Trung

dt-sql-parser is aSQL Parserproject built withANTLR4,and it's mainly for theBigDatafield. TheANTLR4generated the basic Parser, Visitor, and Listener, so it's easy to complete theLexer,Parser,traverse the AST,and so on features.

Additionally, it provides advanced features such asSQL Validation,Code CompletionandCollecting Table and Columns in SQL.

Supported SQL:

  • MySQL
  • Flink
  • Spark
  • Hive
  • PostgreSQL
  • Trino
  • Impala

Tip

This project is the default for Typescript target, also you can try to compile it to other languages if you need.


Integrating SQL Parser with Monaco Editor

We also have providedmonaco-sql-languagesto easily to integratedt-sql-parserwithmonaco-editor.


Installation

#use npm
npm i dt-sql-parser --save

#use yarn
yarn add dt-sql-parser

Usage

We recommend learning the fundamentals usage before continuing. The dt-sql-parser library provides SQL classes for different types of SQL.

import{MySQL,FlinkSQL,SparkSQL,HiveSQL,PostgreSQL,TrinoSQL,ImpalaSQL}from'dt-sql-parser';

Before using syntax validation, code completion, and other features, it is necessary to instantiate the Parser of the relevant SQL type. For instance, one can consider usingMySQLas an example:

constmysql=newMySQL();

The following usage examples will utilize theMySQL,and the Parser for other SQL types will be used in a similar manner asMySQL.

Syntax Validation

First instanced a Parser object, then call thevalidatemethod on the SQL instance to validate the sql content, if failed returns an array includeserrormessage.

import{MySQL}from'dt-sql-parser';

constmysql=newMySQL();
constincorrectSql='selec id,name from user1;';
consterrors=mysql.validate(incorrectSql);

console.log(errors);

output:

/*
[
{
endCol: 5,
endLine: 1,
startCol: 0,
startLine: 1,
message: "..."
}
]
*/

Tokenizer

Call thegetAllTokensmethod on the SQL instance:

import{MySQL}from'dt-sql-parser';

constmysql=newMySQL()
constsql='select id,name,sex from user1;'
consttokens=mysql.getAllTokens(sql)

console.log(tokens)

output:

/*
[
{
channel: 0
column: 0
line: 1
source: [SqlLexer, InputStream]
start: 0
stop: 5
tokenIndex: -1
type: 137
_text: null
},
...
]
*/

Visitor

Traverse the tree node by the Visitor:

import{MySQL,MySqlParserVisitor}from'dt-sql-parser';

constmysql=newMySQL();
constsql=`select id, name from user1;`;
constparseTree=mysql.parse(sql);

classMyVisitorextendsMySqlParserVisitor<string>{
defaultResult():string{
return'';
}
aggregateResult(aggregate:string,nextResult:string):string{
returnaggregate+nextResult;
}
visitProgram=(ctx)=>{
returnthis.visitChildren(ctx);
};
visitTableName=(ctx)=>{
returnctx.getText();
};
}
constvisitor=newMyVisitor();
constresult=visitor.visit(parseTree);

console.log(result);

output:

/*
user1
*/

Listener

Access the specified node in the AST by the Listener

import{MySQL,MySqlParserListener}from'dt-sql-parser';

constmysql=newMySQL();
constsql='select id, name from user1;';
constparseTree=mysql.parse(sql);

classMyListenerextendsMySqlParserListener{
result='';
enterTableName=(ctx):void=>{
this.result=ctx.getText();
};
}

constlistener=newMyListener();
mysql.listen(listener,parseTree);

console.log(listener.result)

output:

/*
user1
*/

Splitting SQL statements

TakeFlinkSQLas an example, call thesplitSQLByStatementmethod on the SQL instance:

import{FlinkSQL}from'dt-sql-parser';

constflink=newFlinkSQL();
constsql='SHOW TABLES;\nSELECT * FROM tb;';
constsqlSlices=flink.splitSQLByStatement(sql);

console.log(sqlSlices)

output:

/*
[
{
startIndex: 0,
endIndex: 11,
startLine: 1,
endLine: 1,
startColumn: 1,
endColumn: 12,
text: 'SHOW TABLES;'
},
{
startIndex: 13,
endIndex: 29,
startLine: 2,
endLine: 2,
startColumn: 1,
endColumn: 17,
text: 'SELECT * FROM tb;'
}
]
*/

Code Completion

Obtaining code completion information at a specified position in SQL.

Call thegetAllEntitiesmethod on the SQL instance, pass the SQL content and the row and column numbers indicating the position where code completion is desired. The following are some additional explanations aboutCaretPosition.

  • keyword candidates list

    import{FlinkSQL}from'dt-sql-parser';
    
    constflink=newFlinkSQL();
    constsql='CREATE ';
    constpos={lineNumber:1,column:16};// the end position
    constkeywords=flink.getSuggestionAtCaretPosition(sql,pos)?.keywords;
    
    console.log(keywords);

    output:

    /*
    [ 'CATALOG', 'FUNCTION', 'TEMPORARY', 'VIEW', 'DATABASE', 'TABLE' ]
    */
  • Obtaining information related to grammar completion

    import{FlinkSQL}from'dt-sql-parser';
    
    constflink=newFlinkSQL();
    constsql='SELECT * FROM tb';
    constpos={lineNumber:1,column:16};// after 'tb'
    constsyntaxSuggestions=flink.getSuggestionAtCaretPosition(sql,pos)?.syntax;
    
    console.log(syntaxSuggestions);

    output:

    /*
    [
    {
    syntaxContextType: 'table',
    wordRanges: [
    {
    text: 'tb',
    startIndex: 14,
    stopIndex: 15,
    line: 1,
    startColumn: 15,
    stopColumn: 16
    }
    ]
    },
    {
    syntaxContextType: 'view',
    wordRanges: [
    {
    text: 'tb',
    startIndex: 14,
    stopIndex: 15,
    line: 1,
    startColumn: 15,
    stopColumn: 16
    }
    ]
    }
    ]
    */

The grammar-related code completion information returns an array, where each item represents what grammar can be filled in at that position. For example, the output in the above example represents that the position can be filled with either atable nameora view name.In this case,syntaxContextTyperepresents the type of grammar that can be completed, andwordRangesrepresents the content that has already been filled.

Get all entities in SQL (e.g. table, column)

Call thegetAllEntitiesmethod on the SQL instance, and pass in the sql text and the row and column numbers at the specified location to easily get them.

import{FlinkSQL}from'dt-sql-parser';

constflink=newFlinkSQL();
constsql='SELECT * FROM tb;';
constpos={lineNumber:1,column:16};// tb mặt sau
constentities=flink.getAllEntities(sql,pos);

console.log(entities);

output

/*
[
{
entityContextType: 'table',
text: 'tb',
position: {
line: 1,
startIndex: 14,
endIndex: 15,
startColumn: 15,
endColumn: 17
},
belongStmt: {
stmtContextType: 'selectStmt',
position: [Object],
rootStmt: [Object],
parentStmt: [Object],
isContainCaret: true
},
relatedEntities: null,
columns: null,
isAlias: false,
origin: null,
alias: null
}
]
*/

Position is not required, if the position is passed, then in the collected entities, if the entity is located under the statement where the corresponding position is located, then the statement object to which the entity belongs will be marked withisContainCaret,which can help you quickly filter out the required entities when combined with the code completion function.

Other API

  • createLexerCreate an instance of Antlr4 Lexer and return it;
  • createParserCreate an instance of Antlr4 parser and return it;
  • parseParses the input SQL and returns the parse tree;

Position and Range

Some return results of the APIs provided bydt-sql-parsercontain text information, among which the range and start value of line number, column number and index may cause some confusion.

Index

The index starts at 0. In the programming field, it is more intuitive.

index-image

For an index range, the start index starts from 0 and ends with n-1, as shown in the figure above, an index range of blue text should be represented as follows:

{
startIndex:0,
endIndex:3
}

Line

The line starts at 1.

line-image

For a range of multiple lines, the line number starts from 1 and ends with n. A range of the first and second lines is represented as follows:

{
startLine:1,
endLine:2
}

Column

The column also starts at 1.

column-image

It is easier to understand by comparing the column number with the cursor position of the editor. For a range of multiple columns, the column number starts from 1 and ends with n+1, as shown in the figure above, a range of blue text columns is represented as follows:

{
startColumn:1,
endColumn:5
}

CaretPosition Of Code Completion

The code completion ofdt-sql-parserwas designed to be used in the editor, so the format of the second parameter (CaretPosition) of thegetSuggestionAtCaretPositionmethod is line and column number instead of character position index. This makes it easier to integrate the code completion into the editor. For the editor, it only needs to get the text content and cursor position in the editor at a specific time to call the code completion ofdt-sql-parser,without any additional calculation.

But in some other scenarios, you may need to get the caret position required by the code completion through conversion or calculation. Then, there are some precautions that you may need to care about before that.

The code completion ofdt-sql-parserdepends onantlr4-c3,which is a great library. The code completion ofdt-sql-parseris just encapsulated and converted based on antlr4-c3, including converting the line and column number information into the token index required by antlr4-c3, as shown in the figure below:

column-image

Regard the column in the figure as the cursor position, and put this text into the editor, you will get 13 possible cursor positions, while for dt-sql-parser, this text will generate 4 Tokens after being parsed. An important strategy of the code completion is:When the cursor (CaretPosition) has not completely left a Token, dt-sql-parser thinks that this Token has not been completed, and the code completion will infer what can be filled in the position of this Token.

For example, if you want to know what to fill in afterSHOWthrough the code completion, the caret position should be:

{
lineNumber:1,
column:6
}

At this time, dt-sql-parser will think thatSHOWis already a complete Token, and it should infer what can be filled in afterSHOW.If the column in the passed-in caret position is 5, then dt-sql-parser will think thatSHOWhas not been completed, and then infer what can be filled in the position ofSHOW.In other words, in the figure above,column: 5belongs totoken: 0,andcolumn: 6belongs totoken: 1.

For the editor, this strategy is also more intuitive. After the user entersSHOW,before pressing the space key, the user probably has not finished entering, maybe the user wants to enter something likeSHOWS.When the user presses the space key, the editor thinks that the user wants to enter the next Token, and it is time to ask dt-sql-parser what can be filled in the next Token position.


Contributing

Refer toCONTRIBUTING

License

MIT