且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

ANTLR 实现 python 之类的缩进依赖语法的最简单方法是什么?

更新时间:2022-10-17 21:13:45

我不知道最简单的处理方法是什么,但下面是一个比较简单的方法.每当您在词法分析器中匹配换行符时,可选择匹配一个或多个空格.如果换行后有空格,则将这些空格的长度与当前的缩进大小进行比较.如果大于当前缩进大小,则发出 Indent 标记,如果小于当前缩进大小,则发出 Dedent 标记,如果相同,则不要什么都不做.

您还需要在文件末尾发出多个 Dedent 标记,让每个 Indent 都有一个匹配的 Dedent令牌.

要使其正常工作,您必须在输入源文件中添加前导和尾随换行符!

ANTRL3

快速演示:

grammar PyEsque;选项 {输出=AST;}令牌{堵塞;}@lexer::members {private int previousIndents = -1;私人 int indentLevel = 0;java.util.Queuetoken = new java.util.LinkedList();@覆盖公共无效发射(令牌 t){state.token = t;代币.offer(t);}@覆盖公共令牌 nextToken() {super.nextToken();返回 tokens.isEmpty() ?Token.EOF_TOKEN:tokens.poll();}私有无效跳转(int ttype){indentLevel += (ttype == Dedent ? -1 : 1);发射(新CommonToken(ttype,级别="+缩进级别));}}解析: 块 EOF ->堵塞;堵塞:缩进 block_atoms Dedent ->^(BLOCK block_atoms);块原子: (Id | 块)+;新队: NL SP?{int n = $SP.text == null ?0 : $SP.text.length();如果(n>previousIndents){跳转(缩进);以前的缩进 = n;}else if(n  0) {跳转(Dedent);}}别的 {跳过();}};ID: ('a'..'z' | 'A'..'Z')+;空格字符: SP {跳过();};片段 NL : '\r'?'\n' |'\r';片段 SP : (' ' | '\t')+;片段缩进:;片段 Dedent : ;

您可以使用类测试解析器:

import org.antlr.runtime.*;导入 org.antlr.runtime.tree.*;导入 org.antlr.stringtemplate.*;公共课主要{public static void main(String[] args) 抛出异常 {PyEsqueLexer 词法分析器 = new PyEsqueLexer(new ANTLRFileStream("in.txt"));PyEsqueParser parser = new PyEsqueParser(new CommonTokenStream(lexer));CommonTree 树 = (CommonTree)parser.parse().getTree();DOTTreeGenerator gen = new DOTTreeGenerator();StringTemplate st = gen.toDOT(tree);System.out.println(st);}}

如果您现在将以下内容放入名为 in.txt 的文件中:

AAA AAAABBB BB BBB BBBBB BB中交建中交BB BBBBBBCCC滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴滴

(注意前后换行!)

然后您将看到与以下 AST 对应的输出:

请注意,我的演示不会连续产生足够的 dedents,例如从 cccaaa 的 dedenting(需要 2 个 dedent 标记):

aaabbb抄送啊啊啊啊

您需要调整 else if(n < previousIndents) { ... } 中的代码,以根据 n 之间的差异发出超过 1 个 dedent 标记code> 和 previousIndents.在我的脑海里,这可能是这样的:

 else if(n 

ANTLR4

对于 ANTLR4,请执行以下操作:

语法 Python3;令牌 { INDENT, DEDENT }@lexer::members {//推送额外令牌的队列(参见 NEWLINE 词法分析器规则).私有 java.util.LinkedList<Token>token = new java.util.LinkedList();//跟踪缩进级别的堆栈.私有 java.util.Stackindents = new java.util.Stack();//打开的大括号、方括号和圆括号的数量.私有整数打开 = 0;//最近生成的令牌.私人令牌 lastToken = null;@覆盖公共无效发射(令牌 t){super.setToken(t);代币.offer(t);}@覆盖公共令牌 nextToken() {//检查文件尾是否在前面并且仍然有一些预期的 DEDENTS.if (_input.LA(1) == EOF && !this.indents.isEmpty()) {//从我们的缓冲区中删除任何尾随的 EOF 标记.for (int i = tokens.size() - 1; i >= 0; i--) {if (tokens.get(i).getType() == EOF) {token.remove(i);}}//首先发出一个额外的换行符作为语句的结尾.this.emit(commonToken(Python3Parser.NEWLINE, "\n"));//现在根据需要发出尽可能多的 DEDENT 令牌.while (!indents.isEmpty()) {this.emit(createDedent());缩进.pop();}//将 EOF 放回令牌流中.this.emit(commonToken(Python3Parser.EOF, ""));}令牌下一个 = super.nextToken();if (next.getChannel() == Token.DEFAULT_CHANNEL) {//跟踪默认通道上的最后一个标记.this.lastToken = 下一个;}返回 tokens.isEmpty() ?下一个:tokens.poll();}私人令牌 createDedent() {CommonToken dedent = commonToken(Python3Parser.DEDENT, "");dedent.setLine(this.lastToken.getLine());归还牙医;}private CommonToken commonToken(int type, String text) {int stop = this.getCharIndex() - 1;int start = text.isEmpty() ?停止:停止 - text.length() + 1;return new CommonToken(this._tokenFactorySourcePair, type, DEFAULT_TOKEN_CHANNEL, start, stop);}//计算提供的空格的缩进,取//考虑以下规则:////制表符被一到八个空格替换(从左到右)//使得字符总数达到并包括//替换是八的倍数 [...]"////-- https://docs.python.org/3.1/reference/lexical_analysis.html#indentation静态 int getIndentationCount(字符串空格){整数计数 = 0;for (char ch:spaces.toCharArray()) {开关(通道){案例'\t':计数 += 8 - (计数 % 8);休息;默认://一个普通的空格字符.计数++;}}返回计数;}布尔值 atStartOfInput() {返回 super.getCharPositionInLine() == 0 &&super.getLine() == 1;}}单输入: 新队|simple_stmt|Compound_stmt NEWLINE;//更多解析规则新队:( {atStartOfInput()}?空格|( '\r'? '\n' | '\r' ) 空格?){String newLine = getText().replaceAll("[^\r\n]+", "");字符串空格 = getText().replaceAll("[\r\n]+", "");int next = _input.LA(1);if (opened > 0 || next == '\r' || next == '\n' || next == '#') {//如果我们在一个列表中或一个空行,忽略所有缩进,//凹痕和换行符.跳过();}别的 {发出(commonToken(NEWLINE,newLine));int indent = getIndentationCount(spaces);int previous = indents.isEmpty() ?0 : 缩进.peek();如果(缩进 == 前一个){//跳过与当前缩进大小相同大小的缩进跳过();}否则如果(缩进>前一个){缩进.推(缩进);发出(commonToken(Python3Parser.INDENT,空格));}别的 {//可能发出超过 1 个 DEDENT 令牌.while(!indents.isEmpty() && indents.peek() > indent) {this.emit(createDedent());缩进.pop();}}}};//更多词法规则

取自:https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4

I am trying realize python like indent-depending grammar.

Source example:

ABC QWE
  CDE EFG
  EFG CDE
    ABC 
  QWE ZXC

As i see, what i need is to realize two tokens INDENT and DEDENT, so i could write something like:

grammar mygrammar;
text: (ID | block)+;
block: INDENT (ID|block)+ DEDENT;
INDENT: ????;
DEDENT: ????;

Is there any simple way to realize this using ANTLR?

(I'd prefer, if it's possible, to use standard ANTLR lexer.)

I don't know what the easiest way to handle it is, but the following is a relatively easy way. Whenever you match a line break in your lexer, optionally match one or more spaces. If there are spaces after the line break, compare the length of these spaces with the current indent-size. If it's more than the current indent size, emit an Indent token, if it's less than the current indent-size, emit a Dedent token and if it's the same, don't do anything.

You'll also want to emit a number of Dedent tokens at the end of the file to let every Indent have a matching Dedent token.

For this to work properly, you must add a leading and trailing line break to your input source file!

ANTRL3

A quick demo:

grammar PyEsque;

options {
  output=AST;
}

tokens {
  BLOCK;
}

@lexer::members {

  private int previousIndents = -1;
  private int indentLevel = 0;
  java.util.Queue<Token> tokens = new java.util.LinkedList<Token>();

  @Override
  public void emit(Token t) {
    state.token = t;
    tokens.offer(t);
  }

  @Override
  public Token nextToken() {
    super.nextToken();
    return tokens.isEmpty() ? Token.EOF_TOKEN : tokens.poll();
  }

  private void jump(int ttype) {
    indentLevel += (ttype == Dedent ? -1 : 1);
    emit(new CommonToken(ttype, "level=" + indentLevel));
  }
}

parse
 : block EOF -> block
 ;

block
 : Indent block_atoms Dedent -> ^(BLOCK block_atoms)
 ;

block_atoms
 :  (Id | block)+
 ;

NewLine
 : NL SP?
   {
     int n = $SP.text == null ? 0 : $SP.text.length();
     if(n > previousIndents) {
       jump(Indent);
       previousIndents = n;
     }
     else if(n < previousIndents) {
       jump(Dedent);
       previousIndents = n;
     }
     else if(input.LA(1) == EOF) {
       while(indentLevel > 0) {
         jump(Dedent);
       }
     }
     else {
       skip();
     }
   }
 ;

Id
 : ('a'..'z' | 'A'..'Z')+
 ;

SpaceChars
 : SP {skip();}
 ;

fragment NL     : '\r'? '\n' | '\r';
fragment SP     : (' ' | '\t')+;
fragment Indent : ;
fragment Dedent : ;

You can test the parser with the class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    PyEsqueLexer lexer = new PyEsqueLexer(new ANTLRFileStream("in.txt"));
    PyEsqueParser parser = new PyEsqueParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}    

If you now put the following in a file called in.txt:


AAA AAAAA
  BBB BB B
  BB BBBBB BB
    CCCCCC C CC
  BB BBBBBB
    C CCC
      DDD DD D
      DDD D DDD

(Note the leading and trailing line breaks!)

then you'll see output that corresponds to the following AST:

Note that my demo wouldn't produce enough dedents in succession, like dedenting from ccc to aaa (2 dedent tokens are needed):

aaa
  bbb
    ccc
aaa

You would need to adjust the code inside else if(n < previousIndents) { ... } to possibly emit more than 1 dedent token based on the difference between n and previousIndents. Off the top of my head, that could look like this:

 else if(n < previousIndents) {
   // Note: assuming indent-size is 2. Jumping from previousIndents=6 
   // to n=2 will result in emitting 2 `Dedent` tokens
   int numDedents = (previousIndents - n) / 2; 
   while(numDedents-- > 0) {
     jump(Dedent);
   }
   previousIndents = n;
 }

ANTLR4

For ANTLR4, do something like this:

grammar Python3;

tokens { INDENT, DEDENT }

@lexer::members {
  // A queue where extra tokens are pushed on (see the NEWLINE lexer rule).
  private java.util.LinkedList<Token> tokens = new java.util.LinkedList<>();
  // The stack that keeps track of the indentation level.
  private java.util.Stack<Integer> indents = new java.util.Stack<>();
  // The amount of opened braces, brackets and parenthesis.
  private int opened = 0;
  // The most recently produced token.
  private Token lastToken = null;
  @Override
  public void emit(Token t) {
    super.setToken(t);
    tokens.offer(t);
  }

  @Override
  public Token nextToken() {
    // Check if the end-of-file is ahead and there are still some DEDENTS expected.
    if (_input.LA(1) == EOF && !this.indents.isEmpty()) {
      // Remove any trailing EOF tokens from our buffer.
      for (int i = tokens.size() - 1; i >= 0; i--) {
        if (tokens.get(i).getType() == EOF) {
          tokens.remove(i);
        }
      }

      // First emit an extra line break that serves as the end of the statement.
      this.emit(commonToken(Python3Parser.NEWLINE, "\n"));

      // Now emit as much DEDENT tokens as needed.
      while (!indents.isEmpty()) {
        this.emit(createDedent());
        indents.pop();
      }

      // Put the EOF back on the token stream.
      this.emit(commonToken(Python3Parser.EOF, "<EOF>"));
    }

    Token next = super.nextToken();

    if (next.getChannel() == Token.DEFAULT_CHANNEL) {
      // Keep track of the last token on the default channel.
      this.lastToken = next;
    }

    return tokens.isEmpty() ? next : tokens.poll();
  }

  private Token createDedent() {
    CommonToken dedent = commonToken(Python3Parser.DEDENT, "");
    dedent.setLine(this.lastToken.getLine());
    return dedent;
  }

  private CommonToken commonToken(int type, String text) {
    int stop = this.getCharIndex() - 1;
    int start = text.isEmpty() ? stop : stop - text.length() + 1;
    return new CommonToken(this._tokenFactorySourcePair, type, DEFAULT_TOKEN_CHANNEL, start, stop);
  }

  // Calculates the indentation of the provided spaces, taking the
  // following rules into account:
  //
  // "Tabs are replaced (from left to right) by one to eight spaces
  //  such that the total number of characters up to and including
  //  the replacement is a multiple of eight [...]"
  //
  //  -- https://docs.python.org/3.1/reference/lexical_analysis.html#indentation
  static int getIndentationCount(String spaces) {
    int count = 0;
    for (char ch : spaces.toCharArray()) {
      switch (ch) {
        case '\t':
          count += 8 - (count % 8);
          break;
        default:
          // A normal space char.
          count++;
      }
    }

    return count;
  }

  boolean atStartOfInput() {
    return super.getCharPositionInLine() == 0 && super.getLine() == 1;
  }
}

single_input
 : NEWLINE
 | simple_stmt
 | compound_stmt NEWLINE
 ;

// more parser rules

NEWLINE
 : ( {atStartOfInput()}?   SPACES
   | ( '\r'? '\n' | '\r' ) SPACES?
   )
   {
     String newLine = getText().replaceAll("[^\r\n]+", "");
     String spaces = getText().replaceAll("[\r\n]+", "");
     int next = _input.LA(1);
     if (opened > 0 || next == '\r' || next == '\n' || next == '#') {
       // If we're inside a list or on a blank line, ignore all indents, 
       // dedents and line breaks.
       skip();
     }
     else {
       emit(commonToken(NEWLINE, newLine));
       int indent = getIndentationCount(spaces);
       int previous = indents.isEmpty() ? 0 : indents.peek();
       if (indent == previous) {
         // skip indents of the same size as the present indent-size
         skip();
       }
       else if (indent > previous) {
         indents.push(indent);
         emit(commonToken(Python3Parser.INDENT, spaces));
       }
       else {
         // Possibly emit more than 1 DEDENT token.
         while(!indents.isEmpty() && indents.peek() > indent) {
           this.emit(createDedent());
           indents.pop();
         }
       }
     }
   }
 ;

// more lexer rules

Taken from: https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4