ParseLib

This is the C++ core of ParseLib framework

Interfaces

The ParseLib interfaces expose a lot of functionalities, among them

Doxygen doc

You can use doxygen’s doxywizard or CLI to generate parselib’s documentation.

Sessions

All functions mentioned later and more are wrapped in a utility class (pl::ParseSession).

Reading a grammar and parsing a source code then becomes trivial :

#include <parselib/parselibinstance.hpp>
#include <boost/property_tree/ptree.hpp>

namespace pl = parselib;

int main(int argc, char** argv){
  // define a parselib session
  pl::ParseSession parsesession ;
  bool verbose = true ;

  // load a grammar from a raw text file++
  parsesession.loadGrammar("data/grammar.grm", verbose) ;
  
  // parse some source code if parsable
  // return type is boost::property_tree::ptree 
  auto out = parsesession.process_source_to_ptree("data/test.java", verbose) ; 

  return 0 ;
}

Recursive File Glober

This object can be used to glob recursively files and filter them by file extension.

#include <parselib/utils/io.hpp>

// setup fileGlober
parselib::utils::FileGlober fileglober ("foo/bar", "java") ;
// recursively globs all java files in foo/bar (relative path accepted)
auto files = fileglober.glob() ; 
// type is parselib::utils::FileGlober::FilesList or std::vector<std::string>

This can mainly be useful to setup an engine for recursively exploring files to parse.

Under the hood (Pipeline)

Graph encoder for generic textual CFG

This encode a text written context-free grammar (CFG) in a graph data-structure handled by the appropriate component, implemented Parsers mainly.

//import important stuff
#include <parselib/parselibinstance.hpp>

using namespace parselib ;

Grammar load(std::string filename) {

  //define preprocessor to use
  utils::OnePassPreprocessor *preproc = new utils::OnePassPreprocessor() ;
  parsers::GenericGrammarParser ggp (preproc) ; //define the parser

  auto grammar = ggp.parse (filename, verbose) ; //..and parse

  std::cout << grammar ; //it is printable
  
  return grammar;
}

Results on display :

RULE AXIOM = [
	S(NONTERMINAL)
]
RULE S = [
	a(TERMINAL) + S(NONTERMINAL) + b(TERMINAL)
	''(EMPTY)
]

TOKEN a = regex('a')
TOKEN b = regex('b')

Operators for grammar transformation

to 2NF^[1]

TERM : creates production rule pointing to a specific terminal for each terminal in a production rule
BIN : binarize all rules

Note : START operator is forced by the language by the AXIOM keyword

parselib::normoperators::get2nf(grammar) ;

Result on display :

RULE AXIOM = [
	S(NONTERMINAL)
]
RULE S = [
	a.(NONTERMINAL) + S-b(NONTERMINAL)
]
RULE S-b = [
	S(NONTERMINAL) + b.(NONTERMINAL)
	b(TERMINAL)
]
RULE a. = [
	a(TERMINAL)
]
RULE b. = [
	b(TERMINAL)
]

TOKEN a = regex('a')
TOKEN b = regex('b')

CYK parsers for grammars in 2NF^[1]

//import the good stuff
#include <parselib/parselibinstance.hpp> 

using namespace parselib;

Frame parse_file_into_frame (Grammar grammar, std::string filename) {

  parsers::CYK parser (grammar) ; //instantiate parser
  std::string source = utils::gettextfilecontent(filename) ; //load source from text file

  //tokenize source code
  lexer::Lexer tokenizer (grammar.tokens) ;
  tokenizer.tokenize (source) ;

  //result is in a Frame which is a polite term to say std::vector<parselib::parsetree::Node*>
  //containing all accepted solutions/parse trees
  //we generally use the first one
  Frame result = parser.membership (tokenizer.tokens) ;
  return result;
}

The parser supports basic error handling. If membership fails, the returned frame contains a the token most suspected of breaking parsing, otherwise it contains a list of accepted parse trees.

This is all handled transparently in the ParseSession object.

Full workflow :

The full workflow for parselib parser generator can be something like this:

void full_work_flow (std::string t_grammar_filename, std::string t_source_filename) {

  using namespace parselib
  
  //define preprocessor to use
  utils::OnePassPreprocessor *preproc = new utils::OnePassPreprocessor() ;
  parsers::GenericGrammarParser ggp (preproc) ; //define the grammar parser

  auto grammar = ggp.parse (t_grammar_filename, verbose) ; //..and parse
  delete preproc; // delete preprocessor, not needed anymore

  parsers::CYK parser (grammar) ; //instantiate CYK parser
  std::string source = utils::gettextfilecontent(t_source_filename) ; //load source from text file

  //tokenize source code
  lexer::Lexer tokenizer (grammar.tokens) ;
  tokenizer.tokenize (source) ;

  //result is in a Frame which is a polite term to say std::vector<parselib::parsetree::Node*>
  //containing all accepted solutions/parse trees
  //we generally use the first one
  Frame result = parser.membership (tokenizer.tokens) ;

}

A complete example with error handling and parsing to json can be found in parselibinstance.h/cpp file in process_source_to_ptree.

References :

[1] Lange, Martin; Leiß, Hans (2009). “To CNF or not to CNF? An Efficient Yet Presentable Version of the CYK Algorithm”.