Common syntax definitions

Proposal and brainstorming

This page deals with a problem a lot of developers struggle with - the multiple implementations and configurations of syntax highlighting in code editors and other tools. This causes a lot of repeated work to maintain a stack of programming language definitions for each tool - a common syntax interface (a set of converter scripts) could help.

Problem:

A variety of tools work on the basis of source code (editors, IDEs, version control systems,…) and most of them support syntax highlighting, a feature which became popular with the Borland Turbo Pascal IDE back in the early 90s. Syntax highlighting means the visual separation of syntax elements like keywords, numbers, comments and other symbols. Highlighted code is easier and faster to read, and it can prevent typos in IDEs before compilation is started. Because of the increased usability, all major programming editors and IDEs support highlighting.

Unfortunately, every tool uses an own description format to describe the syntax of the supported programming languages (syntax definitions). Since the implementation of syntax highlighting is differerent for almost every tool, the syntax definitions also differ - some are simple configuration files, some are scripts written in languages you cannot pronounce correctly. If you only use widespread programming languages like C, Java and Python, there is no problem - supporting these languages is a basic requirement for all code editing tools. But what about the other 60%, hundreds of languages with smaller user bases? SNOBOL, Rexx or Boo programmers usually have to take care of their favorite language themselves, and if it is not possible to use the same development toolkit for 20 years, it can be tiring to add support for several editors again and again. It is not possible to exchange syntax definitions between several tools, they have to be written from scratch. Some tools offer converters, but these utilities can only convert one format into another, there is no general approach.

Possible solution:

Most modern programming languages have a similar structure. The syntax consists of several sets of keywords, comments, numbers, directives, symbols and so on. That means the syntax definitions are also similar, only the data formats are different, and they may contain additional parameters to support unique features of the tool.

The idea is to define a common description of programming languages, without abandoning optional configuration parameters where applications can store additional data for own needs. The configuration format could be XML as default, but there should be scripts which convert the format into other text configuration files, like JSON or simple param=value style files. Every format should be generateable from another file containing a valid syntax description. The converters could strip tool specific configuration data, or copy them - maybe they could be of use for other tools. To avoid compatibility issues, syntax definitions and tool-specific elements should have a version parameter.

Example (Pascal):

This is just an idea, not a format proposal:

<?xml version="1.0" encoding="utf-8" ?>
<syntax-desc>
  <version>1.0</version>
  <desc>Pascal</desc>
  <keywords set="set1">
    absolute abstract and array as asm assembler automated begin case
    cdecl class const constructor destructor dispid dispinterface div do downto
  </keywords>
  <keywords set="set2">
    boolean char integer pointer real text
    true false cardinal longint byte word single double int64
  </keywords>
  <keywords set="set3">
    if else then downto do for repeat while to until with
  </keywords>
  <keywords set="set4">
    <reg-exp>(\w+?)\s*\(</reg-exp>
  </keywords>
  <string>&quot; '</string>
  <comment type="single-line" open="//" />
  <comment type="multi-line" open="{" close="}"/>
  <comment type="multi-line" open="(*" close="*)"/>
  <case-sensitivity value="0" />
  <symbols>
    ( ) [ ] , ; : &amp; | &lt; &gt; !  = / * %  + - @ . ^
  </symbols>
  <escape-sequence>
    <reg-exp>\#\$\p{XDigit}{2}|\#\d{,3}</reg-exp>
  </escape-sequence>
  
  <proprietary>
    <name>highlight</name>
	<version>2.6.2</version>
	   <!-- tool specific parameters --> 
	   <allow-external-escape-seq value="1" />
  </proprietary>
</syntax-desc>

Advantages:

  • Syntax descriptions may be exchanged with all tools which support the common syntax description (no matter if XML or INI or what else format is used, support means there exists a converter which can transform the application's configuration format into one of the official text formats)
  • If there is a central repository with syntax descriptions ready to download, users can add new files to their tools with much less effort. The repository could be used to update and add descriptions online.

Issues:

  • Maintenance of tool specific configuration data (which may get lost during conversion into another text format). Can be solved by import scripts.
  • Syntax of regular expressions (compatibility with several regex engines?)

Syntax color descriptions:

Apart from the syntax descriptions, the formatting information like colors and font face (bold, italic, underline) could be defined in a common format, too. If a tool can make use of multiple color sets, it would be nice to share them with others. The formatting attributes should not be defined in the syntax descriptions.

Discussion/Brainstorming:

Do you think this is complete rubbish? Any show stopper not mentioned here? Do you agree or have further ideas?


I don't think it will be that simple, but I do think this topic is important.

I see the following problems:

1. Hybrid documents. Many modern file formats actually consist of several sets of different subdocuments, like HTML (+JS, +CSS, maybe +SVG, +MathML), PHP (includes all of the aforementioned plus SQL) and XML (custom dialects often consist of various subdocuments in all sorts of standard XML languages plus JS and/or XPath scripting)

Some kind of autodetection might be beneficial as well. That shouldn't be too hard, as the choice and placement of embedded subdocuments is usually limited, and the subdocuments are (usually) easily recognizable. Autodetection can't possibly be 100% correct, so it should fail gracefully.

2. Widely differing implementations. Take kwrite as an examle: It's syntax-highlighting is based on a state machine instead of simple pattern matching. It can highlight most popular formats 100% correct (i.e., analyzing it just like the real parser), while most other syntax highlighters depend on some form of coding conventions and can fail in notorious cases. This kind of exact parsing can be used to flag syntax errors reliably as well, and since it is based on a state machine, states could be annotated with information for a similar problem: auto-indenting. (This last possibility is not implemented, it is just an idea) While it is very powerful, it is still fast due to clever state management and incremental parsing. This efficiency also leads to limits what it can analyze, it is not a parser suitable for all documents in use today. Still, I don't know a highlighter as flexible: I once implemented a 99% conforming XML parser with it, only missing a few error checks impossible with a DFA.

NetBeans and various other IDEs have even more powerful code analyzers that do much more than just syntax-highlighting. The problem is that all these examples might be able to do simple pattern matching as required by a highlighting format as proposed above, but then the additional power goes unused, which is as undesirable as locking out simpler highlighters that can't do anything but plain pattern matches.

3. Increasingly complex languages. The popular saying "Only perl can parse Perl" is often augmented with “it's a miracle that even perl can parse Perl” for a good reason. They apply to Perl5, but Perl6 isn't much better. The more dynamic and reflective a language is, the more difficult it gets to highlight. Another interesting example is SmallTalk: It doesn't have any keywords at all, since truly everything is an object. Highlighting might make sense for some common predefined variables, but things usually known as “control structures” are really just method calls on objects. If “do:” denotes the third parameter of a counting loop or some totally unrelated message (part) can only established from context. C with it's preprocessor is another notorious example, as the preprocessor is quite able to throw a parser off its tracks.

The obvious solution (as employed by “highlight” as well) is to use simple pattern matching and rely on coding conventions to avoid possible pitfalls. This is a viable strategy for print-publishing source code (you'll usually want it to look clean anyways) or writing your own code, but what about code from foreign sources which you'd like to understand better using syntax highlighting or need to integrate? Moreover, there are the infamous “fix emacs highlighting” comments in all sorts of source files to fix some mis-parsing of clean code.

4. Dynamic languages. Something that's a keyword might not be a line later. There's an inherent impossibility of 100% exact highlighting as dynamic languages are able to redefine all sorts of symbols. In SmallTalk, the question whether we are facing a for-loop or something totally unrelated might only be decidable at run-time. In C, redefining a few crucial names might change the meaning of a header file entirely. In Perl, “built-in” functions can be replaced and new functions can be declared that parse like builtins. Therefore, graceful failure is an absolute must, otherwise it's “emacs fix up” time again.

So it's not actually a question of syntax-highlighting, but of parsing, which can then be used to do other things as well (auto-indenting, cross-referencing, refactoring, …) depending on how exact it is. There are three basic approaches:

1. Simple more or less line-oriented regex-based pattern matching, maybe augmented by some hard-coded special cases for popular languages. “highlight” falls into this category, as do many others. This low-tech solution works for many cases but can easily mess up valid, clean source code if certain less common constructs are used. Traditionally, failure isn't exactly graceful if quoting rules of special characters are involved.

2. Rule-based full program parsing (using regexes, DFA or specific grammars). “kwrite” and “emacs” are examples. If the rules are implemented correctly (which isn't always the case!), these highlighters can parse many languages 100% correctly, although often with barely enough information for highlighting. Reuse of such parsers for other applications is usually impossible. Highlight failure is usually catastrophic, as the parser gets stuck in a state it doesn't get out of. These often have speed problems on files of several thousand lines.

3. Real parsers. “NetBeans” and most other IDEs are able to parse a file like the real compiler would do, including all neccessary context (include files, macro definitions), reaching the best accuracy possible. Parsing is so slow that it usually happens asynchronously, with a simpler highlighter doing the real-time updates. Correct parsing might depend on additional configuration, for example include paths or compiler flags. Implementation is usually done in native code, not via user-modifiable rule sets.

Approach (3) seems to be desirable, but has serious drawbacks: The ability to create new syntax definitions is crucial, yet some languages are fastest to parse using the language's native parser.

Approach (1) suffers least from parse errors: You can feed mismatching languages to it, claiming it to be some similar known language, and still get something useful (not nearly perfect, but at least better than unhighlighted).

Approach (2) can yield high-quality output if done right, but really needs matching rule definitions. It can be fast enough for all practical purposes, if done right. “if done right” is it's drawback: graceful failure is difficult. Moreover, it is not generic enough to parse everything.

So, I'd think the best solution would be a light-weight library with a flexible API (and some useful language bindings) that allows all three approaches. Plugins would be possible for hand-writing parsers in native code (or interfacing the existing native parser) for difficult languages, while two different parsers could do highlighting using approach (1) or (2). The kwrite parser is a good source of ideas for an efficient implementation of (2).

The library would need to have multiple modes of operation, at least “strict” and “common”. “Strict” mode means “rely on the correctness of rules and that the code doesn't do unfair run-time tricks” (i.e., approach 2), while “common” mode would mean “do simplified parsing that only works if certain coding standards are followed” (i.e., approach 1). Also possible would be a mode “auto” that uses strict parsing but switches to simple parsing if some definitely impossible situations are encountered.

The library would also handle hybrid documents, maintaining separate parsers for subdocuments, doing auto-detection if needed.

The associated file format might allow some simple rules for an implementation according to (1), an automaton or grammar for (2), additional annotations for the other features: information of scope nesting depth (for indention), type of symbol being highlighted, rules for “auto” mode sanity checks, locations that possibly or definitely start a subdocument, quoting inside subdocuments, auto-detection rules for sublanguages and so on.

I think only such a complete implementation is likely to get wide support, as most people are satisfied with the current “complete enough” state of affairs. An alternative would need to be a big improvement over current solutions, in quality and/or in number of supported document formats.

(by: Jörg Walter jwalt@garni.ch)


This comment is pretty impressive, but scary in consideration of the workload of such a project ;) Volunteers, run to the hills!


Please use the login below to edit this page or send me a message.

csi-login.jpg

csi/common_syntax_interface.txt · Last modified: 2010/01/13 16:43 (external edit)
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0