NaturalLanguageConcreteSyntaxTree format.
nlcstis a specification for representing natural language in asyntax tree. It implements theunistspec.
This document may not be released.
Seereleasesfor released documents.
The latest released version is1.0.2
.
- Introduction
- Types
- Nodes (abstract)
- Nodes
- Glossary
- List of utilities
- Related
- References
- Contribute
- Acknowledgments
- License
This document defines a format for representing natural language as aconcrete syntax tree. Development of nlcst started in May 2014, in the now deprecatedtextom project forretext,beforeunistexisted. This specification is written in aWeb IDL-like grammar.
nlcst extendsunist,a format for syntax trees, to benefit from its ecosystem of utilities.
nlcst relates toJavaScriptin that it has anecosystem of utilitiesfor working with compliant syntax trees in JavaScript. However, nlcst is not limited to JavaScript and can be used in other programming languages.
nlcst relates to theunifiedandretextprojects in that nlcst syntax trees are used throughout their ecosystems.
If you are using TypeScript, you can use the nlcst types by installing them with npm:
npm install @types/nlcst
interface Literal<:UnistLiteral {
value:string
}
Literal(UnistLiteral) represents a node in nlcst containing a value.
Itsvalue
field is astring
.
interface Parent<:UnistParent {
children:[Paragraph|Punctuation|Sentence|Source|Symbol|Text|WhiteSpace|Word]
}
Parent(UnistParent) represents a node in nlcst containing other nodes (said to bechildren).
Its content is limited to only other nlcst content.
interface Paragraph<:Parent {
type:'ParagraphNode'
children:[Sentence|Source|WhiteSpace]
}
Paragraph(Parent) represents a unit of discourse dealing with a particular point or idea.
Paragraphcan be used in arootnode. It can containsentence,whitespace, andsourcenodes.
interface Punctuation<:Literal {
type:'PunctuationNode'
}
Punctuation(Literal) represents typographical devices which aid understanding and correct reading of other grammatical units.
Punctuationcan be used insentenceor wordnodes.
interface Root<:Parent {
type:'RootNode'
}
Root(Parent) represents a document.
Rootcan be used as therootof atree,never as achild. Its content model is not limited, it can contain any nlcst content, with the restriction that all content must be of the same category.
interface Sentence<:Parent {
type:'SentenceNode'
children:[Punctuation|Source|Symbol|WhiteSpace|Word]
}
Sentence(Parent) represents grouping of grammatically linked words, that in principle tells a complete thought, although it may make little sense taken in isolation out of context.
Sentencecan be used in aparagraphnode. It can containword,symbol, punctuation,whitespace,and sourcenodes.
interface Source<:Literal {
type:'SourceNode'
}
Source(Literal) represents an external (ungrammatical) value embedded into a grammatical unit: a hyperlink, code, and such.
Sourcecan be used inroot,paragraph, sentence,orwordnodes.
interfaceSymbol<:Literal {
type:'SymbolNode'
}
Symbol(Literal) represents typographical devices different from characters which represent sounds (like letters and numerals), white space, or punctuation.
Symbolcan be used insentenceorword nodes.
interfaceText<:Literal {
type:'TextNode'
}
Text(Literal) represents actual content in nlcst documents: one or more characters.
Textcan be used inwordnodes.
interface WhiteSpace<:Literal {
type:'WhiteSpaceNode'
}
WhiteSpace(Literal) represents typographical devices devoid of content, separating other units.
WhiteSpacecan be used inroot, paragraph,orsentencenodes.
interface Word<:Parent {
type:'WordNode'
children:[Punctuation|Source|Symbol|Text]
}
Word(Parent) represents the smallest element that may be uttered in isolation with semantic or pragmatic content.
Wordcan be used in asentencenode. It can containtext,symbol, punctuation,andsourcenodes.
See theunist glossary.
See theunist list of utilitiesfor more utilities.
nlcst-affix-emoticon-modifier
— merge affix emoticons into the previous sentencenlcst-emoji-modifier
— support emojinlcst-emoticon-modifier
— support emoticonsnlcst-is-literal
— check whether a node is meant literallynlcst-normalize
— normalize a word for easier comparisonnlcst-search
— search for patternsnlcst-to-string
— serialize a nodenlcst-test
— validate a nodemdast-util-to-nlcst
— transform mdast to nlcsthast-util-to-nlcst
— transform hast to nlcst
- mdast — Markdown Abstract Syntax Tree format
- hast — Hypertext Abstract Syntax Tree format
- xast — Extensible Abstract Syntax Tree
- unist: Universal Syntax Tree. T. Wormer; et al.
- JavaScript: ECMAScript Language Specification. Ecma International.
- Web IDL: Web IDL, C. McCormack. W3C.
Seecontributing.md
insyntax-tree/.github
for
ways to get started.
Seesupport.md
for ways to get help.
Ideas for new utilities and tools can be posted insyntax-tree/ideas
.
A curated list of awesome syntax-tree, unist, mdast, hast, xast, and nlcst resources can be found inawesome syntax-tree.
This project has acode of conduct. By interacting with this repository, organization, or community you agree to abide by its terms.
The initial release of this project was authored by @wooorm.
Thanks to @nwtn, @tmcw, @muraken720,and @dozoisch for contributing to nlcst and related projects!