Skip to content

A powerful dataset generator for Rasa NLU, inspired by Chatito

License

Notifications You must be signed in to change notification settings

SimGus/Chatette

Repository files navigation

Tweet about it

Chatette logo
Chatette

A data generator forRasa NLU

PyPI package GitHub license Build status codecov Documentation

InstallationUninstallationHow to useChatette?ChatettevsChatito?DevelopmentCredits

Chatetteis a Python program that generates training datasets forRasa NLUgiven template files. If you want to make large datasets of example data for Natural Language Understanding tasks without too much of a headache,Chatetteis a project for you.

Preview of Chatette's capabilities

Specifically,Chatetteimplements a Domain Specific Language (DSL) that allows you to define templates to generate a large number of sentences, which are then saved in the input format(s) ofRasa NLU.

TheDSLused is a near-superset of the excellent projectChatitocreated by Rodrigo Pimentel. (Note: the DSL is actually a superset of Chatito v2.1.x for Rasa NLU, not for all possible adapters.)

An interactive mode is available as well:

Interactive mode

Installation

To runChatette,you will need to havePythoninstalled. Chatetteworks with both Python 2.7 and 3.x (>= 3.4).

Chatetteis available onPyPI,and can thus be installed usingpip:

pip install chatette

Alternatively,you can clone theGitHub repositoryand install the requirements:

pip install -r requirements/common.txt

You can then install the project (as an editable package) using pip, by executing the following command from the directoryChatette/chatette/:

pip install -e.

You can then run the module by using the commands below in the cloned directory.

Uninstallation

You can just use pip to uninstallChatette:

pip uninstall chatette

How to useChatette?

Input and output data

The data thatChatetteuses and generates is loaded from and saved to files. You will thus have:

  • One or severalinput file(s)containing the templates. There is no need for a specific file extension. The syntax of theDSLto make those templates is described on thewiki.

  • One or severaloutput file(s),which will be generated byChatetteand will contain the generated examples. Those files can be formatted inJSON(by default) or inMarkdownand can be directly fed toRasa NLU.It is also possible to use aJSONLformat.

RunningChatette

OnceChatetteis installed and you created the template files, run the following command:

python -m chatette<path_to_template>

wherepythonis your Python interpreter (some operating systems usepython3as the alias to the Python 3.x interpreter).

You can specify the name of the output file as follows:

python -m chatette<path_to_template>-o<output_directory_path>

<output_directory_path>is specified relatively to the directory from which the script is being executed. The output file(s) will then be saved in numbered.jsonfiles in<output_directory_path>/trainand<output_directory_path>/test.If you didn't specify a path for the output directory, the default one isoutput.

Other program arguments and are describedin the wiki.

ChatettevsChatito?

TL;DR:main selling point:it is easier to deal with large projects usingChatette,and you can transform mostChatitoprojects into aChatetteone without any modification.

A perfectly legitimate question is:

Why doesChatetteexist whenChatitoalready fulfills the same purposes?

The two projects actually have different goals:

Chatitoaims to be a generic but powerfulDSL,that should stay very legible. While it is perfectly fine for small projects, when projects get larger, the simplicity of itsDSLmay become a burden: your template file becomes overwhelmingly large, to the point you get lost inside it.

Chatettedefines a more complexDSLto be able to manage larger projects and tries to stay as interoperable withChatitoas possible. Here is a non-exhaustive list of featuresChatettehas and thatChatitodoes not have:

  • Ability to break down templates intomultiple files
  • Possibility to specify theprobability of generating some partsof the sentences
  • Conditional generation of some partsof the sentences, given which other parts were generated
  • Choice syntaxto prevent copy-pasting rules with only a few changes and to easily modify the generation behavior of parts of sentences
  • Ability todefine the value of each slot (entity)whatever the generated example
  • Syntax for generating words withdifferent casefor the leading letter
  • Argument supportso that some templates may be filled by different strings in different situations
  • Indentation is permissiveand must only be somewhat coherent
  • Support forsynonyms
  • Interactive command interpreter
  • Output for Rasa inJSONor inMarkdownformats

As theChatette's DSL is a superset ofChatito's one, input files used forChatitoare most of the time completely usable withChatette(not the other way around). Hence, it is easy to start usingChatetteif you usedChatitobefore.

As an example, thisChatitodata:

// This template defines different ways to ask for the location of toilets (Chatito version)
%[ask_toilet]('training': '3')
~[sorry?] ~[tell me] where the @[toilet#singular] is ~[please?]?
~[sorry?] ~[tell me] where the @[toilet#plural] are ~[please?]?

~[sorry]
sorry
Sorry
excuse me
Excuse me

~[tell me]
~[can you?] tell me
~[can you?] show me
~[can you]
can you
could you
would you

~[please]
please

@[toilet#singular]
toilet
loo
@[toilet#plural]
toilets

could be directly given as input toChatette,but thisChatettetemplate would produce the same results:

// This template defines different ways to ask for the location of toilets (Chatette version)
%[&ask_toilet](3)
~[sorry?] ~[tell me] where the @[toilet#singular] is [please?]?
~[sorry?] ~[tell me] where the @[toilet#plural] are [please?]?

~[sorry]
sorry
excuse me

~[tell me]
~[can you?] [tell|show] me
~[can you]
[can|could|would] you

@[toilet#singular]
toilet
loo
@[toilet#plural]
toilets

TheChatitoversion is arguably easier to read, but theChatetteversion is shorter, which may be very useful when dealing with lots of templates and potential repetition.

Beware that, as always with machine learning, having too much data may cause your models to perform less well because of overfitting. While this script can be used to generate thousands upon thousands of examples, it isn't advised for machine learning tasks.

Chatetteis named afterChatito:-ettein French could be translated to-itaor-itoin Spanish. Note that the lasteinChatetteis not pronouced (as is the case in "note" ).

Development

For developers, you can clone therepoand install the development requirements: pip install -r requirements/develop.txt Then, install the module as editable: pip install -e <path-to-chatette-module>

Credits

Author and maintainer

Disclaimer: This is a side-project I'm not paid for, don't expect me to work 24/7 on it.

Contributors

Many thanks to them!