Skip to content

thunderpoot/isogloss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

42 Commits

Repository files navigation

Python JavaScript

🌐 iso·gloss

isogloss

ISO 639 and IETF Language Code Lookup Tool

isoglossis a Python–based command–line tool designed for looking up language details based onISO 639codes and IETF (BCP-47) language tags. It provides comprehensive information about languages, including their names, native names, and additional details associated with each code or tag.

There is also aweb–based version here.TheBCP47 parserhas some known issues, documented below in the "Errata" section.

Elsewhere,the word isoglossmeans a boundary line on a map denoting the regional use of a particular linguistic characteristic, but in this case it just seemed to fit.

Features

  • Lookup language details using ISO 639-1, 639-2/B, 639-2/T, or 639-3 codes.
  • Lookup language details by language name.
  • Lookup language details using IETF BCP-47 language tags
    • Examples:en-GB,en-US,sv-SE,zh-cmn-Hans-CN-pinyin-ud1-p9t4-x-private1,and so on.

Installation

Clone the repository to your local machine:

git clone https://github.com/thunderpoot/isogloss.git

Create a virtual environment and install requirements

python3.11 -m venv venv
source venv/bin/activate
pip install unidecode

Usage

The script can be run directly from the command line. Below are some examples of how to use it:

To look up information by ISO 639 code:

$ isogloss/isogloss.py -c swe
{
"639-1": "sv",
"Scope": "Individual",
"Type": "Living",
"Native name(s)": "svenska",
"Other name(s)": "",
"639-2/T": "swe",
"639-2/B": "",
"639-3": "swe",
"Name(s)": "Swedish"
}

To look up information by language name:

$ isogloss/isogloss.py -n "egyptian arabic"
{
"Egyptian Arabic": "arz"
}

Example of lookup via native name:

$ isogloss/isogloss.py -n nhật bổn ngữ
{
"\u65e5\u672c\u8a9e Nihongo": "jpn"
}

Example of multiple results being found:

$ isogloss/isogloss.py -n norwegian
{
"Norwegian Nynorsk": "nno",
"Nynorsk, Norwegian": "nno",
"Bokm\u00e5l, Norwegian": "nob",
"Norwegian Bokm\u00e5l": "nob",
"Norwegian": "nor",
"Norwegian Sign Language": "nsl",
"Traveller Norwegian": "rmg"
}

Language names are normalised, allowing for case–insensitive and accent–insensitive matching when searching:

$ isogloss/isogloss.py -n espanol
{
"Judeo-espa\u00f1ol": "lad",
"espa\u00f1ol": "spa"
}

To look up information by IETF language tag:

$ isogloss/isogloss.py -i fr-FR
{
"Language": {
"639-1": "fr",
"Scope": "Individual",
"Type": "Living",
"Native name(s)": "fran\u00e7ais",
"Other name(s)": "",
"639-2/T": "fra",
"639-2/B": "fre",
"639-3": "fra",
"Name(s)": "French"
},
"Region": "France"
}
$ isogloss/isogloss.py -i zh-cmn-Hans-CN-pinyin-ud1-p9t4-x-private1
{
"Primary Language": {
"639-1": "zh",
"639-2/B": "chi",
"639-2/T": "zho",
"639-3": "zho",
"Deprecated": false,
"Name(s)": "Chinese",
"Native name(s)": "\u4e2d\u6587 Zh\u014dngw\u00e9n; \u6c49\u8bed; \u6f22\u8a9e H\u00e0ny\u01d4",
"Other name(s)": "",
"Scope": "Macrolanguage",
"Type": "Living"
},
"Extended Languages": [
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"639-3": "cmn",
"Deprecated": false,
"Name(s)": "Mandarin Chinese",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual",
"Type": "Living"
}
],
"Script": "Han (Simplified variant)",
"Region": "China",
"Variant": "pinyin",
"Extension": "ud1-p9t4",
"Private Use": "x-private1"
}
$ isogloss/isogloss.py -i ar-ajp-apc-apd-Arab-CV-arevela-g-231243-r-sdarre-x-private-x-private1 | jq
{
"Primary Language": {
"639-1": "ar",
"639-2/B": "",
"639-2/T": "ara",
"639-3": "ara",
"Deprecated": false,
"Name(s)": "Arabic",
"Native name(s)": "العربية; al'Arabiyyeẗ",
"Other name(s)": "",
"Scope": "Macrolanguage",
"Type": "Living"
},
"Extended Languages": [
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"Deprecated": true,
"Language Name(s)": "South Levantine Arabic",
"Language Type": "Living",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual"
},
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"639-3": "apc",
"Deprecated": false,
"Name(s)": "Levantine Arabic",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual",
"Type": "Living"
},
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"639-3": "apd",
"Deprecated": false,
"Name(s)": "Sudanese Arabic",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual",
"Type": "Living"
}
],
"Script": "Arabic",
"Region": "Cabo Verde",
"Variant": "arevela",
"Extension": "g-231243-r-sdarre",
"Private Use": "x-private-x-private1"
}

Files

  • data/consolidated_langs.json:Contains language data in JSON format used for the lookup.
  • data/region_names.json:Contains region data in JSON format used for the BCP47 lookup.
  • data/script_codes.json:Contains script code data in JSON format used for the BCP47 lookup.
  • data/deprecated-639-3.csv:Contains deprecated ISO 639-3 codes in CSV format, for quick reference.

Errata

There are known issues with the BCP47 parser in the web interface. It uses regular expressions to validate input, such that:

Examples of valid tags:

  • en

  • fr-CA

  • i-klingon

  • az-Arab-IR

  • sr-Cyrl-RS

  • zh-cmn-Hans

  • ja-JP-x-tokyo

  • uz-Cyrl-UZ-1992

  • bo-Tibt-x-dialect

  • zh-cmn-Hans-CN-x-private1

  • hy-Latn-IT-arevela-x-test

Examples of invalid tags (malformed):

  • en-GB-oed-x-private

  • de-CH-1901-co-phonebk-sc-gothic-x-bavaria

(and more)

Examples of inputs that reveal parsing bugs:

  • ca-valencia-nedis (Highlighted input section is missing "valencia" )

  • en-US-u-islamcal (Variant "u" and Extension "islamcal", Extension section says "u - islamcal" )

  • es-419-fonipa (Extended languages blank)

  • de-Latf-1901 (Region undefined)

  • sl-rozaj (rozaj is coloured differently in the result container to how it is in the highlighted input section)

Contributing

Contributions, issues, and feature requests are welcome!

Author

Written by T E Vaughan

Sponsorship

Github-sponsors

If you find this project useful, please consider sponsoring my work. <3

Related Standards and RFCs

The codes used in this program conform to the following ISO standards:

Standards

RFCs

  • RFC 1766Tags for the Identification of Languages
  • RFC 4646Tags for Identifying Languages
  • RFC 4647Matching of Language Tags

License

This project isMIT licensed.