isogloss
is a Python–based command–line tool designed for looking up language details based onISO 639codes and IETF (BCP-47) language tags. It provides comprehensive information about languages, including their names, native names, and additional details associated with each code or tag.
There is also aweb–based version here.TheBCP47 parserhas some known issues, documented below in the "Errata" section.
Elsewhere,the word isoglossmeans a boundary line on a map denoting the regional use of a particular linguistic characteristic, but in this case it just seemed to fit.
- Lookup language details using ISO 639-1, 639-2/B, 639-2/T, or 639-3 codes.
- Lookup language details by language name.
- Lookup language details using IETF BCP-47 language tags
- Examples:
en-GB
,en-US
,sv-SE
,zh-cmn-Hans-CN-pinyin-ud1-p9t4-x-private1
,and so on.
- Examples:
Clone the repository to your local machine:
git clone https://github.com/thunderpoot/isogloss.git
Create a virtual environment and install requirements
python3.11 -m venv venv
source venv/bin/activate
pip install unidecode
The script can be run directly from the command line. Below are some examples of how to use it:
To look up information by ISO 639 code:
$ isogloss/isogloss.py -c swe
{
"639-1": "sv",
"Scope": "Individual",
"Type": "Living",
"Native name(s)": "svenska",
"Other name(s)": "",
"639-2/T": "swe",
"639-2/B": "",
"639-3": "swe",
"Name(s)": "Swedish"
}
To look up information by language name:
$ isogloss/isogloss.py -n "egyptian arabic"
{
"Egyptian Arabic": "arz"
}
Example of lookup via native name:
$ isogloss/isogloss.py -n nhật bổn ngữ
{
"\u65e5\u672c\u8a9e Nihongo": "jpn"
}
Example of multiple results being found:
$ isogloss/isogloss.py -n norwegian
{
"Norwegian Nynorsk": "nno",
"Nynorsk, Norwegian": "nno",
"Bokm\u00e5l, Norwegian": "nob",
"Norwegian Bokm\u00e5l": "nob",
"Norwegian": "nor",
"Norwegian Sign Language": "nsl",
"Traveller Norwegian": "rmg"
}
Language names are normalised, allowing for case–insensitive and accent–insensitive matching when searching:
$ isogloss/isogloss.py -n espanol
{
"Judeo-espa\u00f1ol": "lad",
"espa\u00f1ol": "spa"
}
To look up information by IETF language tag:
$ isogloss/isogloss.py -i fr-FR
{
"Language": {
"639-1": "fr",
"Scope": "Individual",
"Type": "Living",
"Native name(s)": "fran\u00e7ais",
"Other name(s)": "",
"639-2/T": "fra",
"639-2/B": "fre",
"639-3": "fra",
"Name(s)": "French"
},
"Region": "France"
}
$ isogloss/isogloss.py -i zh-cmn-Hans-CN-pinyin-ud1-p9t4-x-private1
{
"Primary Language": {
"639-1": "zh",
"639-2/B": "chi",
"639-2/T": "zho",
"639-3": "zho",
"Deprecated": false,
"Name(s)": "Chinese",
"Native name(s)": "\u4e2d\u6587 Zh\u014dngw\u00e9n; \u6c49\u8bed; \u6f22\u8a9e H\u00e0ny\u01d4",
"Other name(s)": "",
"Scope": "Macrolanguage",
"Type": "Living"
},
"Extended Languages": [
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"639-3": "cmn",
"Deprecated": false,
"Name(s)": "Mandarin Chinese",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual",
"Type": "Living"
}
],
"Script": "Han (Simplified variant)",
"Region": "China",
"Variant": "pinyin",
"Extension": "ud1-p9t4",
"Private Use": "x-private1"
}
$ isogloss/isogloss.py -i ar-ajp-apc-apd-Arab-CV-arevela-g-231243-r-sdarre-x-private-x-private1 | jq
{
"Primary Language": {
"639-1": "ar",
"639-2/B": "",
"639-2/T": "ara",
"639-3": "ara",
"Deprecated": false,
"Name(s)": "Arabic",
"Native name(s)": "العربية; al'Arabiyyeẗ",
"Other name(s)": "",
"Scope": "Macrolanguage",
"Type": "Living"
},
"Extended Languages": [
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"Deprecated": true,
"Language Name(s)": "South Levantine Arabic",
"Language Type": "Living",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual"
},
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"639-3": "apc",
"Deprecated": false,
"Name(s)": "Levantine Arabic",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual",
"Type": "Living"
},
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"639-3": "apd",
"Deprecated": false,
"Name(s)": "Sudanese Arabic",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual",
"Type": "Living"
}
],
"Script": "Arabic",
"Region": "Cabo Verde",
"Variant": "arevela",
"Extension": "g-231243-r-sdarre",
"Private Use": "x-private-x-private1"
}
data/consolidated_langs.json
:Contains language data in JSON format used for the lookup.data/region_names.json
:Contains region data in JSON format used for the BCP47 lookup.data/script_codes.json
:Contains script code data in JSON format used for the BCP47 lookup.data/deprecated-639-3.csv
:Contains deprecated ISO 639-3 codes in CSV format, for quick reference.
There are known issues with the BCP47 parser in the web interface. It uses regular expressions to validate input, such that:
-
en
-
fr-CA
-
i-klingon
-
az-Arab-IR
-
sr-Cyrl-RS
-
zh-cmn-Hans
-
ja-JP-x-tokyo
-
uz-Cyrl-UZ-1992
-
bo-Tibt-x-dialect
-
zh-cmn-Hans-CN-x-private1
-
hy-Latn-IT-arevela-x-test
-
en-GB-oed-x-private
-
de-CH-1901-co-phonebk-sc-gothic-x-bavaria
(and more)
-
ca-valencia-nedis
(Highlighted input section is missing "valencia" ) -
en-US-u-islamcal
(Variant "u" and Extension "islamcal", Extension section says "u - islamcal" ) -
es-419-fonipa
(Extended languages blank) -
de-Latf-1901
(Region undefined) -
sl-rozaj
(rozaj is coloured differently in the result container to how it is in the highlighted input section)
Contributions, issues, and feature requests are welcome!
Written by T E Vaughan
If you find this project useful, please consider sponsoring my work. <3
The codes used in this program conform to the following ISO standards:
- ISO 639Language codes
- ISO 3166-1 alpha-2Country codes
- ISO 15924Script codes
- RFC 1766Tags for the Identification of Languages
- RFC 4646Tags for Identifying Languages
- RFC 4647Matching of Language Tags
This project isMIT licensed.