User:Conrad.Bot

From Wiktionary, the free dictionary
Jump to navigation Jump to search

A bot, using pywiki framework that is run byUser:Conrad.Irwin.

Tasks

[edit]
  • (Approved) Link fi xing for deleted/deletable redirects.
  • (With consensus) Uploading the index files, seeUser:Conrad.Bot/Inde xing.
  • (Without explicit approval) Replacing{{see}}to{{also}}.(Will only work on pages that start with{{see|and contain no other occurances of{{see|to avoid propagating formatting errors)A bad idea...

Anagrams

Adding and updating ==Anagrams== sections in:

  • English
  • French
  • request your languagehere

For both of these languages:

  • Ignore anything containing a number, or which looks like a prefix, suffix or interfix '(^-|- -|[0-9]|-$)', or which only has "{{misspelling of}}"definitions.
  • normal-form is the lower-case word, remove all diacritics, remove all non-letters.
  • The base anagram is formed from sorting the normal-form's letters into order, anything that has the same base anagram, but a different normal-form is considered an anagram.

Inde xing

Horribly out of date now...

This page may be out of date, but it should accurately reflect the current status when it was updated.

Languages

[edit]
  • On multiple pages: Hungarian, Irish, Italian, Spanish, Galician, Ancient Greek, English, Lithuanian
  • On one page:Mapudungun, Hiligaynon

Overview

[edit]
  1. create_indices.shDownloads the latest XML dump fromhttp://devtionary.info/w/dump/xmluand then runs the following programs.
  2. nicen.dump.awkNormalize the XML dump, removing entries I am uninterested in, and formatting those that I am more readably
  3. extract_words.awkScan through the dump and add every entry that contains at least one definition that doesn't look like a "form of" definition to a list. This step also stores any audio files it finds, as well as noting whether the link will need a #Language as it is not the first section on the page.
    • Entries whose only definition line consists entirely of a template (except{{SI unit}}and{{given name}}) are excluded
    • Definitions start with "compound of" are excluded
    • Definitions that contain variations on X form of, where X is present/perfect/plural/singular/past historic/preterite/compound/ending in ive are excluded.
    • This is of course guess work, and if you notice words that should be in the index, but aren't, or words that shouldn't be in the index but are, let me know.
  4. get_trans.pyScan through the dump and add every translation of words in languages that are being indexed, and add them to the lists created in 2.
    • This looks for any line starting with "*<Language name>:"
    • It discards everything in (brackets).
    • It will include anything in a{{t}}template or{{l}}template.
    • It will include any remaining links.
    • If the entire line looks like a valid term, then it will include the whole line.
  5. get_missing.py(For some languages) scan through the current index for that language and add all words there to the list as "missing".
  6. split_index.<language name>.plSplit the list for each language into files for each starting letter, corresponding to the list of entries on each page, and (for newly added languages) sort them, and divide them by second letter.
  7. format_index.<language name>.plFormat the per-letter lists into wikitext (for the older few languages, the sorting and splitting by second letter happens here).
  8. indexupload.pyUpload each formatted output file

Sorting and splitting

[edit]

For all languages, the strings are first normalized to lowercase. As I get round to it, I intend to rewrite the old-style ones as new style ones.

Ancient Greek

[edit]
  • Remove all space and punctuation.
  • Treat any remaining non- Alpha betic and𐠀ϝϻϡϙas0.
  • Remove diacritics.
  • Split on first two characters.
  • Useel_EL.utf-8to sort original strings.

Galician

[edit]
  • (old style)
  • Remove all diacritics (exceptñ).
  • Treat non- Alpha betic characters as0.
  • Split on first two characters.
  • Sort on normalised form.

Hungarian

[edit]
  • (old style)
  • Replaceá é í ó ú ő űwitha e i o u ö ü
  • Treat non- Alpha numeric characters as0
  • Split on fist two(cs|gy|ly|ny|sz|ty|zs|[[: Alpha:]0])
  • Sort on normalised form.

Irish

[edit]
  • Remove all space and punctuation.
  • Treat any remaining non- Alpha betic as0.
  • Remove diacritics.
  • Remove any leadingan.
  • Split on first two characters.
  • Usegl_GL.utf-8to sort original string.

Italian

[edit]
  • (old style)
  • Remove all diacritics.
  • Remove any leadinga.
  • Treat non- Alpha betic characters as0.
  • Split on first two characters.
  • Sort on normalised form.

Spanish

[edit]
  • Remove all space and punctuation.
  • Treat any remaining non- Alpha betic as0.
  • Split on first two(ñ|ll|ch|[[: Alpha:]0])
  • Usees_ES.utf-8to sort original string.

Formatting

[edit]

Currently all languages are treated about the same:

  • Strikethrough links that were added as "missing" from the inde<xes
  • Add an{{audio-list}}for an audio file, if one was found.
  • Abbreviate PoS and add that in italics.
  • Add an * linked to any entries which contained the word.
  • Add #<language name> to links that were not the first on the page.
  • Put the lists (#-lists) into a<div class= "index" ></div>seperated by ===-headings and a table of contents.
    • This means that the lists run horizontally, this means that they can change width to fill the maximum amount of space available to them, and that users can continue scrolling downwards without having to go up to find the next column.

Old Stuff

[edit]

Wiktionary: cross-namespace redirects

[edit]

Will replace all links to these except for:

[edit]

as they are widely known and linked to.

Hungarian Indexes
[edit]

Need to move these first.

[edit]
Wont touch
[edit]

Totals

[edit]

Before pruning: 15839

After pruning: 7748 links to fix (on 7343 pages).