Jump to content

Hamshahri Corpus

From Wikipedia, the free encyclopedia
Hamshahri Corpus logo

TheHamshahri Corpus(Persian:پیکره همشهری) is a sizablePersiancorpusbased on theIraniannewspaperHamshahri,one of the first online Persian-language newspapers in Iran. It was initially collected and compiled by Ehsan Darrudi at DBRG Group[1]ofUniversity of Tehran.Later, a team headed by Abolfazl AleAhmad[2]built on this corpus and created the first Persian text collection suitable forinformation retrievalevaluation tasks.

This corpus was created by crawling the online news articles from theHamshahri's website and processing the HTML pages to create a standardtext corpusfor modern information retrieval experiments.

Version 1.0

[edit]

The collection contains more than 160,000 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, society, foreign news, sports, etc. The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) with the average size of 1.8 KB.

The corpus is available in several formats for download:[2]

  • Tagged Text: 560 MB
  • In SQL Server 2000 Tables: 712 MB

Version 2.0

[edit]

The second release of the Hamshahri Corpus was launched on 20 October 2008. It offers several new features and improvements:

  • More News:323,616 Text Stories in 3206 XML files (one file for each day)
  • Increased Time Span:from 22 June 1996 to 13 May 2007
  • Bigger in Size:1.42 GB uncompressed
  • Standard Container:Unicode XML
  • Included Images:images have been extracted from the news and preserved (available in an additional package), which makes it suitable for Image Retrieval tasks.
  • Categorized News:the news stories have been categorized semi-automatically (appropriate fortext categorization and classification tasks).

The corpus is available for download in XML format.

See also

[edit]

References

[edit]
  1. ^DBRG NewsArchived2017-05-15 at theWayback MachineDatabase Research Group
  2. ^abHamshahriArchived2017-05-14 at theWayback MachineDatabase Research Group
[edit]