Web Archive Discovery

These are the components we use to data-mine and index our ARC and WARC files and make the contents explorable and discoverable.

Documentation

See the wiki.

Running the development Opensearch Server

The Opensearch part is also usable for Elasticsearch 7.10.2 and may usable for older versions (with minor modifications). You can start it with the provided docker-compose file. After checkout do the following steps in a shell

$ cd warc-indexer/src/main/opensearch/os1
$ docker-compose up -d

Initalize the index

To use the cluster you need to create an index. You can do it by calling

$ curl --insecure --user admin:admin -H 'Content-Type: application/json' -XPUT https://localhost:9200/warcdiscovery/  -d @schema.json

this call creates the index with the schema.json which you can use with warcindexer. You can delete the index by calling

$ curl --insecure --user admin:admin -XDELETE https://localhost:9200/warcdiscovery

Solr-schema ported to Opensearch

The Solr-schema was as close as possible ported to Opensearch. There are just a few small differences:

default value "NOW" of index_time will be done by the warcindexer
default value "other" of content_type_norm will be done by the warcindexer
field content must be indexed, otherwise no position_increment_gap is possible in elastic
we only put ssdeep_hash_bs_* as dynamicField and skipped the institution-specific values, but these could be added easily

Indexing a WARC file

Use the following line if you want to populate the opensearch index:

$ java -jar target/warc-indexer-*-jar-with-dependencies.jar -e https://localhost:9200/warcdiscovery/ --user admin --password admin src/test/resources/wikipedia-mona-lisa/flashfrozen-jwat-recompressed.warc.gz

License

Overall, GNU General Public License Version 2, but some sub-components are Apache Software License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 1,609 Commits
.github/workflows		.github/workflows
configs		configs
digipres-tika		digipres-tika
documentation		documentation
warc-analysers-oscar4		warc-analysers-oscar4
warc-hadoop-crawler-utils		warc-hadoop-crawler-utils
warc-hadoop-indexer		warc-hadoop-indexer
warc-hadoop-recordreaders		warc-hadoop-recordreaders
warc-indexer		warc-indexer
warc-nlp		warc-nlp
warc-openimaj		warc-openimaj
warc-weka		warc-weka
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGES.md		CHANGES.md
Dockerfile		Dockerfile
README.md		README.md
TODO.md		TODO.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Archive Discovery

Documentation

Running the development Opensearch Server

Initalize the index

Solr-schema ported to Opensearch

Indexing a WARC file

License

About

Releases 1

Packages

Contributors 12

Languages

ukwa/webarchive-discovery

Folders and files

Latest commit

History

Repository files navigation

Web Archive Discovery

Documentation

Running the development Opensearch Server

Initalize the index

Solr-schema ported to Opensearch

Indexing a WARC file

License

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 12

Languages

Packages