Page MenuHomePhabricator

[ES-M3]: Investigate how search could work by label and aliases on the EntitySchema expert
Closed, ResolvedPublic

Description

Once the EntitySchema expert is created inT362004,we would like to make it easier for users to search for EntitySchema by the label and aliases.

Before we do this must investigate how search could work by label and aliases on the EntitySchema expert.

Acceptance Criteria

  • A technical direction is documented on how we can enable search by label and aliases on the EntitySchema expert

Notes
This should be timeboxed before we being the investigation

Event Timeline

Arian_Bozorgrenamed this task from[ES-M2]: Investigate how search could work by label and aliases on the EntitySchema expertto[ES-M3]: Investigate how search could work by label and aliases on the EntitySchema expert.May 29 2024, 11:17 AM

Change #1056116 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/EntitySchema@master] [PoC] Use Wikibase Lib's term store in ES

https://gerrit.wikimedia.org/r/1056116

Hm, I’ll admit this isn’t what I expected from this task 😅 I’ll leave some comments here rather than on Gerrit, since they’re pretty general:

  • I assumed the main subject of this investigation would be ElasticSearch integration – after all, we’re not using the term store for searching Items and Properties in production. Did you look into ElasticSearch, or other alternative approaches to the term store? Are there known reasons why we should use the term store instead? (The main downside of the term store that I’m aware of is that any search based on it is case sensitive, at least with the current implementation. I think there are ways to make term store search case insensitive, and they would actually be useful to some third-party Wikibases, but they haven’t been a product priority so far.)
  • The attached patch goes a lot further than I would expect from an investigation task – if we can use it, that’s really cool!
  • I’m quite happy to see the term store being used in an extension, because in my view this paves the way towards using the term store for showing links to other pages – which is something that, at least for Lexemes, I assume we want to do sooner or later (rather than the current approach of actually loading and parsing the full Lexeme / EntitySchema). The biggest TODO left in the PoC change is probably theFindUnusedTermTrait::findActuallyUnusedTermInLangIds()hook (or similar) that you already mention in the commit message.

Hm, I’ll admit this isn’t what I expected from this task 😅 I’ll leave some comments here rather than on Gerrit, since they’re pretty general:

  • I assumed the main subject of this investigation would be ElasticSearch integration – after all, we’re not using the term store for searching Items and Properties in production. Did you look into ElasticSearch, or other alternative approaches to the term store? Are there known reasons why we should use the term store instead? (The main downside of the term store that I’m aware of is that any search based on it is case sensitive, at least with the current implementation. I think there are ways to make term store search case insensitive, and they would actually be useful to some third-party Wikibases, but they haven’t been a product priority so far.)

I briefly looked into this and I thought making the WikibaseCirrusSearch extension work with pseudo-entities would be to much work. Maybe I'm missing something or the code is not as tightly bound to Wikibase (mostly data model) as it seemed to me, but forking the extension for ES might have been the only path forward here, and I don't think that's worthwhile.

For the number of entity schemas we will have for the foreseeable future I think this should work well enough.

  • The attached patch goes a lot further than I would expect from an investigation task – if we can use it, that’s really cool!

I tried to make sure everything (except pruning, see the todos) can be made to work.

  • I’m quite happy to see the term store being used in an extension, because in my view this paves the way towards using the term store for showing links to other pages – which is something that, at least for Lexemes, I assume we want to do sooner or later (rather than the current approach of actually loading and parsing the full Lexeme / EntitySchema). The biggest TODO left in the PoC change is probably theFindUnusedTermTrait::findActuallyUnusedTermInLangIds()hook (or similar) that you already mention in the commit message.

Indeed, but I think this shouldn't be particularly hard… we probably only need a hook to provide us with the tables+columns to check.

I briefly looked into this and I thought making the WikibaseCirrusSearch extension work with pseudo-entities would be to much work. Maybe I'm missing something or the code is not as tightly bound to Wikibase (mostly data model) as it seemed to me, but forking the extension for ES might have been the only path forward here, and I don't think that's worthwhile.

I think forking it is more or less what I had in mind, though we could include the code directly in EntitySchema without making it a separate extension (WikibaseMediaInfo also does it this way, I believe).

If we go with the term store – is it okay for product that the EntitySchema search is case-sensitive? Or should we try to add a case-insensitive search based on the term store? (I think it would be relatively doable in the current SQL schema by introducing normalized labels/aliases as a separate term type, like howwb_termsused to haveterm_search_keyin addition toterm_text,pointing to the same text_in_lang/text tables. But it would take some more work, of course.)

If we go with the term store – is it okay for product that the EntitySchema search is case-sensitive? Or should we try to add a case-insensitive search based on the term store? (I think it would be relatively doable in the current SQL schema by introducing normalized labels/aliases as a separate term type, like howwb_termsused to haveterm_search_keyin addition toterm_text,pointing to the same text_in_lang/text tables. But it would take some more work, of course.)

I fear it is too confusing if case-insensitive search doesn't work for making some statements but not others and will probably lead to people assuming that an EntitySchema doesn't exist and worst case create a new duplicate one.

This ticket is marked as ready for peer review, but peer review seems already to have started.@hoowill you move this back into development based on the feedback?

hooremovedhooas the assignee of this task.Aug 21 2024, 7:32 PM
hoosubscribed.

Due to problems with my local development setup (I had huge troubles setting upWikibaseCirrusSearch…while not a strict requirement for this task, I figured if I should really get that to work), I didn't manage to make significant progress here.

I can go back to this next week or someone else can pick it up at will.

Change #1071938 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/EntitySchema@master] [PoC] Use CirrusSearch/ Elastic for label search

https://gerrit.wikimedia.org/r/1071938

I have created an initial PoC which uses CirrusSearch for labels only, similar to WikibaseCirrusSearch (but unlike eg. WikibaseMediaInfo we can't reuse any of its infrastructure). This is mostly code copied over from WikibaseCirrusSearch and modified to both not interfere with its search fields (as we can't re-use them, because WikibaseCirrusSearch expects actual entity-entities) and to allow working with non-entity entities.

The code is wired up via EntitySchemaContentHandler in a manner similar Wikibase's EntityHandler (but incomplete for now, see TODOs).

This is how we could tackle this:

  1. Create (by copying and removing Wikibase dependencies from) the relevant field definitions for labels (for other entity types thelabelsandlabels_allfield -labels_allis used for supporting language fallback)
    1. This ishttps://gerrit.wikimedia.org/r/c/mediawiki/extensions/EntitySchema/+/1071938/1/src/LabelsField.phpandhttps://gerrit.wikimedia.org/r/c/mediawiki/extensions/EntitySchema/+/1071938/1/src/AllLabelsField.phpin the PoC
  2. Wire these up inEntitySchemaContentHandler:With that done we will have our labels indexed, ready to be queried.
    1. This ishttps://gerrit.wikimedia.org/r/c/mediawiki/extensions/EntitySchema/+/1071938/1/src/MediaWiki/Content/EntitySchemaContentHandler.phpin the PoC
  3. Create a newEntitySearchHelperimplementation that can be switched in via configuration, similar toEntitySearchElastic in the WikibaseCirrusSearchExtension.
    1. In the PoC this ishttps://gerrit.wikimedia.org/r/c/mediawiki/extensions/EntitySchema/+/1071938/1/src/Wikibase/Search/EntitySchemaSearchHelper.php(although while working this is fairly incomplete).

I think this is not going to affect Special:Search at all as we are (forced to use) different indexes sohaslabel:…wont work for EntitySchemas (but that is beyond the scope of this ticket).

hooclaimed this task.

SeeT375641for implementing this, based on my findings creating the WikibaseCirrusSearch based proof of concept.