Page MenuHomePhabricator

Research Spike: Add a Link (Structured Task): remove link suggestions for certain words or Wikidata properties
Closed, ResolvedPublic5 Estimated Story Points

Description

User story & summary:

As a Wikimedian, I want the "Add a link" task to follow the Manual of Style (MOS) guidelines and the specific norms of my wiki. In particular, I want to prevent recurring problematic link suggestions so that newcomers aren't repeatedly prompted to link words or phrases that are inappropriate, unnecessary, or against community standards.

Research Goals

There are multiple ways to improve "Add a link" to better align with wiki-specific guidelines, each with trade-offs. This timeboxed research spike will allow engineers to:

  • Explore possible solutions.
  • Assess feasibility, pros, and cons.
  • Identify major dependencies (e.g., Research or Machine Learning team support).
  • Provide short-term recommendations (for Q4) and long-term considerations (for the next fiscal year).
Background / User Problem:

This task is critical because the English Wikipedia community identified compliance with MOS:OL as a blocker to expanding the "Add a link" feature to more editors:

Please implement some means of complying more broadly with MOS:OL. :meta:Research:Link recommendation model for add-a-link structured task § Hard-coded rules for (not) linking is a great start, but continued non-compliant suggestions are what in my estimation will create the most community pushback, since they exhibit all of the following: contravene longstanding community consensus; generate maintenance burden; forewarned of; already documented during trial period; obviously solveable in software somehow. I'm not sure I'd be comfortable going over 10% deployment before this is addressed, but others might feel differently.

from enwiki discussion: Wikipedia talk:Growth Team features

This request is also emerging at several other wikis:

Just a few days ago I received another suggestion for "Lika", which I have denied for several years... What the function needs is a list of words that should never be suggested.

from svwiki discussion: Wikipedia:Bybrunnen

Potential Approaches

There are various ways we can consider making improvements, and they all have pros and cons. This timeboxed research spike is meant to give engineers time to dive into this issue more deeply and help ensure we are considering all options, and start to better understand the pros and cons and time investment for various approaches.

1: Update hardcoded rules and retrain the model

2. Leverage Wikidata-based filtering

  • Adapt the logic used in the current hardcoded rules.
  • Instead of static rules, allow communities to configure a list of P31 (instance-of) properties from Wikidata to exclude.

3. Community-configurable "Never Link" list

  • Enable wikis to define a list of terms that should never be suggested.
  • Example: Swedish Wikipedia could add "Lika," while English Wikipedia might exclude all country names.

4. Model retraining based on rejection data

  • Ensure the Link Suggestion model is retrained regularly
  • Ensure the Link Suggestion model learns from past rejected and/or reverted link suggestions.

5. Other possible solutions

  • [Insert additional ideas—there may be other creative approaches worth considering!]
Acceptance Criteria:
  • Document the problem space, including feasibility, trade-offs, and implementation complexity.
  • Identify whether this work requires external team support (e.g., Research, Machine Learning).
  • Recommend next steps for Q4 and the next fiscal year.

Timeboxed research spike: 5 days.

Event Timeline

KStoller-WMF removed KStoller-WMF as the assignee of this task.
KStoller-WMF triaged this task as High priority.
KStoller-WMF removed a project: Epic.
KStoller-WMF updated the task description. (Show Details)
KStoller-WMF moved this task from Inbox to Needs Discussion on the Growth-Team board.
KStoller-WMF added a subscriber: Sgs.
KStoller-WMF renamed this task from Research Spike: Add a Link (Structured Task): disallow linking for certain words or instances to Research Spike: Add a Link (Structured Task): remove link suggestions for certain words or Wikidata properties.Apr 1 2025, 11:39 AM
KStoller-WMF set the point value for this task to 5.
KStoller-WMF moved this task from Needs Discussion to Up Next on the Growth-Team board.

I believe there are two main lines of work derived from this task: (1) Improve the model per-se (2) Improve how often and quickly can the model be re-trained based on some input. Expanding on these two below.

(1) Improve the model

Potential approaches (1), (2) & probably (3) require the model to be re-trained based on new input. All types of new input we are exploring for the model require code changes on it: new hard-coded rule, Wikidata entity deemed useful, or "Never link" word. Research folks feedback would be appreciated to make sure the new rules or wikidata entities we want to introduce make sense across wikis/languages. Aside from that the changes seem simple enough to be handled between Growth/Research. The "Never link" work input is a bit more tricky as it needs to be applied on a per language-model basis. I believe that's currently possible as we have 1 model per language but I'm unsure where in the model that should be done. In any case we would need to make the wiki configuration available to the model training process which is not currently the case.

Note on the "Never link" words approach: we could filter these words downstream in the mwaddlink API call or in GrowthExperiments. This seems not very efficient but it's cheap to implement.

(2) Improve how often and quickly can the model be re-trained based on some input

Whatever changes we make to the model it will need re-training and deployment (there's a good summary of steps that involves in T388258#10707714). The main problem is that it is currently a manual operation which has to be done per language, and some languages produce errors (see T387556). In my opinion this is a blocker for allowing the model to be re-trained on whatever "dynamic" data, eg: "Never link" words community configured or rejection data. The relevant task to unblock this would be T388258: Make airflow-dag for addalink training pipeline output compatible with deployed model. Once that's resolved we can consider allowing such kinds of inputs into the model.

Last, even with an airflow-dag driven model, we would need to think about how to invalidate and re-populate the recommendations pool, as things are now, it can take time for the changes made in the model to be visible in the recommendations feed. More details on this coming.

Posting some thoughts from a review we had with @Sgs around the spike suggestions.

(1) Improve the model

This approach would be suitable for getting the Add Link improvements out to enwiki faster but wouldn't scale across other wikis since the for the reasons posted in the findings and mostly because the model has to be manually ran every time some other "Never Link" criteria is defined on a wiki. We could move a head with this approach and improve the model for enwiki to allow us to offer Add a Link to more enwiki users but decide not to use this approach to scale for other wikis because of the degree of manual labour involved.

(2) Improve how often and quickly can the model be re-trained based on some input

This approach of automating training across different wikis is definitely the most ideal, but would consume some of time in rolling out the Add Link improvements within the hypothesis time frame, as it relies on training automation work in T388258: Make airflow-dag for addalink training pipeline output compatible with deployed model, as well as reworking how we intend to serve recommendations since training output would now be served differently. Rather than try to tackle this within a sprint or two. This approach would be better served as it's own hypothesis around optimising link recommendation training and serving for scale to other Wikis with little to no manual involvement.

Thanks, @Sgs! Between your feedback in this task and the discussions from the Add a Link diagramming session, I think I have what I need to propose a path forward. My current thinking:

  1. Given the fact that in the medium to long-term there are some major changes to the Add a Link model being planned, it’s not a great time for us to invest a lot in a new Community Configuration filter or “never link” list. So I’m not suggesting we add any new Community Configurable Wikidata-based filtering or “Never Link” list.
  2. It seems unlikely that we will spend the time to manually retrain all of the existing models, so it seems safe as a short-term approach to just adjust the hardcoded list and retrain enwiki to unblock the rollout there. AKA move forward with approach 1 from: T386867: Add a Link: add "do not link" rule for country names (Q6256) on English Wikipedia
  3. We should continue to discuss and partner with Machine Learning and Research to ensure their next steps are in alignment with what will work for the Growth task and Newcomers as a whole. Related: T393474: FY2024-25 Q4 Goal: Investigate Add-a-link model training and deployment.