Specific issues where an Analytics dataset has incorrect, missing, or malformed data or shows an anomaly which might be caused by such data. Not for general work on data quality processes or monitoring.
(Project tag requested in T362839.)
Specific issues where an Analytics dataset has incorrect, missing, or malformed data or shows an anomaly which might be caused by such data. Not for general work on data quality processes or monitoring.
(Project tag requested in T362839.)
Agree, that they look like bots.
We are working on an improvement to the bot detection pipeline currently.
Maybe after the update, these requests won't exist any more.
Nevertheless, it makes sense to normalize, since we are planning to backfill unique devices numbers soon.
This was a known change when we migrated from Varnish to HAProxy. We decided not normalize the hosts, to keep the data as close as possible from the source.
My assumption is that hits where domains has a trailing dot are most probably bots. I'd be super happy to be proven wrong and asked to normalize though :)
@mforns please take a look
This has been completed by rerunning the Airflow DAGs for that snapshot. Thanks to the Data Engineering ops team!
(For some projects, 0 was reported, up to somewhere around June 2.)
Changes task name to "Sharp spike.."
Might be easier to do this through Airflow. Waiting for the DAG errors to go away first, though.
I checked wikistats and I am not seeing zero unique devices for any of the projects. Yes they are inflated for May 2025 and we are investigating as mentioned above but could we please change the title and description of this task to reflect the issue correctly?
In T388825#10861480, @mpopov wrote:<snip /> I think the behavior here is that when events are logged, meta.domain is set to whatever the hostname is <snip />
I'll flag this to Experiment Platform, since we're maintainers of EventLogging and I think the behavior here is that when events are logged, meta.domain is set to whatever the hostname is, but we should discuss whether that is appropriate behavior for an instrument running on auth.wikimedia.org
@OSefu-WMF can you help assess the effort, impact and prioritize?
@Mayakp.wiki weird, we did comprehensively backfill, including Druid.
Is there a way you can verify with the raw data?
at first thought its not, because the October issue focussed on Wikipedia unique devices T373630#10177892. This one affected wikifunctions.
However considering that it affected similar countries the root cause could be the same, and the Jun, Jul, Aug, Sep spikes coincide with what we saw.
@Ahoelzl , was the backfill applied to Wikipedia unique devices only?
the data from past few months looks stable.
@OSefu-WMF is this related to the unique devices issues we fixed in October?
Could we automatically set the gadget name in mw.Api by giving each gadget its own copy of mw? I'm not very familiar with the details of JS execution - could we just override getScript in GadgetResourceLoaderModule and wrap the entire thing in a scope that sets a local value for mw?
... we could encode the user agent info in an Accept header, e.g. Accept: application/json;user-agent=my-gadget. That's a terrible hack, and would be bad for Turnilo as well, but it would work around the CORS issue.