Page MenuHomePhabricator

Improve user experience of spark-history service
Closed, ResolvedPublic

Description

https://yarn.wikimedia.org/spark-history/ frequently times out when trying to fetch long running Spark jobs (5+ hours), or jobs that generated lots of tasks (20k+ tasks).

I presume this is because the first fetch needs to decompress all the logs. Retrying the same app id multiple times eventually gets you the contents.

Ask #1: Considering heavy requests like these are typical to this service, can we bump the max response time to, perhaps 10x ?

@BTullis also mentions:

It looks like the CPU is also being throttled here, so maybe we can increase the maximum limit of CPU that it can have.

image.png (534×631 px, 49 KB)

Thus, ask #2: Allow the service to use extra CPU if available.

And finally, ask#3, a nit pick: https://yarn.wikimedia.org/spark-history/, with a trailing slash, works, but without it (https://yarn.wikimedia.org/spark-history) it 404s. Can we fix this?

And finally finally: it would also be great to fix T331448: Make YARN web interface work with both primary and standby resourcemanager.

Event Timeline

Gehel triaged this task as Medium priority.Jun 13 2025, 8:02 AM
Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.

Change #1168153 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Increase the limitranges for the spark-history service

https://gerrit.wikimedia.org/r/1168153

Change #1168165 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable greater timeouts and rewriting for the spark-history service

https://gerrit.wikimedia.org/r/1168165

Change #1168171 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Increase the CPU and memory limits for the spark-history service

https://gerrit.wikimedia.org/r/1168171

Hopefully these patches should address the first three asks.

  • Increased timeout to the proxy backend (300 seconds)
  • Double the CPU limit (with 50% more RAM, too)
  • Fixed the trailing slash issue.

Change #1168153 merged by jenkins-bot:

[operations/deployment-charts@master] Increase the limitranges for the spark-history service

https://gerrit.wikimedia.org/r/1168153

Change #1168171 merged by jenkins-bot:

[operations/deployment-charts@master] Increase the CPU and memory limits for the spark-history service

https://gerrit.wikimedia.org/r/1168171

Change #1169045 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Tweak the limitrange for the dse-k8s-eqiad/spark-history namespace

https://gerrit.wikimedia.org/r/1169045

Change #1168165 merged by Btullis:

[operations/puppet@production] Enable greater timeouts and rewriting for the spark-history service

https://gerrit.wikimedia.org/r/1168165

Change #1169045 merged by jenkins-bot:

[operations/deployment-charts@master] Tweak the limitrange for the dse-k8s-eqiad/spark-history namespace

https://gerrit.wikimedia.org/r/1169045

These patches are all deployed, so I think that we can resolve this ticket now.
I checked that https://yarn.wikimedia.org/spark-history (without the trailing slash) now works.

I think that the service is snappier and should have an increased timeout for the backend, but please feel free to reopen the ticket if this doesn't match your experience. Thanks.