Improve user experience of spark-history service
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	xcollazo
	Jun 11 2025, 1:52 PM

Description

https://yarn.wikimedia.org/spark-history/ frequently times out when trying to fetch long running Spark jobs (5+ hours), or jobs that generated lots of tasks (20k+ tasks).

I presume this is because the first fetch needs to decompress all the logs. Retrying the same app id multiple times eventually gets you the contents.

Ask #1: Considering heavy requests like these are typical to this service, can we bump the max response time to, perhaps 10x ?

@BTullis also mentions:

It looks like the CPU is also being throttled here, so maybe we can increase the maximum limit of CPU that it can have.

Thus, ask #2: Allow the service to use extra CPU if available.

And finally, ask#3, a nit pick: https://yarn.wikimedia.org/spark-history/, with a trailing slash, works, but without it (https://yarn.wikimedia.org/spark-history) it 404s. Can we fix this?

And finally finally: it would also be great to fix T331448: Make YARN web interface work with both primary and standby resourcemanager.

Details

Subject	Repo	Branch	Lines +/-
Tweak the limitrange for the dse-k8s-eqiad/spark-history namespace	operations/deployment-charts	master	+4 -4
Enable greater timeouts and rewriting for the spark-history service	operations/puppet	production	+18 -2
Increase the CPU and memory limits for the spark-history service	operations/deployment-charts	master	+8 -0
Increase the limitranges for the spark-history service	operations/deployment-charts	master	+21 -0

Customize query in gerrit

Related Objects

Mentioned Here: T331448: Make YARN web interface work with both primary and standby resourcemanager

Event Timeline

xcollazo created this task.Jun 11 2025, 1:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 11 2025, 1:52 PM

Gehel triaged this task as Medium priority.Jun 13 2025, 8:02 AM

Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.

Gehel edited projects, added Data-Platform-SRE (2025.05.24 - 2025.06.13); removed Data-Platform-SRE.Jun 13 2025, 8:06 AM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.05.24 - 2025.06.13) board.

Gehel edited projects, added Data-Platform-SRE (2025.06.13 - 2025.07.04); removed Data-Platform-SRE (2025.05.24 - 2025.06.13).Jun 13 2025, 8:49 AM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.06.13 - 2025.07.04) board.

xcollazo updated the task description. (Show Details)Fri, Jun 20, 1:42 PM

BTullis edited projects, added Data-Platform-SRE (2025.07.05 - 2025.07.25); removed Data-Platform-SRE (2025.06.13 - 2025.07.04).Fri, Jul 4, 3:59 PM

BTullis moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.07.05 - 2025.07.25) board.Fri, Jul 4, 4:10 PM

BTullis claimed this task.Fri, Jul 11, 9:17 AM

BTullis moved this task from Backlog - operations to In Progress on the Data-Platform-SRE (2025.07.05 - 2025.07.25) board.

Change #1168153 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Increase the limitranges for the spark-history service

https://gerrit.wikimedia.org/r/1168153

gerritbot added a project: Patch-For-Review.Fri, Jul 11, 10:58 AM

Change #1168165 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable greater timeouts and rewriting for the spark-history service

https://gerrit.wikimedia.org/r/1168165

Change #1168171 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Increase the CPU and memory limits for the spark-history service

https://gerrit.wikimedia.org/r/1168171

Hopefully these patches should address the first three asks.

Increased timeout to the proxy backend (300 seconds)
Double the CPU limit (with 50% more RAM, too)
Fixed the trailing slash issue.

Change #1168153 merged by jenkins-bot:

[operations/deployment-charts@master] Increase the limitranges for the spark-history service

https://gerrit.wikimedia.org/r/1168153

Change #1168171 merged by jenkins-bot:

[operations/deployment-charts@master] Increase the CPU and memory limits for the spark-history service

https://gerrit.wikimedia.org/r/1168171

Change #1169045 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Tweak the limitrange for the dse-k8s-eqiad/spark-history namespace

https://gerrit.wikimedia.org/r/1169045

Change #1168165 merged by Btullis:

[operations/puppet@production] Enable greater timeouts and rewriting for the spark-history service

https://gerrit.wikimedia.org/r/1168165

Change #1169045 merged by jenkins-bot:

[operations/deployment-charts@master] Tweak the limitrange for the dse-k8s-eqiad/spark-history namespace

https://gerrit.wikimedia.org/r/1169045

These patches are all deployed, so I think that we can resolve this ticket now.
I checked that https://yarn.wikimedia.org/spark-history (without the trailing slash) now works.

I think that the service is snappier and should have an increased timeout for the backend, but please feel free to reopen the ticket if this doesn't match your experience. Thanks.

Maintenance_bot removed a project: Patch-For-Review.Mon, Jul 14, 10:31 AM

Improve user experience of spark-history serviceClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Improve user experience of spark-history service
Closed, ResolvedPublic
Actions