https://yarn.wikimedia.org/spark-history/ frequently times out when trying to fetch long running Spark jobs (5+ hours), or jobs that generated lots of tasks (20k+ tasks).
I presume this is because the first fetch needs to decompress all the logs. Retrying the same app id multiple times eventually gets you the contents.
Ask #1: Considering heavy requests like these are typical to this service, can we bump the max response time to, perhaps 10x ?
@BTullis also mentions:
It looks like the CPU is also being throttled here, so maybe we can increase the maximum limit of CPU that it can have.
Thus, ask #2: Allow the service to use extra CPU if available.
And finally, ask#3, a nit pick: https://yarn.wikimedia.org/spark-history/, with a trailing slash, works, but without it (https://yarn.wikimedia.org/spark-history) it 404s. Can we fix this?
And finally finally: it would also be great to fix T331448: Make YARN web interface work with both primary and standby resourcemanager.