Skip to content

[WIP] Swarmkit task reaper should be more robust for dead nodes.#2408

Open
anshulpundir wants to merge 1 commit intomoby:masterfrom
anshulpundir:reap
Open

[WIP] Swarmkit task reaper should be more robust for dead nodes.#2408
anshulpundir wants to merge 1 commit intomoby:masterfrom
anshulpundir:reap

Conversation

@anshulpundir
Copy link
Copy Markdown
Contributor

@anshulpundir anshulpundir commented Oct 12, 2017

The issue we're trying to address that of bloating of the store because of a large number of orphaned tasks. It was noted in the code that orphaned tasks are not deleted unless service id == "".

Also, task history is only cleaned up when EventCreateTask is seen for the service for that task. It is not clear if this is enough in the case of restarting from a snapshot when orphaned tasks may not have been cleaned up from the history before the restart.

Signed-off-by: Anshul Pundir anshul.pundir@docker.com

@anshulpundir anshulpundir force-pushed the reap branch 2 times, most recently from 4a8a50d to a5506ee Compare October 12, 2017 21:33
@codecov
Copy link
Copy Markdown

codecov Bot commented Oct 12, 2017

Codecov Report

Merging #2408 into master will increase coverage by 0.08%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2408      +/-   ##
==========================================
+ Coverage   61.89%   61.98%   +0.08%     
==========================================
  Files         134      134              
  Lines       21771    21771              
==========================================
+ Hits        13476    13494      +18     
+ Misses       6836     6819      -17     
+ Partials     1459     1458       -1

case api.EventUpdateTask:
t := v.Task
if t.Status.State >= api.TaskStateOrphaned && t.ServiceID == "" {
if t.Status.State >= api.TaskStateOrphaned {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only fire when the task actually gets updated, right? If we load this task out of a snapshot, this event will not fire for the task.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll address that and see if I can add a test for that case.

@aaronlehmann
Copy link
Copy Markdown
Collaborator

The Orphaned state was added to let us free resources associated with a task (such as IP addresses) without deleting the task itself. See #440 for background.

The idea was that we don't want to be forced to delete a task in order to free its resources, because then we lose the history associated with that task. The task can remain in the task list in an Orphaned state, and the administrator will be able to see it in service ps. There is an exception for tasks that don't have a service associated with them (these are used for network attachments on classic docker run containers), because they have no concept of slot, so the task reaper can't clean them up with its normal approach.

This change seems to negate the purpose of the Orphaned state, by deleting the tasks right away. I don't think it's the right direction. As usual, it would be helpful to know what problem this is trying to solve. A reference to a bug in a private bug tracker isn't helpful to anyone in the open source community.

@anshulpundir anshulpundir changed the title Swarmkit task reaper should be more robust for dead nodes #9388. Swarmkit task reaper should be more robust for dead nodes. Oct 15, 2017
@anshulpundir
Copy link
Copy Markdown
Contributor Author

anshulpundir commented Oct 15, 2017

I looked at the PR that the Orphaned state was added in and #440 and it was not clear from either on why the check for t.ServiceID == "" made sense and why it sense for service-less tasks to be deleted right away but not for service tasks. The code doesn't have any comments whatsoever on the motivation behind this, so thanks for providing this clarification.

I have updated the description to provide the motivation for this change. The change was made before I was clear on the motivation for the Orphaned state (I tried digging around but I wasn't able to get a good description of it). I agree now that it is not in the right direction to remove orphaned tasks right away.

I think the one concern that may still need to be addressed is that of cleaning out the task history upon restart when there are no longer any EventCreateTask events (is that even possible ?). Either way, this may not be that high priority.

I also think it would be very helpful to add comments to the code. Its not scalable to have to piece together motivation behind a change from PRs and issues.

@aaronlehmann

@anshulpundir anshulpundir changed the title Swarmkit task reaper should be more robust for dead nodes. [WIP] Swarmkit task reaper should be more robust for dead nodes. Oct 15, 2017
@wsong
Copy link
Copy Markdown

wsong commented Oct 16, 2017

@aaronlehmann The original idea behind this change is that orphaned tasks are supposed to get cleaned up when you run docker node rm on a dead node. If that removal fails somehow, however, those tasks will never get cleaned up. We need some way of removing tasks on nodes that no longer exist.

@aaronlehmann
Copy link
Copy Markdown
Collaborator

The original idea behind this change is that orphaned tasks are supposed to get cleaned up when you run docker node rm on a dead node. If that removal fails somehow, however, those tasks will never get cleaned up. We need some way of removing tasks on nodes that no longer exist.

Got it. Sounds like the task reaper needs some reconciliation that makes sure global service tasks for dead nodes are deleted.

I assume the problem is for global service tasks? Replicated tasks would be retained up to the "task history limit", and then deleted.

Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants