Re: [RFC] Run-time PM framework (was: Re: [patch update] PM: Introduce core framework for run-time PM of I/O devices (rev. 6))



On Thursday 02 July 2009, Alan Stern wrote:
On Thu, 2 Jul 2009, Rafael J. Wysocki wrote:

The RPM_IN_TRANSITION flag is unnecessary. It would always be equal to
(status == RPM_SUSPENDING || status == RPM_RESUMING).

I thought of replacing the old flags with RPM_IN_TRANSITION, actually.

Okay, but hopefully you won't mind if I continue to use the old state
names in conversation.

Sure.

Still, the additional flag for 'idle notification is in progress' is still
necessary for the following two reasons:

(1) Idle notifications cannot be run (synchronously) when one is already in
progress, so we need a means to determine whether or not this is the case.

(2) If run-time PM is to be disabled, the function doing that must guarantee
that ->runtime_idle() won't be running after it's returned, so it needs to
know how to check that.

Agreed.


I don't agree. For example, suppose the device has an active child
when the driver says: Suspend it in 30 seconds. If the child is then
removed after only 10 seconds, does it make sense to go ahead with
suspending the parent 20 seconds later? No -- if the parent is going
to be suspended, the decision as to when should be made at the time the
child is removed, not beforehand.

There are two functions, on that sets up the timer and the other that queues
up the request. This is the second one that makes the decision if the request
is still worth queuing up.

(Even more concretely, suppose there is a 30-second inactivity timeout
for autosuspend. Removing the child counts as activity and so should
restart the timer.)

To put it another way, suppose you accept a delayed request under
inappropriate conditions. If the conditions don't change, the whole
thing was a waste of effort. And if the conditions do change, then the
whole delayed request should be reconsidered anyhow.

The problem is, even if you always accept a delayed request under appropriate
conditions, you still have to reconsider it before putting it into the work
queue, because the conditions might have changed. So, you'd like to do this:

(1) Check if the conditions are appropriate, set up a timer.
(2) Check if the conditions are appropriate, queue up a suspend request.

while I think it will be simpler to do this:

(1) Set up a timer.
(2) Check if the conditions are appropriate, queue up a suspend request.

In any case you can have a pm_runtime_suspend() / pm_runtime_resume() cycle
between (1) and (2), so I don't really see a practical difference.

A cycle like that would cancel the timer anyway. Maybe that's what you
meant...

Yes.

Hmm. What sort of conditions are we talking about? One possiblity is
that we are in the wrong state, i.e., in SUSPENDING or SUSPENDED. It's
completely useless to start a timer then; if the state changes the
timer will be cancelled, and if it doesn't change then the request
won't be queued when the timer expires.

OK

The other possiblity is that either the children or usage counter is
positive. If the counter decrements to 0 so that a suspend is feasible
then we would send an idle notification. At that point the driver
could decide what to do; the most likely response would be to
reschedule the suspend. In fact, it's hard to think of a situation
where the driver would want to just let the timer keep on running.

OK

For example, suppose ->runtime_resume() has been called as
a result of a remote wake-up (ie. after pm_request_resume()) and it has some
I/O to process, but it is known beforehand that the device will most likely be
inactive after the I/O is done. So, it's tempting to call
pm_schedule_suspend() from within ->runtime_resume(), but the conditions are
inappropriate (the device is not regarded as suspended).

?? Conditions are perfectly appropriate, since suspend requests are
allowed in the RESUMING state.

OK

Unless the driver also did a pm_runtime_get, of course. But in that
case it would have to do a pm_runtime_put eventually, at which point it
could schedule the suspend.

However, calling
pm_schedule_suspend() with a long enough delay doesn't break any rules related
to the ->runtime_*() callbacks, so why should it be forbidden?

It isn't.

Next, suppose pm_schedule_suspend() is called, but it fails because the
conditions are inappropriate. What's the caller supposed to do? Wait for the
conditions to change and repeat?

In a manner of speaking. More precisely, whatever code is responsible
for changing the conditions should call pm_schedule_suspend. Or set up
an idle notification, leading indirectly to pm_schedule_suspend.

But why should it bother if the conditions
may still change before ->runtime_suspend() is actually called?

It should bother because conditions might _not_ change, in which case
the suspend would occur. But for what you are proposing, if the
conditions don't change then the suspend will not occur.

IMO, it's the caller's problem whether or not what it does is useful or
efficient. The core's problem is to ensure that it doesn't break things.

But what's the drawback? The extra overhead of checking whether two
counters are positive is minuscule compared to the effort of setting up
a timer. And it's even better when you consider that the mostly likely
outcome of letting the timer run is that the timer handler would fail
to queue a suspend request (because the counters are unchanged).


Trying to keep track of reasons for incrementing and decrementing
usage_count is very difficult to do in the core. What happens if
pm_request_resume increments the count but then the driver calls
pm_runtime_get, pm_runtime_resume, pm_runtime_put all before the work
routine can run?

Nothing wrong, as long as the increments and decrements are balanced (if they
aren't balanced, there is a bug in the driver anyway).

That's my point -- in this situation it's very difficult for the driver
to balance them. There would be no decrement to balance
pm_request_resume's automatic increment, because the work routine would
never run.

In fact, for this to
work we need the rule that a new request of the same type doesn't replace an
existing one. Then, the submitted resume request cannot be canceled, so the
work function will run and drop the usage counter.

A new pm_schedule_suspend _should_ replace an existing one. For
idle_notify and resume requests, this rule is more or less a no-op.

It's better to make the driver responsible for maintaining the counter
value. Forcing the driver to do pm_runtime_get, pm_request_resume is
better than having the core automatically change the counter.

So the caller will do:

pm_runtime_get(dev);
error = pm_request_resume(dev);
if (error)
goto out;
<process I/O>
pm_runtime_put();

["error" isn't a good name. The return value would be 0 to indicate
the request was accepted and queued, or 1 to indicate the device is
already active. Or perhaps vice versa.]

Why do you insist on using positive values? Also, there are other situations
possible (like run-time PM is disabled etc.).

but how is it supposed to ensure that pm_runtime_put() will be called after
executing the 'goto out' thing?

The same way it knows that the runtime_resume method has to process the
pending I/O. That is, the presence of I/O to process means that once
the processing is over, the driver should call pm_runtime_put.

I overlooked the fact that if pm_request_resume() returns a value indicating
that the request has been queued up, the status is such that it won't allow any
other requests to be queued up and only pm_runtime_resume() can. The status
will still remain this way until ->runtime_resume() has returned, so the caller
can just call pm_runtime_put() right after pm_request_resume() in that case
(unless it wants to process I/O after ->runtime_resume() has returned, but
then it can increment the usage counter in ->runtime_resume()).

Anyway, we don't need to use the usage counter for that (although it's cheap).
Instead, we can make pm_request_suspend() and pm_request_idle() check if a
resume request is pending and fail if that's the case.

But what about pm_runtime_suspend? I think we need to use the counter.
Besides, the states in which suspend requests and idle requests are
valid are disjoint from the states in which resume requests are valid.

That's correct. pm_runtime_suspend() should check the counter IMO, but it
shouldn't change it.

Also, it looks like the status bits are sufficient to prevent suspend requests
or synchronous suspends from happening at wrong times, from the core's point
of view, so scratch the idea of using the usage counter to block them.

Let's put it another way. What's the practical benefit to the caller if we
always check the counters in submissions?

It saves the overhead of setting up and running a useless timer. It
avoids a race between the timer routine and pm_runtime_put.

OK

Any pending request takes precedence over a new idle notification request.

For pending resume requests this rule is unnecessary; it's invalid to
submit an idle notification request while a resume request is pending
(since resume requests can be pending only in the RPM_SUSPENDING and
RPM_SUSPENDED states while idle notification requests are accepted only
in RPM_RESUMING and RPM_ACTIVE).

It is correct nevertheless. :-)

Okay, if you want. Provided you agree that "pending request" doesn't
include unexpired suspend timers.

Sure.

Well, after some reconsideration I think it's not enough (as I wrote in my last
message), becuase it generally makes sense to make the following rule:

A pending request always takes precedence over a new request of the same
type.

So, for example, if pm_request_resume() is called and there's a resume request
pending already, the new pm_request_resume() should just let the pending
request alone and quit.

Do you mean we shouldn't cancel the work item and then requeue it? I
agree. In fact I'd go even farther: If the timer routine find an idle
request pending, it shouldn't cancel it -- instead it should simply
change async_action to ASYNC_SUSPEND. That's a simple optimization.
Regardless, the effect isn't visible to drivers.

I don't really like the async_action idea, as you might have noticed.

Thus, it seems reasonable to remember what type of a request is pending
(i don't think we can figure it out from the status fields in 100% of the
cases).

That's what the async_action field in my proposal is for.

Ah. Why don't we just use a request type field instead?

In fact, we can use a 2-bit status field (RPM_ACTIVE, RPM_SUSPENDING,
RPM_SUSPENDED, RPM_RESUMING) and a 2-bit request type field
(RPM_REQ_NONE, RPM_REQ_IDLE, RPM_REQ_SUSPEND, RPM_REQ_RESUME).

Additionally, we'll need an "idle notification is running" flag as we've aleady
agreed, but that's independent on the status and request type (except that, I
think, it should be forbidden to set the request type to RPM_REQ_IDLE if
this flag is set).

That would pretty much suffice to represent all of the possibilities.

I'd also add a "disabled" flag indicating that run-time PM of the device is
disabled, an "error" flag indicating that one of the
->runtime_[suspend/resume]() callbacks has failed to do its job and
and an int field to store the error code returned by the failing callback (in
case the failure happened in an asynchronous routine).

Yes. But the driver might depend on something happening inside the
runtime_resume method, so it would need to know if a successful
pm_runtime_resume wasn't going to invoke the callback.

Hmm. That would require the driver to know that the device was suspended,
but in that case pm_runtime_resume() returning 0 would mean that _someone_
ran ->runtime_resume() for it in any case.

If the driver doesn't know if the device was suspended beforehand, it cannot
depend on the execution of ->runtime_resume().

Exactly. Therefore it needs to be told if pm_runtime_resume isn't
going to call the runtime_resume method, so that it can take
appropriate remedial action.

OK, it can return 1 if the status was already RPM_ACTIVE.

To be determined: How runtime PM will interact with system sleep.

Yes. My first idea was to disable run-time PM before entering a system sleep
state, but that would involve canceling all of the pending requests.

Or simply freezing the workqueue.

Well, what about the synchronous calls? How are we going to prevent them
from happening after freezing the workqueue?

How about your "rpm_disabled" flag?

That's fine, we'd also need to wait for running callbacks to finish too. And
I'm still not convinced if we should preserve requests queued up before the
system sleep. Or keep the suspend timer running for that matter.

Now there's a point in which allowing to set up the suspend timer at any time
simplifies things quite a bit. Namely, in that case, if pm_schedule_suspend()
is called and it sees a timer pending, it deactivates the timer with
del_timer() and sets up a new one with add_timer(). It doesn't need to worry
about whether the suspend request has been queued up already or
pm_runtime_suspend() is running or something. Things will work themselves out
anyway eventually.

Otherwise, after calling del_timer() we'll need to check if the timer was pending
and if it wasn't, then if the suspend requests has been queued up already, and
if it has, then if pm_runtime_suspend() is running (the current status is
RPM_SUSPENDING) etc. That doesn't look particularly clean.

It's not as bad as you think. In pseudo code:

ret = suspend_allowed(dev);
if (ret)
return ret;
if (dev->power.timer_expiration) {
del_timer(&dev->power.timer);
dev->power.timer_expiration = 0;
}
if (dev->power.work_pending) {
cancel_work(&dev->power.work);
dev->power.work_pending = 0;
dev->power.async_action = 0;
}
dev->power.timer_expiration = max(jiffies + delay, 1UL);
mod_timer(&dev->power.timer, delay);

The middle section could usefully be put in a subroutine.

Could you please remind me what timer_expiration is for?

So, at a high level, the pm_request_* and pm_schedule_* functions would work
like this (I'm omitting acquiring and releasing locks):

pm_request_idle()
* return -EINVAL if 'disabled' is set or 'runtime_error' is set
* return -EAGAIN if 'runtime status' is not RPM_ACTIVE or 'request type' is
RPM_REQ_SUSPEND or 'usage_count' > 0 or 'child_count' > 0
* return -EALREADY if 'request type' is RPM_REQ_IDLE
* return -EINPROGRESS if 'idle notification in progress' is set
* change 'request type' to RPM_REQ_IDLE and queue up a request to execute
->runtime_idle() or ->runtime_suspend() (which one will be executed depends
on 'request type' at the time when the work function is run)
* return 0

pm_schedule_suspend()
* return -EINVAL if 'disabled' is set or 'runtime_error' is set
* return -EAGAIN if 'usage_count' > 0 or 'child_count' > 0
* return -EALREADY if 'runtime status' is RPM_SUSPENDED
* return -EINPROGRESS if 'runtime status' is RPM_SUSPENDING
* if suspend timer is pending, deactivate it
* if 'request type' is not RPM_REQ_NONE, cancel the work
* set up a timer to execute pm_request_suspend()
* return 0

pm_request_suspend()
* return if 'disabled' is set or 'runtime_error' is set
* return if 'usage_count' > 0 or 'child_count' > 0
* return if 'runtime status' is RPM_SUSPENDING or RPM_SUSPENDED
* if 'request type' is RPM_REQ_IDLE, change it to RPM_REQ_SUSPEND and return
* change 'request type' to RPM_REQ_SUSPEND and queue up a request to
execute ->runtime_suspend()

pm_request_resume()
* return -EINVAL if 'disabled' is set or 'runtime_error' is set
* return -EINPROGRESS if 'runtime status' is RPM_RESUMING
* return -EALREADY if 'request type' is RPM_REQ_RESUME
* if suspend timer is pending, deactivate it
* if 'request type' is not RPM_REQ_NONE, cancel the work
* return 1 if 'runtime status' is RPM_ACTIVE
* change 'request type' to RPM_REQ_RESUME and queue up a request to
execute ->runtime_resume()
* return 0

Or did I miss anything?

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages