VHPT implementation issues

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

VHPT implementation issues

Arun Sharma-11

This email contains minutes of a internal discussion we had at Intel a
few weeks ago on the thread about global vs per domain VHPT.

As someone else on the list suggested, there are really 4 options on the
table (the original thread dealt primarily with global vs per domain):

1. Global
2. one per logical processor (logical processor = hardware thread as in
SMT/SOEMT)
3. one per domain
4. one per virtual processor

Generally speaking, the list is in the ascending order of number of
VHPTs (although there are exceptions).

We first eliminated 1 and 3 as they have scalability issues on large
scale SMP systems.

So it was really a 2 vs 4 and I was initially arguing for 2 and against
4. Before we go further, I think I should explain some details about how
we propose to implement 4.

The idea is that we set aside a fixed amount of memory for VHPT
purposes. In all of the above algorithms, the VHPT will be sized
proportional to the amount of physical memory on the system.

In the earlier arguments on the thread IIRC, it was argued that since #4
has more VHPTs of the same size, it must be consuming more memory. But
in our modified proposal above, 2 and 4 consume the same amount of
memory, which is set aside at boot time (so no issues with finding
contiguous memory for VHPT etc).

So now the argument comes down to how best to use this memory for VHPT
caching purposes. Because this is a cache, it can be thrown away at will
without compromising correctness. So if we can't find enough VHPT memory
  when we create a domain, we can steal some from another domain.

The arguments for 4:

- less interference between domains due to similar access patterns
- easy to reuse rids
        - with 2, when a domain terminates and we want to reuse it's RIDs, we
need to walk through all entries in the VHPT and clear them. There is no
easy way to walk VHPT by RID
        - with 4, it comes for free.

The arguments for 2:

- Potential for better VHPT hit ratio for workloads with large working
sets. In some sense, this is like the argument between 1 vs 2. But most
people (including Gelato for Linux) seem to choose 2 over 1 because SMP
locking overhead and cache benefits trump VHPT hit ratio.

Another way to look at this is, instead of statically partitioning the
VHPT memory between domains (or not parititioning it at all as in 1), we
can do it more dynamically.

Does this new proposal address some of the concerns expressed earlier?

        -Arun


_______________________________________________
Xen-ia64-devel mailing list
[hidden email]
http://lists.xensource.com/xen-ia64-devel
Reply | Threaded
Open this post in threaded view
|

RE: VHPT implementation issues

Dan Magenheimer
Good summary!

Some other factors that would be worthwhile evaluating
when comparing/contrasting:

1) Performance impact of "fragmenting" the VHPT if
   VHPT is not physically contiguous. (E.g. TLB
   entries used for VHPT push user mappings out)
2) Cost of resizing the VHPTs when stealing a page
   of the VHPT page "cache" from one domain and adding
   it to another.  What is the mechanism for choosing
   which page(s) to steal?
3) Granularity of fragment size (e.g. what is the
   impact if machine memory for domain A and B are
   sized at A=4GB and B=4GB and get changed --
   invisibly to the domain -- to A=4GB+16KB,
   B=4GB-16KB).  Does there need to be "slop" in
   the cache for VHPT-cache-pages to map partial
   granules?
4) Overhead/frequency for VHPT flushing vs the above.

Also, there's still the "use model" question -- in a
nutshell: which is more important, scale-up or scale-out?

My bottom line requirement is that the VHPT implementation
must flexibly handle frequent changes in the guest-physical
memory size/location for each domain.  It's very
possible that the current monolithic global VHPT is
overkill; I like this continued discussion of the alternatives
as we may end up with an ideal design that meets all of
our needs/requirements.  Or we may end up with two that
are needed depending on the use model.

Dan

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf
> Of Arun Sharma
> Sent: Monday, May 23, 2005 3:07 PM
> To: [hidden email]
> Subject: [Xen-ia64-devel] VHPT implementation issues
>
>
> This email contains minutes of a internal discussion we had
> at Intel a
> few weeks ago on the thread about global vs per domain VHPT.
>
> As someone else on the list suggested, there are really 4
> options on the
> table (the original thread dealt primarily with global vs per domain):
>
> 1. Global
> 2. one per logical processor (logical processor = hardware
> thread as in
> SMT/SOEMT)
> 3. one per domain
> 4. one per virtual processor
>
> Generally speaking, the list is in the ascending order of number of
> VHPTs (although there are exceptions).
>
> We first eliminated 1 and 3 as they have scalability issues on large
> scale SMP systems.
>
> So it was really a 2 vs 4 and I was initially arguing for 2
> and against
> 4. Before we go further, I think I should explain some
> details about how
> we propose to implement 4.
>
> The idea is that we set aside a fixed amount of memory for VHPT
> purposes. In all of the above algorithms, the VHPT will be sized
> proportional to the amount of physical memory on the system.
>
> In the earlier arguments on the thread IIRC, it was argued
> that since #4
> has more VHPTs of the same size, it must be consuming more
> memory. But
> in our modified proposal above, 2 and 4 consume the same amount of
> memory, which is set aside at boot time (so no issues with finding
> contiguous memory for VHPT etc).
>
> So now the argument comes down to how best to use this memory
> for VHPT
> caching purposes. Because this is a cache, it can be thrown
> away at will
> without compromising correctness. So if we can't find enough
> VHPT memory
>   when we create a domain, we can steal some from another domain.
>
> The arguments for 4:
>
> - less interference between domains due to similar access patterns
> - easy to reuse rids
> - with 2, when a domain terminates and we want to reuse
> it's RIDs, we
> need to walk through all entries in the VHPT and clear them.
> There is no
> easy way to walk VHPT by RID
> - with 4, it comes for free.
>
> The arguments for 2:
>
> - Potential for better VHPT hit ratio for workloads with
> large working
> sets. In some sense, this is like the argument between 1 vs
> 2. But most
> people (including Gelato for Linux) seem to choose 2 over 1
> because SMP
> locking overhead and cache benefits trump VHPT hit ratio.
>
> Another way to look at this is, instead of statically
> partitioning the
> VHPT memory between domains (or not parititioning it at all
> as in 1), we
> can do it more dynamically.
>
> Does this new proposal address some of the concerns expressed earlier?
>
> -Arun
>
>
> _______________________________________________
> Xen-ia64-devel mailing list
> [hidden email]
> http://lists.xensource.com/xen-ia64-devel
>

_______________________________________________
Xen-ia64-devel mailing list
[hidden email]
http://lists.xensource.com/xen-ia64-devel