[PATCH] 0/7 xen: Add basic NUMA support

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] 0/7 xen: Add basic NUMA support

Ryan Harper
The patchset will add basic NUMA support to Xen (hypervisor only).  We
borrowed from Linux support for NUMA SRAT table parsing, discontiguous
memory tracking (mem chunks), and cpu support (node_to_cpumask etc).

The hypervisor parses the SRAT tables and constructs mappings for each
node such as node to cpu mappings and memory range to node mappings.

Using this information, we also modified the page allocator to provide a
simple NUMA-aware API.  The modified allocator will attempt to find
pages local to the cpu where possible, but will fall back on using
memory that is of the requested size rather than fragmenting larger
contiguous chunks to find local pages.  We expect to tune this algorithm
in the future after further study.

We also modified Xen's increase_reservation memory op to balance memory
distribution across the vcpus in use by a domain.  Relying on previous
patches which have already been committed to xen-unstable, a guest can be
constructed such that its entire memory is contained within a specific
NUMA node.

We've added a keyhandler for exposing some of the NUMA-related
information and statistics that pertain to the hypervisor.

We export NUMA system information via the physinfo hypercall.  This
information provides cpu/memory topology and configuration information
gleaned from the SRAT tables to userspace applications.  Currently, xend
doesn't leverage any of the information automatically but we intend to
do so in the future.

We've integrated in NUMA information into xentrace so we can track various
points such as page allocator hits and misses as well as other
information.  In the process of implementing the trace, we also fixed
some incorrect assumptions about the symmetry of NUMA systems w.r.t the
sockets_per_node value.  Details are available a later email with the
patch.

These patches have been tested on several IBM NUMA and non-NUMA systems:

NUMA-aware systems:
IBM Dual Opteron:  2 Node,  2 CPU,  4GB
IBM x445        :  4 Node, 32 CPU, 32GB
IBM x460        :  1 Node,  8 CPU, 16GB
IBM x460        :  2 Node, 32 CPU, 32GB

Non NUMA-aware systems (i.e, no SRAT tables):
IBM Dual Xeon   :  1 Node,  2 CPU,  2GB
IBM P4          :  1 Node,  1 CPU,  1GB


We look forward to your review of the patches for acceptance.

--
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
[hidden email]

_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

RE: [PATCH] 0/7 xen: Add basic NUMA support

Ian Pratt
 
> The patchset will add basic NUMA support to Xen (hypervisor
> only).  

I think we need a lot more discussion on this -- your approach differs
from what we've previously discussed on the list. We need a session at
the Jan summit.

> We borrowed from Linux support for NUMA SRAT table
> parsing, discontiguous memory tracking (mem chunks), and cpu
> support (node_to_cpumask etc).
>
> The hypervisor parses the SRAT tables and constructs mappings
> for each node such as node to cpu mappings and memory range
> to node mappings.

Having the code for parsing the SRAT table is clearly a good thing.

> Using this information, we also modified the page allocator
> to provide a simple NUMA-aware API.  The modified allocator
> will attempt to find pages local to the cpu where possible,
> but will fall back on using memory that is of the requested
> size rather than fragmenting larger contiguous chunks to find
> local pages.  We expect to tune this algorithm in the future
> after further study.

Personally, I think we should have separate budy allocators for each of
the zones; much simpler and faster in the common case.
 
> We also modified Xen's increase_reservation memory op to
> balance memory distribution across the vcpus in use by a
> domain.  Relying on previous patches which have already been
> committed to xen-unstable, a guest can be constructed such
> that its entire memory is contained within a specific NUMA node.

This makes sense for 1 vcpu guests, but for multi vcpu guests this needs
way more discussion. How do we expose the (potentially dynamic) mapping
of vcpus to nodes? How do we expose the different memory zones to
guests? How does Linux make use of this information? This is a can of
worms, definitely phase 2.

> We've added a keyhandler for exposing some of the
> NUMA-related information and statistics that pertain to the
> hypervisor.
>
> We export NUMA system information via the physinfo hypercall.
>  This information provides cpu/memory topology and
> configuration information gleaned from the SRAT tables to
> userspace applications.  Currently, xend doesn't leverage any
> of the information automatically but we intend to do so in the future.

Yep, useful.

> We've integrated in NUMA information into xentrace so we can
> track various points such as page allocator hits and misses
> as well as other information.  In the process of implementing
> the trace, we also fixed some incorrect assumptions about the
> symmetry of NUMA systems w.r.t the sockets_per_node value.  
> Details are available a later email with the patch.

Nice.

> These patches have been tested on several IBM NUMA and
> non-NUMA systems:
>
> NUMA-aware systems:
> IBM Dual Opteron:  2 Node,  2 CPU,  4GB
> IBM x445        :  4 Node, 32 CPU, 32GB
> IBM x460        :  1 Node,  8 CPU, 16GB
> IBM x460        :  2 Node, 32 CPU, 32GB


If only we had an x445 to be able to work on these patches :)

Ian

_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] 0/7 xen: Add basic NUMA support

Ryan Harper
* Ian Pratt <[hidden email]> [2005-12-16 19:28]:
>  
> > The patchset will add basic NUMA support to Xen (hypervisor
> > only).  
>
> I think we need a lot more discussion on this -- your approach differs
> from what we've previously discussed on the list. We need a session at
> the Jan summit.

OK.

> > Using this information, we also modified the page allocator
> > to provide a simple NUMA-aware API.  The modified allocator
> > will attempt to find pages local to the cpu where possible,
> > but will fall back on using memory that is of the requested
> > size rather than fragmenting larger contiguous chunks to find
> > local pages.  We expect to tune this algorithm in the future
> > after further study.
>
> Personally, I think we should have separate budy allocators for each of
> the zones; much simpler and faster in the common case.

I'm not sure how having multiple buddy allocators helps one choose
memory local to a node.  Do you mean to have a buddy allocator per node?

> > We also modified Xen's increase_reservation memory op to
> > balance memory distribution across the vcpus in use by a
> > domain.  Relying on previous patches which have already been
> > committed to xen-unstable, a guest can be constructed such
> > that its entire memory is contained within a specific NUMA node.
>
> This makes sense for 1 vcpu guests, but for multi vcpu guests this needs
> way more discussion. How do we expose the (potentially dynamic) mapping
> of vcpus to nodes? How do we expose the different memory zones to
> guests? How does Linux make use of this information? This is a can of
> worms, definitely phase 2.

I believe this makes sense for multi-vcpu guests as currently the vcpu
to cpu mapping is known at domain construction time and prior to memory
allocation.  The dynamic case requires some thought as we don't want to
spread memory around, unplug two or three vcpus and potentially incur a
large number of misses because the remaining vcpus are not local to all
the domains memory.

The phase two plan is to provide virtual SRAT and SLIT tables to the
guests to leverage existing Linux NUMA code.  Lots to discuss here.

> If only we had an x445 to be able to work on these patches :)

=)

Thanks for the feedback.

--
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
[hidden email]

_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

RE: [PATCH] 0/7 xen: Add basic NUMA support

Ian Pratt
In reply to this post by Ryan Harper
> > Personally, I think we should have separate budy allocators
> for each
> > of the zones; much simpler and faster in the common case.
>
> I'm not sure how having multiple buddy allocators helps one
> choose memory local to a node.  Do you mean to have a buddy
> allocator per node?

Absoloutely. You try to allocate from a local node, and then fall back
to others.
 

> > This makes sense for 1 vcpu guests, but for multi vcpu guests this
> > needs way more discussion. How do we expose the
> (potentially dynamic)
> > mapping of vcpus to nodes? How do we expose the different
> memory zones
> > to guests? How does Linux make use of this information?
> This is a can
> > of worms, definitely phase 2.
>
> I believe this makes sense for multi-vcpu guests as currently
> the vcpu to cpu mapping is known at domain construction time
> and prior to memory allocation.  The dynamic case requires
> some thought as we don't want to spread memory around, unplug
> two or three vcpus and potentially incur a large number of
> misses because the remaining vcpus are not local to all the
> domains memory.

Fortunately we already have a good mechanism for moving pages between
nodes: save/restore could be adapted to do this. For shadow-translate
guests this is even easier, but of course there are other penalties of
running in shadow translate mode the whole time.

> The phase two plan is to provide virtual SRAT and SLIT tables
> to the guests to leverage existing Linux NUMA code.  Lots to
> discuss here.

The existing mechanisms (in Linux and other OSes) are not intended for a
dynamic situation. I guess that will be phase 3, but it may mean that
using the SRAT is not the best way of communicating this information.

Best,
Ian

_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel