It seems that NVIDIA Optimus support for Linux is picking up pace and David Airlie has started to work on some proof-of-concept code for Multi-GPU Rendering in Linux. Since Phoronix posted first about it, we are here linking to their post, and David's original post afterwards:
[Phoronix] Proof Of Concept: Open-Source Multi-GPU Rendering!
[Phoronix] Proof Of Concept: Open-Source Multi-GPU Rendering!
Now that David Airlie's vga_switcheroo has went upstream in the Linux 2.6.34 kernel that provides hybrid graphics support and delayed GPU switching, David went on to look for something new to work on in his downtime when not busy with tasks at Red Hat. This new work is on GPU offloading / multi-GPU rendering.airlied: GPU offloading - PRIME - proof of concept
Last month NVIDIA introduced Optimus as a way for dual-GPU notebooks to seamlessly switch between the two GPUs but also to offload the rendering workload to the other graphics processor. This is somewhat similar to NVIDIA's SLI and ATI/AMD's CrossFire for splitting the rendering workload across multiple GPUs, but it has its differences. David ended up developing a proof-of-concept similar to NVIDIA's Optimus that he is calling "Prime" and it works with Intel and ATI GPUs.
David's goals with Prime are to allow a second GPU to render 3D applications onto the screen of the first GPU, with it being configurable by the client, and just to handle the rendering side. This work isn't as simple as his vga_switcheroo implementation, but it required changes to the Linux kernel and the Graphics Execution Manager (GEM), the DRI2 protocol, the X Server and DRI2 modules, and then the actual Linux hardware drivers.
All of this code has already been published as a proof-of-concept, but David shares on his blog that he's unlikely to personally take this work further by upstreaming the code. He has been successful though in using this code to offload the rendering work from an Intel IGP that's driving a display to a discrete ATI graphics processor.
Right now Intel and ATI hardware is supported, but NVIDIA GPUs could be supported too. This work depends upon a system using DRI2 (albeit with these out-of-tree patches) and a compositing manager must be running. David also shares, "To make this as good as Windows we need to seriously re-architect the X server + drivers. At the moment you can't load an X driver without having a screen to attach it to, I don't really want a screen for the slave driver, however I still have to have one all setup and doing nothing and hopefully not getting in the way. We'd need to separate screen + drivers a lot better. Having some sort of dynamic screens would probably fall out of this work if someone decides to actually do it."
It would be wonderful if this work on Prime could be continued and it works its way upstream or that someone takes the reigns from David to continue on with this GPU offloading work for open-source drivers. First though it may make more sense to focus on getting decent performance out of a single GPU before dealing with multi-GPU excitement.
THIS IS A PROOF OF CONCEPT - its not
going to be upstream unless someone else dedicates their life to it,
(btw anyone know anyone in ASUS?)
So NVIDIA unveiled their
optimus GPU selection solution for Windows 7, so I decided to see what
it would take to implement something similar under DRI. I've named it
PRIME for obvious reasons.
Goals:
1. Allow a second GPU to render 3D apps onto the screen of the first, pickable from the client side.
2.
Just target the rendering side, I'm assuming the GPU power up/down is
similiar to what was done for the older switching method.
Restrictions + limitations:
1. Must have compositing manager running
2. Must have second screen configured for slave card (doesn't need to be used)
Test system:
Intel 945 IGP + radeon r200 PCI card - yes this won't be a speed demon.
Terms:
Master: the IGP displaying the output - intel
Slave: the GPU rendering the app - radeon r200 in this case.
Step 1: kernel support
http://git.kernel.org/?p=linux/kernel/git/airlied/drm-testing.git;a=shortlog;h=r efs/heads/drm-prime-test
http://cgit.freedesktop.org/~airlied/drm/log/?h=prime-test
The kernel requirements were simple, we needed a way to share a memory managed object between two kernel device drivers.
The
kernel has a GEM namespace per device, however this isn't good enough
to share with other devices, so I introduced a new PRIME namespace with
two ioctls. One ioctl allows the master device to associate a device
buffer handle with a name in the prime namespace, and the other allows
the slave device to associate a prime namespace handle with a buffer.
When the master creates a prime buffer the kernel associates the list
of pages with the handle, and when the slave looks up the same handle
it retrieves the list of pages and fakes up a TTM buffer populated with
those pages as backing store. I've added the concept of slave object to
TTM to allow for this.
The drm repo contains the API wrappers + intel + radeon pieces to call the association functions for buffer objects.
Step two: DRI2 Protocol
http://people.freedesktop.org/~airlied/prime/0001-dri2proto-add-prime-token.p atch
http://people.freedesktop.org/~airlied/prime/0001-prime-support-for-mesa.pa tch
From
the X server point of view a recent change to the DRI2 layer allowed
for multiple device driver names to be associated with a DRI2 end
point. The client can request either a DRI or VDPAU device name
currently. I firstly extended the DRI2 protocol, to add a new buffer
type, called PRIME, and added a hack to mesa's glx loader to request
the prime driver if an environment variable was specified.
Step 3: X server DRI2 module + drivers
http://people.freedesktop.org/~airlied/prime/0001-intel-add-prime-master-su pport.patch
http://cgit.freedesktop.org/~airlied/xf86-video-ati/log/?h=prime-test
http://people.freedesktop.org/~airlied/prime/0001-dri2-prime-hackfest.patch
This
was the messiest bit and still requires a lot of change. First up I
added an interface for the drivers to register as PRIME master and
slaves. Intel driver registers as master, radeon as slave for my demo.
We store these in an array. When a client connects and requests prime
driver, we mark the drawable and redirect the dri2 buffer creation
requests to the slave screen driver. Also the drm authentication is
sent to both kernel drms. It then hooks the swapbuffers command where
it does a region copy, and redirects this to the slave driver, and
damages the pixmap in the master driver. Now the "interesting" part, my
original implementation simply grabbed the window pixmap at the dri2
create buffers time, however there is an ordering issue with
compositing, this pixmap is pre-composite redirection so isn't actually
the pixmap you want to tell the kernel to bind to both gpus. This
turned out to function badly, I could see gears all stretched over the
front buffer.
So a quick coke + chocolate break later, I had
enough sugar to bash out the hack that now exists. DRI2 calls the slave
driver copy region callback, which checks if the drawable pixmap is on
the same screen, if its not, it checks if we've marked the pixmap as a
prime pixmap (i.e. one that belongs to the master). It is, it swaps in
the slaves copy, otherwise it callsback into DRI2. This callback calls
the Intel driver to make the buffer object backing the pixmap,
shareable, and returns the handle,then calls into radeon with the
handle to create a new pixmap pointing at the shared buffer object.
Once all that is done, radeon copies the back buffer to the shared
front pixmap, we return and damage is posted and the compositor grabs
the window pixmap and displays it.
So does it work?
On my
blistering fast test system with X + xcompmgr running glxgears was
going at 150fps from the r200 PCI card. Hopefully I can get some time
on a faster system or one of the dual laptops.
Caveats:
- When a window manager is running the gears get all corrupted, this looks like the clipping and/or stride matching between
the
drivers isn't correct. I suspect something with reparenting and
decorations, I'm not enough of an X guru to understand this yet,
hopefully one of the other hackers can fill me in. Also before it gets
reparented and redirected a frame can land on the real front buffer,
again clipping should take care of this, but isn't working yet. I need
to workout how clipping and that stuff works in X/DRI2. - talk to ppl
about clipping then JDI.
- Once a client has connected as a prime,
we don't tear it down properly, so later clients can end marked as
prime. - work out some sort of resources to turn stuff off
-
Reference counting on the pages in the kernel is iffy, currently i915
ups the page list refcount but never drops it. solution JDI
- hardcoded /dev/dri paths in dri2 for slave device - solution JDI
- radeon driver could in theory be a prime master - solution JDI
- nouveau could support prime master/slave also. - solution nouveau guys JDI
-
requires an ugly second screen in xorg.conf to load the slave driver.
Can we have a 0 sized screen or maybe a rootless second screen. -
solution : rearchitect X server to allow drivers without screens
(6m-1yr work)
- pageflipping needs to be hacked off in intel driver. - work out and then JDI
Where is the video?
Once I get it working with a window manager on a useful machine I might do a video of two gears going.
Where now?
Well
this is a purely academic exercise so far, after a week of kernel
fighting I decided to do something new and cool. To make this as good
as Windows we need to seriously re-architect the X server + drivers. At
the moment you can't load an X driver without having a screen to attach
it to, I don't really want a screen for the slave driver, however I
still have to have one all setup and doing nothing and hopefully not
getting in the way. We'd need to separate screen + drivers a lot
better. Having some sort of dynamic screens would probably fall out of
this work if someone decides to actually do it.
The kernel bits
aren't as ugly as I thought but I'm not sure if upstreaming them is a
good idea without the others bits. The refcounting definitely needs
work also the cleanup when clients exit.
DRI2 needs some more changes, I might try and flesh it out a bit more and then talk to krh about a sane interface.
I'm
probably going to get forced task switch quite soon, so I might just
get to having this running on a W500 or T500, before dropping it for 6
months, so if anyone wants a neat project to play with and has the hw
feel free to try and take this on.
ASUS
feel free to send me one of the real optimus laptops and I'll get
nouveau guys hooked up and try and RE the nvidia DMA engine.