What we want to do
Unless we use Sparse Residency we cannot know what memory address maps to what part of the image.
However Sparse Residency has its own host of problems, see below.
And even with Sparse Residency we only know at a 64kb granularity.
Faster Uploads, Zero Copies, Decompression Straight to the Image
We would like to use zstd/lz4 on the CPU to decompress straight into OPTIMAL tiling texture bound to DEVICE_LOCAL and HOST_VISIBLE memory.
OR use VK_EXT_memory_decompression to decompress into a BDA that aliases the image's bound memory.
Note that host_image_copy is NOT a solution to our problem as we're not after a simple memcpy unlike #2271
We know that not every WxHxD region of texels at any offset (X,Y,Z) is a contiguous range in memory. But this is for a Software Virtual Texture Page Pool, so our regions can be PoT or aligned to a particular size.
Aliasing Layers of 2D Image Arrays to the same memory for Flexible Page Pools
As it stands right now, to create an Image View of the same Image with a different format, the Texel Block size of the formats need to be compatible.
For example I cannot create R16_SFLOAT and RGBA8_SRGB views of the same image.
This makes sense and I'm not contesting that, for Software Virtual Texturing where the tiles need to be PoT + some border pixels its impossible to have some pages be interpreted as a different texel block resolution format because we could no longer use a pool allocator to allocate pages (as they'be different size).
Our solution there would be to have a single Page Pool thats bound to multiple a 2D Image Arrays of different resolutions but same layer count (one per Texel Block compatibility Class), and then exclusively associate each layer with a Texel Block Size Class.
So two imaages of R16_SFLOAT and RGBA8_SRGB image would alias each other in memory.
In Current Vulkan I can do this, but without Sparse Residency we wont know which part of each image aliases what part of the other.
This is incredibly annoying for our Software VT implementation where we'd like to keep a common shared Page Pool instead of a Pool per Texel Block Compatibility class, so that we don't need to resize, move or defragment should we run out of pages for a particular format while other pool sits unused.
Other APIs
DirectX has the following https://learn.microsoft.com/en-us/windows/win32/direct3d12/default-texture-mapping with the enums
D3D12_TEXTURE_LAYOUT_64KB_UNDEFINED_SWIZZLE
D3D12_TEXTURE_LAYOUT_64KB_STANDARD_SWIZZLE
D3D12_TEXTURE_LAYOUT_64KB_UNDEFINED_SWIZZLE similar to Vulkan's Sparse Images, but they don't require that the binding be virtualized.
While D3D12_TEXTURE_LAYOUT_64KB_STANDARD_SWIZZLE would truly let us do what we want as we would not have to align our pages to multiples of 512,256,128 or 64 depending on the texel block bytesize.
I'm not a Dx12 expert, and maybe despite the Texture not being virtualized they still have the same perf penalty in Dx12 from being created this way.
The problem
Richermoz's and Neyret's paper "The Sad State of Virtual Textures" confirms common wisdom that binding pages on Windows is abysmally slow and the bindings are obviously non-programmable/indirect so thats why we are using Software Virtual Texturing. Another reason are the resolution limits, as we're rendering GeoTIFF and ECW with resolutions of 128k or 256k per axis.
The paper also shows that using Sparse Residency imparts a very measurable (up to 50% on Nvidia) overhead when sampling a Sparse Residency Image. Nobody has been able to confirm the cause to me officially (also because the impl is hidden in the texturing or memory unit), but it might stem from the fact that when a page is accessed the memory address has to be looked up instead of computed with ALU and added to a per-image base offset.
In Vulkan if I want a reliable mapping of sections of the image to a memory address I can get one but only if the image is created with Sparse Residency,** just Sparse allows me to bind different chunks of memory to the image at certain sizes and alignments but defines no mapping between the texels and the memory, or even the layers!**
The drawbacks outlined in the paper are why making our Software VT Page Pool have Sparse Residency just for the purposes of having a reliable mapping of an image region to some chunk of memory is a non-starter.
Solution - Every GPU with host_image_copy should be able to support it
To efficiently use the PCIE bandwidth, perform write combining and not waste PCIE packets, ANY driver supporting host_image_copy already has the knowledge of what memory address maps to what texel.
It needs to know because the CPU thread must iterate over the texels in order to write the texels out in an efficient way.
So what we need is an extension + set of queries that lets us perform this work ourselves instead of pawning it off to the driver. This allows us to use any decompression algorithm, produce data procedurally on the CPU or GPU.
A per-texel-block callback to produce texel data wihin the host copy won't do because that crosses a DLL boundary and without extreme batching would incur horrible overhead.
TL;DR
Currently decompressing an OPTIMAL layout Image data as it was downloaded from the GPU straight across a ReBAR VRAM mapping is impossible without an extra staging vkImage unless we're yeeting the entire image (and worse yet the image might need to be newly created and in PREINITIALIZED layout)
What we want to do
Unless we use Sparse Residency we cannot know what memory address maps to what part of the image.
However Sparse Residency has its own host of problems, see below.
And even with Sparse Residency we only know at a 64kb granularity.
Faster Uploads, Zero Copies, Decompression Straight to the Image
We would like to use zstd/lz4 on the CPU to decompress straight into OPTIMAL tiling texture bound to DEVICE_LOCAL and HOST_VISIBLE memory.
OR use
VK_EXT_memory_decompressionto decompress into a BDA that aliases the image's bound memory.Note that
host_image_copyis NOT a solution to our problem as we're not after a simple memcpy unlike #2271We know that not every WxHxD region of texels at any offset (X,Y,Z) is a contiguous range in memory. But this is for a Software Virtual Texture Page Pool, so our regions can be PoT or aligned to a particular size.
Aliasing Layers of 2D Image Arrays to the same memory for Flexible Page Pools
As it stands right now, to create an Image View of the same Image with a different format, the Texel Block size of the formats need to be compatible.
For example I cannot create R16_SFLOAT and RGBA8_SRGB views of the same image.
This makes sense and I'm not contesting that, for Software Virtual Texturing where the tiles need to be PoT + some border pixels its impossible to have some pages be interpreted as a different texel block resolution format because we could no longer use a pool allocator to allocate pages (as they'be different size).
Our solution there would be to have a single Page Pool thats bound to multiple a 2D Image Arrays of different resolutions but same layer count (one per Texel Block compatibility Class), and then exclusively associate each layer with a Texel Block Size Class.
So two imaages of R16_SFLOAT and RGBA8_SRGB image would alias each other in memory.
In Current Vulkan I can do this, but without Sparse Residency we wont know which part of each image aliases what part of the other.
This is incredibly annoying for our Software VT implementation where we'd like to keep a common shared Page Pool instead of a Pool per Texel Block Compatibility class, so that we don't need to resize, move or defragment should we run out of pages for a particular format while other pool sits unused.
Other APIs
DirectX has the following https://learn.microsoft.com/en-us/windows/win32/direct3d12/default-texture-mapping with the enums
D3D12_TEXTURE_LAYOUT_64KB_UNDEFINED_SWIZZLEsimilar to Vulkan's Sparse Images, but they don't require that the binding be virtualized.While
D3D12_TEXTURE_LAYOUT_64KB_STANDARD_SWIZZLEwould truly let us do what we want as we would not have to align our pages to multiples of 512,256,128 or 64 depending on the texel block bytesize.I'm not a Dx12 expert, and maybe despite the Texture not being virtualized they still have the same perf penalty in Dx12 from being created this way.
The problem
Richermoz's and Neyret's paper "The Sad State of Virtual Textures" confirms common wisdom that binding pages on Windows is abysmally slow and the bindings are obviously non-programmable/indirect so thats why we are using Software Virtual Texturing. Another reason are the resolution limits, as we're rendering GeoTIFF and ECW with resolutions of 128k or 256k per axis.
The paper also shows that using Sparse Residency imparts a very measurable (up to 50% on Nvidia) overhead when sampling a Sparse Residency Image. Nobody has been able to confirm the cause to me officially (also because the impl is hidden in the texturing or memory unit), but it might stem from the fact that when a page is accessed the memory address has to be looked up instead of computed with ALU and added to a per-image base offset.
In Vulkan if I want a reliable mapping of sections of the image to a memory address I can get one but only if the image is created with Sparse Residency,** just Sparse allows me to bind different chunks of memory to the image at certain sizes and alignments but defines no mapping between the texels and the memory, or even the layers!**
The drawbacks outlined in the paper are why making our Software VT Page Pool have Sparse Residency just for the purposes of having a reliable mapping of an image region to some chunk of memory is a non-starter.
Solution - Every GPU with
host_image_copyshould be able to support itTo efficiently use the PCIE bandwidth, perform write combining and not waste PCIE packets, ANY driver supporting
host_image_copyalready has the knowledge of what memory address maps to what texel.It needs to know because the CPU thread must iterate over the texels in order to write the texels out in an efficient way.
So what we need is an extension + set of queries that lets us perform this work ourselves instead of pawning it off to the driver. This allows us to use any decompression algorithm, produce data procedurally on the CPU or GPU.
A per-texel-block callback to produce texel data wihin the host copy won't do because that crosses a DLL boundary and without extreme batching would incur horrible overhead.
TL;DR
Currently decompressing an OPTIMAL layout Image data as it was downloaded from the GPU straight across a ReBAR VRAM mapping is impossible without an extra staging vkImage unless we're yeeting the entire image (and worse yet the image might need to be newly created and in PREINITIALIZED layout)