This process of height map displacement may be advantageous in that running time is primarily independent on the complexity of the filtering algorithm used to calculate displacement. Unlike other displacement algorithms, in which the displacement value must be calculated for every sample, the complexity of height map composition in this case does not significantly effect run time.
That being said, there are a few options to implement the displacement filter:
If implemented on the GPU, the displacement algorithm is basically a for loop checking up to Max_Displacement pixels bellow, and returning the index of the first valid location. Though as register space is limited, this algorithm must be broken into several steps.
This method seems to work the best, but scales directly with the maximum displacement value. Although this value can be scaled to simulate more displacement, but displacement is then culled closer to the screen. The running time for this algorithm is dependent on the Max_Displacement value,
The displacement algorithm may also be implemented on the CPU, where variables may be stored and referenced cumulatively for each column in the image. This advantage, being able to use global variables, allows for a significant speed up in running time, theoretically.
Each column in the image can be processed in a for loop, that keeps track of the current reference pixel and offset distance. In this method, a while loop advanced the position of the reference pixel and calculates a new displacement value. The while loop advances until a valid displacement value is found, or until the current and refrence pixel indices are equal, in which no displacement occurs for the current pixel. This means that the cumulative executions of the while loops for each column will not exceed the image_height, thus running at O(HEIGHT * WIDTH). Though most significantly, the running time for this algorithm is not dependent on the Max_Displacement value.
Though of course as this method is on the CPU, it is sequential, and requires time to transfer resources from the GPU, meaning it does not run in real time. (about 1-2 fps
GPU Direct Programming
As the above algorithm operates independently for each column, it is easily parallelized using direct programming with the GPU. Although with the above implementation, parallelization is limited, as the execution paths for each unit are highly variable. Additionally, transfering resources and switching rendering modes for the GPU is slow, which accounts for about 75% of the running time. This algorithm currently runs at about 5-7 fps, although I definitely think it could be implemented in a more parallelized way.
These explanations here are quite simple and brief, but were are a significant portion of my research in developing this process. I elaborate on the specifics of implementation in the paper I have been working on...