Optimising a Julia Fractal Animation

I recently came across this video on YouTube. It's a short tutorial on generating a Julia set fractal image in Rust. I've used Rust before but never with image generation or mathematics, so it looked interesting, and I followed along. By the end of the video, you have some code to generate a single black and white image.

I extended the code to use colour and generate an animation.

How does it work?

define an iterative function, say z = z² + c
take a point on the complex plane z = x + yi and a constant c
iterate until the value diverges to "infinity", in this case until the magnitude exceeds 2
stop if exceeding MAX_ITER
the Julia set is the set of points that don't diverge
for the points that do diverge, we record and colour based on the number of iterations it took to diverge.

In the tutorial, he performed 255 iterations, and mapped that to grayscale.

I extended it to an arbitrary number of iterations and mapped the iterations to the hue component in HSV colour. I also used it for the value component, to darken the earlier iterations. This was a stylistic choice to prevent the images from being too bright red.

I ran it with 1000+ iterations and increased the resolution to 1080p, and the result was pretty nice.

Animation

My next idea was to animate it. I wanted to see how the image looked "over time". Each frame would use MAX_ITER corresponding to the frame number. In the basic first version, all it did was vary the MAX_ITER and generate each image independently in sequence. This turned out to be very slow, taking 2210 seconds (~37 minutes) for 360 frames. That is a long time for a 15 second video (24fps). Later, I switch to 1000 frames, but with this version, I estimate it would take around 2 hours!

I also timed each frame:

You can see that as MAX_ITER increases, the frame time increases, but the rate of increase slows, because fewer and fewer points (pixels) require the increased iterations.

Optimisation

Multithreading

The first version used only a single thread, but my CPU has 12 threads, so I decided to use multithreading to calculate the frames in parallel. I programmed it to use one less than the number of available threads, allowing one thread for the rest of the applications running on my PC. In a way, I "turned it up to 11"!

use std::thread;

fn num_threads() -> usize {
    thread::available_parallelism().unwrap().get() - 1
}

I initially divided the frames into separate lists for each thread, but later I switched to a multi-consumer message passing pattern using the crossbeam_channel package. I'm not sure if there was any significant performance improvement, but the code looked cleaner. The latter pattern would allow for threads to pick up the slack if other threads are slower, though in this case the load should be pretty even across threads.

With multithreading, the total time was reduced to 318 seconds (~5 minutes). This is a significant improvement!

However, it is not 11 times the improvement, and you can see in the frame times that each frame takes longer. This is probably due to CPU contention of the threads, and maybe I should have used less than 11 and left more for the other applications. It could also be contention related to the disk when saving the images in parallel. Regardless, I stuck with 11 for the remainder.

From here on, I switched to generating 1000 frames (~42 seconds). With multithreading, this took 1066 seconds (~18 minutes).

Hashmap Cache

My next insight was that in each new frame, I was recalculating every pixel, even if it had already diverged in previous frames. I could reduce the calculations if I remembered those results and carried them forward into the following frames. I.E. if in frame 50 we know that a certain point diverges on the 50th iteration, I can cache that value of 50 for those coordinates, and in later frames I can first check the cache if a value was already calculated. In many cases, the natural cache is a hashmap/dictionary. For any hashable key you can store a value, and later retrieve the corresponding value if it exists.

Each thread gets its own independent cache, so there are some repeated calculations as each thread calculates each pixel regardless of whether it was cached in the other thread. I did try to use a shared cache, but using a mutex added significant cost due to threads waiting for the lock on the cache before adding and retrieving values, that it became slower than having independent caches.

The hashmap cache reduced the total time to 514 seconds (~8.5 minutes). Also the curve of the frame times looks very different:

Initially, the frame times are higher due to the costs of checking the cache and adding to the cache. Most of the cache additions are likely to be early on. Later, the times are lower because it is starting to use the cache, and eventually the frame times start reducing as it starts using the cache more and more. Eventually it flattens, as the majority of pixels are from the cache, and retrieving from the cache is constant in time.

Matrix Cache

The initial costs were high with the hashmap. This is because getting and setting values with the hashmap, you need to hash the value, and in this case, that is actually a waste of time. The keys in this case are coordinates, so a 2D array is a much better option for a cache. I used DMatrix from the nalgebra package. This reduced the total time to 426 seconds (~7 minutes), and you can see in the frame times that the curve is the same shape (since the function of the cache is the same), but it is exclusively lower than the hashmap times, especially at the start.

Max First

At this point, I was saving time on pixels that diverged, but it felt like a waste to start from scratch every time for the pixels that had not diverged yet. For example, if a pixel had not diverged after 49 iterations in frame 49, it would start again and do 50 iterations, then 51 more, then 52 more, etc. I thought about storing the current progress in another cache and passing that through the threads. This would need to store the current value of z along with the number of iterations that got it there, rather than just the number of iterations to divergence like the existing cache. However this complicates the calculation logic quite a bit.

I then had the realisation, that I could simply calculate the final frame with the MAX_ITER set to the largest value first, and then derive all the other frames from it. This final frame would be a matrix with values from 0 to MAX_ITER and to generate the nth frame, I would simply disregard any pixels which had a value greater than n.

This method reduced the total time to 302 seconds (~5 minutes). And now you can see the frame times are pretty much constant! In a single thread, it calculates the final frame matrix, taking 8 seconds, then 294 seconds to generate all the images.

Parallel Pixels

A small optimisation I then did was to calculate the pixels of the final frame matrix in parallel. I again used message passing. This reduced the initial time from 8 seconds to 1.56 seconds.

The frame times are, of course, pretty much the same as before.

Image Formats

As you can see in the previous chart, the frame times are now approximately constant at 3.2 seconds. This still felt slow, and I wondered if this was related to the compression used in generating PNG images. I tried JPG and BMP formats, and the difference was huge.

BMP is around 5 times faster than PNG! This makes sense, as BMP is the simplest image format with no compression. With BMP the total time is only 62 seconds (~1 minute)!

Though the drawback is the file size, with each 1080p frame being 5.93MB. A single 1000 frame animation would be 6 GB. At one point, I filled up my whole disk with frames for the final animations! I was playing with higher resolutions and around 10,000 iterations. I found that images don't change much at iterations higher than around 1000, so I reduced it after that.

General Application

An interesting point, is that much of the caching and optimisation I implemented worked because of the specific animation I was aiming for. In every frame we have the same set of coordinates. If I wanted to zoom in as the animation progresses (like many fractal videos you'll see), these techniques would not be effective. This shows how really understanding your specific problem and managing the scope can give you helpful insights.

More Charts

Here are some charts with all the data:

Final Video

For the final video, I generated animations for a number of parameter variations. I used ffmpeg to stitch the BMP sequences into MP4 videos, then I used Davinci Resolve to edit them together. The animations turned out better than I imagined. Because the colour scale is adjusted to each frame independently, there is a feeling of motion as the colours "follow" the iterations. I'm very happy with the result.

The final video is here, and the source code can be found here.