Train your own GAN

Arxiv Insights
8 min readOct 10, 2020

--

So, you want to train your own GAN-model and create a visual theme that is 100% customized to your needs using your own images? That’s possible, but there are some details to consider, let’s see how it works!

This blogpost is meant as a guide to Custom Model training for wzrd.ai

To get a sense of the types of digital audiovisual content that can be created with these models, take a look at:

Creating videos with GANs

The first question people ask us when they want to create their own visual theme is:

“How can I use my own images in a WZRD video?”

The short answer is: you can’t…

The long answer is: it’s tricky, let me explain.

How GANs work

In order to understand why you can’t simply take your own images and use WZRD’s AI to turn those into a beautiful, morphing video; I need to explain you how GANs work and how we use them to create music videos.

A Generative Adversarial Network (GAN) has two main components: the Generator and the Discriminator. Both of those are neural networks, but they each have a very specific job to do:

  • The Generator receives a bunch of (randomly chosen) numbers as input and transforms those numbers into an image. Every time it receives different input numbers, it produces a different image, we call this image ‘a fake (or generated) sample’.
  • The Discriminator receives an image as input and needs to decide if that image is a) a real image from the dataset or b) generated by the Generator. The Discriminator will randomly be shown images from the real images or samples from the generator and needs to learn how to tell those apart.

Now, the Generator’s job is to produce samples that are so realistic that the Discriminator can no longer tell the difference between them and the real images from the dataset, that is the goal of the training process.

Once the GAN is trained, we no longer need the Discriminator, so we throw it away. What we care about is the Generator, since this network can now transform random input numbers into samples that look like images from the dataset: let the magic begin!

On the WZRD dashboard, you can find a bunch of different GAN models that we already trained for you! Just select a visual style, pick your favourite generated images and following the instructions to turn them into a morphing video.

The magic

The final thing you need to know about GANs is how you can use them to create morphing videos, let’s see how that works!

Remember that the Generator transforms a bunch of random numbers into an image? Well, if you change those numbers just a tiny, tiny bit; the resulting image will also change a tiny, tiny bit!

So by slowely changing the input numbers, we can slowely change the generated image, essentially creating a morphing video (by doing this many times per second).

When you are picking favourites on the WZRD platform, what you are actually doing is picking sets of input numbers that result in beautiful images when sent through the Generator network.

Once you click “render video” on the timeline, our AI is going to slowely change those input numbers from the first favourite, to the next and so on.. creating a morphing video that slowely blends from one favourite image to the next.

Why you can’t directly use your own images

If we are slowely changing input numbers to create a morphing video, you might already see the problem in blending between custom images that you could potentially provide yourself: we don’t know what input numbers we must feed to the Generator in order to produce those images.. And if we don’t know those input numbers, it is impossible to create a morphing video that blends between those given, real images.

In other words, a GAN can only create videos that morph between images it has generated for itself, not between any image that we give it!

So then, how can we create custom visual themes?

Luckily, there is a way to create custom visual styles, and that is by uploading loads of real images and having the GAN learn to reproduce that visual theme. Unfortunately, the samples that the Generator will learn to produce will not be identical to the images you have uploaded, but, if the training went well, the generated samples should look similar to the training images.

We can then pick our favourites from the samples that the GAN produces and create a video that smoothly morphs between them!

Below, you can see an example of some real training images (left) and some generated samples (right) from the GAN after training on those images:

Real images (left) vs Generated samples (right) from WZRD’s infinity model
Digital Artpiece created with WZRD’s infinity model

If you’d like to dive deeper into the rabbit hole, you can take a look at my video here where I explain GANs in even more detail, but don’t feel compelled: you know more than enough to continue right here!

Allright, I’m starting to get it…

So, what do I need to create my own, custom GAN model?

Training Images

Machine Learning models are known to be data hungry and the same applies to GANs. Traditionally, you would need around 50'000 diverse images to train a good generative model, and yes, I know, that is A LOT…

Luckily, using the latest tricks (Academical references: paper1, paper2), it is now possible to get almost the same results with much less example images. To be specific, we’re now able to train custom models with as little as ~1000 images. There are a few caveats to keep in mind though:

  • Data diversity ⟶ Model diversity, in other words: the more training images you can provide, the more diverse the samples will be that the GAN is able to generate. Small image sets tend to produce models that exhibit low visual change in their outputs.
  • On the other hand, a lot also depends on the inherent diversity of the dataset itself. If all your images are rather similar (eg faces) the model will have an easier job modelling those and generating new, realistic samples can be done with a fairly limited set of training images. For visual styles that have loads of diversity (say landscapes for example), the model has a much harder job to do and more images are usually needed to get good results.

As a rule of thumb:

  • ~1000 images: just doable, but the model might lack some diversity
  • 1000–2000 images: good
  • 2000–4000 images: great
  • 4k+ images: purrrfect!

To quantify these rules of thumb a bit more rigorously, we can look at something called the “FID” score of a GAN model. Briefly explained, FID is a statistic that measures the visual difference between the real and generated images of a model. If FID is zero, the generated images are essentially indistinguishable from the real ones, so lower FID is better.

Coupling this metric (vertical axis) with the size of the training data (horizontal axis) you can see why you need as many images as you can get:

More images yields better GAN models: FID scores (vertical axis) of GANs using different amounts of training data (horizontal axis) and different training methods (colors). WZRD’s backend uses the orange method to train custom models.

Computational Resources

Even though a lot of progress has been made, GANs are still very compute-hungry to train. One of the reasons is that GANs essentially involve two neural networks (the Generator and the Discriminator) that both have to be trained simultaneously.

As such, training a good GAN usually requires about 100–200 hours of training time on a single GPU. It goes without saying that training a model for longer will improve results, so there is a cost-quality trade-off here.

Left: Model performance vs Training time (source) | Right: Model performance vs Training steps and Dataset size (colors) (source)

Cost

Because of the required computational resources, training custom GAN models is expensive. If you look at the GPU pricing table here and consider that the above charts are using V100 GPU’s, you can understand why training a custom model will set you back at least $500.

Sample quality

Finally, while GANs have seen impressive progress over the past 5 years, they cannot just model any kind of image dataset, there are limits to what these AI models can currently do.

It is a good idea to try and manually curate the image dataset so that it contains only “good” images which means:

  • High definition (> 1000 x 1000 pixels per image)
  • The primary object of focus is clear & centered in the image
  • The dataset is not too diverse, as this will simply result in what is called “Mode dropping” where the GAN is not able to model all the diversity of the dataset and simply drops parts of the visual theme from it’s output.

Examples

To give an idea of how the WZRD platform can be used, here are some sample images from the training data + the generated samples from several models publically available on the WZRD platform. For every model, a video example is also included that shows how the model can be used to create various styles of digital audio-reactive content.

Real images (left) vs Generated samples (right) of WZRD’s Religion model, trained for 10 days on 15k images.
Real images (left) vs Generated samples (right) of WZRD’s Earth model, trained for 5 days on 5k images.
Real images (left) vs Generated samples (right) of WZRD’s Dots model, based on the amazing artworks of Noj Barker, trained for 4 days on 1k images.

--

--