Image classification of Calvino's artworks using Convolutional Neural Networks


Image classification using Convolutional Neural Networks was used to asses the artistic portfolio of Martin Calvino's work that included hand's gesture derived artworks from algorithmically created artworks. The machine learning algorithm approach used could effectively predict the nature of the artwork with 90% accuracy. This work could be used as inspiration to cataloguing images from art collections at museums and art organizations.


Artificial intelligence (AI) is an area of study focusing on the creation of machines that mimic intelligent behaviors characteristically associated with humans [1]. Machine learning (ML) is a subfield of AI that focuses on teaching computers how to learn without the explicit need to be programmed for the realization of specific tasks. In ML it is possible to create algorithms that learn from and make predictions based on data [1]. ML thus brought about a paradigm shift in computing: whereas in classical programming humans create and input rules (program) and data to be processed by those rules, and out comes answers; with ML humans input data as well as the expected answers from the data, and out come the rules. These rules can subsequently be applied to new data to obtain novel answers [2]. Thus, a machine-learning system is trained instead of explicitly programmed. This means that the machine-learning system is exposed to many examples pertinent to a given task, and it finds statistical structure within these examples that eventually allows the system to 'elaborate' its own rules for automating the task [2].

Deep learning (DL) is in turn a subfield of ML that is concerned with the use of artificial neural networks (ANN) and the 'deep' term is referential to the presence of many 'layers' of these neural networks that 'learn' increasingly meaningful representations from data [1, 2]. In this sense, DL can be considered a 'multistage-data-distillation operation' where data goes through successive filters and comes out increasingly 'purified'; In other words, a multistage procedure to learn more adequate/meaningful data representations [2].

How does the learning happen in deep learning? ML consists in mapping inputs (images of artworks in the example presented below) to targets (image's labels such as 'hand drawings/paintings' and 'art created with algorithms') which is implemented by exposing the algorithm to a vast example of inputs and targets. DL in particular implement input-to-target mapping by deep sequence of simple data transformations (layers), with these transformations learned by exposure to inputs and targets. The type of transformation a layer does to its input data is stored in the layer's weights, which are basically just numbers. In this view, learning entails the finding of a set of values for the weights of all layers in a network, such that the network will effectively map example inputs to targets [2]. To modulate the output of a neural network, the distance between the output and the expected result needs to be measured. This is the role of the loss function. The loss function compares the prediction of the network to the true target (what you expected the network to output) and computes a loss score, evaluating how well the network has done on this concrete example (Figure 1) [2]. This score is then used as feedback signal to adjust the value of the weights in a direction that lowers the overall score for the current example. This adjustment is performed by the optimizer, which implements the so called backpropagation algorithm, a fundamental algorithm in deep learning (Figure 1) [2].

Figure 1. Diagram displaying the overall organization of a deep learning pipeline. Diagram was taken from [2].

At the beginning, the weights of the network are assigned random values, so it basically implements random data transformations. Consequently, its output is far from the expected value and thus the loss score is very high. As the weights are adjusted in small increments with each consecutive example the network processes, the loss score decreases. This is the training loop, which repeated a sufficient number of times eventually yields weights values that minimizes the loss function. A network with a minimal loss is that one for which output predictions closely approach true target values: a trained network! [2]. A training loop consists of the following steps that are repeated as long as necessary:

1- draw a batch of training samples 'X' and corresponding targets 'Y'

2- run the network on 'X' (what's called a forward pass) to obtain predictions 'Y_pred'

3- compute the loss of the network on the batch, which measures the mismatch between 'Y_pred' and 'Y'

4- update all the weights in the network in a manner that gradually reduces the loss on the batch

After successive iterations of the training loop (each iteration over all the training data is called an epoch), the network eventually arrives at a very low loss score on its training data; that is a small mismatch between predictions 'Y_pred' and expected targets 'Y'. When this state is reached, the network has 'learned' to map its inputs to correct targets [2].

Layers of a neural networks perform data transformations through tensor operations. For instance, vector data are stored in 2D tensors of shape (samples, features) and are processed by densely connected layers. Sequence data are stored in 3D tensors of shape (samples, time steps, features) and are processed by recurrent layers such as LSTM layers. Image data are stored in 4D tensors of shape (samples, height, width, color depth) and are usually processed by 2D convolution layers. Video data are stored in 5D tensors of shape (samples, frames, height, width, color depth) [2].

In this work, deep learning models for image classification were built using the library Keras which is implemented using the Python programming language [2]. This is accomplished by clipping together compatible layers that perform useful data transformation pipelines [2]. The Keras workflow consists of the following steps:

1- define training data: input and target tensors

2- define a network of layers (or model) that maps input to targets

3- configure the learning process by selecting the loss function, optimizer, and metrics to evaluate

4- iterate on training data by calling the fit() method of model selected on step 1

The most popular network architecture is a linear stack of layers that in Keras is defined by the Sequential class. Once the network architecture has been defined, a loss function and an optimizer needs to be defined. For a two-class-image-classification problem the suggested loss function to use is binary crossentropy and the optimizer is either RMSprop or SGD [2, 3, 4, 5].

Image classification is considered an instance of supervised-learning, a branch of machine learning that entails the learning of the relationship between training inputs and training targets [2].


My objective in the work presented here was to apply convolutional neural networks (convnets) for the following image classification problem: can a deep learning algorithm correctly distinguish between images of artworks created by hand (drawings and paintings) relative to images of artworks created using computer code? This poses an interesting perspective since the author (me) have created both type of artworks using different mediums (Figure 2). Even though the creative impulse behind both types of artworks may not differ that much, their implementation is radically different and thus poses the question of wether their visual output is clearly discernible by machines relative to humans.

Figure 2a. Examples of Calvino's artworks used to assemble image dataset that was created from drawings using the artist's hands. Some artworks are digital in nature but they were created with hand gestures (as opposed to programming them by writing computer code myself) by drawing with a digital pen on a Wacom Tablet or by using fingers on an iPad's screen. Thus the defining characteristic of this dataset is that lines were created with the artist's hand gestures.

Figure 2b. Examples of Calvino's algorithmic artworks used to assemble image dataset. The defining characteristic of this dataset is that all artworks were created by implementing computer code (as opposed to drawing using hand gestures).


Because convolutional neural networks emphasizes local patterns on images; and in order to increase the number of images on the dataset to perform image classification, I took 25 images similar to those shown on Figure 2a and 25 images similar to those shown on Figure 2b. By writing a script in Processing, I segmented images in 10 squares of 256 x 256 pixels each, with these squares deriving from random locations within the images (Figure 3). In this manner I was able to assemble 500 images in total for the training dataset (250 images derived from hand's gestures artworks like those shown on Figure 2a; and 250 images derived from implementation of computer code like those shown on Figure 2b). A similar approach was used to gather 240 images of 256 x 256 pixels each for the test dataset (120 images for hand's gestures artworks, and 120 images for computer code artworks).

Figure 3. Sub-sampled images of 256 x 256 pixels taken from two artworks shown on Figure 2a and 2b, respectively. A total of 500 images like these comprised the training set, whereas 240 images comprised the test set.

The dataset directory structure containing image data that was laid out for modeling is shown on Figure 4:

Figure 4. Image dataset directory and file structure used by the author in this project. Under each test and train directories/folders, the author placed sub-directories for each of the two classes (hand's gesture derived artworks [herein 'han'], and algorithmic artworks [herein 'kom'] respectively) to which actual image files were allocated. It is important to note that the author is not placing the same image files under han/ and kom/ directories; but rather different images of hand's gesture derived artworks and algorithmic artworks respectively. Similarly, different image files are in the train and test datasets.


Images were loaded to Keras in batches using the ImageDataGenerator class [2, 3]. This class also converts images into pixel arrays as input to the network. Additional capabilities of the ImageDataGenerator class includes the automatically scaling pixel values of images and also generating augmented versions of images (see data augmentation below). The ImageDataGenerator class is used as follows:

> Construct and configure an instance of the ImageDataGenerator class

> Retrieve an iterator by calling the flow_from_directory() function

> Use the iterator in training and evaluation of model

Subdirectories of images, one for each class, are loaded by the flow_from_directory() function in alphabetical order, with an integer assigned to each class. For instance, the subdirectory 'han' comes before 'kom' and thus the class labels are assigned as han=0 and kom=1. The same ImageDataGenerator class can be used to prepare iterators for separate dataset directories such as train/ and test/ respectively. Iterators are then used when fitting and evaluating the model by calling the fit_generator() function on the model and passing training and testing iterators (train_it and test_it respectively). Once the model is fit, it can be evaluated on a test dataset using the evaluate_generator() function and passing in the test iterator.


The procedure to create, train, and evaluate a convolutional neural network (or any other neural network) in Keras followed a series of steps [3] that consisted of:

> Defining network

> Compiling network

> Fitting network

> Evaluating network

> Making predictions

The complete code used by the author is shown on Figure 5:

Figure 5. CNN (VGG) code implementation using Keras library in Python. Code was taken from [3] and adapted by Martin Calvino.

The architectural design of the convolutional neural network shown on Figure 5 is a variant of VGG, specifically a one-block VGG [3]. A key characteristic of this architectural design is that the number of filters increases with the depth of the model; and it was first described by Karen Simonyan and Andrew Zisserman in 2015 [6]. Because this architecture has proven very efficient at extracting features from images, the author has decided to use it for this project. When implemented the code shown on Figure 5, the author experimented with hyper parameters such as batch sizes (32 and 64), data augmentation (rotating images from the train dataset) and shuffling images from train and test datasets as they were progressively loaded into the network. The results are shown below:

Figure 6. Cross entropy loss and classification accuracy for a one-block VGG-CNN run for 20 epochs with batch_size=32. On the y-axis are cross entropy loss and classification accuracy respectively, whereas on the x-axis are number of epochs. Values for the train dataset are shown in blue whereas values for the test dataset are shown in orange. The accuracy of the model for image classification is 76.250%.

From epoch=6 on Figure 6 it can be seen that classification accuracy for the training dataset continues to increase while accuracy for the test dataset fluctuates without overall improvement. This is an indication of overfitting: when the model performs better on the training data relative to the test data. In order to address overfitting, the author implemented two approaches: dropout regularization (Figure 7) and data augmentation (Figure 8, 9 and 10).

Dropout is a technique that consists in randomly dropping nodes out of the network, and its has a regularizing effect because the remaining nodes need to adapt in order to pick-up the slack of the removed nodes [3]. Dropout can be added to the model by the addition of Dropout layers, with the amount of nodes to be removed specified as a parameter (20 and 50% as shown on Figure 5). For this project, the author added Dropout layers after a max pooling layer and after the fully connected layer (Figure 5). The improvement on the result of cross entropy loss and classification accuracy as result of dropout regularization is shown on Figure 7:

Figure 7. Cross entropy loss and classification accuracy for a one-block VGG-CNN run for 20 epochs with batch_size=32 and dropout regularization. On the y-axis are cross entropy loss and classification accuracy respectively, whereas on the x-axis are number of epochs. Values for the train dataset are shown in blue whereas values for the test dataset are shown in orange. The accuracy of the model for image classification is 81.667%.

Although classification accuracy for the test dataset improved from 76.2 to 81.6 % as result of dropout regularization, overfitting is still evident from epoch=8 in which the accuracy for the training dataset keeps improvement but not for the test dataset. In a second attempt to reduce overfitting the author tried a model that combined not only dropout regularization but also data augmentation. Data augmentation is a technique that involves the artificial expansion of the training dataset by creating modified versions of images (rotated images in this particular case). Augmentation creates variation of images that can improve the ability of the fit model to generalize what they have learned to knew images [3]. The augmentation used for this project consisted in rotating the images with a range up to 25 degrees (Figure 8 and Figure 9):

Figure 8. Variation of images derived from rotating a single image from the training dataset of hand's gesture artworks. Rotation was used as data augmentation technique to reduce overfitting and improve model performance. Code taken from [3]. Data augmentation was only applied to the training dataset and not to the test dataset.

Figure 9. Cross entropy loss and classification accuracy for a one-block VGG-CNN run for 20 epochs with batch_size=32 and dropout regularization + data augmentation (image rotation). On the y-axis are cross entropy loss and classification accuracy respectively, whereas on the x-axis are number of epochs. Values for the train dataset are shown in blue whereas values for the test dataset are shown in orange. The accuracy of the model for image classification is 82.251%.

From Figure 9 it can be seen that the inclusion of data augmentation (image rotation) resulted in classification values for the test dataset superior to those of the training dataset, a situation known as underfitting. This could possibly means that the learning process could have not benefitted from rotating images. In order to reduce underfitting the author differentially adjusted the hyperparameter of batch size for the train and test dataset respectively. Because the training dataset has almost as twice images as the test dataset, the batch size of 64 was assigned to the training dataset whereas a batch size of 32 was assigned for the test dataset. Bigger batches have been reported to improve model efficiency by reducing the loss error [7]. By increasing the batch size for the training dataset underfitting was effectively reduced (Figure 10) and the model achieved a classification accuracy of 90.417 % by the end of epoch=20.

Figure 10. Cross entropy loss and classification accuracy for a one-block VGG-CNN run for 20 epochs with batch_size=64 for training dataset and batch_size=32 for test dataset and dropout regularization + data augmentation (image rotation). On the y-axis are cross entropy loss and classification accuracy respectively, whereas on the x-axis are number of epochs. Values for the train dataset are shown in blue whereas values for the test dataset are shown in orange. The accuracy of the model for image classification is 90.417%.


A final mo