Deep Learning Model for GUI Classification

The following essay is about a project that emerged from the interest in experimenting with deep learning techniques and their potential application in the design of digital products. Although, as will be seen, it starts from a simplified analysis of the product design process, it reveals the possibilities that lie behind these types of programs when applied in the field of design. 

Many thanks to Carlos Antonio Diez for feedback and review.

View model in GitHub

Introduction

In a context marked by technology and the increasing automation of services, the question inevitably arises in our field, as in practically all others, about whether tasks such as optimizing and improving product design can be automated, or at least, if it is possible to create an automated process that enhances work cycles. We understand that AI can be a valuable ally in this quest, especially in the evaluation of GUIs (Graphical User Interfaces), which, at the end of the day, will not only streamline work processes but also deliver a final product of higher quality.

Here, we will share our exploration in the creation of a deep learning model that uses artificial intelligence to differentiate between “good” and “bad” quality GUIs of mobile applications. Starting from this simplified approach of making a boolean distinction allowed us to progress through the process swiftly without having to pause to overanalyze the design aspects of each product.

Data collection

To train our model, we collected 2,674 representative examples of GUIs that were classified as “good” or “bad” in a 50/50 split. This was done by building a crawler that downloaded images of complete flows from websites such as Mobbin and UXArchive, sites dedicated to documenting complete flows of applications from different companies and products, making them the ideal source for obtaining this type of data.

We classified as “good” any flow belonging to mobile applications from well-known companies and products that are known to invest heavily in their design teams, and the positive results are evident. We are talking about companies like Netflix, Coinbase, Airbnb, Uber, etc. This choice allowed us to establish a clear benchmark for design quality, setting the bar for specific criteria that we considered essential, such as usability, visual consistency, readability, and overall aesthetics.

On the other hand, we classified as “bad” any example from lesser-known applications or smaller-scale companies that we understand may lack the sufficient team to achieve results of the same quality as those classified as “good.” This, of course, was not intended to be derogatory towards their design teams or their efforts. Our goal was not to belittle but rather to identify areas where improvements could be made.

Model preparation and training

Once the examples were collected, they were prepared for use in convolutional neural networks, specifically for Google’s Inception V3 model. Unlike classical neural networks, these types of networks allow for the processing of two-dimensional data and the analysis of local regions, enabling the identification of patterns that can then be extended as a holistic understanding of the image is sought.

The collected images were loaded using Python, and a random split of 80% of both “good” and “bad” images was made for the model’s training set, reserving the remaining 20% of both categories for the test set.

After this split, the images were processed to have a format compatible with the valid inputs for Inception V3, and data augmentation techniques were applied to the training set to obtain a greater and more diverse variety of GUI examples.

Once the data was ready, the model was trained using the StratifiedKfold library to generate random batches as part of cross-validation, aiming to achieve more robust predictions. After the initial model training, learning and threshold hyperparameters were adjusted to enhance the model’s predictive ability, and a new round of predictions was made.

Results

In the initial model training, encouraging results were obtained, with metrics of performance (see table) consistently exceeding 60% and showing similar values across the training set, cross-validation set, and test set. This reasonably suggests that there was neither overfitting nor underfitting, although there was a 10-point difference between the Recall values of the cross-validation set and the test set. Given that the values of evaluation metrics should generally be close, this was an aspect that was definitely sought to be improved.

Subsequently, after optimizing the hyperparameters of the threshold and learning rates, another test was conducted, yielding slightly better metrics overall (see Optimized test in table). Notably, there was a considerable increase in Recall and F-Score values, indicating a reduced number of designs incorrectly classified as “bad”, as can be observed in Figure 2.

AccuracyPrecisionRecallF1 ScoreAUC-ROCAUC-PR
Training0,6930,7150,6420,6770,7680,754
Cross validation0,7040,7350,6390,6840,7680,764
Test0,6580,7240,5150,6020,7760,716
Optimized Test0,7130,7040,7370,7200,7760,716
Figure 1
Figure 2

Conclusion

Having completed the process of creating and training the model, we can conclude that it is possible to carry out the detection and classification of GUIs. As a preliminary model, strictly based on visual aspects, it sets the stage for the development of more complex models, such as:

  1. Models that can access the code of GUIs or previous versions of the same, to have a comprehensive view of how optimization could be achieved.
  2. Creating API access points to facilitate feedback to designers on the work they have done, helping them make decisions for the creation and maintenance of products.
  3. Deploying the model as a cloud service to provide a dynamic lifecycle, both for an entry point and for it to have access to a growing source of data for refinement and evolution.

Whatever its final form, it is evident that artificial intelligence is present to facilitate the tasks we perform daily and to change paradigms in various sectors where it can be applied, allowing users to devote more time to the creative and intellectual process while relegating simpler and tedious tasks to this new technology.

We are excited about the idea that, based on the mentioned developments, evolution and change in all areas will be exponential and, in our optimistic view, it will be for the benefit of everyone.