Authors:
- Ángeles Soto Pérez
- Salvador Corts Sánchez
Hyperparameter optimization of Keras fully-connected neural networks using Evolutionary Algorithms.
Hyperparameter optimization of Neural Networks (from now on NN), is often a challenging task which requires a lot of expertice and trial and error. Grid-Search is the most common method to tune these parameters, this process consists on configuring the NN with all possible permutations of a given (predefined) set of parameters.
The main problem of Grid-Search is that it's very difficult to explore all the search-space of these parameters, so it's limited to a set of "promising" parameters. As a solution to this problem, we assess the use of evolutionaly algorithms as a tool to optimize the values of these hyper-parameters.
An Evolutionary Algorithm (EA) uses mechanisms inspired by biological evolution, such as reproduction, mutation, recombination, and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the quality of the solutions. Evolution of the population then takes place after the repeated application of the above operators.
In this case we apply a simple generational model. Simply put it generates n offsprings from a population of size n and replaces the population with the offsprings. The offsprings are generated by selecting 2 individuals from the population and applying a crossover method to the selected individuals until the n offsprings have been generated. The newly generated offsprings are then optionally mutated before replacing the original population.
In our particular problem individuals represents a given set of hiperparameters consisnting of:
- Model ID: Unique identifier of the individual (aka model).
- Learning Rate: Also known as LR.
- Optimizer: which can be either Adam, SGD or RMSProp.
- Activation Function: which can be either Relu, Sigmoid, Softwamx or Tanh.
- Layers: Which is a list where each element contains the number of neurons of that given layer. For example [1, 2, 3] means that our NN will have three hidden layers with 1, 2 and 3 neurons each respectively.
- Dropout: Weather to apply a 25% of dropout after each layer or not.
We have implemented the following Genetic Operators:
- Selection: Tournament selection involves running several "tournaments" among a few individuals chosen at random from the population. The winner of each tournament, which is the one with the best fitness, is selected for crossover.
- Crossover: At a given probability, two individuals (aka models) will:
- Swap their Optimizers, Learning rates and Dropout,
- Apply a single-point crossover.
- Mutation: At a given probability, we will:
- Increase / Decrease the LR with a random delta in the range (-0.05, 0.05).
- Randomly select a different optimizer.
- Randomly select a different activation fucntion.
- Toggle dropout
- For each layer, randomly add or substract up to a 25% of the neurons of the layer.
- Permutate the layers.
- Evaluation: Consisting on training a model with the resulting set of parameters and evaluating the trained model against a validation and test dataset.
The Evaluation operator is particularily compute-intensive. In order to better optimize the algorithm, we have implemented our genetic algoritms with a client-server distributed architecture where the server applies the Selection, Mutation and Crossover operators, and the clients evaluate the resulting individuals.
This allows us to be able to divide the amount of work between the several clients making our algorithm run considerably faster.
Note that such architecture only make sense in those problems where the operators computed by the clients require way more time than the latency to send/receive results between the client and the server.
The server is written in Go since it is a compiled language (hence fast) which is ideal to implement distributed systems due to it's very unique features (e.g. channels, goroutines, etc.).
The client is written in Python since we are training the neural networks using the Google's TensorFlow library. Even though Python is a interpreted language (hence slower), the intense computation are done by TensorFlow which is optimized (written in C++).
Communication is done via gRPC, with the following API:
service API {
rpc GetModelParams(Empty) returns (ModelParameters) {}
rpc ReturnModel(ModelResults) returns (Empty) {}
}To test our evolutionary model, we have designed an experiment where we will run 30 generations with 50 individuals each. The objective is to get the best NeuralNetwork for a binary-clasification problem.
We'll use the Algerian Forest Fires DataSet, available at the UCI Machine Learning Repository.
In this paper we can see that the best results were achieveed with an Adaboost model that obtained a Recall of 0.95 and a precission of 0.79, hence a F1-score of 0.86.
These are our results:
{"level":"warning","msg":"No models to evaluate","time":"2021-06-18T18:15:47Z"}
{"level":"info","msg":"Listening at 0.0.0.0:10000","time":"2021-06-18T18:16:33Z"}
2021/06/18 18:17:08 pop_id=Qut min=0.018868 max=1.000000 avg=0.240941 std=0.250306
{"level":"info","msg":"Best fitness at generation 0: 0.018868","time":"2021-06-18T18:17:08Z"}
2021/06/18 18:17:38 pop_id=Qut min=0.018868 max=1.000000 avg=0.176518 std=0.171339
{"level":"info","msg":"Best fitness at generation 1: 0.018868","time":"2021-06-18T18:17:38Z"}
{"level":"warning","msg":"No models to evaluate","time":"2021-06-18T18:17:38Z"}
2021/06/18 18:18:10 pop_id=Qut min=0.018868 max=1.000000 avg=0.182777 std=0.198851
{"level":"info","msg":"Best fitness at generation 2: 0.018868","time":"2021-06-18T18:18:10Z"}
2021/06/18 18:18:41 pop_id=Qut min=0.000000 max=1.000000 avg=0.254514 std=0.316698
{"level":"info","msg":"Best fitness at generation 3: 0.000000","time":"2021-06-18T18:18:41Z"}
{"level":"info","msg":"Best model found: model_id:\"42378d24-2be2-4cb3-9bb9-daa18ebbf0e2\" learning_rate:0.010019941 optimizer:RMSprop activation_func:Tanh layers:{num_neurons:164} layers:{num_neurons:214}","time":"2021-06-18T18:18:41Z"}As we can see just in the first generation we already got a better model with a F1 Score of 0.981132 (1-0.018868), and after 3 generations we go a model that predicted all the test examples correctly.
All in all, the best model is the one that uses:
- Learning Rate: 0.010019941
- Optimizer: RMSprop
- Activation Function: Tanh
- Layers: [164, 214]
- Dropout: No
- Allow other NN architectures than fully connected networks.
- Allow different activation functions on each layer
- Allow different dropout rates
- Island-based distributed evolutionary algorithm
- Pool-based evolutionary algorithms so the evolution is not delayed by slower clients.