SmallGP - a lean Genetic Programming System for Symbolic Regression

Author: Hannes Planatscher

Java SmallGP Applet

The functionality of SmallGP is implemented as a Java Applet: Click here to start!

Classic Command-line SmallGP

Introduction

SmallGP is a implementation of Symbolic Regression wich is compact by means of lines of code and memory usage, developed as an entry for the "Tiny GP"-Coding-Contest at the Genetic and Evolutionary Computation Conference 2004. It did not win however. But I still think that it contains some ideas that may be useful to someone interested or new in the area.

SmallGP is free software released under the GNU GPL.


Download

tar.gz: smallgp-1.0.tar.gz
zip: smallgp-1.0.zip

Installation Instructions

gcc and make are needed to compile SmallGP. It has succesfully been compiled under Linux, FreeBSD, Solaris and Win32. I recommend the Cygwin-Environment for Win32. It should work for all Win32-C-Compilers too however.

Unpack smallgp-1.0.tar.gz:
> tar xvfz smallgp-1.0.tar.gz

Move to the created directory:
> cd smallgp-1.0

Run Make:
> make

Some compiler warnings may occur, but after compilation you should find a working version of smallgp in the same directory.

Usage

SmallGP is a commandline-tool. All important parameters can be changed easily.

Display the help with

> smallgp -h

to learn which parameters are avaiable.

Most of the parameters and their default values are explained in this specs.
Example parameters for a test run:

dataset: test.data
population size: 10000
tournament size: 100
maximum generations: 80
maximum prog. length: 50
maximum depth: 7
#constants: 200
Pcross: 0.7
Pmut/Node: 0.1

To start smallgp with these parameters just run
> smallgp -i test.data -p 10000 -t 100 -g 80 -l 50 -d 7 -r 200 -c 0.7 -m 0.1

.
Warning: Parameters are not checked for validity (ex. negative probabilities etc.).

Detailed documentation

Also available as plain text: smallgpdoc.txt .
 _____       ___  ___       ___   _       _       _____   _____  
/  ___/     /   |/   |     /   | | |     | |     /  ___| |  _  \ 
| |___     / /|   /| |    / /| | | |     | |     | |     | |_| | 
\___  \   / / |__/ | |   / / | | | |     | |     | |  _  |  ___/ 
 ___| |  / /       | |  / /  | | | |___  | |___  | |_| | | |     
/_____/ /_/        |_| /_/   |_| |_____| |_____| \_____/ |_| 1.0 

       a compact implementation of Symbolic Regression	
----------------------------------------------------------------


Content:

I) Introduction
II) Installation/Compilation
III) Usage
IV) Implementation Details
  1) Representation of individuals
  2) Crossover
  3) Mutation  
  4) Strategy
  5) Initialization
V) References
VI) License

I) Introduction

SmallGP is a implementation of Symbolic Regression [1] wich is compact by 
means of lines of code and memory usage, developed as an entry 
for the "Tiny GP"-Coding-Contest [2] at the Genetic and Evolutionary 
Computation Conference 2004.


II) Installation Instructions

tar, gcc and make are needed to compile SmallGP. It has succesfully been 
compiled under  Linux, FreeBSD, Solaris and Win32 (with djgpp).

Unpack smallgp-1.0.tar.gz:
> tar xvfz smallgp-1.0.tar.gz

Move to the created directory:
> cd smallgp-1.0

Run Make:
> make

Some compiler warnings may occur, but after compilation you should find
 a working version of smallgp in the same directory.


III) Usage

SmallGP is a commandline-tool. All important parameters can be changed easily.

Display the help with

> smallgp -h

to learn which parameters are avaiable.

Most of the parameters and their default values are explained in [2]. 

Example parameters for a test run:

dataset: 		test.data
population size: 	10000
tournament size:	100
maximum generations:	80
maximum prog. length: 	50
maximum depth:		7
#constants: 		200
Pcross:			0.7
Pmut/Node:		0.1

To start smallgp with these parameters just run

> smallgp -i test.data -p 10000 -t 100 -g 80 -l 50 -d 7 -r 200 -c 0.7 -m 0.1

. 
Warning: Parameters are not checked for validity (ex. negative 
probabilities etc.).  	

IV) Implementation Details

1) Representation of individuals

SmallGP uses the concept of genotype-phenotype-duality.

Consider the following tree:

   (+)
   / \
  x1 (+) 
     / \
    x2 c2
	
This can be written in polish notation [3]:

 + x1 + x2 c2

Due limited amount of different symbols it is possible to store this
expression in a vector of integer values.

   (+)
   / \
  x1 (+)   => + x1 + x2 c2 => (2,5,2,6,16)
     / \
    x2 c2

The integer vector represents the genotype, while the phenotype 
is a standard  binary-tree, that consists of nodes containing 
information which are holding pointers to their children and 
their predecessor.

struct node {
    struct node *right;
    struct node *left;
    struct node **bp;
    unsigned char no;
};

node->no specifies the primitive function/terminal represented by this node.
It can easily be deducted that the maximum number of different primitves and
terminals is limited to 255 with using an 8-bit  datatype.
This phenotypic representation enables easy implementation of crossover and 
evaluation. 

With 104 bits memory needed per node it isnt really memory efficient, 
as every node contains just 8 bit of information, so we use the genotype
for storage.

The phenotype is easily converted to the genotype: a vector of 8-bit values.
The genotype is far more memory efficient than the phenotypic representation, 
but the implementation of subtree crossover and evaluation is less trivial.

SmallGP uses the phenotype-genotype-duality to save memory and to allow 
easy manipulation. While for evaluation and crossover individuals are 
temporarily expanded to their phenotype, point mutation can be 
performed on the genotype.

2) Crossover

SmallGP implements a size-safe kozastyle subtree-crossover[4]. Subtrees of
appropriate size are selected from two individuals, and then exchanged. It is
guaranteed that the size constraint is met by both individuals after the
operation.

Suppose that size(tree1) < size(tree2).

tree1 can be extended by (MAXLEN - size(tree1)) nodes. The maximum size
of a subtree of tree2 is size(tree2) so we fix a maximum size for the 
subtree selected from tree2 by a positive random integer value 
smaller than

((MAX_LEN - size(tree1)) modulo size(tree2)) = maxsize2

After the selection of a subtree subtree_2 with a maximal size 
maxsize2 from tree2, we can determine maxsize1. tree2 can be extended 
by (MAXLEN - size(tree2) + size(subtree2)).
As the maximum size of a subtree from tree1 is size(tree1) we choose
a positive random integer value not bigger than

((MAX_LEN - size(tree2) + size(subtree2)) modulo size(tree1)) = maxsize1

.

3) Mutation

SmallGP implements point-mutation, that simply substitutes a terminal with
another terminal, and a primitive with another primitive [5].

4) Strategy

SmallGP uses a generational strategy, with tournament selection.

5) Initialization

The tree in the initial population are initialized with a size-safe
"Grow"-Method. The probabilty of choosing a terminal at a certain depth
is obtained as follows:

		 |primitives| + |variables| + 1
       P_term =	------------------------------
			|variables| + 1

Constants are considered as one virtual terminal symbol.

Due to the "Grow"-Method most of the individuals are very small in
the initial population. 
This, in combination with the previously defined point mutation, can 
cause lack of structural  innovation in some cases. 
So an additional parameter P_term_really is introduced that allows 
to scale between "Grow" and "Full" initialization.
A terminal is set with the probability:

	 P_term' =  P_term * P_term_really

. With P_term_really = 1 the method equals "Grow", with P_term_really = 0
"Full". The defaultvalue of Pterm is 1.


V) References

[1] Genetic Programming, Symbolic Regression 
	http://www.genetic-programming.org/

[2] Tiny GP specification:
	http://cswww.essex.ac.uk/staff/sml/gecco/TinyGP.html

[3] Polish Notation
	http://www.cenius.net/refer/display.php?ArticleID=polishnotation

[4] Subtree Crossover
	http://www.geneticprogramming.com/Tutorial/#anchor179890

[5] Genetic Programming with One-Point Crossover and Point-Mutation 
   (Poli, Langdon)  
	http://cswww.essex.ac.uk/staff/poli/papers/Poli-WSC2-1997.pdf

VI) License

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  US