1 Introduction
Basic relations such as equality are fundamental to relational data structures. One goal of applying neural networks to relational data is that the networks learn to infer these relational structure from data. Although equality is typically not learned from data, equality or approximate equality may be embedded as part of other tasks. The modelling of equality is clearly in the hypothesis space of feedforward neural networks (FFNNs) Leshno et al. [1993], but Marcus et al. [1999], Marcus [2001] already highlighted that learning of identity relationships with neural networks may not generalise to unseen data. Therefore, we see learning to recognise equality as relevant from a theoretical and practical perspective.
In this study we test whether feedforward networks learn equality as well as a numeric comparison, thresholded digit sum, and digit reversal of pairs of binary vectors and then generalise this to new data in different settings regarding the task, the amount of data provided, and the depth of the network. We find that the recognition of binary relations is not generalised reliably by feedforward networks.
To address this problem, we introduce an inductive bias with additional predefined network structures, that we call differential rectifier (DR) units. We find in our experiments that DR units induce reliable perfect generalisation for equality and all other tasks except in digit reversal.
We see two questions that these results raise: First, which other relations neural networks do not learn and what that means for more complex tasks. Second, what kinds of inductive biases to design and how to implement them.
2 Related work
In relational learning, equality is often not learned from the data, with the exception of the work by Santoro et al. [2017] who learn to detect equality attributed of objects from images. Learning equality could be interesting in the context of constraint learning De Raedt et al. [2018] to learn when equality constraints should be regarded as satisfied. Another relevant area is rule learning and application, where soft unification like in Campero et al. [2018] could be replaced with a learnt model.
Since neural networks are currently by far the most popular machine learning method, it seems of interest whether they can learn equality. There have been a number of theoretical contributions showing that feedforward networks are universal approximators, most generally to our knowledge by
Leshno et al. [1993]. Presumably because of these results there was relatively little interest in the question which functions neural networks can not learn. One of the few studies in this direction was undertaken in [Marcus et al., 1999]in 1999, where a recurrent neural network failed to distinguish abstract patterns, based on equality relations between sequence elements, although sevenmonthold infants showed the ability to distinguish them after a few minutes of exposure. This was followed by an lively exchange on rule learning by neural networks and in human language acquisition, where results by
[Elman, 1999, Altmann and Dienes, 1999, Shultz and Bale, 2001] could not be reproduced by [Vilcu and Hadley, 2001, 2005] and Shultz and Bale [2006] disputed claims by [Vilcu and Hadley, 2005]. Other approaches, such as [Shastri and Chang, 1999, Dominey and Ramus, 2000, Alhama and Zuidema, 2016], use different network architectures, problem formulations or evaluation methods.A more specific problem of learning equality relations was posed in Marcus [2001]
by showing that learning of equality on even numbers does not transfer to odd numbers in binary representation. This relates to the input neuron for the least significant bit not being set to
during training. Recently, Mitchell et al. [2018]addressed this specific problem with different approaches as an example for extrapolation and inductive biases for machine learning in natural language processing. However, they did not address the general question of learning equality with neural networks.
If standard neural networks do not generalise equality relations despite the solution being in their hypothesis space, as we will show for FFNNs below, then the question is how we can enable the learning of solutions that do generalise. Inductive biases as a solution can be realised in a number of ways and have been of increased interest recently Hamrick et al. [2018], Snell et al. [2017].
3 Equality relation learning
The studies listed above motivated the approach taken here to study a reduced problem outside common contexts such as image analysis or cognitive modelling: whether feedforward neural networks trained with backpropagation generally have the ability to learn equality relations and generalise to unseen data.
The general task is to learn the relation between pairs of binary vectors. This leads to a binary classification of the pairs according to the equality or otherwise of its element vectors. We use a standard FFNN as sketched in Figure 1a). This network has input neurons, where
is the vector dimensionality. The hidden layer has 10 neurons with ReLu activation. The output layer has two neurons representing the two classes (equal/unequal), which use softmax activation. The training uses the Adam optimiser
Kingma and Ba [2014] with crossentropy loss.The data we train and test the network with is synthetically generated and we vary the type and the distribution of the data in the experiments below. We are interested in how many training examples are needed until the network learns to correctly classify pairs of equal vs. unequal vectors. This network, like the following ones have been implemented in Python using PyTorch (
http://pytorch.org).Inductive bias creation with DR units
In our model, we use differential rectifier (DR) units that compare input values by calculating the absolute difference: . We create one DR unit for every vector dimension with weights from the inputs to the DR units fixed at , thus learning the suitable summation weights for the DRs is sufficient for creating a generalisable equality detector.
4 Experiments and Results
We performed different sets of experiments using binary vectors for estimating vector equality in relation to vector dimensionality, data size and dataset structure. We also use two additional tasks to test the effect of DR units in different contexts.
Effect of network architecture and vector dimensionality
We generate pairs of random binary vectors with dimensionality between 2 and 100 as shown in table 1. We use all the possible binary vectors to generate equal pairs, i.e. pairs, for
and a random selection of 1000 vectors otherwise. We also generate the same number of randomly selected unequal vector pairs. Then we use stratified sampling to split the data 75:25 into train and test set. The network is then trained for 20 epochs, which led to convergence in all cases. We run 10 simulations for each configuration. The average results are shown in Table
1.We see that the standard FFNNs never fully generalise, and in many cases barely exceed chance level (50%). The early fusion model improves results, but only reaches full performance for 100 dimensions. The Mid Fusion reaches perfect test performance in all cases.
For the plain FFNN, it looks like there is a trend towards better performance at higher dimensionality, but with the observed variation that may be coincidental. We did not perform an exhaustive grid search over all hyperparameters, but tested higher numbers of hidden layers (2,3), and larger hidden layers (20,30 neurons) without observing a significant change in the results.
Vector Dimensions  Plain FFNN  Early Fusion  Mid Fusion 

n=2  52%  82%  100% 
n=3  55%  75%  100% 
n=5  37%  67%  100% 
n=10  52%  75%  100% 
n=30  65%  75%  100% 
n=100  75%  100%  100% 
Effect of training data size
We study here how much the performance depends on the training data size. For this, we vary only the training data size and keep the test set and all other parameters constant. We use training data sizes of 1% to 50% (in relation to the totally available data as defined above) and the accuracy achieved in various conditions is plotted in Figure 2. It is worth noting, that the Mid Fusion network reaches 100% accuracy from 10% data size on while the FFNN shows only small learning effects.
Effect of vector coverage
A possible hypothesis for the results of the FFNN is that the coverage of the vectors in the training set plays a role. To share vectors in equal pairs between training and test set would mean to train on the test data, but we created a training set that contains all vectors that appear in the test set in the unequal pairs for . The results are shown in column a) of Table 2. We also created a training set where each vector appeared as above, but in both position 1 and 2. The results are shown in column b) of table 2. The results in both cases are similar to those without this additional coverage in Table 1.
Other classification tasks
We evaluate here whether the DR units have a negative effect on other learning tasks (using ). We evaluated the networks on the classification by comparing the two vectors in the pair as binary numbers with results shown in column c) of Table 2. We also tested a task that is not a comparison of the two vectors in the pair, by calculating the digit sum. We classify by checking if the digit sum is . In both c) and d) we see, that the performance is actually not hindered but helped by the DR units. We finally tested the task of recognising digit reversal (swapping least with most significant bits), which DR units are not designed for, as they compare corresponding digits. As we can see in column e), DR units do not deliver a perfect solution here, but still lead to somewhat better results than a plain FFNN.
Type  a)  b)  c)  d)  e) 

1) Plain FFNN  50%  52%  75%  77%  50% 
2) Early Fusion  75%  87%  92%  82%  55% 
3) Mid Fusion  100%  100%  100%  100%  58% 
5 Conclusions
In this study we examined the learning behaviour of feedforward neural networks in vector equality detection and observed that the networks do not generalise well to unseen data. We also had similar results in other tasks like numeric inequality and sum of bits of binary vectors. We therefore introduced a simple modification to the network with differential rectifier (DR) units and noticed substantial improvements on unseen test data. This improvement is largely independent of vector dimension, data size and other parameters.
The question why standard FFNNs do not learn vector equality relations in a generalisable way is a relevant one, and deserves further theoretical and empirical study. It is also important to investigate the design of further measures for creating and controlling inductive biases in neural network learning, as we find that even relatively simple tasks like generalising equality require them.
References

Leshno et al. [1993]
Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken.
Multilayer feedforward networks with a nonpolynomial activation function can approximate any function.
Neural networks, 6(6):861–867, 1993.  Marcus et al. [1999] G. F. Marcus, S. Vijayan, S.B. Rao, and P.M. Vishton. Rule learning by sevenmonthold infants. Science, 283, 5398:77–80, 1999.
 Marcus [2001] G. F. Marcus. The algebraic mind: Integrating connectionism and cognitive science. Cambridge MIT Press, 2001.
 Santoro et al. [2017] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Tim Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976, 2017.

De Raedt et al. [2018]
Luc De Raedt, Andrea Passerini, and Stefano Teso.
Learning constraints from examples.
In
Proceedings in ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  Campero et al. [2018] Andres Campero, Aldo Pareja, Tim Klinger, Josh Tenenbaum, and Sebastian Riedel. Logical rule induction and theory learning using neural theorem proving, 2018.
 Elman [1999] Jeffrey Elman. Generalization, rules, and neural networks: A simulation of marcus et. al. https://crl.ucsd.edu/ elman/Papers/MVRVsimulation.html, 1999.
 Altmann and Dienes [1999] Gerry Altmann and Zoltan Dienes. Technical comment on rule learning by sevenmonthold infants and neural networks. In Science, 284(5416)):875–875, 1999.
 Shultz and Bale [2001] Thomas R. Shultz and Alan C. Bale. Neural network simulation of infant familiarization to artificial sentences: Rulelike behavior without explicit rules and variables. Infancy, 2:4, 501536, DOI: 10.1207/S15327078IN020407, 2001.
 Vilcu and Hadley [2001] Marlus Vilcu and Robert F Hadley. Generalization in simple recurrent networks. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 23, 2001.
 Vilcu and Hadley [2005] Marius Vilcu and Robert F Hadley. Two apparent ‘counterexamples’ to marcus: A closer look. Minds and Machines, 15(34):359–382, 2005.
 Shultz and Bale [2006] Thomas R Shultz and Alan C Bale. Neural networks discover a nearidentity relation to distinguish simple syntactic forms. Minds and Machines, 16(2):107–139, 2006.
 Shastri and Chang [1999] Shastri and Chang. A spatiotemporal connectionist model of algebraic rulelearning. International Computer Science Institute, pages TR–99–011, 1999.
 Dominey and Ramus [2000] P. Dominey and F. Ramus. Neural network processing of natural language: Isensitivity to serial, temporal and abstract structure of language in the infant. Language and Cognitive Processes, pages 15(1),87–127, 2000.
 Alhama and Zuidema [2016] Raquel G. Alhama and Willem Zuidema. Prewiring and pretraining: What does a neural network need to learn truly general identity rules. CoCo at NIPS, 2016.
 Mitchell et al. [2018] Jeff Mitchell, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Extrapolation in nlp. arXiv:1805.06648, 2018.
 Hamrick et al. [2018] Jessica B Hamrick, Kelsey R Allen, Victor Bapst, Tina Zhu, Kevin R McKee, Joshua B Tenenbaum, and Peter W Battaglia. Relational inductive bias for physical construction in humans and machines. arXiv preprint arXiv:1806.01203, 2018.
 Snell et al. [2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
 Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
Comments
There are no comments yet.