BIEDERMAN´S `RECOGNITION-BY-COMP

BIEDERMAN´S `RECOGNITION-BY-COMPONENTS´

2002

Lucy Bartosik

Outline Biederman's 'recognition-by-components' (RBC) theory of object recognition and discuss its relationship to Marr's theory of early visual processing.

'Object recognition' is a term described in Atkinson et al. (2000) as referring to determining the meaning of an object: it is described as being vital for survival because we are only able to respond and react to an object's significant features once we know what the object is. For instance, once we realise that the object advancing on us is a tiger, we can then respond in the appropriate way by realising it is about to eat us and so we can start to run the other way very quickly. Similarly, once we realise that an object is a piece of scrumptious chocolate cake, then we know that the object is edible. It is also necessary to realise that 'spatial localisation' refers to determining where visual objects are and that it is also necessary for survival (Atkinson et al., 2000). Finally, 'perceptual constancy' is described as 'keeping the appearance of objects constant even though their impressions on the retina are constantly changing' (Atkinson et al., 2000). These three are the major goals of perception; localisation, recognition and object constancy.

Biederman's recognition-by-components (RBC) theory is his view that all complex forms are made up of simple geometrical forms known as 'geons' (geometric icons). Pattern recognition comprises of recognising these components (Gross, 1996). On the other hand, Marr's theory of early visual processing is known as the computational approach: this theory aims to define the stages involved in extracting useful 3-D information from 2-D images. Marr's theory involves: image/grey level description, primal sketch, 2½-D sketch, 3-D model representation (Gross, 1996).

To recognise an object, Atkinson et al. (2000) discuss that it is primarily shape, alongside size, colour, texture and orientation which we use to do so. For example, we recognise a cat as a cat whether it is big or little (change in size), black or white (change in colour), extra fuzzy or without hair (change in texture), whether it is presented curled up or sitting upright (change in orientation). There is evidence provided for the importance of shape (Biederman & Ju, 1998 in Atkinson et al., 2000, page 164), in that we are able to recognise many objects from basic line drawings as well as we can from detailed colour photographs, the latter of which also provides other attributes from the objects.

Biederman (1987, 1990) proposed a theory of object recognition which extended an existing theory by Marr and Nishihara (1978). As has previously been stated, Biederman's RBC theory has a central assumption that objects consist of basic components known as geons. Arcs, wedges, spheres, blocks and cylinders are all examples of geons, Biederman suggesting about thirty-six different ones. Eysenck and Keane (2001) explain that although thirty-six seems so incredibly few to provide descriptions of all the objects we can identify and recognise, the situation may be compared to phonemes to explain it. There are only around forty-four different phonemes in the English language and yet these are sufficient for all the spoken words because there are almost infinite possibilities for arranging these phonemes. Geons may be thought of as phonemes and thus the explanation holds true, leading to plentiful object description by geons from their many different possible spatial relationships. Further, a bucket may be described as a cylinder with an arc connected to the top, in a similar way a cup may be described as a cylinder with an arc connected to the side (Eysenck and Keane, 2001). The theory then surrounds the determination of the geons of a visual object and their relationships, this information is then matched with stored object representations or structural models which contain information about the attributes of objects (such as texture, colour, size and orientation). Thus a visual object is identified by the best fitting stored object representation to the geon-based information provided by the visual object.

The outline of Biederman's theory has been provided but the theory also contains analysis of how the object's geons are determined. The first step is described by Biederman as edge extraction and it being 'an early edge extraction stage, responsive to differences in surface characteristics namely, luminance, texture, or colour, provides a line drawing description of the object.' (Eysenck and Keane, 2001 from Biederman 1987, p. 117). Next it must be decided upon how the object should be segmented so that the number of parts of components (geons) can be found. Biederman (1987) agreed with Marr and Nishihara (1978) that when segmenting a visual image into geons, the concave parts of the object's contour are of specific importance. The outlasting vital element to the theory is deciding which edge information from an object possesses the necessary characteristic of remaining invariant across different viewing angles (Eysenck and Keane, 2001). Biederman (1987) states five invariant properties of edges, including points on a curve (Curvature), sets of points in parallel (Parallel), edges terminating at a common point (Co-Termination) and points in a straight line (Co-Linearity). The theory continues to explain that a visual object's geons are formed of these invariant properties: for example a book is constructed of three parallel edges and no curved edges but a cup has two parallel edges connecting the curved edges.

Marr's computational theory (1982) basically involves a series of representations which provide information (increasing in detail) about the visual circumstances. Firstly the primal sketch which may be either the raw primal sketch or the full primal sketch. The raw primal sketch contains information about light-intensity variations in the visual scene, whereas the full primal sketch uses this information to identify the number and outline of visual objects. Two primal sketches are formed because light intensity reflections from an object will be altered depending on its texture as well as being affected by the angle at which the light hits the object. This means that an imperfect but useful guide to the object's shapes and edges are given by the light intensity changes included in the raw primal sketch. It is the 'grey-level representation' of the retinal image which forms the raw primal sketch. In turn, 'pixels' are the very small individual areas of the image, it is these pixels and their light intensities from which the representation is based on.

There is continuous fluctuation of the light intensity reflected by each pixel, arising to a threat of distortion of the representation. To avoid this, it is possible to elimate this noise by smoothing out and averaging the light intensities of surrounding pixels. A drawback here is that this can often result in losing important information. In solving this problem, it is taken that several representations of the image will be formed, each with different amounts of blurring. This information from the representations is combined: the result being the the raw primal sketch formed up of four tokens being edge-segments, bars, terminations and blobs. The full primal sketch is simply another version of the primal sketch but it is important to note that they are two dimensional.

Similar to the primal sketch, the 2½-D sketch is observer-centered (or viewpoint-dependent). Marr (1982) identified several steps for the transformation from the primal sketch into the 2½-D sketch. Firstly a 'range map' is formed; next related parts of this map are used to form higher level descriptions. For example, concave and convex junctions between once surface and another are formed by information from the specified parts of the map. It is important to note that motion, shading, shape, texture, shading and binocular disparity are all made use of in the alteration from the primal sketch into the 2½-D sketch.

The 3-D model representation is produced because the 2½-D sketch is view-point centred and so is a lacking basis for identifying an object. It results in varied object representation, depending upon the viewpoint angle to the object, making object recognition more complicated. On the other hand, the 3-D model presents the shapes of objects that is 'viewpoint invariant' (independent of the observer's viewpoint).

There are several similarities and differences between Biederman's RBC theory and Marr's theory of early visual processing. Marr's theory appears much more complex from the outset because of the variety of sketches and models involved. Conversely Biederman's theory seems to surround object recognition starting at basic levels and then becoming more complex. In addition, the two theories seem fundamentally different because Marr's theory suggests that we recognise objects from their components and the shapes of these components, the cylinders. Biederman's theory also involves geons as the basic units where Marr's theory does not share them. One similarity between the two theories is that they may both be described as top-down processing such that objects are indirectly perceived and our knowledge of the world is used to perceive at the end of the process (Gross, 1996).

A theory that was fully viewpoint-independent would essentially specify that objects are mentally represented as 3-D models thus predicting that these representations should be equally accessible from any point of view. Beiderman's model does not quite predict this because a structural description lacks hidden part information, meaning that more than one structural description might be required to identify the object. The RBC is thus different from Marr's theory because the RBC suggests that once we see an object, we are able to recognise it from having seen similar patterns in the past: hence we recognise a tiger's tail as such because we recognise the pattern of stripes, the shape, the texture and so on. Simply, Biederman's theory is based upon identifying an objects features and then using these to identify the objects geons and their relationship. Next the visual memory is called upon to see whether there is an object that matches up with what we have detected (Gleitman et al., 1998). With regards to Marr's theory, more is known about the processes involved in constructing a range map than about transferring from the range map to the 2½-D map, suggesting that there are still some areas of weakness in Marr's theory.

It is often that a lot may be learned about the functionnings from the brain after damage to it : there are patients who have suffered specific cortex lesions which leads to visual agnosia (Farrah, 1990 in Gleitman et al., 1998). Such patients are able to see but are not able to recognise what they see, one example of a patient demonstrated this clearly: his task was to copy some drawings but he could not name this drawing of the key and named the bird he had just drawn as a tree stump (Farah 1990 from Gleitman et al. 1998). The evidence suggests that he had formed adequate structural descriptions of the objects but was lacking in their meaning because his ability to process was prohibited at this point. This example shows that the visual system is able to describe structure but may fail in the final stage of the object's meaning. This links with Biederman's theory in that the object's features are identified but these features are failed to be used to identify the objects geons and their relationship. Using this patient as an example, it seems easier to accept Biederman's RBC theory as more plausible than Marr's early processing theory. This is because the patient is halted once it comes to the meaning of the object but is still able to see it and it is unclear how Marr's theory could explain the patient's situation. Perhaps one way to understand agnosia using Marr's theory is to consider whether it would be possible for a blockage to be present between the 2½-D sketch and the 3-D model, thus the final representation would be lacking in vital meaning and so difficult to identify.

The Gestalt view of pattern recognition is usually dependent upon the overall shape of a visual stimulus and less so upon the object's individual features, thus here is it is similar to Marr's theory and less similar to Biederman's. Biederman's RBC theory was based upon Marr's theory of early visual processing and so a relationship between the two is unavoidable. It is vital to understand that Marr's theory of early visual processing and Biederman's RBC theory are two from a large field of competing theories on 3-D object recognition. There are several differences and similarities between the two theories but essentially a relationship does exist because both are plausible theories for object recognition however Marr's theory of early visual processing seems more complex than Biederman's recognition-by-components theory.

References

Atkinson, Rita. L, Atkinson, Richard. C, Smith, Edward. E, Bem, Daryl. J, Nolen-Hoeksema, Susan. (2000) Hilgard's Introduction to Psychology, Thirteenth Edition.

Gleitman, Henry, Fridlund, Alan. J, Reisburg, Daniel. (1998) Psychology, Fifth Edition.

Gross, Richard. (1996) Psychology, The Science of Mind and Behaviour, Third Edition.

Eysenck, Michael. W. and Keane, Mark, T. (2001) Cognitive Psychology, A Student's Handbook, 4th Edition.