2002
Lucy Bartosik
Outline Biederman's 'recognition-by-components' (RBC) theory of object
recognition and discuss its relationship to Marr's theory of early visual
processing.
'Object recognition' is a term described in Atkinson et al. (2000) as
referring to determining the meaning of an object: it is described as being
vital for survival because we are only able to respond and react to an
object's significant features once we know what the object is. For instance,
once we realise that the object advancing on us is a tiger, we can then
respond in the appropriate way by realising it is about to eat us and so we
can start to run the other way very quickly. Similarly, once we realise that
an object is a piece of scrumptious chocolate cake, then we know that the
object is edible. It is also necessary to realise that 'spatial
localisation' refers to determining where visual objects are and that it is
also necessary for survival (Atkinson et al., 2000). Finally, 'perceptual
constancy' is described as 'keeping the appearance of objects constant even
though their impressions on the retina are constantly changing' (Atkinson et
al., 2000). These three are the major goals of perception; localisation,
recognition and object constancy.
Biederman's recognition-by-components (RBC) theory is his view that all
complex forms are made up of simple geometrical forms known as 'geons'
(geometric icons). Pattern recognition comprises of recognising these
components (Gross, 1996). On the other hand, Marr's theory of early visual
processing is known as the computational approach: this theory aims to
define the stages involved in extracting useful 3-D information from 2-D
images. Marr's theory involves: image/grey level description, primal sketch,
2˝-D sketch, 3-D model representation (Gross, 1996).
To recognise an object, Atkinson et al. (2000) discuss that it is primarily
shape, alongside size, colour, texture and orientation which we use to do
so. For example, we recognise a cat as a cat whether it is big or little
(change in size), black or white (change in colour), extra fuzzy or without
hair (change in texture), whether it is presented curled up or sitting
upright (change in orientation). There is evidence provided for the
importance of shape (Biederman & Ju, 1998 in Atkinson et al., 2000, page
164), in that we are able to recognise many objects from basic line drawings
as well as we can from detailed colour photographs, the latter of which also
provides other attributes from the objects.
Biederman (1987, 1990) proposed a theory of object recognition which
extended an existing theory by Marr and Nishihara (1978). As has previously
been stated, Biederman's RBC theory has a central assumption that objects
consist of basic components known as geons. Arcs, wedges, spheres, blocks
and cylinders are all examples of geons, Biederman suggesting about
thirty-six different ones. Eysenck and Keane (2001) explain that although
thirty-six seems so incredibly few to provide descriptions of all the
objects we can identify and recognise, the situation may be compared to
phonemes to explain it. There are only around forty-four different phonemes
in the English language and yet these are sufficient for all the spoken
words because there are almost infinite possibilities for arranging these
phonemes. Geons may be thought of as phonemes and thus the explanation holds
true, leading to plentiful object description by geons from their many
different possible spatial relationships. Further, a bucket may be described
as a cylinder with an arc connected to the top, in a similar way a cup may
be described as a cylinder with an arc connected to the side (Eysenck and
Keane, 2001). The theory then surrounds the determination of the geons of a
visual object and their relationships, this information is then matched with
stored object representations or structural models which contain information
about the attributes of objects (such as texture, colour, size and
orientation). Thus a visual object is identified by the best fitting stored
object representation to the geon-based information provided by the visual
object.
The outline of Biederman's theory has been provided but the theory also
contains analysis of how the object's geons are determined. The first step
is described by Biederman as edge extraction and it being 'an early edge
extraction stage, responsive to differences in surface characteristics
namely, luminance, texture, or colour, provides a line drawing description
of the object.' (Eysenck and Keane, 2001 from Biederman 1987, p. 117). Next
it must be decided upon how the object should be segmented so that the
number of parts of components (geons) can be found. Biederman (1987) agreed
with Marr and Nishihara (1978) that when segmenting a visual image into
geons, the concave parts of the object's contour are of specific importance.
The outlasting vital element to the theory is deciding which edge
information from an object possesses the necessary characteristic of
remaining invariant across different viewing angles (Eysenck and Keane,
2001). Biederman (1987) states five invariant properties of edges, including
points on a curve (Curvature), sets of points in parallel (Parallel), edges
terminating at a common point (Co-Termination) and points in a straight line
(Co-Linearity). The theory continues to explain that a visual object's geons
are formed of these invariant properties: for example a book is constructed
of three parallel edges and no curved edges but a cup has two parallel edges
connecting the curved edges.
Marr's computational theory (1982) basically involves a series of
representations which provide information (increasing in detail) about the
visual circumstances. Firstly the primal sketch which may be either the raw
primal sketch or the full primal sketch. The raw primal sketch contains
information about light-intensity variations in the visual scene, whereas
the full primal sketch uses this information to identify the number and
outline of visual objects. Two primal sketches are formed because light
intensity reflections from an object will be altered depending on its
texture as well as being affected by the angle at which the light hits the
object. This means that an imperfect but useful guide to the object's shapes
and edges are given by the light intensity changes included in the raw
primal sketch. It is the 'grey-level representation' of the retinal image
which forms the raw primal sketch. In turn, 'pixels' are the very small
individual areas of the image, it is these pixels and their light
intensities from which the representation is based on.
There is continuous fluctuation of the light intensity reflected by each
pixel, arising to a threat of distortion of the representation. To avoid
this, it is possible to elimate this noise by smoothing out and averaging
the light intensities of surrounding pixels. A drawback here is that this
can often result in losing important information. In solving this problem,
it is taken that several representations of the image will be formed, each
with different amounts of blurring. This information from the
representations is combined: the result being the the raw primal sketch
formed up of four tokens being edge-segments, bars, terminations and blobs.
The full primal sketch is simply another version of the primal sketch but it
is important to note that they are two dimensional.
Similar to the primal sketch, the 2˝-D sketch is observer-centered (or
viewpoint-dependent). Marr (1982) identified several steps for the
transformation from the primal sketch into the 2˝-D sketch. Firstly a 'range
map' is formed; next related parts of this map are used to form higher level
descriptions. For example, concave and convex junctions between once surface
and another are formed by information from the specified parts of the map.
It is important to note that motion, shading, shape, texture, shading and
binocular disparity are all made use of in the alteration from the primal
sketch into the 2˝-D sketch.
The 3-D model representation is produced because the 2˝-D sketch is
view-point centred and so is a lacking basis for identifying an object. It
results in varied object representation, depending upon the viewpoint angle
to the object, making object recognition more complicated. On the other
hand, the 3-D model presents the shapes of objects that is 'viewpoint
invariant' (independent of the observer's viewpoint).
There are several similarities and differences between Biederman's RBC
theory and Marr's theory of early visual processing. Marr's theory appears
much more complex from the outset because of the variety of sketches and
models involved. Conversely Biederman's theory seems to surround object
recognition starting at basic levels and then becoming more complex. In
addition, the two theories seem fundamentally different because Marr's
theory suggests that we recognise objects from their components and the
shapes of these components, the cylinders. Biederman's theory also involves
geons as the basic units where Marr's theory does not share them. One
similarity between the two theories is that they may both be described as
top-down processing such that objects are indirectly perceived and our
knowledge of the world is used to perceive at the end of the process (Gross,
1996).
A theory that was fully viewpoint-independent would essentially specify that
objects are mentally represented as 3-D models thus predicting that these
representations should be equally accessible from any point of view.
Beiderman's model does not quite predict this because a structural
description lacks hidden part information, meaning that more than one
structural description might be required to identify the object. The RBC is
thus different from Marr's theory because the RBC suggests that once we see
an object, we are able to recognise it from having seen similar patterns in
the past: hence we recognise a tiger's tail as such because we recognise the
pattern of stripes, the shape, the texture and so on. Simply, Biederman's
theory is based upon identifying an objects features and then using these to
identify the objects geons and their relationship. Next the visual memory is
called upon to see whether there is an object that matches up with what we
have detected (Gleitman et al., 1998). With regards to Marr's theory, more
is known about the processes involved in constructing a range map than about
transferring from the range map to the 2˝-D map, suggesting that there are
still some areas of weakness in Marr's theory.
It is often that a lot may be learned about the functionnings from the brain
after damage to it : there are patients who have suffered specific cortex
lesions which leads to visual agnosia (Farrah, 1990 in Gleitman et al.,
1998). Such patients are able to see but are not able to recognise what they
see, one example of a patient demonstrated this clearly: his task was to
copy some drawings but he could not name this drawing of the key and named
the bird he had just drawn as a tree stump (Farah 1990 from Gleitman et al.
1998). The evidence suggests that he had formed adequate structural
descriptions of the objects but was lacking in their meaning because his
ability to process was prohibited at this point. This example shows that the
visual system is able to describe structure but may fail in the final stage
of the object's meaning. This links with Biederman's theory in that the
object's features are identified but these features are failed to be used to
identify the objects geons and their relationship. Using this patient as an
example, it seems easier to accept Biederman's RBC theory as more plausible
than Marr's early processing theory. This is because the patient is halted
once it comes to the meaning of the object but is still able to see it and
it is unclear how Marr's theory could explain the patient's situation.
Perhaps one way to understand agnosia using Marr's theory is to consider
whether it would be possible for a blockage to be present between the 2˝-D
sketch and the 3-D model, thus the final representation would be lacking in
vital meaning and so difficult to identify.
The Gestalt view of pattern recognition is usually dependent upon the
overall shape of a visual stimulus and less so upon the object's individual
features, thus here is it is similar to Marr's theory and less similar to
Biederman's. Biederman's RBC theory was based upon Marr's theory of early
visual processing and so a relationship between the two is unavoidable. It
is vital to understand that Marr's theory of early visual processing and
Biederman's RBC theory are two from a large field of competing theories on
3-D object recognition. There are several differences and similarities
between the two theories but essentially a relationship does exist because
both are plausible theories for object recognition however Marr's theory of
early visual processing seems more complex than Biederman's
recognition-by-components theory.
References
Atkinson, Rita. L, Atkinson, Richard. C, Smith, Edward. E, Bem, Daryl. J,
Nolen-Hoeksema, Susan. (2000) Hilgard's Introduction to Psychology,
Thirteenth Edition.
Gleitman, Henry, Fridlund, Alan. J, Reisburg, Daniel. (1998) Psychology,
Fifth Edition.
Gross, Richard. (1996) Psychology, The Science of Mind and Behaviour, Third
Edition.
Eysenck, Michael. W. and Keane, Mark, T. (2001) Cognitive Psychology, A
Student's Handbook, 4th Edition.