Effects of Handling
Real Objects and Self-Avatar Fidelity on Cognitive Task Performance and Sense
of Presence in Virtual Environments
Immersive virtual environments (VEs) provide participants with computer-generated environments filled with virtual objects to assist in learning, training, and practicing dangerous and/or expensive tasks. But does having every object being virtual inhibit the interactivity and level of immersion for certain tasks? If participants spent most of their time and cognitive load on learning and adapting to interacting with a purely virtual system, does this reduce the VE effectiveness?
We
conducted a study that investigated how handling real objects and self-avatar
visual fidelity affects performance and sense-of-presence on a spatial
cognitive manual task. We compared
participants’ performance of a block arrangement task in both a real-space
environment and several virtual and hybrid environments. The results showed that manipulating real
objects in a VE brings task performance closer to that of real space, compared
to manipulating virtual objects. There
was not a significant difference in reported sense-of-presence, regardless of
the self-avatar’s visual fidelity or the presence of real objects.
Keywords: Virtual Environments, Sense
of Presence, Human-Computer Interaction
Conducting design evaluation and assembly feasibility evaluation tasks in immersive virtual environments (VEs) enables designers to evaluate and validate multiple alternative designs more quickly and cheaply than if mock-ups are built and more thoroughly than can be done from drawings. Design review has become one of the major productive applications of VEs [1]. Virtual models can be used to study the following important design questions:
· Can an artifact readily be assembled?
· Can repairers readily service it?
The ideal VE system would have the participant fully believe he was actually performing a task. In the assembly verification example, parts and tools would have mass, feel real, and handle appropriately. The participant would naturally interact with the virtual world, and in turn, the virtual objects would respond to the participant’s action appropriately [2].
Obviously, current VEs are far from that ideal system. Indeed, not interacting with every object as if it were real has distinct advantages, as in dangerous or expensive tasks. In current VEs, almost all objects in the environment are virtual, but both assembly and servicing are hands-on tasks, and the principal drawback of virtual models — that there is nothing there to feel, nothing to give manual affordances, and nothing to constrain motions — is a serious one for these applications.
Simulating a wrench with a six degree-of-freedom wand, for example, is far from realistic, perhaps too unrealistic to be useful. Imagine trying to simulate a task as basic as unscrewing an oil filter from an engine in such a VE!
Interacting with purely virtual objects could impose three limiting factors on VEs:
· Limits the types of feedback, such as motion constraints and haptics, the system could provide the user.
· The VE representation of real objects (real-object avatars) is usually stylized and not necessarily visually faithful to the object itself.
· Hinders real objects (including the user) from naturally interacting with virtual objects.
This work investigates the impact of these two factors on task performance and sense of presence in a spatial cognitive task. As opposed to perceptual motor tasks (e.g., pick up a pen), cognitive tasks require problem-solving decisions on actions (e.g., pick up a red pen). Most design verification and training tasks are cognitive.
We extend our definition of an avatar to include a virtual representation of any real object, not just the participant. The real-object avatar is registered with the real object, and ideally, they are registered in look, form, and function with the real object. The self-avatar refers specifically to the user’s virtual representation.
We believe a hybrid environment system, one that could handle dynamic real objects, would be effective in providing natural interactivity and visually-faithful self-avatars. In turn, this should improve task performance and sense of presence.
The advantages of interacting with real objects could enable applying VEs to tasks that are hampered by using all virtual objects. We believe spatial cognitive manual tasks, common in simulation and training VEs, would benefit from incorporating real objects. These tasks require problem solving through manipulating objects while maintaining spatial relationships.
The user is represented within the VE by a self-avatar, either from a library of representations, a generic self-avatar, or no self-avatar. A survey of VE research shows the most common approach is a generic self-avatar – literally, one size fits all [1]. The participant’s self-avatars are typically stylized human models, such as those found in commercial packages. These models, while containing a substantial amount of detail, do not visually match a participant’s appearance.
Researchers believe that providing generic self-avatars substantially improves sense-of-presence over providing no self-avatar [3]. However, they hypothesize that the visual misrepresentation of self would reduce how much a participant believed he was “in” the virtual world, his sense-of-presence. Usoh hypothesizes, “Substantial potential presence gains can be had from tracking all limbs and customizing [self-]avatar appearance [4].”
Recent studies suggest that even crude self-avatar representations convey substantial information. Even having some representation of the participants in the environment was important for navigation, social interaction, and task performance [5]. With self-avatars, emotions such as embarrassment, irritation, and self-awareness could be generated [6][7].
Providing realistic self-avatars requires capturing the participant’s motion, shape, and appearance. In general, VE systems attach extra trackers to the participant for sensing changing positions to drive an articulated stock self-avatar model. Presenting and controlling an accurate representation of the participant’s shape and pose is difficult due to the human body’s deformability and numerous degrees of freedom. Matching the virtual look to the physical reality is difficult to do dynamically, though commercial systems, such as the AvatarMe system, that generate static-textured, personalized self-avatars are available [8].
Ideally, a participant should be able to interact with the VE by natural speech and natural body motions. The VE system would understand and react to expressions, gestures, and motion. The difficulty is in capturing this information, both for rendering images and for input to simulations.
The fundamental interaction problem is that most things are not real in a virtual environment. In effort to address this, some VEs provide tracked, instrumented real objects as input devices. Common interaction devices include an articulated glove with gesture recognition or buttons (Immersion’s Cyberglove), tracked mouse (Ascension Technology’s 6D Mouse), or tracked joystick (Fakespace’s NeoWand).
Another approach is to engineer a device for a specific type of interaction. This typically improves interaction affordance, so that the participant interacts with the system in a more natural manner. For example, augmenting a doll’s head with sliding rods and trackers enables doctors to more naturally select cutting planes for visualizing MRI data [9]. However, this specialized engineering is time-consuming and often usable for only a particular type of task. VE interaction studies have been done on interaction ontologies [10], interaction methodologies [11], and 3-D GUI widgets and physical interaction [12].
We started off trying to prove the following: For cognitive tasks,
· Does interacting with real objects improve task performance?
· Does seeing a visually faithful self-avatar improve sense-of-presence?
To test this, we employed a hybrid system that can incorporate dynamic real objects into a VE. It uses multiple cameras to generate virtual representations of real objects at interactive rates [13]. Thus we could investigate how cognitive tasks performance is affected by interacting with real versus virtual objects. The results would be useful for training and assembly verification VEs, which often require problem solving while interacting with tools and parts.
Video capture of real object appearance also has another potential advantage — enhanced visual realism. Generating virtual representations of the participant in real time would allow the system to render a visually faithful self-avatar. The real-object appearance is captured from a camera that has a similar line of sight as the participant. Thus the system also allows us to investigate on whether having a visually faithful self-avatar, as opposed to a generic self-avatar, increases sense-of-presence. The results will provide insight into the need to invest the additional effort to render a high-fidelity visual self-avatar. This will be useful for immersive virtual environments that aim for high sense-of-presence, such as phobia treatment and entertainment VEs.
We sought to abstract tasks common to VE design applications. In surveying production VEs [1], we noted that a substantial number involve participants doing spatial cognitive manual tasks.
We specifically wanted to use a task that focused on cognition and manipulation over participant dexterity or reaction speed because of current technology, typical VE applications, and participant physical variability. We conducted a user study on a block arrangement task. We compared a purely virtual task system and two hybrid task systems that differed in level of visual fidelity. In all three cases, we used a real-space task as a baseline.
The task we designed is similar to, and based on, the block design portion of the Wechsler Adult Intelligence Scale (WAIS). Developed in 1939, the Wechsler Adult Intelligence Scale is a test widely used to measure IQ [14]. The block-design component measures reasoning, problem solving, and spatial visualization.
In the standard WAIS block design task, participants manipulate one-inch cubes to match target patterns. As the WAIS test is copyrighted, we modified the task to still require cognitive and problem solving skills while focusing on interaction methodologies. Also, the small one-inch cubes of the WAIS would be difficult to manipulate with purely virtual approaches and hamper the conditions that used the reconstruction system due to reconstruction error. We increased the size of the blocks to three-inch cubes, as shown in Figure 1.

Participants manipulated four or nine identical wooden blocks to make the top face of the blocks match a target pattern. Each cube had six patterns on its faces that represented the possible quadrant-divided white-blue patterns. There were two target patterns sizes, small four-block patterns in a 2 x 2 arrangement, and large nine-block patterns in a 3 x 3 arrangement.
The user study was a
between-subjects design. Each
participant performed the task in a real space environment (RSE), and then in a
VE condition. The independent variables
were the VE interaction modality (real or virtual blocks) and the VE
self-avatar visual fidelity (generic or visually faithful). The three VE conditions had:
· Virtual objects, generic self-avatar (purely virtual environment - PVE)
· Real objects, generic self-avatar (hybrid environment - HE)
· Real objects, visually faithful self-avatar (visually-faithful hybrid environment - VFHE)

The task was accessible to all participants, and the target patterns were intentionally of a medium difficulty (determined through pilot testing). Our goal was to use target patterns that were not so cognitively easy as to be manual dexterity tests, nor so difficult that participant spatial ability dominated the data. The participants were randomly assigned to one of the three groups, 1) RSE then PVE, 2) RSE then HE, or 3) RSE then VFHE (Figure 2).

Real Space Environment (RSE). The participant sat at a desk (Figure 3) with nine wooden blocks inside a rectangular enclosure. The side facing the participant was open and the whole enclosure was draped with a dark cloth. Two small lights lit the inside of the enclosure. A television placed atop the enclosure displayed the video feed from a “lipstick camera” mounted inside the enclosure. The camera had a similar line of sight as the participant, and the participant performed the task while watching the TV.

Purely Virtual Environment (PVE). Participants stood at a four-foot high table, and wore Fakespace Pinchgloves, each tracked with Polhemus Fastrak trackers, and a Virtual Research V8 head-mounted display (HMD) (Figure 4). The participant picked up a virtual block by pinching two fingers together (i.e. thumb and forefinger). When the participant released the pinch, the virtual block was dropped and an open hand avatar was displayed. The self-avatar’s appearance was generic (its color was a neutral gray).
The block closest to an avatar’s hand was highlighted to inform the participant which block would be selected by pinching. Pinching caused the virtual block to snap into the virtual avatar’s hand, and the hand appeared to be holding the block. To rotate the block, the participant rotated his hand while maintaining the pinching gesture.
Releasing the block within six inches of the workspace surface caused the block snapped into an unoccupied position in a three by three grid on the table. This reduced the fine-grained interaction that would have artificially inflated the time to complete the task. Releasing the block away from the grid caused it to simply drop onto the table. Releasing the block more than six inches above the table caused the block to float in mid-air to aid in rotation. There was no inter-block collision detection, and block interpenetration was not automatically resolved.

Hybrid Environment (HE). Participants wore yellow dishwashing gloves and the HMD (Figure 5). Within the VE, participants handled physical blocks, identical to the RSE blocks, and saw a self-avatar with accurate shape and generic appearance (due to the gloves).

Visually-Faithful Hybrid Environment (VFHE). Participants wore only the HMD. The self-avatar was visually faithful, as the shape reconstruction was texture-mapped with images from a HMD mounted camera. The participant saw an image of his own hands (Figure 6).
Virtual Environment. The VE
room was identical in all three of the virtual conditions (PVE, HE, VFHE). It had several virtual objects, including a
lamp, plant, and painting, along with a virtual table that was registered with
a real Styrofoam table. The enclosure in
the RSE was also rendered with transparency in the VE (Figure 7).

All the VE conditions were rendered on an SGI Reality Monster. The PVE ran on one rendering pipe at a minimum of twenty FPS. The HE and VFHE ran on four rendering pipes at a minimum of twenty FPS for virtual objects and twelve FPS for reconstructing real objects. The reconstruction system used 4 cameras, with 0.3 seconds of estimated latency, and 1 cm reconstruction error. The participant wore a Virtual Research V8 HMD (640 x 480 resolution) that was tracked with the UNC HiBall optical tracker.
Rationale for Conditions. We expect a participant’s
RSE (no VE equipment) performance would produce the best results, as the
interaction and visually fidelity were optimal.
Thus, we compared how closely a
participant’s task performance in VE was to their RSE task performance. We compared the reported sense-of-presence in the VE
conditions to each other.
The RSE was used for task training to reduce variability in individual task performance and as a baseline. The block design task had a learning curve (examined through pilot testing), and performing the task in the RSE allowed participants to become proficient without spending additional time in the VE. We limited VE time to fifteen minutes, as many pilot subjects complained of fatigue after that amount of time.
Task Performance. Participants were timed on
replicating correctly the target
pattern. We also recorded if the
participant incorrectly concluded that target pattern was replicated. In these cases, the participant was informed
and continued to work on the pattern.
Each participant eventually completed every pattern correctly.
Sense-of-presence. Participants answered the Steed-Usoh-Slater Presence Questionnaire (SUS) after completing the task in the VE condition [15].
Other Factors. We also measured spatial ability and simulator sickness by using the Guilford-Zimmerman Aptitude Survey, Part 5: Spatial Orientation and the Kennedy – Lane Simulator Sickness Questionnaire.
Participant Reactions. After the VE session, we interviewed the participant on their impressions of their experience. We recorded self- and experimenter-reported behaviors.
All participants completed a consent form and questionnaires to gauge their physical and mental condition, simulator sickness, and spatial ability.
Real Space. Next, the participant entered the room
with the real space environment (RSE) setup.
The participant was presented with
the wooden blocks and was instructed on the task. The participant was also told that they would
be timed, and to examine the blocks and become comfortable with moving
them. The cloth on the enclosure was
lowered, and the TV turned on.
The participant was
given a series of six practice patterns, three small (2 x 2) and then three
large (3 x 3). The participant was told
the number of blocks involved in a pattern, and to notify the experimenter when
they were done. After the practice
patterns were completed, a series of six timed test patterns were administered,
three small and three large. Between
patterns, the participant was asked to randomize the blocks’ orientations. The order of the patterns that each
participant saw was unique, though all participants saw the same twenty
patterns (real space: six practice, six timed, VE: four practice, four timed).
We recorded the time
required to complete each test pattern correctly. If the participant misjudged the completion
of the pattern, we noted this as an error and told the participant that the
pattern was not yet complete, and to continue working on the pattern. We did not stop the clock on errors. The final time was used as the task
performance measure for that pattern.
Virtual Space. Next, the participant entered a different room where the experimenter helped the participant put on the HMD and any additional equipment particular to the VE condition (PVE – tracked pinch gloves, HE – dishwashing gloves). Following a period of adaptation to the VE, the participant practiced on two small and two large patterns. The participant then was timed on two small and two large test patterns. A participant could ask questions and take breaks between patterns if so desired. Only one person (a PVE participant) asked for a break.
Post Experience. Finally, the participant was interviewed about
their impressions of and reactions to the session. The debriefing session was a semi-structured
interview. The specific questions asked
were only starting points, and the interviewer could delve more deeply into
responses for further clarification or to explore unexpected conversation
paths.
The participant filled out the simulator sickness questionnaire again. By comparing their pre- and post-experience scores, we could assess if their level of simulator sickness had changed while performing the task. Finally, an expanded Slater – Usoh – Steed Virtual Presence Questionnaire was given to measure the participant’s sense of presence in the VE.
Managing Anomalies. If the head or hand tracker lost tracking or crashed, we quickly restarted the system (about 5 seconds). In almost all the cases, the participants were so engrossed with the task they never noticed the lack of tracking and continued working. We noted long or repeated tracking failures, and participants who were tall (which gave the head tracker problems) were allowed to sit to perform the task. None of the tracking failures appeared to significantly affect the task performance time.
On hand were additional patterns for replacement of voided trials, such as if a participant dropped a block onto the floor. This happened twice and was noted.
Task Performance. Participants who manipulate real objects in the VE (HE, VFHE) will complete the spatial cognitive manual task significantly closer to their RSE task performance than will participants who manipulate virtual objects (PVE), i.e. interacting with real objects improves task performance. Further, there will not be a significant difference in task performance for VFHE and HE participants, i.e. interacting with real objects improves task performance regardless of self-avatar visual fidelity.
Sense-of-Presence. Participants represented in the VE by a visually faithful self-avatar (VFHE) will report a higher sense-of-presence than will participants represented by a generic self-avatar (PVE, HE), i.e. avatar visual fidelity increases sense-of-presence. Further, there will not be a significant difference in sense-of-presence for HE and PVE participants, i.e. generic self-avatars would have similar effects on sense-of-presence regardless of the presence of real objects.
We use a two-tailed t-test with unequal variances and an a=0.05 level for significance.
Forty participants completed the study, thirteen in the purely virtual environment (PVE) and hybrid environment (HE), and fourteen in the visually-faithful hybrid environment (VFHE). They were primarily male (thirty-three) undergraduate students enrolled at UNC-CH (thirty-one). Participants were recruited from UNC-CH Computer Science classes and word of mouth.
They reported little prior VE experience (M=1.37, s.d.=0.66), high computer usage (M=6.39, s.d.=1.14), and moderate – 1 to 5 hours a week – computer/video game play, on [1..7] scales. There were no significant differences between the groups.
During the recruiting process, we required participants to have taken or be currently enrolled in a higher-level mathematics course (equivalent of a Calculus 1 course). This greatly reduced participant spatial ability variability, and in turn reduced task performance variability.
The dependent variable for task performance was the difference in the time to correctly replicate the target pattern in the VE condition compared to the RSE.

Table 1 – Task performance results
|
|
Small Pattern Time (seconds) |
|||