Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

Review on Human Gaze Control

A Model of Saliency-Based Visual Attention

for Rapid Scene Analysis

Contextual Guidance of Attention in Natural scenes:

The role of Global features on object search

Human experiments

  • Task: counting people/paintings/mugs
  • Eye movements evaluation
  • Exhaustive search regardless of the true number of targets present
  • Smaller objects, longer searching time
  • Participants are very consistent with one another

Summary

  • Realized a computational model of Bayesian network of attention and proved the importance of scene-schema knowledge for visual search tasks
  • Improved performance of saliency-map model by adding global features
  • Gave a baseline system combining bottom-up process and top-down process

Issues

  • Training of their model is based on a relatively small trainset, and the model may not perform well when given unseen test images
  • When using EM, only one starting point is chosen, which may lead to finding local optimum instead of global optimum. A possible solution is to try several random seeds as starting points and run EM several times
  • Used horizontal layers to compute global features instead of object recognition is computational cheap, but it may fail to give good description in some cases
  • Computed R, G, B separately. May also need to combine these channels
  • Only compared bottom-up and full model. Better if provide more experiments on top-down alone model

General Discussion

Thank you for your ATTENTION!

Similarities

Open Questions

  • Modeling the early stage of visual processing
  • Feed-forward and parallel structure
  • Static images as stimuli
  • What features should be involved in a saliency map and how are they weighted?
  • Gaze control of moving objects?
  • How to model episodic knowledge?
  • How to make the systems perform as efficient as human?

Differences

Paper 1:

  • Bottom-up stimulus-based
  • Only use local features
  • Multi-scale image properties: Color, intensity, orientation
  • Based on Neural Network

References

Paper 2:

  • Combines Bottom-up & Top-down
  • Use local & global features
  • Saliency map: only orientation in R, G, and B components
  • Based on Bayesian framework

Visual Attention

  • [1] John M. Henderson. 2003. Human gaze control during real-world scene perception. Trends in Cognitive Sciences, 7:11, 498-504.
  • [2] L. Itti, C. Koch, E. Niebur. 1998. A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:11, 1254-1259.
  • [3] A. Torralba, A. Oliva, M. Castelhano and J. M. Henderson. 2006. Contextual Guidance of Attention in Natural scenes: The role of Global features on object search. Psychological Review, 113:4, 766-786.

Leimin Tian

Shang Zhao

Introduction

  • Primate visual system
  • Reduce complexity before time-consuming processing
  • Select a subset of the scene
  • Feature integration theory
  • Pre-attention stage:
  • Features registered in parallel
  • Focused attention stage
  • Features in the attention location combine in order to perceive the whole object
  • Bottom-up without top-down guidance
  • Fast selection of a small number of interesting image locations

Experiments and Results

  • Detect salient traffic signs quickly
  • Robust to noise which does not directly interfere with the main feature of the target
  • Predict objects of interest: faces and flags
  • Comparison with Spatial Frequency Content Model

Model

(Itti et al, 1998)

  • Spatial Frequency: any structure that is periodic across position in space
  • Bad performance with speckle noise
  • Informative - not just high SFC
  • Images in 9 spatial scales
  • Fine center: c = {2,3,4}
  • Coarse surround: s = c + d, d = {3,4}
  • Features:
  • normalized colors: R,G,B,Y
  • Intensity = (R+G+B)/3
  • Orientations: {0, 45, 90, 135}
  • Feature maps: Center-surround, across-scale subtraction
  • Intensity contrast: dark centers against bright surrounds or vise versa
  • Color double-opponent: Red/green and blue/yellow
  • Orientation contrast: center VS surround
  • Conspicuity maps:
  • Normalization: promote maps with strong peaks while suppress homogeneous ones
  • Across-scale addition
  • Saliency map
  • Average of three conspicuity maps
  • Equally weighted
  • Neural Network
  • Leaky integrate-and-fire
  • Winner-take-all

Conclusion and Discussion

(Itti et al, 1998)

Human Gaze

Bottom-up Stimulus-based

  • Summary
  • Bottom-up saliency map to guide visual attention
  • Feed-forward feature-extraction mechanisms
  • Massively parallel architecture
  • Multi-scale features: Intensity, color, orientation
  • Biologically plausible neural networks
  • Successfully detect local salient targets
  • Show kind of robustness in noisy images.
  • Issues
  • Model predictions VS human fixation statistics
  • Unimplemented feature types (e.g., T junctions or line terminators)
  • A weighted linear combination of image properties
  • Recurrent mechanism for contour completion and closure

high quality visual information is acquired only from a limited spatial region surrounding the center of gaze (the point of fixation)

Why Important?

  • Scene statistics
  • Difference between fixated patches and unselected patches
  • Saliency-map
  • Model prediction of fixation using scene statistics (color, orientation, contrast, edge density, etc.)
  • Gaze is the first step of visual cognition
  • Eye movements serve as a window into the operation of the attentional system
  • Play an important role in studies of human languages

Connection

Top-down Knowledge-driven

Where do we fixate?

  • Paper 2 is a more recent work combining bottom-up process with top-down process
  • Global-context modulated local salient features
  • More integral model for visual attention.
  • Episodic scene knowledge
  • clock on the wall
  • Scene-schema knowledge
  • more likely to find superman in the sky than on the road
  • Task-related knowledge
  • visual search or memorize
  • Early studies of gaze control demonstrated that empty, uniform, and uninformative scene regions are often not fixated.
  • Viewers instead concentrate their fixations, including the very first fixation in a scene, on interesting and informative regions.
  • What is an "interesting and informative region"?

Compared salient model, full model and human

  • Performed well above chance level (based on ratio of target area to the whole image)
  • The contextual guidance model performed better than saliency-map alone model
  • In people searching task, human tended to start fixating image locations selected by global features
  • In mug searching task, salient model performed almost as well as full model
  • Human performed better than all models in all tasks

Experiments and Results

Saliency map alone VS Contextual Guidance Model

Papers

(Torralba et al, 2006)

Introduction

  • Paper 1: Bottom-up
  • A Model of Saliency-Based Visual Attention for Rapid Scene Analysis (Itti et al, 1998)
  • Paper 2: Bottom-up + Top-down
  • Contextual Guidance of Attention in Natural scenes: The role of Global features on object search (Torralba et al, 2006)

Conclusion and Discussion

  • Provided a contextual guidance model combining bottom-up process with top-down process
  • Modeled attention using a Bayesian framework
  • Use two parallel pathways: local features (saliency-map) and global (scene context) features.
  • Features are computed in a feed-forward manner.
  • Top-down process mainly used scene-schema knowledge to select relevant image regions for visual search task

Model

(Torralba et al, 2006)

  • Local features: orientations computed from R, G, B separately by passing through filter bank. Then use PCA to reduce dimensionality.
  • Global features: p(O = 1,X|L,G) = [p(L|O = 1,X,G) * p(X|O = 1,G) * p(O = 1|G)] / p(L|G) (p from trainset)
  • Scene-modulated saliency map: S(X) = p(X|O = 1,G) * p(L|G)^(-r) (EM for r)

15th, March, 2013

Learn more about creating dynamic, engaging presentations with Prezi