Review on Human Gaze Control
A Model of Saliency-Based Visual Attention
for Rapid Scene Analysis
Contextual Guidance of Attention in Natural scenes:
The role of Global features on object search
Human experiments
- Task: counting people/paintings/mugs
- Eye movements evaluation
- Exhaustive search regardless of the true number of targets present
- Smaller objects, longer searching time
- Participants are very consistent with one another
Summary
- Realized a computational model of Bayesian network of attention and proved the importance of scene-schema knowledge for visual search tasks
- Improved performance of saliency-map model by adding global features
- Gave a baseline system combining bottom-up process and top-down process
Issues
- Training of their model is based on a relatively small trainset, and the model may not perform well when given unseen test images
- When using EM, only one starting point is chosen, which may lead to finding local optimum instead of global optimum. A possible solution is to try several random seeds as starting points and run EM several times
- Used horizontal layers to compute global features instead of object recognition is computational cheap, but it may fail to give good description in some cases
- Computed R, G, B separately. May also need to combine these channels
- Only compared bottom-up and full model. Better if provide more experiments on top-down alone model
General Discussion
Thank you for your ATTENTION!
- Modeling the early stage of visual processing
- Feed-forward and parallel structure
- Static images as stimuli
- What features should be involved in a saliency map and how are they weighted?
- Gaze control of moving objects?
- How to model episodic knowledge?
- How to make the systems perform as efficient as human?
Paper 1:
- Bottom-up stimulus-based
- Only use local features
- Multi-scale image properties: Color, intensity, orientation
- Based on Neural Network
Paper 2:
- Combines Bottom-up & Top-down
- Use local & global features
- Saliency map: only orientation in R, G, and B components
- Based on Bayesian framework
Visual Attention
- [1] John M. Henderson. 2003. Human gaze control during real-world scene perception. Trends in Cognitive Sciences, 7:11, 498-504.
- [2] L. Itti, C. Koch, E. Niebur. 1998. A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:11, 1254-1259.
- [3] A. Torralba, A. Oliva, M. Castelhano and J. M. Henderson. 2006. Contextual Guidance of Attention in Natural scenes: The role of Global features on object search. Psychological Review, 113:4, 766-786.
Leimin Tian
Shang Zhao
- Primate visual system
- Reduce complexity before time-consuming processing
- Select a subset of the scene
- Feature integration theory
- Pre-attention stage:
- Features registered in parallel
- Focused attention stage
- Features in the attention location combine in order to perceive the whole object
- Bottom-up without top-down guidance
- Fast selection of a small number of interesting image locations
- Detect salient traffic signs quickly
- Robust to noise which does not directly interfere with the main feature of the target
- Predict objects of interest: faces and flags
- Comparison with Spatial Frequency Content Model
- Spatial Frequency: any structure that is periodic across position in space
- Bad performance with speckle noise
- Informative - not just high SFC
- Images in 9 spatial scales
- Fine center: c = {2,3,4}
- Coarse surround: s = c + d, d = {3,4}
- Features:
- normalized colors: R,G,B,Y
- Intensity = (R+G+B)/3
- Orientations: {0, 45, 90, 135}
- Feature maps: Center-surround, across-scale subtraction
- Intensity contrast: dark centers against bright surrounds or vise versa
- Color double-opponent: Red/green and blue/yellow
- Orientation contrast: center VS surround
- Conspicuity maps:
- Normalization: promote maps with strong peaks while suppress homogeneous ones
- Across-scale addition
- Saliency map
- Average of three conspicuity maps
- Equally weighted
- Neural Network
- Leaky integrate-and-fire
- Winner-take-all
Conclusion and Discussion
- Summary
- Bottom-up saliency map to guide visual attention
- Feed-forward feature-extraction mechanisms
- Massively parallel architecture
- Multi-scale features: Intensity, color, orientation
- Biologically plausible neural networks
- Successfully detect local salient targets
- Show kind of robustness in noisy images.
- Issues
- Model predictions VS human fixation statistics
- Unimplemented feature types (e.g., T junctions or line terminators)
- A weighted linear combination of image properties
- Recurrent mechanism for contour completion and closure
high quality visual information is acquired only from a limited spatial region surrounding the center of gaze (the point of fixation)
- Scene statistics
- Difference between fixated patches and unselected patches
- Saliency-map
- Model prediction of fixation using scene statistics (color, orientation, contrast, edge density, etc.)
- Gaze is the first step of visual cognition
- Eye movements serve as a window into the operation of the attentional system
- Play an important role in studies of human languages
Top-down Knowledge-driven
- Paper 2 is a more recent work combining bottom-up process with top-down process
- Global-context modulated local salient features
- More integral model for visual attention.
- Episodic scene knowledge
- clock on the wall
- Scene-schema knowledge
- more likely to find superman in the sky than on the road
- Task-related knowledge
- visual search or memorize
- Early studies of gaze control demonstrated that empty, uniform, and uninformative scene regions are often not fixated.
- Viewers instead concentrate their fixations, including the very first fixation in a scene, on interesting and informative regions.
- What is an "interesting and informative region"?
Compared salient model, full model and human
- Performed well above chance level (based on ratio of target area to the whole image)
- The contextual guidance model performed better than saliency-map alone model
- In people searching task, human tended to start fixating image locations selected by global features
- In mug searching task, salient model performed almost as well as full model
- Human performed better than all models in all tasks
Saliency map alone VS Contextual Guidance Model
- Paper 1: Bottom-up
- A Model of Saliency-Based Visual Attention for Rapid Scene Analysis (Itti et al, 1998)
- Paper 2: Bottom-up + Top-down
- Contextual Guidance of Attention in Natural scenes: The role of Global features on object search (Torralba et al, 2006)
Conclusion and Discussion
- Provided a contextual guidance model combining bottom-up process with top-down process
- Modeled attention using a Bayesian framework
- Use two parallel pathways: local features (saliency-map) and global (scene context) features.
- Features are computed in a feed-forward manner.
- Top-down process mainly used scene-schema knowledge to select relevant image regions for visual search task
- Local features: orientations computed from R, G, B separately by passing through filter bank. Then use PCA to reduce dimensionality.
- Global features: p(O = 1,X|L,G) = [p(L|O = 1,X,G) * p(X|O = 1,G) * p(O = 1|G)] / p(L|G) (p from trainset)
- Scene-modulated saliency map: S(X) = p(X|O = 1,G) * p(L|G)^(-r) (EM for r)
15th, March, 2013