Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Dissertation Proposal: List Reading

No description
by

Thomas Packer

on 7 December 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Dissertation Proposal: List Reading

Weakly Supervised
Transductive
Self-training
Co-training May contain:
OCR errors
Zoning
Grammatical structure
HTML tag structure
Tabular structure
Heterogenious structure
Preprocessing (e.g. tokenization, POS tags, other features) Computer Science Information
Extraction Machine Learning Graphical Processing Document
Image Analysis Structure Recognition Physical Structure Recognition Optical Character Recognition and Error Correction Research Area:
Machine Learning-based
Information Extraction
from Document Images Statistical IE Rule-based IE Input Output Output Output May contain:
Entity categories
Relations among entities
Other ontological structures Supervised
Unsupervised
Semi-supervised Model /
Mapping Function Oracle May be:
Human
Automated Instances / Data "Twenty years of document image analysis in PAMI" by G. Nagy (2000) IEEE Transactions on Pattern Analysis and Machine Intelligence "Document structure analysis algorithms: A
literature survey" by S. Mao, A. Rosenfeld and T. Kanungo (2003) SPIE Electronic Imaging "Forty years of research in character and document recognition --- an industrial perspective" by H. Fujisawa (2008) Pattern Recognition

"Historical review of OCR research and development" by S. Mori, C. Y. Suen and K. Yamamoto (1992) Proceedings of the IEEE "Techniques for automatically correcting words in text" by K. Kukich (1992) ACM Computing Surveys "Information extraction" by S. Sarawagi (2008) Foundations and Trends in Databases

"Adaptive information extraction" by J. Turmo, A. Ageno, and N. Catala (2006) ACM Computing Surveys

"A brief survey of web data extraction tools" by A. H. F. Laender, B. A. Ribeiro-Neto, A. S. da Silva, and J. S. Teixeira (2002) ACM SIGMOD Record "Wrapper induction for information extraction" by N. Kushmerick (1997) PhD thesis, University of Washsington "Maximum entropy markov models for information extraction and segmentation" by A. McCallum, D. Freitag and F. Pereira (2000) International Conference on Machine Learning "Conceptual-model-based data extraction from multiple-record web pages" by D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, D. W. Lonsdale, Y. Ng, and R. D. Smith (1999) Data & Knowledge Engineering "Semi-supervised learning literature survey" by Xiaojin Zhu (2006) Technical report, University of Wisconsin, Madison "Statistical Learning Theory" by Vladimir N. Vapnik (1998) Wiley New York "Maximum likelihood from incomplete data via the EM algorithm" by Arthur P. Dempster, Nan M. Laird, and Donal B. Rubin (1977) Journal of the Royal Statistical Society "Combining labeled and unlabeled data with co-training" by A. Blum and T. Mitchell (1998) Annual Conference on Computational Learning Theory

"Unsupervised word sense disambiguation rivaling supervised methods" by David Yarowsky (1995) Annual Meeting on Association for Computational Linguistics Online
Active "Active learning literature survey" by Burr Settles (2010) Technical report, University of Wisconsin-Madison. Semi-supervised NER
Bootstrapping
Grammar Induction "Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision" by David Nadeau (2007) Thesis, University of Ottawa, Canada Transfer
Multi-task "A survey on transfer learning" by Sinno Jialin Pan and Qiang Yang (2010) IEEE Transactions on Knowledge and Data Engineering "Unsupervised word sense disambiguation rivaling supervised methods" by David Yarowsky (1995) Annual Meeting on Association for Computational Linguistics

"Automatic acquisition of hyponyms from large text corpora" by Marti A. Hearst (1992) International Conference on Computational Linguistics

"Learning dictionaries for information extraction by multi-level bootstrapping" by Ellen Riloff and Rosie Jones (1999) National Conference on Articial Intelligence "Nonparametric Bayesian Models of Lexical Acquisition" by S. J. Goldwater (2007) PhD thesis, Brown
University

"Information compression by multiple alignment, unication and search as a unifying principle in computing and cognition" by J. G. Wolff (2003) Articial Intelligence Review

"Wrapper induction for information extraction" by N. Kushmerick (1997)
PhD thesis, University of Washington

"Hidden markov model induction by bayesian model merging" by A. Stolcke and S. Omohundro (1993) Advances in neural information processing systems Model Formalism, Creation, Resources Oracle May be:
Human Instances / Data Model /
Mapping Function Oracle May be:
Human Instances / Data Model /
Mapping Function Instances / Data Model /
Mapping Function Oracle May be:
Human Instances / Data Model /
Mapping Function Knowledge-constrained "Machine Learning" by Tom Mitchell (1997) McGraw Hill (Focus: Semi-supervised ML) (Focus: Semi-structured IE) (Focus: Structure Recognition & OCR) Logical Structure Recognition Oracle May be:
Human
Automated Instances / Data Model /
Mapping Function Oracle May be:
Human
Automated Instances / Data Model 2 Model 1 Oracle May be:
Human
Automated Instances / Data Model /
Mapping Function Thomas Packer Related terms:
Page Decomposition
Document Structure Extraction
Document Layout Analysis
Document Structure Analysis
Document Structure Recognition Related terms:
Physical Layout Analysis
Geometric Layout Analysis
Physical Component Analysis
Syntactic Analysis
Physical Structure Recognition Related terms:
Logical Layout Recognition
Functional Layout Recognition
Logical Component Analysis
Functional Component Analysis
Semantic Analysis
Logical Structure Analysis
Logical Structure Recognition "Document image analysis is the subfield of digital image processing that aims at converting document images to symbolic form for modification, storage, retrieval, reuse, and transmission. It helps the transition from bookshelves and filing cabinets to the paperless (and perhaps even wireless) world."

"Document Image Analysis (DIA) is the theory and practice of recovering the symbol structure of digital images scanned from paper or produced by computer."

G. Nagy, 2000 "OCR converts the individual word or character images into a character code like ASCII or Unicode."

G. Nagy, 2000 Character Segmentation Character Classification Hypothesis Ranking Hypothesis Selection Machine-queryable, -linkable, and -editable Research
Area Labels are indirectly, incompletely, or inconsistently relevant to the target function, but are cheep to acquire. Expensive labeled and cheep unlabeled data are both leveraged. Based on:
Character classification confidence
Resulting word (unigram) probability
Resulting word (n-gram) probability conditioned on context Page Grammar:
Models document layout as a grammar and performs a global search for the optimal parse based on a grammatical cost function.
Full transcript