Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Conversion of Research Publications into XML

ITEC 810 Presentation

Suherman Suherman

on 7 June 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Conversion of Research Publications into XML

Introduction Methodology Approach Discussion Prototype 1 Problem

Prototype 2 Problem

Prototype 3 Problem

Final System Conclusion Conversion of
Research Publications
into XML By: Suherman (42968461)
Supervisor: Dr. Diego Mollá Aliod Topics: Introduction
Literature Review
Conclusion Literature Review XML Formats PubMed DTD

Text Encoding Initiative


Comparison Result XML Processing Tools Prototype Development Requirements Definition

System and Software Design

Implementation and System Testing

Operation and Maintenance Any
Question JSoup


XOM The converter tool is an application to convert various journals into one standard XML format;
The application will use a set of tag mapping in several XML configuration files to process various journals. Full documentation will be provided;
Application source codes will be handed over. Revision History Prototype 1 Development
Prototype 2 Development
Prototype 3 Development Final Prototype Development Separate all Cochrane Library source files;
Analyze Cochrane Library Journals;
Hard code source to target tag mapping inside Java;
Develop Converter Application for Cochrane Library;
Manual verification by using PubMed Journal Publishing DTD version 3.0. Process New England Journals;
Acquire source document metadata and abstract automatically from PubMed website;
Implement internal anchor and external HTML links;
Accommodate subsequent content, bulleted list, numbered list and table components from data source;
Verify invalid DTD on the spot. Create main configuration file in XML file;
Create external XML source to target mapping files;
Implement XPath query language for selecting nodes and extracting data from XML document;
Implement source document parsing and cleaning mechanism before processed with XPath. Add length comparison between source and result file;
New ability of arranging the HTML sources based on its journal type, source file and clean XHML file;
New ability of filtering error source and result file in “error/” directory for further investigation;
New ability of filtering large discrepancy between original and converted article in “missing/” directory for further investigation;
Fixing some missing tag and invalid XML structure on Cochrane, JAMA and New England journal;
New ability of sharing journal mapping tags using shared XML mapping file. Inconsistent Documents Metadata and Abstract Components;
Missing some tags due to the insufficient tag map;
Unable to verify document result against DTD;
Unable to process complex nested sub tags. Eliminates all problems in Prototype 1;
Source to destination tag mapping hard coded in the Java source code;
Difficult to add a new journal because need to understand Java programming language;
Large discrepancy found when comparing source against result document length. Eliminates all problems in Prototype 2;
Document traversing mechanism cause slow on processing document (up to 4 seconds per file);
There is no guidance on how to create the XML mapping file. Eliminates all problems in Prototype 3;
Provides user manual and technical diagrams;

Some journals does not have a valid article tag scope;
Some journal files does not have a valid reference tag scope;
Some journal files have so many variant of article and reference scopes. Altering the source file and adding an article scope tag;
Ignoring the discrepancy warning for specific type of journal which contains large length comparison discrepancy;
Manual update variant scopes of journals by adding fixed article and reference scope tag. Current Solutions From 1071 of source documents only 302 documents were processed, which is about 30% of document that at this point this application manages to process at the end of the project;
Recommendation: creating some XML configuration file that consist of XPath document component set that suitable for that particular document source;
Lesson: complexity of processing HTML, with inconsistency structure, leads to different tag scope for each document in the same type of journal, and missing or incomplete reference tag;
Further development: implementing special algorithm to process incomplete and in consistence biomedical journal which have low percentage of similarity with other journals. Background, research problem, and significance

Project aims and expected outcomes Dr. Diego Mollá Aliod conduct research of clinical questions;
Result result consist of multi-types of biomedical journals;
Need to convert into a unified XML format for data extraction;
Help physicians by providing the references in less time. Determine the variants and the structures of biomedical journals;
Find the best XML format that able to handle most variants;
Develop some prototypes of converter application;
Produce set of converted journals in the unified XML format;
Provide a final prototype application that does the conversion from multiple file sources into unified XML format. Pubmed is free search engine to access the Medline database of citation;
PubMed DTD XML file is compliance with Medline data repository. TEI is a consortium that collectively develops and maintains standards for the representation of texts in digital form;
TEI XML file is compliance with online research, teaching, and preservation. DocBook is a semantic markup language for technical documentation;
DocBook XML file is compliance with technical documents related to computer hardware and software. Find and extract data, using DOM traversal or CSS selectors;
Manipulate the HTML elements, attributes, and text. SAX-compliant parser in Java;
Generate clean and well-formal XML from HTML. Able to process dual streaming and tree-based API;
Build-in support for Namespaces, XPath, XSLT, and XInclude. Analyzing random sampling compared with manual conversion;
Check the overall error rate from XML result. Some Problems
Full transcript