- Aggregation of metadata from heterogeneous collections leads to data quality issues
- Large scale aggregation also brings opportunities for data enrichment and enhancement
The Europeana case is quite different from many library-focused ones
- Persons are referred to in the simple ESE (Europeana Semantic Element) metadata
- There is no direct linking, for example, via a reference to an authority number used at a national library.
The pilot would allow an improvement of the enrichment process in Europeana.
2. Connect related Europeana records
- Detect duplicates or near-duplicates
- Identify and create semantic links between objects that are related
- a painting and photographs of that painting
- all digitized pages of the same book
- a collection of letters that belong to the same person.
- different editions of one book
http://thoth.pica.nl/eu/results_en/level40/40_8251.html
2. Categorize clusters and identify semantic links between records
Duplicates
Findings
Same page digitized 3 times --- Duplicates?
OCLC internal data
(Digital Collection
Gateway, etc)
On the clusters
- Clusters are generally good but are limited to close relationships
On the data use for the research
- Quality issues in the data
- Standard are interpreted differently by providers despite the presence of guidelines
- Creation of digital object is not always in line with the creation of descriptive metadata
Logical structure of cultural heritage objects is not always reflected in the metadata.
Applying the types of relations available in EDM to the types of clusters found during the experiment.
Findings from the pilot could feed into best practice guides for content providers and thereby improve the quality of the whole Europeana dataset
Same objects, different providers
Clustering
and
enrichment
innovation
Data services for
third parties
Digitized content of Europe's galleries, libraries, museums, archives and audiovisual collections.
Over 22 million books, films, paintings, museum objects and archival documents from some 2,200 content providers.
Hunting for Semantic Clusters
Europeana Innovation Pilots
How can we find interesting stuff in
over 22 million Europeana objects?
1. Connecting as many objects (books, films, paintings, etc) to the resources of Virtual International Authority Files (VIAF)
Shenghui Wang
OCLC Research
Leiden, The Netherlands
OCLC Research: Two step approach
1. Cluster records into small clusters
- A fast clustering method which clusters 23.6 million records in 4 minutes
- Genetic algorithm to automatically select important metadata for more meaningful clusters, such as
- all pages of the same book
- all postcards sent by one person
- Different similarity thresholds for a hierarchical way of exploring records
Current situation in Europeana
Europeana
Clusters
(Near-)Duplicates
Thematic clusters
or collections
Views of the same object
Parts of the same CHO
Derivatives works