BIBFLOW: A Roadmap for Library Linked Data Transition
Prepared 14 March, 2017
MacKenzie Smith | Carl G. Stahmer | Xiaoli Li | Gloria Gonzalez
University Library, University of California, Davis | Zepheira Inc.
Research support by the Institute for Museum and Library Services
Read or download the roadmap below.
BIBFLOW is an Institute for Museum and Library Services (IMLS) funded multi-year project of the University of California Davis Library and Zepheira Corporation. Traditional library data methods are out of sync with the data storage, transmission, and linking standards that drive the new information economy. As a result, new standards and technologies are sorely needed to help the library community leverage the benefits and efficiencies that the Web has afforded other industries. The findings in this report are the result of research focused on how libraries should adapt their practices, workflows, software systems, and partnerships to support their evolution to new standards and technologies. In conducting this research, the BIBFLOW team collaborated and communicated with partners across the library data ecosystem – key organizations like the Library of Congress, OCLC, library vendors, standards organizations like NISO, software tool vendors, commercial data providers, and other libraries that are trying to plan for change. We also experimented with various technologies as a means of testing Linked Data transition and operation workflows. The specific focus of this study was the Library of Congress’ emerging BIBFRAME model, a framework developed specifically to help libraries leverage Linked Data capabilities.
This report is the result of two years of research across the spectrum of Linked Data implementation and operations. Its purpose is to provide a roadmap that individual libraries can use to plan their own transition to Linked Data operations. It makes specific recommendations regarding a phased transition approach designed to minimize costs and increase the efficiency and benefits of transition. An analysis of specific transition tools is provided, as well as an analysis of workflow transitions and estimated training and work effort requirements.
A key finding of the report is that libraries are better positioned than most believe to transition to Linked Data. The wider Linked Data ecosystem and the semantic web in general are built on the bedrock of shared, unique identifiers for both entities (people, places, etc.) and actions (authored, acquired, etc.). Libraries have a long history of shared data governance and standards; as such, library culture is well suited to transitioning to Linked Data, and library structured data (MARC) is well situated for data transformation. In light of the above, it is our conclusion that Linked Data represents an opportunity rather than a challenge, and this roadmap is intended to serve as a guide for libraries wishing to seize this opportunity.
Transitioning to Linked Data is not a data transformation activity. Libraries have extensive experience transforming data from one format to another. While crosswalk processes can be cumbersome and time consuming, they are well understood and we are quite good at them. Transitioning to Linked Data, however, requires more than simply mapping fields across data models and performing necessary data reformatting to comply with the specifications of the new model. Transitioning to Linked Data requires adding new data to each record, data that can often be difficult to disambiguate by machine. Specifically, a successful transition to a Linked Data ecosystem requires adding numerous shared, publicly recognized unique identifiers (a Uniform Resource Identifiers, or URI) to each record at the time of transformation.
URIs form the backbone of the Linked Data ecosystem. The fundamental concept is to provide a unique, machine actionable identifier for all entities in a graph. Thus, for example, whereas a human might say:
Figure 2: Human readable triple
A Linked Data representation of the same statement would look like:
Figure 3: Machine readable triple
When we refer to URIs as “machine actionable” or “machine traversable,” we mean to say that an identifier is uniformly recognized by independent computing systems, allowing them to use it to link things being said about the same entity by different people or telling it about a relationship that can be used to control function and output. For example, if you have a collection of records that says that “Shakespeare wrote Hamlet” and I have a collection of records that says “Shakespeare wrote Romeo and Juliet,” adding URIs to our records allows a computer to infer that “Shakespeare wrote Hamlet and Romeo and Juliet.” Similarly, if we used URIs to identify Hamlet and Romeo and Juliet, the computer could search across the network for things that others have said about each of these plays.
Figure 4: Dynamic Linked Data graph
The above figure shows a partial graph of relationships between Hamlet and Romeo and Juliet that was dynamically created, with no human intervention, by traversing URI based statements about the two plays that are currently available as Linked Data on the internet.
For a full discussion of the function and benefits of Linked Data see the “Why Linked Data” section of this report. For the present purposes, what concerns us is the role that URIs serve in the Linked Data universe. A Linked Data graph is only as good as its URIs. If two individuals use two different URIs for the same entity, William Shakespeare for example, then to the computer there are two William Shakespeares. As such, proper URI management is essential to the Linked Data effort.
Several organizations, such as Getty, the Library of Congress, OCLC, and VIAF, currently make available Linked Data gateways that provide URIs for entities and controlled vocabularies widely used by libraries and cultural heritage organizations. Using these resources, organizations can lookup shared URIs for entities (people, organizations, subjects, etc.) Similarly, BIBFRAME defines a set of relationships for which public URIs have also been minted.
From a data perspective, the primary obstacle to transitioning to Linked Data is associating the literal representation of entities in MARC records (Shakespeare, William, 1564-1616) with machine actionable URIs (http://viaf.org/viaf/96994048). This association must be backward implemented on all legacy records (a daunting task) and library systems must be updated to create the association when dealing with new records or editing existing ones (a potentially difficult task since most libraries rely on vendor software over which they have little control to perform this work.)
In addition to the technical problems presented by conversion of data, transitioning to Linked Data also brings with it a host of potential systems and workflow issues. Current library operations rest on workflows designed for and performed by staff with specialized and advanced training and knowledge. Changing the required output of these workflows could potentially have dramatic effects on the workflows that create it. Section VII of this document discusses these changes in depth.
Finally, transitioning our data and workflows will also necessarily impact library systems and information flow. The figure below is a diagram of the numerous systems in place at the UC Davis library that communicate either directly or by association with our library catalog:
Figure 5: Library systems diagram
As depicted in the above diagram, 40 different systems connect either directly or indirectly with our library catalog. Each of these connections represents a potential point of failure during a Linked Data Transition, further complicating any imagined or real transformation process.
The transition roadmap presented here is based on two years of experimenting with various approaches to making a transition to Linked Data. The plan is driven by the following seven primary principles:
1. Insure accuracy of resulting data
2. Insure proper function of data in the wider information systems ecosystem
3. Minimize impacts on daily operations during transition
4. Minimize impacts on library workflows except where changes will result in increased efficiency and improved quality of work
5. Minimize the need for additional staff training
6. Maximize benefits Linked Data offers with regard to data sharing and interoperability
7. Maximize benefits Linked Data offers in terms of extensibility of descriptive practices and methods (improve depth of records)
The proposed transition plan is a two-phased plan, each comprised of multiple steps. Importantly, Phase One can be undertaken as an end-game transition process and will situate libraries to function in a Linked Data library ecosystem. Libraries that complete Phase One will be able to exchange BIBFRAME and other Linked Data graphs with other libraries and cultural heritage institutions with minimal impact on staff and systems, but also without capitalizing on the full potential of Linked Data. Libraries that go on to complete Phase Two will add to this the ability to capitalize on the extensibility inherent in Linked Data graph description and also introduce efficiencies in cataloging workflows. Libraries should seek the level of engagement that aligns with their in-house technical expertise, efforts performing original cataloging, desire to create a deeper and more descriptive catalog, and budget.
The following figure presents a high-altitude view of the proposed conversion roadmap, including milestones of each phase:
Figure 6: Transition process overview
The Primary focus of Phase One is preparing existing MARC records for transformation to Linked Data graphs. The involves inserting appropriate URIs into MARC records so that records can be converted into functioning Linked Data graphs that include machine actionable URIs. At the conclusion of Phase One, the catalog’s data store and cataloging user interface remain MARC based, but the presence of URIs in MARC records allows for the development of Application Programming Interfaces (API) to export and ingest Linked Data graphs. Libraries that lack the necessary resources, need, or are otherwise not interested in transitioning to complete internal Linked Data operations could stop at the completion of Phase One and function effectively in the wider Linked Data library ecosystem.
The Primary focus of Phase Two is converting the entire library information ecosystem to native Linked Data operations. During this phase of conversion, the catalog itself is converted to a Linked Data, graph-based architecture and cataloging interfaces and workflows are altered to maximize realization of the descriptive, search and discovery, and workflow benefits of Linked Data.
Step One of Phase One conversion plan establishes a working environment in which all future-forward cataloging efforts will support Linked Data transition. Step Two of Phase One addresses the problem of legacy records. In October 2015 the Program for Cooperative Cataloging (PCC) charged a Task Group on URIs and MARC. The specific charge of the Task Group was to investigate the feasibility of and make recommendations regarding the insertion of URIs in standard MARC records. Much of the Task Group’s work focused on testing the potential impact of inserting URIs into MARC records, with an eye particularly to testing whether or not such an effort would negatively affect the functioning of current ILS systems. This testing necessitated the large-scale conversion of MARC records. To this end, librarians and staff at George Washington University, working under the guidance of the PCC Task Group’s chairperson, Jackie Shieh, tested various methods of inserting URIs in the MARC records of their 1.7 million title catalog.
The published results of George Washington University’s experiments with URI insertion provide details regarding the exact process used as well as scripts for performing the insertion. As such, these specific details are not included in this report. Relevant to this report is the calculation of effort required to complete the transformation. The most successful method implemented by the George Washington University team involved automated conversion and validation followed by human validation, correction, and supplemental cataloging. According to Shieh and Reese, automated conversion of records resulted in few errors. Human catalogers were used to spot check machine output. One cataloger was devoted to this task for the duration of the project, resulting in a very high, verified rate of conversion accuracy.
A potential option for completing Step Two of Phase One of the conversion plan would be to share the conversion effort across libraries both through and with OCLC and other vendors. The present workflows of most libraries involve contributing and receiving records from OCLC and other vendors. There is opportunity for service models in which OCLC inserts URIs in bibliographic records and distributes the updated records to libraries as appropriate. Additionally, vendors could provide records for shelf-ready acquisitions that include records with URIs. The costs of conversion as a service model are impossible to calculate without direct input from vendors; however, as such a service would dramatically reduce the work effort required at each local institution, the resultant cost should represent a cost savings to participating libraries.
Phase One completion represents a significant milestone in the transition to Linked Data operations. At the conclusion of this phase, libraries will be situated such that their entire record collection and ongoing record creation and maintenance will support Linked Data operations, and they will be able to deliver and ingest Linked Data records. Implementation timelines for Phase One are dependent on vendor implementation timelines for all but those libraries that currently implement open source ILS and have the expertise to add the needed functionality to the ILS. Some commercial ILS systems already contain the necessary URI lookup and insertion functionality. In all circumstances, the cost of implementing Phase One is minimal, as is the effect on cataloging workflows.
Concurrently with, or after, migrating human workflows to native Linked Data operation, legacy MARC records must be converted to Linked Data graphs and stored in the new graph database. (As noted before, this database may not be strictly graph based, but the MARC records must be migrated to the new model regardless.) Automated transformation is made possible because needed URIs were added to MARC records during Phase One of this transition plan. This process will primarily involve technical staff, but libraries should expect to devote one cataloger familiar with both MARC and BIBRAME (or an alternate Linked Data model) to the effort in order to facilitate proper data mapping and to validate output.
Several viable tools are currently available for performing conversion of MARC records to Linked Data graphs.
Library of Congress Transformation Service:
Figure 15: Current Library of Congress MARC to BIBFRAME Transformation Service
The current release of the Library of Congress MARC to BIBFRAME Transformation Service is a web-based service suitable for testing conversion from MARC to BIBFRAME 1.0. The Library of Congress is currently working on an open source, BIBFRAME 2.0 version of the software that can be installed locally and used to transform MARC to BIBRAME 2.0, the latest BIBFRAME standard. This software is soon to be released. The MARC to BIBFRAME Transformation Service has undergone extensive testing at the Library of Congress and will provide excellent MARC to BIBFRAME transformation. The software runs efficiently and produces a minimal required storage footprint. Additionally, the transformation engine is highly flexible, using an XSLT transformation service to traverse a MARC-XML DOM and output data in any text-based format. The Library of Congress provides XSLT for MARC-BIBFRAME conversion only, but with custom developed XSLT services the software could export transformations using any single or combination of ontologies and frameworks and in any Linked Data serialization. As such it represents a good choice for libraries interested in producing strict BIBFRAME with few alterations and for libraries with in-house XSLT expertise that are interested in converting to frameworks other than or in combination with BIBFRAME.
Figure 16: MarcEdit
Most librarians are already familiar with Terry Reese’s MarcEdit software. An import feature of MarcEdit is its MARCNext component, which provides a collection of tools for manipulating MARC with an eye towards Linked Data transformation. Two particular tools are of use in this regard: 1) a highly configurable transformation service; and 2) the ability to export MARC records as a SQL database.
MarcEdit’s transformation engine is highly flexible, using an XSLT transformation service to traverse a MARC-XML DOM and output data in any text-based format. This could include RDF-XML, Turtle, or any other form of Linked Data representation. Using this system’s libraries, one can easily run multiple transformations on the same collection of MARC records. This allows libraries to produce specific outputs for specific uses. For example, a library could run transformation as BIBFRAME for interlibrary use and another as Schema.org for search engine optimization. Additionally, Terry Reese also maintains a public forum where XSLT transformation scripts can be shared. This means that one library could use another library’s BIBFRAME transformation out of the box, or modify it for a particular purpose and share with other libraries.
MarcEdit’s ability to export MARC records as a collection of SQL scripts is also potentially quite useful. Exporting records to a SQL database opens the door for complex querying of data. Storing records in an accessible SQL database can simplify the transformation process for those libraries interested in writing their own, stand-alone transformation scripts or applications. All widely used scripting and programming environments have packages that provide easy access to a variety of SQL databases, simplifying the process of querying records as part of a transformation process.
MarcEdit provides a highly flexible platform for shared development of transformation script. As such, it is a good tool for libraries interested in performing multiple transformations and/or sharing in communal development of transformations. A potential drawback of the tool is that it is a Microsoft Windows only tool and can only be deployed on Windows based servers or desktops. As such, it is only a suitable option for those libraries that operate in a Windows environment.
Figure 17: Extensible Catalog
The MST is a web-application that runs as a Java Servlet under server engines such as Apache Tomcat or Jeti. Administrative users use a web interface to manage transformation “Services” that map identified record sets to the Java transformation scripts. A valuable feature of the MST is that Transformations can be run one time only; or, the service can poll the ILS for changes and execute the transformation as need to keep the graph representation synchronized with the MARC data store. Transformed data sets are made available through an API. The MST can be run on any system that supports Java Servlets. This includes Linux, Mac, Unix, and Windows.
The MST is good option for libraries with in-house server administration technical expertise and the computing infrastructure necessary to run a Java Servlet container. An ILS that supports OAI-PMH is also required, or the ability to install and maintain a service that uses APIs or exported MARC data to provide an OAI-PMH gateway. (The Extensible Catalog suite includes a MARC-XML to OAI-PMH gateway.) A particular disadvantage of Extensible Catalog’s MST is it requires significant physical storage. In order to provide its synchronized transformation service, it maintains a local copy (SQL) of the entire catalog as pulled using OAI-PMH. As such, a single pipeline of transformation from the ILS to BIBFRAME results in three complete instantiations of the catalog: The original in the ILS, a copy in the MST SQL database, and the exposed BIBFRAME version.
For libraries with robust technical services departments who are familiar with the various APIs of their various ILS, building a custom conversion tool could be an option. Our initial testing indicates that it will typically take from one to three months of full-time programming to code and test a fully functioning, stand-alone, custom conversion tool. Building a custom tool offers few advantages. It can, however be useful in cases where the records being converted are stored in more than one system or when attempting to combine records of different formats that reference the same object. For example, a not uncommon situation is for libraries holding special collections to maintain both a MARC record and an EAD record for the same object. Linked Data offers the opportunity to combine these two records into a single graph. In such cases, a custom application designed to communicate with both the MARC and EAD systems would be more efficient than using existing tools to create separate graphs and then applying a post-creation system of combining the graphs.
Third Party Service:
Zepheira Inc. will work with your library to either assist with or completely handle a transformation process. To date, Zepheira has worked with the Library of Congress, a host of public libraries, and the American Antiquarian Society, to name a few, to convert their existing MARC records. It can be expected that other vendors will also move into this space as the number of libraries planning on transforming records increases. Third party conversion services could focus on conversion of individual libraries or, taking advantage of economies of scale, provide a common, shared point of conversion and distribution. Libraries currently participate in shared cataloging through OCLC. A similar vendor service (OCLC is a natural point of service) that performs batch conversion and distributes converted records to libraries is a natural extension of the services that are already employed at libraries.
The final step in the Phase Two Linked Data transformation is the conversion of non-cataloging library systems to Linked Data operations through either the development of necessary connectors or the adoption of Linked Data native versions of these systems as they become available. As with transitioning workflows, there is an advantage to pursing an iterative approach to this last phase of transformation. Attempting to transition all systems simultaneously would be highly disruptive to overall operations. It increases the likelihood of introducing a cascading error scenario where failures propagate across nodes in the information pipeline. This increases the impact of the inherent difficulty of troubleshooting. Transitioning one system at a time simplifies this process, localizing error potential, facilitating troubleshooting, and reducing potential impacts to the entire information ecosystem. Additionally, there are labor benefits to transitioning small teams at a time as opposed to transitioning the entire team over a short period of time. The small team approach offers management efficiencies and also simplifies human resources on-boarding and off-boarding.
Cataloging is the process of creating metadata for libraries collections, whether owned or accessed. Workflows associated with cataloging largely depend on the ecosystem in which cataloging activities take place. The BIBFLOW project examined the effects and opportunities created by transitioning cataloging to a native Linked Data ecosystem by examining the following workflows:
1. Copy cataloging of a non-rare book
2. Original non-rare book cataloging
3. Original cataloging of a print serial
4. Original cataloging of a print map
5. Personal and corporate name authority creation
The study method employed was to document the current workflows in place at the UC Davis library, followed by testing of various approaches to the same cataloging tasks using native Linked Data cataloging workbenches. In each case, an eye was directed toward efficiency, accuracy, and the training required for catalogers to work in the new ecosystem. The workflows tested were chosen because they are representative of the range of cataloging practice employed in the library.
Workflows for authority creation and management are covered in the Section VIII of this report below. The remaining tested workflows are discussed in this section. Generally speaking, it was found that catalogers had little difficulty transitioning to a Linked Data ecosystem. The amount of training required was equivalent to that of transitioning from one MARC-based interface to another. With the exception of serials cataloging, discussed below, either a comprehensive knowledge of the technical details of Linked Data nor of the BIBFRAME model were required for catalogers to work successfully in the new environment. Additionally, cataloging in the Linked Data ecosystem offered various efficiencies in some workflows.
While completing Step One and Step Two of the transition plan outlined in this report, the Linked Data ecosystem consists of the following six components:
Figure 18: Six components of Linked Data ecosystem
At the center of this ecosystem is the Triplestore: the database management system for data in BIBFRAME format (RDF triples).
Human Discovery is comprised of application(s) that facilitate the transactions between patrons and the library’s triplestore. It should also support the retrieval of additional information from external resources pointed to by the URIs recorded in the local triplestore.
The Integrated Library System (ILS) is an inventory control tool used to manage library’s internal operations only, such as ordering and payment, collection management, and circulation. In this model, it also serves as a stand-in for all external systems that communicate with the library’s catalog data. At the conclusion of Phase One of the transition plan, it will comprise a collection of applications that perform various functions such as acquisition, circulation, bibliometrics, etc. These systems may evolve to work directly with the triplestore, or they will continue to communicate with the triplestore through an API.
The Linked Data Editor is a tool that supports cataloging activities (metadata creation and management). At a minimum, an editor should have: 1) a user-friendly interface that does not require the cataloger to have a deep knowledge of the BIBFRAME data model or vocabularies; and 2) lookup services that can be configured to search, retrieve, and display Linked Data from external resources automatically.
Data Sources are resource locations available over the internet with which a Linked Data Editor can communicate in order to exchange data. These include endpoints such as OCLC WorldCat for bibliographic data and Library of Congress’s Linked Data services for subject and name headings. To increase the likelihood of finding authoritative URIs and to make library data more interoperable, the community should also explore the use of non-library data and identifiers, such as ORCID, publisher’s data, Wikidata, LinkedBrainz, etc.
Machine Discovery is a SPARQL endpoint that enables an external machine to query the library triplestore.
Figure 19 below illustrates the interactions among the six conceptual categories (OCLC and Authorities are used to represent “Data Sources”):
Figure 19: Interaction between the components of a Linked Data ecosystem
As can be seen, the information flows involved in a Linked Data ecosystem are more complex than in a MARC ecosystem. In the current MARC ecosystem, the Integrated Library System (ILS) acts as centralized information exchange point wherein external data is ingested and served through a single point of access. The Linked Data ecosystem dis-integrates the ILS. The triplestore serves as a partial, centralized data store, but graphs stored locally in the triplestore are supplemented on-the-fly by information provided by other Linked Data services and can be interacted with by a flexible suite of applications. The net result is a more complex data ecosystem, but one in which the workflows surrounding the data remain unchanged or are actually simplified.
Below we discuss the impacts of Linked Data adoption on three main types of cataloging workflows – copy, original, and serials cataloging. In each case we present proposed Linked Data native workflows and discuss how they relate to traditional MARC-based cataloging workflows. Readers will note that the two workflows presented are quite similar to their MARC ecosystem counterparts; however, each still presents its own issues and challenges. Some of the identified challenges may require further research and experimentation to address. Some may require the library community to rethink its cataloging rules and practices.
Linked Data copy catalogers will perform essentially the same tasks in a BIBFRAME ecosystem as they have traditionally in a MARC ecosystem: searching databases, finding existing bibliographic data, making local edits, checking access points, and saving data into a local system. During a Phase One implementation as defined in Section V above, the only required additional step is to synchronize thin MARC records with the existing ILS. The diagrams below illustrate the steps (workflow) used to perform copy cataloging. For demonstration purposes, OCLC WorldCat is used as an example of an external Linked Data data source (OCLC publishes its bibliographic data in Schema.org) and the BIBFLOW Scribe interface (as discussed in Section VI above) is assumed as a Linked Data cataloging workbench:
Figure 20: Step One of Linked Data copy cataloging workflow
In Step one, the copy cataloger uses the interface to see if a local bibliographic graph already exists for the item being cataloged. If a local graph does exist, a new local Holding is added to the local triplestore. If not, the cataloger moves to Step Two:
Figure 21: Step Two of Linked Data copy cataloging workflow
Step Two involves retrieving data about the item being cataloged from OCLC. This can be performed in one of two ways. Figure 14 in Section VI above depicts a system tested as part of the BIBFLOW project that allows users to scan the barcode of an item and automatically retrieve OCLC graph data based upon the extracted ISBN. Similarly, the BIBFRAME Scribe tool allows a cataloger to manually input an ISBN to perform the same search, or to perform a Title and or Author search. In both cases, the cataloger may be required to disambiguate results, as a single ISBN or search return can reflect multiple Work graphs. This same disambiguation is similarly required in a MARC ecosystem, and does not reflect an additional effort. Once an appropriate OCLC Work record has been identified, the Linked Data cataloging interface retrieves the graph for that resource from OCLC. This graph includes all information currently stored in exchanged MARC records. When a graph is pulled, its data is used to auto-fill all fields in the cataloging workbench for review by the cataloger.
Figure 22: Step Three of Linked Data copy cataloging workflow
Step Three involves using similar lookup functionality to automatically discover URIs for authority entries. Using services such as VIAF, Library of Congress Authorities, and Getty Authorities, catalogers can search for authorities using human readable forms and automatically pull Linked Data representations of the authority, including URIs.
Figure 23: Step Four of Linked Data copy cataloging workflow
Once the cataloger is satisfied with the graph data pulled from OCLC and any made modifications, the final step in the human cataloging workflow is to push the new graph to the triplestore. In the case of items for which there is currently a local bibliographic graph, this involves adding an appropriate bibliographic record to the database as well as required Instance and Holding data. In a completely native Linked Data ecosystem, one in which all systems that surround the library’s cataloging data have been converted to communicate directly with the triplestore, Step Four is the final step in the copy cataloging process. In cases where the cataloger is working in a hybrid ecosystem (prior to the completion of Phase Two as defined in Section VI above), a final, machine-automated step will be required:
Figure 24: Step Five of Linked Data copy cataloging workflow
In cases where the library is currently not operating in a completely Linked Data ecosystem, when a cataloger pushes a new graph to the triplestore (or modifies an existing one), these changes must be propagated to any systems still relying on MARC data. This transaction is handled by a machine process and requires no human interaction.
As illustrated above, transition to a Linked Data ecosystem has no negative impact on the human workflows involved in copy cataloging and will improve efficiency in many cases due to the ability to auto-lookup and create graphs for items. Specific benefits of Linked Data copy cataloging include:
1. Catalogers do not need to search OCLC database separately because the lookup services embedded in the Linked Data cataloging workbench can retrieve both bibliographic and authority data, with associated URIs, and automatically put retrieved data into appropriate fields (auto-populate)
2. Catalogers do not need to have in-depth knowledge of BIBFRAME data model or BIBFRAME vocabularies because the data mapping between Data Source (e.g. OCLC – Schema.org) and BIBFRAME has been done behind the scenes
3. Catalogers do not need to input URIs manually because the machine will record and save them into the triplestore automatically; they just need to identify and select the correct entry associated with a URI
4. Automated methods such as barcode scanning can be used to perform record creation in a fraction of the time currently required
One potential issue stands as a barrier to proper BIBFRAME implementation using the proposed model. Schema.org (the Linked Data framework used by OCLC) does not differentiate title proper from the remainder of the title, but they are differentiated in the BIBRAME specification. For our implementation, we opted to include the complete Schema.org title in the BIBFRAME Title Proper element. This approach was taken because a full text search (or index) of a combined title element would return a successful search for any portion of the title. Given the nature of current full-text search capabilities, more discussion about whether multiple title elements are still useful and, if so, how to reconcile OCLC and LOC data will be necessary.
Transitioning to Linked Data cataloging using the proposed model raises the following questions for community consideration:
1. As per the discussion immediately above and given the nature of current full-text search capabilities, more discussion about whether multiple title elements are still useful and, if so, how to reconcile OCLC and LOC data will be necessary
2. How much data is needed in local triplestore? If most of the things can be identified by their associated URIs, and library discovery systems that sit on top of the local triplestore can pull information from external resources, how much data does the library still want or need in its local system?
3. If changes are made to source data, is it necessary to send the revised information back to the sources? If yes, what will we need to make this happen as an automatic process?
An original cataloging situation occurs when a cataloger is unable to locate, either locally or through an external authority, existing bibliographic data for the item being described. The process outlined above for copy cataloging an item included several options for searching both locally and through an external source (OCLC) for existing bibliographic graphs related to the item with which the cataloger is working. External lookup sources could include OCLC, publisher Linked Data endpoints, and even non-traditional data sources such as booksellers and Wikipedia. In the course of a cataloger’s workflow, it is possible that no or partial data only can be found for an object. In this case, the cataloger must switch to an original cataloging workflow.
Once a cataloger has switched to an original cataloging workflow, very little will change from current original cataloging methods. The task of describing the details of the item being described will remain the same; however, cataloging in a Linked Data environment offers some distinct efficiency in the original cataloging workflow.
As discussed in Section V and Section VI of this report, Linked Data enabled cataloging workbenches have the ability to provide automatic lookup of entities at a variety of Linked Data endpoints such as OCLC, the Library of Congress, and Getty. This auto lookup feature facilitates original cataloging such that users can locate, disambiguate, and enter relevant data in a variety of fields that will be used to complete the bibliographic graph for an item. Current MARC-based cataloging systems employ similar functionality based on authority file lookup. When proper authority references are found, transitioning to Linked Data cataloging is a zero-sum-gain scenario. However, Linked Data cataloging offers workflow efficiencies in situations where no appropriate authority references can be found.
Currently, a cataloger confronted with the need for a nonexistent authority is faced with one of the following two workflows:
1. Identify need for new authority
2. Create new authority record
3. Submit new authority record to NACO
4. Return to original cataloging and continue cataloging item
1. Identify need for new authority
2. Submit request for new authority
3. Wait for response to request
4. Return to original cataloging and continue cataloging item
Both of the above workflows involve the cataloger moving from the current cataloging work to another workflow (and often another computing system and interface) to create or request creation of a new authority before returning (either immediately or after an undefined period of time) to the cataloging workflow.
Linked Data workbenches, such as the BIBFRAME Scribe workbench tested as part of this project, eliminate the need to step away, as it were, from the current cataloging effort to deal with authority issues. When a cataloger is unable to locate a suitable authority, the workbench prompts the cataloger to create a new authority using whatever information is currently available to the cataloger:
Figure 25: New authority in BIBFRAME Scribe
When a user creates a new authority entry, a graph for this authority is created in the local triplestore with a new, local URI. The cataloger is then returned to their ongoing cataloging effort.
When a cataloger creates a new authority using the above system, the authority is subsequently available within the local domain for all future cataloging efforts. This insures that all local cataloging efforts run efficiently, but does not, de facto, solve the larger problematics of authority control. As discussed in Section II above, Linked Data’s ability to facilitate information traversal rests on the availability of URIs over the network and also on the assumption that each entity is uniquely represented. As such, a local instance of a URI cannot function as an authority unless it is distributed across the network and is done so in a way that can be properly linked to or differentiated from other URIs in the Linked Data universe.
Section VIII below provides a more in-depth discussion of processes for managing the production of local URIs for new authorities. Relevant to the present discussion is the fact that systems can be put in place to allow for on-the-fly authority graph creation, thereby streamlining the workflows of catalogers involved in original cataloging. These efficiencies include:
1. Catalogers do not need to have in-depth knowledge of BIBFRAME’s data model or BIBFRAME vocabularies to perform cataloging because the terms used by Linked Data workbenches are the same ones currently used by catalogers
2. Catalogers do not have to leave the Editor in order to complete the cataloging work when confronted with authority issues
3. Catalogers have an option to create authority data on-the-fly and to mint local URIs which can be connected to other related URIs through a reconciliation service as discussed in Section VIII
In order to implement the above described workflow, the following systems need to be in place:
1. Robust lookup services which can interpret source data and present it in a readable format to catalogers
2. Systems for performing local authority reconciliation as described below in Section VIII
Transitioning to Linked Data cataloging using the proposed model raises the following questions for community consideration:
1. URIs are crucial in order to disambiguate or retrieve information in the Linked Data environment. As a result, the more sources a library can use, the less work needed locally. But how to find right balance? To what extent should we consider using non-traditional information sources such as commercial book sellers and Wikipedia?
2. Cataloging descriptive rules have played an important role in the card or MARC cataloging environment. In a world where most of the entities we describe can be identified by a unique ID (URI), how much descriptive data do catalogers still need to create if that information can be retrieve from other data sources, such as publishers or vendors?
3. Library of Congress subject strings played an essential role in the era when the discovery technology was string based and not always automated. With faceted navigation and other features a 21st century library discovery tool can offer, library users can narrow down their search results more easily. Given this new environment, how much value is added by having tightly controlled, nested subject strings presented to library users?
4. Instead of creating new name authority data, would it make sense for the library community to start using other authoritative URI enabled name identifiers, such as ORCID (researchers) and ISNI (individuals and organizations) IDs and focus on building context around these identifiers?
Cataloging workflows described above can be used for cataloging serials. However, because of the changing nature of serial publications and the need to accommodate complex holdings information, cataloging serials in BIBFRAME has its own unique issues. During the life time of a serial publication, the serial title, issuing body, publication information, frequency, numbering, etc., may change. As a result, it is essential that catalogers are provided a means to associate dates or date ranges with assertions (triple statements). In this report, we want to highlight the following two areas where the current data BIBFRAME model will fail to maximize the potential of Linked Data:
1. The current state of BIBFRAME does not seem to be able to address adequately the issue of change over time to serials metadata. For example, there is not a way to express a start and end date for changes to titles and publication information. It may make sense for the serials cataloging community to explore other vocabularies that are more suitable for modeling serials, such as PRESSoo, for use in conjunction with BIBFRAME.
2. Enumeration and chronology information is ubiquitous and important for describing serials. It is used with serial titles and appears in notes, item, and holdings records in the MARC environment. Figure 26 shows the mappings of enumeration and chronology data in MARC records to corresponding BIBFRAME properties.
Figure 26: Enumeration and chronology information mapping
As illustrated above, there are two problems with how Enumeration and chronology information are expressed in BIBFRAME: 1) several different properties are often used to encapsulate a single datum point, resulting in an overly complex representation; and 2) none of those data are machine-actionable because they are literals (strings of text). The serials cataloging community should consider the following questions:
a. Should enumeration/chronology data appearing at BIBFRAME Instance level be coded in a uniform way?
b. Should enumeration/chronology data appearing at both BIBFRAME Instance and Item be coded in the same way?
c. Does Linked Data offer the possibility of simplifying the ways in which we encode enumeration/chronology data while still achieving same end-user functionality for which they are intended? For example, dropping enumeration when chronology alone is sufficient.
d. Would it be more useful to parse enumeration and chronology data currently recorded in MARC 853/863 fields into similar pieces like this:
Figure 27: Possible model for holdings data
e. Should we explore other ontology/vocabularies such as ONIX for Serials Coverage Statement (Version 1.0) or Enumeration and Chronology of Periodicals Ontology?
f. Would incorporating other models or vocabularies enable the reusability of data? For example, harvesting existing enumeration and chronology data from content providers.
Several groups have been, and remain actively involved in discussions surrounding modeling serials using BIBFRAME and other vocabularies. These include groups from the LD4P, LD4L, Library of Congress BIBFRAME working group, and the PCC BIBFRAME CONSER working group. Future reports from these groups may shed more light on modeling serials. Given the efforts currently devoted to this area of Linked Data implementation, it is reasonable to expect that best practices will be achieved before libraries are situated to begin the transition. It is also worth noting that work on this front could continue with different libraries adopting different serials models. While this scenario is not preferred, a multi-model ecosystem could be made functional through reconciliation graphs that use multiple sameAs designations to linked disparate graphs. The following section provides an in depth discussion of reconciliation models.
Moving from MARC to Linked Data affords us the opportunity to take a fresh look at the way we describe serials. The answers to the challenges mentioned may be found by rethinking existing practices. Regardless of the path forward in serials cataloging, this is an area where we can expect the necessity for staff re-training.
This section characterizes the state of Linked Data readiness and awareness within the community of library services and product providers. Innovations in digital media applications on the Web from companies like Google and Amazon are clear wake-up calls to libraries and their service providers which, in response, need to expand their strategy to work in new and different ways. There are primarily two possible reactions to this major technological change: try to delay or deny the development, or seize the opportunity and use it to redefine the relationships between libraries and their communities of users.
Librarians and their service providers must work together to ensure that libraries are well positioned to take advantage of evolving technology and offer their rich resources to users in their communities and across the globe. In order to do so, library systems must become compatible with a range of external and internal systems including acquisitions, cataloging, circulation, discovery layers, and content management systems. An overview of the findings reported in this section can be found in Appendix A.
In pursuit of the objective to provide an assessment of Linked Data strategies in the library industry, a multi-method approach was employed. The information synthesized by this report was gathered through direct conversations with service providers and combined with material made publicly available to document the business and product development strategies of companies that provide library services. First, Zepheira worked with academic and public librarians to identify key library service providers. It then reached out to these companies in order to assess the following areas:
1. Have these service providers experienced demand for Linked Data integration or any Linked Data services yet?
2. Have the companies established any collaborative partnerships with customers or other companies for Linked Data developments, grant-funded or otherwise?
3. Have they published any reports, white papers or other public documents relevant to Linked Data initiatives?
For those companies who are not yet incorporating Linked Data into services, Zepheira then began educational discussions to explain the increasing interest in Linked Data their customers are experiencing in the library community. Some companies did not respond to requests for information, or were not willing to share information at the time of this report because their business plans are confidential for competitive reasons. In such cases, research was performed to gather public documentation on Linked Data products, services, and strategies. There may be service providers working on Linked Data products that are not addressed by this report. Due to the fast-paced and constantly changing nature of Linked Data adoption, this report is not intended to be comprehensive and does not provide recommendations to libraries for purchasing specific services.
Summary of Linked Data Assessment
With the exception of a few forward thinking companies including Atlas Systems, EBSCO, Ex Libris, Innovative Interfaces, Inc. (III), OCLC, Overdrive, ProQuest, SirsiDynix, and Zepheira, library vendors in general are either unaware or minimally aware of Linked Data developments and benefits. Libraries, archives, and museums are starting to working together with their service providers to solve these challenges to move forward towards a future with visible resources on the Web that can be used by a variety of Semantic Web applications. The following summary of Linked Data assessment is divided up by service provider and arranged in alphabetical order. Details follow for each service provider on their plans for incorporating Linked Data to the extent they were willing to share publicly.
Atlas Systems is the provider of Aeon, circulation and workflow automation software for archives and special collections, and ILLIAD, resource sharing management software for automating interlibrary loans. Atlas Systems believes Linked Data and BIBFRAME will play a key role in how the Web understands libraries and reflects what libraries have to offer. Atlas Systems became a founding Libhub Initiative sponsor in spring 2015. The Libhub Initiative is an effort founded by Zepheira to bring people together to explore and experiment with Linked Data technologies in service of increasing library relevance through the Web. Atlas’ support, along with support from many other service providers, funded a forum and experimental space for librarians, libraries, and industry leaders.
In the fall of 2015, Atlas Systems partnered with Zepheira to start exploring Linked Data for Archives and Special Collections. At the end of 2015, Atlas Systems became a Registered Service Provider for ArchivesSpace, an open source content management and publishing platform for archives and special collections. At ALA Mid-Winter, Atlas Systems presented findings of their Linked Data research to the Association for Library Collections & Technical Services MARC Formats Transition Group in Boston. Currently, Atlas Systems is continuing to explore methods for integrating Linked Data with ArchivesSpace, Aeon, and ILLIAD.
EBSCO and Novelist
In February 2015, EBSCO announced that they will be funding development of Koha, an open source Integrated Library System created by librarians for librarians. Koha Linked Data updates will include MARC to RDF cross-walking to enhance capabilities of linking to online data repositories. However, in April 2016 EBSCO announced that they would no longer be supporting Koha or Kuali OLE development, and will instead fund the development of an open-source Library Service Platform. The outlined functional expectations for this new open-source Library Service Platform include support for Linked Data Services. An initial version of the software will be available in early 2018.
EBSCO and their subsidiary company, NoveList, became Libhub Initiative sponsors in October 2015. EBSCO explained that they are “showing support for Zepheira and moving forward to support BIBFRAME and Linked Data which are seen as essential to opening up library collections to the World Wide Web.” NoveList launched the Linked Library Service in April 2016 at the Public Library Association in Denver. The service, created for public libraries, publishes Linked Data to the Web via the Library.Link Network. The Library.Link Network provides global infrastructure for publishing library Linked Data. NoveList is currently researching enrichment products and services for Linked Data.
Innovative Interfaces, Inc. (III)
In 2014, Innovative Interfaces, Inc. (III) demonstrated strong interest in Linked Data innovation and support for the library industry’s BIBFRAME transition by becoming a founding sponsor of the Libhub Initiative. In March 2016, after reviewing the results of Libhub Initiative experimentation done by III customers, the company partnered with Zepheira to release a new service, Innovative Linked Data. The goal of the Innovative Linked Data service is to extract bibliographic data from Polaris, Sierra, Millennium and Virtua library systems, transform the information, and publish the descriptions as Linked Data on the Web via the Library.Link Network. The Innovative Linked Data pages can be found on the open Web, including discovery via search engines. The pages direct users to the library’s interface where they can complete their interaction with the library. Leif Pedersen, Executive Vice President at Innovative, explains “the Innovative Linked Data service publishes regular updates of library data to the Web, and this constant exposure to search engines will help drive our library partners’ visibility among search results. Innovative Linked Data plays a critical role in the relevance and sustainable discovery of libraries, and catalog content and geographic locations are just the first step in our commitment to strengthen and expand the library user experience.”
III realized the importance of Linked Data before incorporating the technology into their tools and services. In early April 2015, III announced that “Linked Data is going to fundamentally change some of the assumptions which we have operated upon.” III continues to partner with Zepheira to streamline the transformation of their customers’ MARC records into Linked Data. At the 2016 Innovative User Group Meeting in San Francisco and the 2016 Public Library Association Annual Meeting in Denver, III launched Innovative Linked Data and made subscriptions to the service available.
Online Computer Library Center, Inc. (OCLC)
OCLC is broadly known for their support of Linked Data and actively speaks about integration of Linked Data into their strategy. Publication of the Virtual International Authority File (VIAF) and Faceted Application of Subject Terminology (FAST) as Linked Data were early demonstrations of OCLC’s strategic initiatives to provide authoritative library data in open formats native to the Web. Connexion, OCLC’s tool for creating, acquiring, and managing bibliographic and authority records does not include Linked Data services. However, OCLC provides access to over 197 million bibliographic work descriptions in Linked Data format via WorldCat Works. These Linked Data entities are incorporated into WorldCat and made available to software applications via API. To support a more human-friendly understanding of these data, work entities are also available via the WorldCat Linked Data Explorer Interface. Libraries can use OCLC’s work entities to consistently identify works in a way the Web understands. Through creating actionable URIs to identify works, OCLC is providing the infrastructure that will be needed to identify works in future Linked Data based systems and services.
In January 2015, OCLC published a white paper with the Library of Congress entitled “Common Ground: Exploring Compatibilities Between the Linked Data Models of the Library of Congress and OCLC.” A major outcome of the paper is the recommendation that OCLC develop and test technical solutions that capture information expressed in BIBFRAME that cannot be expressed using the schema.org model. The report also recommends the development of services that can export and import BIBFRAME into OCLC systems without data loss.
In February 2015, OCLC featured BIBFLOW as part of the Collective Insight Series titled, “Linked Data [R]evolution: Applying Linked Data Concepts.” The goal of this session was to explain OCLC’s work with Linked Data and provide presentations from people experimenting with Linked Data in libraries, including “Linked Data in the Library Workflow Ecosystem” presented by Carl Stahmer, Director of Digital Scholarship at the UC Davis University Library.
A primary goal for OCLC’s work with Linked Data is to understand the library workflows that will drive the use of tools that use Linked Data. To support this strategy, OCLC is working with the Library of Congress, the BIBFRAME community, and the schema.org community. OCLC Research is also experimenting with the beta version of a discovery layer for Linked Data, called Entity JS, to demonstrate other uses for WorldCat Entities. In September 2015, OCLC announced a person entity lookup pilot project. The pilot aims to help library professionals reduce redundant data about people by linking related sets of identifiers and authorities. The libraries participating in the pilot include University of California, Davis, Cornell University, Harvard University, the Library of Congress, the National Library of Medicine, the National Library of Poland, and Stanford University. Together OCLC and these libraries will improve the relationships between authorities and the librarian’s ability to identify the vast number of people who create and are described by library collections.
To date, OverDrive has limited their public use of Linked Data to incorporating a limited amount of schema.org decoration into their interfaces in order to make high-level information available to Bing and Google. OverDrive continues to monitor Linked Data adoption in the library industry. The company is evaluating how Linked Data can be incorporated into their strategies for eBook, video, and audiobook access for public libraries. OverDrive is also engaged with Libhub Initiative partners and participants. Currently, OverDrive is working with customers to assess the potential utility of the Library.Link Network and possible integration with OverDrive content.
ProQuest and Ex Libris
In October 2015, ProQuest agreed to acquire the Ex Libris Group in order to “support ProQuest’s mission to innovate across libraries across the world.” In December 2015, Ex Libris announced their vision and roadmap for incorporating Linked Data into two products: Alma, their resource management service and Primo, their discovery layer solution, will enhance workflows and allow new methods for exploring library resources. In addition, Ex Libris plans to make the Linked Data provided by each product available to third party tools.
In the outline of their plan for Linked Data services, Ex Libris explained, “While there is a shared understanding that the use of Linked Data will have many benefits in the form of new services for both library staff and end users, the precise nature of the possibilities is still a matter of discussion and debate. Ex Libris is working closely with libraries around the world to identify the scenarios and use cases that are expected to yield the greatest value to libraries and patrons, and is actively leading the way in planning and implementing linked-data services as part of the Alma resource management and Primo discovery and delivery solutions.” Ex Libris plans to incorporate BIBFRAME into their Linked Data services, which will include BIBFRAME import and export from Alma. The Alma Linked Data pilot has already produced demonstration functionalities to this end.
SirsiDynix was the first company to offer a Library.Link Network service for integrated library systems in partnership with Zepheira. In Fall 2015, SirsiDynix launched BLUECloud Visibility in order to transform their customers’ MARC records into Linked Data and make library resources visible on the Web. To make library Linked Data freely available to search engines and other applications, the service allows any library using Symphony or Horizon along with the BLUECloud web application to have their catalog data harvested, transformed to Linked Data, and published to the Library.Link Network.
In an announcement about their partnership with Zepheira, Bill Davison, the CEO of SirsiDynix said “Our goal is to take the mystery and complexity out of Linked Data and deliver to our customers a product that is plug-and-play. We want libraries to easily transform their MARC data into robust, Web-searchable, geo-locatable Linked Data—ready for the world to find.” Many libraries are starting to set-up the infrastructure needed to publish a Local Graph of Linked Data. So far, 26 library organizations have published with SirsiDynix’s BLUECloud Visibility. These organizations include large consortia like the Houston Area Library Automated Network, the Library Integrated Network Consortium, and the System Wide Automated Network Consortium as well as individual libraries like Randwick City Library and special libraries like the International Bureau of Fiscal Documentation in the Netherlands.
In Fall 2014, Zepheira founded the Libhub Initiative, an effort to bring libraries together with data and service providers to explore and experiment with Linked Data technologies in service of increasing library relevance through the Web. The Libhub Initiative sparked more than 500 conversations, meetings, interviews and experiments with library professionals as well as library data and service providers, all committed to greater library relevance through better library visibility on the Web. With many successful partnerships with libraries across North America and early support from Atlas Systems, III, SirsiDynix and EBSCO/NoveList, Zepheira felt there was strong confirmation from libraries and vendors alike and saw a clear need for global Linked Data infrastructure. Currently, Zepheira’s top priority is offering its Linked Data infrastructure service, the Library.Link Network, to libraries, cultural history organizations, and their service providers who wish to improve the visibility of libraries and their collections on the Web. Partnering with library service providers to create Linked Data services lowers the barriers of entry for libraries that may not be able to participate in experimental projects.
Launched in 2016, the Library.Link Network is a global infrastructure for allowing libraries and other cultural heritage organizations to increase their visibility on the Web while maintaining the uniqueness of their own local identity. The Library.Link Network is the direct result of successful library-led collaborations for large-scale Linked Data experimentation completed under the umbrella of the Libhub Initiative. The Library.Link Network brings together libraries and their providers on the Web to share their localized, comprehensive, connection-rich stories. While Zepheira established the Libhub Initiative as a community space for libraries to share best practices around implementing Linked Data, the Library.Link Network provides shared infrastructure that libraries can use to make their resources and events visible on the Web by publishing their resources in Linked Data format.
The Library.Link Network infrastructure is used to reveal library resources including events, collections, bibliographic data and archival description in a Web-actionable format. Zepheira works with publishing partners and libraries to transform data from MARC and other formats into Linked Data to seed the Web with structured, openly published and interLinked Data.
Library.Link Network partners include Atlas Systems, Counting Opinions, Innovative, SirsiDynix, and most recently, EBSCO’s NoveList. Contributions to and participation in Library.Link Network are possible at different levels. Some services are free to libraries, including the description of library locations and hours of operation with Linked Data. Other Library.Link Network services and partner services are fee-based, including Linked Data transformation for entire catalogs and publication of Linked Data to the Web via the Library.Link Network. Zepheira, SirsiDynix, Innovative Interfaces, and Novelist all offer library services that publish Linked Data to the Library.Link Network. All libraries are free to contribute their identifying information to the Library.Link Network in order to make the organization more visible on the Web. Participating in Library.Link Network gives libraries and archives the opportunity to contribute collection details into an open data store, known as a Local Graph. The Library.Link Network also connects the shared resources across Local Graphs to create trustworthy Linked Data on the open Web.
Over 1,110 public library locations have published Linked Data via the Library.Link Network, including Denver Public Library, Arapahoe Public Library, Dallas Public Library, Worthington Public Library, and Tulsa Public Library. In total, 29,378,381 MARC records have been transformed resulting in 118,799,193 Linked Data resources and 326,009,018 links connecting the data. Academic libraries are also beginning to join Library.Link Network. Most recently, Boston University transformed their MARC catalogs into Linked Data and published via the Library.Link Network. Jack Ammerman, Associate University Librarian for Digital Initiatives and Open Access, explains “We are committed to making the resources of Boston University Libraries discoverable in the preferred discovery environments of our users. Publishing our records as Linked Data is an essential first step for us. We are convinced that publishing these bibliographic data in Linked Data formats not only increases their discoverability, but enables their re-use in ways we can’t yet imagine.” The University of Manitoba was the first academic Canadian library to use the Library.Link Network. Les Moor, Head of Technical Services at University of Manitoba Libraries, is working to improve how users find resources. Les says, “Linked Data allows our faculty, students and researchers to use popular search engines like Google to find our resources. As a result, we take a big step towards closing a giant discovery gap.”
Discovery is the aspect of Linked Data implementation that is least studied, tested, and understood, as well as being the aspect most likely to have the biggest impact on library operations. It offers the potential for radical changes in the way users search and browse for library information. One of the most obvious and talked about aspects of Linked Data adoption is the extent to which it positions library information to be accessible through search engines and connected with a graph of information beyond the library.
In his February 2015 presentation “Making Library Collections Discoverable on the Web” as part of the OCLC Collective Insight Series titled, “Linked Data [R]evolution: Applying Linked Data Concepts,” Ted Fons outlines the way Linked Data enabled library records can and should circulate in the wider information universe of the World Wide Web. According to Fons, as libraries increasingly shift focus from physical collections management to information access management, the need to make information discoverable through user-preferred mechanisms increases. This requires structuring information such that it can be located through non-library interfaces such as major search engines.
Search engine optimization is a valuable benefit to Linked Data adoption; but it is not the only, or even most important discovery advance that it serves. Linked Data makes possible new visual discovery interfaces that speak to one of the most voiced laments about the transition from stacks to screens: the serendipity of browsing. Consider the rudimentary network visualization of the Lord of the Rings presented in Figure 1 at the beginning of this report. Here we see a focus on a text of interest spread to include an expansive multitude of context, much of which will inevitably be unknown to the user. This type of interface allows users to follow threads of relationship in a manner that harkens back to the days browsing the stacks, moving from one node to the next, with the option of focusing in on a new node and following the subsequent traces growing from it.
The above is just one example of a potential new discovery mechanisms made possible through Linked Data adoption. Current Linked Data discovery efforts either present experimental, pilot demonstrations of this potential (such as the thin network graph presented in Figure 1) or provide traditional search and discovery interfaces. One of the most adopted platforms for exposing Linked Data graphs for search and discovery is Blacklight.
Figure 32: Blacklight demonstration screenshot
Figure 32 above is a screenshot of the online demonstration of Blacklight. Blacklight is, “A multi-institutional open-source collaboration building a better discovery platform framework.” It provides a traditional but sophisticated search and discovery interface to both MARC and Linked Data data-stores. Built on an Apache SOLR/Lucene index, it provides fuzzy search, with full text search capability and faceted browsing.
Another Linked Data discovery platform is Collex, a native Linked Data platform maintained by the Advanced Research Council and Institute for Digital Humanities, Media, and Culture at the IDHMC.
Figure 33: Collex Linked Data browser
Figure 33 presents a screenshot or the Collex platform as implemented at the Michigan State University Library’s Studies in Radicalism Online. Much like Blacklight, Collex provides a fairly traditional library interface to its Linked Data data-store. But it also includes several features designed to capitalize on the extended web of Linked Data information. Note the “Currently Searching…” at the top of right side menu column of the screenshot in Figure 33. Built into the Collex platform is the ability to direct the platform to either query an aggregated triplestore or query a configured list of SOLR endpoints at other institutions, thereby producing an aggregated search and browse environment. The aggregating through linking functionality of the platform represents a significant step towards the type of network expanded search and discovery made possible by Linked Data.
As noted at the beginning of this section, we are only beginning to explore the potential of Linked Data discovery, and experimentation along these lines is likely to continue for the next several years. Two important initiatives devoted to this area of research are the Mellon funded Linked Data for Libraries (LD4L) and Linked Data for Production (LD4P) initiatives. These ongoing initiatives bring together Columbia, Cornell, Harvard, Library of Congress, Princeton, and Stanford University in a combined effort to examine the potential for Linked Data production and discovery, building on products such as Hydra, Blacklight, Fedora, Vivo, and Vitro. LD4L and LD4P has initiated wide engagement with the library community, and are actively developing new search and discovery methodologies and platforms based on their own research and engagement with other libraries.
It is difficult to predict exactly what new search and discovery approaches and capabilities will be developed out of initiatives like LD4L and LD4P. However, we already have enough examples of novel interfaces to begin to see some of the possibilities. Importantly, we currently lack sufficient library Linked Data data-stores to properly test and develop scalable Linked Data search and discovery platforms. We can, however, reasonably expect the functionality of existing systems, which already provide capabilities on par with current library search and discovery, will expand over time as Linked Data uptake in the library community expands.
Actionable [Machine]: An object is said to be machine actionable when it is in a form that allows a computer to interact with it in some automated manner.
Application Programming Interface (API): An application programming interface (API) is a set of communication protocols that provide a clearly defined method of communication between various software components, programs, or network services.
BIBFRAME: Short for Bibliographic Framework—a data model created for bibliographic description. The design of BIBFRAME began in 2011 through a partnership between the Library of Congress and Zepheira. BIBFRAME’s goals include the replacement of MARC encoding standards with methods that integrate Linked Data principles in order to make bibliographic data more useful both within the library professional community and to the world at large.
Crosswalk: The process of migration data from one serialized form to another.
Disambiguate: A process directed at distinguishing between distinct entities.
Graph: A graph is a data arrangement that consists of nodes (objects) connected to each other via edges (relationships). A family tree is a common example of a graph where the persons represent nodes (John, Jane, etc.) and relationships represent edges (child, parent, etc.).
International Resource Identifier (IRI): An IRI is a version of a URI that is encoded in a form that can render international characters.
Linked Data: According to the W3C the term Linked Data refers to a set of best practices for publishing structured data on the Web that includes the use Uniform Resource Identifiers (URIs) as names for things, the use of HTTP URIs so that people can look up those names, insuring that when someone looks up a URI, provide useful information, and including links to other URIs so that users can discover more things. Additionally, Linked Data describes a semantic data structure based on collections of n-triples preferably (but not necessarily) serialized as RDF. See https://www.w3.org/wiki/LinkedData.
LD4L: Acronym for Linked Data for Libraries, a Mellon funded initiative focused on examining a variety of issues surrounding Linked Data implementation in libraries. See https://www.ld4l.org/.
LD4P: Acronym for Linked Data for Production. An extension of the LD4L project. See https://www.ld4l.org/ld4p/.
n-triple: An n-triple is the fundamental structure of Linked Data graphs, wherein relationships between objects are described through “subject::predicate::object” statements. For example, “John::hasMother::Sarah” is an n-triple.
Resource Description Framework (RDF): A standard model by the World Wide Web Consortium (W3C) for expressing Linked Data on the Web. See “Resource Description Framework (RDF) Model and Syntax Specification”. 22 Feb 1999. Accessed August 1, 2015. http://www.w3.org/TR/1999/ REC-rdf-syntax-19990222/.
Serialization: Serialization is the process of representing data in a particular form. In the Linked Data universe, this refers to the one of many forms that can be used to represent n-triples. Examples of such formats include RFD, Turtle, JASON, etc. A non-technical way to understand serialization is to think of it as the way a triple is formatted.
Schema.org: A Linked Data standard ontology implemented by most major search engines. See http://schema.org/.
Thin [MARC, Record, or Graph]: A thin information is a sparse collection of data that describes only the minimum necessary depth of information for a particular context as opposed to the full range of information that may be known about the object.
Traversable: When a person or computer follows the chain of relationships represented by a data graph, moving from one related node to the next, she/he/it is said to traverse the graph.
Uniform Resource Identifier (URI): In information technology, a Uniform Resource Identifier (URI) is a string of characters used to identify a resource. Such identification enables interaction with representations of the resource over a network using specific protocols. In practical terms, to human readers URIs look like the URLs used to navigate the World Wide Web. URIs, however, by convention, are intended to be permanent identifiers for a resource, regardless of it might live (or move to) on the network. In other words, an item’s URL could change, if, for example, a web-based resource moved to another hosting environment, but its URI would not and any person or machine that traverses the URI would be directed to the current URL for the resource.
Workflow: The steps involved in completing a defined task.