DPub Archival Task Force -- 24 Mar 2016

<scribe> scribenick: TimCole

Minutes Approval

TimCole: Any concerns about minutes
... hearing none they are approved.

CLOCKSS AND LOCKSS

Bill_Kasdorf: Do we need to introduce Craig and Nicholas an introduction

Craig: we've read the documents, but would appreciate a brief intro

Bill_Kasdorf: First the DPub is a W3C Interest Group
... IGs do not publish recommendations but they inform the W3C about issues and coordinate with Working Groups that make Recommendations
... In the context of that work an initiative was done to draft a vision of a Web-based format that is independent of whether a document was online or offline
... This led to the Portable Web Publication document
... when online all the components of a PWP document are available online
... but whether online or offline to the user it's the same document.
... in the course of this dicussion the issue of forming archival came up
... we want to make sure that PWP is useful and can be archived.

Craig: I am the exec dir of CLOCKSS after 37 years in publishing.

Nichoals: Web archiving service manager at Stanford
... mostly work on Web archiving genreally, but have been working a lot on LOCKSS

Craig: CLOCKSS is a free-standing org made up of publisher and libraries
... pubs are billed both annually and by the article or book archived.
... CLOCKSS builds on the LOCKSS protocol
... CLOCKSS adds controlled -- LOCKSS typically has more copies...
... libraries have a traditional role of archives, but in digital world they are not the holders of the digital copies
... so libraries wanted trusted 3rd parties to whom publishers could provide content
... 3rd parties (like CLOCKSS) is a dark archive for safekeeping in case the resources become unavailable on the web (e.g., publisher goes out of business)
... CLOCKSS for example has 20,000+ journals, so if a journal goes away, CLOCKSS can provide archival access
... scholars highly dependent on the literature, so access to lit is crucial
... CLOCKSS helps ensure that access
... CLOCKSS harvests or crawls - publisher agrees that LOCKSS can harvest
... harvested content is then put in the 12 nodes that CLOCKSS has spread out across the globe
... 2nd method is to accept files from the publisher (or retrieved from the publisher)
... Regardless of whether delivered by publisher or crawled, the norm is to simply archive the files, not to normalize
... we do however, a quality assurance on the materials ingested.
... LOCKSS allows nodes to confirm that all nodes have the same content
... a voting system can be used among the nodes to validate data correct
... this checking that all copies match is constantly ongoing and repaired as needed

Bill_Kasdorf: question about nature of the content in CLOCKSS
... when you sign on a publisher is it all content of that publisher?
... is it mostly scientific literature

Craig: broader than just science
... anything scholars access from publishers
... in theory all journal and book publications, but in practice books may not be added right away
... databases, other content types not always in scope

Nicholas: drilling down on how CLOCKSS acquires content by harvest
... one challenge is that publishers use different platforms
... we have to account for differences in how content and related files are served
... we create 'plugins' for each publisher
... one thing that might be useful about PWP would be more consistent presentation across publisher platform
... this might make easier to acquire content
... but one concern would be that PWD might be a parallel - which is canonical version? Do they stay in sync

lrosenth: Our thinking right now - we recognize that for any given publication there is a canonical version
... our strategy is that there will be a locator associated with any 'copy' that refers back to canonical copy and/or breadcrumbs through versions that you have to follow
... so for example if annotations are added you may have a new version and so a new locater, but you can still go back to the original
... There is no requirement that a PWP is served as package
... for example a publication talking about the Mona Lisa, and so the most correct version of the pub references the Mona Lisa at the Louvre

Nicholas: the manifest sounds like the Signposting being discussed in the Web Archiving Community more general

lrosenth: yes, sounds relevant
... the manifest documents everything, every part that is necessary to 'present' / 'consume' the publication
... this would include what a machine might need
... if you utilize the manifest, and processes the manifest to have the set of elements needed for the publication

Bill_Kasdorf: If there are elements that the publisher wants to 'protect' , the publisher can count on CLOCKSS not to release any of this until a trigger event occurs

Craig: Yes

Bill_Kasdorf: So there is another level of abstraction, potentially
... a font might be an example
... some of these fonts may require licenses
... so the PWP may name a fallback

lrosenth: Yes, that seems right. Important to look at.
... may define which PWP's are truly archival

Nicholas: Seems like spec has a lot of potential
... as it stands now we have this 2 pronged archival approach
... PWP may bring these together a little more
... the manifest idea or signposting or some level of semantic annotation that projects the publisher's perspective of the publication would be very useful
... right now we have a set of heuristics that we need to keep revisiting
... a little difficult to know what it might look like in practice

lrosenth: as i understand LOCKSS, you make no requirements on the content
... so if a trigger even occurs, you release what you have

Craig: not making any kind of legal warrant
... but our expectation is to deliver content in a consumable way
... we're not focused on user experience, but rather on content

lrosenth: Bill was talking about fonts
... in case you have a document that references a font, but the font is not archived
... you don't overtly address font issue?

Nicholas: Once the content is archived, we try to make sure that all the required content is archived
... not sure if we are going after fonts

lrosenth: has a lot of implications for what we are trying to do

Bill_Kasdorf: as an XML guy, ideally content is Unicode, so you should be able to consume the content, albeit without the proper glyph

lrosenth: But since they don't normalize, they don't ensure unicode is being made

timCole: manifest may make it easier to make sure you get everything you need?

Nicholas: yes it can be difficult to know for each platform exactly which links should be collected (e.g., fonts vs. publisher home page)
... is this related to packaging on the Web (W3C)

lrosenth: that initiative is broader and separate, but as we start talking about what our packages look like, we will look at that work

Bill_Kasdorf: would Nicholas have more time to join DPub IG? Since Stanford already a member of W3C

Nicholas: have already joined...

DPub Archival Task Force

24 Mar 2016

Attendees

Contents

Minutes Approval

CLOCKSS AND LOCKSS

Summary of Action Items

Summary of Resolutions