How the Cloud Brings New Opportunities to the Forefront for the Library of Congress

"As the oldest federal institution, the Library of Congress serves as a steward to some of the country’s most defining works — inspiring learners, creatives and artists — for more than two centuries. But as information becomes ever-more digital, the Library is turning to the cloud to expand access and enhance the services they provide in order to realize its vision of connecting all American people to the Library of Congress. In order to support accessibility and enhance the availability of services they provide, the Library is turning to transformative tools and capabilities in the cloud. Machine learning tools, including natural language translation and assistance for the visually impared, are enabling the Library to scale and extend their services to include previously less accessible collections and resources, only available in their original printed format."

– Shannon Sullivan, Director of Federal, Google Cloud

It took 14 years for the Library of Congress to amass its first 3,000 books and maps, but just a single day to lose them all. In 1814, British troops turned the Library’s original volumes into kindling to burn down the Capitol, torching the symbolic seat of American democracy and learning.

Far from losing relevance in the digital age, the Library’s potential value today could be greater than ever before.

Today, the Library is the world’s largest repository of knowledge, employing several thousand personnel and spanning multiple buildings and departments in the pursuit of its mission: collecting, preserving and ensuring access to the world’s documents. But like other federal agencies, the Library of Congress is now looking to advance its mission into the next era of digital transformation, an effort spearheaded by the Library’s forward-looking leadership.

In fiscal year 2018, the Library of Congress…

Issued more than 560,000 Copyright registrations

Recorded a total of 168,291,624 items in the collections

Circulated nearly 21 million copies of braille, audio, and large-print items to over 970,000 blind and physically handicapped readers

Received nearly 1.9 million onsite visitors and more than 497.9 million page views on the Library's web properties

Recorded 114 million onsite visits

Her chief objective is to use technology to help make its digital material “maximally available” — that is, to make it accessible to as many users as possible, for as many purposes as possible, with the capability to turn that data into useful knowledge for any who seek it.

“This is the first digital strategy the Library has ever published,” Zwaard says.

While the Library has always strived to make its collections widely available, “what’s new is using digital to connect people to both the tangible material here in Washington, D.C., and also to make our digital material [more widely] available," she adds.

In a model that sought consensus from the Library’s stakeholders, Zwaard’s team developed a roadmap for its digital future that builds on the Library’s existing strategy with a hyperfocus on emerging technologies. These include cloud storage, machine learning, and elastic search, all of which are being used to prepare a foundation for the institution’s digital journey.

The Cloud Springboards Transformation

However, Zwaard believes cloud is the critical centerpiece enabling these technologies to truly launch. For example, cloud can provide her team a way to automatically enforce digital rights management so that only the appropriate type of users can access material at a given time and place. Additionally, it could also provide researchers access to portions of the Library’s collection without disrupting access to other users in the process.

“We, the American citizens, have paid for this, and we should be able to use it,” Zwaard says. “We're also serving learners and creatives and artists and people who want to use our material in ways that even we don't expect. To that end, we've invested in converting our materials into machine-readable format to enable greater access and see where computation can help people better understand the humanities.”

To put these innovations in perspective, Zwaard tells the story of Frederick Mosteller and David Wallace, two mathematicians who spent three years studying the writings of James Madison and Alexander Hamilton to identify tell-tale elements of their individual writing styles and determine, within a reasonable doubt, the authorship of a dozen of the 85 Federalist Papers. Mosteller and Wallace used statistical analysis without the benefit of computers to make their discovery. But in so doing, they also demonstrated the utility of statistical analysis to uncover truths which until then had vexed the best minds in law and history.

“Now fast-forward to modern times,” she says. “There was a statistician who figured out that the author Robert Gilbraith was really J.K. Rowling, who wrote the Harry Potter books.” With the benefits of modern technology, the statistician hired to investigate the case was able to make a definitive match in just under half an hour — significantly faster than the three years it took Mosteller and Wallace to conduct their research. “If you imagine the kind of knowledge that is hidden in our collections, then you can see the inherent potential energy that computation could help us unlock with machine readable access.”

The Public Taps Cloud for Innovation

As Zwaard sees it, the Library can deliver greater insights and help fuel the development of new knowledge through the intelligent use of advanced digital technologies. In 2018, the Library hosted a Congressional data challenge. “As part of our special relationship with Congress, we make available legislative information [via Congress.gov]. Members of Congress use this, their staff members use it, but also members of the public use it,” she explains. Inviting professionals and students to compete for cash prizes by designing digital tools that extract sentiment from text, Zwaard and her team were looking to answer a question: “what if we challenge people to think of [this legislative information] as a set of data? What could people do with it?”

The winners of the competition? Two high school students.

“That really surprised us,” Zwaard says, marveling at their success. One created a tool that clarifies for users the content of America’s treaties with other countries. The other winner built a smartphone app, which sought to pair legislators with others who shared their interests and were receptive to collaborating on legislation together.

“The thing that was so exciting about this for me was that our judges realized congressional staff would actually use these tools,” Zwaard says. “If that’s the kind of creative capability we can unearth from high schoolers, just imagine if we made data that much more accessible and widely available. Imagine what we could find.”

Zwaard wants to leverage cloud capabilities in other ways, most notably in supporting the Library’s crowd-sourcing initiatives. These initiatives involve enlisting the public’s help to identify images, transcribe handwritten letters, and take on other tasks that machines can’t do well. The Library has launched an app dedicated to crowd-sourcing this information. Crowd-sourcing gives the public a chance to contribute to the Library, but also aids the Library in making parts of its collections more useful to the public. When public interest peaks, however, database calls can spike. Cloud-based hosting helps manage those spikes to keep systems accessible.

Library of Congress Puts Cloud to Use to Jump Data Hurdles

With some 60 petabytes of data in its ever-expanding collections, the Library’s storage requirements are enormous. But as the central repository of the nation’s knowledge base, the Library of Congress is unlikely to move everything to the cloud. “There is great interest in having a copy of everything we have in our physical control,” Zwaard says. That includes digital copies in addition to physical, which the cloud can scale to meet as storage needs dictate.

But in Zwaard’s opinion, the cloud’s greatest utility for the Library is not in storage, but in its ability to process massive volumes of data under one roof. “Computational analysis of data requires you to have a copy of all that material in a place that you can compute against. Traditionally, that's meant downloading it to laptops. But we are talking about data collections that are too big to be downloadable.”

The cloud can be particularly helpful when those datasets get really large. “Let’s say you want every instance of the mention of the word ‘elephant’ in all of the movies, all of the books, and all of the newspapers in our collection,” Zwaard explains. “The conventional approach to search the data would be to pull the collections onto a laptop that could process the information, but a data set this large is going to be more than a person can copy onto their laptop.” To improve what can be managed at scale, the Library is “investigating ways we could use the cloud to allow people to link that data, which is in the cloud, to a virtual machine, which is also in the cloud. Now they can pay for their own computation against our collections.”

On top of this, cloud resources can be used to organize search functionality of digitized collections. In all, the Library has about 60 petabytes of data in its digital collections. To put that in context, Netflix has about 4 petabytes of video “master copies” in its entire streaming service; meanwhile, the Event Horizon Team required 5 petabytes of memory to store its historic first images of a black hole in April 2019. Managing that astronomical volume of data requires the Library to identify which data is available to the public, which data is only available for view on-site, and which data is restricted by rules set out in a gift letter.

“We focus on digitizing material that's either unique to the Library or that is rare,” Zwaard says. “We emphasize the material that we can share the most widely in terms of our digitization dollars. But we also have large digital collections. Those include electronic newspapers, ebooks, journal articles, and also web archiving material — government websites but also websites where the copyright owner has given us permission to crawl it and show it.”

The Library of Congress also sees promise in leveraging a cloud-based geospatial hosting environment in order to combine geospatial information with demographic, environmental, or other kinds of data. The Library has already begun investing heavily in this environment, which allows researchers at the Library and in the Congressional Research Service to pursue in-depth geographical research.

“As you might guess, members of Congress are very interested in geography,” Zwaard says. “But they’re also interested in using computation to display that research in new ways.”

Disabled Americans Benefit from Cloud-Hosted Content

One of the more powerful stories emerging from the Library’s digital migration is its plan to use voice recognition technology at cloud scale to serve portions of the population that have historically lacked access to reading materials. For example, the Library now has a service to provide audiobooks to qualified, visually impaired Americans or those who cannot use a conventional computer. In FY 2018 alone, the Library circulated nearly 21 million copies of braille, audio, and large-print items to over 970,000 blind and physically handicapped readers. “We are investigating what it would take for us to develop natural language search that could enable you to work with Siri, Google Home or Alexa to search the book's availability through this program,” she says.

The Way Forward Is Hybrid

While the Library of Congress will never give up managing its own central collections, it does recognize that commercial cloud services can enable its other missions to push the envelope for what is possible, not just in making its collections accessible and searchable to the broader public, but also in creating new avenues for research and understanding.

Kate Zwaard, chief of digital initiatives, Library of Congress, is looking to leverage emerging tech in the cloud to unlock the Library’s potential as an information source. AMELIA SHULER

“We are in the process of transforming to a complete hybrid data center model,” Zwaard says.

“We're building out a physical, on-premises Tier Three data center facility, with all the needed redundancy built in, but at the same time, we’re also building out the cloud. When people come to our website and want to look at something, that stuff will live in the cloud,” she says.

This approach provides the Library, the Congress and the public with the best of both worlds.

“Hybrid lets us leverage the technology of the cloud while still protecting both the long-term nature of our occupation, the sensitivities around the cloud, and the need for compute power that the cloud can deliver best.”