Google seeks better access to government information

Online search giant says as much as 40 percent of public information on agency Web sites is unsearchable.

Officials from the leading Internet search engine are working to remove barriers that prevent their technology from reaching vast troves of information buried in government databases.

Internet users want government information because it has a reputation for being reliable and accurate, said J.L. Needham, a strategic partner development manager at Google. But while portions of agency Web sites are easily indexed by Google and other common search engines, the engines cannot search other areas, known as the deep Web.

For instance, Google cannot scan information in the database housed at the Environmental Protection Agency's Regulations.gov Web site, Needham said. The site allows users to view government regulations and post comments on proposed agency rules.

"If you were a business owner and found out you were potentially subject to a new regulation that you wanted to find out more information on, it may be difficult to find this information using a search engine like Google," Needham said. "The problem is that search engines are unable to crawl the full text of many government agencies' databases."

As much as 40 percent of the content on agency Web sites is invisible to Google's crawlers, Needham said. This means that for a majority of Internet users who do not know how to look beyond a search engine site, that information is effectively invisible.

Needham said he is meeting with a variety of agencies to discuss how the information housed in their databases can be made available in the search results from engines such as Google, Yahoo or MSN. One method would be to use Google Sitemaps, which enhances Google's search results, Needham said.

Implementation of Google Sitemaps by a federal institution that maintains one of the world's largest networks of sites, including many databases, doubled the number of Web links found by Google, Needham said. This allowed for millions of new documents to be included in search engine results, he said.

A Dec. 16, 2005, memorandum from Clay Johnson, deputy director for management at the Office of Management and Budget, required all agencies by Sept. 1, 2006, to set up their public information so that it is searchable. It stated that "increasingly sophisticated Internet search functions" can "greatly assist agencies in this area."

Agencies also were required to provide all public data in an open format that allows the public to aggregate "or otherwise manipulate and analyze the data to meet their needs" by Dec. 31, 2005, according to a separate OMB memorandum signed by Johnson on Dec. 17, 2004.

Mark Luttner, director of EPA's Office of Information Collection in the Office of Environmental Information, said the agency's e-rulemaking program management office is working with OMB to respond to a recent request from a search engine company that wants to index the Regulations.gov data.

In addition to the technical challenges presented by the company's request, EPA has to consider whether a commercial company could assert proprietary ownership on federal data and whether providing government data to one company would provide an unequal playing field for other companies, Luttner said.

Needham said Google, for one, does not want to assert ownership over any information obtained from agencies, and agency efforts to improve the ability to search their Web sites would likely be equally beneficial to its competitors.

Commonly used search engines like Google are able to index other agency Web sites used to disseminate information, such as the Small Business Administration's Business Gateway e-government initiative.

Nancy Sternberg, the program manager for Business Gateway, said the initiative's Web site, Business.gov, has been optimized for all major search engines. But Business.gov does not contain a separate database, Sternberg said, which would make indexing much more challenging.

Search engines cannot index the Grants.gov database housed at the Health and Human Services Department, according to John Etcheverry, director of grants systems modernization at HHS. But in 2007, Grants.gov will implement a Google search appliance that will let Google scan specified database tables with grant synopsis information, he said. Allowing search engines to crawl the entire Grants.gov database would create security vulnerabilities since it contains sensitive applicant information, he noted.

Google's forays into the government include a U.S. Government Search Web page, which is intended to provide a single location for searching across agency information and for keeping up-to-date on government news. Google maintains the site is not intended to compete with the government search site hosted by the General Services Administration, called FirstGov.gov. Rather, it is intended to complement it, company officials say.

John Murphy, director of FirstGov.gov technologies, said the FirstGov.gov pages are optimized for all search engines, but the MSN-run search tool is specifically directed to searching government Web pages, including those hosted by state and local governments.