Using CKAN As A Catalog For Geospatial Data: Building The Next Generation Of geo.data.gov

AdriĆ  Mercader (Open Knowledge Foundation) with Irina Bolychevsky (Open Knowledge Foundation)

15:00 on Thursday 19th September (in Session 12, starting at 2 p.m., EMCC: Room 3)

Show in Timetable

Description: CKAN is the most widely used open source data portal software across the world, built on open standards to make data easily discoverable and reusable. This presentation explores the features which allow publishing geospatial metadata with CKAN, providing an alternative to traditional geospatial catalogs, and more particularly how it has been used to build the new version of geo.data.gov, the US Government official online data catalog.
Abstract:

Data.gov is the main online data catalog from the US Government, aggregating data from across several publishers including Federal Agencies, States, Universities, etc. As part of a series of wider changes, a new version of the portal is being built, which will merge the current data.gov and geo.data.gov sites into a single catalog, hosting both non-geographic and geographic data. This combined portal will be powered by CKAN, an open source data management system. A mature and widely used project, CKAN is maintained by the Open Knowledge Foundation, a UK-based non-profit organization that promotes open access to information. The main goals of CKAN are to help publishers manage and place data online and make that data easily discoverable for users, while allowing developers to customize and extend the software for maximum re-use potential. Already used in several governmental Open Data catalogs across the world [1], CKAN will replace two existing instances currently powered by proprietary software. The implementation of the new version of geo.data.gov has posed significant challenges, from technical ones (such as harvesting and managing large numbers of datasets) to user experience and design ones (like presenting such a big amount of data in a useful and meaningful way). Data needed to be harvested from different sources across a wide range of organizations, using an authorization process compatible with existing systems in place. Metadata sources used different protocols and formats, with a significant quality disparity. The harvesting extension of CKAN provides a framework that allows building harvesters for different kinds of sources, managing them via a web interface and generating job reports. Existing harvesters for CSW servers and Web Accessible Folders were improved, allowing the import of documents in both ISO19139 and FGDC formats, and new ones were created for other sources such as ArcGIS Rest API endpoints or Z39.50 databases. Custom validation options were implemented to deal with common errors encountered, such as wrong bounding boxes, misplaced elements on the XML document, etc. Once the metadata is imported into CKAN, it follows an approval process where it can be reviewed by authorized users based on the organization it belongs to, with tools that allow bulk processing of large number of datasets. After becoming publicly available, datasets can be found via a user interface that allows full text search, filtering by bounding box, term faceting and a powerful JSON based search API that allows building third party applications and mashups on top of the catalog. Great effort has been put into making the search among such a big volume of data useful, with special work on ranking algorithms and aggregation of conceptually close datasets into collections (for example, map series) so they don’t interfer in the main search results. The same metadata is exposed via a CSW endpoint to ensure compatibility with other geospatial software. This has been done leveraging pycsw, an open source CSW implementation, and a number of improvements have resulted from the collaboration between both projects teams. In terms of data visualization, the portal integrates with online viewers based on GeoExt and OpenLayers for common geospatial formats and services like WMS or KML, with plans to extend support for others. At the same time, existing previews for other non-geospatial formats like CSV and PDF are available, giving users access to different types of data and making the catalog useful to users without a geospatial technical background. Both the US and Canada Open Data Initiatives are committed to use and support CKAN, as well as provide an open source distribution based on CKAN and Drupal, the Open Government Platform, for other governments and agencies to meet open data and open government policies and requirements. The first version of the portal will be available in the coming weeks at the following URL: http://catalog.data.gov/ More information on CKAN and the main source code repository can be found on the following links: * http://ckan.org * https://github.com/okfn/ckan [1] http://ckan.org/instances