Faster Rasters For All

Graeme Bell (Norsk Institutt for Skog og Landskap)

11:30 on Saturday 21st September (in Session 58, starting at 11 a.m., EMCC: Room 1)

Show in Timetable

Description: The Norwegian Forest and Landscape geomatics group works with national-scale agricultural data, and as part of a recent project, we have developed some tools & techniques that significantly improve performance with open source GIS tools that we want to share with the community.

 Project background: This project is being run by the geomatics section of the Norwegian Forest and Landscape institute, and involves building a new map describing farmland across the country of Norway. The aim is to combine the best quality information found in historic and current map datasets (stored in various different formats) into a best-estimate description of land status today. We want to understand how the countryside is changing over time, and find ways to help protect and monitor the landscape. One major motivation for the project is that high quality farmland ('mat jord') in Norway is at threat from erosion and urbanisation. Norway is a very mountainous country, and so flat farmland is an extremely important national food resource. Another motivation comes from local research into marshland ('myr'). Researchers have found that marshes in Norway trap very large quantities of carbon and prevent CO2 from entering the atmosphere. They also provide a habitat to a wide variety of local species. If marshes and farmland are not maintained appropriately, there is the potential for many types of harm to the environment and the national food supply in Norway. But these issues also extend far beyond Norway; many countries are experiencing deterioriation in their farmland, environmental hotspots, and carbon sinks. In general, we're hoping that in future, the land-change discovery, visualisation and monitoring tools we are building can be re-used around the world. Our group has a commitment to using open source technology wherever possible, and that gives us the freedom to share what we build with geoinformatics and landscape experts everywhere. In terms of this talk to Foss4G, and the theme of "Geo for all", we want to contribute some new, fast, general purpose tools and code performance techniques into the open source geoinformatics community. Building the project's technology: We are standardising diverse vector maps and raster maps into a raster format that allows us to track change and status over time in terms of a fixed spatial reference system. The main open source technologies being used to enable this project include: Linux, PostgreSQL, PostGIS, GDAL/OGR/Proj4, Python, BASH, QGis, and OpenOffice. To accelerate our software and scale up our maps, we use parallel data processing techniques - primarily Python's NumPy project for high speed numerical calculations and the Gnu Parallel project for multi-core process management, as well as some tools of our own. Presentation: Our presentation will discuss the first year of development in this project - in particular, what we learned about open source tools that enabled us to improve our software performance by several orders of magnitude. There are two main themes in the presentation: - An introductory project description / case study that explains the context of this work, and establishes the scale and context in which we're using open source. For example: we perform fairly complex mathematical transformations that combine multiple input datasets, each covering an area of around 2 trillion square meters, at up to 1 square meter resolution. - "A bundle of hacks": techniques that improve the performance of open source geo-technology, discovered by analyzing the program behaviour and source code of popular tools. This part of the talk is intended to suit developers at all levels, with a mix of theory, diagrams, benchmarks and code. Techniques: The second part of the presentation offers tips, suggestions and measurements relating to the following performance techniques: - multi-core parallel programming, and the balance between increasing parallelism and overheads - vectorisation/array programming (using mathematical operations that transform groups of data in one step) - analysing/improving open source code - appropriate selection of data types - benchmarking open source software (e.g. gdalwarp vs; GDAL/files vs postgis/db) - benchmarking the effects of algorithm choices & parameters upon performance - benchmarking operator/function performance in languages - precalculation - divide and conquer - cache management, fitting the problem into the cache, and taking advantage of temporal/spatial locality in geodata processing - indexing & tiling - tradeoffs between storage space, access times, and CPU-time in compression of rasterised vector data. - compilation Optimising the user's performance, as well as the code: We also want to make map programming more accessible to non-programmers. Particularly, in this project, we want to draw upon the knowledge of domain experts in soil status and landscape change. At the end of our presentation we will offer two new re-usable open source tools/approaches that can help non-programmer domain experts to work easily with open source map software. The first tool is a small compiler that transforms ideas described in spreadsheet tables into optimised array mathematics expressions suitable for NumPy. The second tool is a GDAL-based raster build environment that simplifies the process of starting from mixed map sources, applying the right calculations in the right order, and getting back to easy-to-use vector or raster maps. Summary: Tips, techniques, benchmarks and tools from an environmentally-friendly and open-source-friendly project in Norway!