Tech

Monitoring deforestation with Open Data and Machine Learning — Part 1

Written by
Luis Di Martino
Published on
March 15, 2024

Climate change is one of the biggest challenges we face as a species, and deforestation is a key contributing factor. This article aims to show how one can monitor deforestation using Machine Learning (ML) and open-access forest data.

The problem

Let’s explain how we got here. Since the start of the twentieth century, we got used to extracting underground fossil fuels, obtain energy from them, and pump the resulting greenhouse gases into the atmosphere. This practice allowed industries to thrive, improve transportation and provide more people with access to electricity. Initially, it did not seem very harmful. As the world population started to increase rapidly, and the consumption rate of a significant part of the world also rose, this went out of control. The higher demand for goods, transportation, and electricity pushed forward the rate at which we pollute the atmosphere.

World population — Wikipedia

Additionally, the rise in world population increased the need for more food supply. Growing food demand pushed us to use more land for agriculture and animal breeding at the expense of deforestation. Cutting down trees makes things worse as forests do a great job sequestering carbon dioxide, the greenhouse gas with the most concentration in the atmosphere.

Combining these two phenomena took us to a very delicate position. We increased the sources of greenhouse gas emissions. At the same time, we reduced nature's mechanisms for absorbing such emissions from the atmosphere. This equilibrium between sources and sinks broke, resulting in greenhouse gas accumulation and global warming. Dr. Jonathan Foley clearly explains this phenomenon. He is a very well-known environmental scientist and the Executive Director of Project Drawdown. This initiative’s mission is to help the world reach the point where greenhouse gas levels stop climbing and decline steadily, thus avoiding catastrophic climate change.

Sources and sinks of greenhouse gases — Project Drawdown

There is a positive side to this story. In recent years, we started realizing the dangers of polluting the atmosphere and deforestation. We woke up as we began to see first-hand the consequences of climate change: record-breaking wildfires razing forests in Australia and California and larger ice caps breaking from the poles. This realization drove many government efforts like the Paris Agreement and private initiatives like Bill Gates-led Breakthrough Energy Ventures and Elon Musk-led XPrize. It also fostered endeavors in the startup communities. New companies are innovating and proposing possible solutions to reduce emissions or increase the sequestration of greenhouse gasses accumulated in the atmosphere.

A possible solution that emerged from the Kyoto Protocol is the introduction of carbon offsets. These work by allowing an entity to compensate its emission somewhere by emission reductions elsewhere. As greenhouse gases are widespread in the atmosphere, the climate benefits from the reduction independently of where the cutback occurs. In the usual terminology, the gases emitted by some activity are considered the activity’s “carbon footprint”. When carbon offsets compensate for this footprint, the activity is said to be “carbon neutral”. Carbon offsets can be bought, sold, or traded as part of a carbon market. Having a legitimate marketplace is a huge technical challenge for multiple reasons. On the one hand, it involves monitoring that projects aimed at sequestering carbon from the atmosphere capture the amount they report. Additionally, one needs to validate that the emission reduction by one project is not accounted for twice, which would invalidate the underlying principles of the marketplace.

Companies like Pachama and Natural Capital Exchange are innovating to make the carbon credits marketplace accountable and reliable. They leverage the latest technological advances in remote sensing and Machine Learning (ML). Remote sensing techniques aim to detect and monitor changes in an area from measurements done at a distance. This field has seen considerable advancements in the last decade with the growth of companies providing satellite imagery with better resolution and lower revisit times as Satellogic and Planet Labs. ML is the field in which one trains algorithms running in machines to perform complex tasks. This area is the one where technology has seen the most incredible breakthroughs in the last years.

This article aims to show how one can monitor deforestation automatically by using high-resolution satellite imagery. A tool like this has great power for avoiding deforestation by providing transparency on the matter. Several initiatives pursue this goal. The World Resources Institute provides the Global Forest Watch, an online tool that maintains an up-to-date map of forest coverage and its variations year by year. Planet Labs also proposed, in partnership with Kaggle, the competition Planet: Understanding the Amazon from Space. It aimed at evaluating ML algorithms for tracking human footprint in the Amazon rainforest.

Changes in tree cover in Tailândia municipality in the Amazon rainforest — Global Forest Watch

The solution

Training data

One of the main issues when trying to build supervised ML models is the lack of labeled data. These are required to train the algorithms to then perform the task one aims to solve in previously unseen data. Luckily, the Kaggle challenge Planet: Understanding the Amazon from Space provides precisely this. The provided training data comes from imagery of the Amazon basin captured by Planet’s Flock 2 satellites between January 1st, 2016, and February 1st, 2017. The images contain the visible red (R) , green (G) , and blue (B) and near-infrared (NIR) bands. The captures’ ground-sample distance (GSD) is 3.7 m., and they are orthorectified with a pixel size of 3 m.

The labels include several phenomena of interest in the Amazon rainforest basin. They can be divided into atmospheric conditions, common land cover/use, and rare land cover/use. Let’s see some examples:

Training data labels — Planet: Understanding the Amazon from Space

You can find more details on the data and the labels in the Data section of the Kaggle challenge.

Remote sensing imagery is captured and delivered using different spectral bands. The ones mainly used are the ones in the visible spectrum: R, G, and B, and the Near-Infrared (NIR).

The Kaggle challenge provides data in two different formats:

  1. Images with the four bands (RGB-NIR) and full definition are available as GeoTiff files.
  2. jpeg compressed version of the captures containing only the visible bands (RGB) is given.

The GSD in both cases is the same, as the jpeg files are not subsampled versions of the original images. But they can suffer from artifacts introduced in the lossy jpeg compression. In addition, the GeoTiff files contain data coded in 16-bits, while the jpeg data bundle only contains 8-bits. This implies that the jpeg images have less dynamic range.

A discussion of the differences between both data formats can be found here.

To train the classifier, we use the jpeg compressed version of the images. Using the GeoTiff captures poses an issue: convolutional neural networks (ConvNets) are usually pre-trained with the ImageNet database. Therefore, they expects only RGB imagery coded with 8-bits. The bank of filters they implement is not suitable to process four-bands images with data coded using 16 bits. We could think about adapting the classifier architecture to handle this extra channel. But we decide to leave that alternative as future work.

The classifier

To solve the issue at hand, we need to implement an image classifier. In the last years, ConvNets have shown the best results on this task across very different datasets. We chose to use a Resnet50 ConvNet to keep a good trade-off between the network’s size, complexity, and performance. You can find the technical details and code in a Jupyter Notebook I shared publicly here. The main takeaways from the implemented solution are the following:

  • We use transfer learning to finetune the network for the particular dataset and labels. Training the network from scratch requires a considerable amount of GPU time and lots of data. We avoid this by using a pre-trained set of parameters learned on the popular image recognition dataset ImageNet. Most deep learning courses cover transfer learning in-depth. You can find technical details in Stanford’s CS231n lectures. PyTorch’s resources also provide an excellent hands-on tutorial.
  • We can classify an image into multiple categories, making this a multi-label classification problem. To effortlessly handle the labels, we use the MultiLabelBinarizer from scikit-learn.
  • When training a ConvNet, a critical parameter that we need to calibrate is the learning rate. The optimal learning rate depends on the network architecture, the dataset and usually changes in the training process. We use the “One-Cycle policy” that allows for a fast convergence of the network doing few training epochs. The article “The 1 Cycle Policy: An Experiment That Vanished the Struggle in Training of Neural Nets” explains the principles behind this strategy and its implementation details. This policy is beneficial when training models in the cloud, pushing the network to converge in a few iterations, saving training time and resources.

The implemented network achieved a score of 0.89188 in the Kaggle challenge. The top score in the challenge leaderboard was 0.93317. I’m confident we can improve the obtained result. Other Kagglers scored 0.924662 by fine-tuning and refining a ResNet50 ConvNet similar to the one used in the implemented classifier. In the linked Jupyter Notebook, there is a list of possible alternatives to improve the classification performance. Nevertheless, the result obtained seems good enough to move towards the end goal of implementing an automatic deforestation monitoring tool.

Open-access deforestation data

Now that we have a classifier, the next step for implementing an automatic deforestation monitoring tool is access to the data to monitor. As deforestation is currently a critical world issue, several initiatives aim to provide free resources to foster advancement in the subject. One of these is the Norway’s International Climate & Forests Initiative (NICFI) Imagery Program. It is a partnership between the NICFI, Kongsberg Satellite Services (KSAT), Airbus, and Planet with the following goal.

Through Norway’s International Climate & Forests Initiative, anyone can now access Planet’s high-resolution, analysis-ready mosaics of the world’s tropics in order to help reduce and reverse the loss of tropical forests, combat climate change, conserve biodiversity, and facilitate sustainable development.

You can find the technical details of the available data and how you can use it on the program’s homepage. The provided mosaics cover tropical forested regions between 30 degrees North and 30 degrees South. Some non or low forest-covered areas, as well as some countries, are excluded. For the particular area we plan to tackle in this article, these exceptions are not an issue as the whole Amazon basin is included. The provided imagery contains the visible RGB and NIR spectral bands provided at a spatial resolution of 4.77m/px. The temporal resolution of the provided imagery is as follows:

  • Bi-annual imagery is provided for the period from December 2015 to August 2020.
  • Monthly images are granted from September 2020 onwards.
NICFI Imagery Program coverage


So far, we have:

  • Presented the leading causes and consequences of climate change.
  • Highlighted the importance of forests and the need to preserve them as part of the solution.
  • Explained how to monitor deforestation by implementing a classifier based on a well-known ConvNet architecture.
  • Shown how open-access data from forests can be accessed for training and evaluating the classifier.

This story is the first section of a two-part article. In the next one, we present the evaluation of the solution in the NICFI program data and analyze the obtained results. We also discuss conclusions and lines of future work. You can find the second part here.