Affiliations:

John K. Dagsvik, Statistics Norway, Research Department;

Mariachiara Fortuna, freelance statistician, Turin;

Sigmund Hov Moen, Westerdals Oslo School of Arts, Communication and Technology.

Corresponding author:

John K. Dagsvik, E-mail: john.dagsvik@ssb.no

Mariachiara Fortuna, E-mail: mariachiara.fortuna@vanlog.it (reference for code and analysis)


Raw data organization

The data used in this project were collected by Sigmund Hov Moen, and are available in the Rimfrost system, www.rimfrost.no.

All the data are available in the tempFgn repository, see the temperature data section for more.

They consist of a large amount of monthly and annual temperature time series from all around the world.

The raw data, as organized in the Rimfrost system, are collected in the folder “data/raw”

The “data/raw” folder contains 101 subfolders named with the English or the Norwegian name of the countries included in the Rimfrost system.

Each country folder contains the temperature time series for each weather station included in the Rimfrost system.

Data structure

Each time series is collected in a separate txt file, usually named with the Norwegian name of the weather station.

Each file is structured as follow:

There are no column names, and the missing data are usually recorded with the string 99 (but several exceptions are present).

As an example, these are the first six rows of the Paris.txt file.

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1757 -0.33 3.73 6.03 11.23 14.53 19.03 24.63 19.63 16.23 8.23 9.03 0.43 11.0
1758 1.63 3.93 7.93 10.03 18.53 20.33 17.93 20.83 15.43 9.83 5.63 3.13 11.3
1759 4.53 6.03 7.33 11.73 15.33 19.43 23.93 20.43 17.73 12.63 3.73 3.13 12.2
1760 0.23 3.63 6.63 12.33 16.03 20.03 21.63 19.73 18.73 11.83 7.53 1.43 11.6
1761 1.83 6.33 8.73 10.83 16.33 19.53 21.43 21.93 18.33 10.43 5.63 6.43 12.3
1762 4.93 4.03 4.03 13.83 17.33 19.93 23.33 19.63 16.83 9.83 5.83 2.63 11.8

Main features of the raw data

In order to select the set of suitable time series and explore their main features, we first provide a preliminary table with useful information (T0_Information). This table contains general information about all the available time series, namely:

As an example, consider the first 20 rows of the T0_Information table:

Country Station Status From To Length Missing
afganistan herat OK 1963 1990 28 16
afganistan kabul OK 1961 1992 32 12
afganistan mazar_i_sharif OK 1964 1992 29 11
algerie adrar OK 1965 2011 47 4
algerie alger_dar_elbeida OK 1923 2011 89 3
algerie beni_abbes OK 1931 2011 81 25
algerie constantine ERROR NA NA NA NA
algerie el_golea OK 1930 2011 82 3
algerie in_amenas OK 1963 2011 49 8
algerie in_salah OK 1964 2011 48 4
algerie oran OK 1908 2011 104 12
algerie tamanrasset OK 1940 2011 72 2
alpene andermatt OK 1966 2006 41 2
alpene basel ERROR NA NA NA NA
alpene davos OK 1901 2006 106 2
alpene engelberg OK 1931 2006 76 2
alpene geneve_ecad OK 1901 2009 109 1
alpene geneve ERROR NA NA NA NA
alpene graz OK 1894 2009 116 1
alpene hohenpeissenberg OK 1781 2009 229 2

Raw data basic exploration

The total number of availbale time series is 1260.

The table below provides a summary of the information given about each recorded variable:

Status From To Length Missing
ERROR: 225 Min. :1701 Min. :1869 Min. : 5.00 Min. : 0.00
OK :1035 1st Qu.:1890 1st Qu.:2009 1st Qu.: 61.00 1st Qu.: 1.00
NA Median :1925 Median :2011 Median : 84.00 Median : 2.00
NA Mean :1917 Mean :2008 Mean : 92.28 Mean : 5.57
NA 3rd Qu.:1949 3rd Qu.:2012 3rd Qu.:122.00 3rd Qu.: 6.00
NA Max. :2002 Max. :2012 Max. :312.00 Max. :104.00
NA NA’s :225 NA’s :228 NA’s :225 NA’s :225

The following graph shows all the available time series, sorted by first recorded year. The green dot represents the first recorded year, while the purple dot represents the last recorded year. The grey segment represents the length of the time series.

\label{fig:ts-lollipop} Time series plot

(#fig:ts-lollipop) Time series plot


SELECTION PROCESS

Given the wide variety of properties of the available time series, we decided to apply a multi-step automatic selection procedure to obtain a subset of series with specific characteristics.

In any case, the full set of available time series (1260) is contained in the data/raw folder, and can be analyzed with the provided code.

Multi-step automatic process

Briefly, the multi-step process was designed as follow:

  1. Step 1 - Valid structure: selection of the time series with valid data structure

  2. Step 2 - Length in years: selection of the time series with more than 105 recorded years

  3. Step 3 - Recorded months: selection of the time series with more than 1280 recorded months (non missing)

  4. Step 4 - Missing months: selection of the time series with less than 80 missing months

  5. Step 5 - Length in months: selection of the time series with full monthly length not inferior to 1290 months

Step 1 - Selection by valid structure

Using the information table previusly built (T0_Information), we selected all the time series with the variable Status equal to OK.

Here the results of the selection procedure for valid structure (step 1):

##  Cleaning data - Step : 1 
##  ************************************ 
##  1035 accepted time series over 1260 
##  Current step acceptance rate : 82.14 % 
##  Loss from step 1: 17.86 %

Step 2 - Selection by length (years)

The second step of our selection procedure was to identify all the time series that had more than 106 recorded years.

The results of the second step of the selection procedure follows here:

##  Cleaning data - Step : 2 
##  ************************************ 
##  376 accepted time series over 1035 
##  Current step acceptance rate : 36.33 % 
##  Loss from step 1: 70.16 %

At this point, our subset consists of 376 weather stations.

Information table 2 - Monthly information

In order to apply step 3-5 of the selection procedure, we need to gather information about the data at a monthly level.

We then built a second information table, with the following fields:

As an example, consider the the first 10 rows of the T02_Information2 table:

Country Station First_Year Last_Year Years Months Full_Lenght Missing_M
alpene geneve_ecad 1901 2009 109 1303 1314 11
alpene graz 1894 2009 116 1387 1398 11
alpene hohenpeissenberg 1781 2009 229 2741 2754 13
alpene innsbruck 1877 2009 133 1490 1597 107
alpene klagenfurt 1881 2009 129 1513 1554 41
alpene kremsmunster 1876 2009 134 1602 1614 12

Step 3 - Selection by available months

We can now refine the selection procedure checking that the number of non missing months is superior to 1260.

Here the results of this selection step:

##  Cleaning data - Step : 3 
##  ************************************ 
##  329 accepted time series over 376 
##  Current step acceptance rate : 87.5 % 
##  Loss from step 1: 73.89 %

Step 4 - Selection by missing months

We can now exclude all the time series with number of missing months above 80.

Recall that the previous step was about non missing months: in this step we are avoiding time series that (although long), contain so many “holes” that they might compromise the data quality.

Here the results of this selection step:

##  Cleaning data - Step : 3 
##  ************************************ 
##  278 accepted time series over 329 
##  Current step acceptance rate : 84.5 % 
##  Loss from step 1: 77.94 %

Step 5 - Length in months

The last step of the selection procedure is to check if the total length of the time series (from the first recorded month to the last one, missing included) is not inferior to 1290.

Above the results:

##  Cleaning data - Step : 5 
##  ************************************ 
##  277 accepted time series over 278 
##  Current step acceptance rate : 99.64 % 
##  Loss from step 1: 78.02 %

At this point we have a subset of 277 waether stations, characterized by a valid data structure and a length and presence of missing observation below specific thresholds.

Final selection

In order to select the final set of weather stations we proceded by manual inspection of all the time series.

We excluded all the time series with substantial quality problems, such as those with extreme outliers and several consecutive missing over time intervals.

In the final selection we tried as much as possible to select weather stations that were distributed across most parts of the world.

Final data

Summary statistics of the selected of 96 time series is shown in Appendix B, table B1.

There are two time series that did not pass the tests but we still included them in our analysis. One is from Ivittuut, Greenland, and the second one is from Uppsala, Sweden. The reason is that Greenland is of particular interest in the climate debate, and the temperature series from Uppsala is the longest time series ever recorded.

All the selected time series are available in the data/final folder. All the names (country and stations) have been translated to English.

We have highlighted results from 9 weather stations because they are well known cities with good quality of the temperature data.