Best Track data Exploratory Data Analysis¶
In this notebook we attempt to have a first glance at the best data, to help
us better understand the nature of Digital Typhoon data. We will only
consider the data starting from 1978, which is the year that satellite
imagery started to be collected. Make sure to replace path_best
with
the path to the directory containing the TSV files with best track data.
Load all the best data¶
>>> # Read tsv data and convert to numpy array
>>> from pyphoon.io.tsv import read_tsvs
>>> all_data = np.array(read_tsvs(path_best))
>>> # Only consider data since 1978
>>> all_data = all_data[all_data[:, 0] > 1978]
>>> # Number of recorded typhoons and features
>>> N, D = all_data.shape
>>> # Print shape of best table
>>> print("Number of samples:", N)
>>> print("Number of features:", D)
Number of samples: 208953
Number of samples: 20
To get information about the features in all_data
you can check the
variable feature_names
from pyphoon.eda_jma.
>>> import pyphoon.eda_jma as eda
>>> eda.feature_names
['year',
'month',
'day',
'hour',
'class',
'latitude',
'longitude',
'pressure',
'wind',
'gust',
'storm_direc',
'storm_radius_major',
'storm_radius_minor',
'gale_direc',
'gale_radius_major',
'gale_radius_minor',
'landfall',
'speed',
'direction',
'interpolated']
Consider only real data¶
As aforementioned, only 17% of all best data is real while the rest has been
generated via interpolation to catch up with the image observation frequency.
Luckily, the 20th feature in all_data
tells us which data is original (i
.e. ‘0’ if original, ‘1’ if synthetic).
>>> # Discard synthetic samples
>>> index = all_data[:, -1] == 0
>>> data = all_data[index]
Plot histogram of classes¶
Method plot_hist()
provides the necessary tools to
obtain the histogram of any feature. The 4th feature in all_data
tells us
the class of the sample.
>>> plot_hist(all_data, show_fig=True, feature_index=4, bins=[2,3,4,5,6,7,8],
... normed=True, centre=True, title="Class histogram", xlabel="Class")
Use arguments save_fig
and fig_name
to store the generated plot.
Plot histogram of pressure¶
The 7th feature conveys the pressure values. Let’s find the minimum and maximum values of the pressure.
>>> minimum = min(all_data[:, 7])
>>> maximum = max(all_data[:, 7])
>>> print("minimum:", minimum, "\nmaximum:", maximum)
minimum: 870.0
maximum: 1018.0
Now, we want to plot the histogram. We will use bins of resolution of 7 hPa.
>>> bins = np.arange(870, 1018, 7)
>>> plot_hist(data, show_fig=True, feature_index=7, bins=bins, normed=True,
... title="Pressure histogram", xlabel="Class")
Wind speed time analysis¶
Let us now provide a simple example of time analysis. In particular we will
explore the wind speed values along time. The 8th feature is the responsible
for the wind-speed data. First, we obtain the wind speeds of all samples per
year and store them in an array, accessible via its year using the dictionary
wind_sp
.
>>> # Get wind speeds from all samples per year
>>> wind_sp = {}
>>> for idx in range(N):
>>> year = int(data[idx, 0])
>>> if year not in wind_sp:
>>> wind_sp[year] = []
>>> wind_sp[year].append(data[idx, 8])
Next, we can easily iterate over all dictionary values and compute the mean. This way, we obtain the wind speed mean per year.
>>> # Get the mean
>>> mean_wind_sp = {}
>>> for key in wind_sp.keys():
>>> mean_wind_sp[key] = np.mean(wind_sp[key])
Finally, we plot the mean wind speed over time.
>>> plt.plot(list(mean_wind_sp.keys()), list(mean_wind_sp.values()), 'k')