Why I Never Put My Best Work on Simily

Simily has low-quality standards and pay a flat fee revenue per view rather than based on reader retention. The latter would require stellar content to engage readers.

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Predicting Volcanic Eruptions from Seismic Behavior

Developing a web-based machine learning application to predict volcanic eruptions on the basis of seismic signals

Seismic indicators, according to studies, are an excellent predictor of a volcano’s impending eruption. The problem with seismic signals, though, is that they are difficult to interpret. In systems (that make use of these signals), eruptions can be estimated only a few minutes in advance, but longer-term predictions are difficult.

In this blog we will go through the end-to-end process of creating a web app (from problem understanding to deployment) that can forecast the approximate time of eruption of a volcano based on seismic signals collected by sensors mounted near the volcano.

This part includes the application’s agenda and is intended to provide a high-level overview of the main stages of this case study.

The topics that will be covered in this blog are stated below.

Let’s begin by looking at some of the ideas, resources, and platforms needed to follow this blog smoothly.

Here is a list of pre-requisites that are needed (or rather, helpful) in building similar data-related projects.

Don’t worry if the pre-requisites are not present in your toolbox, please go through the blog. One can always come back and learn what is required!

Now that we have gone through our agenda and mentioned the things that are pre-required, we are now finally able to get our hands on the data.

Fig 2: Snapshot of Kaggle INGV competition Data Explorer

The download will be in the form of a zip file and which upon extraction will yield a directory as shown below:

Each of these measurements is a collection of ten timeseries captured at regular interval from ten seismometers through a duration of ten minutes and each of them is associated with a value, called time to eruption, that represents the time elapsed between the end of the measurement and the start of an eruption.

But the problem is that we don’t know much else about the data, such as what the sensor’s signals mean or where they are in reference to others. In this study, we will make use of machine learning methods to derive useful features from raw data and attempt to model the relationship between the data and their corresponding time to eruption values.

However, the key constraints are that the model must have low latency criteria (in seconds or even a few minutes is acceptable), be extremely descriptive and reliable so conditions like this cannot be overlooked, and not raise false alarms.

The aim of machine learning is to understand the nature of data and capture its essence with mathematical modelling, which in turn can be used by user to comprehend or to predict similar type of data in future.

In this case study we will try to model the time to eruption value given its multi dimensional seismic predictors using supervised machine learning algorithms. And since this variable is continuous in nature, the type of modelling would be regression in our case and prescribed key performance metric that will be used to measure model performance at different stages of this case study is mean absolute error.

But machine learning is not only about modelling itself rather it is the whole process from collecting, understanding, cleaning and transforming the data to deploying the developed model.

Lets proceed step-by-step…

We’ve downloaded the data to our device and are now ready to dig in. Let us take a look at what we have here.

Let’s take a look at what those particular csv files in the train or test folders hold.

Fig 5: First few samples of two random csv file from train and test data

As a result, we have a total of 60001 readings from 10 sensors for each segment id(one data point). The multivariate time series data for each segment was recorded at regular intervals for 10 minutes (600 seconds). So we can assume that the readings were taken every 100th of a second or every 10 millisecond or every centisecond. We also have a variable called time to eruption that is associated with each segment id. This variable represents the time elapsed between the end of each 10 minute segment and the start of an eruption.

Let’s take a look at the distribution of the variable time to eruption.

Visualization of the multivariate time series

Let us begin by plotting the data for the segment with the smallest value for the time to eruption variable.

The pdfs of the sensor signals for the same segment is given below

We’ve already shown that there are some test segments that have completely missing sensor columns. We’ll now try to figure out which sensors completely skipped a particular segment.

We can see from the plot above:

Let’s look at the range of values for each sensor now. We would do this by determining the maximum and minimum values for each sensor for all training segments.

We can deduce from the above output that all sensor values range from -32767 to 32767. And now we’ll sort the train segments by time to eruption and examine some of the segments with low and high time to eruption values.

Let’s look at some time series visualizations of segments with low time to eruption values.

The plot above depicts the segment whose time to eruption is approximately (4 minutes) in the future (wrt time when data was captured). Let’s take a look at some more data with a low time to eruption value.

The visualization for the second and third segments (2nd and 3rd wrt time to eruption) is shown in the above two plots, and we can see that:

Let’s take a look at several other segments that have higher time to eruption values. We will now plot the data with a much larger time to eruption value and see how it can be comparable to the plots above (with a much smaller time to eruption).

The two plots above show the sensor values from the last two segments (in terms of time to eruption):

We’ll now produce sensor value density plot from some random segments from the training data to examine their distribution patterns.

As a result of the above two plots, we can conclude:

Let’s take a brief look at some previous approaches to this challenge. This section is not part of the development process, but it offers a broader viewpoint by investigating past implementations of ideas taken in this direction; some of them are briefly discussed below:

Fig 2: Diagram of the solution pipeline

This method employs a blended ResNet model that is applied to the visualizations derived from the signals. As a result, this technique is divided into two parts:

The steps followed in this methodology are given below:

This approach employs the LSTM recurrent neural network architecture. The suggested method is divided into two stages:

We’ve looked deep enough into our data to begin some simple modelling. Each data point, as we know, is a set of a 10-dimensional time series, and each time series can be thought of as a sensor value distribution corresponding particular data point or segment.

In this section, we will develop some fundamental descriptive features for the sensor value distribution. These feature set will include: mean, standard deviation, minimum, maximum, different quantile values (30th, 60th, 80th, 90th), skewness and kurtosis for each sensor distribution.

Correlation coefficient plot

We can now investigate the correlation stats of the features in comparison to the target variable time to eruption.

The above plot depicts the correlation between the features and the target variable.

Relative feature importance plot

The plot below depicts the relative feature importance of the drawn features. The feature importance are calculated by the Random Forest algorithm.

Since we are keeping track of the min and max of each sensor through each segment, thus we don’t need to record missed sensors as it will be a redundant information. We have imputed the missing values with zero, so for entirely missed segment by any sensor will a zero in min as well as in max features.

Machine learning models on basic descriptive features

The first set of features for given segments has drawn out. In this section we incorporate various ML models and check the usefulness of the drawn features.

In this section, we will go through tiny code fragments and their associated results, attempting to prevent complete code walkthroughs. The implementation code for replicating the case study will be given at the end of this blog in the form of a link to the corresponding GitHub repository.

In this experiment, we will examine the error measure in relation to various values of alpha (a constant that determines the amount of regularization to be applied) and train an elastic net model with alpha that produces the least amount of error. But instead of feeding the model with obtained data, we will scale the data first and then train the model.

Now, we will investigate the error measure of a Random Forest model with respect to various numbers of base learners and train Random Forest with the optimum number of base learners.

We will repeat the experiment that we did with the Random Forest model and train a gradient boosted regressor with the optimal number of base learners.

Summary:

Hyperparameter Tuning

Summary:

In this section, we will build features that are similar to the features derived in previous section, but with a minor difference. We would first divide the entire section into six parts (each containing data from 10000 timesteps), and then measure simple statics (such as mean, maximum, and minimum) of the data in each of these six parts individually to obtain the new features.

Correlation coefficient plot

We will now look into correlation stat of the features with respect to the target variable: time to eruption

The above plot depicts the correlation between the features and the target variable.

Relative feature importance plot

The relative feature importance of the new features is given by the plot below.

Machine learning models on basic descriptive features on different time window

The second set of features for each segment has been developed. In this section, we will use the same ML models as in the previous section to assess the effectiveness of the drawn features.

In this part we will carry out similar set of experiments as we done in the previous section with ElasticNet, RandomForest and XGBoost. The results of these experiments is summarized below.

Summary:

The descriptive features in terms of time steps did not contribute significantly to the featurization process, as shown by a comparison of the MAE scores with the previous feature set. As a result, we did not do any hyperparameter tuning job in this set experiments.

We can now take into account both the feature sets and feed them into a feature selection module to determine the optimum number of features that are truly useful in deciding the time to eruption.

The above plot shows that the optimal number of features is 60. We will now repeat the same series of experiments for the three previously used models and summarize the result below.

Summary:

The efficiency on the validation set has degraded after hyperparameter tuning, but the problem of overfitting has been greatly reduced. Before moving forward let us a look at the mean absolute errors value resulted so far.

Summary:

However, if we evaluate the model’s performance using the following criteria:

Given below are the Kaggle scores obtained on the private and public test set:

This section provides an custom ensemble algorithm to model our data. The model uses stacking mechanism i.e., obtaining predictions from base models as features and using this features to train a meta-learner to model the relationship between the obtained features and target variable.

The process of modelling on the train set is stated as follows:

We have defined this custom ensemble regressor under a python class and the definition is given below:

We tested a variety of meta-learners, including SVM, Decision Tree, and other tree-based models, and discovered that XGBoost with 1300 base learners outperformed the others. As a result, we can limit the number of base learners to 1300, set the meta learner to XGBoost, and conduct grid search only on the meta learner.

Now that we have designed and obtained the scores for custom stacked ensemble let us a look at the mean absolute errors value resulted so far.

Summary:

This custom model will be saved in Python Pickle format so that it can be used later for deployment. For the time being, we will use the model to evaluate its performance on the public test set. The mae scores obtained on the public and private test set is given below:

In this segment, we will use the custom ensemble model trained on the data to construct a web-app interface for user interaction using Streamlit. It is an open-source python library meant for building and sharing web-apps.

Installation :

For the sake of modularity, we have separated our code into separate python(.py) files for this case study, as follows:

One simply put everything on the app.py file. In this section we will look into app.py only, the contents of the rest of the python files are explained in the model development section.

Let’s get started…

Step 1: Set a title, cover image for the web-app and create a sidebar for additional information about the app.

Fig #: title, cover & sidebar

Step 2: Create a file uploader object for handing csv file uploads for testing

Fig #: File uploader
Fig #: file uploader prediction output

Step 3: Creating option for the visualization of the uploaded multi dimensional time-series

Before deploying the web-app online, one can verify the application locally by running the following command in environment prompt.

Now that we’ve completed the majority of our tasks, we have a app that can be run locally and only the developer can see or run it.

Here, Heroku comes into play. The Heroku platform executes the user’s applications in virtual containers on a reliable and consistent runtime environment. For the purpose of deploying the developed app onto Heroku servers we have go through three major steps:

Step 1: Creating the files required by Heroku

So far, our project’s main directory includes:

Heroku expects more detail, such as which Python version to load, which dependencies are needed, or which file to be run on the deployed environment. All of the information expected by the platform is provided by the following files, which, like the others, should be present in the main directory:

Double check your package versions by verifying the environment’s package list

Step 2: Making the code accessible for Heroku via. GitHub

Step 3: Connecting the repo with the Heroku container

Once the code has been pushed and you are ready to build an app with in Heroku dashboard. For this we have to:

Add a comment

Related posts:

Mechanism Description of a Power Drill

In Electrical Engineering, a power drill is an electrical motor that spins a tool attached to its drilling end called a screwdriver bit to make round holes in wood, drywall, plaster, etc. They are…

20 Gifts You Can Give Your Boss if They Love Easy Meals To Cook For Dinner

If you are interested in becoming familiar with up-to-date information regarding HPV and warts, there is a comprehensive and authoritative book concerning human papillomavirus published in 2022 and…

The Fascinating Origins of Common Words in Different Languages

It is made up of words and grammar that are used to convey meaning and express ideas, and language is constantly evolving and changing over time. Understanding the roots and origins of words means…