Maintaining infrastructure can be a complex and expensive task. Predictive maintenance based on machine learning can help diagnose infrastructure failures or predict them before they happen.
In this blog post, we show how a machine learning solution for predicting the failure of water pump equipment is built. We use data collected about water pumps in Tanzania by a platform called Taarifa, made available to the public in an online competition by DrivenData. DrivenData publishes this set of pump data records for anyone to compete in developing the best prediction algorithm for the status of water pumps. At the time of writing, 5333 teams from all over the world had participated in this competition. Fintu Data Science ranks 37th, well within the top 1% of participating teams.
Understanding the problem
The data set published by DrivenData contains around 60,000 citizen reports about water pumps scattered across Tanzania. Each record includes attributes of the pump such as the exact location, the type of pump, the type of water source, the company that installed the pump, the funding agency, the date the record was made, etc. — 39 attributes in total. Each record in this training set also states the status of the pump (functional, functional but needs repair, not functional). The objective is to learn to predict this status solely from the pump’s attributes, to enable the planning and effective dispatch of maintenance work as quickly as possible. Some exemplary pump locations are shown on the map below.
We solve this problem using supervised learning: A model is trained to predict the functional status using labeled data of past pump records. Those include the pump’s attributes, here called features, and their true known functional status. After training, this model can be used to predict the functional status of pumps only from their recorded features. To achieve the best performance in real-world operation, we use cross-validation. This ensures that the algorithm does not overfit the available training data. Doing so, the available data is split into a training and a validation set, and the model is optimized to give the best performance on the validation set, which was not seen by the model during training time. In this competition, teams predict the status of pumps on a scoring data set of 14,850 entries, for which the true status is not known to competing teams. Submissions to DrivenData are then ranked by their accuracy, that is, the share of correctly predicted statuses.
In a first step, the input data set has to undergo quality checks for missing or faulty data. We drop data points with unreasonable dates or locations outside of the country. If only some of the features are missing in a record, we estimate them based on complete records (a process often called imputation). For most features, such as for missing population data, we impute simply by using the median value. For two features we think are particularly important though – the height of the pump above sea level and the construction year – we train specific models to impute the missing values: First, we impute the missing height of a pump through a k-nearest-neighbors regression. The missing height is computed as the weighted mean of the heights of the five nearest pumps, using their distances as weights (for the details of this imputation, see this blog post). Second, missing construction years are imputed by a specific prediction model, using gradient boosted trees trained using cross-validation. After cleaning and preparation, around 59 thousand data points are left. Of those, 54% describe functional pumps, 7% functional pumps requiring repair, and 38% non-functional pumps.
There are many machine learning models that can be used for predictive maintenance algorithms, each having a set of parameters that can be tuned. It is impossible to know beforehand which model performs best, so usually, multiple combinations of models and parameters have to be tested. Selecting the best models and their parameters was in the past often left to the data scientist, and could be very time consuming. Here, we use an automated approach instead, where model and model parameter selection is left to an optimization procedure. The implementation we use, auto-sklearn, uses a Bayesian mechanism to select an entire ensembles of models and their parameters. This algorithmic search for the best model frees the data scientist to spend more of her or his time on what matters most – understanding the processes underlying the data and developing indicators for their improvement.
Our predictions submitted to the competition have an accuracy score of 0.8255. This means that for 82.55% of records in the scoring data set, the functional status of the pump is correctly predicted by our algorithm. Our optimal model ensemble consists of four models: Two linear support vector machines, a random forest, and an extremely randomized trees model. These models take very different approaches to finding and combining features to optimize prediction performance. Consequently, using an ensemble of different model types often leads to better prediction performance than using single models, as the errors of different types of models are not perfectly correlated and thus partly cancel each other out in predictions.
For effective maintenance, a swift indication of which pumps are starting to fail or are already broken is crucial. To measure the performance of our algorithm in this task, we combine the two predicted categories “functional, needs repair” and “not functional” into a single category “needs servicing”. Our model distinguishes these two categories with 82% accuracy. There are a similar number of false positives and false negatives in the prediction, but both error rates are below 18%. If either true positives or false negatives are much more costly for the infrastructure operators, the model could instead be build to minimize the incidence rate of the costlier errors.
Putting predictions to use
The prediction model we built can easily be wrapped into a web service. Predictions for a pump’s status can then be obtained by posting a newly collected observation record for this pump against this prediction service. Therefore, integration into existing IT systems and processes straightforward. In the case of infrastructure equipment, our predictions could, for example, be used to dispatch servicing teams to failing or broken equipment quicker and more effectively.
Aside from using reports entered by people, predictive maintenance algorithms can also be based on readings of embedded sensors. Apart from the case described in this blog, predictive maintenance has applications in many areas:
- Infrastructure: water and power grids, telecommunication networks
- High value assets: elevators, aircraft engines, trains, medical imaging machines, etc.
- Manufacturing equipment
This large number of different applications comes with the need to tailor data acquisition, model selection, and integration into existing system and processes. Implemented properly, a predictive maintenance solution can prevent equipment failures, improve field service effectiveness and prolong equipment lifetime – improving customer satisfaction while lowering service costs at the same time.
Fintu Data Science implements custom data science and machine learning solutions to optimize our customer’s processes. Our clients include small and medium-sized enterprises, startups, and NGOs across Germany and Europe. Send us an email at email@example.com.