Filling the data scientist gap, part 2: Diving into data analytics technologies
April 10, 2018
A new set of algorithms and infrastructure has emerged that allows businesses to use key data analytics techniques such as big data or machine learning to capitalize on opportunities.
The tsunami of data provides businesses an opportunity to optimize processes and provide differentiated products. A new set of algorithms and infrastructure has emerged that allows businesses to use key data analytics techniques such as big data or machine learning to capitalize on these opportunities.
Additionally, this new infrastructure behind big data or machine learning leads to a host of different technologies that support the iterative process of building a data analytics algorithm. It’s this beginning stage of the iterative process of building the algorithm that can set a business up for success. This iterative process involves trying several strategies like finding other sources of data and different machine learning approaches and feature transformations.
Given the potentially unlimited number of combinations to try, it is crucial to iterate quickly. Domain experts are well suited to iterate quickly, as they can use their knowledge and intuition to avoid approaches that are unlikely to give strong results. The faster an engineer with domain knowledge can apply their knowledge with the tools that enable quick iterations, the faster the business can gain a competitive advantage.
But before diving into the technologies that support this activity, let’s first walk through an example of this iterative process and some questions to ask along the way.
Iterating on data sets
A prosthetics company knows that it could build smarter prosthetics if it knew what activity its customer would be doing (standing, sitting, walking, etc.). So, the first question it asks is: What data could we use to determine this?
The engineers at the company know that most of their customers have smartphones, so they would like to use the data from the smartphone’s sensors to determine their activity. Engineers at the company begin by logging data from the accelerometer. They apply a machine learning algorithm directly to the data, but learn the results aren’t as good as they hoped. The iterative process begins, with the engineers then asking: Are there additional ways we could prepare the data for machine learning that might give better results?
The company’s engineers apply signal processing techniques to extract frequency content from the sensor data, and try the machine learning techniques again. The results are better but not quite there yet, so they ask: Are there other sources of data we could use to improve our predictions?
They decide to also log gyroscope data from the smartphones, and combine this with the accelerometer data. Training their machine learning models again, they are now happy with the results, and move to production.
Other questions an engineer in the iterative process might ask include:
- What data is available?
- Are there other data sources?
- What types of processes could be used to extract high-level information from the data?
- Where is the model going to run in production?
- Are certain types of misclassification costlier than others?
- How can we experiment quickly to validate ideas and answer the above questions?
Now that you’ve seen an example of the iterative process and questions to ask, what about the technologies behind this process?
Iterating on big data
As more and more data is generated, systems need to evolve to process it all. In this “big data” space, two large projects have reshaped the landscape: Hadoop and Spark. Both projects are part of the Apache Software Foundation. Together, they have made it easier and cheaper to store and analyze large amounts of data.
These technologies can greatly impact an engineer’s work. For engineers accustomed to working with data in files on desktop machines, on network drives, or in traditional databases, these new tools require a different way of accessing the data before analysis can even be considered. In many cases, artificial data silos and inefficiencies can be created, such as when someone needs to be contacted to pull data out of the big data system each time a new analysis is performed.
Another challenge engineers face when working with big data is the need to change their computational approach. When data is small enough to fit in memory, the standard workflow is to load the data in and perform computation; the computation would typically be fast because the data is already in memory. But with big data, there are often disc reads/writes, as well as data transfers across networks, which slow down computations.
When engineers are designing a new algorithm, they need to be able to iterate quickly over many designs. The result is a new workflow that involves grabbing a sample of the data and working with that locally, enabling quick iterations and easy usage of helpful development tools such as debuggers. Once the algorithm has been vetted on the sample, it is then run against the full data set in the big data system.
The solution for these challenges is a system that lets engineers use a familiar environment to write code that runs both on the data sample locally and on the full data set in the big data system. Tools such as MATLAB establish connections to big data systems such as Hadoop. Data samples can be downloaded, and algorithms prototyped locally. New computational models that utilize a deferred evaluation framework are used to run the algorithm on the full data set in a performance-optimized manner. For the iterative analysis that is common to engineering and data science workflows, this deferred evaluation model is key to reducing the time it takes for an analysis to complete on a full data set, which can often be on the order of minutes or hours.
Big data technologies have been a key enabler in the growth of data science. With large amounts of data collected, new algorithms were needed to reason on this data, which has led to a boom in the use of machine learning.
Machine learning is used to identify the underlying trends and structures in data. Machine learning is split up into unsupervised learning and supervised learning.
In unsupervised learning, we try to uncover relationships in data, such as groups of data points that are all similar. For example, we may want to look at driving data to see if there are distinct modes that people operate their cars in. From cluster analysis, we may discover different trends such as city versus highway driving or, more interestingly, different styles of drivers (e.g., aggressive drivers).
In supervised learning, we are given input and output data, and the goal is to train a model that, given new inputs, can predict the new outputs. Supervised learning is commonly used in applications such as predictive maintenance, fraud detection, and facial recognition in images.
Each of the areas in machine learning – unsupervised learning and supervised learning – have dozens of algorithms that are popular (and hundreds that are less-popular). However, it’s hard to know which one of these algorithms will be best for the particular problem you are working on. Often, the best thing to do is to just try them out and compare results. This can be quite the challenge in some environments, as researchers build algorithms with different interfaces depending on their problem and preferences.
Mature machine learning tools have a consistent interface for the various algorithms and make it easy to quickly try different approaches. This is critical for domain experts performing data science because it enables them to identify “quick wins” where machine learning provides improvement over traditional methods. This approach also prevents them from spending days or weeks tuning a machine learning model to a data set that is not well-suited for machine learning. Tools such as MATLAB address this problem by providing point-and-click apps that train and compare multiple machine learning models.
Combined, big data and machine learning are poised to bring new solutions to long-standing business problems. The underlying technology, in the hands of domain experts who are intimately familiar with these business problems, can yield significant results. For example, engineers at Baker Hughes used machine learning techniques to predict when pumps on their gas and oil extraction trucks would fail. They collected nearly a terabyte of data from these trucks, then used signal processing techniques to identify relevant frequency content. Domain knowledge was crucial here, as they needed to be aware of other systems on the truck that might show up in sensor readings, but that weren’t helpful at predicting pump failures. They applied machine learning techniques that can distinguish a healthy pump from an unhealthy one. The resulting system is projected to reduce overall costs by $10 million. Throughout the process, their knowledge of the systems on the pumping trucks enabled them to dig into the data and iterate quickly.
Leveraging tools for processing big data and applying machine learning, engineers such as those at Baker Hughes are well-positioned to tackle problems that improve business outcomes. With their domain knowledge of these complex systems, engineers take these tools far beyond traditional uses for web and marketing applications.
Seth DeLand is an application manager at MathWorks for data analytics. Before that, he was product manager for optimization products. Prior to joining MathWorks, Seth earned his BS and MS in mechanical engineering from Michigan Technological University.
The MathWorks, Inc.