The data team had spent weeks preparing this critical first run, creating complex models to analyze word patterns pulled from OCR versions of medical forms to discover if specific patterns in these reports would indicate the presence of cancer. Hundreds of thousands of paper forms had been scanned, the fastest CPUs and machine learning algorithms had been implemented, and finally, all came down to the press of a button.
When the report showed up, as test subject after test subject had been run through the model, the junior data scientists looked at the results and started to cheer – an accuracy rate of 99.67%. Yet as more data came through, the older, more grizzled member of the team began shaking their heads. No test was this good. Something was wrong.
They ordered the members of the team to start pulling up the scanned versions of each medical report to try to figure out what was wrong, baffled until one of them noticed that, in the upper-right-hand corner of the page, every test that had indicated the patient was a cancer victim had a small circled C.
After a phone call with the doctors, and then the nurses, later, had discovered what a highly expensive machine learning model hadn’t: to help identify who was currently taking anti-cancer medications, the nurses had devised a convention of writing a small copyright symbol in the corner. The model had identified a crucial pattern, but it was a meaningless one.
Considering this particular symbol, the accuracy rate dropped to about 53% – only slightly better than a coin toss.
If you take apart most machine learning systems, what you are doing is trying to figure out how to ski down a multidimensional slope from a high point to the point of stability, to find that point of stability, use the methods of differential geometry. The math very quickly gets to the point where you can no longer actually visualize that surface and the resulting path to those islands of stability, but the principle remains the same regardless. There are quite several algorithms that can be used to reduce the amount of time it takes to simulate runs down the slope, but most of these steps primarily have to do with shortcuts that can be made because the slopes in question are mostly linear and connected.
Put another way, the benefits to be had by tweaking the algorithms may account for a certain minimal amount of optimization, but typically that optimization can’t help you if the data that you are working with is bad – it simply means that you come to the wrong conclusion faster.
Organizations often tend to face problems that the Pareto Principle can describe: 80% of the work involved seems to go towards 20% of the problem. In machine learning, this ratio could be even higher. Running a machine learning project is very much akin to launching a spacecraft: 80% of the work goes into ensuring that the rocket is upright and in the gantry, that all of the necessary valves are going to be able to work against extreme heat and pressure, that the crew is healthy and prepared, that the software is ready, and on and on and on. The actual launch may take only minutes to get into orbit, at which point, the goal is making sure that the reason for the launch gets carried through. Pressing the big red button to start the ignition or clicking the button to train the model? Those parts, while important, also take up only a small fraction of the time of building a viable data model pipeline.
Moreover, as the above anecdote suggests, machine learning is a very big hammer that is dependent upon good quality, clean, well-prepared data. All too often, rather than taking the effort (and it is effort) to achieve that goal, people work with raw data that is not worth putting through the model – there is too much noise and not enough meaningful signal, and as a consequence what comes out is what goes in: garbage.
This means that data engineering is actually becoming one of the most important facets of data science, even if many “pure” data scientists feel it is beneath sullying their hands. Data engineering involves many different components:
Locating and identifying meaningful, semantically rich data. Most data has a structure because most data ultimately is a record of the interactions and relationships between various things in a system. Most true insights about data that come from machine learning do so because there are often hidden relationships that exist between the various dimensions involved. Still, it’s worth understanding that randomly grabbing a chunk of memory and pushing it into a model is likely to be about as effective as grabbing a chunk of memory at random and expecting to be able to run an application from it – the best-case scenario is that it doesn’t work! In this respect, semi-structured data – such as XML, JSON, or even better yet RDF – is likely to have enough richness to provide relevant patterns, so long as you don’t attempt to flatten things too far.
Clean data is vital. This has several implications. Machine learning is not intelligent. Stemming variations, multiple terms representing the same concepts, numeric strings interpreted as numbers (thinking zipcodes), a whole host of date formats, ambiguous nulls and similar signals, and misspellings, all of these things can corrupt the data. Even things like idiomatic expressions can prove challenging for NLP-related systems. Especially when data sources lack sufficient fidelity (such as facial recognition systems) the potential for miscategorization is high. New MLOps-based processes can automate some of this cleansing, but even today no system is foolproof.
Preventing Overengineering. There are a few secrets about matching learning that tend not to be heard much in the press. Still, they are essential to understand if you want to make a career in the field: First is that more dimensions are not necessarily better, especially when those dimensions are non-orthogonal. All too often, models are created that are too complex, which in turn means that slight variations in inadvertently correlative data can be introduced into the model to reduce its effectiveness. Again, some tools can help to identify and reduce spurious dimensions, but they have to be involved.
Using More Intelligent, Semantic Labeling. Labeling is often seen as a pain in the butt, but it is one of the most important things you will do in creating models. Labels provide a way of associating metadata with your data, metadata that can be critical in interpreting the results of your models. This is actually an area where semantic modeling should be introduced into your machine models, as such semantics are essential in placing the data in context to the larger information in the system.
Balancing the Hardware. The GPU has become increasingly tied into the perception of deep learning and machine learning systems. Still, again, it’s worth noting that while graphical processing units can work with specialized versions of libraries (such as Python’s CUDA-enabled Numpys libraries), the GPU-powered processing systems generally provide only a 5-10% improvement in accuracy for many operations and in some cases provide considerably less. In other words, CUDA-based machine learning will likely be faster, but its impact on accuracy is debatable, and in the case of machine learning, reaching the wrong conclusion faster is not generally a good thing.
Understanding the Product. In addition to all of this, you need to know what you want to end up with at the end. Machine learning can be used for categorization and is generally strongest there, but building a categorization model is different from building a predictive model, which is different from building a language recognition system. This will affect what you produce, the format of its expression, and its role in the organization’s overall data strategy. In other words, building a machine learning pipeline is more than simply running a few algorithms, it is the process of productizing the insights that emerge from this. As such, any tools that are used for building such pipelines must include the ability to govern the data process and assert data strategies relevant to the larger organization.
Supporting Various Data Platforms. There is no question that the information sector is now moving into a new phase. Once upon a time, data scientists could get by with a command-line shell running R or Python, but those days are quickly receding. Microsoft, Google, Amazon, Oracle, Apple, DataBricks, and many other companies are now trying to capture big money being spent by companies trying to gain meaningful insights into their data torrents, and this means that data solutions that once took six months need to be completed in days or even hours, in effect moving towards the real-time processing of data analytics. Yet for many smaller businesses, there is the very real possibility of getting priced out of the market, so it is likely that open-source mlOps solutions, whether in the clouds or at the edge, will become critical players in this space, given past history. This includes designing for containment using Kubernetes or similar tools.
Given all of this, extracting value from that data becomes a matter largely of recognizing that what you are looking for is the monetization of insight, as a product, a mechanism for cost savings, or a tool for driving other processes.
For the most part, data products are made up of information that has both high currency and value. Machine learning systems can, among other things, facilitate classification, and in fast-moving environments, the ability to classify (and consequently better search) information may mean the difference between a consumer finding their desired product on your forum or them finding it on someone else’s, whether it be grocery items, automobiles, houses, or investment opportunities. In most cases, however, machine learning does not so much enable this capability as it makes the capability more accurate and timely.
It should be noted that the classification aspect of machine learning ultimately should be seen as a refinement of existing data, adding value by providing more organization and semantic metadata to that data. This can be something of a chancy market, admittedly, because it has a comparatively low barrier to entry and there are diminishing returns to how much value you can add to data without getting hung up in poor modeling.
On the other hand, predictive analytics is a place where a data organization can shine – if it can build the right models, which usually means getting the right data. Again, the key here is that the data acquisition process is as critical as the analytics because what you are selling at this point is business forecasting. Developing an effective data acquisition strategy that can provide richer insights becomes essential. All too often, data strategies have foundered when the costs associated with good data acquisition are factored in.
This becomes even more important when working with systems that can adjust based on environmental changes. A simulation, when you get right down to it, is a prediction of the change of state of a system based upon its previous state, repeatedly iterated. Reflexive systems like this take the insights determined in a previous generation as additional input and are generally useful in studying a system’s evolution, whether natural, economic, militaristic or otherwise.
This is also an area where businesses should spend some time looking at gaming worlds, which are, in effect, simulations. Autonomous agents were acting in an environment, making use of near-real-time machine-learning loops with the local virtual environment acting as the data acquisition channel – in other words, a game – provide a very good indication of where businesses are going and what analytics will look like in the business environment by 2030.
Machine learning is becoming an integral part of the IT landscape, but it differs from past technologies in that it is not so much the power of the algorithms but the fidelity, cleanliness, and richness of the data that ultimately makes for the best models and consequently the best means towards monetization of those data insights. By understanding that the value in this technology comes ultimately from the intelligent and strategic acquisition, processing, interpretation and utilization of that data, companies will be able to move data beyond the command line and onto the bottom line.
Kurt Cagle is the editor of the The Cagle Report and former Managing Editor of Data Science Central, a site devoted to the evolution of data systems in business, government and personal life.