By: PeopleTec Senior Data Scientist, Keith Allen, PhD.
Currently, the collection, storage, and analysis of data involves a cloud-based, artificially intelligent, digital ecosystem that can grow alongside modern complexities to help decision makers solve problems and answer questions. This modernized system also allows users to spend less time preparing data and more time analyzing data for actionable decision-making, and it provides the backbone for the future of human learning shaped by technology.
PeopleTec works with several U.S. Government agencies in the Intelligence sector and the Test and Evaluation domain to modernize their digital ecosystems.
In these ecosystems, data at scale is automatically processed, stored, and made ready for immediate analysis. These ecosystems also include AI-enabled tools, trend and predictive analysis tools, and form readers that convert documents to data.
This end-to-end data system results in quicker, data-driven decisions while saving data processing time.
A Fool with a Tool is Still a Fool
Moravec’s Paradox states that computers can be trained to perform tasks that humans find difficult and, conversely, humans are good at performing tasks that are difficult for computers. The machine can process a large amount of data very quickly, but it takes the experienced human to provide the context and direction for the design and employment of the tool. If you have spent any amount of time collecting and analyzing data with a spreadsheet, then you understand Moravec’s Paradox and the basic challenges of modern data science.
We have all struggled to manually convert data from a document to a spreadsheet, spending an inordinate amount of time organizing the data before even making it to the analysis portion. Perhaps tables have changed over time or maybe your colleague down the hall is working in a more current version of the same spreadsheet. Or maybe, as the requirements for your data set and analysis grow, your processing power degrades.
Now imagine if you had to scale your data process to that of a large company or a government agency. That spreadsheet stored on your hard drive is simply not going to hold up to the demands of the modern era.
The modern end-to-end data system addresses the five “V”s of data science:
- Volume (the size and amount of data that needs to be managed and analyzed)
- Value (the return on investment to the organization)
- Variety (the diversity of data types)
- Velocity (the speed at which organizations receive, store, and analyze data)
- Veracity (the quality and accuracy of the data)
Modern data systems enable data governance, enforcing data standards and policies on how data is collected, stored, and utilized. Effective data governance and a well-designed, modern data system results in trustworthy insights that drive informed decision-making. Ultimately, the challenge boils down to how humans and machines interoperate more effectively, not how one can out-pace (or replace) the other.
The Modern Data System
The modern, end-to-end data system includes the following characteristics:
- A cloud-based infrastructure that is accessible by any verified user, regardless of location
- Tools to automatically ingest, store, and retrieve large amounts of data quickly
- Customizable extract, transfer, and load data pipelines
- Real time and historical resource costs and performance metrics
- Customizable visualizations tailored to the user’s specific analysis need
- Ability to organize and archive data based on relevance and importance
- Ability to scale to volume, velocity, and variety of the data over time while minimizing system latency and down time
In a modern, cloud-based system (as depicted in the image above), the user interfaces with the data system via an authorized device. As the centerpiece of the system, the data lake serves as the hub for all data inflows and outflows and is essential for proper data governance. Incoming data triggers an automatic process that stores and tags the data set in the raw container within the data lake.
Cognitive artificial intelligent tools such as search algorithms and predictive tools draw both new and historical data from the data lake and work in tandem with custom data pipelines. The data pipelines then move, transform, filter, and clean the data through the system. Once complete, the processed data is stored in a separate container to preserve and archive the work that was done with it. This process may be executed several times.
The processed data is then moved to a data exploration and visualization tool. Many organizations prefer a combination of built-in custom visualizations along with the ability to modify and create new visualizations as needed. Once the final analysis is complete, the reportable data is stored in the authoritative data container, where it can be extracted for use in other applications. When new data is available, this entire process can be triggered manually or automatically.
Artificial Intelligence Tools
Artificial Intelligence is a collection of algorithms designed to use large amounts of data to make predictions, assist with understanding the data, etc.. Artificial intelligence applies current knowledge of the way human neurology works (i.e., how the human brain learns) while leveraging the computational power of a fast computer. The figure below depicts the basic function of a neural network, as modeled after the human neurological system.
In an artificially intelligent algorithm, data enters the system and is processed by a set of statistical functions. The output is a set of inferences about the dependent variable in question, with associated probabilities. For instance, a neural network may be used in image recognition to identify specific animals in trail camera footage. In this example, the algorithm processes a portion of the trail camera footage as a training set, provides a range of probabilities about the type of animal, and then compares the results to a validated set of the data.
Because the algorithm learns over time, as more data is available, the animal predictions and their associated probabilities will change over time. Like all statistical models, artificially intelligent algorithms are susceptible to issues of statistical power (typically in the form of data availability); external validity (i.e., the algorithm can only provide meaningful insights for a specific problem and data set); and using the wrong tool for the problem.
If you ever use regression analysis or classification analysis in work or school then you have worked with Artificial Intelligence. Only, today’s artificially intelligent systems have access to a huge amount of data and can be tuned to learn automatically over time as new data is available. Without a computational engine that can process large amounts of data quickly and adapt to new data sets, humans have little hope in being able to harness all the power of the information age.
Advances in data systems and Artificial Intelligence tools undoubtedly steer the future of humanity. Recent advancements in large language models and, in general, Artificial Intelligence have made some scenes from Kubrick’s 2001: A Space Odyssey seem a bit more fact than fiction (think Dr. Floyd interacting with HAL 9000 as an analogue for how we interact with Google or ChatGPT).
As humans begin to use more Artificial Intelligence tools, it is important to understand the capabilities, limitations, risks, and ethical issues that may arise and affect future data systems. The future artificially intelligent data system must address the following challenges:
- Computational Power: the computational power required for these systems is often large, which limits edge-based systems and systems that have higher SWAP (size, weight, and power) constraints. Advancements in quantum computing and in fusion power may hold the key in addressing these limitations.
- Algorithmic Complexity: the complexity of the artificially intelligent algorithm can often become a source of bias, error, and slower processing speeds. Examining how best to write and test the algorithm, while minimizing unnecessary parameters, is challenging in a world where code is mass produced for a variety of different problem sets.
- Ethical Issues: As Artificial Intelligence changes the way the world operates, possible ethical and legal challenges arise that must be examined. It will be extremely important to examine and regulate the use cases of artificially intelligent systems.
- Security Issues: In the age of cloud computing, big data, and Artificial Intelligence, the security protocols used by all organizations is susceptible to security issues. It is essential to examine the security implications of artificially intelligent systems to inform how policy can be modernized while protecting safety, security, and public law.
- Change Management: change is always disruptive; demonstrating the utility of a modern data system is often not enough to quell fears of status quo change. Demonstrating the utility of the tool, and how it can improve the human workforce, should be at the forefront of Artificial Intelligence research.