After the data is stored, the Data Scientist has to go through the data and clean it. Data Processing (Cleansing) is the process of ensuring that your data is correct, consistent and useable. Common inaccuracies in data include missing values and typographical errors. Improving data quality is critical for this step, and so the data should be validated with all the rules that make sense for the specific data. Other important things to understand are if there is missing data one can get from another source, and if there is uniformity in the data that can be converted to the same units/measures. The Data Scientist must also check the accuracy and consistency of the data with other data sets and real values. Normally a process is created (using python scripts, R or other tools) to go through the data searching for unexpected or incorrect data, cleaning it by fixing it or removing it, checking the data again, then reporting the changes and the quality of the current data to work with.
Data mining is the process of finding patterns, anomalies and correlations in data to solve problems through data analysis. Additionally, data mining techniques are used to build machine learning (ML) models that power applications such as search engine algorithms and recommendation systems.
Mathematical/statistic models are used to find patterns in the data using data tools. There are a lot of libraries in python or R that make a lot of these tools available, such as Tensorflow. Some examples of techniques include:
- Sequence or Path Analysis looks for patterns where one event leads to another later event.
- Clustering is able to find and group data sets in ways that were previously unknown. Clustering groups are aggregated based on how similar they are to each other.
- Classification looks for new patterns and might result in a change in the way the data is organized. These algorithms predict classifications based on multiple features.
In the end, the results are evaluated and compared to business objectives. Businesses can learn more about their customers and develop more effective strategies related to various business functions and in turn leverage resources in a more optimal and insightful manner. This helps businesses be closer to their objectives and make better decisions.
Visualization - Reporting
Findings are communicated through Reporting and Monitoring so key resources of the business can understand the results in a clear and concise manner. Monitoring usually provides an alert or warning for a specific point in time, while Reporting typically displays information in an organized manner. A report usually takes the shape of a table, graph, or chart. In the field of information technology, reporting is divided into two types: executive and operational. Operational reporting presents information that tends to be more technical and detailed. Executive reporting tends to be of a broader or higher-level perspective and is generally used to educate managers about financial decisions.