Big Data Quality: What’s right, what’s wrong?
Large environments such as the internet reveal human wisdom and creativity. With the number of data sources and data producers also the number of different opinions, synonyms, and inconsistencies usually rises. Big data sets are by nature full of different opinions and inconsistencies. But what is right or wrong in this perspective? Many times there will not be a single truth for many reasons. For example data may be used in different contexts for different purposes and, therefore, may have multiple truths. Moreover, data may point to different states in the timeline. Last but not least there are some truths that we may not be aware of yet. When I analyzed the different genders of persons documented in wikipedia I discovered the values „Nerd“, „Puppet“, and „Cylon“ which were associated to fictious figures of TV shows. We could argue whether all figures should be associated to male or female. However, achieving a general truth about this is neither possible, nor feasible without suppressing the creativity of another group of people. Moreover, it harms the community and most likely equals censorship.
However, when we analyze big data sets we should be aware of the variety of truths that we will most likely face and include them into interpretation of data. It is totally fine to apply a specific perspective for data consuming purposes, but caution is required when attempting to apply subjective truths for general big data cleansing. What are your experiences and opinions about data quality in big data environments?