Data Reconciliation Strategies: Priori reconciliation vs Posteriori reconciliation

The biggest difference between data integration semweb-style and data integration datawarehouse-style is when reconciliation happens: the semantic web model assumes that reconciliation must happen a posteriori, when the data is consumed, while data warehousing assumes that reconciliation must happen a priori, when the data enters the system.

The semantic web architects correctly identified a priori reconciliation as the biggest scalability impediment for a world-scale data integration effort and decided to avoid worrying about it  (so much that it took years for the concept of ‘identifier equivalence’ to even surface in the semantic web architecture and it was included in OWL which feels horribly overdesigned to be used simply for that particular purpose).

For years, I’ve been a fairly vocal advocate for the elegance and scalability of a-posteriori reconciliation via equivalence mappings as a superior mechanism (scale-wise) to a-priori reconciliation efforts… but this started to change very rapidly once I started working for Metaweb and saw first hand how much more effective a-priori reconciliation can be, even if drastically more expensive and limiting in the data acquisition front.

The difference between efforts like Freebase and efforts like Linking Open Data hinges around their model for reconciliation.

Freebase spends considerable amount of resources performing a priori reconciliation of all the bulk loads of data to try to have the most compact and densest possible graph, even at the cost of limiting the rate with which new data can be acquired. On the other hand, Linking Open Data follows the a posteriori reconciliation model where it is assumed that identifier reconciliation is a low-energy point and the world-wide web of data will, once big enough, tend to naturally reconcile identifiers and schemas toward an increased graph density.

Both are huge bets: there is no indication that a priori reconciliation costs are not a function of the quantity of data already contained in the graph (which would eventually saturate its ability to grow); and there is no indication that a denser graph is naturally a lower energy point for unreconciled agglomerations of datasets and that an increase in relational density would happen naturally and spontaneously.

Filed under  //  Data  
Comments (0)
Posted

Data Depression: In Search of better data compression solution

What if you never had to throw away any data, and if at the same time you could spend just a fraction of what it costs to maintain an ever-expanding collection of hard drives? The way to get there might be data compression.

Traditionally, compression has not been at the forefront of the storage discussion for IT folks dealing with massive amounts of next-gen sequencing data. When everyone started to see hard drive meters pushing past full, what naturally occurred next was a panic about obtaining more storage hardware, quickly followed by a paradigm shift when it came to data management and organization, instead of thinking more about how to make use of what hardware they already had.

Filed under  //  Data  
Comments (0)
Posted

Hardly anyone takes data analyses seriously

Media_httpbooksgooglecombooksidffnvjg9ljicpgpa97img1zoom3hlenotshqwa6zizk9sigacfu3u3sxudib82hweswqchmqazopsaw685_ddjxppgdcpbddfg

Hardly anyone takes data analyses seriously. Or perhaps more accurately, hardly anyone takes anyone else’s data analyses seriously. Like elaborately plumed birds who have long since lost the ability to procreate but not the desire, we preen and strut and display our t-values.
— “Let’s Take the Con Out of Econometrics” - Edward Leamer

Filed under  //  Data  
Comments (0)
Posted