Data Reconciliation Strategies: Priori reconciliation vs Posteriori reconciliation
The biggest difference between data integration semweb-style and data integration datawarehouse-style is when reconciliation happens: the semantic web model assumes that reconciliation must happen a posteriori, when the data is consumed, while data warehousing assumes that reconciliation must happen a priori, when the data enters the system.
The semantic web architects correctly identified a priori reconciliation as the biggest scalability impediment for a world-scale data integration effort and decided to avoid worrying about it (so much that it took years for the concept of ‘identifier equivalence’ to even surface in the semantic web architecture and it was included in OWL which feels horribly overdesigned to be used simply for that particular purpose).
For years, I’ve been a fairly vocal advocate for the elegance and scalability of a-posteriori reconciliation via equivalence mappings as a superior mechanism (scale-wise) to a-priori reconciliation efforts… but this started to change very rapidly once I started working for Metaweb and saw first hand how much more effective a-priori reconciliation can be, even if drastically more expensive and limiting in the data acquisition front.
The difference between efforts like Freebase and efforts like Linking Open Data hinges around their model for reconciliation.
Freebase spends considerable amount of resources performing a priori reconciliation of all the bulk loads of data to try to have the most compact and densest possible graph, even at the cost of limiting the rate with which new data can be acquired. On the other hand, Linking Open Data follows the a posteriori reconciliation model where it is assumed that identifier reconciliation is a low-energy point and the world-wide web of data will, once big enough, tend to naturally reconcile identifiers and schemas toward an increased graph density.
Both are huge bets: there is no indication that a priori reconciliation costs are not a function of the quantity of data already contained in the graph (which would eventually saturate its ability to grow); and there is no indication that a denser graph is naturally a lower energy point for unreconciled agglomerations of datasets and that an increase in relational density would happen naturally and spontaneously.
