# HG changeset patch # User František Kučera # Date 1545403680 -3600 # Node ID 9ea7e5c6510755cb6277f0e1856441ff7f5ef884 # Parent 6402cd6921c5989106fefc8e8f1bd21e1be867cc faq: duplicate records and the relational model diff -r 6402cd6921c5 -r 9ea7e5c65107 relpipe-data/faq.xml --- a/relpipe-data/faq.xml Fri Dec 21 12:19:17 2018 +0100 +++ b/relpipe-data/faq.xml Fri Dec 21 15:48:00 2018 +0100 @@ -55,6 +55,26 @@ record row tuple +

+ What about duplicate records? +
+ In the relational model, the records must be unique. + In there is no central authority that would prevent you from appending duplicate records to the relational stream. + It means that in some points in the relational pipeline there might occur data that do not fit the rules of the relational model. + The deduplication is generally not done on the output side of particular steps, but is postponed and done on the input side of steps, where uniqueness is important (e.g. JOIN or UNION). + You should not put duplicate records in the relational stream, but you can. + Duplicates can also occur after some transformations like relpipe-tr-cut (e.g. if you choose only dump or type attributes from your fstab and omit the primary/unique key field). + Such data are not considered invalid, but should be processed like there are no duplicates (if uniqueness is important for particular step) + or should be passed through if it is not in conflict with the goal of given step (e.g. calling uppercase() function on some field or doing UNION ALL). + Each tool must document how it handles duplicate records. +

+ +

+ The reasons for this transient tolerance of duplicate records are two. + 1) Performance: guaranteeing the uniqueness in every moment would negate streaming and would require holding whole relation in memory and always sorting the records. + 2) Modularity: many tasks would have to be done by a single bulky tool that does everything e.g. if you want to cut only the type field from your fstab and then count statistics how many times particular filesystems are used. +

+