faq: duplicate records and the relational model v_0
authorFrantišek Kučera <franta-hg@frantovo.cz>
Fri, 21 Dec 2018 15:48:00 +0100
branchv_0
changeset 224 9ea7e5c65107
parent 223 6402cd6921c5
child 225 1c88e881ce79
faq: duplicate records and the relational model
relpipe-data/faq.xml
--- a/relpipe-data/faq.xml	Fri Dec 21 12:19:17 2018 +0100
+++ b/relpipe-data/faq.xml	Fri Dec 21 15:48:00 2018 +0100
@@ -55,6 +55,26 @@
 			record	row	tuple
 		</m:tabulka>
 		
+		<p>
+			<strong>What about duplicate records?</strong>
+			<br/>
+			In the relational model, the records must be unique.
+			In <m:name/> there is no central authority that would prevent you from appending duplicate records to the relational stream.
+			It means that in some points in the relational pipeline there might occur data that do not fit the rules of the relational model.
+			The deduplication is generally not done on the output side of particular steps, but is postponed and done on the input side of steps, where uniqueness is important (e.g. JOIN or UNION).
+			You should not put duplicate records in the relational stream, but you can.
+			Duplicates can also occur after some transformations like <code>relpipe-tr-cut</code> (e.g. if you choose only <code>dump</code> or <code>type</code> attributes from your <code>fstab</code> and omit the primary/unique key field).
+			Such data are not considered invalid, but should be processed like there are no duplicates (if uniqueness is important for particular step)
+			or should be passed through if it is not in conflict with the goal of given step (e.g. calling <code>uppercase()</code> function on some field or doing UNION ALL).
+			Each tool must document how it handles duplicate records.
+		</p>
+		
+		<p>
+			The reasons for this <em>transient tolerance of duplicate records</em> are two.
+			1) Performance: guaranteeing the uniqueness in every moment would negate streaming and would require holding whole relation in memory and always sorting the records.
+			2) Modularity: many tasks would have to be done by a single bulky tool that does everything e.g. if you want to cut only the <code>type</code> field from your <code>fstab</code> and then count statistics how many times particular filesystems are used.
+		</p>
+		
 		<!--
 		<p>
 			<strong>?</strong>