# HG changeset patch
# User František Kučera
+ Assume that we have a text file containing a list of animals and their properties:
+
+ We can pass this file through a pipeline:
+
+ Particular steps of the pipeline are separated by the | pipe symbol.
+ In the first step, we just read the file and print it on STDOUT.
+ In the third step, we select second field (fields are separated by spaces) and get colours of our dogs:
+
+ In the fourth step, we translate the values to uppercase and get:
+
+ So we have a list of colors of our dogs printed upper-case.
+ In case we have several dogs of same colors, we could avoid duplicates simply by adding
+ The authors of
+ And we don't have to know anything about the low-level programming in the C language or compile anything.
+ We just simply build a pipeline in a shell (e.g. GNU Bash) from existing programs and focus on our business logic.
+ And we do it well without being distracted by any low-level issues.
+
+ This simple example looks quite flawlessly.
+ But actually it is very brittle.
+
+ What if we have a very big cat that can be described by this line in our file?
+ In the second step of the pipeline ( Which is really unexpected and unwanted result. We don't have a RED dog and this is just an accident. The same would happen if we have a monkey of a doggish color.
+ This problem is caused by the fact that the
+ What if we have a turtle that has lighter color than other turtles?
+
+ If we do
+ Which is definitively wrong because the second turtle is not LIGHT, it is LIGHT GREEN.
+ This problem is caused by the fact that we don't have a well-defined separators between fields.
+ Sometimes we could avoid such problems by restrictions/presumptions e.g. the color must not contain a space character (we could replace spaces by hyphens).
+ Or we could use some other field delimiter e.g. ; or | or ,. But still we would not be able to use such character in the field values.
+ So we must invent some kind of escaping (like
+ There are also other problems like character encoding, missing meta-data (e.g. field names and types), joining multiple files (Is there always a new-line character at the end of the file? Or is there a BOM at the beginning of the file?)
+ or passing several types of data in a single stream (we have list of animals and we can have e.g. also a list of foods or list of our staff where each list has different fields).
+ | sort -u
in the pipeline (after the cut
part).
+ The great parts
+
+ cat
, grep
, cut
or tr
programs don't have to know anything about catsThe pitfalls
+
+ dog-sized red cat
+
+ grep
) we will include this record and the final result will be:grep dog
filters lines containing the word dog regardless its position (first, second or third field).
+ Sometimes we could avoid such problems by a bit more complicated regular expression and/or by using Perl, but our pipeline wouldn't be as simple and legible as before.
+ small light green turtle
+
+ grep turtle
it will work well in this case, but our pipeline will fail in the third step where the cut
will select only light (instead of light green).
+ And the final result will be:
+ \;
is not a separator but a part of the field value)
+ or add some quotes/apostrophes (which still requires escaping, because what if we have e.g. name field containing an apostrophe?).
+ And parsing such inputs by classic tools and regular expressions is not easy and sometimes even not possible.
+
+ A classic pipeline example (
Bytes, text, structured data? XML, YAML, JSON, ASN.1
+ +Rules:
+ +@@ -101,12 +118,12 @@
cat animals.txt | grep dog | cut -d " " -f 2 | tr a-z A-Z