Linux Pipes Tips & Tricks

18/10/13
Share

A pipe is unidirectional interprocess communication channel. The term was coined by Douglas McIlroy for Unix shell and named by analogy with the pipeline.

Pipes are most often used in shell-scripts to connect multiple commands by redirecting the output of one command (stdout) to the input (stdin) followed by using a pipe symbol '|':

cmd1 | cmd2 | .... | cmdN

For example:

$ grep -i "error" ./log | wc -l 
43

The grep command performs a case-insensitive search for the string “error” in the file log, but the search result is not displayed on the screen, and is instead redirected to the input (stdin) of the wc command, which in turn performs a count of the number of rows.

Logic

Pipes provide asynchronous execution of commands using buffered I/O routines. Thus, all the commands in the pipeline operate in parallel, each in its own process.

The size of the buffer since kernel version 2.6.11 is 65536 bytes (64K) and is equal to the page memory in older kernels. When attempting to read from an empty buffer, the read process is blocked until data appears. Similarly, if you attempt to write to a full buffer, the recording process will be blocked until the necessary amount of space is available.

It is important to note, that despite the fact that pipes operates using file descriptor I/O streams, operations are performed in memory without loading to/from the disc.

All the information given below is for bash shell 4.2 and kernel 3.10.10.

Simple debug

The strace utility allows you to track system calls during the execution of a program:

$ strace -f bash -c '/bin/echo foo | grep bar' 
.... 
getpid() = 13726 <- PID of the main process 
.... 
pipe([3, 4]) <- system call 
.... 
clone(....) = 13727 <- subprocess for the first pipe command (echo) 
.... 
[pid 13727] execve("/bin/echo", ["/bin/echo", "foo"], [/* 61 vars */]  
..... 
[pid 13726] clone(....) = 13728 <- subprocess for the second command (grep) is created also by the main process
.... 
[pid 13728] stat("/home/aikikode/bin/grep",  
....

It is seen that a system call to pipe() is used to create the pipe, and that two processes are performed in parallel in different flows.

Source code, level 1, shell

Since the best documentation is the source code, we turn to it. Bash uses Yacc to parse the input commands and returns 'command_connect()', when it encounters the '|' symbol.

parse.y:1242: (http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y?id=bash-4.2#n1242) 1242 pipeline: pipeline '|' newline_list pipeline 
1243 { $$ = command_connect ($1, $4, '|'); } 
1244 | pipeline BAR_AND newline_list pipeline 
1245 { 
1246 /* Make cmd1 |& cmd2 equivalent to cmd1 2>&1 | cmd2 */ 
1247 COMMAND *tc; 
1248 REDIRECTEE rd, sd; 
1249 REDIRECT *r; 
1250 
1251 tc = $1->type == cm_simple ? (COMMAND *)$1->value.Simple : $1; 
1252 sd.dest = 2; 1253 rd.dest = 1; 
1254 r = make_redirection (sd, r_duplicating_output, rd, 0); 
1255 if (tc->redirects) 
1256 { 
1257 register REDIRECT *t; 
1258 for (t = tc->redirects; t->next; t = t->next) 
1259 ; 
1260 t->next = r; 
1261 } 
1262 else 
1263 tc->redirects = r; 
1264 
1265 $$ = command_connect ($1, $4, '|'); 
1266 } 
1267 | command 
1268 { $$ = $1; } 
1269 ;

Also here we see the processing of pairs of characters '|&', which is equivalent to divert stdout as well as stderr into the pipeline. Next, we turn to command_connect():make_cmd.c:194: (http://git.savannah.gnu.org/cgit/bash.git/tree/make_cmd.c?id=bash-4.2#n194)

194 COMMAND * 
195 command_connect (com1, com2, connector) 
196 COMMAND *com1, *com2; 
197 int connector; 
198 { 
199 CONNECTION *temp; 
200 
201 temp = (CONNECTION *)xmalloc (sizeof (CONNECTION)); 
202 temp->connector = connector; 
203 temp->first = com1; 
204 temp->second = com2; 
205 return (make_command (cm_connection, (SIMPLE_COM *)temp)); 
206 }

where the connector is the '|' character parsed as an int. When you perform a series of commands (connected via the '&', '|', ';', etc.) execute_connection (): execute_cmd.c: 2255 is called: (http://git.savannah.gnu.org/cgit/bash.git/tree/execute_cmd.c?id=bash-4.2#n2255)

2325 case '|': ... 
2331 exec_result = execute_pipeline (command, asynchronous, pipe_in, pipe_out, fds_to_close);

PIPE_IN and PIPE_OUT are file descriptors that provide information about the input and output streams. They can take a NO_PIPE value, which means that I/O is stdin/stdout.

The execute_pipeline () function is rather extensive, the implementation of it is contained in execute_cmd.c: 2094. We consider the most interesting part for us.

execute_cmd.c: http://git.savannah.gnu.org/cgit/bash.git/tree/execute_cmd.c?id=bash-4.2#n2112

2112 prev = pipe_in; 
2113 cmd = command; 
2114 
2115 while (cmd && cmd->type == cm_connection && 
2116 cmd->value.Connection && cmd->value.Connection->connector == '|') 
2117 { 
2118 /* Creation of a pipe between two commands */ 
2119 if (pipe (fildes) <0) 
2120 { /* returning error */ } 
....... 
/* We execute the first command from the pipeline, using prev (output of the previous command) as input data and fildes[1] (the output file descriptor obtained from a call to pipe() ) as output */ 
2178 execute_command_internal (cmd->value.Connection->first, asynchronous, 
2179 prev, fildes[1], fd_bitmap); 
2180 
2181 if (prev >= 0) 
2182 close (prev); 
2183 
2184 prev = fildes[0]; /* The outut becomes output for the following command */ 
2185 close (fildes[1]); 
....... 
2190 cmd = cmd->value.Connection->second; /* "Move" to the next pipeline command */ 
2191 }

Thus, bash handles pipe symbol by a system call to pipe() for each '|' and executes each command as a separate process using the corresponding file descriptors as the input and output streams.

Source code, level 2, kernel

Let’s turn to the code of the kernel and see how the pipe() function is implemented. In this article, we are looking at the stable kernel version 3.10.10.

fs/pipe.c (https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/pipe.c?id=refs/tags/v3.10.10) (the skipped sections of code are unimportant for this article):

/* 
The maximum size of the pipeline buffer for an unprivileged user. Can be set by root in a file /proc/sys/fs/pipe-max-size */ 35 unsigned int pipe_max_size = 1048576; 
/* 
The minimum size of the pipeline buffer, according to the recommendation of POSIX is the size of one page of memory, that is 4Кб 
*/ 
40 unsigned int pipe_min_size = PAGE_SIZE;
 
869 int create_pipe_files(struct file **res, int flags) 
870 { 
871 int err; 
872 struct inode *inode = get_pipe_inode(); 
873 struct file *f; 
874 struct path path; 
875 static struct qstr name = { .name = "" }; 
/* Highlight dentry в dcache */ 
881 path.dentry = d_alloc_pseudo(pipe_mnt->mnt_sb, &name); 
/* Hightlight and initialize a structure file. Note FMODE_WRITE, as well as the flag O_WRONLY, i.e. this is a structure for recording and it is used as the output stream in the pipeline. To O_NONBLOCK flag we will return later */ 
889 f = alloc_file(&path, FMODE_WRITE, &pipefifo_fops); 
893 f->f_flags = O_WRONLY | (flags & (O_NONBLOCK | O_DIRECT)); 
/* the same way we highlight and initialize file structure for reading (llok at FMODE_READ and flag O_RDONLY) */ 
896 res[0] = alloc_file(&path, FMODE_READ, &pipefifo_fops); 
902 res[0]->f_flags = O_RDONLY | (flags & O_NONBLOCK); 
903 res[1] = f; 
904 return 0; 
917 } 
918 
919 static int __do_pipe_flags(int *fd, struct file **files, int flags) 
920 { 
921 int error; 
922 int fdw, fdr; 
/* Create a file structure for pipeline file descriptors (see above) */ 
927 error = create_pipe_files(files, flags); 
/* Choose spare file descriptors */ 
931 fdr = get_unused_fd_flags(flags); 
936 fdw = get_unused_fd_flags(flags); 
941 audit_fd_pair(fdr, fdw); 
942 fd[0] = fdr; 
943 fd[1] = fdw; 
944 return 0; 
952 } 
/* the implementation of functions int pipe2(int pipefd[2], int flags)... */ 
969 SYSCALL_DEFINE2(pipe2, int __user *, fildes, int, flags) 
970 { 
971 struct file *files[2]; 
972 int fd[2]; 
/* Create a structure for I / O and are search for descriptors */ 
975 __do_pipe_flags(fd, files, flags); 
/* Copy the descriptors from kernel space to user space */ 
977 copy_to_user(fildes, fd, sizeof(fd)); 
/* Assign the file descriptors as pointers to structures */ 
984 fd_install(fd[0], files[0]);
985 fd_install(fd[1], files[1]); 
989 } 
/* ...and int pipe(int pipefd[2]), which is essentially a wrapper for the call pipe2 with default flags; */ 
991 SYSCALL_DEFINE1(pipe, int __user *, fildes) 
992 { 
993 return sys_pipe2(fildes, 0); 
994 }

If you noticed, there is a check on the O_NONBLOCK flag in the code. It can be set using the F_SETFL operation in fcntl. It is responsible for the transition to non-blocking mode for I/O flow in the pipeline. In this mode, instead of blocking the process of reading/writing to the stream, it will fail with errno code EAGAIN.

The maximum size of the data to be written to the pipe is a single page of memory (4K) in ARM architecture:

arch/arm/include/asm/limits.h (https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/arch/arm/include/asm/limits.h?id=refs/tags/v3.10.10):

8 #define PIPE_BUF PAGE_SIZE

For kernels> = 2.6.35, you can change the size of the buffer:

fcntl(fd, F_SETPIPE_SZ, )

The maximum available size of the buffer, as we have seen above, is listed in the file /proc/sys/fs/pipe-max-size.

Tips & trics

In the examples below we will perform operations on the existing “Documents” directory and two non-existent files ./non-existent_file and ./other_non-existent_file.

  • Redirection of stdout and stderr to pipe
    ls -d ./Documents ./non-existent_file ./other_non-existent_file 2>&1 | egrep "Doc|other"
    ls: cannot access ./other_non-existent_file: No such file or directory 
    ./Documents

    or we can use the combination of‘|$’ (you can learn about it either from the shell documentation (man bash), or the source code above, when we examined the Yacc parser bash):

    ls -d ./Documents ./non-existent_file ./other_non-existent_file |& egrep "Doc|other" 
    ls: cannot access ./other_non-existent_file: No such file or directory 
    ./Documents
  • Redirecting only stderr to pipe
    $ ls -d ./Documents ./non-existent_file ./other_non-existent_file 2>&1 >/dev/null | egrep "Doc|other" 
    ls: cannot access ./other_non-existent_file: No such file or directory

    It is important to comply with the order of stdout and stderr redirect. For example, a combination of '>/dev/null 2>&1' will redirect both stdout and stderr to /dev/null.

  • Getting the correct pipe’s code completion

    By default, the exit pipe code is the code of the last command in the pipeline. For example, take the original command that is finished with a non-zero:

    $ ls -d ./non-existent_file 2>/dev/null; echo $? 
    2

    And redirect it to a pipe:

    $ ls -d ./non-existent_file 2>/dev/null | wc; echo $? 
    0 0 0 
    0

    Now the pipe exit code is the exit code of the wc command, ie 0.

    Usually, however, we need to know if there was an error during any part of the command execution. To do this, set the pipefail option, which tells the shell that exit pipe code will coincide with the first non-zero code of one of pipe exit commands – or, zero if all commands are completed correctly:

    $ set -o pipefail 
    $ ls -d ./non-existent_file 2>/dev/null | wc; echo $? 
    0 0 0 
    2

    You should also remember about "harmless" commands that cannot return anything but zero. This applies not only when working with the pipes. For example, consider an example with grep:

    $ egrep "^foo=[0-9]+" ./config | awk '{print "new_"$0;}'

    Here we modify all found lines by ascribing 'new_' at the beginning of each line, or do not modify anything if there were no lines in the required format. The problem is that grep exits with 1 if there was no match found, so if your script has the pipefail option, this example is completed with code 1:

    $ set -o pipefail 
    $ egrep "^foo=[0-9]+" ./config | awk '{print "new_"$0;}' >/dev/null; echo $? 
    1

    In large scripts with complex designs and long pipes it is possible to miss this point, and this can lead to incorrect results.

  • Assigning values to variables in pipes

    To begin with, let us recall that all commands in pipes are executed in separate processes generated by calling clone(). Typically, this is not a problem, except when the values of variables change.

    Consider the following example:

    $ a=aaa 
    $ b=bbb 
    $ echo "one two" | read a b

    We expect that now the values of a and b will be “one” and “two”, respectively. In fact, they remain “aaa” and “bbb”. In general, any change in the values of the variables in a pipe will leave the variables unchanged beyond the current pipe:

    $ filefound=0 
    $ find . -type f -size +100k | 
    while true 
    do 
    read f 
    echo "$f is over 100KB" 
    filefound=1 
    break # exit after the first found file done 
    $ echo $filefound;

    Even if the find command will discover a file of more than 100kb, the filefound variable will still be 0.

    Several solutions to this problem are possible:

    1. Using ‘set — $var’:

      This design will expose the variables according to the contents of the var variable. For instance, as in the first example above:

      $ var="one two" 
      $ set -- $var 
      $ a=$1 # "one" 
      $ b=$2 # "two"

      We must bear in mind that the original positional parameters of the script will be lost.

    2. Transfer all processing logic value of the variable into the same sub process in the pipe
      $ echo "one" | (read a; echo $a;) 
      one
    3. Change the logic to avoid the assignment of variables in the pipeline. For example, change our example with find:
      $ filefound=0 
      $ for f in $(find . -type f -size +100k) # we replaced the pipeline with cycle 
      do 
      read f 
      echo "$f is over 100KB" 
      filefound=1 
      break 
      done 
      $ echo $filefound;
    4. (only for bash-4.2 and later) Using lastpipe option

      The lastpipe option instructs the shell to perform the last command line in the main process.

      $ (shopt -s lastpipe; a="aaa"; echo "one" | read a; echo $a) 
      one

      Importantly, in the command line you can set the lastpipe option in the same process, where the corresponding pipe will be called, so the parentheses in the above example are required. In scripts they are optional. See also: CNC Controller in Python

Additional information

Sign Up for Updates!

Subscribe now to receive industry-related articles and updates

Choose industries of interest
Thank You for Joining!

You will receive regular updates based on your interests. No spam guaranteed

Add another email address
Sign Up for Updates!
Choose industries of interest
Thank You for Joining!

You will receive regular updates based on your interests. No spam guaranteed

Add another email address
Welcome
We are glad you found us
Please explore our services and find out how we can support your business goals.
Get in Touch