muSED

MicroSED is a script language interpreter designed to allow programs to perform transformations on strings held in memory. The "SED" part of the name signals that the language being interpreted is related to that of the unix stream editor (SED).

The SED language home page can be found here:

http://sed.sourceforge.net/

The MicroSED implementation (partial, see below) can be found in

Numerous manual pages exist on the internet documenting the SED command language. No two of them are exactly the same, though the core functionality is stable across implementations, of which there are several. cxxtls::muSED is yet another. To determine how to use it, follow the advice found in on this site and combine it with advice found on other web pages -- particular the SED homepage on sourceforge or your the manual page on any convieniet linux box. Most of all, try using the editor's ^X^T command to transform marked blocks using a script or the mus program to perform experiments with the language. Better yet, look at the comments for the *Statement functions that implement the features of the language in file cxxtls/muSED.h.

Differences with SED

muSED's grammer is mostly a proper subset of SED's -- with a few of caveats:

It always operates in the "-n" mode of the SED program, meaning that no "output" is produced unless a "p" command is provided in the script.

Several SED commands are simply not supported in muSED: ":", a, b, c, D, l, n, N, r, t, w, y.

The 'y' command in muSED works more like the unix "tr" program than like the SED y command. Search the unix manual pages for tr for more details. The principle difference with SED is that muSED supports character ranges whereas standard SED only supports a trivial mapping of character pairs.

For example, to use the 'y' command in SED to upper case the first 8 letters, you must do this: "y/abcdefgh/ABCDEFGH/". In muSED, however, you can do this: "y/a-h/A-H/". To include the "-" character, as a character, let it be the first character in the translation sequence. Of course, in both environments, you can also use the substitution command, 's', to translate character ranges, but the y command is faster in both environments.

It adds the 'W' command. 'W' stands for "while". Since muSED doesn't support any of the commands that define labels or allows the branching to labels, this new looping construct needed to be added. See below.

Implementation details

Important muSED implementation details can be found in cxxtls/muSED.h which defines namespace cxxtls::muSED, as well as the data structures and c++ templates that comprise it's implementation. Some code is also found in lib/muSED.cxx.

Its implementation is heavily templatized so that it can be used on any sequence of std::strings that you have stored in memory. Minor tweaks could enable the processing of lines from files just like regular SED using the stream iterators.

Test programs can be found in tests/muSED_tests/*.

The source code for program, mus, can be found in bin/mus.cxx. "mus" provides a sed-like wrapper that lets you process files using muSED scripts. It is not meant as a sed competitor because it reads the entire input stream into memory before dumping output.

Basic Operation

Using muSED in a program typically involves the following steps:

Decide on a script language sequence to be performed.

Create a sequence of std::string in memory containing the script

Either invoke one of the quick wrapper functions to automate the compile / execute logic on a one-time basis.
See also:
cxxtls::muSED::apply or cxxtls::muSED::oneLiner
Or, create a compiled form of the script and apply it repeatedly to different input sequences.
See also:
the examples below
A script is a sequence of commands as documented in the following paragraphs:

Scripts are typically small, so the time spent compiling is correspondingly small, but this may not always be true.

Commands

The following paragraphs discuss the muSED commands, which, like those in SED, operate on the "pattern buffer", and on the "hold buffer". These are the only two variables allowed in a SED program.

A muSED script, which is a sequence of muSED commands, is assumed to operate on each single line of the input stream at a time so as to produce an output stream of strings.

The basic execution sequence of a SED script is as follows:

     for each line of input
     do
        for each script command
        do
           apply the current command to the current input line
              so as to produce zero or more output lines
        done
     done

The two SED variables are used like this:

The pattern Buffer is populated with the current input line at the beginning of each script execution sequence

The hold buffer is essentially optional, unless the script manipulates the hold buffer, it will be ignored

No output will be produced unless the script executes the 'p' (print) command.

"Output" in this context means "data produced into to the output container".

Note that the newline character, \n, is interpreted as a line break by the "p" command and each \n will result in a different string being produced into the output container.

Condition Prefixes

All statements of interest in SED and muSED can be preceeded by a condition prefix to allow the selective execution of statements. That is, not all statements in a muSED script apply to all input lines. There are several conditions that can be ascribed to each command to limit it's applicability:

none -- that is, without any command prefix, the statement applies to all commands.

line -- with a single condition prefix, the statement applies to only lines which match the condition.

range -- When a statement has 2 condition prefixes, the two define a range of lines during which the statement is active.

See paragraph, Activation Conditions, below for an explanation of conditionality.

Note:: this is how "and" clauses are created in SED -- nested condition prefixes only allow input lines that match all the clauses to be transformed by the command.

"p" is for Print

Usage:

The 'p' command lets the user print the contents of the pattern buffer. The pattern buffer is a variable that is initialized with the current line line. The script can transform it, delete, it or print to the output stream as needed.

The print command assumes that the pattern buffer is a collection of lines, delimited by newline characters (\n). It splits the pattern buffer on the \n character and appends each separate chunk to the output stream.

Note:: that you can print the same text out multiple times as needed -- just duplicate the 'p' command as often as needed.

"s" is for Substitute

Usage:

s/target/replacement/options

The 's' command lets the user perform regular expression find and replace operations on the content of the pattern buffer. The transforms that can occur as follows:

substitute one simple string for another

find a regular expression match and replace it with either a constant or some intermingling of the original text and new content

delete matched text ( by replacing it with nothing )

The options are as follows:

1 -- substitutes only the first match in the pattern buffer with the replacement
g -- substitutes all matches
i -- the comparison is case insensitive.

See the SED manual page for details of the 's' command. It is quite powerful

"d" is for delete

Usage:

The 'd' command lets you delete the current line from the output stream. Normally, input stream contents are ignored, so if you don't print them, they don't show up in the output stream, but the 'd' command lets you abort the processing of all other commands in the script on the current line -- presumably letting you skip any 'p' commands that wout copy the current line to the output.

"q" means "Quit After This Line is printed'

Note:: there are two "q" commands, see the "Q" command, below.

Usage:

The "q" command prints the current line to the output stream, then suppresses any further script execution on any other lines in the input stream.

"Q" means "Quit Before This Line and do not print it"

Usage:

The 'Q' command does not print the current line to the output stream, but suppresses any further script execution on any other lines in the input stream.

'x' is for Swap

The 'x' command swaps the pattern and hold buffer without modifying either.

'h' is for "Hold this for me, will you?"

The 'h' command copies the pattern bufefr to the hold buffer.

'H' is for "Add this to your list"

The 'H' command appends '\n' and the content of the pattern buffer to the hold buffer. Usually, you want to first perform an 'h' command, then performing zero or more 'H' commands to create the hold buffer. Entire files can built up as giant strings in the hold buffer.

'g' is for 'Get the hold buffer'

It discards the pattern buffer and replaces it with the entire content of the hold buffer.

'G' is for 'Append hold to the pattern with a \n delimiter'

It appends \n to the pattern buffer then follows that with the entire hold buffer.

Examples

The following paragraphs contain muSED examples, explained:

C++ setup examples

The following examples focus on setting up the muSED intepreter in c++.

Example: Deleting the input

An empty script will ignore all the input lines when you call the apply function. To get any input produced into the output sequence, you must invoke the 'p' command, at least once in the input script.

Example: Copying input to output

 //  Using mused to copy one sequence of strings to another.
 //
     std::vector&lt;std::string&gt;  script;
 
     script.push_back("p");  // just print the input to the output
 
     std::deque&lt;std::string&gt;  input;
 
     input.push_back("1");
     input.push_back("2");
     input.push_back("3");
 
     std::list&lt;std::string&gt; output;
 
     cxxtls::muSED::CompiledScript  cScript(script);
 
     if(!cScript.ok())
     {
       cerr &lt;&lt; "huh?  shouldn't happen:  " &lt;&lt; sScript.error() &lt;&lt; endl;
       exit(1);
     }
 
     cxxtls::muSED::apply(cScript, input, output);
 
 //  output now contains 3 lines ("1", "2", and "3")

Example: Line by line filtering

Copy only strings that match a pattern from the input sequence to the output.

 //  Using mused to copy only selected strings from one containger
 //  to another:
 //
     std::list&lt;std::string&gt;  script;
 
     script.push_back("/2/p");  // only if the input line contains '2' will
                                // it be printed.
 
     std::vector&lt;std::string&gt;  input;
 
     input.push_back("1");
     input.push_back("2");  // only this line of input will be printed
     input.push_back("3");
 
     std::list&lt;std::string&gt; output;
 
     cxxtls::muSED::CompiledScript  cScript(script);
 
     if(!cScript.ok())
     {
       cerr &lt;&lt; "huh?  shouldn't happen:  " &lt;&lt; sScript.error() &lt;&lt; endl;
       exit(1);
     }
 
     cxxtls::muSED::apply(cScript, input, output);
 
 //  output now contains 1 line2 ("2")

Example: Line Group Filtering

Assume that most of the sequence should not be copied, but that certain groups of strings might be worthy.

A group of group is defined by a start string and an end string (considered to be included in the group).

However, not all groups should be copied from the input to the output: only those that contain a certain string.

 //  Using mused to copy only selected groups of strings.
 //
     std::list&lt;std::string&gt;  script;
 
     script.push_back("/start/,/end/{");
     script.push_back("               /start/h");
     script.push_back("               /start/!H");
     script.push_back("               /end/{");
     script.push_back("                      x");
     script.push_back("                      /match/p");
     script.push_back("                    }");
     script.push_back("             }");
 
 
     std::vector&lt;std::string&gt;  input;
 
     input.push_back("ignored1");    // not in a group
 
     input.push_back("start");       // group to be printed
     input.push_back("2");  
     input.push_back("3matches");  
     input.push_back("end");
 
     input.push_back("ignored2");   // not in a group
 
     input.push_back("start");      // non-matching group
     input.push_back("2");  
     input.push_back("3ignored");  
     input.push_back("end");        //
 
     std::list&lt;std::string&gt; output;
 
     cxxtls::muSED::CompiledScript  cScript(script);
 
     if(!cScript.ok())
     {
       cerr &lt;&lt; "huh?  shouldn't happen:  " &lt;&lt; sScript.error() &lt;&lt; endl;
       exit(1);
     }
 
     cxxtls::muSED::apply(cScript, input, output);
 
 //  output now contains 4 lines ("start", "2", "3matches", "end")

muSED language examples

Most mused examples will behave exactly like SED examples that use the same commands. Specifically, you should be able to make use of the sed homepage on sourceforge.net for any examples that restrict themselves to the following commands:

The following sections describe muSED peculiarities.

Reformatting paragraphs

The following mused script reads text which consists of paragraphs delineated by blank lines. It forces all the paragraphs to fit with in the character positions 1-80 unless a word is longer than 10 characters -- in which case it will overflow

 s/^ *//1  # remove leading blanks from all lines
 s/  *$//1 # remove trailing blanks  from all lines
 
 1,/./  {
            # remove leading blank LINES at the start of the
            # stream
 
            /^$/d
 
         }
 
 /^$/ {
         # on blank lines, dump the current paragraph and start a new
         # one
 
         x              # get the paragraph into the pattern buffer
         y/\n/ /        # convert embedded newlines into spaces
         s/  */ /g      # convert multiple spaces into a single space
         s/^  *//1      # remove leading blanks
 
         W /^.\{80\}/   {
                            # split the paragraph into 80 character lines and
                            # print them.
 
                            s/^\(.\{70\}\)\([^ ]\+\)  *\(.*\)/\1\2\n\3/1
                            P
                            s/^[^\n]\+\n//1
 
 
                        }
 
         /./ p          # print the final line fragment
 
         g              # get the empty line back into the pattern buffer
 
         p              # print a paragraph separator
 
         d              # stop processing on blank lines, to avoid the 'h' and H commands
                        # below
 
      }
 
 ${
         # at end of file, dump the current paragraph and start a new
         # one
 
         g              # get the paragraph into the pattern buffer
         y/\n/ /        # convert embedded newlines into spaces
         s/  */ /g      # convert multiple spaces into a single space
         s/^  *//1      # remove leading blanks
         
         W /^.\{80\}/   {
                            # split the paragraph into 80 character lines and
                            # print them.
 
                            s/^\(.\{70\}\)\([^ ]\+\)  *\(.*\)/\1\2\n\3/1
                            P
                            s/^[^\n]\+\n//1
 
 
                        }
 
         /./ p          # print the final line fragment
 
      }
 
 #
 #  we won't get here if the current line is blank
 #
 
 1h      # if this is line 1, copy it to the hold buffer to start a paragraph 
 1!H     # append this line to the hold buffer separated by \n