Yelt

Yelt is a streaming text editor, like its older cousing, SED. Yelt could be described as "SED -- the next generation" because is essentially SED plus features not present in SED, like:

And like Star Trek, the Next Generation, it has different characters and more sophisticated plots.

The rest of this page describes how to do many commonly needed tasks using yelt scripts.

YELT TRICKS


HOW DO I GET MORE DOCUMENTATION?
HOW DO I WRITE AND EXECUTE YELT SCRIPTS?
HOW ARE VARIABLES USED?
HOW DO I PASS COMMAND LINE ARGUMENTS AS PROGRAM VARIABLES?
HOW DO I USE IF STATEMENTS?
HOW DO I WRITE AND CALL FUNCTIONS?
HOW DO I #INCLUDE OTHER SCRIPTS INTO MY SCRIPT?
SUPPRESSING THE DEFAULT PRINTING OF LINES
CLEANING UP YELT INVOCATIONS BY COMBINING -e OPTIONS:
NUMBERING LINES
HOW DO I SIMULATE GREP?
HOW DO I WIDEN STRINGS?
HOW DO I REPEATEDLY PROCESS THE SAME LINE IN A SCRIPT?
WHY WOULD I REPEATEDLY PROCESS THE SAME LINE?
HOW DO I EXTRACT SECTIONS OF A LINE?
HOW DO I TRANSLATE ONE SET OF STRINGS TO ANOTHER?
HOW DO I TRANSLATE ONE CHARACTER SET (RANGE) TO ANOTHER
HOW DO I INJECT A NEWLINE IN THE RHS OF A SUBSTITUTION?
HOW DO I INJECT SPECIAL CHARACTERS AND CONTROL CODES INTO THE RHS OF SUBSTITUTION?
HOW DO I SWAP TWO VARIABLES WITHOUT USING A TEMPORARY?
HOW DO I PERFORM A CASE INSENSITIVE COMPARE?
HOW DO I PRINT ONLY THE FIRST OCCURRENCE OF A STRING IN A FILE?
HOW DO I PARSE COMMA SEPARATED VALUE FILES?
HOW DO I PARSE FILES DELIMITED BY SPECIAL CHARACTERS?
HOW DO I CHANGE ONLY ONE SECTION OF A FILE?
HOW DO I IMPLEMENT IF-THEN-ELSE?
HOW DO I COMPUTE A REGEX FOR USE IN CONDITIONAL EXPRESSIONS?
HOW DO I SPLIT FILES INTO PARAGRAPHS?
HOW DO I SELECTIVELY PRINT ONLY THE PARAGRAPHS THAT MATCH A REGULAR EXPRESSION
HOW DO I MAKE YELT READ A FILE CONTAINING A LIST OF FILES AND PROCESS EACH ONE?
HOW DO I JOIN LINES BASED ON THE CONTENTS OF THE FIRST LINE?
HOW DO WRITE SCRIPT CODE THAT RUNS BEFORE THE FIRST LINE OF TEXT IS READ OR AFTER THE LAST ONE HAS BEEN PROCESSED?
HOW DO I INSERT A FILE AT THE TOP OF MY OUTPUT?
HOW DO SIMULATE C #INCLUDE PROCESSING?
HOW DO I WRITE TO FILES OTHER THAN STDOUT?
HOW DO I AVOID CREATING A SEPARATE YELT SCRIPT WHEN WRITING A UNIX COMMAND LINE SCRIPT?
HOW DO I EXTRACT ANCHOR URLS FROM HTML FILES?
HOW DO I PROCESS REPEATED BLOCKS OF TEXT?

This file contains discussions of how to do various things using
yelt -- the particular focus is on yelt specific behaviors not
generalized stream editing.   Yelt is a descendent of SED but is
not a proper superset -- so be careful.




HOW DO I GET MORE DOCUMENTATION?

top


One way is cause yelt to print its internal documentation -- this
will be the latest and greatest -- but it may be terse.

  yelt -h 2>&1 | less

You could also go to the yelt website:

    http://www.bordoon.com/yelt

However, right now, there is not much there.  By the time you read
this, maybe that will change.




HOW DO I WRITE AND EXECUTE YELT SCRIPTS?

top


Yelt scripts can be specified in either of 3 mutually exclusive ways:

  1.  use -e options to the yelt program to add single expressions to the
      default script:

        yelt -e 'y/a-z/A-Z/;' file1 file2 file3 ...

      This invocation adds 1 expression, to upper case the current line,
      to the default script, which looks like this:

        w { n; [-e options]; p; }

      The default script reads every line of input from all files and
      prints them out.  The -e options let you do things to the current
      line before it gets automatically printed.

  2.  use the -S option to manually specify the entire script from the
      command line:

        yelt -S "q9 STUFF; w { n; A90; p; }"

      This eliminates the default script and lets you do the whole thing
      yourself.  In this case, the register 9 is populated with 'STUFF'
      before the while loop that reads and prints lines begins.  The
      main while loop appends the contents of R9 to all lines then 
      prints each line (now with STUFF stuck on the end).

   3. You use the -f option to run a script stored in a file.  When writing
      your own scripts, remember to put the outer w{n; p;} parts!

        yelt -f scriptFileName datafile1 ...




HOW ARE VARIABLES USED?

top


Yelt commands are like assembly language instructions:  they have either
0, 1, 2, or 3 registers associated with them.  They are typically written
like this:

  Y01

This command means that the command 'Y' is a two register command and the
registers are 0 and 1.  If the command is a 1 register command and if the
register you are using is 0, you can leave it out:

  p;

This command prints register 0.


Yelt allows for 10 string registers.  They are numbered 0-9.  They
are equally empowered.  Some commands, however, assume variable 0
unless you specifically tell the command which variable to use. This
simplifies typing in most cases and gives some semblance of compatibility
with sed.  For example:

    -e 's/string/output/g'

This syntax works the same for yelt as it does for sed.  To use some
other variable, do this:

    -e 's4/string/output/g'

         ^
Note the 4 in the above syntax.

Most yelt commands allow one or more varible numbers to follow the single
letter that defines the command:

  a01
  p9
  q3
  etc

One exception to this positioning of the variable number is found in the
pattern conditional statement:


  /pattern/~3 command
           ^^

Note that you specify the variable that is checked against the pattern after
the regular expression and you must specify the ~.  Variable zero is the default,
so if you are using variable zero, this is how the command looks:

  /pattern/ command

You can increment and decrement variables if they are valid integers using the
following syntax:

  -8
  +3

Which is to say, substract 1 from register 8, and add 1 to register 3.  When comparing
registers, you are on your own -- try formatting the text in a right justified 
field then doing a normal string comparison.  In general, math is not yelt's
strong suit.  The variables exist mainly for line counting.




HOW DO I PASS COMMAND LINE ARGUMENTS AS PROGRAM VARIABLES?

top


Yelt defines 10 string registers.  They can be prepopulated using the -r[0-9]
options like this:

  yelt -r3 text: -e 'a34; A04; p4; d;'  someFile

This particular script will print all the lines in someFile and prefix each
with "text:".  The string variables populated in this manner are not special --
if you change the text as part of the script, the command line values will
be lost.

Note that the F command lets you read and process files whose names are
specified in registers.  Thus, you could easily use a -r option to populate
a register with the name of the file you actually intend to transform and 
you could use the stdin data for some other purpose.  For example:

  yelt -r9 Somefile -S 'F9 { n; p; }'  # stdin is ignore completely, but
                                       # Somefile is copied to stdout.




HOW DO I USE IF STATEMENTS?

top


Yelt has a real if-statement, else clause and all.  Both yelt and said have
pattern conditional statements with this syntax:

  /regex/ cmd

but yelt adds this syntax:

  if /regex/ cmd1 else cmd2

The "else" and cmd2 are optional.  In both cases, the cmds can be singleton
commands or command blocks like this:

  { cmd1; cmd2; ... }

The advantage of the if statement over the pattern conditional statement
is that the else clause will only execute if the if clause did not.  In sed,
traditionally, pattern conditional statements do not have have else logic --
so you have to implement them yourself:

  /reg1/   cmd1;
  /reg1/!  cmd2;

This code only really works if cmd1 does not change the data being tested,
if it does, you have to come up with some complicated work around -- or
duplicate the cmd2 inside nested blocks.




HOW DO I WRITE AND CALL FUNCTIONS?

top


Functions are declared like this:

  Def fun 3
  {
    #  This function expects 3 registers to be passed into it.
    #  The registers from the caller's "address space" will be stored into
    #  this function's register set in locations 0,1, and 2.  When this function executes,
    #  all of its other registers will be empty strings.
    #
    #  Upon returning to the caller, this function's registers, 0-2, will be
    #  copied back into the caller's registers -- in the same registers that
    #  were passed to this function.  Thus, the return variables must be in
    #  same group as the parameter variables.
  }

 C fun 7,2,1
   #
   # This statement calls function 'fun' using the contents of registers 7, 2, and 1
   # The contents of register 7 will be placed in fun's variable 0, 2's contents will go
   # fun fun's variable 1, and the contents of register 1 will go into
   # fun's register 2.
   #
   # fun can return any data it wants in its variables 0, 1, or 2.  The data
   # will be mapped back into the caller's 7, 2, or 1 using the same logic
   # as when data is passed into the function.
   #

The scheme is clunky, but at least fun can't change any registers in the caller
that aren't expected.

Here's an example:

  Def app 2
  {
    #
    # append parameter 0 to parameter 1 and return it in parameter 1 (ie do nothing but
    # leave the registers as they are on exit)
    #
    A01;
  }

  ...

  q0 VARIABLE 0'S CONTENTS;
  q1 MORE STUFF;

  C app 0,1;

  p1;

Upon execution, this script will print:

  VARIABLE 0'S CONTENTSMORE STUFF




HOW DO I #INCLUDE OTHER SCRIPTS INTO MY SCRIPT?

top


With a text editor.  There is no #include capability in the script
language.  Although anything can change....

If you are using the -S or -e options to yelt, however, you can
allow the shell command interpreter to do the substitution for you.
Here's a unix example:

  includeFileContents=`cat someScriptFile`

  yelt -S "$includeFileContents w{n; [my commands]; p;}"

This would be harder to do on Windows but possible.




SUPPRESSING THE DEFAULT PRINTING OF LINES

top


The d command can be used to avoid the final printing of the line that
naturally occurs as part of the default script.  Thus, you have a
choice, if you don't like the default script because it has the
final print, you can just make your last -e command be 'd' which
will continue the while loop before you get to the offending p,
or you can use the -S or -f command to replace the whole script.

For example, you might only want to print the top of a file:

  -e '101,$d'

This deletes all lines after 100.  

The 'd' command is the "continue" command.  The 'b' command breaks
out of a while loop and the 'd' command just goes back to the top.
The 'q' command quits the current function or script if not in a function.

If you want to write a script that works like normal except that it
has no final print, do this:

  -S "
        w
        {
          n;
          commands;
        }
     "



CLEANING UP YELT INVOCATIONS BY COMBINING -e OPTIONS:

top


The -e option to yelt can be used to specify multiple commands --
all you have to do is to specify ; between them within the string
you pass to yelt with -e:

  -e "s/bill/tom/g ; s/hank/susan/1;"




NUMBERING LINES

top


To number lines, use the l command like this:

  -e "l1 ; A01 ; p1 ; d"

This script fragment puts the line number in variable 1
then appends the current input line to it.  Then it prints variable
1 and stops processing on the current line (ie it continues).
Note that the line number is formatted with a trailing space.

Note that if you want to include both the filename and the line number
you can do this:

  -e "l1; L2; A12; A02; x02;"

These commands do the following:

 l1   -- puts the current line number of the current file into register 1.
 L2   -- puts the current filename into register 2.
 A12  -- appends the line number in register 1 to the end of the filename in register 2.
 A02  -- appends the text of the current line into register 2
 x02  -- swaps register 0 with register so that the default print command will print 
         the filename, line number, text string we have just made.




HOW DO I SIMULATE GREP?

top


Like this:

  -e ' /regex/{ l1; L2; A12; A02; x02; } ; d; '

This script fragment only prints the yelt's current input line
if the line contains the regular expresion.  It then formats a 
filename, line number, and text of the current line and prints it.
Finally, it suppresses the default printing.




HOW DO I WIDEN STRINGS?

top


If you just want to widen a string, use this command:

  -e "j1 100"

This appends spaces to make variable 1 be at least 100 characters
wide.  That is, it left justifies in a field of 100 characters.
Use J to pad from the left (ie right justify).




HOW DO I REPEATEDLY PROCESS THE SAME LINE IN A SCRIPT?

top


While loops do not force the reading of new lines -- you
have to do that with the n command.  So you can set up
a while loop that does not read a new line unless you
do it yourself:

  w
  {
      n;

      w
      {
        # this part is an infinite loop until you
        # use b, d, or Q.  If you want to read the
        # next line, you'll have to use 'n' -- and
        # it hits the end of file, you'll break out
        # of the loop at that point
      }

      p;
  }




WHY WOULD I REPEATEDLY PROCESS THE SAME LINE?

top


Some yelt commands are designed to enable parsing of text, even
text spanning multiple lines -- and possibly having multiple 
interesting things happening on the same line.  See the SR012,
SW01, SC0 commands in the yelt -h output.  Or use s///g and the
A commands.




HOW DO I EXTRACT SECTIONS OF A LINE?

top


If you want to use only part of a line, you have several methods:

1.  use the substitute command to eliminate uneeded text:

     s/....\(.....\).*/\1/1

    This command removes the first 4 characters, keeps the next 5,
    then removes the rest.

2.  you can use the cut command:

     c 5-9

3.  The split commands lets you split lines (strings) into parts and
    you can process each separately:

    SC01 10     -- splits register 0 into two parts:  columns 0-9 remain
                   in register 0.  Everything after that, if any, goes
                   into register1.

    SW943 xy    -- splits register 9 into 3 parts:  the beginning of the
                   string up to the first 'x' or 'y' remains in register 9.
                   The character that triggered the splitting (either x or y)
                   goes into register 4.  The remainder of the line, after the
                   delmiter goes into register 3.

    SR825 /r/   -- splits register 8 into 3 parts based on a regular 
                   expression:  the string up to the first instance of the
                   regular expression stays in R8.  The string matching the
                   regular expression goes into register 2.  The text after that
                   matching regex goes into R5.




HOW DO I TRANSLATE ONE SET OF STRINGS TO ANOTHER?

top


In many sed scripts there are large numbers of simple textual translations
performed like this:

  s/Bill/William/1

This works great and is relatively fast but can lead to enormous scripts.
Sometimes it is even necessary to create a script using a script just so
that the second script has the right translations in it.

Yelt provides a "mapping" mechanism that lets you convert one string to
another and this runs in O(ln(N)) time rather  than in O(N).  The -M
option to yelt lets you define a file that contains key|value pairs.
Your script can then use the M command to cause a register containing a
key to be converted to the corresponding value as defined in the -M
file.  Like this:

   yelt -M filename -e 'M' inputFile

This script, naturally, reads lines from inputFile into variable 0.  The
M command with no following variable number, maps the variable 0, containing
the input line, into the value defined in "filename".  That file would be
formatted like this:

  key1|value1
  key2|value2
  ...




HOW DO I TRANSLATE ONE CHARACTER SET (RANGE) TO ANOTHER

top


  yelt -e 'y/a-z/A-Z/'

This is useful for making all the characters in a string
be either upper case or lower case.  You can also perform
special character substitutions.  See the special character
escape sequence list below.




HOW DO I INJECT A NEWLINE IN THE RHS OF A SUBSTITUTION?

top


You can insert new lines into the right hand side of a substitution
using the \n character sequence:

   yelt -e 's/.*/text\n/1'




HOW DO I INJECT SPECIAL CHARACTERS AND CONTROL CODES INTO THE RHS OF SUBSTITUTION?

top


The special characters that are interpreted in the right hand side
of a substitution are as follows:

  \n -- new line          0x0a
  \t -- tab               0x09
  \r -- return character  0x0d
  \e -- escape            0x1b
  \b -- backspace         0x08
  \s -- space             0x20 ' '
  \a -- bel               0x07 (beep)




HOW DO I SWAP TWO VARIABLES WITHOUT USING A TEMPORARY?

top


  -e 'x01'

The 'x' command swaps any two registers.  It requires 2 digits that refer
to the registers.




HOW DO I PERFORM A CASE INSENSITIVE COMPARE?

top


option1:

Write your comparison in all upper case and translate the incoming text to
upper case before comparing

  a03
  y3/a-z/A-Z/
  /COMPARAND/~3

option2:

Write write the regexp of your substitution like this:

    /[Cc][Oo][Mm][Pp][Aa][Rr][Aa][Nn][Dd]/~3




HOW DO I PRINT ONLY THE FIRST OCCURRENCE OF A STRING IN A FILE?

top


You quit when you find it.  Consider:

  -e '/RE/{p;b}; RE/!d;'

This command means, "if you find the regular expression, RE,
then print the current line and break out of the loop.  If the
line does not contain RE, than continue the while statement 
before getting to the end where the "p: directive will print
the current line.




HOW DO I PARSE COMMA SEPARATED VALUE FILES?

top


The substitute command can be used to replace one set of text
with another and delimiters are easily handled:

  -e 's/\([^,]*\),\([^,]*\),\([^,]*\)/[field1=\1], [field2=\2], [field3=\3]/1'

This of course requires that you know how many fields are on one line.

Another way to do this is to use one either SW or SR.  These commands let you
split the text in a register into three parts:

  SW lets you split the string into two parts:  before the delimiter and after
  the delimiter.

  SR lets you split the string into three parts: 

    before the delimiter

    the delimiter iself

    after the delimiter.

For example, the command "SW01 ," will split the contents of register 0.
Everything up to the first comma will be left in R0.  Everything after the
comma will go into register 1.  The comma will be discarded.  Note that
SW lets you specify a set of delimiters not just 1 character.  However there
is no way to know which delimiter trigger the splitting.

The SR012 command is more flexible and lets you know what the delimiter actually
was.  For example:  "SR012 /,|:/" will do the following:

  The contents of register 0 will be split into three parts:

    the stuff up to the first comma, or bar, or colon will stay in R0.

    the delimiter will go into R1.

    The stuff after the first delimiter will go into R2.




HOW DO I PARSE FILES DELIMITED BY SPECIAL CHARACTERS?

top



Another form of parsing can be done which involves loops within loops in
the script language:

  suppose you want to convert all references to text like this:

    +high lighted text+

  into this:

    <i>high lighted text</i>

  and handle text that crosses line boundaries, like this:

    +some text
     more text
     the end+

You can do this by keeping a state variable associated with the "inside"
of the highlighted text situation.  The TestLib/toggleTest does this:

    w
    {
      #
      #  main loop to read lines and process them one at a time
      #
      n;

      w
      {
        #
        #  inner loop that repeatedly processes one line to handle all 
        #  the toggling
        #

         /+/{
              #
              # a plus is found in the current line which is stored
              # in variable 0
              #

              /^ON/~1 {       
                         # If "ON" is found in variable 1, empty variable 1

                         q1;  

                         # .. turn the next "+" into "[/i]"

                         s/+/[\/i]/1; 

                         # go back to the top of the inner loop

                         d;    
                            
                      }

              #
              # at this point, we know that variable 1 did not contain ON
              # so lets put one there

              q1 ON;

              # and convert the first + into [i]

              s/+/[i]/1;      
              d;
            }

         #
         #  if the current line had a + in it, we would not be
         #  here, so lets break out of the inner loop and
         #  let the outer loop print the line
         #  
         #

         b;
     }

     p;

   }




HOW DO I CHANGE ONLY ONE SECTION OF A FILE?

top


You can use a simple line range triggered command group like this:

  -e '10,20 y/a-z/A-Z'

This upper cases the text in lines 10 through 20 but no where else.

You can use a regular expression comparison to trigger commands only
on some specific lines:

  -e '/fred/ { y/a-z/A-Z/; } '

This upper cases any line containing fred.


You can use a pair of regular expressions to define a range of lines
that should be processed:

  -e '/begin/, /end/ { y/a-z/A-Z/; } '

If you do not wish to process the end or beginning line, you can
make the expression more complex, but it works basically the same
way.  Suppose you were writing the script in a file, the code to
process all lines in a range but not the last one might look like
this:

  w
  {
    n;

    /begin/, /end/
    {
      /end/! 
      {
        p;
      }
    }

  }

This code processes all lines in the file and ignores most of
them.  But when groups of lines lines are bounded by the words
begin and end, print those lines -- except for the last line 
of each group -- ie the line containing "end'.

So if your data contains:

  stuf
  begin
  more 1
  more 2
  more 3
  end
  other
  garbage
  begin
  later 1
  end

the program would print:

  begin
  more 1
  more 2
  more 3
  begin
  later

Note that the "end" lines are not printed.

Yet another approach is to use a /regex/ trigger to
execute a while loop that reads lines.  For example:

  w
  {
    n;

    /trigerLine/
    {
      w
      {
        /endLine/b;
        n;
      }
      d;
    }
    p;
  }

This code reads and prints most lines but in groups of lines
bounded by this pair of lines, it does not print the lines:

  triggerLine
  ...
  endLine


Note that the two styles of range specifications can be intermingled:

  /text/,$ cmd -- executes cmd from the first line containing /text/ to
                  the end of the file

  10,/AA/ cmd  -- executes cmd on line 10 and on all lines until a line
                  containing /AA/ is found.  It will also be executed on
                  that line, but no lines thereafter.

  /AA/,20 cmd  -- executes cmd on lines between the first line that contains
                  /AA/ and line 20, inclusive.





HOW DO I IMPLEMENT IF-THEN-ELSE?

top


You use two slightly different if clauses:

  /stuff/ {
            # do the "if" behavior
          }

  /stuff/! {
             # handle the "else" behavior
           }

This of course is not a true if-then-else because the test
for the else-clause will still be run even if the if-clause
executes -- so be careful with your if-clause.  You might need
to end it with 'd' to eliminate activation of the else-clause.



HOW DO I COMPUTE A REGEX FOR USE IN CONDITIONAL EXPRESSIONS?

top

 EXPRESSIONS?

In the above examples, the regular expressions that control
the execution of statements is a constant:  stuff.

If you need to compute a regular expression -- say one which
is depend on the actual text, you can use the |[digit]!command
syntax.  This syntax allows you to use a register as the 
regular expression -- like this:

  w
  {
    n;
    |30!cmd;
  }

In this case, the cmd will be executed if the data found in 
register 0 DOES NOT match the expression found in register 3.

You will have to have put some value in register 3 in some other
part of the code.  Note that it should not have  // around it.
Register three should be just the text of the desired regex.




HOW DO I SPLIT FILES INTO PARAGRAPHS?

top


Assuming that blank lines are paragraph delimiters
and you want multi-line paragraphs each converted into a single long
line, you can do the following:

  #
  #  the following code assumes that varible 1 will contain the
  #  paragraph concatenated into 1 long line.  Variable 0, is as usual,
  #  the current line of input from stdin.
  #

  w
  {
      #
      #  read the next line
      #
      n;
      
      #
      #  expand tabs
      #
      
      t;  
  
      #
      #  treat lines containing only spaces as being empty
      #
  
      s/^  *//g;
  
      #
      #  if the line is empty, call this an end of paragraph
      #
  
      /^$/ {
              # handle empty lines
  
              /./~1 {
                      #
                      #  the current paragraph is not empty: print it
                      #  and empty it.
                      #
  
                      p1;
  
                      q1 ;
  
                    }
  
              #
              #  if the current line is blank and the current paragraph
              #  is empty, do nothing.
              #
        
           }
  
     /./ {
           # the current line is not empty -- append it to the current
           # paragraph -- with an extra blank 
  
           s/.*/& /1
  
           A01;
  
         }
  
    #
    # don't automatically print all lines
    #
  
  }
  
  #
  #  print the last paragraph
  #
  
  /./~1 p1
  



HOW DO I SELECTIVELY PRINT ONLY THE PARAGRAPHS THAT MATCH A REGULAR EXPRESSION

top

 MATCH A REGULAR EXPRESSION

See the script above for instructions on how to 
parse text into paragraphs.  Then once you have that
working, modify the two print statements so that
they only act if the paragraph contains the desired
regular expression.  In the example above, you see
the following to print statements:

  p1
  /./~ p1

These should be changed to look like this:

  /desiredRegex/ p1
  /desiredRegex/ p1

But only change the two lines containing the p1 
statement, leave the other /./ conditions as they
are.


  

HOW DO I MAKE YELT READ A FILE CONTAINING A LIST OF FILES AND PROCESS EACH ONE?

top

 FILES AND PROCESS EACH ONE?

The F command is meant for this purpose.  Here's its
syntax:

  -e 'F {n; s/a/b/g; p;}; d'

Like most yelt commands, The F can also be used to read
a file whose name in some variable other than 0 -- using
F3 for example.

Note that a FULL script must follow the F command token --
except for the outermost while loop which is provided by the
F command itself.  If you don't include n in your sub-script, 
no lines will be read -- the current state of the variables 
will be maintained which means they'll be set up for the outer.

The command following the F token will be executed repeatedly
until the end of the opened file.  If you don't read the
lines using n, the script will hang.

Note that variables are not modified when you open the file
unless the script changes them.  Nor are they restored when the
file is closed.

Of course, you are not limited to files whose names are
read in.  You can use the quote command, q, to force the
reading of an explicit file:

  -e "q1 junk.txt; F1 {n; p;}"; 

You could of course have done this the same way that you
would have done it in sed:

  xargs <listOfFiles cat | yelt -e '...'




HOW DO I JOIN LINES BASED ON THE CONTENTS OF THE FIRST LINE?

top

 LINE?

  -e '/someRegex/ { n1; s/.*/& /1; A10; }'

This script fragment does the following:

  1.  if the current line does not contain the regex,
      it does nothing special

  2.  if the current line does contain the regex, the
      following line is read into variable 1, then
      variable 1 is appended to variable 0 which will
      automatically be printed.  When the lines are
      joined, a space is used as a separator for the
      data.

If the n1 command encounters an end of file, the block 
will terminate and so will the while loop that is closest 
to the n1 command.  This means that the current line won't 
be printed unless you do something about it.  The simplest
solution, is to write the entire script yourself like this:

         -S "
               w
               {
                 n;

                 /someRegex/ 
                 {
                   n1;

                   s/.*/& /1;

                   A10;
                 }

                 p;
                 q0 ;  
               }

               /someRegex/p;
            "
By using the quote command to empty variable 0 after print it, 
you ensure that the final print after the while loop will not 
occur, but if the n1 detects end of file and terminates the while
loop, you will still see the regex in variable 0 so you will see 
it printed.




HOW DO WRITE SCRIPT CODE THAT RUNS BEFORE THE FIRST LINE OF TEXT IS READ OR AFTER THE LAST ONE HAS BEEN PROCESSED?

top

 TEXT IS READ OR AFTER THE LAST ONE HAS BEEN PROCESSED?

You have to write a full script do that yourself like this:

   -S "
         q1 stuff before the first line;

         w
         {
           n;
             q2 stuff after the first line;
           p;
         }

         q3 stuff after the last line is handled;
      "



HOW DO I INSERT A FILE AT THE TOP OF MY OUTPUT?

top


Assuming that you can't just process that file first
in the list of parameters to the yelt script, you can
use the F command in either of two ways:

1. -e '1{ q1 filename; F {n;p;} } commands for this file'

2.  write your own script like this:

     q0 filename;
     F  { n; p; }
     w
     {
       n;
       q1 your commands for this file
       p;
     }



HOW DO SIMULATE C #INCLUDE PROCESSING?

top


   w
   {
     n;

     /^ *# *include *["<]"/
     {
       #
       #  we replace the #include line with its
       #  contents
       #

       s/^[^"<]//1;
       s/[">].*//1;

       #
       # process the file whose name is in var0
       #
       F 
       {
         n;
         p;
       }


       # 
       #  don't print the #include line -- or 
       #  you could turn it into a comment and print it
       #  anyway
       #
       d; 
     }

     p;
   }



HOW DO I WRITE TO FILES OTHER THAN STDOUT?

top


Yelt has a W command that lets you write to files
but it is clunky to use:

  -e 'W01'

Here's a practical example use:

 -S "q2 \n; w{n; A01; A21;} q3 text.out; W13;"

Here's what the script does:

  * before reading from stdin, populate the string
    variable 2 with a newline for use as a line
    separator in the file that we are creating

  * the main loop of the script reads from stdin
    into variable 0.  It then appends this text
    to the end of string register 1.  Next it
    appends the newline character found in register
    2 to the end of register 1.

  * the script does not print the lines of text as
    it executes

  * after the last line of input is read, the
    name of the output file, text.out, stored in
    register 3.  Finally, the W13 command
    writes the saved up data from register 1 to 
    the file whose name is register 3.

This would not be an appropriate way to handle a
gigabyte file. You might want to have yelt prefix
output records with some identifier so that you can
later split the output into multiple streams using
grep.

  


HOW DO I AVOID CREATING A SEPARATE YELT SCRIPT WHEN WRITING A UNIX COMMAND LINE SCRIPT?

top

 WRITING A UNIX COMMAND LINE SCRIPT?

The common unix command line interpreters, bash,
bourne, csh, and ksh all support the following 
syntax:

  command <<TOKEN
    stuf
  TOKEN

This syntax means that the script language interpreter
creates a temporary file as it runs, in this case
populated with 'stuf', and feeds it to the stdin
of the 'command' being executed.

You can use this feature to avoid storing yelt scripts
in files separate form any command line scripts you
write.  Here's a trivial example:

  yelt -f - someFile <<TOKEN 
    w { n; s/stuff/junk/g; p; }
  TOKEN

This command line automatically creates a temporary
file containing
  w { n; s/stuff/junk/g; p; }

And feeds it to yelt like this:

  cat tmpFile | yelt

Then, because of the yelt command line options, the
script is read from stdin, but the file being operated
upon is named someFile.  So, 'stuff' will be converted
into 'junk' in file someFile and it will be printed to
stdout.

  


HOW DO I EXTRACT ANCHOR URLS FROM HTML FILES?

top


The following script works only on Wikipedia htmls because
it intensionally filters only the Wikipedia internal links
but the basic idea for a general case solution is as follows:

1.  Get the html of interest into a file.  An easy way to
    do this is to use the wget program.  It comes with linux
    and a windows version is available from sourceforge.net.
    If you are on Windows, you'll probably want go to your
    internet explorer and change the options so that it does
    not always extract from the IE cache.  For some reason,
    on windows the wget program interacts with IE -- perhaps
    it uses an IE library.  In any event, you will have a 
    problem getting the latest updates to your web files unless
    you ask IE to always at least check for a newer file.

    To run wget, do this:

      wget http://www.wikipedia.org/wiki/Subject

    This will create a file named Subject in your current
    directory.  Try using the -E or -O options if you want it
    named something else (like Subject.html).

    Note:  the browser's File/Save menu works too, but you get
    more files than you eed.

2.  The following script can be used on a downloaded .html file
    and will produce an output like this:

      /wiki/subject| anchor text describing the link
      ...

    Note that when writing your own script, you might want to
    leave the full text of the url instead of just the parts
    that the following script does:

      http://wikipedia.org/wiki/Subject| stuff

3.  The following bourne shell script runs yelt in two steps
    to extract the links:

      A.  The first step splits the anchor's into an easily
          parsable form for the second step:

            any instance of "<a", or ">" is put on a line of
            its own

      B.  The second step detects "<a" and splits out the 
          href information so that it can become the first
          part of the output line.

          Then, the body of the anchor tag group (<a ... </a)
          is appended together into a single long line for
          printing along with the url that goes with it.

Here is the bourne shell script that invokes yelt to split out
desired urls from wikipedia articles:

    #!/bin/sh

    # ensure that <a and </a> end up on lines by themselves
    # and store the results in tmp1
    
    yelt -e 's/<\/*a/\n&/g' -e 's/>/\n>\n/g' "$1" >tmp1
    
    #
    #  Note that tmp1 has extra blank lines in over and above
    #  what's in the original text
    #
    
    yelt -f - <<EOF tmp1 >tmp2
    
      w
      {
        #
        # outer loop to read all the lines in the file
        #
        n;
    
        /^<a/
        {
          #
          # handle each anchor line group -- we are expecting
          # to see
          #
          #    <a href="/wiki/" ..
          #    >
          #    anchor stuff
          #    </a
          #    >
          #
          # and we are going to going to produce this output
          #
          #    hrefstuff | anchorstuff 
    
          #
          # skip some well know things
          #
          /Wiktionary/d;
          /:/d;
    
          # 
          # handle only the internal wikipedia entries
          #
    
          /href="\/wiki\//
          {
            #
            #  move the href into varible 1
            #
    
            s/^[^"]\+"//g;
            s/".*//g;
    
            a01;
    
            #
            #  now, find the end of the <a tag -- which has
            #  been spread out over multiple lines
            #
    
            w
            {
              n;
    
              /^>/b;
            }
    
            #
            #  we are now between the <a ...> group and the </a ...>
            #  group that terminates it.
            #
            #  lets print the contents of the a tag's text field in
            #  tabular form needed by our output.  First lets append
            #  an | to the end of the href tag so that we can
            #  split on in latter processing steps
            #
    
            q2 |;
    
            A21;
    
            w
            {
              #
              #  read lines inside the <a>..</a> block and
              #  append them to variable 1 for final printing
              #  later
              #
    
              n;
              #
              #  quit if we are at the end of the
              #  anchor block
              #
              /<\/a/b;
    
              /<a/
              {
                q3 FATAL ERROR -- <A>...</A> SYNTAX IS NOT MATCHED;
                p3;
                Q;
              }
    
              s/|/ OR /g;
    
              s/.*/ &/1;
    
              A01
            }
    
            p1;
    
          }
    
          # ignore any lines that get to here
    
        }
      }
    
    EOF
    
    cat tmp2
    
    rm tmp1 tmp2




HOW DO I PROCESS REPEATED BLOCKS OF TEXT?

top


The n command reads lines of text into registers.  The N command
pushes lines of text back into the input stream for the n command to read
again (for the first time if need be).

This is useful for processing blocks of text that have section delimiters:
An outer while loop can read until a section delimiter, then an inner while
loop can read until the next section delimiter -- then it can push the
delimiter back into input stream and break out of the inner loop -- at which
point the outer loop will re-read the section delimiter and go back into the
inner loop to process the next section.

For example, suppose we have the following input data:

  header information 1
  header information 2
  header information 3
  
    Section 1
  
        s1 line 1
        s1 line 2
        s1 line 3
  
    Section 2
  
        s2 line 1
        s2 line 2
        s2 line 3
        s2 line 4
  
    Section 3
  
        s3 line 1
        s3 line 2
        s3 line 3
        s3 line 4
        s3 line 5
  
  
  
  
     trailer information 1
     trailer information 2
     trailer information 3

And we are interested in processing the lines between the section delimiters.
To do so, we can use a script like the following which will just print the
lines with their section id in column 1:

  w
  {
    #
    #  Handle the whole file consisting of multiple sections -- process
    #  each section in the following way:
    #
    #   ignore the header lines
    #
    #   print each section's lines with the section number in column 1
    #
    #   ignore the trailer information
    #
  
    # read the next top level line and interpret what major group of lines
    # you are in -- sections, header, or trailer
  
    n;

    if /^ *Section/
    {
      #
      #  When a section line is started, save the section number into r1
      #
  
      a01;
      s1/^ *Section *\(.*\)/\1 /g;
  
      #
      # Then read the rest of the lines in the section and print them --
      # prefixed with the section number
      #
  
      w
      {
        n;

        #
        # If you read a line inside a section, first check to see if it is
        # the beginning of a new section or the beginning of the trailer.
        # If so, push the line back into the input stream and break out of
        # this inner loop -- this will leave the text available to be read
        # by the outer loop and it can start the next line of input or break
        # out of the script as needed
        #
  
        /^ *Section/
        {
          N;
          b;
        }
  
        /^ *trailer/
        {
          N;
          b;
        }
  
        # at this point we know that we are in a section whose number is in r1
        # and we want to print the line of text, found in r0, with the contents
        # of r1 as a line prefix -- we'll use r2 as a working buffer to construct
        # the output line.
  
        a12;
  
        A02;
        p2;
  
  
      }
  
    }
    else
    {
      #
      # ignore the header lines completely
      # and quit the script on the first trailer line
      #
  
      /^ *trailer/ b;
    }
    
  }