Hawk

Haskell-awk

Haskell text processor for the command-line

Mario Pastorelli

Introduction

awk

  • a generic text processor where

    “A file is treated as a sequence of records, and by default each line is a record.” - Alfred V. Aho

  • developed in 1977 by Alfred Aho, Peter Weinberger, and Brian Kernighan @ Bell Labs

  • uses AWK as programming language
    awk 'BEGIN { print "Hello World!" }'
    • procedural
    • interpreted
    • a program is a series of pattern action pairs

Why another awk?

    “Whenever faced with a problem, some people say `Lets use AWK.' Now, they have two problems.” - D. Tilbrook

  • avoid the AWK programming language
    • use a generic language, not a DSL
    • BEGIN{split("a b c c a",a);for(i in a)b[a[i]]=1;r="";for(i in b)r=r" "i;print r}
      nub $ words "a b c c a"

  • procedural (imperative) vs functional programming for stream processing

Haskell-awk (Hawk)

  • a generic text processor where

    “A stream is treated as a sequence of records, and by default each line is a record.

    the same philosophy of awk!

  • developed in 2013 by me and Samuel GĂ©lineau, the name is a tribute to awk

  • uses Haskell as programming language
    hawk '"Hello World!"'
    • functional
    • (incrementally) compiled
    • a program is a Haskell expression

Why Haskell

  • expressive, clean and concise
    > filter odd [1,2,3,4]
    [1,3]
  • functions as composable building blocks
    > let wordCount = sum . map (length . words) . lines
    
    > :type wordCount
    wordCount :: String -> Int
    
    > wordCount "1 2 3\n4 5 6\n7 8 9"
    9
    
  • partial application
    > :type map
    map :: (a -> b) -> [a] -> [b]
    
    > :type not
    not :: Bool -> Bool
    
    > :type map not
    map not :: [Bool] -> [Bool]
    
    > map not [True,False]
    [False,True]
    
  • point-free style, laziness ...

Hawk

Modes

  • evaluate an expression
    $ hawk '1'
    1
    
    $ hawk '[1,2]'
    1
    2
    
    $ hawk '[[1,2],[3,4]]''
    1 2
    3 4
    

  • apply an expression to the input
    $ echo '1\n2\n3' | hawk -a 'L.reverse'
    3
    2
    1
    

  • map an expression to each record of the input
    $ echo '1 2\n3 4' | hawk -m 'L.reverse'
    2 1
    4 3
    

IO format

  • The input is, by default, a list of list of strings where lines are separated by \n and words by spaces
    $ echo '1 2\n3 4' | hawk -a 'show'
    [["1","2"],["3","4"]]
    
  • Options -d/-D are provided to change delimiters or set them to empty
    $ echo '1,2;3,4' | hawk -a -d',' -D';' 'show'
    [["1","2"],["3","4"]]
    
    $ echo '1 2\n3 4' | hawk -a -d'' 'show'
    ["1 2","3 4"]
    
    $ echo '1 2\n3 4' | hawk -a -d'' -D'' 'show'
    "1 2\n3 4\n"
    

  • The output can be any type that instantiate the typeclass Rows
    class (Show a) => Rows a where
    	repr :: ByteString -> a -> [ByteString]
    

Examples

  • get all users of a UNIX system
    $ cat /etc/passwd | hawk -d: -m 'L.head'
    root
    daemon
    ...
    
  • select username and userid
    $ cat /etc/passwd | hawk -d: -o'\t' -m '\l -> (l !! 0,l !! 2)'
    root	0
    daemon	1
    ...
    
  • sort by username (instead of pid)
    $ cat /etc/passwd | hawk -d: -a 'L.sortBy (compare `on` L.head)'
    bin:x:2:2:bin:/bin:/bin/sh
    daemon:x:1:1:daemon:/usr/sbin:/bin/sh
    ...
    
  • get the number of users using each shell
    > cat /etc/passwd | hawk -ad: 'L.map (L.head &&& L.length) . L.group . L.sort . L.map L.last'
    /bin/bash:1
    ...
    

Context

  • Hawk can be customized using files inside the context directory (by default ~/.hawk)

  • The most important file is prelude.hs that contains the "runtime context"
    $ cat ~/.hawk/prelude.hs
    {-# LANGUAGE ExtendedDefaultRules, OverloadedStrings #-}
    import Prelude
    import qualified Data.ByteString.Lazy.Char8 as B
    import qualified Data.List as L
    

  • for instance, we can add a function for taking elements in an interval
    $ echo 'takeBetween s e = L.take (e - s) . L.drop s' >> ~/.hawk/prelude.hs
    $ seq 0 100 | hawk -a 'takeBetween 2 4'
    2
    3
    

Implementation

Hawk must be fast

  • cache the context
    • use the timestamp to check if the context is changed since last run
    • compile it with ghc
  • use locks to compile only once when multiple Hawk instances instances are running
    hawk '[1..]' | hawk -a 'L.take 3'
  • use ByteString instead of String
  • ...

Parse and interpret Haskell

  • Hawk combines two Haskell libraries

    • haskell-src-exts to deal with haskell source code
      > import Language.Haskell.Exts.Parser
      > getTopPragmas "{-# LANGUAGE NoImplicitPrelude,OverloadedStrings #-}\n"
      ParseOk [LanguagePragma (SrcLoc
       {srcFilename = "unknown.hs", srcLine = 1, srcColumn = 1}) 
      [Ident "NoImplicitPrelude",Ident "OverloadedStrings"]]
      

    • hint to interpret the user expression
      > import Language.Haskell.Interpreter
      > runInterpreter $ setImports ["Data.Int"] >> interpret "1" (as :: Int)
      Right 1
      > runInterpreter $ setImports ["Data.Int"] >> interpret "foo" (as :: Int)
      Left (WontCompile [GhcError {errMsg = "Not in scope: `foo'"}])
      

Thank you!


https://github.com/gelisam/hawk