Hack 41 Log URLs People Mention

Logging URLs on IRC is useful in case you need to refer to them later on. Learn an unusual and interesting way to do it with a shell script on Linux/Unix.

Often, useful URLs are mentioned on a channel, and you cannot visit them straightaway but would like to check them out later. Perhaps you remember someone mentioning the URL of a really cool page containing various useful IRC hacks, but you just cannot remember it. Or maybe you just hate the constant cutting and pasting of URLs that your friends keep posting.

In this hack, you will look at a simple IRC client that will be absolutely passive—it will just sit in your channel, silently noting down the URLs passing by. Because such a task would be too simple in a language like Perl, let's show at the same time that you can make useful IRC hacks in a pure shell script!

6.3.1 The Code

The trivial solution would be to have an input block, emitting just a few commands required to negotiate the connection and join the channel. This block would be piped to netcat, with netcat's output then redirected to another block, munching the server's lines and selecting the PRIVMSG messages that contain a URL.

But the world is never this simple. This architecture has a fatal problem that means you cannot send any commands to the server later. The basic flaw is the inability to reply to PINGs from the server, which means that the server will decide the connection is dead and close it unexpectedly. You could, of course, cheat it by showing some periodic activity. However, what if you want to make some more elaborate interface available, like being able to join more channels on request? Or handle various errors properly?

You should ideally remove this limitation and somehow connect the input and output blocks. But how can one do that? Try thinking about it and see if your ideas work as you expected.

The bash shell can do wonderful things with redirections. You can, for example, redirect to or from a special file that triggers some magic inside of bash. One of these files is /dev/tcp/hostname/port, which establishes a TCP connection. You can say, "Whatever, I have my netcat and love it!", but first realize that this way the socket behaves like a file for redirection purposes and that considerably expands our possibilities.

What about the other redirection trick? You need to direct both your input and output to the socket. The answer is using the <> redirection operator, which will open the given (magic) file for both input and output. But it acts upon stdin, so you also need to redirect stdout to stdin with >&0.

In case something goes wrong and you cannot connect, or if the server dies, you should try again. That is easy—the read in the main input while loop fails and it bails out, therefore you must add yet another while loop around the whole block. You should not forget to sleep for some reasonable time between the iterations in case the connection failure is repetitive.

Save the following as urlgrab.sh:

#!/bin/sh

# IRC URL grabber: records all URLs mentioned to a log file.

# This script is public domain.



# Configuration section.

SERVER="irc.freenode.net"

PORT="6667"

NICK="urlspy"

IDENT="urlspy"

IRCNAME="URL Grabber"

CHANNEL="#irchacks" # We can specify multiple channels separated by a comma.

LOGFILE="url.log"



# Try to reconnect in case the connection fails.

while true; do



# Standard input/output of this block is redirected to an IRC connection.

{



# We prepare few raw IRC commands and send them out in advance.  We do not do

# any error checking, therefore if one of the commands fails, the game is over.

echo "USER $IDENT x x :$IRCNAME"

echo "NICK $NICK"

echo "JOIN $CHANNEL"



while read input; do



    # Strip the CRLF at the end of each line.

    input=`echo "$input" | tr -d '\r\n'`



    # If this is a PING, then send a PONG back.

    ping=`echo "$input" | cut -d " " -f 1`

    if [ "$ping" = "PING" ]; then

        data=`echo "$input" | cut -d " " -f 2-`

        echo "PONG $data"

        continue

    fi



    # One PRIVMSG line looks like:

    # :pasky!pasky@pasky.or.cz PRIVMSG #elinks :(IRC hack ;)

    #  --------source--------- --cmd-- -dest--  ---text-----

    cmd=`echo "$input" | cut -d " " -f 2`

    if [ "$cmd" != "PRIVMSG" ]; then

        continue

    fi





    # Extract the other fields from the message.

    # We must not forget to strip the leading colons from $source and $text.

    source=`echo "$input" | cut -d " " -f 1`

    source=`echo "$source" | sed 's/^://'`



    target=`echo "$input" | cut -d " " -f 3`



    text=`echo "$input" | cut -d " " -f 4-`

    text=`echo "$text" | sed 's/^://'`



    # Our URL-matching regular expression is of course far from perfect.

    # Some more complex ones can be found 

    # (e.g., at http://www.regexp.org/486).



    # Sed won't print the lines out on its own because of -n and the 'p'

    # command will utter the line only if the preceding address (a regexp

    # in our case) is found.  This hack requires GNU sed.



    # Note that the continuation lines of the sed expression MUST start at

    # the beginning of the lines!



    url=`echo "$text" | sed -n 's/^.*\(\(http\|ftp\)s\{0,1\}:\/\/'\

'[\-\.\,\/\%\~\=\@\_\&\:\?\#a-zA-Z0-9]*'\

'[\/\=\#a-zA-Z0-9]\).*$/\1/gp'`



    if [ "$url" ]; then

        # One line in the log shall look like:

        # ----date---- :: ---source--- -> ---dest--- :: ---url---

        echo `date` ":: $source -> $target :: $url" >>$LOGFILE

    fi

done



} <>/dev/tcp/$SERVER/$PORT >&0



sleep 30



done

6.3.2 Running the Hack

First, you need to change the settings at the start of the file and tweak the configuration to suit your needs. Then just execute the script and watch your log file slowly grow.

To make the script executable, you can use the chmod command:

% chmod u+x urlgrab.sh

Then you can run the script from the command line:

% ./urlgrab.sh

Whenever a URL is detected within a message, it will append a line like this to the log file:

Mar 20 19:39:23 2004 :: pasky!pasky@pasky.or.cz -> #ch :: http://hacks.oreilly.com/

You can use this log file in any way you want, whether it's for your own personal use or to display the most popular links on a web page.

6.3.3 Hacking the Hack

The basic flaw here is obviously the lack of portability of our redirection tricks. This one is, however, easily fixed. There are alternative solutions, perhaps less elegant, but still very usable.

The simplest approach would involve having an "input" file that you tail -f to netcat. Then you can turn the rest of the script into an output block, where you just append all the commands to the input file, for example:

# ... configuration ...



# If you don't have mktemp installed (http://www.mktemp.org/mktemp/) you can

# use `/tmp/urlgrab.$$` instead, at the risk of a security problem.

TMPFILE=`mktemp`



tail -f $TMPFILE | nc $SERVER $PORT | {



# ... the original block's body ...



} >>$TMPFILE



rm $TMPFILE

Alternatively, you could use mkfifo to pass the data through a named pipe, which is less portable, but takes virtually no disk space and might be more effective. In that case, you could use a simple cat instead of tail -f.

Of course, there are a lot of other possible enhancements. You should ideally handle any errors correctly. That means some code inflation, as you can't just dump the startup commands blindly, but you would have to wait for some numerics from the server to indicate that you have succeeded in connecting before you send the JOIN command.

Another issue is cycling between multiple IRC servers, which is a must for a reliable IRC bot. It is easy to do using cut and a cycling counter.

Maybe you would like to log the whole line containing the URL instead of just the URL itself? This is useful if you are interested in the context or if there is a description placed near the URL. All you will need to do is modify the sed script from s/regexp/\1/gp to /regexp/p.

A significantly more challenging problem is handling the possibility of multiple URLs in the same message. When you use just a simple search instead of the substitution, as outlined earlier, this is not an issue, but otherwise you would need to weed out the non-URL parts. Even though this problem would probably be solvable in sed, at this level of complexity it is wiser to switch to something more convenient, such as Perl. This would result in replacing the sed statement with something like this:

perl -nle 'print join (" ",

    m$((?:http|ftp)s?://[-\.,/%~=\@_\&:\?#a-zA-Z0-9]*[\/=#a-zA-Z0-9])$g);'

The -e flag will make Perl execute the given statement, -n will run the statement for each line of input, and -l will make it add a newline at the end of each line. m$regexp$g will match the input as many times as necessary and output a list of matched URLs, which is then joined by spaces and printed out.

This hack doesn't even consider all the possibilities regarding various uses of the logged URLs, from automatic opening in a web browser to storing them in an SQL database [Hack #42] . This part is definitely up to your imagination.

—Petr Baudis

< Day Day Up >