HTML FAQ to Text

---------

John Relph (relph@presto.ig.com)
Wed, 24 May 95 22:42:27 PDT


I use the following SunOS makefile to convert my HTML FAQ into text:

---- cut here (makefile) ----

%:%.html
$(RM) temp.html temp.out
sed -e 's=</*I>=_=g' -e 's=</*STRONG>=\*=g' \
-e 's=BLOCKQUOTE=PRE=g' -e 's=<HR>=-------=' $< > temp.out
awk -f faq.awk temp.out > temp.html
$(RM) temp.out
lynx -dump temp.html > temp.out
cat temp.out | tail +5 | \
sed \
-e 's/^ -------/----------------------------------------------------------------------/' \
-e 's/^ -------/------------------------------/' \
-e 's/ *$$//' -e 's/^ //' | uniq > $@
$(RM) temp.html temp.out

---- and here (faq.awk) ----
#!/bin/awk -f
#
# looks for <PRE> pre-formatted regions and removes <BR> from those regions.
#
BEGIN {
pre = 0;
}

$1 ~ /\<PRE\>/ {
pre = 1;
}

$1 ~ /\<\/PRE\>/ {
pre = 0;
}

$NF ~ /\<BR\>/ {
if (pre == 1) {
s = "";
k = length($0) - 3;
while (substr($0,k,4) != "<BR>")
k = k - 1;
s = substr($0,0,k - 1);
print s;
}
else
print $0;
next;
}

{
print $0;
}

---- and here ----

Check out one of my FAQs for the input format I use, which must be
followed to get good looking text out of the system. For example, the
Chalkhills FAQ is at "http://idaho.ig.com/chalkhlls/html/FAQ.html".

Specifically, I use <BLOCKQUOTE> for quotations, but the text in both
<BLOCKQUOTE> and <PRE> sections must be indented 12 spaces and
formatted as if <PRE>-formatted. The text version actually changes
all <BLOCKQUOTE>s to <PRE>s before formatting. Note the answers are
all in a <UL> list. The script also converts <HR>s to something
vaguely resembling RFC 1153 format (very vaguely). I also convert
<I>talics to _underlines_ (because I deal with a lot of album titles),
and <STRONG>s to *emphasis here*, just to get the point across.
Anyway, as I say, the best way to get an idea of the format is
actually to look at the HTML source for one of my FAQs.

I suppose I could re-write this in Perl, but I haven't. Sorry.

-- John

--
http://www.ig.com/~relph/


[ Usenet Hypertext FAQ Archive | Search Mail Archive | Authors | Usenet ]
[ 1993 | 1994 | 1995 | 1996 | 1997 ]

---------

faq-admin@landfield.com

© Copyright The Landfield Group, 1997
All rights reserved