você está aqui: Home → Colunistas → Cantinho do Shell

Como extrair um site plone com wget

Colaboração: Roberto Romani

Data de Publicação: 26 de Janeiro de 2007

Essa solução foi desenvolvida pois precisávamos publicar um site feito em plone (versão 2.5.1) em uma máquina sem condições de suportar uma instalação de zope/plone.

O problema principal quando se tenta copiar um site feito em plone para outra máquina usando o wget são os arquivos CSS gerados dinamicamente pelo plone. O wget não encontra esses arquivos. Como resultado a cópia do site é exibida sem CSS.

Para resolver esse problema foi criado um script python na raiz do site plone que monta os liks dinâmicos. Aí é só passar o endereço do script na chamada do wget para que ele possa trazer os CSS também.

Resolvido o problema principal falta resolver os problemas secundários. Alguns links para imagens e outros arquivos dentro dos HTML e dos CSS copiados pelo wget continuam apontando para a máquina original apesar desses arquivos terem sido trazidos pelo wget. Para solucionar esse problema foi feito um script usando o comando sed para alterar os links não resolvidos pelo wget.

O mesmo comando sed resolve também outro problema detectado. Desta vez se trata de um mal funcionamento do IE que não exibe imagens e gera links internos errados. Trata-se da interpretação que o IE faz da tag base. O wget gera em todo html uma tag base sem conteúdo (<base href="" />). Essa tag é usada no html para informar que o endereço base de todo link da página usa como base o endereço descrito nessa tag. Outros navegadores interpretam a tag base sem conteúdo. Como nenhum endereço base é dessa forma, os links da página permanecem como estão. Já o IE interpreta a tag base vazia como sendo o endereço base raiz, acrescentando uma / em todos os links da página, o que causa a quebra de todos eles, inclusive links para imagens. Para resolver isso basta apagar a linha com a tag base vazia de todos os arquivos html copiados. Não encontrando a tag base, o IE (e os outros navegadores) não fazem nada com os links, que alias foram criados corretamente pelo wget.

O último ponto a ser corrigido é o fato do plone usar o recurso de herança entre diretórios do zope para sempre encontrar os CSS. Isso é facilmente resolvido movendo-se todos os CSS para a pasta Plone Default. Note que isso deve ser feito com os arquivos copiados pelo wget, dessa forma não é necessário alterar o funcionamento interno do plone com relação aos CSS.

Resumo Executivo :-)

Para facilitar o trabalho criei o script wget.sh que executa o wget e faz os ajustes descritos acima.

Para implementar essa solução siga os 3 passos:

Passo 1

Crie o seguinte script python na raiz do site plone. Chame-o de archive_portal

archive_portal

style_sheets = ['plone.css','ploneColumns.css','ploneCustom.css']

# Versao estatica
#dac_style_sheets = ['ploneStyles9563.css','ploneStyles0959.css','ploneStyles2195.css', 'ploneStyles3726.css', 'ploneStyles1531.css', 'ploneStyles2712.css']

# Versao dinamica
plone_style_sheets = ["portal_css/"+i.getId() for i in context.portal_css.getCookedResources()]
# Outra forma de fazer versao dinamica
#plone_style_sheets = []
#for i in context.portal_css.getCookedResources():
#    plone_style_sheets.append("portal_css/"+i.getId())

graphics = ['bullet.gif','portal_logo']
index = 'index.html'

print "<html><head><title>archive portal</title></head><body>"
print "<a href='%s'>site entry point</a><br>" % context.absolute_url()
print "<a href='%s/%s'>site index</a><br>" % (context.absolute_url(), index)

#for item_name in plone_style_sheets:
#    print "<a href='%s/portal_css/%s'>%s</a><br>" % (context.absolute_url(),item_name, item_name)

for item_name in style_sheets + graphics + plone_style_sheets:
print "<a href='%s/%s'>%s</a><br>" % (context.absolute_url(),item_name, item_name)

for dir in context.portal_catalog(portal_type='Folder',review_state='published'):
print "<a href='%s/%s'>%s</a><br>" % (dir.getURL(), index, dir.id)

print "</body></html>"
return printed

Passo 2

Configure o Wget

Sugiro criar um arquivo .wgetrc no diretório home do usuário que irá rodar o wget.

O arquivo que usei com comentários é o seguinte:

.wgetrc

#input = urls-to-get

# -B URL
# --base=URL
# When used in conjunction with -F, prepends URL to relative links in the file specified by -i.
base = http://www.example.com/prg/dac/dac_plone/dac/index.html

#-nH
# --no-host-directories
# Disable generation of host-prefixed directories. By default, invoking Wget with -r http://fly.srk.fer.hr/
# will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.
# no-host-directories = on nH = on (só funciona na linha de comando)

# Set directory prefix to prefix. The directory prefix is the direc-
# tory where all other files and subdirectories will be saved to,
# i.e. the top of the retrieval tree. The default is . (the current
# directory).
#dir_prefix = /www/ns-home/docs/prg/dac/dac_plone/

# Specify comma-separated lists of file name suffixes or patterns to
# accept or reject (@pxref{Types of Files} for more details).
reject = author,copyright,sendto_form,folder_listing,topic*

# Change which characters found in remote URLs may show up in local
# file names generated from those URLs. Characters that are
# restricted by this option are escaped, i.e. replaced with %HH,
# where HH is the hexadecimal number that corresponds to the
# restricted character.

# By default, Wget escapes the characters that are not valid as part
# of file names on your operating system, as well as control charac-
# ters that are typically unprintable. This option is useful for
# changing these defaults, either because you are downloading to a
# non-native partition, or because you want to disable escaping of
# the control characters.
#restrict_file_names = nocontrol
#restrict_file_names = windows
restrict_file_names = unix

# Do not ever ascend to the parent directory when retrieving recur-
# sively. This is a useful option, since it guarantees that only the
# files below a certain hierarchy will be downloaded.
no_parent = on

# This option causes Wget to download all the files that are neces-
# sary to properly display a given HTML page. This includes such
# things as inlined images, sounds, and referenced stylesheets.
page_requisites = on

# After the download is complete, convert the links in the document
# to make them suitable for local viewing. This affects not only the
# visible hyperlinks, but any part of the document that links to
# external content, such as embedded images, links to style sheets,
# hyperlinks to non-HTML content, etc.
convert_links = on

# When converting a file, back up the original version with a .orig
# suffix. Affects the behavior of timestamping.
backup-converted = on

# If a file of type application/xhtml+xml or text/html is downloaded
# and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this
# option will cause the suffix .html to be appended to the local
# filename. This is useful, for instance, when you are mirroring a
# remote site that uses .asp pages, but you want the mirrored pages
# to be viewable on your stock Apache server. Another good use for
# this is when you are downloading CGI-generated materials. A URL
# like http://site.com/article.cgi?25 will be saved as arti-
# cle.cgi?25.html.

# Note that filenames changed in this way will be re-downloaded every
# time you re-mirror a site, because Wget can't tell that the local
# X.html file corresponds to remote URL X (since it doesn't yet know
# that the URL produces output of type text/html or applica-
# tion/xhtml+xml. To prevent this re-downloading, you must use convert_links
# and backup-converted so that the original version of the file will be saved as
# X.orig.
html-extension = on

# Log all messages to logfile. The messages are normally reported to
# standard error.
# output-file=/www/ns-home/docs/prg/dac/dac_plone/wgetlogfile ou
# (-o só funciona na linha de comando)

# You can turn on recursive retrieving by default (don't do this if
# you are not sure you know what it means) by setting this to on.
#recursive = off
recursive = on

# Lowering the maximum depth of the recursive retrieval is handy to
# prevent newbies from going too "deep" when they unwittingly start
# the recursive retrieval. The default is 5.
#reclevel = 5
reclevel = inf

# It can be useful to make Wget wait between connections. Set this to
# the number of seconds you want Wget to wait.
#wait = 0

# You can set retrieve quota for beginners by specifying a value
# optionally followed by 'K' (kilobytes) or 'M' (megabytes). The
# default quota is unlimited.
#quota = inf
quota = 1000m

# You can lower (or raise) the default number of retries when
# downloading a file (default is 20).
#tries = 20

# Setting this to off makes Wget not download /robots.txt. Be sure to
# know *exactly* what /robots.txt is and how it is used before changing
# the default!
#robots = on

# By default Wget uses "passive FTP" transfer where the client
# initiates the data connection to the server rather than the other
# way around. That is required on systems behind NAT where the client
# computer cannot be easily reached from the Internet. However, some
# firewalls software explicitly supports active FTP and in fact has
# problems supporting passive transfer. If you are in such
# environment, use "passive_ftp = off" to revert to active FTP.
#passive_ftp = off
passive_ftp = on

# The "wait" command below makes Wget wait between every connection.
# If, instead, you want Wget to wait only between retries of failed
# downloads, set waitretry to maximum number of seconds to wait (Wget
# will use "linear backoff", waiting 1 second after the first failure
# on a file, 2 seconds after the second failure, etc. up to this max).
waitretry = 10

# Set this to on to use timestamping by default:
timestamping = on

# It is a good idea to make Wget send your email address in a `From:'
# header with your request (so that server administrators can contact
# you in case of errors). Wget does *not* send `From:' by default.
#header = From: Your Name <username@site.domain>

# You can set up other headers, like Accept-Language. Accept-Language
# is *not* sent by default.
#header = Accept-Language: en

# You can set the default proxies for Wget to use for http and ftp.
# They will override the value in the environment.
#http_proxy = http://www.example.com/
#ftp_proxy = http://proxy.yoyodyne.com:18023/

# If you do not want to use proxy at all, set this to off.
#use_proxy = on

# You can customize the retrieval outlook. Valid options are default,
# binary, mega and micro.
#dot_style = default

# You can force creating directory structure, even if a single is being
# retrieved, by setting this to on.
#dirstruct = off

# To always back up file X as X.orig before converting its links (due
# to -k / --convert-links / convert_links = on having been specified),
# set this variable to on:
#backup_converted = off

# To have Wget follow FTP links from HTML files by default, set this
# to on:
#follow_ftp = off

Passo 3

Copie os 2 scripts abaixo para um diretório vazio.

wget.sh

#!/bin/bash
# Script que traz um site chamado portal feito em plone 
# para outra máquina com o wget
# Autor: Roberto Romani
# data 22/11/2006

echo
echo Passo 1. Executa o wget.  Aguarde ... 
echo -n Inicio do Passo 1 em: 
date
# Deve existir um arquivo .wgetrc no home do usuário com a configuração correta.
wget -nH -o wget_log.txt http://ENDEREÇO_MAQUINA_ORIGEM:PORTA/portal/archive_portal 

echo 
echo Passo 2 Altera os endereços não resolvidos pelo wget e retira a tag base
# O endereço a ser substituido bem como o endereço substituto são definidos no 
# arquivo replace.sh

find portal/.  -type f -name \*.css > /tmp/lista.$$
find portal/.  -type f -name \*.html >> /tmp/lista.$$

LISTA="/tmp/lista.$$"

[ -z "$LISTA" ] && {
echo -e "${WHITE}Nenhum html foi encontrado."
}

while read ITEM
do
./dac_replace.sh "$ITEM"
done < $LISTA

echo
echo Passo 3 move arquivos css para a pasta Plone Default
mv portal/portal_css/ploneStyles*.css portal/portal_css/Plone\ Default/

echo
echo -n Fim do script "$0" em: 
date 

replace.sh
#!/bin/bash

cp -a "$1" "$1.tmp"
sed '/<base href="" \/>/d;s!ENDEREÇO_MAQUINA_ORIGEM/!ENDEREÇO_MAQUINA _DESTINO!g' "$1.tmp" > "$1"
rm "$1.tmp"

Antes de rodar o script wget.sh, edite o script replace.sh alterando o ENDEREÇO_MAQUINA_ORIGEM para o endereço de onde o site plone será copiado e ENDEREÇO_MAQUINA _DESTINO para o endereço onde o a cópia será acessada. Não é necessário colocar o nome do site, apena o endereço até o nome do site, ou seja o domínio e diretório anteriores ao nome do site se houver. Caso o plone esteja respondendo em uma porta específica informe também a porta no endereço origem.

Você precisa alterar também o ENDEREÇO_MAQUINA_ORIGEM na linha que executa o wget no script wget.sh

Caso o nome do site plone não seja portal substitua a palavra portal pelo nome do site plone a ser extraído.

Feitas essas alterações basta executar o script wget.sh ele deve gerar um diretório portal (ou com o nome do site copiado) no diretório em que foi executado contendo uma cópia estática do site plone original.

Agradecimentos especiais a Rodrigo Senra pela ajuda no desenvolvimento do script python e ao Rubens Queiroz pela força com o shell script.

Veja a relação completa dos artigos desta coluna