-
Notifications
You must be signed in to change notification settings - Fork 0
/
exporting-a-drupal-database-to-atom.xhtml
263 lines (239 loc) · 13.5 KB
/
exporting-a-drupal-database-to-atom.xhtml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Exporting a Drupal Database to an Atom/XHTML Bundle</title>
<meta name="author" content="Magnus Achim Deininger" />
<meta name="description" content="The Worst bash/sed/SQL/XSLT Combobulation You'll Ever See(tm): this article presents a nifty script to export a Drupal 5/6 MySQL database to a static Atom bundle." />
<meta name="date" content="2012-11-18T11:49:00Z" />
<meta name="mtime" content="2012-11-18T11:49:00Z" />
<meta name="category" content="Articles" />
<meta name="unix:name" content="exporting-a-drupal-database-to-atom" />
</head>
<body>
<h1>Not Quite a One-Liner</h1>
<p>The other day I stumbled upon an old Drupal database that, miraculously, was still running. Since I've since lost all motivation to set up <del>'real'</del> <ins>'open-source'</ins> CMSs, especially on nginx without MySQL, I thought it'd be a lot nicer if I could just export the whole thing as an .atom file with basic XHTML files in the atom:content nodes. (I'll admit that this output format may seem a wee bit strange. Atom turned out to be a really nice container format for XML files, I'll be writing about serving those files soon.)</p>
<p>Now I thought someone else must've been crazy enough to do this before me, but I was wrong. Guess I just had to do it myself then. So I dug out an inglorious number of scripting languages, cobbled them all together and ended up producing something that might even be useful to others migrating from Drupal.</p>
<p>The result of this script is available on this website, as <a href="archives">the einit.org and kyuba.org archives</a>.</p>
<h1>The Script</h1>
<p>This is the script I came up with. All you need to do is adjust the variables in the header and then it should be good to run. It'll create a file called '(database).atom' with all the major content in the drupal database (excluding comments). You might also want to <a href="download/drupal-to-atom.sh">download this script instead of copying and pasting it into an editor</a>.</p>
<p>The script assumes that you're running it on the same host as the database and that the database is a MySQL database (not sure if Drupal would let you use anything else). It also assumes you have xsltproc (from libxslt) installed on that machine, that you have bash installed and that your sed accepts GNU extensions. So, nothing out of the ordinary on a typical LAMP machine. The database I wrote this with was for a Drupal 5 or Drupal 6 installation. Not sure if it's going to work with other versions, you might have to adjust it a bit.</p>
<p>The variables in the script are as follows:</p>
<ul>
<li>database: the name of the MySQL database</li>
<li>user: the name of the MySQL database user</li>
<li>prefix: the table prefix in the database. Not sure if Drupal supported this, but most LAMP-y designes do. Leave blank if you had a dedicated database for the drupal installation.</li>
<li>timezone: the timezone suffix for any dates taken from the database. Either use 'Z' or '+....' for whatever offset should be applied. There's no decent way to query this from a MySQL database and it sneakily applies timezone adjustments to Unix dates without telling you at times, so you'll have to provide this manually if you care enough. Otherwise just use 'Z'.</li>
<li>categoryprefix: this specifies a secondary category to put each of the Drupal nodes into. I'm using this to differentiate between different sources of content in my combined .atoms. You could just use the domain name here.</li>
<li>fileprefix: I'm adding 'unix:name' meta attributes to the XHTML portions, so that it should be easier to split up when post-processing. This is a prefix for these meta attributes.</li>
<li>tbase: this is the prefix for any temporary files the script generates. Sadly, xsltproc doesn't seem to be able to get a stylesheet piped into it, so I have to write those XSLT scripts to real files.</li>
<li>name: you don't really need this one, I'm just using it to make the script slightly easier to read.</li>
</ul>
<h2>The Worst bash/sed/SQL/XSLT Combobulation You'll Ever See</h2>
<pre><code><![CDATA[#!/bin/bash
# configuration parameters for the scripts
name=kyuba
database=drupal-${name}
user=root
prefix=
timezone=Z
categoryprefix=${name}.org
fileprefix=${name}:
tbase=/tmp/drupal-x-
cat > ${tbase}pre-sed.xslt <<atomManglePreSedSQL
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns="http://www.w3.org/2005/Atom"
exclude-result-prefixes="xhtml atom"
version="1.0">
<xsl:output method="xml" version="1.0" encoding="UTF-8"
indent="no"
media-type="application/atom+xml" />
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="atom:summary/text()">
<xsl:variable name="text" select="translate(normalize-space(.),'*[]<>=','-()')"/>
<xsl:value-of select="substring(\$text,1,160)"/>
<xsl:if test="string-length(\$text) > 160">...</xsl:if>
</xsl:template>
<xsl:template match="atom:category/@term[.='blog']">
<xsl:attribute name="term">Blog Post</xsl:attribute>
</xsl:template>
<xsl:template match="atom:category/@term[.='forum']">
<xsl:attribute name="term">Forum Post</xsl:attribute>
</xsl:template>
<xsl:template match="atom:category/@term[.='project_project']">
<xsl:attribute name="term">Project Description</xsl:attribute>
</xsl:template>
<xsl:template match="atom:category/@term[.='project_release']">
<xsl:attribute name="term">Project Release</xsl:attribute>
</xsl:template>
<xsl:template match="atom:category/@term[.='project_issue']">
<xsl:attribute name="term">Project Issue</xsl:attribute>
</xsl:template>
<xsl:template match="atom:category/@term[.='page']">
<xsl:attribute name="term">Page</xsl:attribute>
</xsl:template>
<xsl:template match="atom:category/@term[.='book']">
<xsl:attribute name="term">Book</xsl:attribute>
</xsl:template>
<xsl:template match="atom:category/@term[.='image']">
<xsl:attribute name="term">Image</xsl:attribute>
</xsl:template>
<xsl:template match="atom:feed/atom:title|atom:subtitle">
<xsl:copy>
<xsl:value-of select="substring-before(substring-after(text(),'"'),'"')"/>
</xsl:copy>
</xsl:template>
<xsl:template match="atom:content[@type='text']">
<content type='application/xhtml+xml'>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title><xsl:value-of select="../atom:title"/></title>
<meta name="date" content="{../atom:updated}"/>
<meta name="author" content="{../atom:author/atom:name}"/>
<meta name="mtime" content="{../atom:updated}"/>
<meta name="unix:name" content="{translate(concat('${fileprefix}',translate(normalize-space(translate(../atom:title,'=()<>[]?#!"/!',' ')),' ','-')),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')}"/>
</head>
<body>
<xsl:value-of select="."/>
</body>
</html>
</content>
</xsl:template>
</xsl:stylesheet>
atomManglePreSedSQL
cat > ${tbase}post-sed.xslt <<atomManglePostSedSQL
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns="http://www.w3.org/1999/xhtml"
exclude-result-prefixes="xhtml atom"
version="1.0">
<xsl:output method="xml" version="1.0" encoding="UTF-8"
indent="no"
media-type="application/atom+xml" />
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="atom:content/xhtml:html/xhtml:head">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
<meta name="description" content="{../../../atom:summary}"/>
<meta name="category" content="{../../../atom:category/@term}"/>
</xsl:copy>
</xsl:template>
<xsl:template match="xhtml:body/text()">
<xsl:variable name="text" select="normalize-space(.)"/>
<xsl:if test="string-length(\$text)>0"><p>
<xsl:value-of select="\$text"/>
</p></xsl:if>
</xsl:template>
<xsl:template match="xhtml:body/xhtml:br"/>
<xsl:template match="xhtml:body/xhtml:code/xhtml:br"/>
<xsl:template match="xhtml:body/xhtml:ul">
<xsl:variable name="name" select="local-name()"/>
<xsl:if test="local-name(preceding-sibling::*[position()=1]) != \$name">
<xsl:copy>
<xsl:apply-templates />
<xsl:apply-templates select="following-sibling::*[1][local-name()=\$name]" mode="next"/>
</xsl:copy>
</xsl:if>
</xsl:template>
<xsl:template match="xhtml:body/xhtml:ul" mode="next">
<xsl:variable name="name" select="local-name()"/>
<xsl:apply-templates />
<xsl:apply-templates select="following-sibling::*[1][local-name()=\$name]" mode="next"/>
</xsl:template>
<xsl:template match="xhtml:a[@rel='need-lookup']">
<xsl:variable name="href" select="@href"/>
<xsl:choose>
<xsl:when test="//xhtml:head[xhtml:title=\$href]">
<xsl:copy>
<xsl:attribute name="href">
<xsl:value-of select="//xhtml:head[xhtml:title=\$href]/xhtml:meta[@name='unix:name']/@content"/>
</xsl:attribute>
<xsl:copy-of select="text()"/>
</xsl:copy>
</xsl:when>
<xsl:otherwise>
<xsl:copy-of select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
atomManglePostSedSQL
cat > ${tbase}db-to-xhtml.sql <<drupalExtractionSQL
select v.entry
from (
select 1 as seq,
concat(
'<?xml version="1.0" encoding="utf-8"?>',
'<feed xmlns="http://www.w3.org/2005/Atom">',]]>
'<title><![CDATA[',(select value from ${prefix}variable where name='site_name'),']]></title>',
'<subtitle><![CDATA[',(select value from ${prefix}variable where name='site_slogan'),']]></subtitle>',
<![CDATA[ '<updated>',(select replace(from_unixtime(timestamp),' ','T') from ${prefix}node_revisions order by timestamp desc limit 1),'${timezone}</updated>',
'<link rel="self" href="${database}"/>',
'<id></id>'
)
as entry
union
select
2 as seq,
concat(
'<entry>',]]>
'<title><![CDATA[',${prefix}node.title,']]></title>',
<![CDATA[ '<updated>',replace(from_unixtime(timestamp),' ','T'), '${timezone}</updated>',]]>
'<author><name><![CDATA[',${prefix}users.name,']]></name><email><![CDATA[',${prefix}users.mail,']]></email></author>',
<![CDATA[ '<id>drupal:${database}:${prefix}:',${prefix}node_revisions.nid,':',${prefix}node_revisions.vid,'</id>',]]>
'<summary><![CDATA[',teaser,']]></summary>',
'<content type="text"><![CDATA[',body,']]></content>',
<![CDATA[ '<category term="',type,'" />',
'<category term="${categoryprefix}"/>',
'</entry>'
)
as entry
from ${prefix}node
left join ${prefix}node_revisions on ${prefix}node.vid = ${prefix}node_revisions.vid
left join ${prefix}users on ${prefix}users.uid = ${prefix}node_revisions.uid
union
select
3 as seq,
'</feed>' as entry
) as v
order by v.seq;
drupalExtractionSQL
cat ${tbase}db-to-xhtml.sql\
| mysql -s -r -u ${user} ${database} -p\
| xsltproc ${tbase}pre-sed.xslt -\
| sed -e\
's#======\([^=]\+\)======#<h5>\1</h5>#g;
s#=====\([^=]\+\)=====#<h4>\1</h4>#g;
s#====\([^=]\+\)====#<h3>\1</h3>#g;
s#===\([^=]\+\)===#<h2>\1</h2>#g;
s#==\([^=]\+\)==#<h1>\1</h1>#g;
s#^ *\*\*\*\* \+\([^<]\+\)#<ul><ul><ul><ul><li>\1</li></ul></ul></ul></ul>#g;
s#^ *\*\*\* \+\([^<]\+\)#<ul><ul><ul><li>\1</li></ul></ul></ul>#g;
s#^ *\*\* \+\([^<]\+\)#<ul><ul><li>\1</li></ul></ul>#g;
s#^ *\* \+\([^<]\+\)#<ul><li>\1</li></ul>#g;
s#\[\[\([^]"<>]\+\)\]\]#<a href="\1" rel="need-lookup">\1</a>#g;
s#^$#<br/>#;
s#^---*#<hr/>#;
s#.lt.\(/\?code\).gt.#<\1>#g' \
| xsltproc ${tbase}post-sed.xslt -\
> ${database}.atom
]]></code></pre>
<h2>How it Works</h2>
<p>The script basically just generates two XSLT scripts and an SQL script to run. It saves those scripts under /tmp, then runs the SQL, the first XSLT, a custom sed and then the second XSLT in sequence. The SQL script gathers the data and produces a rudimentary .atom file. The first XSLT script is only used to do some cleanup - shorten descriptions, create unix:name nodes, turn the text nodes into rudimentary XHTML, rename some categories. The sed script then converts some of the typical Drupal text markup to 'real' XHTML markup. It also converts some wiki-esque markup as used by Drupal. The final XSLT then creates proper paragraphs from the post-processed text content, merges uls together and decorates things with slightly more metadata.</p>
<p>The resulting .atom file should probably be post-procssed to clean up names and remove nodes you don't need or want to serve, or those that won't export too well. You know, spam and Image nodes in Drupal. Renaming user accounts to real names. Things like that. XSLT will work wonders for that, too.</p>
</body>
</html>