金子邦彦研究室プログラミングRuby プログラミングRuby の REXML を使って XML ドキュメントをリレーショナル・データベースのテーブルに格納する.

Ruby の REXML を使って XML ドキュメントをリレーショナル・データベースのテーブルに格納する (Store a XML Document into a relational table using REXML of ruby)

XMLドキュメントをリレーショナルのテーブルにマッピングする。 (Store a XML document int a relational table.)

REXMLのクラスのうち主要なもの (main REXML classes)

XML Declaration (XML宣言) REXML::XMLDecl
Document Type Definition (文書型定義) REXML::DocType
Comment (コメント) REXML::Comment
Element (要素) REXML::Element
CDATA Section (CData セクション) REXML::CData
Processing Instruction (処理命令) REXML::Instruction
Text (テキスト) REXML::Text
REXML::ElementDecl
REXML::NotationDecl
REXML::AttlistDecl
REXML::Entity

前もって決めておく事項

テーブル定義 (table definition)

 create table R (
   start   integer primary key not null,
   end     INTEGER UNIQUE not null, 
   parent  INTEGER not null, 
   level   INTEGER not null, 
   class   TEXT not null, 
   tagname text, 
   path    text, 
   text    TEXT not null ); 

◆ sqlite3 での操作手順例 (Table definition using sqlite3)

sqlite3 mydb
create table R (
  start   integer primary key not null,
  end     INTEGER UNIQUE not null, 
  parent  INTEGER not null, 
  level   INTEGER not null, 
  class   TEXT not null, 
  tagname text, 
  path    text, 
  text    TEXT not null ); 
.exit

Windows での実行結果例 (in case of Windows)

[image]

◆ Linux での実行結果例 (in canse of Linux)

[image]

テーブルへの格納を行うプログラム (a program to store a XML document into the table 'R')

ファイル hoge.xml を読み込み,テーブル R(start, end, parent, level, class, tagname, path, text) に格納するプログラム (The program read a 'hoge.xml' file and store it into the table R(start, end, parent, level, class, tagname, path, text))

次のプログラムを conv.rb のようなファイル名で保存する. (The program file name is 'conv.rb')

require "rexml/document" 
 
f = open("hoge.xml")
source = f.read
f.close

def pre(n, parent, line, level, pathbase, tagname)
    line << "insert into R (start, parent, level, class, text, tagname, path, end) values("
    line << $number.to_s
    line << ", "
    line << parent.to_s
    line << ", "
    line << level.to_s
    line << ", '"
    line << n.class.to_s
    line << "', '"

    str = n.to_s
    # text
      # SQL escape \' \" \\ \% \_ \n \t
    line << str.gsub(/\\/,"\\\\\\").gsub(/'/,"\\\\'").gsub(/'/,"\\\"").gsub(/%/,"\\%").gsub(/_/,"\\_").gsub(/\n/,"\\n").gsub(/\t/,"\\\t")

    if n.is_a?(REXML::Element) then
      if ( tagname.size == 0 ) then
        line << "', NULL"        
      else        
        line << "','"
        # SQL escape \' \" \\ \% \_ \n \t
        line << tagname.gsub(/\\/,"\\\\\\").gsub(/'/,"\\\\'").gsub(/'/,"\\\"").gsub(/%/,"\\%").gsub(/_/,"\\_").gsub(/\n/,"\\n").gsub(/\t/,"\\\t") 
        line << "'"
      end 
    else
      line << "', NULL"
    end

    # generate path
    if ( pathbase.size == 0 ) then
      path = tagname
    else
      if ( tagname.size == 0 ) then
        path = pathbase
      else
        path = pathbase + "/" + tagname
      end
    end
    # print path
    if ( path.size == 0 ) then
      line << ", NULL, "
    else
      # SQL escape \' \" \\ \% \_ \n \t
      line << ", '"
      line << path.gsub(/\\/,"\\\\\\").gsub(/'/,"\\\\'").gsub(/'/,"\\\"").gsub(/%/,"\\%").gsub(/_/,"\\_").gsub(/\n/,"\\n").gsub(/\t/,"\\\t") 
      line << "', "
    end
end


def post(n, line)
    line << $number.to_s 
    line << ");"
end

def trav(n, parent, level, pathbase, tagname)
  line = ""
  $number = $number + 1

  # extract tag name    
  tagname = ""
  if n.is_a?(REXML::Element) then
    tagname = ((n.to_s.scan(/^<[^>]*/)[0]).scan(/[^\s]*/)[0]).gsub(/^</,"").gsub(/\/$/,"")
    if ( tagname == "?xml" ) then
      # "?xml" is not tag name! 
      tagname = ""
    elsif ( tagname == "?xml-stylesheet" ) then
      # "?xml-stylesheet" is not tag name! 
      tagname = ""
    end
  end

  pre(n, parent, line, level, pathbase, tagname) 
  current = $number

  if n.is_a?(REXML::Parent) then    
    n.children.each {|e|
      if ( pathbase.size == 0 ) then
        path = tagname
      else
        path = pathbase + "/" + tagname
      end
      trav(e, current, level + 1, path, tagname) 
    }
  end

  if n.is_a?(REXML::Element) then    
    if n.has_attributes? then    
      n.attributes.each_pair { |k,v|
        str = ""
        $number = $number + 1
        str << "insert into R (start, parent, level, class, text, tagname, path, end) values("
        str << $number.to_s
        str << ", "
        str << parent.to_s
        str << ", "
        str << level.to_s
        str << ", '"
        str << "Attribute"
        str << "', '"
        # text
          # SQL escape \' \" \\ \% \_ \n \t
        str << k.to_s.gsub(/\\/,"\\\\\\").gsub(/'/,"\\\\'").gsub(/'/,"\\\"").gsub(/%/,"\\%").gsub(/_/,"\\_").gsub(/\n/,"\\n").gsub(/\t/,"\\\t") 
        str << "="
          # SQL escape \' \" \\ \% \_ \n \t
        str << v.to_s.gsub(/\\/,"\\\\\\").gsub(/'/,"\\\\'").gsub(/'/,"\\\"").gsub(/%/,"\\%").gsub(/_/,"\\_").gsub(/\n/,"\\n").gsub(/\t/,"\\\t") 
        str << "', NULL, '"
          # SQL escape \' \" \\ \% \_ \n \t
        str << pathbase.gsub(/\\/,"\\\\\\").gsub(/'/,"\\\\'").gsub(/'/,"\\\"").gsub(/%/,"\\%").gsub(/_/,"\\_").gsub(/\n/,"\\n").gsub(/\t/,"\\\t") 
        str << "', "
        $number = $number + 1
        str << $number.to_s
        str << ");"
        puts str
      }
    end
  end

  $number = $number + 1
  post(n, line) 
  puts line
end

$doc = REXML::Document.new source
$number = 0
puts "begin transaction;"
trav($doc, $number, 0, "", "") 
puts "commit;"

使ってみる (try to use)

◆ テスト用に使う XML ドキュメント (an example XML document for a test).

次のような XML ドキュメントを,ファイル名 hoge.xml で保存 (Store the following file. the file name is 'hoge.xml').

<?xml version="1.0" encoding="UTF-8"?> 
<?xml-stylesheet href="www.hogehoge.com/css/orange.xsl" type="text/xsl" ?>


<order_records>
  <order_record hoge="aaa">
    <id>1</id>
    <year>2011</year>
    <month>12</month>
    <day>1</day>
    <customer_name>kaneko</customer_name>
    <product_name>orange A</product_name>
    <unit_price>1.2</unit_price>
    <qty>10</qty>
    <created_at>2011-12-01 10:42:48</created_at>
  </order_record>
  <order_record hoge="bbb">
    <id>2</id>
    <year>2011</year>
    <month>12</month>
    <day>1</day>
    <customer_name>miyamoto</customer_name>
    <product_name>Apple M</product_name>
    <unit_price>2.5</unit_price>
    <qty>2</qty>
    <created_at>2011-12-01 10:42:59</created_at>
  </order_record>
  <order_record hoge="ccc">
    <id>3</id>
    <year>2011</year>
    <month>12</month>
    <day>3</day>
    <customer_name>kaneko</customer_name>
    <product_name>orange B</product_name>
    <unit_price>1.2</unit_price>
    <qty>8</qty>
    <created_at>2011-12-03 11:08:56</created_at>
  </order_record>
  <order_record hoge="ddd">
    <id>4</id>
    <year>2011</year>
    <month>12</month>
    <day>4</day>
    <customer_name>miyamoto</customer_name>
    <product_name>Apple L</product_name>
    <unit_price>3</unit_price>
    <qty>1</qty>
    <created_at>2011-12-04 15:35:12</created_at>
  </order_record>
</order_records>

テスト結果の例 (test result example)

ruby conv.rb > hoge.sql
sqlite3 mydb
.read hoge.sql
.exit

Windows での実行結果例 (in case of Windows)

[image]

◆ Linux での実行結果例 (in case of Linux)

[image]

テーブル R の確認 (examine the table R)

sqlite3 mydb
SELECT * FROM R;
.exit

Windows での実行結果例 (in case of Windows)

[image]

◆ Linux での実行結果例 (in case of Linux)

[image]
sqlite3 mydb
select distinct tagname FROM R;
select distinct path FROM R;
.exit

◆ Linux での実行結果例 (in case of Linux)

[image]

タグ名のテーブル (generate a table for tag name)

テーブルRから,新しいテーブル tag(tagid, tagname) を作る. (generate a new table tag(tagid, tagname) from table R).

sqlite3 mydb
CREATE TEMPORARY TABLE T1 AS select distinct tagname FROM R WHERE tagname IS not null;
CREATE TEMPORARY TABLE T2 AS SELECT OID AS tagid, * FROM T1;
create table tag (
    tagid    integer primary key not null, 
    tagname  TEXT UNIQUE not null
);
insert into tag(tagid, tagname) SELECT * FROM T2; 
SELECT * FROM tag;  
.exit

◆ Linux での実行結果例 (in case of Linux)

[image]

パスのテーブル (generate a table for path)

テーブルRから,新しいテーブル path(pathid, path) を作る. (generate a new table path(pathid, path) from table R).

sqlite3 mydb
CREATE TEMPORARY TABLE T1 AS select distinct path FROM R WHERE path IS not null;
CREATE TEMPORARY TABLE T2 AS SELECT OID AS pathid, * FROM T1;
create table path (
    pathid    integer primary key not null, 
    path  TEXT UNIQUE not null
);
insert into path(pathid, path) SELECT * FROM T2; 
SELECT * FROM path;  
.exit

◆ Linux での実行結果例 (in case of Linux)

[image]

REXML::Element に関するテーブル (generate a table for REXML::Elementh)

テーブルRから,新しいテーブル elements(start, parent, level, pathid, tagid) を作る. (generate a new table elements(start, parent, level, pathid, tagid) from table R).

sqlite3 mydb
CREATE TEMPORARY TABLE R1 AS SELECT start, parent, level, path.pathid, tag.tagid
  FROM R, tag, path
  WHERE R.tagname = tag.tagname
        AND R.path = path.path 
        AND R.class = "REXML::Element";
create table elements (
 start   integer primary key not null,
 parent  INTEGER not null,
 level   INTEGER not null,
 pathid  INTEGER not null,
 tagid   INTEGER not null
);
CREATE INDEX idxtagid ON elements(pathid);
insert into elements(start, parent, level, pathid, tagid) SELECT start, parent, level, pathid, tagid
FROM R1 ORDER BY pathid;
DROP TABLE R1;

SELECT start, parent, level, path, tagname FROM elements, tag, path WHERE elements.tagid = tag.tagid AND elements.pathid = path.pathid;

.exit

◆ Linux での実行結果例 (in case of Linux)

[image]
# create table sim_elements01 ( id integer primary key not null, start INTEGER not null, parent INTEGER not null, tagid INTEGER not null, element_start INTEGER not null, deleted INTEGER not null); # 01 CREATE TEMPORARY TABLE R01 as SELECT t1.start, t1.parent, t1.tagid, t2.start as element_start FROM elements as t1, elements as t2 WHERE t1.tagid = t2.tagid AND t1.tagid = 1; insert into sim_elements01(start, parent, tagid, element_start, deleted) SELECT t.start, t.parent, t.tagid, t.element_start, '0' as deleted FROM R01 as t; DROP TABLE R01; # 01 revised CREATE TEMPORARY TABLE R01 as SELECT t1.start, t1.parent, t1.tagid, t2.start as element_start FROM elements as t1, elements as t2 WHERE t1.pathid = t2.pathid AND t1.pathid = 1; insert into sim_elements01(start, parent, tagid, element_start, deleted) SELECT t.start, t.parent, t.tagid, t.element_start, '0' as deleted FROM R01 as t; DROP TABLE R01;