Reading and replacing text in Word DocX and Excel XlsX documents using Ruby

So as you may know. The new Word and Excel formats are similar to open office document formats in that they are just zips of multiple xml documents (well mostly xml documents). So what we wanted to do for our project (the WebDav one mentioned in my last post) is to set up a simple templating system that would do variable replacement in Word/Excel documents. And it turned out to be a piece of cake. I am just going to go through the DocX version of template model, but the only difference between them is the folder structure so there is not too much to change to get this working for both.

The basic model looks as follows. make_temp is just a method that makes temp files/directories so you can set that up however you like.

class DocxTemplateTool < TemplateTool  
  require 'find'
  require 'zip/zip'
  require 'zip/zipfilesystem'

  def initialize(data)
    @temp_path = make_temp(".zip")
    @temp_dir = make_temp
    @data = data
  end
end  

Now we have can get started on reading the document and then replacing some text. We will start with the shell of method and then add stuff. In the beginning we need to save our data as a zip file (I should mention all of our data here is saved in blobs) and then unzip it's contents in another directory. After we are done, we of course want to clean up all of these temp files. Both of the methods are going to be used by the replace method and by a validation method so to keep it DRY they are going into protected class methods.

def replace(vars={})  
    rdata = nil
    begin
      loadup
      #more code is going here
    ensure
      cleanup
    end
    return rdata
  end

  protected
  def loadup
      Dir.mkdir(@temp_dir, 777)
      File.open(@temp_path, 'wb') { |f| f.write data }
      args = ['unzip', @temp_path, '-d', @temp_dir]
      raise "Could not unzip #{@temp_path}" unless system(*args)
      args = ['chmod','-R','777', @temp_dir]
      raise "Could not chmod #{@temp_path}" unless system(*args)
  end

  def cleanup
      Dir.chdir(RAILS_ROOT)
      FileUtils.rm(@temp_path) if File.exists?(@temp_path)
      FileUtils.rm_rf(@temp_dir) if File.exists?(@temp_dir)
  end

The only piece here that might not make sense is the need to call Dir.chdir(RAILS_ROOT) in cleanup. The reason for this is that during the course of the replace method we may find ourselves in that directory and when we go to clean it up we won't be able to delete it if we had cd'ed to it.

Before we go onto the replace method there is one more protected method that we are going to have to write. This will method will check to make sure that as we are going through directories we are only doing replacement in the files we want to.

protected  
    def prunecheck?(path)
      FileTest.directory?(path) &&
      path != File.join(@temp_dir) &&
      path != File.join(@temp_dir,'word') &&
      path != File.join(@temp_dir,'docProps')
    end

As you look through the file structure you will see that we are ignoring all of the _rels directories since their purpose is to tell word which files do what and we do not want to ruin that. I also ignored the theme directory by my own choice, and we probably could choose to ignore other files n the word directory such as fontTable.xml, but as is this will allow you to replace anything in the document itself as well as in the metadata.

So finally we are going to do some replacement. Follows is the code that goes in #more code is going here.

Find.find(File.join(@temp_dir)) do |path|  
        Find.prune if (prunecheck?(path))
        unless FileTest.directory?(path)
          sdata = nil
          File.open(path, 'r') { |f| sdata = f.read }
          FileUtils.rm(path)
          File.open(path, 'w') { |f| f.write replace_data(sdata,vars) }
        end
      end

      FileUtils.rm(@temp_path) if File.exists?(@temp_path)
      Dir.chdir(@temp_dir)
      args = ['zip', '-r', @temp_path, '*']
      raise "Could not rezip file" unless system(args.join(" "))
      File.open(@temp_path, 'r') { |f| rdata = f.read }

We are beginning by using the Find.prune method to search through the directory structure for all files that we are going to be working on. If we find a file the process is as follows.

  1. Read the data from the file into memory.
    1. Remove the file.
    2. Run a method on the data to do whatever replacements you would like and then save it back to where the file was before you deleted it.

I was not going to provide my replace_data method since it is particular to the project I wrote this for, but pretty much any sort of methods with gsub functions is what should be used. Lets just say replace_data looks like this.

def replace_data(data, vars)  
    data.gsub(/PHP/, 'Ruby')
  end

After the files have been replaced we are going to cd to the temp directory and rezip it. I chose to use a UNIX system call here since there were issues using Ruby's zip class (if I recall correctly it was adding an extra folder when it was rezipping it). After that just, reread the zip file so that you can return that data after you are done cleaning up and voila you have a Word Document that doesn't have any mention of PHP!

One final method we are going to add before we are done. It is very likely that you are going to want to make sure someone didn't put anything malicious in the file that is going to mess with your system or maybe just let the user know why some text isn't getting replaced. For that we have this method.

def validate_data  
    docx_errors = []
    begin
      loadup
      Find.find(File.join(@temp_dir)) do |path|
        Find.prune if (prunecheck?(path))
        File.open(path, 'r') { |f| docx_errors.concat(validate_template(f.read ))} unless FileTest.directory?(path)
      end
    ensure
      cleanup
    end
    return docx_errors
  end

It pretty much goes through the document just like we did with the replacement except we will call a method validate_template (that you will have to write) that will make sure your data is AOK!

comments powered by Disqus