Building a Site Scraper to Find Problem Urls in WordPress

Sometimes all you need is a quick scraper to help you get the information missed in a database search-and-replace.

I recently hacked together this single use scraper repo. Fork, extend and tweak for your needs!

Backstory

I’ve recently had the pleasure of launching a new WordPress site for Peaceful Media — Yeah!! — but a cursory walk through of the live site had me stumbling across urls still pointing to the development site.

How could this be? I was sure to use wp-cli and its search and replace function:

wp search-replace '//dev.example.com' '//example.com'

As it turns out, Visual Composer, the page layout plugin that was used to craft the various static pages of the site, was saving some urls in slightly different ways to the database. In a few of those cases it was simply that the urls had been percent encoded. So for example, http://dev.example.com becomes http%3A%2F%2Fdev.example.com. Not too bad. I could have just search and replaced on this new string. But what about the more annoying Raw HTML Element?

The Raw HTML element actually saves content to the database as a base64 encoded string. I actually should give at least a little props to Visual Composer for the clever, albeit hacky way they have managed to ensure that raw HTML markup isn’t “helpfully” sanitized by TinyMCE or WordPress on post save. Since the stuff being saved is just a jumble of characters, there is nothing that either TinyMCE, or WordPress can adjust and break.

That’s all fine and good, however, having the content exist encoded in base64 means I can’t easily interact with the content outside of the GUI admin screens — in this case, perform a simple search and replace on the database. Additionally, should we ever migrate away from using Visual Composer, we now have a more complex migration path ahead of us.

Anyway, back to solving the problem at hand — identifying all of the instances of a Url that match a ‘problematic’ pattern — dev.example.com. What I ended up doing was to create a quick and dirty site scraper to help me scan all of the pages that are easily accessible from the homepage and flag any instances of URLs which match a regex of ‘problem’ urls. Then it’s just a matter of hand updating them. It turned out to be pretty easy using the node-simplescraper module and creating a 75 line nodejs file. One secondary benefit of this is that it is not confined to simply scanning the content of the site database.

Since this is an actual headless browser scraping our site, it scans everything that a visitor would have access to. This includes links, images, styles and more. This means that if there were any problem urls in linked files, for example, external Javascript and Css files — those would get flagged as well. Awesome!

The end result is still a duck-taped, immediate use sort of tool, but I think there is something here that I can come back to and extend. Until then, feel free to poke around and fork the repo on Github!