Sunday, April 19, 2015

PHP iterating over Blogger posts from Atom Feed XML

Thanks to some decent PHP APIs, it is really easy to read XML from a URL and process the results.

Use case: I have written an expression of interest form that people can use to raise an application to adopt a rescued Chihuahua. Part of the form lists the dogs available - and this list comes from a Blogger feed. The blog is used as a list of all the Chihuahuas that have been rescued (one dog per post) and the ones that are currently available all have a specific label: Available now. The expression of interest form reads the Atom feed for this label and displays a list of the available now dogs on the form.

Retrieve XML from URL and access child elements

The function below reads an XML document from a URL (an Atom feed) and will return an array of the child elements that represent specific posts with a given label from the blog.

function retrieveAvailableNowPosts() {
   // Set URL to XML we want to read - Available now.
   $file="http://chihuahuarescue.blogspot.com.au/feeds/posts/default/-/Available%20now";
   // Load specified XML file or report failure
   $xml = simplexml_load_file($file);
   if (!$xml) {
      return false;
   }
   // Load blog entries
   $posts =  $xml -> entry;
   if (sizeOf($posts) > 0) {
      return $posts;
   } else {
      return null;
   }
}

Notes about this function.

  • $xml = simplexml_load_file($file);
      • Loading the contents of a URL and then parsing the XML it contains is done with simplexml_load_file. The return from this function is either an SimpleXMLElement object or a boolean false if there was an error reading XML from the file (URL in this case).
      • The parameter to simplexml_load_file can be a file or URL.
      • Blogger supports feeds either from RSS 2.0 or Atom 1.0, and you can switch between them simply with a different URL.
      • The URL I am using (http://chihuahuarescue.blogspot.com.au/feeds/posts/default/-/Available%20now) is for an Atom feed. The label is the part after the last forward slash after it has been URL encoded i.e. Available now is the label, which becomes Available%20now after URL encoding. This was easy in my case because I only had to swap the space with %20. If you have more complicated labels (or perhaps need to URL encode dynamic labels), you can use the PHP function urlencode to do this.
  • $posts = $xml -> entry;
    • $xml is the variable containing the XML read from simplexml_load_file.
    • The Atom feed XML returned from the URL has feed as the root element and it contains a variable number of entry elements, each of which is a blog post that was made against the target label. The skeleton is shown below.
      <feed ...>
         ...
         <entry>
            ...
         </entry>
         <entry>
            ...
         </entry>
         <entry>
            ...
         </entry>
      </feed>
    • We access the array of the entry elements on the feed using the "arrow" operator (T_OBJECT_OPERATOR for objects).
    • $posts should therefore be an array which might be empty.
  • if (sizeOf($posts) > 0)
    • If the array has 1 or more elements, return it.
    • Otherwise, return null.

Iterate through XML elements

The function below accepts the XML elements read from the earlier function and iterates through them to output HTML.

function createDogList($posts) {
   $list = '<ul class="availableNowList">';
   // Check if posts is undefined, null, false or empty.
   if (!$posts || sizeOf($posts) == 0) {
      $list .= '<li>Unfortunately there are no dogs available at this time.</li>';
   } else {
      // Go over each entry.
      foreach($posts as $post) {
         // Publish time
         $dateTime = date("l jS F, Y", strtotime(strtok($post->published, 'T')));
         // Link.
         $link = $post->link[4]["href"];
         // Title.
         $title = $post->title;
         // List the entry.
         $list .=
            '<li>
               <a href="' . $link . '" target="_blank">'
                     . $title . '</a> <em><small>(published ' . $dateTime . ')</small></em>.
            </li>';
      }
   }
   $list .= '</ul>';
   return $list;
}

Notes about this function.

  • This function builds up an un-ordered list (ul) of blog posts, with each list item having
    • A link to the blog post - the link text being the blog entry title.
    • The date on which the post was published. Example list:
      • RUDI (published Sunday 29th March, 2015).
      • MIMI (published Sunday 29th March, 2015).
      • BAXTER (published Sunday 29th March, 2015).
  • If there are no posts against this label, output only one list item with an explanation that there are no matches at this time.
  • The parameter to this function is a list of blog posts against a certain label, retrieved by the previous function: retrieveAvailableNowPosts().
  • if (!$posts || sizeOf($posts) == 0) { .. }
    • This is more than just a null-check: !$posts will be true if the variable $posts is null, not set (undefined, has no value) or false.
    • Here is a quick overview, showing that an IF is a good test for all three things.
      <?php
         echo '<pre>';
         $foo1;        if($foo1) { ?>foo1 is set and not null/false.<?php } else { ?>foo1 is not set/null/false.<br><?php }
         $foo2=null;   if($foo2) { ?>foo2 is set and not null/false.<?php } else { ?>foo2 is not set/null/false.<br><?php }
         $foo3=false;  if($foo3) { ?>foo3 is set and not null/false.<?php } else { ?>foo3 is not set/null/false.<br><?php }
         echo 'foo1 - ';
         var_dump(isset($foo1));
         echo 'foo2 - ';
         var_dump(isset($foo2));
         echo 'foo3 - ';
         var_dump(isset($foo3));
         echo '</pre>';
      ?>
      
      The output of the above is:
      foo1 is not set/null/false.
      foo2 is not set/null/false.
      foo3 is not set/null/false.
      foo1 - bool(false)
      foo2 - bool(false)
      foo3 - bool(true)
    • This allows us to respond to two sad cases from retrieveAvailableNowPosts() at once.
      • retrieveAvailableNowPosts() returns false if it couldn't read the Atom feed XML from the blog.
      • retrieveAvailableNowPosts() returns null if the list of posts is empty.
      • The call to sizeof (sizeOf($posts) == 0) is actually not needed, because retrieveAvailableNowPosts() returns null if the list of posts is empty, but I left it here in case I ever call this function from a different place and neglect to included the same rule.
  • $dateTime = date("l jS F, Y", strtotime(strtok($post->published, 'T')));
    • $post->published
      • The value of this looks like 2015-03-29T03:10:00.003-07:00.
      • I only want the year, month and date only: I want to discard all the time information and just output the date. I will use a string tokenizer to do this in the next step.
    • strtok($post->published, 'T')
      • I use a string tokenizer with the letter "T" as the delimiter. Note that the first call strtok will return the first token, and since I only need the first token, I don't store reference to the tokenizer. Here is how you would use the tokenizer in another situation to go over all tokens:
        $string = "String to split";
        delimiter = " \n\t";  // Split string on spaces, newlines and tabs.
        $token = strtok($string, $delimiter);
        while ($token !== false) {
            echo "Next token: $token <br />";
            $token = strtok($delimiter);
        }
        
      • The result of strtok($post->published, 'T') will be something like 2015-03-29 (note that it does not include the delimiter itself).
    • date("l jS F, Y", strtotime(strtok($post->published, 'T')))
      • I use the date function to parse the date (from text like 2015-03-29) and output it in a different format (like Sunday 29th March, 2015).
      • See the PHP page for date function to find the full list of date format options, but here is what my format uses.
        • l - A full textual representation of the day of the week: Sunday through Saturday.
        • j - Day of the month without leading zeros: 1 to 31.
        • S - English ordinal suffix for the day of the month, 2 characters: st, nd, rd or th.
        • F - A full textual representation of a month, such as January or March: January through December.
        • Y - A full numeric representation of a year, 4 digits: 1999 or 2003.
  • $link = $post->link[4][href]
    • In a given entry element, get the href attribute of the fifth link element (using a zero based index).
    • The fifth link element holds a direct URL to the post, such as: <link rel="alternate" type="text/html" href="http://chihuahuarescue.blogspot.com/2015/03/mimi.html" title="MIMI"/>.
  • Google's description of what is in each post element shows you what things you can access this way for each post:
    • posts: A list of all posts for this page. Each post contains the following:
      • dateHeader: The date of this post, only present if this is the first post in the list that was posted on this day.
      • id: The numeric post ID.
      • title: The post's title.
      • body: The content of the post.
      • author: The display name of the post author.
      • url: The permalink of this post.
      • timestamp: The post's timestamp. Unlike dateHeader, this exists for every post.
      • labels: The list of the post's labels. Each label contains the following:
        • name: The label text.
        • url: The URL of the page that lists all posts in this blog with this label.
        • isLast: True or false. Whether this label is the last one in the list (useful for placing commas).

Error handling

The primary error condition is from the call to simplexml_load_file, which returns false if there was a failure reading XML from the URL. The secondary error condition occurs if we read the XML okay, but found it contained none of the elements we are interested in. On the page that uses these functions, both error conditions are treated as normal outputs from the retrieveAvailableNowPosts function and dealt with nicely, as you can see below. We output error messages if either error occurs, and display the "normal" content of the page otherwise.

$posts = retrieveAvailableNowPosts();
// If a boolean false is returned there was an error.
if ($posts === false) {
?>
   <p style="text-align: center; color: red;">Unable to load list of Available Now dogs from Chihuahua Rescue Victoria!</p>
<?php
// And null means there is nothing present.
} else if ($posts === null) {
?>
   <p style="text-align: center;">Unfortunately there are no dogs available at this time. Please try again later.</p>
<?php
// Otherwise, all good. Carry on.
} else {
?>
<?php
 ... normal page content goes here.
} // end else
?>

Just die!

We could have handled the error from simplexml_load_file in this way:

$xml = simplexml_load_file($file) or die("<p>An error message.</p>");

This offers a very poor user experience, especially on a web page because it will cause PHP to immediately exit and no further code on the page will be processed. This will most likely result in an ugly page with broken HTML.

Resources that helped me.