Creating a Bulk Open Graph Data Extractor using PHP

Open graph data in webpages can be a useful source of data for marketers (in addtion to the big benefit of facilitating enhanced content presentation on Social Networks).

Both Twitter and Facebook have their own metadata tags available. Twitter’s open graph tags and Facebook’s Open Graph Markup

Consider the following tags which may be present in a html document (using Twitter Open Graph Tags):

Twitter open graph tags

Twitter lets authors and websites attribute posts / articles with specific Twitter IDs, so that we (and bots) can see who is responsible for any given post.

open graph tags on marketinghacker

Here’s part of the source code of the Marketinghacker.blog homepage. We can see the marketinghacker blog TwitterID , as specified using Twitter open graph tags) is ‘@marketinghack3r‘.

One example use case might be as follows:

I want to promote ‘Product X’.

  1. Firstly I scrape a list of the top 100 URLS on Google which rank for the term ‘Product X’ (using a Chrome data extractor plugin ).
  2. Now armed with 100 URLs in a spreadsheet, I’d like to get the Twitter ID of each website so that in order to:
    • Influencer marketing: Build a relationship with the author or webmaster via TwitterID – Maybe to turn them into a brand / product advocate
    • Perform social network analysis / content analysison the User’s Twitter account
    • Get featured on their website
    • Enquire about buying an advertorial / request media packs
    • and more..

What if we want to bulk extract open graph data from a large number of URLS stored in a spreadsheet or CSV?

Let’s start with a csv file of URLs.
Here’s the top 20 results for a search for ‘social media tools’, saved in a file on a web-server called ‘urls.csv’.
csv input data

Now, in the same directory as urls.csv, create and save the following file as bulk-scrape-og.php

	set_time_limit(0);

	$count=0;
	if (($handle = fopen("urls.csv", "r")) !== FALSE) 
	{
	    while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) 
	    {
	    	$count++;
	    	$url = $data[0];
	    	$aOG= getOG($url);
	    	echo $url.','.@$aOG['Meta name Tags']['twitter:site'],','.@$aOG['Meta name Tags']['twitter:creator'].'<br/>';
	    }
	    fclose($handle);
	}

	function getOG($url)
	{
	 
		$curl = curl_init();
		curl_setopt($curl, CURLOPT_URL, $url);
		curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
		curl_setopt($curl, CURLOPT_HEADER, false);
		curl_setopt($curl, CURLOPT_ENCODING ,"");
		$site_html = curl_exec($curl);
		curl_close($curl);

		#$site_html=  @file_get_contents($url);
		if ($site_html == FALSE)
			return null;
			
		$ogtags =null; #Extracted Open Graph Tags assoc
		$metatags = array(); #extracted meta data  assoc
		$propertytags = array(); #extracted meta data  assoc
		
	       $meta = null;  
		preg_match_all('~<\s*meta\s+name="([^"]+)"\s+content="([^"]*)~i',     $site_html,$meta);
		for($i=0;$i<count($meta[1]);$i++)
	            $metatags[$meta[1][$i]]=$meta[2][$i];

		$property=null;
		preg_match_all('~<\s*meta\s+property="([^"]+)"\s+content="([^"]*)~i',     $site_html,$property);
		for($i=0;$i<count($property[1]);$i++)
	            $propertytags[$property[1][$i]]=$property[2][$i];
		
		$matches=null; 
		preg_match_all('~<\s*meta\s+property="(og:[^"]+)"\s+content="([^"]*)~i',     $site_html,$matches);

		for($i=0;$i<count($matches[1]);$i++)
	        $ogtags[$matches[1][$i]]=$matches[2][$i];

		$aTheData =  array('Open Graph Tags'=>$ogtags,'Meta name Tags'=> $metatags, 'Meta Property Tags'=>$propertytags);

		return ($aTheData);
		
	}

Now you can visit wherever you placed the script (perhaps yoursite.com/yourdirectory/bulk-scrape-og.php) to get a result like this:

extracted open graph data done in bulk with php

Mission accomplished.

Leave a Reply

Your email address will not be published. Required fields are marked *

Marketing Hacker

Online Marketing Tips, Hacks & Code Snippets