0

A More Efficient (UTF-8) curl_multi()

Posted in PHP

If you need to do multiple CURL requests with PHP, curl_multi is a great way to do it. The problem with curl_multi is that it waits for all requests to complete before it begins processing any of the responses – for example if you’re making 100 requests and just one is slow, the other 99 will be held off for processing until the one has come back. This is wasteful so below is a script originally written by Josh Fraser with my own modifications and improvements that will begin processing a response immediately before dispatching the next request.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
function curl_multi_download(array $urls, callable $callback, array $custom_options = array())
{
	// make sure the rolling window isn't greater than the # of urls
	$rolling_window = 5;
	$rolling_window = (sizeof($urls) < $rolling_window) ? sizeof($urls) : $rolling_window;
 
	$master = curl_multi_init();
	$curl_arr = array();
	$options = array(
		CURLOPT_RETURNTRANSFER => true,
		CURLOPT_FOLLOWLOCATION => true,
		CURLOPT_MAXREDIRS => 5,
	) + $custom_options;
 
	// start the first batch of requests
	for ( $i = 0; $i < $rolling_window; $i++ )
	{
		$ch = curl_init();
		$options[CURLOPT_URL] = $urls[$i];
		curl_setopt_array($ch, $options);
		curl_multi_add_handle($master, $ch);
	}
 
	do
	{
		while(($execrun = curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);
		if($execrun != CURLM_OK)
			break;
		// a request was just completed -- find out which one
		while( $done = curl_multi_info_read($master) )
		{
			$info = curl_getinfo($done['handle']);
 
			// request successful.  process output using the callback function.
			$output = curl_multi_getcontent($done['handle']);
			call_user_func_array($callback, array($info, $output));
 
			if ( isset($urls[$i+1]) )
			{
				// start a new request (it's important to do this before removing the old one)
				$ch = curl_init();
				$options[CURLOPT_URL] = $urls[$i++];  // increment i
				curl_setopt_array($ch, $options);
				curl_multi_add_handle($master, $ch);
			}
 
			// remove the curl handle that just completed
			curl_multi_remove_handle($master, $done['handle']);
		}
	} while ($running);
 
	curl_multi_close($master);
	return true;
}

I would also highly suggest you use the below curl_multi_getcontent_utf8 function based off the work of Paul Tarjan and Nadeau. Simply replace
curl_multi_getcontent with curl_multi_getcontent_utf8 above and you should be good to go.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
function curl_multi_getcontent_utf8( $ch )
{
	$data = curl_multi_getcontent( $ch );
	if ( !is_string($data) )
		return $data;
 
	unset($charset);
	$content_type = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
 
	/* 1: HTTP Content-Type: header */
	preg_match( '@([\w/+]+)(;\s*charset=(\S+))?@i', $content_type, $matches );
	if ( isset( $matches[3] ) )
		$charset = $matches[3];
 
	/* 2: <meta> element in the page */
	if ( !isset($charset) )
	{
		preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s*charset=([^\s"]+))?@i', $data, $matches );
		if ( isset( $matches[3] ) )
			$charset = $matches[3];
	}
 
	/* 3: <xml> element in the page */
	if ( !isset($charset) )
	{
		preg_match( '@<\?xml.+encoding="([^\s"]+)@si', $data, $matches );
		if ( isset( $matches[1] ) )
			$charset = $matches[1];
	}
 
	/* 4: PHP's heuristic detection */
	if ( !isset($charset) )
	{
		$encoding = mb_detect_encoding($data);
		if ($encoding)
			$charset = $encoding;
	}
 
	/* 5: Default for HTML */
	if ( !isset($charset) )
	{
		if (strstr($content_type, "text/html") === 0)
			$charset = "ISO 8859-1";
	}
 
	/* Convert it if it is anything but UTF-8 */
	/* You can change "UTF-8"  to "UTF-8//IGNORE" to
	   ignore conversion errors and still output something reasonable */
	if ( isset($charset) && strtoupper($charset) != "UTF-8" )
		$data = iconv($charset, 'UTF-8', $data);
 
	return $data;
}

 

Usage:

The $callback (second) argument of rolling_url() can take any callback supported by PHP.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
curl_multi_download(array(
	"http://translink.com.au/travel-information/services-and-timetables/trains/rss",
	"http://translink.com.au/travel-information/services-and-timetables/buses/rss",
	"http://arandombrokenurl.com/should/give/an/error",
	"http://translink.com.au/travel-information/services-and-timetables/ferries/rss",
), 'process_response');
 
function process_response( $info, $response ) 
{
	if ( $info['http_code'] != 200 )
	{
		echo "Error retrieving URL " . $info['url'] . "<br/>";
		return;
	};
	var_dump($info);
	var_dump($response);
}

 

Bonus Round

Here’s a curl_download() function, used to grab the response from a single URL (For extra points switch out curl_exec() with curl_exec_utf8() using the same method as above):

1
2
3
4
5
6
7
8
9
10
11
function curl_download( $url )
{
	$ch = curl_init();
	curl_setopt($ch, CURLOPT_URL, $url);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
	$data = self::curl_exec($ch);
	curl_close($ch);
 
	return $data;
}