`
ygxu
  • 浏览: 78862 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

apache的HttpClient方法的使用

    博客分类:
  • java
阅读更多
使用apache的HttpClient实现网页抓取功能

	/**
	 * 根据url抓取字符串 返回字符串
	 * 
	 * @param urlstr
	 *            String
	 * @return String
	 */
	public static String snatch(String urlstr, String encode) {

		
		String rs = "";
		// Create an instance of HttpClient.
		HttpClient client = new HttpClient();

		// Create a method instance.
		GetMethod method = new GetMethod(urlstr);

		// Provide custom retry handler is necessary
		method.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,
				new DefaultHttpMethodRetryHandler(0, false));

		// method.getParams().setParameter(HttpMethodParams.USER_AGENT,
		// "Mozilla/4.0 (compatible; MSIE 6.0; Windows 2000)");

		try {
			// Execute the method.
			int statusCode = client.executeMethod(method);

			if (statusCode != HttpStatus.SC_OK) {
				System.err.println("Method failed: " + statusCode);
				System.err.println("Method failed: " + method.getStatusLine());
			}else{

			// Read the response body.
			byte[] responseBody = method.getResponseBody();
			// Deal with the response.
			// Use caution: ensure correct character encoding and is not binary
			// data

			if (!method.getResponseCharSet().trim().equalsIgnoreCase(
					"ISO-8859-1")) {
				rs = new String(responseBody, method.getResponseCharSet());
			} else {
				if (encode != null && encode.length() > 0)
					rs = new String(responseBody, encode);
				else
					rs = new String(responseBody, "gb2312");
			}
			}
		} catch (HttpException e) {
			System.err.println("Fatal protocol violation: " + e.getMessage());
			e.printStackTrace();
		} catch (IOException e) {
			System.err.println("Fatal transport error: " + e.getMessage());
			System.err.println("=============" + urlstr);
			e.printStackTrace();
		}catch(java.lang.IllegalArgumentException e){
			System.err.println("报错的url是:"+urlstr);
			e.printStackTrace();
		} finally {
			// Release the connection.
			method.releaseConnection();
		}
		return rs;
	}


以上代码就是一个简单的HttpClient远程抓取页面源码了,不过记得要
import org.apache.commons.httpclient.*;

还有就是中间注释掉的代码
// method.getParams().setParameter(HttpMethodParams.USER_AGENT,
		// "Mozilla/4.0 (compatible; MSIE 6.0; Windows 2000)");

大家要注意点,因为有些网站是设置的是防止爬虫形式抓取的,所以如果有些时候你抓取不到的时候你需要加上这段代码了!
直接调用snatch("url地址一定要加http://",“编码”)方法
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics