apache的HttpClient方法的使用

ygxu

浏览: 78862 次
性别:
来自: 杭州

最近访客更多访客>>

djyotm

伱喠蹈涐覆轍

棉絮田

w1213w

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

java

Apache Windows

使用apache的HttpClient实现网页抓取功能

	/**
	 * 根据url抓取字符串 返回字符串
	 * 
	 * @param urlstr
	 *            String
	 * @return String
	 */
	public static String snatch(String urlstr, String encode) {

		
		String rs = "";
		// Create an instance of HttpClient.
		HttpClient client = new HttpClient();

		// Create a method instance.
		GetMethod method = new GetMethod(urlstr);

		// Provide custom retry handler is necessary
		method.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,
				new DefaultHttpMethodRetryHandler(0, false));

		// method.getParams().setParameter(HttpMethodParams.USER_AGENT,
		// "Mozilla/4.0 (compatible; MSIE 6.0; Windows 2000)");

		try {
			// Execute the method.
			int statusCode = client.executeMethod(method);

			if (statusCode != HttpStatus.SC_OK) {
				System.err.println("Method failed: " + statusCode);
				System.err.println("Method failed: " + method.getStatusLine());
			}else{

			// Read the response body.
			byte[] responseBody = method.getResponseBody();
			// Deal with the response.
			// Use caution: ensure correct character encoding and is not binary
			// data

			if (!method.getResponseCharSet().trim().equalsIgnoreCase(
					"ISO-8859-1")) {
				rs = new String(responseBody, method.getResponseCharSet());
			} else {
				if (encode != null && encode.length() > 0)
					rs = new String(responseBody, encode);
				else
					rs = new String(responseBody, "gb2312");
			}
			}
		} catch (HttpException e) {
			System.err.println("Fatal protocol violation: " + e.getMessage());
			e.printStackTrace();
		} catch (IOException e) {
			System.err.println("Fatal transport error: " + e.getMessage());
			System.err.println("=============" + urlstr);
			e.printStackTrace();
		}catch(java.lang.IllegalArgumentException e){
			System.err.println("报错的url是:"+urlstr);
			e.printStackTrace();
		} finally {
			// Release the connection.
			method.releaseConnection();
		}
		return rs;
	}

以上代码就是一个简单的HttpClient远程抓取页面源码了，不过记得要

import org.apache.commons.httpclient.*;

还有就是中间注释掉的代码

// method.getParams().setParameter(HttpMethodParams.USER_AGENT,
		// "Mozilla/4.0 (compatible; MSIE 6.0; Windows 2000)");

大家要注意点，因为有些网站是设置的是防止爬虫形式抓取的，所以如果有些时候你抓取不到的时候你需要加上这段代码了！
直接调用snatch("url地址一定要加http://",“编码”)方法

分享到：

java 常用类整理一（字符串处理类） | jQuery性能优化（二）

2009-05-19 14:55
浏览 1716
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

apache的HttpClient方法的使用

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

apache的HttpClient方法的使用

评论

发表评论

相关推荐

java 设计模式总结

java 经典算法收藏

java 泛型 详解一

java 常用类整理一（字符串处理类）

log4j 详解

jdk5线程池的简单使用

java方式的将java对象以及list或者map转化为json数据

Fckeditor的参数详解

dom4j使用

myeclipse下配置hibernate+spring(快速)

最近访客更多访客>>

java 泛型详解一