所谓的网页爬虫,就是url请求网页数据,通过正则并获取自己想要的数据。
我这边访问的链接是http://www.baixing.com/?changeLocation=yes
本文通过Java请求一个网页,拿到网页的文本信息,通过双重正则,让网页中的地理信息与相应链接按如下形式展示出来。
宁县 = ningxian.baixing.com
天水 = tianshui.baixing.com
全天水 = tianshui.baixing.com
甘谷 = gangu.baixing.com
秦安 = qinan.baixing.com
public class FinallyDemo { public static void main(String args[]) { String buf =getBuf(); System.out.println("main--------------------------"); String al = getRegex(buf); //System.out.println(al); } public static String getBuf(){ try { //1.获取url 2.通过url获得连接 3.设置请求方式 4.设置超时时间 5.连接 URL url = new URL("http://www.baixing.com/?changeLocation=yes"); HttpURLConnection connect = (HttpURLConnection) url.openConnection(); connect.setRequestMethod("GET"); connect.setConnectTimeout(3000); connect.connect(); //6.得到状态码 7.读取内容 8.输出内容 int code = connect.getResponseCode(); if(code==200) { BufferedReader reader = new BufferedReader( new InputStreamReader(connect.getInputStream(),"UTF-8")); StringBuffer buffer = new StringBuffer(); // byte[] bytes = new byte[1*1024]; String line = null; while ((line = reader.readLine())!=null) { buffer.append(line); //System.out.println(line); } System.out.println("try----------------------------"); return buffer.toString(); } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); System.out.println("catch-----------------------"); }finally { System.out.println("finally-----------------------"); } return null; } public static String getRegex(String s) { //<a href='//jiangsu.baixing.com/'>江苏</a> //String regex = "<a href='//[a-zA-z0-9]+.[a-zA-z0-9]*/'>[\u4E00-\u9FA5]</a>"; String regex = "<a[^>]*href=(\"([^\"]*)\"|\'([^\']*)\'|([^\\s>]*))[^>]*>(.*?)</a>"; Pattern r = Pattern.compile(regex); Matcher m = r.matcher(s); System.out.println(m.matches()); ArrayList list = new ArrayList(); while(m.find()) { list.add(m.group()); String regex1 = "^<a href='//(.*?)/'.*?([\\u4e00-\\u9fa5]*)</a>$"; Pattern r1 = Pattern.compile(regex1); Matcher m1 = r1.matcher(m.group()); if(m1.find()) { System.out.println(m1.group(2)+" = "+m1.group(1)); } } return list.toString(); } }
还有什么问题不明白,或者不会
欢迎加入我的Java与Android逆向开发交流QQ群,一起学习,一起进步。