HTMLのパース

jsoupを使ったwebスクレイピング。
webスクレイピングは他サイトのデータを用いるため、無断転載等著作権の問題が起きる場合があるので利用の際には注意。
ダウンロードを選んで、core libraryとなっているものをクリック。
ダウンロードしたら自身のプロジェクトのlibsにコピペする。

http://yahoo.co.jpからtitleタグの中身を取ってくるサンプルコード

import あれこれ;

public class MainActivity extends Activity {
    TextView tv;

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);
        tv=(TextView)findViewById(R.id.textView1);
        HttpLoader loader=new HttpLoader(tv);
        loader.execute("http://yahoo.co.jp");
    }

}

import あれこれ;

public class MainActivity extends Activity {

TextView tv;

@Override

protected void onCreate(Bundle savedInstanceState) {

super.onCreate(savedInstanceState);

setContentView(R.layout.activity_main);

tv=(TextView)findViewById(R.id.textView1);

HttpLoader loader=new HttpLoader(tv);

loader.execute("http://yahoo.co.jp");

}

import あれこれ;

public class HttpLoader extends AsyncTask<String,Void,String> {
    TextView tv;

    public HttpLoader(TextView tv) {
        this.tv=tv;
    }

    @Override
    protected String doInBackground(String... url) {
        String title="";
        try {
            Document document=Jsoup.connect(url[0]).get();
             title = document.getElementsByTag("title").text();

        } catch (IOException e) {
            e.printStackTrace();
        }

        return title;
    }

    @Override
    protected void onPostExecute(String result) {
        super.onPostExecute(result);
        tv.setText(result);
    }

}

import あれこれ;

public class HttpLoader extends AsyncTask<String,Void,String> {

TextView tv;

public HttpLoader(TextView tv) {

this.tv=tv;

}

@Override

protected String doInBackground(String... url) {

String title="";

try {

Document document=Jsoup.connect(url[0]).get();

title = document.getElementsByTag("title").text();

} catch (IOException e) {

e.printStackTrace();

}

return title;

}

@Override

protected void onPostExecute(String result) {

super.onPostExecute(result);

tv.setText(result);

}

TextViewに「Yahoo!JAPAN」って出すだけです。
HTMLドキュメントを取ってくる時に通信が発生するため、
AsynkTaskを継承したHttpLoaderクラスを作ってdoInBackgroundメソッド内で、
リクエストを飛ばしてDocumentを取得しています。
getElementsByTagで取得したdocumentから引数に指定したtitleタグの値を抽出して、text()でタグ内のテキストだけを更に抽出し、
TextViewにセットします。
スクレイピングしてテキストビューにセットして表示されるまで少し時間がかかります。

久しぶりの訓練記事・・・・
色々書かずに溜まってるのでちょっとずつ出していこうと思う今日この頃・・・・
そもそもjsonのパースの記事とか出してなくね？いきなりHTMLのパース？？というね・・・・

Share on Tumblr

関連

コメント

コメントを残すコメントをキャンセル

関連

コメント

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル