Simply crawl all the information of Oriental Fortune.com stocks

Home Page > Finance > Content 2021-09-11

Topic:Oriental Wealth Stock Interface Illustration

Goals

Crawling all information reports on the corporate information section of the Oriental Fortune Online Market

Foreword

Media reports have an information intermediary role and public opinion in the capital market The role of supervision and access to the number and content of media reports will help us analyze the ins and outs of hot events in the capital market and related public opinion dynamics. Oriental Fortune.com is a professional Internet financial media, which gathers a full range of comprehensive financial information and financial market information. Today we introduce how to crawl all the information content of the Oriental Fortune Internet stock bar section

The recent "Xiaomi car making" incident has aroused heated discussions in the capital market and ordinary online platforms. Take "Xiaomi" as an example.

p>

Steps

The first step, enter the Xiaomi Group page of Oriental Fortune.com, click on the "Hong Kong Stock Bar" option to enter the Xiaomi Stock Bar channel

The second step, in the "Information" channel of Xiaomi stock bar, turn the page to observe the URL changes

Not difficult to find! ! !

The URL of the first page of information: ,hk01810,1,f _ 1.html

The URL of the second page of information: ,hk01810,1,f _ 2.html

The information URL of page n:,hk01810,1,f _ n.html

Among them, is the fixed part, hk01810 is the Xiaomi code, f _ 1 is the information page number

Through the above rules, we can construct a url list of different pages of the information channel of any listed company

Directly upload the code

def

get _ url

span>

(code

,pages

)

:

''' Get the link list code of Oriental Fortune Internet Stock Bar Refers to the company code page is the number of crawled pages''' url _ list

=

[

]

for page

in

range

(

1

,pages

+

1

)

: url

= f

",{code},1,f _ {page}.html" url _ list

.append

(url

)

return url _ list

The third step, analyze the html law of each page of information content

Obviously, the reading, comment, title, author, and posting events of each information are arranged in similar In the span tag of class = l1 a1, you can easily grab it through BeautifulSoup's css selector

Code implementation

def

get _ news

(url _ list

)

:

' '' Get the news list of Oriental Fortune.com to the local xls url _ list refers to the link list''' headers

=

{

'User-Agent'

: UserAgent

(verify _ ssl

=

False

)

.random

,

'cookie '

:'Your cookie'

}

# Save crawl content outwb

= openpyxl

.Workbook

(

)

# Open a file to be written outws

= outwb

.create _ sheet

(index

=

0

)

# In the file to be written Create sheet outws

.cell

(row

=

1

, column

=

1

, value

=

"read"

) outws

.cell

(row

=

1

, column

=

2

, value

=

"comment"

) outws

.cell

(row

=

1

, column

=

3

, value

=

"title"

) outw s

.cell

(row

=

1

, column

=

4

, value

=

"author"

) outws

. cell

(row

=

1

, column

=

5

, value

=

"renew"

) outws

.cell

(row

=

1

, column

=

6

, value

=

"link"

) index

=

2

for i

​​in

range

(

len

(url _ list

)

)

: url

= url _ list

[i

] res

= re

.get

(url

,headers

= headers

) res

.encoding

= res

. apparent _ encoding html

= res

.text soup

= BeautifulSoup

(html

,

"html.parser"

) read _ list

= soup

.select

(

".l1.a1"

)

[

1

:

] comment _ list

= soup

.select

(

".l2.a2"

)

[

1

:

] title _ list

= soup

.select

(

".l3.a3"

)

[

1

:

] author _ list

= soup

.select

(

".l4 .a4"

)

[

1

:

] renew _ list

= soup

.select

(

".l5.a5"

)

[

1

:

]

for k

in

range

(

len

(title _ list

)

)

: outws

.cell

(row

= index

, column

=

1

, value

=

str

(read _ list

[k

]

.text

.strip

(

)

)

) outws

.cell

(row

= index

, column

=

2

, value

=

str

(comment _ list

[k

]

.text

.strip

(

)

)

) outws

.cell

( row

= index

, column

=

3

, value

=

str

(title _ list

[k

]

.select

(

'a'

)

[

0

]

[

"title"

]

)

) outws

.cell

(row

= index

, column

=

4

, value

=

str

(author _ list

[k

]

.text

.strip

(

)

)

) outws

. cell

(row

= index

, column

=

5

, value

=

str

(renew _ list

[k

]

.text

.strip

(

)

)

) outws

. cell

(row

= index

, column

=

6

, value

=

str

(title _ list

[k

]

.select

(

'a'

)

[

0

]

[

"href"

]

)

) index

+=

1

print

(title _ list

[k

]

.select

(

'a'

)

[

0

]

[

"title"

]

, renew _ list

[k

]

.text

.strip

(

)

) sleep

(random

.uniform

(

3

,

4

)

) outwb

.save

(

"Eastern Fortune Network Information.xlsx"

)

Step 3: Run the main program

if _ _ name _ _ == "_ _ main _ _ ": code = "hk01810" pages = 75 url _ list = get _ url(code,pages) get _ news(url _ list) print("Run complete")

Run result

The result of running the interface:

The result of saving the file:

So far, we have crawled all the information of Xiaomi stock bar, a total of 5788 pieces of Xiaomi stock bar information have been obtained.

The project is complete The code can be obtained by replying to the keyword "Stock Bar Information Crawler" in the backstage of the official account

Zhihu and public account: Notes of accounting programmers (ID: wylcfy2014)

Irregular push: Python +Stata | Text Analysis + Machine Learning | Finance + Accounting

Label group:[technology news] [Millet

Extended reading

Same topic

recommend

Popular